linux-loongson

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-09-04 02:25:58 +00:00

Author	SHA1	Message	Date
Waiman Long	0a751df456	blk-throttle: Fix incorrect display of io.max Commit `bf20ab538c` ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") attempts to revert the code change introduced by commit `cd5ab1b0fc` ("blk-throttle: add .low interface"). However, it leaves behind the bps_conf[] and iops_conf[] fields in the throtl_grp structure which aren't set anywhere in the new blk-throttle.c code but are still being used by tg_prfill_limit() to display the limits in io.max. Now io.max always displays the following values if a block queue is used: <m>:<n> rbps=0 wbps=0 riops=0 wiops=0 Fix this problem by removing bps_conf[] and iops_conf[] and use bps[] and iops[] instead to complete the revert. Fixes: `bf20ab538c` ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") Reported-by: Justin Forbes <jforbes@redhat.com> Closes: https://github.com/containers/podman/issues/22701#issuecomment-2120627789 Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240530134547.970075-1-longman@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-30 19:44:29 -06:00
John Garry	41b7574252	scsi: bsg: Pass dev to blk_mq_alloc_queue() When calling bsg_setup_queue() -> blk_mq_alloc_queue(), we don't pass the dev as the queuedata, but rather manually set it afterwards. Just pass dev to blk_mq_alloc_queue() to have automatically set. Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240524084829.2132555-3-john.g.garry@oracle.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2024-05-30 20:22:15 -04:00
Damien Le Moal	29459c3eaa	block: Fix zone write plugging handling of devices with a runt zone A zoned device may have a last sequential write required zone that is smaller than other zones. However, all tests to check if a zone write plug write offset exceeds the zone capacity use the same capacity value stored in the gendisk zone_capacity field. This is incorrect for a zoned device with a last runt (smaller) zone. Add the new field last_zone_capacity to struct gendisk to store the capacity of the last zone of the device. blk_revalidate_seq_zone() and blk_revalidate_conv_zone() are both modified to get this value when disk_zone_is_last() returns true. Similarly to zone_capacity, the value is first stored using the last_zone_capacity field of struct blk_revalidate_zone_args. Once zone revalidation of all zones is done, this is used to set the gendisk last_zone_capacity field. The checks to determine if a zone is full or if a sector offset in a zone exceeds the zone capacity in disk_should_remove_zone_wplug(), disk_zone_wplug_abort_unaligned(), blk_zone_write_plug_init_request(), and blk_zone_wplug_prepare_bio() are modified to use the new helper functions disk_zone_is_full() and disk_zone_wplug_is_full(). disk_zone_is_full() uses the zone index to determine if the zone being tested is the last one of the disk and uses the either the disk zone_capacity or last_zone_capacity accordingly. Fixes: `dd291d77cc` ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Niklas Cassel <cassel@kernel.org> Link: https://lore.kernel.org/r/20240530054035.491497-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-30 15:03:52 -06:00
Damien Le Moal	cd63999368	block: Fix validation of zoned device with a runt zone Commit `ecfe43b11b` ("block: Remember zone capacity when revalidating zones") introduced checks to ensure that the capacity of the zones of a zoned device is constant for all zones. However, this check ignores the possibility that a zoned device has a smaller last zone with a size not equal to the capacity of other zones. Such device correspond in practice to an SMR drive with a smaller last zone and all zones with a capacity equal to the zone size, leading to the last zone capacity being different than the capacity of other zones. Correctly handle such device by fixing the check for the constant zone capacity in blk_revalidate_seq_zone() using the new helper function disk_zone_is_last(). This helper function is also used in blk_revalidate_zone_cb() when checking the zone size. Fixes: `ecfe43b11b` ("block: Remember zone capacity when revalidating zones") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Niklas Cassel <cassel@kernel.org> Link: https://lore.kernel.org/r/20240530054035.491497-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-30 15:03:52 -06:00
Hannes Reinecke	e993db2d6e	block: check for max_hw_sectors underflow The logical block size need to be smaller than the max_hw_sector setting, otherwise we can't even transfer a single LBA. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-28 06:55:23 -06:00
Christoph Hellwig	e528bede6f	block: stack max_user_sectors The max_user_sectors is one of the three factors determining the actual max_sectors limit for READ/WRITE requests. Because of that it needs to be stacked at least for the device mapper multi-path case where requests are directly inserted on the lower device. For SCSI disks this is important because the sd driver actually sets it's own advisory limit that is lower than max_hw_sectors based on the block limits VPD page. While this is a bit odd an unusual, the same effect can happen if a user or udev script tweaks the value manually. Fixes: `4f563a6473` ("block: add a max_user_discard_sectors queue limit") Reported-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240523182618.602003-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-28 06:54:36 -06:00
hexue	30a0e3135f	block: delete redundant function declaration blk_stats_alloc_enable was used for block hybrid poll, the related function definition was removed by patch: commit `54bdd67d0f` ("blk-mq: remove hybrid polling") but the function declaration was not deleted. Signed-off-by: hexue <xue01.he@samsung.com> Link: https://lore.kernel.org/r/20240527084533.1485210-1-xue01.he@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-27 13:58:06 -06:00
Linus Torvalds	b4d88a60fe	block-6.10-20240523 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmZPaegQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgplkkD/4h1vxr2a6jg44TEUJ9f59rIOELuYHXJdpt 5m7r8UWcy7LF6HfmMgSeHV/7Gr1bBw6jh1eMubZRt9pZJ1sSGnc6vQdrOU+RnG9k F9i0qogAD2WXClQPAxvHGC1KD1quSdeiKME0hNJdGA6SsV4cYnDVeR8O6SQbaomD KPeGGBdjvrygRFhyDBFDACWK3GuD5POlbswUOwASYNrAb4OrQsj+bX/QXkuOXir9 n/NW/RfiQqAvI4m51yzaMqfFWw+s0irhXNfchl3i8RBMvDFBRNEkgtDN4y2rUynK +FaDeAwGXR51/qL9gr0ZScXAY6Q7f/B9FkrTUZR7S1lD3JsLXiS+uOefXEljKsDd RpNUc0sX3RjaSu1uNiUD/H4v+umvR+r3uuAyH6OXstCQt+98SJUbQvZuzphVGC60 iM8W+NRsaYZUhjN4LBj0NBGgCiidHanm22GCPADWN1fxZbjRWUoA886sZXTqmmMj +GGqpPU3pbGtj09ysaJpLKxu1TbD3QmcCUVPWQ8+DKt8PGGDDa+vIRXV8xswwQDg DyZoq0s/s00DzCXiPsbvVyKwXCJ1XSB0sEq0gvjDfGXb+5h6T+lH2irbcjBxUlwq qbofAmk6PVjxeWMUP4NXE04oK5Itc/l20LT9ECFPWzMdc1ht31TsqmxldHLIpDqp KUeacOh94A== =Btam -----END PGP SIGNATURE----- Merge tag 'block-6.10-20240523' of git://git.kernel.dk/linux Pull more block updates from Jens Axboe: "Followup block updates, mostly due to NVMe being a bit late to the party. But nothing major in there, so not a big deal. In detail, this contains: - NVMe pull request via Keith: - Fabrics connection retries (Daniel, Hannes) - Fabrics logging enhancements (Tokunori) - RDMA delete optimization (Sagi) - ublk DMA alignment fix (me) - null_blk sparse warning fixes (Bart) - Discard support for brd (Keith) - blk-cgroup list corruption fixes (Ming) - blk-cgroup stat propagation fix (Waiman) - Regression fix for plugging stall with md (Yu) - Misc fixes or cleanups (David, Jeff, Justin)" * tag 'block-6.10-20240523' of git://git.kernel.dk/linux: (24 commits) null_blk: fix null-ptr-dereference while configuring 'power' and 'submit_queues' blk-throttle: remove unused struct 'avg_latency_bucket' block: fix lost bio for plug enabled bio based device block: t10-pi: add MODULE_DESCRIPTION() blk-mq: add helper for checking if one CPU is mapped to specified hctx blk-cgroup: Properly propagate the iostat update up the hierarchy blk-cgroup: fix list corruption from reorder of WRITE ->lqueued blk-cgroup: fix list corruption from resetting io stat cdrom: rearrange last_media_change check to avoid unintentional overflow nbd: Fix signal handling nbd: Remove a local variable from nbd_send_cmd() nbd: Improve the documentation of the locking assumptions nbd: Remove superfluous casts nbd: Use NULL to represent a pointer brd: implement discard support null_blk: Fix two sparse warnings ublk_drv: set DMA alignment mask to 3 nvme-rdma, nvme-tcp: include max reconnects for reconnect logging nvmet-rdma: Avoid o(n^2) loop in delete_ctrl nvme: do not retry authentication failures ...	2024-05-23 13:44:47 -07:00
Dr. David Alan Gilbert	4a482e691c	blk-throttle: remove unused struct 'avg_latency_bucket' 'avg_latency_bucket' is unused since commit `bf20ab538c` ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Link: https://lore.kernel.org/r/20240522172458.334173-1-linux@treblig.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-22 11:30:47 -06:00
Yu Kuai	9a42891c35	block: fix lost bio for plug enabled bio based device With the following two conditions, bio will be lost: 1) blk plug is not enabled, for example, __blkdev_direct_IO_simple() and __blkdev_direct_IO_async(); 2) bio plug is enabled, for example write IO for raid1/raid10 while bitmap is enabled; Root cause is that blk_finish_plug() will add the bio to curent->bio_list, while such bio will not be handled: __submit_bio_noacct current->bio_list = bio_list_on_stack; blk_start_plug do { dm_submit_bio md_handle_request raid10_write_request -> generate new bio for underlying disks raid1_add_bio_to_plug -> bio is added to plug } while ((bio = bio_list_pop(&bio_list_on_stack[0]))) -> previous bio are all handled blk_finish_plug raid10_unplug raid1_submit_write submit_bio_noacct if (current->bio_list) bio_list_add(&current->bio_list[0], bio) -> add new bio current->bio_list = NULL -> new bio is lost Fix the problem by moving the plug into the while loop, so that current->bio_list will still be handled after blk_finish_plug(). By the way, enable plug for raid1/raid10 in this case will also prevent delay IO handling into daemon thread, which should also improve IO performance. Fixes: `060406c61c` ("block: add plug while submitting IO") Reported-by: Changhui Zhong <czhong@redhat.com> Closes: https://lore.kernel.org/all/CAGVVp+Xsmzy2G9YuEatfMT6qv1M--YdOCQ0g7z7OVmcTbBxQAg@mail.gmail.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Changhui Zhong <czhong@redhat.com> Link: https://lore.kernel.org/r/20240521200308.983986-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-21 19:37:33 -06:00
Linus Torvalds	3413efa888	Compactifying bdev flags We can easily have up to 24 flags with sane atomicity, _without_ pushing anything out of the first cacheline of struct block_device. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZkznRwAKCRBZ7Krx/gZQ 69XpAQDOZCyvYOZ/dlMOKKLf2vAojC/h++E/NjvGt3erbvVN2wEArXMi13ECsoCw JYJA3MsmvjuY6VNcm24icf2/p4TMIgo= =JyYi -----END PGP SIGNATURE----- Merge tag 'pull-bd_flags-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull bdev flags update from Al Viro: "Compactifying bdev flags. We can easily have up to 24 flags with sane atomicity, _without_ pushing anything out of the first cacheline of struct block_device" * tag 'pull-bd_flags-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: bdev: move ->bd_make_it_fail to ->__bd_flags bdev: move ->bd_ro_warned to ->__bd_flags bdev: move ->bd_has_subit_bio to ->__bd_flags bdev: move ->bd_write_holder into ->__bd_flags bdev: move ->bd_read_only to ->__bd_flags bdev: infrastructure for flags wrapper for access to ->bd_partno Use bdev_is_paritition() instead of open-coding it	2024-05-21 13:02:56 -07:00
Linus Torvalds	38da32ee70	bd_inode series Replacement of bdev->bd_inode with sane(r) set of primitives. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZkwjlgAKCRBZ7Krx/gZQ 66OmAP9nhZLASn/iM2+979I6O0GW+vid+uLh48uW3d+LbsmVIgD9GYpR+cuLQ/xj mJESWfYKOVSpFFSrqlzKg9PQlU/GFgs= =6LRp -----END PGP SIGNATURE----- Merge tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull bdev bd_inode updates from Al Viro: "Replacement of bdev->bd_inode with sane(r) set of primitives by me and Yu Kuai" * tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RIP ->bd_inode dasd_format(): killing the last remaining user of ->bd_inode nilfs_attach_log_writer(): use ->bd_mapping->host instead of ->bd_inode block/bdev.c: use the knowledge of inode/bdev coallocation gfs2: more obvious initializations of mapping->host fs/buffer.c: massage the remaining users of ->bd_inode to ->bd_mapping blk_ioctl_{discard,zeroout}(): we only want ->bd_inode->i_mapping here... grow_dev_folio(): we only want ->bd_inode->i_mapping there use ->bd_mapping instead of ->bd_inode->i_mapping block_device: add a pointer to struct address_space (page cache of bdev) missing helpers: bdev_unhash(), bdev_drop() block: move two helpers into bdev.c block2mtd: prevent direct access of bd_inode dm-vdo: use bdev_nr_bytes(bdev) instead of i_size_read(bdev->bd_inode) blkdev_write_iter(): saner way to get inode and bdev bcachefs: remove dead function bdev_sectors() ext4: remove block_device_ejected() erofs_buf: store address_space instead of inode erofs: switch erofs_bread() to passing offset instead of block number	2024-05-21 09:51:42 -07:00
Linus Torvalds	5ad8b6ad9a	getting rid of bogus set_blocksize() uses, switching it to struct file * and verifying that caller has device opened exclusively. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZkwkfQAKCRBZ7Krx/gZQ 62C3AQDW5vuXNx2+KDPma5YStjFpPLC0xtSyAS5D3YANjtyRFgD/TOcCarq7rvBt KubxHVFsfW+eu6ASeaoMRB83w5OIzwk= =Liix -----END PGP SIGNATURE----- Merge tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs blocksize updates from Al Viro: "This gets rid of bogus set_blocksize() uses, switches it over to be based on a 'struct file ' and verifies that the caller has the device opened exclusively" tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: make set_blocksize() fail unless block device is opened exclusive set_blocksize(): switch to passing struct file * btrfs_get_bdev_and_sb(): call set_blocksize() only for exclusive opens swsusp: don't bother with setting block size zram: don't bother with reopening - just use O_EXCL for open swapon(2): open swap with O_EXCL swapon(2)/swapoff(2): don't bother with block size pktcdvd: sort set_blocksize() calls out bcache_register(): don't bother with set_blocksize()	2024-05-21 08:34:51 -07:00
Jeff Johnson	f0eab3e8d1	block: t10-pi: add MODULE_DESCRIPTION() Fix the allmodconfig 'make W=1' issue: WARNING: modpost: missing MODULE_DESCRIPTION() in block/t10-pi.o Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240516-md-t10-pi-v1-1-44a3469374aa@quicinc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-20 08:07:44 -06:00
Linus Torvalds	eb6a9339ef	Mainly singleton patches, documented in their respective changelogs. Notable series include: - Some maintenance and performance work for ocfs2 in Heming Zhao's series "improve write IO performance when fragmentation is high". - Some ocfs2 bugfixes from Su Yue in the series "ocfs2 bugs fixes exposed by fstests". - kfifo header rework from Andy Shevchenko in the series "kfifo: Clean up kfifo.h". - GDB script fixes from Florian Rommel in the series "scripts/gdb: Fixes for $lx_current and $lx_per_cpu". - After much discussion, a coding-style update from Barry Song explaining one reason why inline functions are preferred over macros. The series is "codingstyle: avoid unused parameters for a function-like macro". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZkpLYQAKCRDdBJ7gKXxA jo9NAQDctSD3TMXqxqCHLaEpCaYTYzi6TGAVHjgkqGzOt7tYjAD/ZIzgcmRwthjP R7SSiSgZ7UnP9JRn16DQILmFeaoG1gs= =lYhr -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2024-05-19-11-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-mm updates from Andrew Morton: "Mainly singleton patches, documented in their respective changelogs. Notable series include: - Some maintenance and performance work for ocfs2 in Heming Zhao's series "improve write IO performance when fragmentation is high". - Some ocfs2 bugfixes from Su Yue in the series "ocfs2 bugs fixes exposed by fstests". - kfifo header rework from Andy Shevchenko in the series "kfifo: Clean up kfifo.h". - GDB script fixes from Florian Rommel in the series "scripts/gdb: Fixes for $lx_current and $lx_per_cpu". - After much discussion, a coding-style update from Barry Song explaining one reason why inline functions are preferred over macros. The series is "codingstyle: avoid unused parameters for a function-like macro"" * tag 'mm-nonmm-stable-2024-05-19-11-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (62 commits) fs/proc: fix softlockup in __read_vmcore nilfs2: convert BUG_ON() in nilfs_finish_roll_forward() to WARN_ON() scripts: checkpatch: check unused parameters for function-like macro Documentation: coding-style: ask function-like macros to evaluate parameters nilfs2: use __field_struct() for a bitwise field selftests/kcmp: remove unused open mode nilfs2: remove calls to folio_set_error() and folio_clear_error() kernel/watchdog_perf.c: tidy up kerneldoc watchdog: allow nmi watchdog to use raw perf event watchdog: handle comma separated nmi_watchdog command line nilfs2: make superblock data array index computation sparse friendly squashfs: remove calls to set the folio error flag squashfs: convert squashfs_symlink_read_folio to use folio APIs scripts/gdb: fix detection of current CPU in KGDB scripts/gdb: make get_thread_info accept pointers scripts/gdb: fix parameter handling in $lx_per_cpu scripts/gdb: fix failing KGDB detection during probe kfifo: don't use "proxy" headers media: stih-cec: add missing io.h media: rc: add missing io.h ...	2024-05-19 14:02:03 -07:00
Ming Lei	7b815817aa	blk-mq: add helper for checking if one CPU is mapped to specified hctx Commit `a46c27026d` ("blk-mq: don't schedule block kworker on isolated CPUs") rules out isolated CPUs from hctx->cpumask, and hctx->cpumask should only be used for scheduling kworker. Add helper blk_mq_cpu_mapped_to_hctx() and apply it into cpuhp handlers. This patch avoids to forget clearing INACTIVE of hctx state in case that one isolated CPU becomes online, and fixes hang issue when allocating request from this hctx's tags. Cc: Raju Cheerla <rcheerla@redhat.com> Fixes: `a46c27026d` ("blk-mq: don't schedule block kworker on isolated CPUs") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240517020514.149771-1-ming.lei@redhat.com Tested-by: Raju Cheerla <rcheerla@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-17 09:40:26 -06:00
Waiman Long	9d230c0996	blk-cgroup: Properly propagate the iostat update up the hierarchy During a cgroup_rstat_flush() call, the lowest level of nodes are flushed first before their parents. Since commit `3b8cc62987` ("blk-cgroup: Optimize blkcg_rstat_flush()"), iostat propagation was still done to the parent. Grandparent, however, may not get the iostat update if the parent has no blkg_iostat_set queued in its lhead lockless list. Fix this iostat propagation problem by queuing the parent's global blkg->iostat into one of its percpu lockless lists to make sure that the delta will always be propagated up to the grandparent and so on toward the root blkcg. Note that successive calls to __blkcg_rstat_flush() are serialized by the cgroup_rstat_lock. So no special barrier is used in the reading and writing of blkg->iostat.lqueued. Fixes: `3b8cc62987` ("blk-cgroup: Optimize blkcg_rstat_flush()") Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com> Closes: https://lore.kernel.org/lkml/ZkO6l%2FODzadSgdhC@dschatzberg-fedora-PF3DHTBV/ Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240515143059.276677-1-longman@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-15 20:15:54 -06:00
Ming Lei	d0aac23635	blk-cgroup: fix list corruption from reorder of WRITE ->lqueued __blkcg_rstat_flush() can be run anytime, especially when blk_cgroup_bio_start is being executed. If WRITE of `->lqueued` is re-ordered with READ of 'bisc->lnode.next' in the loop of __blkcg_rstat_flush(), `next_bisc` can be assigned with one stat instance being added in blk_cgroup_bio_start(), then the local list in __blkcg_rstat_flush() could be corrupted. Fix the issue by adding one barrier. Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Fixes: `3b8cc62987` ("blk-cgroup: Optimize blkcg_rstat_flush()") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240515013157.443672-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-15 20:14:20 -06:00
Ming Lei	6da6680632	blk-cgroup: fix list corruption from resetting io stat Since commit `3b8cc62987` ("blk-cgroup: Optimize blkcg_rstat_flush()"), each iostat instance is added to blkcg percpu list, so blkcg_reset_stats() can't reset the stat instance by memset(), otherwise the llist may be corrupted. Fix the issue by only resetting the counter part. Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Jay Shin <jaeshin@redhat.com> Fixes: `3b8cc62987` ("blk-cgroup: Optimize blkcg_rstat_flush()") Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20240515013157.443672-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-15 20:14:20 -06:00
Linus Torvalds	113d1dd9c8	SCSI misc on 20240514 Updates to the usual drivers (ufs, lpfc, qla2xxx, mpi3mr, libsas). The major update (which causes a conflict with block, see below) is Christoph removing the queue limits and their associated block helpers. The remaining patches are assorted minor fixes and deprecated function updates plus a bit of constification. Signed-off-by: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> -----BEGIN PGP SIGNATURE----- iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCZkOnWyYcamFtZXMuYm90 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishYe7AP93XRN/ xnccJbSTTUL4FFGobq2CYXv58Na+FM/b/+/kEAD+PNi0LmHDdDTOaFUblMd9l4lj mpvYLRvJ6ifnHX6WXAg= =PVnL -----END PGP SIGNATURE----- Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI updates from James Bottomley: "Updates to the usual drivers (ufs, lpfc, qla2xxx, mpi3mr, libsas). The major update (which causes a conflict with block, see below) is Christoph removing the queue limits and their associated block helpers. The remaining patches are assorted minor fixes and deprecated function updates plus a bit of constification" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (141 commits) scsi: mpi3mr: Sanitise num_phys scsi: lpfc: Copyright updates for 14.4.0.2 patches scsi: lpfc: Update lpfc version to 14.4.0.2 scsi: lpfc: Add support for 32 byte CDBs scsi: lpfc: Change lpfc_hba hba_flag member into a bitmask scsi: lpfc: Introduce rrq_list_lock to protect active_rrq_list scsi: lpfc: Clear deferred RSCN processing flag when driver is unloading scsi: lpfc: Update logging of protection type for T10 DIF I/O scsi: lpfc: Change default logging level for unsolicited CT MIB commands scsi: target: Remove unused list 'device_list' scsi: iscsi: Remove unused list 'connlist_err' scsi: ufs: exynos: Add support for Tensor gs101 SoC scsi: ufs: exynos: Add some pa_dbg_ register offsets into drvdata scsi: ufs: exynos: Allow max frequencies up to 267Mhz scsi: ufs: exynos: Add EXYNOS_UFS_OPT_TIMER_TICK_SELECT option scsi: ufs: exynos: Add EXYNOS_UFS_OPT_UFSPR_SECURE option scsi: ufs: dt-bindings: exynos: Add gs101 compatible scsi: qla2xxx: Fix debugfs output for fw_resource_count scsi: qedf: Ensure the copied buf is NUL terminated scsi: bfa: Ensure the copied buf is NUL terminated ...	2024-05-14 18:25:53 -07:00
Linus Torvalds	a3d1f54d7a	for-6.10-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmZCE4MACgkQxWXV+ddt WDudtQ//WjXcHtY3I6NJtDhPsIOG3Qjg9mA0shp73X4djJtZoGCdgL7dq+fTp5lk Wu6/XY5g+CSttTgwF4eyHgUSJOptKWY0XQDWxX5VR8WCM2qmUZ7SedlrBED9GNDM rN/3egmc74OGwnqyQq3I/2qYLByXFj66tsvW3UBjLNB8vMHajjw1idj9ujipioHq ySStPCHkPMwuhEzw9+CTe3W47VUSb5Ug3XDhAZXvxT99oDHn1m+CxKQwcona/IPH 1El8PmZ7JetaT9ZO3DICBICfCyo+2SSy/KXYypXXE+nzNZhbhC0V9N7Uqm1c91C0 aRglsJZCXmHBD4BPLvkls6CqEIvMc7FvcNCqQlrbRT6PlfX91/XaeDq4l3RUcuPn mGShsdHUiwbPMWYVwqVUKd0IPiktF1R7yigTjYSkEFJTL6HFTrBqV/2fAMUsMfPc 8gyzYMCPQld73WmrnXZQPKvmzO/LvE0gS5cPapokGwoXstq9n3iYd4ypN0wN6sif 1jwy3efNzWXXMYV0WzcihKwFMm2fqp/pl9bXq/zwn2CunfIX4WTsaQ2NmJf81jqF qFNjlr8S3qO7AvIOs+R2XY9E3VjfzeDADzvjpQy5J/ZYbcHBcxxdYDhg+QGhe5nB eNmR51oL1pHSjU2M8PxATL8JxKkX2BvX6u64lVojaw4rxUlyFC0= =MMpE -----END PGP SIGNATURE----- Merge tag 'for-6.10-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This update brings a few minor performance improvements, otherwise there's a lot of refactoring, cleanups and other sort of not user visible changes. Performance improvements: - inline b-tree locking functions, improvement in metadata-heavy changes - relax locking on a range that's being reflinked, allows read operations to run in parallel - speed up NOCOW write checks (throughput +9% on a sample test) - extent locking ranges have been reduced in several places, namely around delayed ref processing Core: - more page to folio conversions: - relocation - send - compression - inline extent handling - super block write and wait - extent_map structure optimizations: - reduced structure size - code simplifications - add shrinker for allocated objects, the numbers can go high and could exhaust memory on smaller systems (reported) as they may not get an opportunity to be freed fast enough - extent locking optimizations: - reduce locking ranges where it does not seem to be necessary and are safe due to other means of synchronization - potential improvements due to lower contention, allocation/freeing and state management operations of extent state tracking structures - delayed ref cleanups and simplifications - updated trace points - improved error handling, warnings and assertions - cleanups and refactoring, unification of error handling paths" * tag 'for-6.10-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (122 commits) btrfs: qgroup: fix initialization of auto inherit array btrfs: count super block write errors in device instead of tracking folio error state btrfs: use the folio iterator in btrfs_end_super_write() btrfs: convert super block writes to folio in write_dev_supers() btrfs: convert super block writes to folio in wait_dev_supers() bio: Export bio_add_folio_nofail to modules btrfs: remove duplicate included header from fs.h btrfs: add a cached state to extent_clear_unlock_delalloc btrfs: push extent lock down in submit_one_async_extent btrfs: push lock_extent down in cow_file_range() btrfs: move can_cow_file_range_inline() outside of the extent lock btrfs: push lock_extent into cow_file_range_inline btrfs: push extent lock into cow_file_range btrfs: push extent lock into run_delalloc_cow btrfs: remove unlock_extent from run_delalloc_compressed btrfs: push extent lock down in run_delalloc_nocow btrfs: adjust while loop condition in run_delalloc_nocow btrfs: push extent lock into run_delalloc_nocow btrfs: push the extent lock into btrfs_run_delalloc_range btrfs: lock extent when doing inline extent in compression ...	2024-05-14 17:25:36 -07:00
Linus Torvalds	0c9f4ac808	for-6.10/block-20240511 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmY/YgsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvi0EACwnFRtYioizBH0x7QUHTBcIr0IhACd5gfz bm+uwlDUtf6G6lupHdJT9gOVB2z2z1m2Pz//8RuUVWw3Eqw2+rfgG8iJd+yo7IaV DpX3WaM4NnBvB7FKOKHlMPvGuf7KgbZ3uPm3x8cbrn/axMmkZ6ljxTixJ3p5t4+s xRsef/lVdG71DkXIFgTKATB86yNRJNlRQTbL+sZW22vdXdtfyBbOgR1sBuFfp7Hd g/uocZM/z0ahM6JH/5R2IX2ttKXMIBZLA8HRkJdvYqg022cj4js2YyRCPU3N6jQN MtN4TpJV5I++8l6SPQOOhaDNrK/6zFtDQpwG0YBiKKj3nQDgVbWWb8ejYTIUv4MP SrEto4MVBEqg5N65VwYYhIf45rmueFyJp6z0Vqv6Owur5nuww/YIFknmoMa/WDMd V8dIU3zL72FZDbPjIBjxHeqAGz9OgzEVafled7pi0Xbw6wqiB4kZihlMGXlD+WBy Yd6xo8PX4i5+d2LLKKPxpW1X0eJlKYJ/4dnYCoFN8LmXSiPJnMx2pYrV+NqMxy4X Thr8lxswLQC7j9YBBuIeDl8NB9N5FZZLvaC6I25QKq045M2ckJ+VrounsQb3vGwJ 72nlxxBZL8wz3sasgX9Pc1Cez9AqYbM+UZahq8ezPY5y3Jh0QfRw/MOk1ZaDNC8V CNOHBH0E+Q== =HnjE -----END PGP SIGNATURE----- Merge tag 'for-6.10/block-20240511' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - Add a partscan attribute in sysfs, fixing an issue with systemd relying on an internal interface that went away. - Attempt #2 at making long running discards interruptible. The previous attempt went into 6.9, but we ended up mostly reverting it as it had issues. - Remove old ida_simple API in bcache - Support for zoned write plugging, greatly improving the performance on zoned devices. - Remove the old throttle low interface, which has been experimental since 2017 and never made it beyond that and isn't being used. - Remove page->index debugging checks in brd, as it hasn't caught anything and prepares us for removing in struct page. - MD pull request from Song - Don't schedule block workers on isolated CPUs * tag 'for-6.10/block-20240511' of git://git.kernel.dk/linux: (84 commits) blk-throttle: delay initialization until configuration blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW block: fix that util can be greater than 100% block: support to account io_ticks precisely block: add plug while submitting IO bcache: fix variable length array abuse in btree_iter bcache: Remove usage of the deprecated ida_simple_xx() API md: Revert "md: Fix overflow in is_mddev_idle" blk-lib: check for kill signal in ioctl BLKDISCARD block: add a bio_await_chain helper block: add a blk_alloc_discard_bio helper block: add a bio_chain_and_submit helper block: move discard checks into the ioctl handler block: remove the discard_granularity check in __blkdev_issue_discard block/ioctl: prefer different overflow check null_blk: Fix the WARNING: modpost: missing MODULE_DESCRIPTION() block: fix and simplify blkdevparts= cmdline parsing block: refine the EOF check in blkdev_iomap_begin block: add a partscan sysfs attribute for disks block: add a disk_has_partscan helper ...	2024-05-13 13:03:54 -07:00
Linus Torvalds	1b0aabcc9a	vfs-6.10.misc -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZj3HuwAKCRCRxhvAZXjc orYvAQCZOr68uJaEaXAArYTdnMdQ6HIzG+FVlwrqtrhz0BV07wEAqgmtSR9XKh+L 0+DNepg4R8PZOHH371eSSsLNRCUCkAs= =SVsU -----END PGP SIGNATURE----- Merge tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That means we now have six free FMODE_* bits in total (but bit #6 already got used for FMODE_WRITE_RESTRICTED) - Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup) - Add fd_raw cleanup class so we can make use of automatic cleanup provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well - Optimize seq_puts() - Simplify __seq_puts() - Add new anon_inode_getfile_fmode() api to allow specifying f_mode instead of open-coding it in multiple places - Annotate struct file_handle with __counted_by() and use struct_size() - Warn in get_file() whether f_count resurrection from zero is attempted (epoll/drm discussion) - Folio-sophize aio - Export the subvolume id in statx() for both btrfs and bcachefs - Relax linkat(AT_EMPTY_PATH) requirements - Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors for dup() equality replacing kcmp() Cleanups: - Compile out swapfile inode checks when swap isn't enabled - Use (1 << n) notation for FMODE_ bitshifts for clarity - Remove redundant variable assignment in fs/direct-io - Cleanup uses of strncpy in orangefs - Speed up and cleanup writeback - Move fsparam_string_empty() helper into header since it's currently open-coded in multiple places - Add kernel-doc comments to proc_create_net_data_write() - Don't needlessly read dentry->d_flags twice Fixes: - Fix out-of-range warning in nilfs2 - Fix ecryptfs overflow due to wrong encryption packet size calculation - Fix overly long line in xfs file_operations (follow-up to FMODE_* cleanup) - Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up to FMODE_* cleanup) - Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_* cleanup) - Fix stable offset api to prevent endless loops - Fix afs file server rotations - Prevent xattr node from overflowing the eraseblock in jffs2 - Move fdinfo PTRACE_MODE_READ procfs check into the .permission() operation instead of .open() operation since this caused userspace regressions" * tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits) afs: Fix fileserver rotation getting stuck selftests: add F_DUPDFD_QUERY selftests fcntl: add F_DUPFD_QUERY fcntl() file: add fd_raw cleanup class fs: WARN when f_count resurrection is attempted seq_file: Simplify __seq_puts() seq_file: Optimize seq_puts() proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation fs: Create anon_inode_getfile_fmode() xfs: don't call xfs_file_open from xfs_dir_open xfs: drop fop_flags for directories xfs: fix overly long line in the file_operations shmem: Fix shmem_rename2() libfs: Add simple_offset_rename() API libfs: Fix simple_offset_rename_exchange() jffs2: prevent xattr node from overflowing the eraseblock vfs, swap: compile out IS_SWAPFILE() on swapless configs vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements fs/direct-io: remove redundant assignment to variable retval fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading ...	2024-05-13 11:40:06 -07:00
Linus Torvalds	f4345f05c0	block-6.9-20240510 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmY+Tr0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpok2EACLqHZ2faer07Rpmm9pYzQxPVAJ1iQn3+dX kQLVStabysazYC8ZRk8VrgUZDJAblHDEct/f4eMVIX0abQoXB6nghkAUPFXledso SjCJCzWzhZN7xbC3FAQoxldpFjQxlI3y5xiqcD33bfTouPCDAwkThqCtXIEMwyff OwRDvrB8SrGLHIJYPntfKFnI1DTpu41ZFY/508olxs/uZSuUbxdPFQj2rh8gC5on b86HvsCS8laGY6EO86bbTjjp9WJJ1MMMaFrPzwc9deWbh/lJDB70hptxrHBLLv5Z i+CctM+KEYB/KRL+YjXZSOS2tYmoeA9jEbrtcqiEX87h3F3bfhH4XAp2EDkNoeG9 bzLRaR8tNsKoBkpauGgptjtxpJKHDs5ax4mJgsphOGGv6VBi+RfUlmIe76XFspzZ YJHRjpJ1FsBjyWtQ60W2FWdN+1IrvZ0GriN/Wk68ReHPGp9pBsnK6LXSUSAQ5cq2 eCKppG7S7ZVMylXZhNOK2E8vgz1XqsaE0wf7jI5tPjQKyZBVFrlIQ0sVBr01mKkv ycOJEUx6dGnJrzJPaH3T2m0lhCrUDJaM1/eUWFWvdQEWXyDlGmF6cU0smiSYbeca LP4TU+iB5s5D5nltzHq9D+9E96MWm6axhKAfoJZLpPI6+I73xD3hOAw3ufKIlNi9 jWjtMbArpQ== =k1hJ -----END PGP SIGNATURE----- Merge tag 'block-6.9-20240510' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - nvme target fixes (Sagi, Dan, Maurizo) - new vendor quirk for broken MSI (Sean) - Virtual boundary fix for a regression in this merge window (Ming) * tag 'block-6.9-20240510' of git://git.kernel.dk/linux: nvmet-rdma: fix possible bad dereference when freeing rsps nvmet: prevent sprintf() overflow in nvmet_subsys_nsid_exists() nvmet: make nvmet_wq unbound nvmet-auth: return the error code to the nvmet_auth_ctrl_hash() callers nvme-pci: Add quirk for broken MSIs block: set default max segment size in case of virt_boundary	2024-05-10 10:24:16 -07:00
Yu Kuai	a3166c5170	blk-throttle: delay initialization until configuration Other cgroup policy like bfq, iocost are lazy-initialized when they are configured for the first time for the device, but blk-throttle is initialized unconditionally from blkcg_init_disk(). Delay initialization of blk-throttle as well, to save some cpu and memory overhead if it's not configured. Noted that once it's initialized, it can't be destroyed until disk removal, even if it's disabled. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240509121107.3195568-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-09 09:44:56 -06:00
Yu Kuai	bf20ab538c	blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW One the one hand, it's marked EXPERIMENTAL since 2017, and looks like there are no users since then, and no testers and no developers, it's just not active at all. On the other hand, even if the config is disabled, there are still many fields in throtl_grp and throtl_data and many functions that are only used for throtl low. At last, currently blk-throtl is initialized during disk initialization, and destroyed during disk removal, and it exposes many functions to be called directly from block layer. Remove throtl low to make code much more cleaner and follow up work much easier. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240509121107.3195568-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-09 09:44:55 -06:00
Yu Kuai	7be835694d	block: fix that util can be greater than 100% util means the percentage that disk has IO, and theoretically it should not be greater than 100%. However, there is a gap for rq-based disk: io_ticks will be updated when rq is allocated, however, before such rq dispatch to driver, it will not be account as inflight from blk_mq_start_request() hence diskstats_show()/part_stat_show() will not update io_ticks. For example: 1) at t0, issue a new IO, rq is allocated, and blk_account_io_start() update io_ticks; 2) something is wrong with drivers, and the rq can't be dispatched; 3) at t0 + 10s, drivers recovers and rq is dispatched and done, io_ticks is updated; Then if user is using "iostat 1" to monitor "util", between t0 - t0+9s, util will be zero, and between t0+9s - t0+10s, util will be 1000%. Fix this problem by updating io_ticks from diskstats_show() and part_stat_show() if there are rq allocated. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240509123717.3223892-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-09 07:59:44 -06:00
Yu Kuai	99dc422335	block: support to account io_ticks precisely Currently, io_ticks is accounted based on sampling, specifically update_io_ticks() will always account io_ticks by 1 jiffies from bdev_start_io_acct()/blk_account_io_start(), and the result can be inaccurate, for example(HZ is 250): Test script: fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test -thinktime=4ms Test result: util is about 90%, while the disk is really idle. This behaviour is introduced by commit `5b18b5a737` ("block: delete part_round_stats and switch to less precise counting"), however, there was a key point that is missed that this patch also improve performance a lot: Before the commit: part_round_stats: if (part->stamp != now) stats \|= 1; part_in_flight() -> there can be lots of task here in 1 jiffies. part_round_stats_single() __part_stat_add() part->stamp = now; After the commit: update_io_ticks: stamp = part->bd_stamp; if (time_after(now, stamp)) if (try_cmpxchg()) __part_stat_add() -> only one task can reach here in 1 jiffies. Hence in order to account io_ticks precisely, we only need to know if there are IO inflight at most once in one jiffies. Noted that for rq-based device, iterating tags should not be used here because 'tags->lock' is grabbed in blk_mq_find_and_get_req(), hence part_stat_lock_inc/dec() and part_in_flight() is used to trace inflight. The additional overhead is quite little: - per cpu add/dec for each IO for rq-based device; - per cpu sum for each jiffies; And it's verified by null-blk that there are no performance degration under heavy IO pressure. Fixes: `5b18b5a737` ("block: delete part_round_stats and switch to less precise counting") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240509123717.3223892-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-09 07:59:44 -06:00
Yu Kuai	060406c61c	block: add plug while submitting IO So that if caller didn't use plug, for example, __blkdev_direct_IO_simple() and __blkdev_direct_IO_async(), block layer can still benefit from caching nsec time in the plug. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-09 07:57:37 -06:00
Matthew Wilcox (Oracle)	8fde439b2d	bio: Export bio_add_folio_nofail to modules Several modules use __bio_add_page() today and may need to be converted to bio_add_folio_nofail(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:10 +02:00
Christoph Hellwig	719c15a75e	blk-lib: check for kill signal in ioctl BLKDISCARD Discards can access a significant capacity and take longer than the user expected. A user may change their mind about wanting to run that command and attempt to kill the process and do something else with their device. But since the task is uninterruptable, they have to wait for it to finish, which could be many hours. Open code blkdev_issue_discard in the BLKDISCARD ioctl handler and check for a fatal signal at each iteration so the user doesn't have to wait for their regretted operation to complete naturally. Heavily based on an earlier patch from Keith Busch. Reported-by: Conrad Meyer <conradmeyer@meta.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-07 07:29:42 -06:00
Keith Busch	0f8e9ecc46	block: add a bio_await_chain helper Add a helper to wait for an entire chain of bios to complete. [hch: split from a larger patch, moved and changed the name now that it is non-static] Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-07 07:29:42 -06:00
Christoph Hellwig	e8b4869bc7	block: add a blk_alloc_discard_bio helper Factor out a helper from __blkdev_issue_discard that chews off as much as possible from a discard range and allocates a bio for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-07 07:29:42 -06:00
Christoph Hellwig	81c2168c22	block: add a bio_chain_and_submit helper This is basically blk_next_bio just with the bio allocation moved to the caller to allow for more flexible bio handling in the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-07 07:29:42 -06:00
Christoph Hellwig	30f1e72414	block: move discard checks into the ioctl handler Most bio operations get basic sanity checking in submit_bio and anything more complicated than that is done in the callers. Discards are a bit different from that in that a lot of checking is done in __blkdev_issue_discard, and the specific errnos for that are returned to userspace. Move the checks that require specific errnos to the ioctl handler instead, and just leave the basic sanity checking in submit_bio for the other handlers. This introduces two changes in behavior: 1) the logical block size alignment check of the start and len is lost for non-ioctl callers. This matches what is done for other operations including reads and writes. We should probably verify this for all bios, but for now make discards match the normal flow. 2) for non-ioctl callers all errors are reported on I/O completion now instead of synchronously. Callers in general mostly ignore or log errors so this will actually simplify the code once cleaned up Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-07 07:29:42 -06:00
Christoph Hellwig	0942592045	block: remove the discard_granularity check in __blkdev_issue_discard We now set a default granularity in the queue limits API, so don't bother with this extra check. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-07 07:29:42 -06:00
Justin Stitt	ccb326b5f9	block/ioctl: prefer different overflow check Running syzkaller with the newly reintroduced signed integer overflow sanitizer shows this report: [ 62.982337] ------------[ cut here ]------------ [ 62.985692] cgroup: Invalid name [ 62.986211] UBSAN: signed-integer-overflow in ../block/ioctl.c:36:46 [ 62.989370] 9pnet_fd: p9_fd_create_tcp (7343): problem connecting socket to 127.0.0.1 [ 62.992992] 9223372036854775807 + 4095 cannot be represented in type 'long long' [ 62.997827] 9pnet_fd: p9_fd_create_tcp (7345): problem connecting socket to 127.0.0.1 [ 62.999369] random: crng reseeded on system resumption [ 63.000634] GUP no longer grows the stack in syz-executor.2 (7353): 20002000-20003000 (20001000) [ 63.000668] CPU: 0 PID: 7353 Comm: syz-executor.2 Not tainted 6.8.0-rc2-00035-gb3ef86b5a957 #1 [ 63.000677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 63.000682] Call Trace: [ 63.000686] <TASK> [ 63.000731] dump_stack_lvl+0x93/0xd0 [ 63.000919] __get_user_pages+0x903/0xd30 [ 63.001030] __gup_longterm_locked+0x153e/0x1ba0 [ 63.001041] ? _raw_read_unlock_irqrestore+0x17/0x50 [ 63.001072] ? try_get_folio+0x29c/0x2d0 [ 63.001083] internal_get_user_pages_fast+0x1119/0x1530 [ 63.001109] iov_iter_extract_pages+0x23b/0x580 [ 63.001206] bio_iov_iter_get_pages+0x4de/0x1220 [ 63.001235] iomap_dio_bio_iter+0x9b6/0x1410 [ 63.001297] __iomap_dio_rw+0xab4/0x1810 [ 63.001316] iomap_dio_rw+0x45/0xa0 [ 63.001328] ext4_file_write_iter+0xdde/0x1390 [ 63.001372] vfs_write+0x599/0xbd0 [ 63.001394] ksys_write+0xc8/0x190 [ 63.001403] do_syscall_64+0xd4/0x1b0 [ 63.001421] ? arch_exit_to_user_mode_prepare+0x3a/0x60 [ 63.001479] entry_SYSCALL_64_after_hwframe+0x6f/0x77 [ 63.001535] RIP: 0033:0x7f7fd3ebf539 [ 63.001551] Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 [ 63.001562] RSP: 002b:00007f7fd32570c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 63.001584] RAX: ffffffffffffffda RBX: 00007f7fd3ff3f80 RCX: 00007f7fd3ebf539 [ 63.001590] RDX: 4db6d1e4f7e43360 RSI: 0000000020000000 RDI: 0000000000000004 [ 63.001595] RBP: 00007f7fd3f1e496 R08: 0000000000000000 R09: 0000000000000000 [ 63.001599] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 63.001604] R13: 0000000000000006 R14: 00007f7fd3ff3f80 R15: 00007ffd415ad2b8 ... [ 63.018142] ---[ end trace ]--- Historically, the signed integer overflow sanitizer did not work in the kernel due to its interaction with `-fwrapv` but this has since been changed [1] in the newest version of Clang; It was re-enabled in the kernel with Commit `557f8c582a` ("ubsan: Reintroduce signed overflow sanitizer"). Let's rework this overflow checking logic to not actually perform an overflow during the check itself, thus avoiding the UBSAN splat. [1]: https://github.com/llvm/llvm-project/pull/82432 Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240507-b4-sio-block-ioctl-v3-1-ba0c2b32275e@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-07 07:29:20 -06:00
Ming Lei	ffd379c13f	block: set default max segment size in case of virt_boundary For devices with virt_boundary limit, the driver may provide zero max segment size, we have to set it as UINT_MAX at default. Otherwise, it may cause warning in driver when handling sglist. Fix it by setting default max segment size as UINT_MAX. Cc: Christoph Hellwig <hch@lst.de> Cc: Mike Snitzer <snitzer@kernel.org> Fixes: `b561ea56a2` ("block: allow device to have both virt_boundary_mask and max segment size") Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Reported-by: Geert Uytterhoeven <geert+renesas@glider.be> Closes: https://lore.kernel.org/linux-block/7e38b67c-9372-a42d-41eb-abdce33d3372@linux-m68k.org/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240424134722.2584284-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-06 20:27:51 -06:00
INAGAKI Hiroshi	bc2e07dfd2	block: fix and simplify blkdevparts= cmdline parsing Fix the cmdline parsing of the "blkdevparts=" parameter using strsep(), which makes the code simpler. Before commit `146afeb235` ("block: use strscpy() to instead of strncpy()"), we used a strncpy() to copy a block device name and partition names. The commit simply replaced a strncpy() and NULL termination with a strscpy(). It did not update calculations of length passed to strscpy(). While the length passed to strncpy() is just a length of valid characters without NULL termination ('\0'), strscpy() takes it as a length of the destination buffer, including a NULL termination. Since the source buffer is not necessarily NULL terminated, the current code copies "length - 1" characters and puts a NULL character in the destination buffer. It replaces the last character with NULL and breaks the parsing. As an example, that buffer will be passed to parse_parts() and breaks parsing sub-partitions due to the missing ')' at the end, like the following. example (Check Point V-80 & OpenWrt): - Linux Kernel 6.6 [ 0.000000] Kernel command line: console=ttyS0,115200 earlycon=uart8250,mmio32,0xf0512000 crashkernel=30M mvpp2x.queue_mode=1 blkdevparts=mmcblk1:48M@10M(kernel-1),1M(dtb-1),720M(rootfs-1),48M(kernel-2),1M(dtb-2),720M(rootfs-2),300M(default_sw),650M(logs),1M(preset_cfg),1M(adsl),-(storage) maxcpus=4 ... [ 0.884016] mmc1: new HS200 MMC card at address 0001 [ 0.889951] mmcblk1: mmc1:0001 004GA0 3.69 GiB [ 0.895043] cmdline partition format is invalid. [ 0.895704] mmcblk1: p1 [ 0.903447] mmcblk1boot0: mmc1:0001 004GA0 2.00 MiB [ 0.908667] mmcblk1boot1: mmc1:0001 004GA0 2.00 MiB [ 0.913765] mmcblk1rpmb: mmc1:0001 004GA0 512 KiB, chardev (248:0) 1. "48M@10M(kernel-1),..." is passed to strscpy() with length=17 from parse_parts() 2. strscpy() returns -E2BIG and the destination buffer has "48M@10M(kernel-1\0" 3. "48M@10M(kernel-1\0" is passed to parse_subpart() 4. parse_subpart() fails to find ')' when parsing a partition name, and returns error - Linux Kernel 6.1 [ 0.000000] Kernel command line: console=ttyS0,115200 earlycon=uart8250,mmio32,0xf0512000 crashkernel=30M mvpp2x.queue_mode=1 blkdevparts=mmcblk1:48M@10M(kernel-1),1M(dtb-1),720M(rootfs-1),48M(kernel-2),1M(dtb-2),720M(rootfs-2),300M(default_sw),650M(logs),1M(preset_cfg),1M(adsl),-(storage) maxcpus=4 ... [ 0.953142] mmc1: new HS200 MMC card at address 0001 [ 0.959114] mmcblk1: mmc1:0001 004GA0 3.69 GiB [ 0.964259] mmcblk1: p1(kernel-1) p2(dtb-1) p3(rootfs-1) p4(kernel-2) p5(dtb-2) 6(rootfs-2) p7(default_sw) p8(logs) p9(preset_cfg) p10(adsl) p11(storage) [ 0.979174] mmcblk1boot0: mmc1:0001 004GA0 2.00 MiB [ 0.984674] mmcblk1boot1: mmc1:0001 004GA0 2.00 MiB [ 0.989926] mmcblk1rpmb: mmc1:0001 004GA0 512 KiB, chardev (248:0 By the way, strscpy() takes a length of destination buffer and it is often confusing when copying characters with a specified length. Using strsep() helps to separate the string by the specified character. Then, we can use strscpy() naturally with the size of the destination buffer. Separating the string on the fly is also useful to omit the redundant string copy, reducing memory usage and improve the code readability. Fixes: `146afeb235` ("block: use strscpy() to instead of strncpy()") Suggested-by: Naohiro Aota <naota@elisp.net> Signed-off-by: INAGAKI Hiroshi <musashino.open@gmail.com> Reviewed-by: Daniel Golle <daniel@makrotopia.org> Link: https://lore.kernel.org/r/20240421074005.565-1-musashino.open@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-03 09:57:53 -06:00
Christoph Hellwig	0c12028aec	block: refine the EOF check in blkdev_iomap_begin blkdev_iomap_begin rounds down the offset to the logical block size before stashing it in iomap->offset and checking that it still is inside the inode size. Check the i_size check to the raw pos value so that we don't try a zero size write if iter->pos is unaligned. Fixes: `487c607df7` ("block: use iomap for writes to block devices") Reported-by: syzbot+0a3683a0a6fecf909244@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: syzbot+0a3683a0a6fecf909244@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/20240503081042.2078062-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-03 09:05:11 -06:00
Christoph Hellwig	a4217c6740	block: add a partscan sysfs attribute for disks Userspace had been unknowingly relying on a non-stable interface of kernel internals to determine if partition scanning is enabled for a given disk. Provide a stable interface for this purpose instead. Cc: stable@vger.kernel.org # 6.3+ Depends-on: `140ce28dd3` ("block: add a disk_has_partscan helper") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/linux-block/ZhQJf8mzq_wipkBH@gardel-login/ Link: https://lore.kernel.org/r/20240502130033.1958492-3-hch@lst.de [axboe: add links and commit message from Keith] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-03 09:00:07 -06:00
Christoph Hellwig	140ce28dd3	block: add a disk_has_partscan helper Add a helper to check if partition scanning is enabled instead of open coding the check in a few places. This now always checks for the hidden flag even if all but one of the callers are never reachable for hidden gendisks. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240502130033.1958492-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-03 08:59:59 -06:00
Al Viro	203c1ce0bb	RIP ->bd_inode Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-05-03 02:36:56 -04:00
Al Viro	df65f1660b	block/bdev.c: use the knowledge of inode/bdev coallocation Here we know that bdevfs inodes are coallocated with struct block_device and we can get to ->bd_inode value without any dereferencing. Introduce an inlined helper (static, not exported, purely internal for bdev.c) that gets an associated inode by block_device - BD_INODE(bdev). NOTE: leave it static; nobody outside of block/bdev.c has any business playing with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-05-03 02:36:55 -04:00
Al Viro	881494ed03	blk_ioctl_{discard,zeroout}(): we only want ->bd_inode->i_mapping here... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-6-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-05-03 02:36:51 -04:00
Al Viro	224941e837	use ->bd_mapping instead of ->bd_inode->i_mapping Just the low-hanging fruit... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-2-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-05-03 02:36:51 -04:00
Al Viro	e33aef2c58	block_device: add a pointer to struct address_space (page cache of bdev) points to ->i_data of coallocated inode. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-1-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-05-03 02:36:50 -04:00
Al Viro	2638c20876	missing helpers: bdev_unhash(), bdev_drop() bdev_unhash(): make block device invisible to lookups by device number bdev_drop(): drop reference to associated inode. Both are internal, for use by genhd and partition-related code - similar to bdev_add(). The logics in there (especially the lifetime-related parts of it) ought to be cleaned up, but that's a separate story; here we just encapsulate getting to associated inode. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-05-03 02:36:21 -04:00
Yu Kuai	186ddac207	block: move two helpers into bdev.c disk_live() and block_size() access bd_inode directly, prepare to remove the field bd_inode from block_device, and only access bd_inode in block layer. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-8-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-05-03 02:36:21 -04:00
Al Viro	39c3b4e7d0	blkdev_write_iter(): saner way to get inode and bdev ... same as in other methods - bdev_file_inode() and I_BDEV() of that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-5-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-05-03 02:35:57 -04:00
Al Viro	811ba89a88	bdev: move ->bd_make_it_fail to ->__bd_flags Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-05-02 20:04:18 -04:00
Al Viro	49a43dae93	bdev: move ->bd_ro_warned to ->__bd_flags Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-05-02 20:04:17 -04:00
Al Viro	ac2b6f9dee	bdev: move ->bd_has_subit_bio to ->__bd_flags In bdev_alloc() we have all flags initialized to false, so assignment to ->bh_has_submit_bio n there is a no-op unless we have partno != 0 and flag already set on entire device. In device_add_disk() we have just allocated the block_device in question and it had been a full-device one, so the flag is guaranteed to be still clear when we get to assignment. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-05-02 20:00:37 -04:00
Al Viro	4c80105e39	bdev: move ->bd_write_holder into ->__bd_flags Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-05-02 19:50:29 -04:00
Al Viro	01e198f01d	bdev: move ->bd_read_only to ->__bd_flags Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-05-02 19:50:29 -04:00
Al Viro	1116b9fa15	bdev: infrastructure for flags Replace bd_partno with a 32bit field (__bd_flags). The lower 8 bits contain the partition number, the upper 24 are for flags. Helpers: bdev_{test,set,clear}_flag(bdev, flag), with atomic_or() and atomic_andnot() used to set/clear. NOTE: this commit does not actually move any flags over there - they are still bool fields. As the result, it shifts the fields wrt cacheline boundaries; that's going to be restored once the first 3 flags are dealt with. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-05-02 19:50:11 -04:00
Al Viro	b8c873edbf	wrapper for access to ->bd_partno On the next step it's going to get folded into a field where flags will go. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-05-02 17:48:09 -04:00
Al Viro	3f9b8fb46e	Use bdev_is_paritition() instead of open-coding it Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-05-02 17:48:09 -04:00
Al Viro	d18a867958	make set_blocksize() fail unless block device is opened exclusive Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-05-02 17:39:44 -04:00
Al Viro	ead083aeee	set_blocksize(): switch to passing struct file * Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-05-02 17:39:44 -04:00
Damien Le Moal	d7580149ef	block: Cleanup blk_revalidate_zone_cb() Define the code for checking conventional and sequential write required zones suing the functions blk_revalidate_conv_zone() and blk_revalidate_seq_zone() respectively. This simplifies the zone type switch-case in blk_revalidate_zone_cb(). No functional changes. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240501110907.96950-15-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-01 08:08:43 -06:00
Damien Le Moal	c9c8aea03c	block: Simplify zone write plug BIO abort When BIOs plugged in a zone write plug are aborted, blk_zone_wplug_bio_io_error() clears the BIO BIO_ZONE_WRITE_PLUGGING flag so that bio_io_error(bio) does not end up calling blk_zone_write_plug_bio_endio() and we thus need to manually drop the reference on the zone write plug held by the aborted BIO. Move the call to disk_put_zone_wplug() that is alwasy following the call to blk_zone_wplug_bio_io_error() inside that function to simplify the code. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-14-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-01 08:08:43 -06:00
Damien Le Moal	b5a64ec2ea	block: Simplify blk_zone_write_plug_bio_endio() We already have the disk variable obtained from the bio when calling disk_get_zone_wplug(). So use that variable instead of dereferencing the bio bdev again for the disk argument of disk_get_zone_wplug(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-13-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-01 08:08:43 -06:00
Damien Le Moal	347bde9da1	block: Improve zone write request completion handling blk_zone_complete_request() must be called to handle the completion of a zone write request handled with zone write plugging. This function is called from blk_complete_request(), blk_update_request() and also in blk_mq_submit_bio() error path. Improve this by moving this function call into blk_mq_finish_request() as all requests are processed with this function when they complete as well as when they are freed without being executed. This also improves blk_update_request() used by scsi devices as these may repeatedly call this function to handle partial completions. To be consistent with this change, blk_zone_complete_request() is renamed to blk_zone_finish_request() and blk_zone_write_plug_complete_request() is renamed to blk_zone_write_plug_finish_request(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-12-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-01 08:08:43 -06:00
Damien Le Moal	c4c3ffdab2	block: Improve blk_zone_write_plug_bio_merged() Improve blk_zone_write_plug_bio_merged() to check that we succefully get a reference on the zone write plug of the merged BIO, as expected since for a merge we already have at least one request and one BIO referencing the zone write plug. Comments in this function are also improved to better explain the references to the BIO zone write plug. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-11-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-01 08:08:43 -06:00
Damien Le Moal	096bc7ea33	block: Fix handling of non-empty flush write requests to zones Zone write plugging ignores empty (no data) flush operations but handles flush BIOs that have data to ensure that the flush machinery generated write is processed in order. However, the call to blk_zone_write_plug_attempt_merge() which sets a request RQF_ZONE_WRITE_PLUGGING flag is called after blk_insert_flush(), thus missing indicating that a non empty flush request completion needs handling by zone write plugging. Fix this by moving the call to blk_zone_write_plug_attempt_merge() before blk_insert_flush(). And while at it, rename that function as blk_zone_write_plug_init_request() to be clear that it is not just about merging plugged BIOs in the request. While at it, also add a WARN_ONCE() check that the zone write plug for the request is not NULL. Fixes: `dd291d77cc` ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-10-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-01 08:08:43 -06:00
Damien Le Moal	af147b740f	block: Fix flush request sector restore Make sure that a request bio is not NULL before trying to restore the request start sector. Reported-by: Yi Zhang <yi.zhang@redhat.com> Fixes: `6f8fd758de` ("block: Restore sector of flush requests") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-9-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-01 08:08:43 -06:00
Damien Le Moal	7b29518728	block: Do not remove zone write plugs still in use Large write BIOs that span a zone boundary are split in blk_mq_submit_bio() before being passed to blk_zone_plug_bio() for zone write plugging. Such split BIO will be chained with one fragment targeting one zone and the remainder of the BIO targeting the next zone. The two BIOs can be executed in parallel, without a predetermine order relative to eachother and their completion may be reversed: the remainder first completing and the first fragment then completing. In such case, bio_endio() will not immediately execute blk_zone_write_plug_bio_endio() for the parent BIO (the remainder of the split BIO) as the BIOs are chained. blk_zone_write_plug_bio_endio() for the parent BIO will be executed only once the first fragment completes. In the case of a device with small zones and very large BIOs, uch completion pattern can lead to disk_should_remove_zone_wplug() to return true for the zone of the parent BIO when the parent BIO request completes and blk_zone_write_plug_complete_request() is executed. This triggers the removal of the zone write plug from the hash table using disk_remove_zone_wplug(). With the zone write plug of the parent BIO missing, the call to disk_get_zone_wplug() in blk_zone_write_plug_bio_endio() returns NULL and triggers a warning. This patterns can be recreated fairly easily using a scsi_debug device with small zone and btrfs. E.g. modprobe scsi_debug delay=0 dev_size_mb=1024 sector_size=4096 \ zbc=host-managed zone_cap_mb=3 zone_nr_conv=0 zone_size_mb=4 mkfs.btrfs -f -O zoned /dev/sda mount -t btrfs /dev/sda /mnt fio --name=wrtest --rw=randwrite --direct=1 --ioengine=libaio \ --bs=4k --iodepth=16 --size=1M --directory=/mnt --time_based \ --runtime=10 umount /dev/sda Will result in the warning: [ 29.035538] WARNING: CPU: 3 PID: 37 at block/blk-zoned.c:1207 blk_zone_write_plug_bio_endio+0xee/0x1e0 ... [ 29.058682] Call Trace: [ 29.059095] <TASK> [ 29.059473] ? __warn+0x80/0x120 [ 29.059983] ? blk_zone_write_plug_bio_endio+0xee/0x1e0 [ 29.060728] ? report_bug+0x160/0x190 [ 29.061283] ? handle_bug+0x36/0x70 [ 29.061830] ? exc_invalid_op+0x17/0x60 [ 29.062399] ? asm_exc_invalid_op+0x1a/0x20 [ 29.063025] ? blk_zone_write_plug_bio_endio+0xee/0x1e0 [ 29.063760] bio_endio+0xb7/0x150 [ 29.064280] btrfs_clone_write_end_io+0x2b/0x60 [btrfs] [ 29.065049] blk_update_request+0x17c/0x500 [ 29.065666] scsi_end_request+0x27/0x1a0 [scsi_mod] [ 29.066356] scsi_io_completion+0x5b/0x690 [scsi_mod] [ 29.067077] blk_complete_reqs+0x3a/0x50 [ 29.067692] __do_softirq+0xcf/0x2b3 [ 29.068248] ? sort_range+0x20/0x20 [ 29.068791] run_ksoftirqd+0x1c/0x30 [ 29.069339] smpboot_thread_fn+0xcc/0x1b0 [ 29.069936] kthread+0xcf/0x100 [ 29.070438] ? kthread_complete_and_exit+0x20/0x20 [ 29.071314] ret_from_fork+0x31/0x50 [ 29.071873] ? kthread_complete_and_exit+0x20/0x20 [ 29.072563] ret_from_fork_asm+0x11/0x20 [ 29.073146] </TASK> either when fio executes or when unmount is executed. Fix this by modifying disk_should_remove_zone_wplug() to check that the reference count to a zone write plug is not larger than 2, that is, that the only references left on the zone are the caller held reference (blk_zone_write_plug_complete_request()) and the initial extra reference for the zone write plug taken when it was initialized (and that is dropped when the zone write plug is removed from the hash table). To be consistent with this change, make sure to drop the request or BIO held reference to the zone write plug before calling disk_zone_wplug_unplug_bio(). All references are also dropped using disk_put_zone_wplug() instead of atomic_dec() to ensure that the zone write plug is freed if it needs to be. Comments are also improved to clarify zone write plugs reference handling. Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: `dd291d77cc` ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240501110907.96950-8-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-01 08:08:43 -06:00
Damien Le Moal	79ae35a423	block: Unhash a zone write plug only if needed Fix disk_remove_zone_wplug() to ensure that a zone write plug already removed from a disk hash table of zone write plugs is not removed again. Do this by checking the BLK_ZONE_WPLUG_UNHASHED flag of the plug and calling hlist_del_init_rcu() only if the flag is not set. Furthermore, since BIO completions can happen at any time, that is, decrementing of the zone write plug reference count can happen at any time, make sure to use disk_put_zone_wplug() instead of atomic_dec() to ensure that the zone write plug is freed when its last reference is dropped. In order to do this, disk_remove_zone_wplug() is moved after the definition of disk_put_zone_wplug(). disk_should_remove_zone_wplug() is moved as well to keep it together with disk_remove_zone_wplug(). To be consistent with this change, add a check in disk_put_zone_wplug() to ensure that a zone write plug being freed was already removed from the disk hash table. Fixes: `dd291d77cc` ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-7-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-01 08:08:43 -06:00
Damien Le Moal	9e78c38ab3	block: Hold a reference on zone write plugs to schedule submission Since a zone write plug BIO work is a field of struct blk_zone_wplug, we must ensure that a zone write plug is never freed when its BIO submission work is queued or running. Do this by holding a reference on the zone write plug when the submission work is scheduled for execution with queue_work() and releasing the reference at the end of the execution of the work function blk_zone_wplug_bio_work(). The helper function disk_zone_wplug_schedule_bio_work() is introduced to get a reference on a zone write plug and queue its work. This helper is used in disk_zone_wplug_unplug_bio() and disk_zone_wplug_handle_error(). Fixes: `dd291d77cc` ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-6-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-01 08:08:43 -06:00
Damien Le Moal	19aad274c2	block: Fix reference counting for zone write plugs in error state When zone is reset or finished, disk_zone_wplug_set_wp_offset() is called to update the zone write plug write pointer offset and to clear the zone error state (BLK_ZONE_WPLUG_ERROR flag) if it is set. However, this processing is missing dropping the reference to the zone write plug that was taken in disk_zone_wplug_set_error() when the error flag was first set. Furthermore, the error state handling must release the zone write plug lock to first execute a report zones command. When the report zone races with a reset or finish operation that clears the error, we can end up decrementing the zone write plug reference count twice: once in disk_zone_wplug_set_wp_offset() for the reset/finish operation and one more time in disk_zone_wplugs_work() once disk_zone_wplug_handle_error() completes. Fix this by introducing disk_zone_wplug_clear_error() as the symmetric function of disk_zone_wplug_set_error(). disk_zone_wplug_clear_error() decrements the zone write plug reference count obtained in disk_zone_wplug_set_error() only if the error handling has not started yet, that is, only if disk_zone_wplugs_work() has not yet taken the zone write plug off the error list. This ensure that either disk_zone_wplug_clear_error() or disk_zone_wplugs_work() drop the zone write plug reference count. Fixes: `dd291d77cc` ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-5-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-01 08:08:43 -06:00
Damien Le Moal	74b7ae5f48	block: Fix zone write plug initialization from blk_revalidate_zone_cb() When revalidating the zones of a zoned block device, blk_revalidate_zone_cb() must allocate a zone write plug for any sequential write required zone that is not empty nor full. However, the current code tests the latter case by comparing the zone write pointer offset to the zone size instead of the zone capacity. Furthermore, disk_get_and_lock_zone_wplug() is called with a sector argument equal to the zone start instead of the current zone write pointer position. This commit fixes both issues by calling disk_get_and_lock_zone_wplug() for a zone that is not empty and with a write pointer offset lower than the zone capacity and use the zone capacity sector as the sector argument for disk_get_and_lock_zone_wplug(). Fixes: `dd291d77cc` ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-01 08:08:42 -06:00
Damien Le Moal	6b7593b5fb	block: Exclude conventional zones when faking max open limit For a device that has no limits for the maximum number of open and active zones, we default to using the number of zones, limited to BLK_ZONE_WPLUG_DEFAULT_POOL_SIZE (128), for the maximum number of open zones indicated to the user. However, for a device that has conventional zones and less zones than BLK_ZONE_WPLUG_DEFAULT_POOL_SIZE, we should not account conventional zones and set the limit to the number of sequential write required zones. Furthermore, for cases where the limit is equal to the number of sequential write required zones, we can advertize a limit of 0 to indicate "no limits". Fix this by moving the zone write plug mempool resizing from disk_revalidate_zone_resources() to disk_update_zone_resources() where we can safely compute the number of conventional zones and update the limits. Fixes: `843283e96e` ("block: Fake max open zones limit when there is no limit") Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240501110907.96950-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-01 08:08:42 -06:00
Linus Torvalds	52034cae02	vfs-6.9-rc6.fixes -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZiulnAAKCRCRxhvAZXjc ogO+AP9z3+WAvgGmJkWOjT1aOrcQWVe+ZEdEUdK26ufkHhM5vAD/RXmdUBVHcYWk 3oE1hG8bONOASUc6dUIATPHBDjvqFg8= =LtmL -----END PGP SIGNATURE----- Merge tag 'vfs-6.9-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "This contains a few small fixes for this merge window and the attempt to handle the ntfs removal regression that was reported a little while ago: - After the removal of the legacy ntfs driver we received reports about regressions for some people that do mount "ntfs" explicitly and expect the driver to be available. Since ntfs3 is a drop-in for legacy ntfs we alias legacy ntfs to ntfs3 just like ext3 is aliased to ext4. We also enforce legacy ntfs is always mounted read-only and give it custom file operations to ensure that ioctl()'s can't be abused to perform write operations. - Fix an unbalanced module_get() in bdev_open(). - Two smaller fixes for the netfs work done earlier in this cycle. - Fix the errno returned from the new FS_IOC_GETUUID and FS_IOC_GETFSSYSFSPATH ioctls. Both commands just pull information out of the superblock so there's no need to call into the actual ioctl handlers. So instead of returning ENOIOCTLCMD to indicate to fallback we just return ENOTTY directly avoiding that indirection" * tag 'vfs-6.9-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: netfs: Fix the pre-flush when appending to a file in writethrough mode netfs: Fix writethrough-mode error handling ntfs3: add legacy ntfs file operations ntfs3: enforce read-only when used as legacy ntfs driver ntfs3: serve as alias for the legacy ntfs driver block: fix module reference leakage from bdev_open_by_dev error path fs: Return ENOTTY directly if FS_IOC_GETUUID or FS_IOC_GETFSSYSFSPATH fail	2024-04-26 11:01:28 -07:00
Arnd Bergmann	597bc741e5	block/partitions/ldm: convert strncpy() to strscpy() The strncpy() here can cause a non-terminated string, which older gcc versions such as gcc-9 warn about: In function 'ldm_parse_tocblock', inlined from 'ldm_validate_tocblocks' at block/partitions/ldm.c:386:7, inlined from 'ldm_partition' at block/partitions/ldm.c:1457:7: block/partitions/ldm.c:134:2: error: 'strncpy' specified bound 16 equals destination size [-Werror=stringop-truncation] 134 \| strncpy (toc->bitmap1_name, data + 0x24, sizeof (toc->bitmap1_name)); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ block/partitions/ldm.c:145:2: error: 'strncpy' specified bound 16 equals destination size [-Werror=stringop-truncation] 145 \| strncpy (toc->bitmap2_name, data + 0x46, sizeof (toc->bitmap2_name)); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ New versions notice that the code is correct after all because of the following termination, but replacing the strncpy() with strscpy_pad() or strcpy() avoids the warning and simplifies the code at the same time. Use the padding version here to keep the existing behavior, in case the code relies on not including uninitialized data. Link: https://lkml.kernel.org/r/20240409140059.3806717-4-arnd@kernel.org Reviewed-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Alexey Starikovskiy <astarikovskiy@suse.de> Cc: Bob Moore <robert.moore@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Len Brown <lenb@kernel.org> Cc: Lin Ming <ming.m.lin@intel.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nicolas Schier <nicolas@fjasle.eu> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Richard Russon (FlatCap)" <ldm@flatcap.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 21:07:07 -07:00
Johannes Thumshirn	57787fa42f	block: check if zone_wplugs_hash exists in queue_zone_wplugs_show Changhui reported a kernel crash when running this simple shell reproducer: # cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \; The above results in a NULL pointer dereference if a device does not have a zone_wplugs_hash allocated. To fix this, return early if we don't have a zone_wplugs_hash. Reported-by: Changhui Zhong <czhong@redhat.com> Fixes: `a98b05b02f` ("block: Replace zone_wlock debugfs entry with zone_wplugs entry") Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/e5fec079dfca448cc21c425cfa5d7b291f5faa67.1714046443.git.johannes.thumshirn@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-25 07:47:46 -06:00
Damien Le Moal	a8f59e5a5d	block: use a per disk workqueue for zone write plugging A zone write plug BIO work function blk_zone_wplug_bio_work() calls submit_bio_noacct_nocheck() to execute the next unplugged BIO. This function may block. So executing zone plugs BIO works using the block layer global kblockd workqueue can potentially lead to preformance or latency issues as the number of concurrent work for a workqueue is limited to WQ_DFL_ACTIVE (256). 1) For a system with a large number of zoned disks, issuing write requests to otherwise unused zones may be delayed wiating for a work thread to become available. 2) Requeue operations which use kblockd but are independent of zone write plugging may alsoi end up being delayed. To avoid these potential performance issues, create a workqueue per zoned device to execute zone plugs BIO work. The workqueue max active parameter is set to the maximum number of zone write plugs allocated with the zone write plug mempool. This limit is equal to the maximum number of open zones of the disk and defaults to 128 for disks that do not have a limit on the number of open zones. Fixes: `dd291d77cc` ("block: Introduce zone write plugging") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240420075811.1276893-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-23 09:46:34 -06:00
Linus Torvalds	977b1ef518	block-6.9-20240420 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmYj3soQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpin7D/9wn9XnjUJVp8Pw2by2gY+j9V4Mlr30+HIW 7XT51PHEYWpHQfMCm2gMdc16hb2w+GKkf40ymU9/fnEuqWfs2MAiw3tD7Seo8vTr qiLCp4hYQNM6YF6MQA4It2HJE2r0o8QRPjwFQSY0pnPZe+NPT/cdyKgJAlns4VW8 d5kMOw7hLfZvN6iLjOW0hz0dQqoFdOGP9/QrXFgNzaexnJxDA+N8D7E5WEGjXvJj mHCqXXZEKMj2phuUlKfSeRDGGVDL8Zv2/whPD1TlNHn/8683lSwHXISEaw5KCBb2 9dVFPMQv4eFY0yCBbqmfxOBki/0KElYKZ+ri3A0kdEnJG67F7LCIWEyGhIfZuGXl MGjzaSI8HSdUfUPgn0b4Ad1/cTpUaeHIu7b+x63KlbBO5sBbwh4tKUkuj30s7wP4 FC9egqFL+B4JyzuMPvWtDKvA8v+KMRYsMBNUkYEy/DfQUuf2lmf6dtGSDBK94QvX n7Vdzxkm0gTHuJPnrkt4esS2dwCgMqgk6BpQDJ6ODkMWLtebw7ZYMIoFDJknbWgT W8uovm1uejUbsdjzvvG1ioL/ry3GiaP6sN8TEWHeq0RZrFGPwDjjpu4HVeEXrD0Z PpglL9LDj5bE2IJVCpaEyn86O3eqVeFfoHatAoFrbAKuJjSDALGM9wpC4UNgBFvN CSZ/ZiTKlA== =EMuq -----END PGP SIGNATURE----- Merge tag 'block-6.9-20240420' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: "Just two minor fixes that should go into the 6.9 kernel release, one fixing a regression with partition scanning errors, and one fixing a WARN_ON() that can get triggered if we race with a timer" * tag 'block-6.9-20240420' of git://git.kernel.dk/linux: blk-iocost: do not WARN if iocg was already offlined block: propagate partition scanning errors to the BLKRRPART ioctl	2024-04-20 11:28:02 -07:00
Jiapeng Chong	8294d49adb	block/mq-deadline: Remove some unused functions These functions are defined in the mq-deadline.c file, but not called elsewhere, so delete these unused functions. block/mq-deadline.c:134:1: warning: unused function 'deadline_earlier_request'. block/mq-deadline.c:148:1: warning: unused function 'deadline_latter_request'. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8803 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Link: https://lore.kernel.org/r/20240419025610.34298-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-19 08:10:36 -06:00
Li Nan	01bc4fda9e	blk-iocost: do not WARN if iocg was already offlined In iocg_pay_debt(), warn is triggered if 'active_list' is empty, which is intended to confirm iocg is active when it has debt. However, warn can be triggered during a blkcg or disk removal, if iocg_waitq_timer_fn() is run at that time: WARNING: CPU: 0 PID: 2344971 at block/blk-iocost.c:1402 iocg_pay_debt+0x14c/0x190 Call trace: iocg_pay_debt+0x14c/0x190 iocg_kick_waitq+0x438/0x4c0 iocg_waitq_timer_fn+0xd8/0x130 __run_hrtimer+0x144/0x45c __hrtimer_run_queues+0x16c/0x244 hrtimer_interrupt+0x2cc/0x7b0 The warn in this situation is meaningless. Since this iocg is being removed, the state of the 'active_list' is irrelevant, and 'waitq_timer' is canceled after removing 'active_list' in ioc_pd_free(), which ensures iocg is freed after iocg_waitq_timer_fn() returns. Therefore, add the check if iocg was already offlined to avoid warn when removing a blkcg or disk. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240419093257.3004211-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-19 08:06:24 -06:00
Christoph Hellwig	752863bdda	block: propagate partition scanning errors to the BLKRRPART ioctl Commit `4601b4b130` ("block: reopen the device in blkdev_reread_part") lost the propagation of I/O errors from the low-level read of the partition table to the user space caller of the BLKRRPART. Apparently some user space relies on, so restore the propagation. This isn't exactly pretty as other block device open calls explicitly do not are about these errors, so add a new BLK_OPEN_STRICT_SCAN to opt into the error propagation. Fixes: `4601b4b130` ("block: reopen the device in blkdev_reread_part") Reported-by: Saranya Muruganandam <saranyamohan@google.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20240417144743.2277601-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-18 09:34:34 -06:00
Damien Le Moal	99a9476b27	block: Do not special-case plugging of zone write operations With the block layer zone write plugging being automatically done for any write operation to a zone of a zoned block device, a regular request plugging handled through current->plug can only ever see at most a single write request per zone. In such case, any potential reordering of the plugged requests will be harmless. We can thus remove the special casing for write operations to zones and have these requests plugged as well. This allows removing the function blk_mq_plug and instead directly using current->plug where needed. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-29-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:44:03 -06:00
Damien Le Moal	97abee507b	block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONED Now that zone block device write ordering control does not depend anymore on mq-deadline and zone write locking, there is no need to force select the mq-deadline scheduler when CONFIG_BLK_DEV_ZONED is enabled. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-28-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:44:03 -06:00
Damien Le Moal	02ccd7c360	block: Remove zone write locking Zone write locking is now unused and replaced with zone write plugging. Remove all code that was implementing zone write locking, that is, the various helper functions controlling request zone write locking and the gendisk attached zone bitmaps. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-27-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:44:03 -06:00
Damien Le Moal	a98b05b02f	block: Replace zone_wlock debugfs entry with zone_wplugs entry In preparation to completely remove zone write locking, replace the "zone_wlock" mq-debugfs entry that was listing zones that are write-locked with the zone_wplugs entry which lists the zones that currently have a write plug allocated. The write plug information provided is: the zone number, the zone write plug flags, the zone write plug write pointer offset and the number of BIOs currently waiting for execution in the zone write plug BIO list. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-26-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:44:03 -06:00
Damien Le Moal	d9f1439a30	block: Move zone related debugfs attribute to blk-zoned.c block/blk-mq-debugfs-zone.c contains a single debugfs attribute function. Defining this outside of block/blk-zoned.c does not really help in any way, so move this zone related debugfs attribute to block/blk-zoned.c and delete block/blk-mq-debugfs-zone.c. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-25-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:44:03 -06:00
Damien Le Moal	bca150f0d4	block: Do not check zone type in blk_check_zone_append() Zone append operations are only allowed to target sequential write required zones. blk_check_zone_append() uses bio_zone_is_seq() to check this. However, this check is not necessary because: 1) For NVMe ZNS namespace devices, only sequential write required zones exist, making the zone type check useless. 2) For null_blk, the driver will fail the request anyway, thus notifying the user that a conventional zone was targeted. 3) For all other zoned devices, zone append is now emulated using zone write plugging, which checks that a zone append operation does not target a conventional zone. In preparation for the removal of zone write locking and its conventional zone bitmap (used by bio_zone_is_seq()), remove the bio_zone_is_seq() call from blk_check_zone_append(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-24-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:44:03 -06:00
Damien Le Moal	e4eb37cc0f	block: Remove elevator required features The only elevator feature ever implemented is ELEVATOR_F_ZBD_SEQ_WRITE for signaling that a scheduler implements zone write locking to tightly control the dispatching order of write operations to zoned block devices. With the removal of zone write locking support in mq-deadline and the reliance of all block device drivers on the block layer zone write plugging to control ordering of write operations to zones, the elevator feature ELEVATOR_F_ZBD_SEQ_WRITE is completely unused. Remove it, and also remove the now unused code for filtering the possible schedulers for a block device based on required features. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-23-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:44:03 -06:00
Damien Le Moal	fde02699c2	block: mq-deadline: Remove support for zone write locking With the block layer generic plugging of write operations for zoned block devices, mq-deadline, or any other scheduler, can only ever see at most one write operation per zone at any time. There is thus no sequentiality requirements for these writes and thus no need to tightly control the dispatching of write requests using zone write locking. Remove all the code that implement this control in the mq-deadline scheduler and remove advertizing support for the ELEVATOR_F_ZBD_SEQ_WRITE elevator feature. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-22-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:44:03 -06:00
Damien Le Moal	9b3c08b90f	block: Simplify blk_revalidate_disk_zones() interface The only user of blk_revalidate_disk_zones() second argument was the SCSI disk driver (sd). Now that this driver does not require this update_driver_data argument, remove it to simplify the interface of blk_revalidate_disk_zones(). Also update the function kdoc comment to be more accurate (i.e. there is no gendisk ->revalidate method). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-21-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:44:03 -06:00
Damien Le Moal	63b5385e78	block: Remove BLK_STS_ZONE_RESOURCE The zone append emulation of the scsi disk driver was the only driver using BLK_STS_ZONE_RESOURCE. With this code removed, BLK_STS_ZONE_RESOURCE is now unused. Remove this macro definition and simplify blk_mq_dispatch_rq_list() where this status code was handled. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-20-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:44:03 -06:00
Damien Le Moal	946dd71ed8	block: Allow BIO-based drivers to use blk_revalidate_disk_zones() In preparation for allowing BIO based device drivers to use zone write plugging and its zone append emulation, allow these drivers to call blk_revalidate_disk_zones() so that all zone resources necessary to zone write plugging can be initialized. To do so, remove the check in blk_revalidate_disk_zones() restricting the use of this function to mq request-based drivers to allow also BIO-based drivers to use it. This is safe to do as long as the BIO-based block device queue is already setup and usable, as it should, and can be safely frozen. The helper function disk_need_zone_resources() is added to control the allocation and initialization of the zone write plug hash table and of the conventional zone bitmap only for mq devices and for BIO-based devices that require zone append emulation. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-12-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:44:03 -06:00
Damien Le Moal	9b1ce7f0c6	block: Implement zone append emulation Given that zone write plugging manages all writes to zones of a zoned block device and tracks the write pointer position of all zones that are not full nor empty, emulating zone append operations using regular writes can be implemented generically, without relying on the underlying device driver to implement such emulation. This is needed for devices that do not natively support the zone append command (e.g. SMR hard-disks). A device may request zone append emulation by setting its max_zone_append_sectors queue limit to 0. For such device, the function blk_zone_wplug_prepare_bio() changes zone append BIOs into non-mergeable regular write BIOs. Modified zone append BIOs are flagged with the new BIO flag BIO_EMULATES_ZONE_APPEND. This flag is checked on completion of the BIO in blk_zone_write_plug_bio_endio() to restore the original REQ_OP_ZONE_APPEND operation code of the BIO. The block layer internal inline helper function bio_is_zone_append() is added to test if a BIO is either a native zone append operation (REQ_OP_ZONE_APPEND operation code) or if it is flagged with BIO_EMULATES_ZONE_APPEND. Given that both native and emulated zone append BIO completion handling should be similar, The functions blk_update_request() and blk_zone_complete_request_bio() are modified to use bio_is_zone_append() to execute blk_zone_update_request_bio() for both native and emulated zone append operations. This commit contains contributions from Christoph Hellwig <hch@lst.de>. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-11-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:44:03 -06:00
Damien Le Moal	ccdbf0aad2	block: Allow zero value of max_zone_append_sectors queue limit In preparation for adding a generic zone append emulation using zone write plugging, allow device drivers supporting zoned block device to set a the max_zone_append_sectors queue limit of a device to 0 to indicate the lack of native support for zone append operations and that the block layer should emulate these operations using regular write operations. blk_queue_max_zone_append_sectors() is modified to allow passing 0 as the max_zone_append_sectors argument. The function queue_max_zone_append_sectors() is also modified to ensure that the minimum of the max_hw_sectors and chunk_sectors limit is used whenever the max_zone_append_sectors limit is 0. This minimum is consistent with the value set for the max_zone_append_sectors limit by the function blk_validate_zoned_limits() when limits for a queue are validated. The helper functions queue_emulates_zone_append() and bdev_emulates_zone_append() are added to test if a queue (or block device) emulates zone append operations. In order for blk_revalidate_disk_zones() to accept zoned block devices relying on zone append emulation, the direct check to the max_zone_append_sectors queue limit of the disk is replaced by a check using the value returned by queue_max_zone_append_sectors(). Similarly, queue_zone_append_max_show() is modified to use the same accessor so that the sysfs attribute advertizes the non-zero limit that will be used, regardless if it is for native or emulated commands. For stacking drivers, a top device should not need to care if the underlying devices have native or emulated zone append operations. blk_stack_limits() is thus modified to set the top device max_zone_append_sectors limit using the new accessor queue_limits_max_zone_append_sectors(). queue_max_zone_append_sectors() is modified to use this function as well. Stacking drivers that require zone append emulation, e.g. dm-crypt, can still request this feature by calling blk_queue_max_zone_append_sectors() with a 0 limit. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-10-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:44:03 -06:00
Damien Le Moal	843283e96e	block: Fake max open zones limit when there is no limit For a zoned block device that has no limit on the number of open zones and no limit on the number of active zones, the zone write plug mempool is created with a size of 128 zone write plugs. For such case, set the device max_open_zones queue limit to this value to indicate to the user the potential performance penalty that may happen when writing simultaneously to more zones than the mempool size. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-9-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:44:03 -06:00
Damien Le Moal	dd291d77cc	block: Introduce zone write plugging Zone write plugging implements a per-zone "plug" for write operations to control the submission and execution order of write operations to sequential write required zones of a zoned block device. Per-zone plugging guarantees that at any time there is at most only one write request per zone being executed. This mechanism is intended to replace zone write locking which implements a similar per-zone write throttling at the scheduler level, but is implemented only by mq-deadline. Unlike zone write locking which operates on requests, zone write plugging operates on BIOs. A zone write plug is simply a BIO list that is atomically manipulated using a spinlock and a kblockd submission work. A write BIO to a zone is "plugged" to delay its execution if a write BIO for the same zone was already issued, that is, if a write request for the same zone is being executed. The next plugged BIO is unplugged and issued once the write request completes. This mechanism allows to: - Untangle zone write ordering from block IO schedulers. This allows removing the restriction on using mq-deadline for writing to zoned block devices. Any block IO scheduler, including "none" can be used. - Zone write plugging operates on BIOs instead of requests. Plugged BIOs waiting for execution thus do not hold scheduling tags and thus are not preventing other BIOs from executing (reads or writes to other zones). Depending on the workload, this can significantly improve the device use (higher queue depth operation) and performance. - Both blk-mq (request based) zoned devices and BIO-based zoned devices (e.g. device mapper) can use zone write plugging. It is mandatory for the former but optional for the latter. BIO-based drivers can use zone write plugging to implement write ordering guarantees, or the drivers can implement their own if needed. - The code is less invasive in the block layer and is mostly limited to blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and bio.c. Zone write plugging is implemented using struct blk_zone_wplug. This structure includes a spinlock, a BIO list and a work structure to handle the submission of plugged BIOs. Zone write plugs structures are managed using a per-disk hash table. Plugging of zone write BIOs is done using the function blk_zone_write_plug_bio() which returns false if a BIO execution does not need to be delayed and true otherwise. This function is called from blk_mq_submit_bio() after a BIO is split to avoid large BIOs spanning multiple zones which would cause mishandling of zone write plugs. This ichange enables by default zone write plugging for any mq request-based block device. BIO-based device drivers can also use zone write plugging by expliclty calling blk_zone_write_plug_bio() in their ->submit_bio method. For such devices, the driver must ensure that a BIO passed to blk_zone_write_plug_bio() is already split and not straddling zone boundaries. Only write and write zeroes BIOs are plugged. Zone write plugging does not introduce any significant overhead for other operations. A BIO that is being handled through zone write plugging is flagged using the new BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag. The completion of BIOs and requests flagged trigger respectively calls to the functions blk_zone_write_bio_endio() and blk_zone_write_complete_request(). The latter function is used to trigger submission of the next plugged BIO using the zone plug work. blk_zone_write_bio_endio() does the same for BIO-based devices. This ensures that at any time, at most one request (blk-mq devices) or one BIO (BIO-based devices) is being executed for any zone. The handling of zone write plugs using a per-zone plug spinlock maximizes parallelism and device usage by allowing multiple zones to be writen simultaneously without lock contention. Zone write plugging ignores flush BIOs without data. Hovever, any flush BIO that has data is always plugged so that the write part of the flush sequence is serialized with other regular writes. Given that any BIO handled through zone write plugging will be the only BIO in flight for the target zone when it is executed, the unplugging and submission of a BIO will have no chance of successfully merging with plugged requests or requests in the scheduler. To overcome this potential performance degradation, blk_mq_submit_bio() calls the function blk_zone_write_plug_attempt_merge() to try to merge other plugged BIOs with the one just unplugged and submitted. Successful merging is signaled using blk_zone_write_plug_bio_merged(), called from bio_attempt_back_merge(). Furthermore, to avoid recalculating the number of segments of plugged BIOs to attempt merging, the number of segments of a plugged BIO is saved using the new struct bio field __bi_nr_segments. To avoid growing the size of struct bio, this field is added as a union with the bio_cookie field. This is safe to do as polling is always disabled for plugged BIOs. When BIOs are plugged in a zone write plug, the device request queue usage counter is always incremented. This reference is kept and reused for blk-mq devices when the plugged BIO is unplugged and submitted again using submit_bio_noacct_nocheck(). For this case, the unplugged BIO is already flagged with BIO_ZONE_WRITE_PLUGGING and blk_mq_submit_bio() proceeds directly to allocating a new request for the BIO, re-using the usage reference count taken when the BIO was plugged. This extra reference count is dropped in blk_zone_write_plug_attempt_merge() for any plugged BIO that is successfully merged. Given that BIO-based devices will not take this path, the extra reference is dropped after a plugged BIO is unplugged and submitted. Zone write plugs are dynamically allocated and managed using a hash table (an array of struct hlist_head) with RCU protection. A zone write plug is allocated when a write BIO is received for the zone and not freed until the zone is fully written, reset or finished. To detect when a zone write plug can be freed, the write state of each zone is tracked using a write pointer offset which corresponds to the offset of a zone write pointer relative to the zone start. Write operations always increment this write pointer offset. Zone reset operations set it to 0 and zone finish operations set it to the zone size. If a write error happens, the wp_offset value of a zone write plug may become incorrect and out of sync with the device managed write pointer. This is handled using the zone write plug flag BLK_ZONE_WPLUG_ERROR. The function blk_zone_wplug_handle_error() is called from the new disk zone write plug work when this flag is set. This function executes a report zone to update the zone write pointer offset to the current value as indicated by the device. The disk zone write plug work is scheduled whenever a BIO flagged with BIO_ZONE_WRITE_PLUGGING completes with an error or when bio_zone_wplug_prepare_bio() detects an unaligned write. Once scheduled, the disk zone write plugs work keeps running until all zone errors are handled. To match the new data structures used for zoned disks, the function disk_free_zone_bitmaps() is renamed to the more generic disk_free_zone_resources(). The function disk_init_zone_resources() is also introduced to initialize zone write plugs resources when a gendisk is allocated. In order to guarantee that the user can simultaneously write up to a number of zones equal to a device max active zone limit or max open zone limit, zone write plugs are allocated using a mempool sized to the maximum of these 2 device limits. For a device that does not have active and open zone limits, 128 is used as the default mempool size. If a change to the device active and open zone limits is detected, the disk mempool is resized when blk_revalidate_disk_zones() is executed. This commit contains contributions from Christoph Hellwig <hch@lst.de>. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240408014128.205141-8-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:44:03 -06:00
Damien Le Moal	ecfe43b11b	block: Remember zone capacity when revalidating zones In preparation for adding zone write plugging, modify blk_revalidate_disk_zones() to get the capacity of zones of a zoned block device. This capacity value as a number of 512B sectors is stored in the gendisk zone_capacity field. Given that host-managed SMR disks (including zoned UFS drives) and all known NVMe ZNS devices have the same zone capacity for all zones blk_revalidate_disk_zones() returns an error if different capacities are detected for different zones. This also adds check to verify that the values reported by the device for zone capacities are correct, that is, that the zone capacity is never 0, does not exceed the zone size and is equal to the zone size for conventional zones. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240408014128.205141-7-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:44:02 -06:00
Damien Le Moal	dd850ff3ee	block: Allow using bio_attempt_back_merge() internally Remove "static" from the definition of bio_attempt_back_merge() and declare this function in block/blk.h to allow using it internally from other block layer files. The definition of enum bio_merge_status is also moved to block/blk.h. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240408014128.205141-6-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:44:02 -06:00
Damien Le Moal	a0508c36ef	block: Introduce blk_zone_update_request_bio() On completion of a zone append request, the request sector indicates the location of the written data. This value must be returned to the user through the BIO iter sector. This is done in 2 places: in blk_complete_request() and in blk_update_request(). Introduce the inline helper function blk_zone_update_request_bio() to avoid duplicating this BIO update for zone append requests, and to compile out this helper call when CONFIG_BLK_DEV_ZONED is not enabled. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240408014128.205141-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:44:02 -06:00
Damien Le Moal	c0da26f950	block: Remove req_bio_endio() Moving req_bio_endio() code into its only caller, blk_update_request(), allows reducing accesses to and tests of bio and request fields. Also, given that partial completions of zone append operations is not possible and that zone append operations cannot be merged, the update of the BIO sector using the request sector for these operations can be moved directly before the call to bio_endio(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240408014128.205141-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:44:02 -06:00
Damien Le Moal	6f8fd758de	block: Restore sector of flush requests On completion of a flush sequence, blk_flush_restore_request() restores the bio of a request to the original submitted BIO. However, the last use of the request in the flush sequence may have been for a POSTFLUSH which does not have a sector. So make sure to restore the request sector using the iter sector of the original BIO. This BIO has not changed yet since the completions of the flush sequence intermediate steps use requeueing of the request until all steps are completed. Restoring the request sector ensures that blk_mq_end_request() will see a valid sector as originally set when the flush BIO was submitted. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240408014128.205141-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-17 08:44:02 -06:00
John Garry	de4c7bef9d	block: Call blkdev_dio_unaligned() from blkdev_direct_IO() blkdev_dio_unaligned() is called from __blkdev_direct_IO(), __blkdev_direct_IO_simple(), and __blkdev_direct_IO_async(), and all these are only called from blkdev_direct_IO(). Move the blkdev_dio_unaligned() call to the common callsite, blkdev_direct_IO(). Pass those functions the bdev pointer from blkdev_direct_IO(), as it is non-trivial to look up. Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240415122020.1541594-1-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-15 08:12:22 -06:00
Linus Torvalds	d7ad058156	block-6.9-20240412 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmYZWKYQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpl7VD/0VyVykpWz9ydtEQvlTLndMASi4U0RpSPPy mPX8ydz5k8IywiaAaKxCWK0hCSEXWceQKrbqJqQau9M33JD1Ss7x44BQR/XBZD5J vRV5PbnMpIFL1knGzf9xgXKAeEHI2Hi/VKuW3dMoXcXHABfHlbS2ulvpQ97S9oAo iCS3728sFOiqcKh2SSSMI04f0o+48F2mh3g8CAd45Kx/kwbEogaxc6U590wbnA+5 +TTVb+XRT0x55DCq0awtXDxLs4030IkcqJ+S5891m5Dfm8LUt0m0ekr8R9ks/sA+ +axbcLLw3M3B6QTsT8QCytpAyW5eNBdyGSj4BFl5AOGjUDjuassdzXFtKevItgEd kPsuYexNHERnntb+zEXfO7wVvyewy4Dby/WjoYi2hOhvkt73UrR9y93cp3Lq3iI4 G1+JNLNVkQo/bI0ZtyN5+q39odHcvhGvAr6zAygR+fqU6qOymRvQBpV/bnnMrv2j wxlGC5r7tGd7kEwTg/UBK+sVR8OyOm8IJAszK372KRORDRiYLlw7cF+4d5jpG9P5 frk6z6QrJPoH15iaOjZY62PKEEXiuJbQm2FJYMZvD5upR+UHVvg2xVdLMy/tSnaX zgNzlPFPKXXQ72IS8shJ9JE0XZWfZxirbUNIL+1Z2HUUj3t5SduBV/dWEeupPvuv 5Ml9Lg2uMA== =rCiN -----END PGP SIGNATURE----- Merge tag 'block-6.9-20240412' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - MD pull request via Song: - UAF fix (Yu) - Avoid out-of-bounds shift in blk-iocost (Rik) - Fix for q->blkg_list corruption (Ming) - Relax virt boundary mask/size segment checking (Ming) * tag 'block-6.9-20240412' of git://git.kernel.dk/linux: block: fix that blk_time_get_ns() doesn't update time after schedule block: allow device to have both virt_boundary_mask and max segment size block: fix q->blkg_list corruption during disk rebind blk-iocost: avoid out of bounds shift raid1: fix use-after-free for original bio in raid1_write_request()	2024-04-12 10:22:33 -07:00
Yu Kuai	3ec4848913	block: fix that blk_time_get_ns() doesn't update time after schedule While monitoring the throttle time of IO from iocost, it's found that such time is always zero after the io_schedule() from ioc_rqos_throttle, for example, with the following debug patch: + printk("%s-%d: %s enter %llu\n", current->comm, current->pid, __func__, blk_time_get_ns()); while (true) { set_current_state(TASK_UNINTERRUPTIBLE); if (wait.committed) break; io_schedule(); } + printk("%s-%d: %s exit %llu\n", current->comm, current->pid, __func__, blk_time_get_ns()); It can be observerd that blk_time_get_ns() always return the same time: [ 1068.096579] fio-1268: ioc_rqos_throttle enter 1067901962288 [ 1068.272587] fio-1268: ioc_rqos_throttle exit 1067901962288 [ 1068.274389] fio-1268: ioc_rqos_throttle enter 1067901962288 [ 1068.472690] fio-1268: ioc_rqos_throttle exit 1067901962288 [ 1068.474485] fio-1268: ioc_rqos_throttle enter 1067901962288 [ 1068.672656] fio-1268: ioc_rqos_throttle exit 1067901962288 [ 1068.674451] fio-1268: ioc_rqos_throttle enter 1067901962288 [ 1068.872655] fio-1268: ioc_rqos_throttle exit 1067901962288 And I think the root cause is that 'PF_BLOCK_TS' is always cleared by blk_flush_plug() before scheduel(), hence blk_plug_invalidate_ts() will never be called: blk_time_get_ns plug->cur_ktime = ktime_get_ns(); current->flags \|= PF_BLOCK_TS; io_schedule: io_schedule_prepare blk_flush_plug __blk_flush_plug /* the flag is cleared, while time is not / current->flags &= ~PF_BLOCK_TS; schedule sched_update_worker / the flag is not set, hence plug->cur_ktime is not cleared / if (tsk->flags & PF_BLOCK_TS) blk_plug_invalidate_ts() blk_time_get_ns / got the time stashed before schedule */ return plug->cur_ktime; Fix the problem by clearing cached time in __blk_flush_plug(). Fixes: `06b23f92af` ("block: update cached timestamp post schedule/preemption") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240411032349.3051233-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-12 08:31:54 -06:00
Christoph Hellwig	ec84ca4025	scsi: block: Remove now unused queue limits helpers Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240409143748.980206-24-hch@lst.de Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2024-04-12 06:32:01 -04:00
Christoph Hellwig	4373d2ecca	scsi: bsg: Pass queue_limits to bsg_setup_queue() This allows bsg_setup_queue() to pass them to blk_mq_alloc_queue() and thus set up the limits at queue allocation time. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240409143748.980206-3-hch@lst.de Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2024-04-11 21:37:48 -04:00
Yu Kuai	9617cd6f24	block: fix module reference leakage from bdev_open_by_dev error path At the time bdev_may_open() is called, module reference is grabbed already, hence module reference should be released if bdev_may_open() failed. This problem is found by code review. Fixes: `ed5cc702d3` ("block: Add config option to not allow writing to mounted devices") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240406090930.2252838-22-yukuai1@huaweicloud.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-04-11 11:14:36 +02:00
Ming Lei	b561ea56a2	block: allow device to have both virt_boundary_mask and max segment size When one stacking device is over one device with virt_boundary_mask and another one with max segment size, the stacking device have both limits set. This way is allowed before `d690cb8ae1` ("block: add an API to atomically update queue limits"). Relax the limit so that we won't break such kind of stacking setting. Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218687 Reported-by: janpieter.sollie@edpnet.be Fixes: `d690cb8ae1` ("block: add an API to atomically update queue limits") Link: https://lore.kernel.org/linux-block/ZfGl8HzUpiOxCLm3@fedora/ Cc: Christoph Hellwig <hch@lst.de> Cc: Mike Snitzer <snitzer@kernel.org> Cc: dm-devel@lists.linux.dev Cc: Song Liu <song@kernel.org> Cc: linux-raid@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20240407131931.4055231-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-07 15:50:33 -06:00
Ming Lei	8b8ace0803	block: fix q->blkg_list corruption during disk rebind Multiple gendisk instances can allocated/added for single request queue in case of disk rebind. blkg may still stay in q->blkg_list when calling blkcg_init_disk() for rebind, then q->blkg_list becomes corrupted. Fix the list corruption issue by: - add blkg_init_queue() to initialize q->blkg_list & q->blkcg_mutex only - move calling blkg_init_queue() into blk_alloc_queue() The list corruption should be started since commit `f1c006f1c6` ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()") which delays removing blkg from q->blkg_list into blkg_free_workfn(). Fixes: `f1c006f1c6` ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()") Fixes: `1059699f87` ("block: move blkcg initialization/destroy into disk allocation/release handler") Cc: Yu Kuai <yukuai3@huawei.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240407125910.4053377-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-07 15:50:13 -06:00
Christian Brauner	210a03c9d5	fs: claw back a few FMODE_* bits There's a bunch of flags that are purely based on what the file operations support while also never being conditionally set or unset. IOW, they're not subject to change for individual files. Imho, such flags don't need to live in f_mode they might as well live in the fops structs itself. And the fops struct already has that lonely mmap_supported_flags member. We might as well turn that into a generic fop_flags member and move a few flags from FMODE_* space into FOP_* space. That gets us four FMODE_* bits back and the ability for new static flags that are about file ops to not have to live in FMODE_* space but in their own FOP_* space. It's not the most beautiful thing ever but it gets the job done. Yes, there'll be an additional pointer chase but hopefully that won't matter for these flags. I suspect there's a few more we can move into there and that we can also redirect a bunch of new flag suggestions that follow this pattern into the fop_flags field instead of f_mode. Link: https://lore.kernel.org/r/20240328-gewendet-spargel-aa60a030ef74@brauner Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-04-07 13:49:02 +02:00
Rik van Riel	beaa51b360	blk-iocost: avoid out of bounds shift UBSAN catches undefined behavior in blk-iocost, where sometimes iocg->delay is shifted right by a number that is too large, resulting in undefined behavior on some architectures. [ 186.556576] ------------[ cut here ]------------ UBSAN: shift-out-of-bounds in block/blk-iocost.c:1366:23 shift exponent 64 is too large for 64-bit type 'u64' (aka 'unsigned long long') CPU: 16 PID: 0 Comm: swapper/16 Tainted: G S E N 6.9.0-0_fbk700_debug_rc2_kbuilder_0_gc85af715cac0 #1 Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A23 12/08/2020 Call Trace: <IRQ> dump_stack_lvl+0x8f/0xe0 __ubsan_handle_shift_out_of_bounds+0x22c/0x280 iocg_kick_delay+0x30b/0x310 ioc_timer_fn+0x2fb/0x1f80 __run_timer_base+0x1b6/0x250 ... Avoid that undefined behavior by simply taking the "delay = 0" branch if the shift is too large. I am not sure what the symptoms of an undefined value delay will be, but I suspect it could be more than a little annoying to debug. Signed-off-by: Rik van Riel <riel@surriel.com> Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Jens Axboe <axboe@kernel.dk> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240404123253.0f58010f@imladris.surriel.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-05 20:07:40 -06:00
Linus Torvalds	8a05ef7087	block-6.9-20240405 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmYQT2oQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpt8dEADbrvjvMvjTSfskku0sof/Yv+0RkQfleRjD 9nch6bcYHmnbgSNpKsf62gDKmGWWLfjiWaxzBy2u6ZJ+m/Yg7QWSqPZqM15Ayy05 SsgJtb6N7AgTEOy3fpNLwLaQpSp0Mtx3lGPNJpahJmL9Wl+ZKl8EKoBL1GrvJpc7 DCPenrbEtrXb+uunm8AnyDHgYhVmRx6S3K41JeINTC7ZiG5hc01xkh5DXNCXMF9I c+0asZDsADltbh6jA3tud12pnhdJFpSkHM3nsnFWB0rNKsXKRRSSj/Eexbq5+tYU 38GQgwDtl8bwvCxmYRLj1PISrOROBiKC0or3wCTWW/3PInj4BS3Qry3j7r7HpYrJ 4uy8REgHp2inZACZToBaRoZK2wrJeCHJDogZag3VAuthIsetRqb+uj+qGd+yQK/G XEIy2d9KwFC1mqXeUKy0jZVS2IfE4YQ8ZRB76ZP8wCK3a9mrfAv/WmONeZu9NFGs qvvpCoNJLDRT2WoygsbeXTmSRxGX2FK9F7VIfKFzpD6/JGY58S/N7QznMOVZKmBe Gnb7c+7tCVpCEpcRInN3UrUawKVWX/0x5YMcxi6vCkrf/asIuqqSkxL48vfPQ1b+ r5rEsnXsMzbr6o7zbbowIFFZzdFbIutKHFmN8mPAThKf5JmX+Ccm7g3YMxVUqVut dOmIJZeclg== =ns3r -----END PGP SIGNATURE----- Merge tag 'block-6.9-20240405' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Atomic queue limits fixes (Christoph) - Fabrics fixes (Hannes, Daniel) - Discard overflow fix (Li) - Cleanup fix for null_blk (Damien) * tag 'block-6.9-20240405' of git://git.kernel.dk/linux: nvme-fc: rename free_ctrl callback to match name pattern nvmet-fc: move RCU read lock to nvmet_fc_assoc_exists nvmet: implement unique discovery NQN nvme: don't create a multipath node for zero capacity devices nvme: split nvme_update_zone_info nvme-multipath: don't inherit LBA-related fields for the multipath node block: fix overflow in blk_ioctl_discard() nullblk: Fix cleanup order in null_add_dev() error path	2024-04-05 17:04:11 -07:00
Linus Torvalds	fae0268777	vfs-6.9-rc3.fixes -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZg/C8wAKCRCRxhvAZXjc oljxAQCneq62ginESgeQLw88fzSBTV4C50xXUA+Qz18AEgA/fgD+J3DlWquEHhMM tJmfs3aUn9w7+wDpukcsLjJfJEiSYA8= =f2Z6 -----END PGP SIGNATURE----- Merge tag 'vfs-6.9-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "This contains a few small fixes. This comes with some delay because I wanted to wait on people running their reproducers and the Easter Holidays meant that those replies came in a little later than usual: - Fix handling of preventing writes to mounted block devices. Since last kernel we allow to prevent writing to mounted block devices provided CONFIG_BLK_DEV_WRITE_MOUNTED isn't set and the block device is opened with restricted writes. When we switched to opening block devices as files we altered the mechanism by which we recognize when a block device has been opened with write restrictions. The detection logic assumed that only read-write mounted filesystems would apply write restrictions to their block devices from other openers. That of course is not true since it also makes sense to apply write restrictions for filesystems that are read-only. Fix the detection logic using an FMODE_* bit. We still have a few left since we freed up a couple a while ago. I also picked up a patch to free up four additional FMODE_* bits scheduled for the next merge window. - Fix counting the number of writers to a block device. This just changes the logic to be consistent. - Fix a bug in aio causing a NULL pointer derefernce after we implemented batched processing in aio. - Finally, add the changes we discussed that allows to yield block devices early even though file closing itself is deferred. This also allows us to remove two holder operations to get and release the holder to align lifetime of file and holder of the block device" * tag 'vfs-6.9-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: aio: Fix null ptr deref in aio_complete() wakeup fs,block: yield devices early block: count BLK_OPEN_RESTRICT_WRITES openers block: handle BLK_OPEN_RESTRICT_WRITES correctly	2024-04-05 09:47:26 -07:00
Kefeng Wang	688c8b9208	blk-cgroup: use group allocation/free of per-cpu counters API Use group allocation/free of per-cpu counters api to accelerate blkg_rwstat_init/exit() and simplify code. Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Link: https://lore.kernel.org/r/20240325035955.50019-1-wangkefeng.wang@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-03 09:10:17 -06:00
Li Nan	22d24a544b	block: fix overflow in blk_ioctl_discard() There is no check for overflow of 'start + len' in blk_ioctl_discard(). Hung task occurs if submit an discard ioctl with the following param: start = 0x80000000000ff000, len = 0x8000000000fff000; Add the overflow validation now. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240329012319.2034550-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-02 07:43:24 -06:00
Christoph Hellwig	7a324d8389	blk-cgroup: use bio_list_merge_init Use bio_list_merge_init instead of open coding bio_list_merge and bio_list_init. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240328084147.2954434-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-01 11:53:37 -06:00
John Garry	d3a3a086ad	blk-throttle: Only use seq_printf() in tg_prfill_limit() Currently tg_prfill_limit() uses a combination of snprintf() and strcpy() to generate the values parts of the limits string, before passing them as arguments to seq_printf(). Convert to use only a sequence of seq_printf() calls per argument, which is simpler. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240327094020.3505514-1-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-01 11:53:37 -06:00
Ming Lei	a46c27026d	blk-mq: don't schedule block kworker on isolated CPUs Kernel parameter of `isolcpus=` or 'nohz_full=' are used to isolate CPUs for specific task, and it isn't expected to let block IO disturb these CPUs. blk-mq kworker shouldn't be scheduled on isolated CPUs. Also if isolated CPUs is run for blk-mq kworker, long block IO latency can be caused. Kernel workqueue only respects CPU isolation for WQ_UNBOUND, for bound WQ, the responsibility is on user because CPU is specified as WQ API parameter, such as mod_delayed_work_on(cpu), queue_delayed_work_on(cpu) and queue_work_on(cpu). So not run blk-mq kworker on isolated CPUs by removing isolated CPUs from hctx->cpumask. Meantime use queue map to check if all CPUs in this hw queue are offline instead of hctx->cpumask, this way can avoid any cost in fast IO code path, and is safe since hctx->cpumask are only used in the two cases. Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Andrew Theurer <atheurer@redhat.com> Cc: Joe Mario <jmario@redhat.com> Cc: Sebastian Jug <sejug@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Tejun Heo <tj@kernel.org> Tesed-by: Joe Mario <jmario@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Link: https://lore.kernel.org/r/20240322021244.1056223-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-04-01 11:53:36 -06:00
Linus Torvalds	033e8088a4	block-6.9-20240329 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmYG3agQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvX+EADYziHzPZjy4Ik30nRhc2H3J9GvI0YGSPhB srRWO/ElJa3CvEXmDW7gCdHoc4vKnryyC/rXkTCrTmffzYp1BBKZb9kApUvFo1xc ax2Pww1hFTIDf5YsTCgg9wR+953lCwNyolfND+nQj/2LLOmfbypqSCkG7bDfwuhA vIiwxbgBZM4yi/354xoIUTEimbSHuRzyLXyvCZo5nBxiEFTBXIQMY8UgIXRiNb7v zi0LRxcbtfkcUcxs1seyE3Lke8P+gkx0SPo7r9LLRTNPuJ/fwIfqHvYPDTAj2toj P71kELJLdDNLivmxX5kbC4EqGueo9L6aaKkYQHD4RRlM98LtxuhNHdzYt2YoXVDk 2gg58VNZh7aNzpPlqa4FDf2Sjp0M0k3G/LX9IpmySTL0VVLYvQWhr1qyji2O1yAj m4W6RK1mYE98rt66cxqKKtrWYl6oJj3J0P/KcfPNe6nIdYYgefQxJwh+B4LMSfrr sgDUXxYIwfsbKuSeagIXEWw8FMlFO3nfSOu6BIGdRRcwNl/fJzrVODYoMrAq7+sP mGRnhYAz4HKfWImFFsIla+A6gJEFVGrftqeWLbeB/mjvV/kPhZT+5n+Wv8zS3bjv 9/upgnnaIoRZo6/RA/ev/HXR3YEygZKSHK162lTzGmhw85BSxmFMq/WvB99U3m17 Qeh+J+yviw== =zEIA -----END PGP SIGNATURE----- Merge tag 'block-6.9-20240329' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: "Small round of minor fixes or cleanups for the 6.9-rc2 kernel, one fixing an issue introduced in 6.8" * tag 'block-6.9-20240329' of git://git.kernel.dk/linux: block: Do not force full zone append completion in req_bio_endio() block: don't reject too large max_user_sectors in blk_validate_limits block: Make blk_rq_set_mixed_merge() static	2024-03-29 09:40:22 -07:00
Damien Le Moal	55251fbdf0	block: Do not force full zone append completion in req_bio_endio() This reverts commit `748dc0b65e`. Partial zone append completions cannot be supported as there is no guarantees that the fragmented data will be written sequentially in the same manner as with a full command. Commit `748dc0b65e` ("block: fix partial zone append completion handling in req_bio_endio()") changed req_bio_endio() to always advance a partially failed BIO by its full length, but this can lead to incorrect accounting. So revert this change and let low level device drivers handle this case by always failing completely zone append operations. With this revert, users will still see an IO error for a partially completed zone append BIO. Fixes: `748dc0b65e` ("block: fix partial zone append completion handling in req_bio_endio()") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240328004409.594888-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-28 17:04:48 -06:00
Christian Brauner	22650a9982	fs,block: yield devices early Currently a device is only really released once the umount returns to userspace due to how file closing works. That ultimately could cause an old umount assumption to be violated that concurrent umount and mount don't fail. So an exclusively held device with a temporary holder should be yielded before the filesystem is gone. Add a helper that allows callers to do that. This also allows us to remove the two holder ops that Linus wasn't excited about. Link: https://lore.kernel.org/r/20240326-vfs-bdev-end_holder-v1-1-20af85202918@kernel.org Fixes: `f3a608827d` ("bdev: open block device as files") # mainline only Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-03-27 13:17:15 +01:00
Christian Brauner	3ff56e285d	block: count BLK_OPEN_RESTRICT_WRITES openers The original changes in v6.8 do allow for a block device to be reopened with BLK_OPEN_RESTRICT_WRITES provided the same holder is used as per bdev_may_open(). I think this has a bug. The first opener @f1 of that block device will set bdev->bd_writers to -1. The second opener @f2 using the same holder will pass the check in bdev_may_open() that bdev->bd_writers must not be greater than zero. The first opener @f1 now closes the block device and in bdev_release() will end up calling bdev_yield_write_access() which calls bdev_writes_blocked() and sets bdev->bd_writers to 0 again. Now @f2 holds a file to that block device which was opened with exclusive write access but bdev->bd_writers has been reset to 0. So now @f3 comes along and succeeds in opening the block device with BLK_OPEN_WRITE betraying @f2's request to have exclusive write access. This isn't a practical issue yet because afaict there's no codepath inside the kernel that reopenes the same block device with BLK_OPEN_RESTRICT_WRITES but it will be if there is. Fix this by counting the number of BLK_OPEN_RESTRICT_WRITES openers. So we only allow writes again once all BLK_OPEN_RESTRICT_WRITES openers are done. Link: https://lore.kernel.org/r/20240323-abtauchen-klauen-c2953810082d@brauner Fixes: `ed5cc702d3` ("block: Add config option to not allow writing to mounted devices") Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-03-27 12:59:25 +01:00
Christian Brauner	ddd65e19c6	block: handle BLK_OPEN_RESTRICT_WRITES correctly Last kernel release we introduce CONFIG_BLK_DEV_WRITE_MOUNTED. By default this option is set. When it is set the long-standing behavior of being able to write to mounted block devices is enabled. But in order to guard against unintended corruption by writing to the block device buffer cache CONFIG_BLK_DEV_WRITE_MOUNTED can be turned off. In that case it isn't possible to write to mounted block devices anymore. A filesystem may open its block devices with BLK_OPEN_RESTRICT_WRITES which disallows concurrent BLK_OPEN_WRITE access. When we still had the bdev handle around we could recognize BLK_OPEN_RESTRICT_WRITES because the mode was passed around. Since we managed to get rid of the bdev handle we changed that logic to recognize BLK_OPEN_RESTRICT_WRITES based on whether the file was opened writable and writes to that block device are blocked. That logic doesn't work because we do allow BLK_OPEN_RESTRICT_WRITES to be specified without BLK_OPEN_WRITE. Fix the detection logic and use an FMODE_* bit. We could've also abused O_EXCL as an indicator that BLK_OPEN_RESTRICT_WRITES has been requested. For userspace open paths O_EXCL will never be retained but for internal opens where we open files that are never installed into a file descriptor table this is fine. But it would be a gamble that this doesn't cause bugs. Note that BLK_OPEN_RESTRICT_WRITES is an internal only flag that cannot directly be raised by userspace. It is implicitly raised during mounting. Passes xftests and blktests with CONFIG_BLK_DEV_WRITE_MOUNTED set and unset. Link: https://lore.kernel.org/r/ZfyyEwu9Uq5Pgb94@casper.infradead.org Link: https://lore.kernel.org/r/20240323-zielbereich-mittragen-6fdf14876c3e@brauner Fixes: `321de651fa` ("block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access") Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reported-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-03-27 09:31:41 +01:00
Christoph Hellwig	038105a200	block: don't reject too large max_user_sectors in blk_validate_limits We already cap down the actual max_sectors to the max of the hardware and user limit, so don't reject the configuration. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240326060745.2349154-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-26 11:28:52 -06:00
John Garry	dc53d9eac1	block: Make blk_rq_set_mixed_merge() static Since commit `8e756373d7` ("block: Move bio merge related functions into blk-merge.c"), blk_rq_set_mixed_merge() has only been referenced in blk-merge.c, so make it static. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240325083501.2816408-1-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-26 11:28:20 -06:00
Linus Torvalds	0a7b0acece	vfs-6.9-rc1.fixes -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZfglxgAKCRCRxhvAZXjc ovK9APsF7/TMFhNbtW+JsghSyrEk0cOVPizi8JkRDDWNW3qY+wEAxtydhbmWpbKq MpIjMHqwjPx3zXBL8Ec/b4vAoJqpJwQ= =NgvO -----END PGP SIGNATURE----- Merge tag 'vfs-6.9-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "This contains a few small fixes for this merge window: - Undo the hiding of silly-rename files in afs. If they're hidden they can't be deleted by rm manually anymore causing regressions - Avoid caching the preferred address for an afs server to avoid accidently overriding an explicitly specified preferred server address - Fix bad stat() and rmdir() interaction in afs - Take a passive reference on the superblock when opening a block device so the holder is available to concurrent callers from the block layer - Clear private data pointer in fscache_begin_operation() to avoid it being falsely treated as valid" * tag 'vfs-6.9-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fscache: Fix error handling in fscache_begin_operation() fs,block: get holder during claim afs: Fix occasional rmdir-then-VNOVNODE with generic/011 afs: Don't cache preferred address afs: Revert "afs: Hide silly-rename files from userspace"	2024-03-18 09:15:50 -07:00
Christian Brauner	59a55a63c2	fs,block: get holder during claim Now that we open block devices as files we need to deal with the realities that closing is a deferred operation. An operation on the block device such as e.g., freeze, thaw, or removal that runs concurrently with umount, tries to acquire a stable reference on the holder. The holder might already be gone though. Make that reliable by grabbing a passive reference to the holder during bdev_open() and releasing it during bdev_release(). Fixes: `f3a608827d` ("bdev: open block device as files") # mainline only Reported-by: Christoph Hellwig <hch@infradead.org> Link: https://lore.kernel.org/r/ZfEQQ9jZZVes0WCZ@infradead.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@infradead.org> Tested-by: Yi Zhang <yi.zhang@redhat.com> Reported-by: https://lore.kernel.org/r/CAHj4cs8tbDwKRwfS1=DmooP73ysM__xAb2PQc6XsAmWR+VuYmg@mail.gmail.com Link: https://lore.kernel.org/r/20240315-freibad-annehmbar-ca68c375af91@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-03-18 10:32:44 +01:00
Jiapeng Chong	4c4ab8ae41	block: fix mismatched kerneldoc function name No functional modification involved. block/blk-settings.c:281: warning: expecting prototype for queue_limits_commit_set(). Prototype was for queue_limits_set() instead. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8539 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240314025615.71269-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-14 09:40:47 -06:00
Christoph Hellwig	bf5e3a30f7	Revert "blk-lib: check for kill signal" This reverts commit `8a08c5fd89`. It turns out while this is a perfectly valid and long overdue thing to do for user initiated discards / zeroing from the ioctl handler, it actually breaks file system use of the discard helper by interrupting in places the file system doesn't expect, and by leaving the bio chain in a state that the file system callers of (at least) __blkdev_issue_discard do not expect. Revert the change for now, we'll redo it for the next merge window after refactoring the code to better split the file system vs ioctl callers and cleaning up a few other loose ends. Reported-by: Chandan Babu R <chandanbabu@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20240314021623.1908895-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-13 20:35:48 -06:00
Bart Van Assche	256aab46e3	Revert "block/mq-deadline: use correct way to throttling write requests" The code "max(1U, 3 * (1U << shift) / 4)" comes from the Kyber I/O scheduler. The Kyber I/O scheduler maintains one internal queue per hwq and hence derives its async_depth from the number of hwq tags. Using this approach for the mq-deadline scheduler is wrong since the mq-deadline scheduler maintains one internal queue for all hwqs combined. Hence this revert. Cc: stable@vger.kernel.org Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Cc: Zhiguo Niu <Zhiguo.Niu@unisoc.com> Fixes: `d47f9717e5` ("block/mq-deadline: use correct way to throttling write requests") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240313214218.1736147-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-13 15:56:14 -06:00
Jens Axboe	b874d4aae5	block: limit block time caching to in_task() context We should not have any callers of this from non-task context, but Jakub ran [1] into one from blk-iocost. Rather than risk running into others, or future ones, just limit blk_time_get_ns() to when it is called from a task. Any other usage is invalid. [1] https://lore.kernel.org/lkml/CAHk-=wiOaBLqarS2uFhM1YdwOvCX4CZaWkeyNDY1zONpbYw2ig@mail.gmail.com/ Fixes: `da4c8c3d09` ("block: cache current nsec time in struct blk_plug") Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-13 14:12:53 -06:00
Linus Torvalds	bff4b74625	Revert "dm: use queue_limits_set" This reverts commit `8e0ef41286`. It's broken, and causes the boot to fail on encrypted volumes. Reported-and-bisected-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/all/20240311235023.GA1205@cmpxchg.org/ Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2024-03-11 17:11:28 -07:00
Linus Torvalds	1ddeeb2a05	for-6.9/block-20240310 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmXuFO4QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpq33D/9hyNyBce2A9iyo026eK8EqLDoed6BPzuvB kLKj5tsGvX4YlfuswvP86M5dgibTASXclnfUK394TijW/JPOfJ3mNhi9gMnHzRoK ZaR1di0Lum56dY1FkpMmWiGmE4fB79PAtXYKtajOkuoIcNzylncEAAACUY4/Ouhg Cm+LMg2prcc+m9g8rKDNQ51pUFg4U21KAUTl35XLMUAaQk1ahW3EDEVYhweC/zwE V/5hJsv8UY72+oQGY2Dc/YgQk/Zj4ZDh7C+oHR9XeB/ro99kr3/Vopagu0gBMLZi Rq6qqz6PVMhVcuz8uN2rsTQKXmXhsBn9/adsl4AKtdxcW5D5moWb5BLq1P0WQylc nzMxa1d6cVcTKZpaUQQv3Rj6ZMrLuDwP277UYHfn5x1oPWYRZCG7FtHuOo1gNcpG DrSNwVG6BSDcbABqI+MIS2oD1JoUMyevjwT7e2hOXukZhc6GLO5F3ODWE5j3KnCR S/aGSAmcdR4fTcgavULqWdQVt7SYl4f1IxT8KrUirJGVhc2LgahaWj69ooklVHoU fPDFRiruwJ5YkH4RWCSDm9mi4kAz6eUf+f4yE06wZOFOb2fT8/1ZK2Snpz2KeXuZ INO0RejtFzT8L0OUlu7dBmF20y6rgAYt87lR8mIt71yuuATIrVhzlX1VdsvhdrAo VLHGV1Ncgw== =WlVL -----END PGP SIGNATURE----- Merge tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - MD pull requests via Song: - Cleanup redundant checks (Yu Kuai) - Remove deprecated headers (Marc Zyngier, Song Liu) - Concurrency fixes (Li Lingfeng) - Memory leak fix (Li Nan) - Refactor raid1 read_balance (Yu Kuai, Paul Luse) - Clean up and fix for md_ioctl (Li Nan) - Other small fixes (Gui-Dong Han, Heming Zhao) - MD atomic limits (Christoph) - NVMe pull request via Keith: - RDMA target enhancements (Max) - Fabrics fixes (Max, Guixin, Hannes) - Atomic queue_limits usage (Christoph) - Const use for class_register (Ricardo) - Identification error handling fixes (Shin'ichiro, Keith) - Improvement and cleanup for cached request handling (Christoph) - Moving towards atomic queue limits. Core changes and driver bits so far (Christoph) - Fix UAF issues in aoeblk (Chun-Yi) - Zoned fix and cleanups (Damien) - s390 dasd cleanups and fixes (Jan, Miroslav) - Block issue timestamp caching (me) - noio scope guarding for zoned IO (Johannes) - block/nvme PI improvements (Kanchan) - Ability to terminate long running discard loop (Keith) - bdev revalidation fix (Li) - Get rid of old nr_queues hack for kdump kernels (Ming) - Support for async deletion of ublk (Ming) - Improve IRQ bio recycling (Pavel) - Factor in CPU capacity for remote vs local completion (Qais) - Add shared_tags configfs entry for null_blk (Shin'ichiro - Fix for a regression in page refcounts introduced by the folio unification (Tony) - Misc fixes and cleanups (Arnd, Colin, John, Kunwu, Li, Navid, Ricardo, Roman, Tang, Uwe) * tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux: (221 commits) block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC block/swim: Convert to platform remove callback returning void cdrom: gdrom: Convert to platform remove callback returning void block: remove disk_stack_limits md: remove mddev->queue md: don't initialize queue limits md/raid10: use the atomic queue limit update APIs md/raid5: use the atomic queue limit update APIs md/raid1: use the atomic queue limit update APIs md/raid0: use the atomic queue limit update APIs md: add queue limit helpers md: add a mddev_is_dm helper md: add a mddev_add_trace_msg helper md: add a mddev_trace_remap helper bcache: move calculation of stripe_size and io_opt into bcache_device_init virtio_blk: Do not use disk_set_max_open/active_zones() aoe: fix the potential use-after-free problem in aoecmd_cfg_pkts block: move capacity validation to blkpg_do_ioctl() block: prevent division by zero in blk_rq_stat_sum() drbd: atomically update queue limits in drbd_reconsider_queue_parameters ...	2024-03-11 11:43:44 -07:00
Linus Torvalds	910202f00a	vfs-6.9.super -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem4DwAKCRCRxhvAZXjc ooTRAQDRI6Qz6wJym5Yblta8BScMGbt/SgrdgkoCvT6y83MtqwD+Nv/AZQzi3A3l 9NdULtniW1reuCYkc8R7dYM8S+yAwAc= =Y1qX -----END PGP SIGNATURE----- Merge tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull block handle updates from Christian Brauner: "Last cycle we changed opening of block devices, and opening a block device would return a bdev_handle. This allowed us to implement support for restricting and forbidding writes to mounted block devices. It was accompanied by converting and adding helpers to operate on bdev_handles instead of plain block devices. That was already a good step forward but ultimately it isn't necessary to have special purpose helpers for opening block devices internally that return a bdev_handle. Fundamentally, opening a block device internally should just be equivalent to opening files. So now all internal opens of block devices return files just as a userspace open would. Instead of introducing a separate indirection into bdev_open_by_() via struct bdev_handle bdev_file_open_by_() is made to just return a struct file. Opening and closing a block device just becomes equivalent to opening and closing a file. This all works well because internally we already have a pseudo fs for block devices and so opening block devices is simple. There's a few places where we needed to be careful such as during boot when the kernel is supposed to mount the rootfs directly without init doing it. Here we need to take care to ensure that we flush out any asynchronous file close. That's what we already do for opening, unpacking, and closing the initramfs. So nothing new here. The equivalence of opening and closing block devices to regular files is a win in and of itself. But it also has various other advantages. We can remove struct bdev_handle completely. Various low-level helpers are now private to the block layer. Other helpers were simply removable completely. A follow-up series that is already reviewed build on this and makes it possible to remove bdev->bd_inode and allows various clean ups of the buffer head code as well. All places where we stashed a bdev_handle now just stash a file and use simple accessors to get to the actual block device which was already the case for bdev_handle" * tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) block: remove bdev_handle completely block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access bdev: remove bdev pointer from struct bdev_handle bdev: make struct bdev_handle private to the block layer bdev: make bdev_{release, open_by_dev}() private to block layer bdev: remove bdev_open_by_path() reiserfs: port block device access to file ocfs2: port block device access to file nfs: port block device access to files jfs: port block device access to file f2fs: port block device access to files ext4: port block device access to file erofs: port device access to file btrfs: port device access to file bcachefs: port block device access to file target: port block device access to file s390: port block device access to file nvme: port block device access to file block2mtd: port device access to files bcache: port block device access to files ...	2024-03-11 10:52:34 -07:00
Linus Torvalds	54126fafea	vfs-6.9.iomap -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem4UQAKCRCRxhvAZXjc ouERAQDg63R9s3bKmUgGqngf9cfr//VCTE+WVARwOUTdn2iDbwEA1IME7X1kL/Vz EdhEjyqO6xom+ao/Vqxe0XIDNz70vgs= =8RdE -----END PGP SIGNATURE----- Merge tag 'vfs-6.9.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull iomap updates from Christian Brauner: - Restore read-write hints in struct bio through the bi_write_hint member for the sake of UFS devices in mobile applications. This can result in up to 40% lower write amplification in UFS devices. The patch series that builds on this will be coming in via the SCSI maintainers (Bart) - Overhaul the iomap writeback code. Afterwards ->map_blocks() is able to map multiple blocks at once as long as they're in the same folio. This reduces CPU usage for buffered write workloads on e.g., xfs on systems with lots of cores (Christoph) - Record processed bytes in iomap_iter() trace event (Kassey) - Extend iomap_writepage_map() trace event after Christoph's ->map_block() changes to map mutliple blocks at once (Zhang) * tag 'vfs-6.9.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits) iomap: Add processed for iomap_iter iomap: add pos and dirty_len into trace_iomap_writepage_map block, fs: Restore the per-bio/request data lifetime fields fs: Propagate write hints to the struct block_device inode fs: Move enum rw_hint into a new header file fs: Split fcntl_rw_hint() fs: Verify write lifetime constants at compile time fs: Fix rw_hint validation iomap: pass the length of the dirty region to ->map_blocks iomap: map multiple blocks at a time iomap: submit ioends immediately iomap: factor out a iomap_writepage_map_block helper iomap: only call mapping_set_error once for each failed bio iomap: don't chain bios iomap: move the iomap_sector sector calculation out of iomap_add_to_ioend iomap: clean up the iomap_alloc_ioend calling convention iomap: move all remaining per-folio logic into iomap_writepage_map iomap: factor out a iomap_writepage_handle_eof helper iomap: move the PF_MEMALLOC check to iomap_writepages iomap: move the io_folios field out of struct iomap_ioend ...	2024-03-11 10:07:03 -07:00
Colin Ian King	5205a4aa8f	block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC The helper function mac_fix_string is only required with CONFIG_PPC_PMAC, add #if CONFIG_PPC_PMAC and #endif around the function. Cleans up clang scan build warning: block/partitions/mac.c:23:20: warning: unused function 'mac_fix_string' [-Wunused-function] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20240308133921.2058227-1-colin.i.king@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-09 07:31:42 -07:00
Jens Axboe	d37977f0af	Merge tag 'md-6.9-20240306' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block Pull MD atomic queue limits changes from Song. * tag 'md-6.9-20240306' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: block: remove disk_stack_limits md: remove mddev->queue md: don't initialize queue limits md/raid10: use the atomic queue limit update APIs md/raid5: use the atomic queue limit update APIs md/raid1: use the atomic queue limit update APIs md/raid0: use the atomic queue limit update APIs md: add queue limit helpers md: add a mddev_is_dm helper md: add a mddev_add_trace_msg helper md: add a mddev_trace_remap helper	2024-03-06 11:15:24 -07:00
Christoph Hellwig	dd27a84b06	block: remove disk_stack_limits disk_stack_limits is unused now, remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed--by: Song Liu <song@kernel.org> Tested-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240303140150.5435-12-hch@lst.de	2024-03-06 08:59:54 -08:00
Li Lingfeng	b9355185d2	block: move capacity validation to blkpg_do_ioctl() Commit `6d4e80db4e` ("block: add capacity validation in bdev_add_partition()") add check of partition's start and end sectors to prevent exceeding the size of the disk when adding partitions. However, there is still no check for resizing partitions now. Move the check to blkpg_do_ioctl() to cover resizing partitions. Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240305032132.548958-1-lilingfeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-06 08:32:06 -07:00
Roman Smirnov	93f52fbeaf	block: prevent division by zero in blk_rq_stat_sum() The expression dst->nr_samples + src->nr_samples may have zero value on overflow. It is necessary to add a check to avoid division by zero. Found by Linux Verification Center (linuxtesting.org) with Svace. Signed-off-by: Roman Smirnov <r.smirnov@omp.ru> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Link: https://lore.kernel.org/r/20240305134509.23108-1-r.smirnov@omp.ru Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-06 08:31:54 -07:00
Li kunyu	5f2ad31fbb	sed-opal: Remove the ret variable from the function The ret variable in the function has not yet been effective and can be removed. Signed-off-by: Li kunyu <kunyu@nfschina.com> Link: https://lore.kernel.org/r/20240306101444.1244-1-kunyu@nfschina.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-06 08:29:49 -07:00
Li kunyu	2449be8c8c	sed-opal: Remove unnecessary ‘0’ values from ret ret is assigned first, so it does not need to initialize the assignment. Signed-off-by: Li kunyu <kunyu@nfschina.com> Link: https://lore.kernel.org/r/20240306100659.106521-1-kunyu@nfschina.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-06 08:29:43 -07:00
Li zeming	217fcc4807	sed-opal: Remove unnecessary ‘0’ values from err err is assigned first, so it does not need to initialize the assignment. Signed-off-by: Li zeming <zeming@nfschina.com> Link: https://lore.kernel.org/r/20240306100216.69340-1-zeming@nfschina.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-06 08:29:38 -07:00
Li zeming	147fe61334	sed-opal: Remove unnecessary ‘0’ values from error error is assigned first, so it does not need to initialize the assignment. Signed-off-by: Li zeming <zeming@nfschina.com> Link: https://lore.kernel.org/r/20240306095608.26839-1-zeming@nfschina.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-06 08:29:31 -07:00
Ricardo B. Marliere	f8c7511db0	block: make block_class constant Since commit `43a7206b09` ("driver core: class: make class_register() take a const *"), the driver core allows for struct class to be in read-only memory, so move the block_class structure to be declared at build time placing it into read-only memory, instead of having to be dynamically allocated at boot time. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240305-class_cleanup-block-v1-1-130bb27b9c72@marliere.net Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-06 08:29:20 -07:00
Tony Battersby	38b43539d6	block: Fix page refcounts for unaligned buffers in __bio_release_pages() Fix an incorrect number of pages being released for buffers that do not start at the beginning of a page. Fixes: `1b151e2435` ("block: Remove special-casing of compound pages") Cc: stable@vger.kernel.org Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Tested-by: Greg Edwards <gedwards@ddn.com> Link: https://lore.kernel.org/r/86e592a9-98d4-4cff-a646-0c0084328356@cybernetics.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-06 08:26:42 -07:00
Christian Brauner	86835c39e0	vfs-6.9.rw_hint -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZdcKwQAKCRCRxhvAZXjc oldXAP4uzKixPvJeJmmuLs8Yl2X4g4SnxXFoLwMjCOxGSH1DWQD+Oj0nGs81lIKm iLCZwk09JzfVEat/6KVmkjiqLLTwNgw= =TmTQ -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZeYHOAAKCRCRxhvAZXjc opwvAP0fqxfEAS04/MNdYSf0dA5GMr8v+8RBablWtkVuOMMbRQD/RMFJKXK02afq B4YUemRHtYETdbV69+yzninHy8y4gQQ= =ThqF -----END PGP SIGNATURE----- Merge tag 'vfs-6.9.rw_hint' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull write hint fix from Christian Brauner: UFS devices are widely used in mobile applications, e.g. in smartphones. UFS vendors need data lifetime information to achieve good performance. Providing data lifetime information to UFS devices can result in up to 40% lower write amplification. Hence this patch series that restores the bi_write_hint member in struct bio. After this patch series has been merged, patches that implement data lifetime support in the SCSI disk (sd) driver will be sent to the Linux kernel SCSI maintainer. The following changes are included in this patch series: - Improvements for the F_GET_RW_HINT and F_SET_RW_HINT fcntls. - Move enum rw_hint into a new header file. - Support F_SET_RW_HINT for block devices to make it easy to test data lifetime support. - Restore the bio.bi_write_hint member and restore support in the VFS layer and also in the block layer for data lifetime information. The shell script that has been used to test the patch series combined with the SCSI patches is available at the end of this cover letter. * tag 'vfs-6.9.rw_hint' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: block, fs: Restore the per-bio/request data lifetime fields fs: Propagate write hints to the struct block_device inode fs: Move enum rw_hint into a new header file fs: Split fcntl_rw_hint() fs: Verify write lifetime constants at compile time fs: Fix rw_hint validation Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-03-04 18:35:21 +01:00
Christoph Hellwig	8e0ef41286	dm: use queue_limits_set Use queue_limits_set which validates the limits and takes care of updating the readahead settings instead of directly assigning them to the queue. For that make sure all limits are actually updated before the assignment. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20240228225653.947152-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-01 08:54:42 -07:00
Christoph Hellwig	c1373f1cf4	block: add a queue_limits_stack_bdev helper Add a small wrapper around blk_stack_limits that allows passing a bdev for the bottom device and prints an error in case of misaligned device. The name fits into the new queue limits API and the intent is to eventually replace disk_stack_limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240228225653.947152-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-01 08:54:42 -07:00
Christoph Hellwig	631d4efb80	block: add a queue_limits_set helper Add a small wrapper around queue_limits_commit_update for stacking drivers that don't want to update existing limits, but set an entirely new set. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240228225653.947152-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-03-01 08:54:42 -07:00
Ming Lei	ec30b461f3	blk-mq: don't change nr_hw_queues and nr_maps for kdump kernel For most of ARCHs, 'nr_cpus=1' is passed for kdump kernel, so nr_hw_queues for each mapping is supposed to be 1 already. More importantly, this way may cause trouble for driver, because blk-mq and driver see different queue mapping since driver should setup hardware queue setting before calling into allocating blk-mq tagset. So not overriding nr_hw_queues and nr_maps for kdump kernel. Cc: Wen Xiong <wenxiong@us.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240228040857.306483-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-28 07:22:06 -07:00
Christian Brauner	ab838b3fd9	block: remove bdev_handle completely We just need to use the holder to indicate whether a block device open was exclusive or not. We did use to do that before but had to give that up once we switched to struct bdev_handle. Before struct bdev_handle we only stashed stuff in file->private_data if this was an exclusive open but after struct bdev_handle we always set file->private_data to a struct bdev_handle and so we had to use bdev_handle->mode or bdev_handle->holder. Now that we don't use struct bdev_handle anymore we can revert back to the old behavior. Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-32-adbd023e19cc@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-25 12:05:28 +01:00
Christian Brauner	321de651fa	block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access Make it possible to detected a block device that was opened with restricted write access based only on BLK_OPEN_WRITE and bdev->bd_writers < 0 so we won't have to claim another FMODE_* flag. Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-31-adbd023e19cc@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-25 12:05:28 +01:00
Christian Brauner	7c09a4ed61	bdev: remove bdev pointer from struct bdev_handle We can always go directly via: * I_BDEV(bdev_file->f_inode) * I_BDEV(bdev_file->f_mapping->host) So keeping struct bdev in struct bdev_handle is redundant. Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-30-adbd023e19cc@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-25 12:05:27 +01:00
Christian Brauner	a56aefca8d	bdev: make struct bdev_handle private to the block layer Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-29-adbd023e19cc@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-25 12:05:27 +01:00
Christian Brauner	b1211a25c4	bdev: make bdev_{release, open_by_dev}() private to block layer Move both of them to the private block header. There's no caller in the tree anymore that uses them directly. Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-28-adbd023e19cc@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-25 12:05:27 +01:00
Christian Brauner	e97d06a465	bdev: remove bdev_open_by_path() Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-27-adbd023e19cc@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-25 12:05:27 +01:00
Christian Brauner	190f676afa	block/genhd: port disk_scan_partitions() to file This may run from a kernel thread via device_add_disk(). So this could also use __fput_sync() if we were worried about EBUSY. But when it is called from a kernel thread it's always BLK_OPEN_READ so EBUSY can't really happen even if we do BLK_OPEN_RESTRICT_WRITES or BLK_OPEN_EXCL. Otherwise it's called from an ioctl on the block device which is only called from userspace and can rely on task work. Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-3-adbd023e19cc@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-25 12:05:22 +01:00
Christian Brauner	e5ca9d3916	block/ioctl: port blkdev_bszset() to file Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-2-adbd023e19cc@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-25 12:05:22 +01:00
Christian Brauner	f3a608827d	bdev: open block device as files Add two new helpers to allow opening block devices as files. This is not the final infrastructure. This still opens the block device before opening a struct a file. Until we have removed all references to struct bdev_handle we can't switch the order: * Introduce blk_to_file_flags() to translate from block specific to flags usable to pen a new file. * Introduce bdev_file_open_by_{dev,path}(). * Introduce temporary sb_bdev_handle() helper to retrieve a struct bdev_handle from a block device file and update places that directly reference struct bdev_handle to rely on it. * Don't count block device openes against the number of open files. A bdev_file_open_by_{dev,path}() file is never installed into any file descriptor table. One idea that came to mind was to use kernel_tmpfile_open() which would require us to pass a path and it would then call do_dentry_open() going through the regular fops->open::blkdev_open() path. But then we're back to the problem of routing block specific flags such as BLK_OPEN_RESTRICT_WRITES through the open path and would have to waste FMODE_* flags every time we add a new one. With this we can avoid using a flag bit and we have more leeway in how we open block devices from bdev_open_by_{dev,path}(). Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-1-adbd023e19cc@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-25 12:05:21 +01:00
Chengming Zhou	82c6515d8a	bdev: remove SLAB_MEM_SPREAD flag usage The SLAB_MEM_SPREAD flag is already a no-op as of 6.8-rc1, remove its usage so we can delete it from slab. No functional change. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Link: https://lore.kernel.org/r/20240224134646.829105-1-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-24 13:16:08 -07:00
Qais Yousef	af550e4c96	block/blk-mq: Don't complete locally if capacities are different The logic in blk_mq_complete_need_ipi() assumes SMP systems where all CPUs have equal compute capacities and only LLC cache can make a different on perceived performance. But this assumption falls apart on HMP systems where LLC is shared, but the CPUs have different capacities. Staying local then can have a big performance impact if the IO request was done from a CPU with higher capacity but the interrupt is serviced on a lower capacity CPU. Use the new cpus_equal_capacity() function to check if we need to send an IPI. Without the patch I see the BLOCK softirq always running on little cores (where the hardirq is serviced). With it I can see it running on all cores. This was noticed after the topology change [1] where now on a big.LITTLE we truly get that the LLC is shared between all cores where as in the past it was being misrepresented for historical reasons. The logic exposed a missing dependency on capacities for such systems where there can be a big performance difference between the CPUs. This of course introduced a noticeable change in behavior depending on how the topology is presented. Leading to regressions in some workloads as the performance of the BLOCK softirq on littles can be noticeably worse on some platforms. Worth noting that we could have checked for capacities being greater than or equal instead for equality. This will lead to favouring higher performance always. But opted for equality instead to match the performance of the requester without making an assumption that can lead to power trade-offs which these systems tend to be sensitive about. If the requester would like to run faster, it's better to rely on the scheduler to give the IO requester via some facility to run on a faster core; and then if the interrupt triggered on a CPU with different capacity we'll make sure to match the performance the requester is supposed to run at. [1] https://lpc.events/event/16/contributions/1342/attachments/962/1883/LPC-2022-Android-MC-Phantom-Domains.pdf Signed-off-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240223155749.2958009-3-qyousef@layalina.io Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-24 12:48:01 -07:00
Keith Busch	8a08c5fd89	blk-lib: check for kill signal Some of these block operations can access a significant capacity and take longer than the user expected. A user may change their mind about wanting to run that command and attempt to kill the process and do something else with their device. But since the task is uninterruptable, they have to wait for it to finish, which could be many hours. Check for a fatal signal at each iteration so the user doesn't have to wait for their regretted operation to complete naturally. Reported-by: Conrad Meyer <conradmeyer@meta.com> Tested-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240223155910.3622666-5-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-24 12:46:46 -07:00
Keith Busch	0eb4db4706	block: io wait hang check helper This is the same in two places, and another will be added soon. Create a helper for it. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240223155910.3622666-4-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-24 12:46:46 -07:00
Keith Busch	76a27e1b53	block: cleanup __blkdev_issue_write_zeroes Use min to calculate the next number of sectors like everyone else. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240223155910.3622666-3-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-24 12:46:46 -07:00
Keith Busch	5affe497c3	block: blkdev_issue_secure_erase loop style Use consistent coding style in this file. All the other loops for the same purpose use "while (nr_sects)", so they win. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240223155910.3622666-2-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-24 12:46:46 -07:00
Li Nan	03f12122b2	block: fix deadlock between bd_link_disk_holder and partition scan 'open_mutex' of gendisk is used to protect open/close block devices. But in bd_link_disk_holder(), it is used to protect the creation of symlink between holding disk and slave bdev, which introduces some issues. When bd_link_disk_holder() is called, the driver is usually in the process of initialization/modification and may suspend submitting io. At this time, any io hold 'open_mutex', such as scanning partitions, can cause deadlocks. For example, in raid: T1 T2 bdev_open_by_dev lock open_mutex [1] ... efi_partition ... md_submit_bio md_ioctl mddev_syspend -> suspend all io md_add_new_disk bind_rdev_to_array bd_link_disk_holder try lock open_mutex [2] md_handle_request -> wait mddev_resume T1 scan partition, T2 add a new device to raid. T1 waits for T2 to resume mddev, but T2 waits for open_mutex held by T1. Deadlock occurs. Fix it by introducing a local mutex 'blk_holder_mutex' to replace 'open_mutex'. Fixes: `1b0a2d950e` ("md: use new apis to suspend array for ioctls involed array reconfiguration") Reported-by: mgperkow@gmail.com Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218459 Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240221090122.1281868-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-23 07:44:19 -07:00
Damien Le Moal	522d73526f	block: Do not include rbtree.h in blk-zoned.c The block zone code does not use RB-tree. So remove the include of linux/rbtree.h as it is not needed. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240222131724.1803520-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-22 10:35:18 -07:00
Damien Le Moal	c8f6f88d25	block: Clear zone limits for a non-zoned stacked queue Device mapper may create a non-zoned mapped device out of a zoned device (e.g., the dm-zoned target). In such case, some queue limit such as the max_zone_append_sectors and zone_write_granularity endup being non zero values for a block device that is not zoned. Avoid this by clearing these limits in blk_stack_limits() when the stacked zoned limit is false. Fixes: `3093a47972` ("block: inherit the zoned characteristics in blk_stack_limits") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240222131724.1803520-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-22 10:35:18 -07:00
Christoph Hellwig	a3911966bd	block: fix virt_boundary handling in blk_validate_limits Don't set the default max_segment_size value when a virt_boundary is used. Fixes: `d690cb8ae1` ("block: add an API to atomically update queue limits") Reported-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/r/20240221125010.3609444-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-21 07:21:28 -07:00
Christoph Hellwig	74fa8f9c55	block: pass a queue_limits argument to blk_alloc_disk Pass a queue_limits to blk_alloc_disk and apply it if non-NULL. This will allow allocating queues with valid queue limits instead of setting the values one at a time later. Also change blk_alloc_disk to return an ERR_PTR instead of just NULL which can't distinguish errors. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20240215071055.2201424-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-19 16:58:23 -07:00
Greg Joyce	5429c8de56	block: sed-opal: handle empty atoms when parsing response The SED Opal response parsing function response_parse() does not handle the case of an empty atom in the response. This causes the entry count to be too high and the response fails to be parsed. Recognizing, but ignoring, empty atoms allows response handling to succeed. Signed-off-by: Greg Joyce <gjoyce@linux.ibm.com> Link: https://lore.kernel.org/r/20240216210417.3526064-2-gjoyce@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-16 15:52:45 -07:00
Christoph Hellwig	27e32cd23f	block: pass a queue_limits argument to blk_mq_alloc_disk Pass a queue_limits to blk_mq_alloc_disk and apply it if non-NULL. This will allow allocating queues with valid queue limits instead of setting the values one at a time later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-13 08:56:59 -07:00
Christoph Hellwig	9ac4dd8c47	block: pass a queue_limits argument to blk_mq_init_queue Pass a queue_limits to blk_mq_init_queue and apply it if non-NULL. This will allow allocating queues with valid queue limits instead of setting the values one at a time later. Also rename the function to blk_mq_alloc_queue as that is a much better name for a function that allocates a queue and always pass the queuedata argument instead of having a separate version for the extra argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-13 08:56:59 -07:00
Christoph Hellwig	ad751ba1f8	block: pass a queue_limits argument to blk_alloc_queue Pass a queue_limits to blk_alloc_queue and apply it after validating and capping the values using blk_validate_limits. This will allow allocating queues with valid queue limits instead of setting the values one at a time later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-13 08:56:59 -07:00
Christoph Hellwig	ff956a3be9	block: use queue_limits_commit_update in queue_discard_max_store Convert queue_discard_max_store to use queue_limits_commit_update to check and update the max_discard_sectors limit and freeze the queue before doing so to ensure we don't have requests in flight while changing the limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-13 08:56:59 -07:00
Christoph Hellwig	4f563a6473	block: add a max_user_discard_sectors queue limit Add a new max_user_discard_sectors limit that mirrors max_user_sectors and stores the value that the user manually set. This now allows updates of the max_hw_discard_sectors to not worry about the user limit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-13 08:56:59 -07:00
Christoph Hellwig	0327ca9d53	block: use queue_limits_commit_update in queue_max_sectors_store Convert queue_max_sectors_store to use queue_limits_commit_update to check and update the max_sectors limit and freeze the queue before doing so to ensure we don't have requests in flight while changing the limits. Note that this removes the previously held queue_lock that doesn't protect against any other reader or writer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-13 08:56:59 -07:00
Christoph Hellwig	d690cb8ae1	block: add an API to atomically update queue limits Add a new queue_limits_{start,commit}_update pair of functions that allows taking an atomic snapshot of queue limits, update it, and commit it if it passes validity checking. Also use the low-level validation helper to implement blk_set_default_limits instead of duplicating the initialization. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-13 08:56:59 -07:00
Christoph Hellwig	c490f226a0	block: decouple blk_set_stacking_limits from blk_set_default_limits blk_set_stacking_limits uses very little from blk_set_default_limits. Open code these initializations in preparation for rewriting blk_set_default_limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-13 08:56:59 -07:00
Christoph Hellwig	b9947297d0	block: refactor disk_update_readahead Factor out a blk_apply_bdi_limits limits helper that can be used with an explicit queue_limits argument, which will be useful later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-13 08:56:59 -07:00
Kanchan Joshi	60d21aac52	block: support PI at non-zero offset within metadata Block layer integrity processing assumes that protection information (PI) is placed in the first bytes of each metadata block. Remove this limitation and include the metadata before the PI in the calculation of the guard tag. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Chinmay Gameti <c.gameti@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240201130126.211402-3-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-12 08:49:31 -07:00
Kanchan Joshi	6b5c132a3f	block: refactor guard helpers Allow computation using the existing guard value. This is a prep patch. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240201130126.211402-2-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-12 08:49:31 -07:00
Johannes Thumshirn	71f4ecdbb4	block: remove gfp_flags from blkdev_zone_mgmt Now that all callers pass in GFP_KERNEL to blkdev_zone_mgmt() and use memalloc_no{io,fs}_{save,restore}() to define the allocation scope, we can drop the gfp_mask parameter from blkdev_zone_mgmt() as well as blkdev_zone_reset_all() and blkdev_zone_reset_all_emulated(). Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20240128-zonefs_nofs-v3-5-ae3b7c8def61@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-12 08:41:16 -07:00
Kunwu Chan	48ff13a618	block: Simplify the allocation of slab caches Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240131094323.146659-1-chentao@kylinos.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-08 11:29:40 -07:00
Pavel Begunkov	e516c3fc6c	block: optimise in irq bio put caching When enlisting a bio into ->free_list_irq we protect the list by disabling irqs. It's likely they're already disabled and performance of local_irq_{save,restore}() is decent, but it's not zero cost. Let's only use the irq cache when when we're serving a hard irq, which allows to remove local_irq_{save,restore}(), and fall back to bio_free() in all left cases. Profiles indicate that the bio_put() cost is reduced by ~3.5 times (1.76% -> 0.49%), and total throughput of a CPU bound benchmark improve by around 1% (t/io_uring with high QD and several drives). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/36d207540b7046c653cc16e5ff08fe7234b19f81.1707314970.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-08 10:18:48 -07:00
Pavel Begunkov	c9f5f3aa19	block: extend bio caching to task context bio_put_percpu_cache() puts all non-iopoll bios into the irq-safe list, which entails disabling irqs. The overhead of that is not that bad when interrupts are already off but getting worse otherwise. We can optimise it when we're in the task context by using ->free_list directly just as the IOPOLL path does. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4774e1a0f905f96c63174b0f3e4f79f0d9b63246.1707314970.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-08 10:18:47 -07:00
Tejun Heo	2a427b49d0	blk-iocost: Fix an UBSAN shift-out-of-bounds warning When iocg_kick_delay() is called from a CPU different than the one which set the delay, @now may be in the past of @iocg->delay_at leading to the following warning: UBSAN: shift-out-of-bounds in block/blk-iocost.c:1359:23 shift exponent 18446744073709 is too large for 64-bit type 'u64' (aka 'unsigned long long') ... Call Trace: <TASK> dump_stack_lvl+0x79/0xc0 __ubsan_handle_shift_out_of_bounds+0x2ab/0x300 iocg_kick_delay+0x222/0x230 ioc_rqos_merge+0x1d7/0x2c0 __rq_qos_merge+0x2c/0x80 bio_attempt_back_merge+0x83/0x190 blk_attempt_plug_merge+0x101/0x150 blk_mq_submit_bio+0x2b1/0x720 submit_bio_noacct_nocheck+0x320/0x3e0 __swap_writepage+0x2ab/0x9d0 The underflow itself doesn't really affect the behavior in any meaningful way; however, the past timestamp may exaggerate the delay amount calculated later in the code, which shouldn't be a material problem given the nature of the delay mechanism. If @now is in the past, this CPU is racing another CPU which recently set up the delay and there's nothing this CPU can contribute w.r.t. the delay. Let's bail early from iocg_kick_delay() in such cases. Reported-by: Breno Leitão <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `5160a5a53c` ("blk-iocost: implement delay adjustment hysteresis") Link: https://lore.kernel.org/r/ZVvc9L_CYk5LO1fT@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-08 10:11:39 -07:00
Jan Kara	f814bdda77	blk-wbt: Fix detection of dirty-throttled tasks The detection of dirty-throttled tasks in blk-wbt has been subtly broken since its beginning in 2016. Namely if we are doing cgroup writeback and the throttled task is not in the root cgroup, balance_dirty_pages() will set dirty_sleep for the non-root bdi_writeback structure. However blk-wbt checks dirty_sleep only in the root cgroup bdi_writeback structure. Thus detection of recently throttled tasks is not working in this case (we noticed this when we switched to cgroup v2 and suddently writeback was slow). Since blk-wbt has no easy way to get to proper bdi_writeback and furthermore its intention has always been to work on the whole device rather than on individual cgroups, just move the dirty_sleep timestamp from bdi_writeback to backing_dev_info. That fixes the checking for recently throttled task and saves memory for everybody as a bonus. CC: stable@vger.kernel.org Fixes: `b57d74aff9` ("writeback: track if we're sleeping on progress in balance_dirty_pages()") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240123175826.21452-1-jack@suse.cz [axboe: fixup indentation errors] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-06 09:44:03 -07:00
Bart Van Assche	449813515d	block, fs: Restore the per-bio/request data lifetime fields Restore support for passing data lifetime information from filesystems to block drivers. This patch reverts commit `b179c98f76` ("block: Remove request.write_hint") and commit `c75e707fe1` ("block: remove the per-bio/request write hint"). This patch does not modify the size of struct bio because the new bi_write_hint member fills a hole in struct bio. pahole reports the following for struct bio on an x86_64 system with this patch applied: /* size: 112, cachelines: 2, members: 20 / / sum members: 110, holes: 1, sum holes: 2 / / last cacheline: 48 bytes */ Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240202203926.2478590-7-bvanassche@acm.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-06 14:31:05 +01:00
Tang Yizhou	3bca7640b4	blk-throttle: Eliminate redundant checks for data direction After calling throtl_peek_queued(), the data direction can be determined so there is no need to call bio_data_dir() to check the direction again. Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240123081248.3752878-1-yizhou.tang@shopee.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-05 10:16:12 -07:00
Jens Axboe	06b23f92af	block: update cached timestamp post schedule/preemption Mark the task as having a cached timestamp when set assign it, so we can efficiently check if it needs updating post being scheduled back in. This covers both the actual schedule out case, which would've flushed the plug, and the preemption case which doesn't touch the plugged requests (for many reasons, one of them being then we'd need to have preemption disabled around plug state manipulation). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-05 10:07:34 -07:00
Jens Axboe	da4c8c3d09	block: cache current nsec time in struct blk_plug Querying the current time is the most costly thing we do in the block layer per IO, and depending on kernel config settings, we may do it many times per IO. None of the callers actually need nsec granularity. Take advantage of that by caching the current time in the plug, with the assumption here being that any time checking will be temporally close enough that the slight loss of precision doesn't matter. If the block plug gets flushed, eg on preempt or schedule out, then we invalidate the cached clock. On a basic peak IOPS test case with iostats enabled, this changes the performance from: IOPS=108.41M, BW=52.93GiB/s, IOS/call=31/31 IOPS=108.43M, BW=52.94GiB/s, IOS/call=32/32 IOPS=108.29M, BW=52.88GiB/s, IOS/call=31/32 IOPS=108.35M, BW=52.91GiB/s, IOS/call=32/32 IOPS=108.42M, BW=52.94GiB/s, IOS/call=31/31 IOPS=108.40M, BW=52.93GiB/s, IOS/call=32/32 IOPS=108.31M, BW=52.89GiB/s, IOS/call=32/31 to IOPS=118.79M, BW=58.00GiB/s, IOS/call=31/32 IOPS=118.62M, BW=57.92GiB/s, IOS/call=31/31 IOPS=118.80M, BW=58.01GiB/s, IOS/call=32/31 IOPS=118.78M, BW=58.00GiB/s, IOS/call=32/32 IOPS=118.69M, BW=57.95GiB/s, IOS/call=32/31 IOPS=118.62M, BW=57.92GiB/s, IOS/call=32/31 IOPS=118.63M, BW=57.92GiB/s, IOS/call=31/32 which is more than a 9% improvement in performance. Looking at perf diff, we can see a huge reduction in time overhead: 10.55% -9.88% [kernel.vmlinux] [k] read_tsc 1.31% -1.22% [kernel.vmlinux] [k] ktime_get Note that since this relies on blk_plug for the caching, it's only applicable to the issue side. But this is where most of the time calls happen anyway. On the completion side, cached time stamping is done with struct io_comp patch, as long as the driver supports it. It's also worth noting that the above testing doesn't enable any of the higher cost CPU items on the block layer side, like wbt, cgroups, iocost, etc, which all would add additional time querying and hence overhead. IOW, results would likely look even better in comparison with those enabled, as distros would do. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-05 10:07:28 -07:00
Jens Axboe	08420cf70c	block: add blk_time_get_ns() and blk_time_get() helpers Convert any user of ktime_get_ns() to use blk_time_get_ns(), and ktime_get() to blk_time_get(), so we have a unified API for querying the current time in nanoseconds or as ktime. No functional changes intended, this patch just wraps ktime_get_ns() and ktime_get() with a block helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-05 10:07:22 -07:00
Jens Axboe	c4e47bbb00	block: move cgroup time handling code into blk.h In preparation for moving time keeping into blk.h, move the cgroup related code for timestamps in here too. This will help avoid a circular dependency, and also moves it into a more appropriate header as this one is private to the block layer code. Leave struct bio_issue in blk_types.h as it's a proper time definition. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-05 10:07:17 -07:00
Christoph Hellwig	72e84e909e	blk-mq: special case cached requests less Share the main merge / split / integrity preparation code between the cached request vs newly allocated request cases, and add comments explaining the cached request handling. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240124092658.2258309-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-05 10:06:53 -07:00
Christoph Hellwig	337e89feb7	blk-mq: introduce a blk_mq_peek_cached_request helper Add a new helper to check if there is suitable cached request in blk_mq_submit_bio. This removes open coded logic in blk_mq_submit_bio and moves some checks that so far are in blk_mq_use_cached_rq to be performed earlier. This avoids the case where we first do check with the cached request but then later end up allocating a new one anyway and need to grab a queue reference. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240124092658.2258309-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-05 10:05:16 -07:00
Christoph Hellwig	0f299da55a	blk-mq: move blk_mq_attempt_bio_merge out blk_mq_get_new_requests blk_mq_attempt_bio_merge has nothing to do with allocating a new request, it avoids allocating a new request. Move the call out of blk_mq_get_new_requests and into the only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240124092658.2258309-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-05 10:03:51 -07:00
Hongyu Jin	f3c89983cb	block: Fix where bio IO priority gets set Commit `82b74cac28` ("blk-ioprio: Convert from rqos policy to direct call") pushed setting bio I/O priority down into blk_mq_submit_bio() -- which is too low within block core's submit_bio() because it skips setting I/O priority for block drivers that implement fops->submit_bio() (e.g. DM, MD, etc). Fix this by moving bio_set_ioprio() up from blk-mq.c to blk-core.c and call it from submit_bio(). This ensures all block drivers call bio_set_ioprio() during initial bio submission. Fixes: `a78418e6a0` ("block: Always initialize bio IO priority on submit") Co-developed-by: Yibin Ding <yibin.ding@unisoc.com> Signed-off-by: Yibin Ding <yibin.ding@unisoc.com> Signed-off-by: Hongyu Jin <hongyu.jin@unisoc.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> [snitzer: revised commit header] Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240130202638.62600-2-snitzer@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-01 11:00:06 -07:00
Christoph Hellwig	19871b5c7a	iomap: pass the length of the dirty region to ->map_blocks Let the file system know how much dirty data exists at the passed in offset. This allows file systems to allocate the right amount of space that actually is written back if they can't eagerly convert (e.g. because they don't support unwritten extents). Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231207072710.176093-15-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-01 14:20:13 +01:00
Christian A. Ehrhardt	13f3956eb5	block: Fix WARNING in _copy_from_iter Syzkaller reports a warning in _copy_from_iter because an iov_iter is supposedly used in the wrong direction. The reason is that syzcaller managed to generate a request with a transfer direction of SG_DXFER_TO_FROM_DEV. This instructs the kernel to copy user buffers into the kernel, read into the copied buffers and then copy the data back to user space. Thus the iovec is used in both directions. Detect this situation in the block layer and construct a new iterator with the correct direction for the copy-in. Reported-by: syzbot+a532b03fdfee2c137666@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/0000000000009b92c10604d7a5e9@google.com/t/ Reported-by: syzbot+63dec323ac56c28e644f@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/0000000000003faaa105f6e7c658@google.com/T/ Signed-off-by: Christian A. Ehrhardt <lk@c--e.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240121202634.275068-1-lk@c--e.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-01-23 08:56:55 -07:00
Li Lingfeng	7777f47f2e	block: Move checking GENHD_FL_NO_PART to bdev_add_partition() Commit `1a721de848` ("block: don't add or resize partition on the disk with GENHD_FL_NO_PART") prevented all operations about partitions on disks with GENHD_FL_NO_PART in blkpg_do_ioctl() since they are meaningless. However, it changed error code in some scenarios. So move checking GENHD_FL_NO_PART to bdev_add_partition() to eliminate impact. Fixes: `1a721de848` ("block: don't add or resize partition on the disk with GENHD_FL_NO_PART") Reported-by: Allison Karlitskaya <allison.karlitskaya@redhat.com> Closes: https://lore.kernel.org/all/CAOYeF9VsmqKMcQjo1k6YkGNujwN-nzfxY17N3F-CMikE1tYp+w@mail.gmail.com/ Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240118130401.792757-1-lilingfeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-01-22 09:51:29 -07:00
Linus Torvalds	9d1694dc91	for-6.8/block-2024-01-18 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmWpoCgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqIUEADFvJdC2izkPzYzsOrMK5Rt1H7vaHGKhbA+ zWCuQaa1xQd8bazq+NVnQpbzgclkE/WodTCNfNXcTTjzeQEmcZC888llP3Y9vwyP XfEKH7fSaeKvGigJLro1oPe3YV7/t89F5ol3BoZayfzJF8GEU9BXRWzgOkZzijnk xdm5wUyn/GknksMuQQraZ+U6bQRFLBOulzoaQeMD6Dosx+uRlM4WvAJawC+uOV6R qPT2BVSfYGzmgEKvoaphw0FMkUhFBMDHfXTpQBi5tIzTKOaof8tynYEGz0FHZWeh V0JEEp+3jLWFxFXeEcXgBVPJPE8J0DzGm9g17/uwC2Yhmlbw4FKZVRvGG+PpeUso D5aqhqm3w0x7HgZ7JKwy/aUctADYvjVcSVzPHTaFK0aCSYCIAXxqv4p7fOoxPqyx T32IUHTzGtkCdqzv/xFdtTYhTNM2vyzzbbWj5lXgCBqHsXOVbCh8UM2p+9ec2Umq Fo1XF9eoCDe6Sn4s15hJ5G4DEhKGOKkHluvRUdM+0selA5b0sNOeUqlAf2v+0ve3 Pv3e3X4NPssNIEcsDHf5pc3zGC+LXRS0oFvfIvDESBjwXc3iHIMl+SkjyS57P4Fd RKrHEUUiACuCKO/IWqFYLiNBNHnP3RmV5gSxIZr9QJhFSwOzP+/+4++TCdF5vdAV amhv+0PdCw== =DLW9 -----END PGP SIGNATURE----- Merge tag 'for-6.8/block-2024-01-18' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - tcp, fc, and rdma target fixes (Maurizio, Daniel, Hannes, Christoph) - discard fixes and improvements (Christoph) - timeout debug improvements (Keith, Max) - various cleanups (Daniel, Max, Giuxen) - trace event string fixes (Arnd) - shadow doorbell setup on reset fix (William) - a write zeroes quirk for SK Hynix (Jim) - MD pull request via Song: - Sparse warning since v6.0 (Bart) - /proc/mdstat regression since v6.7 (Yu Kuai) - Use symbolic error value (Christian) - IO Priority documentation update (Christian) - Fix for accessing queue limits without having entered the queue (Christoph, me) - Fix for loop dio support (Christoph) - Move null_blk off deprecated ida interface (Christophe) - Ensure nbd initializes full msghdr (Eric) - Fix for a regression with the folio conversion, which is now easier to hit because of an unrelated change (Matthew) - Remove redundant check in virtio-blk (Li) - Fix for a potential hang in sbitmap (Ming) - Fix for partial zone appending (Damien) - Misc changes and fixes (Bart, me, Kemeng, Dmitry) * tag 'for-6.8/block-2024-01-18' of git://git.kernel.dk/linux: (45 commits) Documentation: block: ioprio: Update schedulers loop: fix the the direct I/O support check when used on top of block devices blk-mq: Remove the hctx 'run' debugfs attribute nbd: always initialize struct msghdr completely block: Fix iterating over an empty bio with bio_for_each_folio_all block: bio-integrity: fix kcalloc() arguments order virtio_blk: remove duplicate check if queue is broken in virtblk_done sbitmap: remove stale comment in sbq_calc_wake_batch block: Correct a documentation comment in blk-cgroup.c null_blk: Remove usage of the deprecated ida_simple_xx() API block: ensure we hold a queue reference when using queue limits blk-mq: rename blk_mq_can_use_cached_rq block: print symbolic error name instead of error code blk-mq: fix IO hang from sbitmap wakeup race nvmet-rdma: avoid circular locking dependency on install_queue() nvmet-tcp: avoid circular locking dependency on install_queue() nvme-pci: set doorbell config before unquiescing block: fix partial zone append completion handling in req_bio_endio() block/iocost: silence warning on 'last_period' potentially being unused md/raid1: Use blk_opf_t for read and write operations ...	2024-01-18 18:22:40 -08:00
Bart Van Assche	49e60333d7	blk-mq: Remove the hctx 'run' debugfs attribute Nobody uses the debugfs hctx 'run' attribute. Hence remove this attribute and also the code that updates the corresponding member variable. Suggested-by: Jens Axboe <axboe@kernel.dk> Cc: Gabriel Ryan <gabe@cs.columbia.edu> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240117203609.4122520-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-01-17 14:16:34 -07:00
Dmitry Antipov	be50df31c4	block: bio-integrity: fix kcalloc() arguments order When compiling with gcc version 14.0.1 20240116 (experimental) and W=1, I've noticed the following warning: block/bio-integrity.c: In function 'bio_integrity_map_user': block/bio-integrity.c:339:38: warning: 'kcalloc' sizes specified with 'sizeof' in the earlier argument and not in the later argument [-Wcalloc-transposed-args] 339 \| bvec = kcalloc(sizeof(*bvec), nr_vecs, GFP_KERNEL); \| ^ block/bio-integrity.c:339:38: note: earlier argument should specify number of elements, later size of each element Since 'n' and 'size' arguments of 'kcalloc()' are multiplied to calculate the final size, their actual order doesn't affect the result and so this is not a bug. But it's still worth to fix it. Fixes: `492c5d4559` ("block: bio-integrity: directly map user buffers") Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20240116143437.89060-1-dmantipov@yandex.ru Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-01-16 09:51:22 -07:00
Nicky Chorley	521277d12b	block: Correct a documentation comment in blk-cgroup.c Commit `99e6038743` ("blk-cgroup: pass a gendisk to the blkg allocation helpers") changed blkg_alloc() to take a struct gendisk instead of a struct request_queue, but the documentation comment still referred to q. So, update that comment to refer to disk instead and fix a typo. Signed-off-by: Nicky Chorley <ndchorley@gmail.com> Link: https://lore.kernel.org/r/20240114191056.6992-1-ndchorley@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-01-15 07:23:38 -07:00
Jens Axboe	7b4f36cd22	block: ensure we hold a queue reference when using queue limits q_usage_counter is the only thing preventing us from the limits changing under us in __bio_split_to_limits, but blk_mq_submit_bio doesn't hold it while calling into it. Move the splitting inside the region where we know we've got a queue reference. Ideally this could still remain a shared section of code, but let's keep the fix simple and defer any refactoring here to later. Reported-by: Christoph Hellwig <hch@lst.de> Fixes: `900e080752` ("block: move queue enter logic into blk_mq_submit_bio()") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-01-12 21:09:42 -07:00
Christoph Hellwig	309ce67414	blk-mq: rename blk_mq_can_use_cached_rq blk_mq_can_use_cached_rq doesn't just check if we can use the request, but also performs the work to actually use it. Remove the _can in the naming, and improve the comment describing the function. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240111135705.2155518-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-01-12 09:11:34 -07:00
Christian Heusel	25c1772a04	block: print symbolic error name instead of error code Utilize the %pe print specifier to get the symbolic error name as a string (i.e "-ENOMEM") in the log message instead of the error code to increase its readablility. This change was suggested in https://lore.kernel.org/all/92972476-0b1f-4d0a-9951-af3fc8bc6e65@suswa.mountain/ Signed-off-by: Christian Heusel <christian@heusel.eu> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240111231521.1596838-1-christian@heusel.eu Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-01-12 09:07:46 -07:00
Ming Lei	5266caaf56	blk-mq: fix IO hang from sbitmap wakeup race In blk_mq_mark_tag_wait(), __add_wait_queue() may be re-ordered with the following blk_mq_get_driver_tag() in case of getting driver tag failure. Then in __sbitmap_queue_wake_up(), waitqueue_active() may not observe the added waiter in blk_mq_mark_tag_wait() and wake up nothing, meantime blk_mq_mark_tag_wait() can't get driver tag successfully. This issue can be reproduced by running the following test in loop, and fio hang can be observed in < 30min when running it on my test VM in laptop. modprobe -r scsi_debug modprobe scsi_debug delay=0 dev_size_mb=4096 max_queue=1 host_max_queue=1 submit_queues=4 dev=`ls -d /sys/bus/pseudo/drivers/scsi_debug/adapter/host/target//block/* \| head -1 \| xargs basename` fio --filename=/dev/"$dev" --direct=1 --rw=randrw --bs=4k --iodepth=1 \ --runtime=100 --numjobs=40 --time_based --name=test \ --ioengine=libaio Fix the issue by adding one explicit barrier in blk_mq_mark_tag_wait(), which is just fine in case of running out of tag. Cc: Jan Kara <jack@suse.cz> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Reported-by: Changhui Zhong <czhong@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240112122626.4181044-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-01-12 08:48:46 -07:00
Linus Torvalds	01d550f0fc	for-6.8/block-2024-01-08 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmWcIOIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpn6hD/9oO7U75PuxUwYYHZ9Uzxpw6gQ0LEmeyJmE NQYCkfYHVq3IsgOdF7elI9v3qtr6v8V8CdB7cByrnn3DgwsMuiTKZZ0dK7vH37PO DX+/xn349e8oH7RdRo7f3m95g1YbHfpfnj0Rc4mjTDV72Jr/HlLTVgGTQg8DEnCR wBIFmeuBHHgeeLh87gsWLAP7ReReiy9V1uqpDFsko2/4BxRAM/8eedkwcAxD8aEy rd+dT/SBQj2cOdQMUeExT3gWjwzHh6ZHx3f1WCLK5fdck6BogH2hBUeri6F/H98L HoaXjBZYBTH68hB/mnO5I4g1ZlrVM74Vp7JPa3e1SFFtyEi6lsyrk2J3GoNh0E7r pXqH5kAcaJwBsBrbRGuvEyGbn9RLTaN5Gvseud0VE4oMruyodTniQaHXuIGackgz sMavMho4486EUWPaF7gIBdLNK1hO13w+IDZ4+3oBxhudMqdgZbk4iYpOCqQ7QY5G 2vkzAE/sZ+aVNXeaIQOI8dE5clBy8gJ+6+t8dm3DY1r1xdbcnU40iZ8/fri3h69r vHs9bpQnVWZF0gEyEflY1pkcAPpIkvMmWCR7Ehy5YCkIfa+qfSL05o3dicpWovLP N+gCtpkhTK2AvmUWsUMypMLRvoSOImyCIiobrr3qNBaUdgRP8xKfUa72RuRp8cGl Vrj5oAiE3w== =YAfp -----END PGP SIGNATURE----- Merge tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: "Pretty quiet round this time around. This contains: - NVMe updates via Keith: - nvme fabrics spec updates (Guixin, Max) - nvme target udpates (Guixin, Evan) - nvme attribute refactoring (Daniel) - nvme-fc numa fix (Keith) - MD updates via Song: - Fix/Cleanup RCU usage from conf->disks[i].rdev (Yu Kuai) - Fix raid5 hang issue (Junxiao Bi) - Add Yu Kuai as Reviewer of the md subsystem - Remove deprecated flavors (Song Liu) - raid1 read error check support (Li Nan) - Better handle events off-by-1 case (Alex Lyakas) - Efficiency improvements for passthrough (Kundan) - Support for mapping integrity data directly (Keith) - Zoned write fix (Damien) - rnbd fixes (Kees, Santosh, Supriti) - Default to a sane discard size granularity (Christoph) - Make the default max transfer size naming less confusing (Christoph) - Remove support for deprecated host aware zoned model (Christoph) - Misc fixes (me, Li, Matthew, Min, Ming, Randy, liyouhong, Daniel, Bart, Christoph)" * tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linux: (78 commits) block: Treat sequential write preferred zone type as invalid block: remove disk_clear_zoned sd: remove the !ZBC && blk_queue_is_zoned case in sd_read_block_characteristics drivers/block/xen-blkback/common.h: Fix spelling typo in comment blk-cgroup: fix rcu lockdep warning in blkg_lookup() blk-cgroup: don't use removal safe list iterators block: floor the discard granularity to the physical block size mtd_blkdevs: use the default discard granularity bcache: use the default discard granularity zram: use the default discard granularity null_blk: use the default discard granularity nbd: use the default discard granularity ubd: use the default discard granularity block: default the discard granularity to sector size bcache: discard_granularity should not be smaller than a sector block: remove two comments in bio_split_discard block: rename and document BLK_DEF_MAX_SECTORS loop: don't abuse BLK_DEF_MAX_SECTORS aoe: don't abuse BLK_DEF_MAX_SECTORS null_blk: don't cap max_hw_sectors to BLK_DEF_MAX_SECTORS ...	2024-01-11 13:58:04 -08:00
Damien Le Moal	748dc0b65e	block: fix partial zone append completion handling in req_bio_endio() Partial completions of zone append request is not allowed but if a zone append completion indicates a number of completed bytes different from the original BIO size, only the BIO status is set to error. This leads to bio_advance() not setting the BIO size to 0 and thus to not call bio_endio() at the end of req_bio_endio(). Make sure a partially completed zone append is failed and completed immediately by forcing the completed number of bytes (nbytes) to be equal to the BIO size, thus ensuring that bio_endio() is called. Fixes: `297db73184` ("block: fix req_bio_endio append error handling") Cc: stable@kernel.vger.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240110092942.442334-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-01-10 09:01:16 -07:00
Jens Axboe	742e324a06	block/iocost: silence warning on 'last_period' potentially being unused If CONFIG_TRACEPOINTS isn't enabled, we assign this variable but then never use it. This can cause the compiler to complain about that: block/blk-iocost.c:1264:6: warning: variable 'last_period' set but not used [-Wunused-but-set-variable] 1264 \| u64 last_period, cur_period; \| ^ Rather than add ifdefs to guard this, just mark it __maybe_unused. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202401102335.GiWdeIo9-lkp@intel.com/ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-01-10 08:43:06 -07:00
Linus Torvalds	fb46e22a9e	Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series "maple_tree: add mt_free_one() and mt_attr() helpers" "Some cleanups of maple tree" - In the series "mm: use memmap_on_memory semantics for dax/kmem" Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series "Add folio_zero_tail() and folio_fill_tail()" "Make folio_start_writeback return void" "Fix fault handler's handling of poisoned tail pages" "Convert aops->error_remove_page to ->error_remove_folio" "Finish two folio conversions" "More swap folio conversions" - Kefeng Wang has also contributed folio-related work in the series "mm: cleanup and use more folio in page fault" - Jim Cromie has improved the kmemleak reporting output in the series "tweak kmemleak report format". - In the series "stackdepot: allow evicting stack traces" Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series "mm: page_alloc: fixes for high atomic reserve caluculations". - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series "samples: introduce cgroup events listeners". - Some mapletree maintanance work from Liam Howlett in the series "maple_tree: iterator state changes". - Nhat Pham has improved zswap's approach to writeback in the series "workload-specific and memory pressure-driven zswap writeback". - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series "mm/damon: let users feed and tame/auto-tune DAMOS" "selftests/damon: add Python-written DAMON functionality tests" "mm/damon: misc updates for 6.8" - Yosry Ahmed has improved memcg's stats flushing in the series "mm: memcg: subtree stats flushing and thresholds". - In the series "Multi-size THP for anonymous memory" Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series "More buffer_head cleanups". - Suren Baghdasaryan has done work on Andrea Arcangeli's series "userfaultfd move option". UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a "KSM Advisor", in the series "mm/ksm: Add ksm advisor". This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series "mm/zswap: dstmem reuse optimizations and cleanups". - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is "Clean up the writeback paths". - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series "kasan: save mempool stack traces". - Andrey also performed some KASAN maintenance work in the series "kasan: assorted clean-ups". - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series "mm/rmap: interface overhaul". - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series "mm/mglru: Kconfig cleanup". - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series "Remove some lruvec page accounting functions". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27 Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU= =0NHs -----END PGP SIGNATURE----- Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series 'maple_tree: add mt_free_one() and mt_attr() helpers' 'Some cleanups of maple tree' - In the series 'mm: use memmap_on_memory semantics for dax/kmem' Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series 'Add folio_zero_tail() and folio_fill_tail()' 'Make folio_start_writeback return void' 'Fix fault handler's handling of poisoned tail pages' 'Convert aops->error_remove_page to ->error_remove_folio' 'Finish two folio conversions' 'More swap folio conversions' - Kefeng Wang has also contributed folio-related work in the series 'mm: cleanup and use more folio in page fault' - Jim Cromie has improved the kmemleak reporting output in the series 'tweak kmemleak report format'. - In the series 'stackdepot: allow evicting stack traces' Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series 'mm: page_alloc: fixes for high atomic reserve caluculations'. - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series 'samples: introduce cgroup events listeners'. - Some mapletree maintanance work from Liam Howlett in the series 'maple_tree: iterator state changes'. - Nhat Pham has improved zswap's approach to writeback in the series 'workload-specific and memory pressure-driven zswap writeback'. - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series 'mm/damon: let users feed and tame/auto-tune DAMOS' 'selftests/damon: add Python-written DAMON functionality tests' 'mm/damon: misc updates for 6.8' - Yosry Ahmed has improved memcg's stats flushing in the series 'mm: memcg: subtree stats flushing and thresholds'. - In the series 'Multi-size THP for anonymous memory' Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series 'More buffer_head cleanups'. - Suren Baghdasaryan has done work on Andrea Arcangeli's series 'userfaultfd move option'. UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm: Add ksm advisor'. This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series 'mm/zswap: dstmem reuse optimizations and cleanups'. - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is 'Clean up the writeback paths'. - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series 'kasan: save mempool stack traces'. - Andrey also performed some KASAN maintenance work in the series 'kasan: assorted clean-ups'. - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series 'mm/rmap: interface overhaul'. - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series 'mm/mglru: Kconfig cleanup'. - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series 'Remove some lruvec page accounting functions'" * tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits) mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER mm, treewide: introduce NR_PAGE_ORDERS selftests/mm: add separate UFFDIO_MOVE test for PMD splitting selftests/mm: skip test if application doesn't has root privileges selftests/mm: conform test to TAP format output selftests: mm: hugepage-mmap: conform to TAP format output selftests/mm: gup_test: conform test to TAP format output mm/selftests: hugepage-mremap: conform test to TAP format output mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large mm/memcontrol: remove __mod_lruvec_page_state() mm/khugepaged: use a folio more in collapse_file() slub: use a folio in __kmalloc_large_node slub: use folio APIs in free_large_kmalloc() slub: use alloc_pages_node() in alloc_slab_page() mm: remove inc/dec lruvec page state functions mm: ratelimit stat flush from workingset shrinker kasan: stop leaking stack trace handles mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE mm/mglru: add dummy pmd_dirty() ...	2024-01-09 11:18:47 -08:00
Jens Axboe	3b7cb74547	block: move __get_task_ioprio() into header file We call this once per IO, which can be millions of times per second. Since nobody really uses io priorities, or at least it isn't very common, this is all wasted time and can amount to as much as 3% of the total kernel time. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-01-08 12:27:39 -07:00
Linus Torvalds	3f6984e730	vfs-6.8.super -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUx4wAKCRCRxhvAZXjc osaNAQC/c+xXVfiq/pFbuK9MQLna4RGZaGcG9k312YniXbHq0AD9HAf4aPcZwPy1 /wkD4pauj3UZ3f0xBSyazGBvAXyN0Qc= =iFAQ -----END PGP SIGNATURE----- Merge tag 'vfs-6.8.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs super updates from Christian Brauner: "This contains the super work for this cycle including the long-awaited series by Jan to make it possible to prevent writing to mounted block devices: - Writing to mounted devices is dangerous and can lead to filesystem corruption as well as crashes. Furthermore syzbot comes with more and more involved examples how to corrupt block device under a mounted filesystem leading to kernel crashes and reports we can do nothing about. Add tracking of writers to each block device and a kernel cmdline argument which controls whether other writeable opens to block devices open with BLK_OPEN_RESTRICT_WRITES flag are allowed. Note that this effectively only prevents modification of the particular block device's page cache by other writers. The actual device content can still be modified by other means - e.g. by issuing direct scsi commands, by doing writes through devices lower in the storage stack (e.g. in case loop devices, DM, or MD are involved) etc. But blocking direct modifications of the block device page cache is enough to give filesystems a chance to perform data validation when loading data from the underlying storage and thus prevent kernel crashes. Syzbot can use this cmdline argument option to avoid uninteresting crashes. Also users whose userspace setup does not need writing to mounted block devices can set this option for hardening. We expect that this will be interesting to quite a few workloads. Btrfs is currently opted out of this because they still haven't merged patches we require for this to work from three kernel releases ago. - Reimplement block device freezing and thawing as holder operations on the block device. This allows us to extend block device freezing to all devices associated with a superblock and not just the main device. It also allows us to remove get_active_super() and thus another function that scans the global list of superblocks. Freezing via additional block devices only works if the filesystem chooses to use @fs_holder_ops for these additional devices as well. That currently only includes ext4 and xfs. Earlier releases switched get_tree_bdev() and mount_bdev() to use @fs_holder_ops. The remaining nilfs2 open-coded version of mount_bdev() has been converted to rely on @fs_holder_ops as well. So block device freezing for the main block device will continue to work as before. There should be no regressions in functionality. The only special case is btrfs where block device freezing for the main block device never worked because sb->s_bdev isn't set. Block device freezing for btrfs can be fixed once they can switch to @fs_holder_ops but that can happen whenever they're ready" * tag 'vfs-6.8.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits) block: Fix a memory leak in bdev_open_by_dev() super: don't bother with WARN_ON_ONCE() super: massage wait event mechanism ext4: Block writes to journal device xfs: Block writes to log device fs: Block writes to mounted block devices btrfs: Do not restrict writes to btrfs devices block: Add config option to not allow writing to mounted devices block: Remove blkdev_get_by_*() functions bcachefs: Convert to bdev_open_by_path() fs: handle freezing from multiple devices fs: remove dead check nilfs2: simplify device handling fs: streamline thaw_super_locked ext4: simplify device handling xfs: simplify device handling fs: simplify setup_bdev_super() calls blkdev: comment fs_holder_ops porting: document block device freeze and thaw changes fs: remove unused helper ...	2024-01-08 10:43:51 -08:00
Damien Le Moal	587371ed78	block: Treat sequential write preferred zone type as invalid With the removal of the support for host-aware zoned devices, blk_revalidate_zone_cb() should never see the zone type BLK_ZONE_TYPE_SEQWRITE_PREF (sequential write preffered zones). Treat this zone type as being invalid. Fixes: `7437bb73f0` ("block: remove support for the host aware zone model") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240107072212.1071080-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-01-08 08:34:24 -07:00
Christoph Hellwig	4e33b071bb	block: remove disk_clear_zoned disk_clear_zoned is unused now that the last warts of the host-aware model support in sd are gone. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20231228075141.362560-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-01-08 08:27:22 -07:00
Ming Lei	393cd8ffd8	blk-cgroup: fix rcu lockdep warning in blkg_lookup() blkg_lookup() is called with either queue_lock or rcu read lock, so use rcu_dereference_check(lockdep_is_held(&q->queue_lock)) for retrieving 'blkg', which way models the check exactly for covering queue lock or rcu read lock. Fix lockdep warning of "block/blk-cgroup.h:254 suspicious rcu_dereference_check() usage!" from blkg_lookup(). Tested-by: Changhui Zhong <czhong@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Fixes: `83462a6c97` ("blkcg: Drop unnecessary RCU read [un]locks from blkg_conf_prep/finish()") Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20231219012833.2129540-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-01-04 16:08:54 -07:00
Daniel Vacek	fab4c16c52	blk-cgroup: don't use removal safe list iterators Commit `f1c006f1c6` moved deletion of the list blkg->q_node from blkg_destroy() to blkg_free_workfn(). Switch to using the list iterators, as we don't need removal protection anymore. Signed-off-by: Daniel Vacek <neelx@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240104180031.148148-1-neelx@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-01-04 16:07:56 -07:00
Christoph Hellwig	458aa1a099	block: floor the discard granularity to the physical block size Discarding less than a physical block doesn't make sense. This fixes the existing behavior for zram before the recent changes to default the discard granularity to the logical block size, and is also a generally useful sanity check. Fixes: `3753039def` ("zram: use the default discard granularity") Reported-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240103081622.508754-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-01-04 16:05:49 -07:00
Matthew Wilcox (Oracle)	17bf23a981	fs: convert block_write_full_page to block_write_full_folio Convert the function to be compatible with writepage_t so that it can be passed to write_cache_pages() by blkdev. This removes a call to compound_head(). We can also remove the function export as both callers are built-in. Link: https://lkml.kernel.org/r/20231215200245.748418-14-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-12-29 11:58:35 -08:00
Linus Torvalds	09c57a762e	block-6.7-2023-12-29 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmWO7A0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjqsEAC4IpYMGeEiY2kEBPLKm/vUEFsQtT2YkdN0 isc6h6xepmqeZO9a+6jLZ5zadCT6IbCdORN0zUrO91lRdfVolIGNDWnkArAye7Yj 62FbYf7+G7WebW8XAMkbx43dFik7qneZpN852b/kaGSHF+SZPsj2hTG5yRuuahcA nVKRk+7U8Y0ZvjCpd6E2Ir2R86KU/JspCEWCYvQ/kz/AYyxQzZUElRO4ptPB/NTb EcWexEFaLi770UkNqrKf87YlcnWiWbeNyeTjRrzDsp4nf8MwxB+tgg2ApGf4lMLv B7lTEIjww7JbsVouLf/EWylp+PI+Ebl4tA3oiya/skby/eiO5fhXMM6NsnIxKx/a Lg3uOg4Xs2Se0GYiw63mDZCpf/+joKTike6smG76GQ5cdsDQoHczvGiAPul5Bp5w 6s5whj4ODdxbLsUqZlfToeZnLL91TI8+UbVJt/fIp5lF0+tiICvdpa2b3hBn6rYP WmenW21OX1Pr4S7X1+lAc8MD/hYpEpbY4U5jWKOejxA9KeQNAK9TJOWQQF3Lu6/s EGzJpFJmZd+G6qSMuJHSoHlPHk9WfHXGatQ/T9ecizVfj7FlUPtbFvbyoKzkY4xK 8Z1pRd2JhDS7GB2bgu/+Eagz0LksHnAe4eK52wgy/uvNdcbTsLGUgsQyAWHHaCKY 7OUzMi/ekg== =wPUa -----END PGP SIGNATURE----- Merge tag 'block-6.7-2023-12-29' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: "Fix for a badly numbered flag, and a regression fix for the badblocks updates from this merge window" * tag 'block-6.7-2023-12-29' of git://git.kernel.dk/linux: block: renumber QUEUE_FLAG_HW_WC badblocks: avoid checking invalid range in badblocks_check()	2023-12-29 11:41:40 -08:00
Christoph Hellwig	3c407dc723	block: default the discard granularity to sector size Current the discard granularity defaults to 0 and must be initialized by any driver that wants to support discard. Default to the sector size instead, which is the smallest possible value, and a very useful default. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231228075545.362768-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-29 08:44:12 -07:00
Christoph Hellwig	928a5dd3a8	block: remove two comments in bio_split_discard A zero discard_granularity is not treated the same as a single-block one, and not having any segments after taking alignment is perfectly fine and does not need a warning. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231228075545.362768-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-29 08:44:12 -07:00
Christophe JAILLET	8ff363ade3	block: Fix a memory leak in bdev_open_by_dev() If we early exit here, 'handle' needs to be freed, or some memory leaks. Fixes: `ed5cc702d3` ("block: Add config option to not allow writing to mounted devices") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/8eaec334781e695810aaa383b55de00ca4ab1352.1703439383.git.christophe.jaillet@wanadoo.fr Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-12-28 11:48:17 +01:00
Christoph Hellwig	d6b9f4e6f7	block: rename and document BLK_DEF_MAX_SECTORS Give BLK_DEF_MAX_SECTORS a _CAP postfix and document what it is used for. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231227092305.279567-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-27 10:46:01 -07:00
Christoph Hellwig	5d13243820	blk-wbt: remove the separate write cache tracking Use the queue wide write back cache tracking insted of duplicating the value in strut rq_wb. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231226090747.204969-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-26 09:28:10 -07:00
Christoph Hellwig	1c042f8d4b	block: reject invalid operation in submit_bio_noacct submit_bio_noacct allows completely invalid operations, or operations that are not supported in the bio path. Extent the existing switch statement to rejcect all invalid types. Move the code point for REQ_OP_ZONE_APPEND so that it's not right in the middle of the zone management operations and the switch statement can follow the numerical order of the operations. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231221070538.1112446-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-26 09:27:14 -07:00
Coly Li	146e843f6b	badblocks: avoid checking invalid range in badblocks_check() If prev_badblocks() returns '-1', it means no valid badblocks record before the checking range. It doesn't make sense to check whether the input checking range is overlapped with the non-existed invalid front range. This patch checkes whether 'prev >= 0' is true before calling overlap_front(), to void such invalid operations. Fixes: `3ea3354cb9` ("badblocks: improve badblocks_check() for multiple ranges handling") Reported-and-tested-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/nvdimm/3035e75a-9be0-4bc3-8d4a-6e52c207f277@leemhuis.info/ Cc: Dan Williams <dan.j.williams@intel.com> Cc: Geliang Tang <geliang.tang@suse.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: NeilBrown <neilb@suse.de> Cc: Vishal L Verma <vishal.l.verma@intel.com> Cc: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/r/20231224002820.20234-1-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-23 18:38:08 -07:00
Jens Axboe	5165799f0d	block: export disk_clear_zoned() A previous commit split disk_set_zoned(..., bool) into not taking an argument for whether to set or clear, and instead added disk_clear_zoned() as the counterpart. However, that commit neglected to export the new symbol, causing failures for modular drivers that used it. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: `d73e93b4df` ("block: simplify disk_set_zoned") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-20 20:32:12 -07:00
Christoph Hellwig	d73e93b4df	block: simplify disk_set_zoned Only use disk_set_zoned to actually enable zoned device support. For clearing it, call disk_clear_zoned, which is renamed from disk_clear_zone_settings and now directly clears the zoned flag as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20231217165359.604246-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-19 20:17:43 -07:00
Christoph Hellwig	7437bb73f0	block: remove support for the host aware zone model When zones were first added the SCSI and ATA specs, two different models were supported (in addition to the drive managed one that is invisible to the host): - host managed where non-conventional zones there is strict requirement to write at the write pointer, or else an error is returned - host aware where a write point is maintained if writes always happen at it, otherwise it is left in an under-defined state and the sequential write preferred zones behave like conventional zones (probably very badly performing ones, though) Not surprisingly this lukewarm model didn't prove to be very useful and was finally removed from the ZBC and SBC specs (NVMe never implemented it). Due to to the easily disappearing write pointer host software could never rely on the write pointer to actually be useful for say recovery. Fortunately only a few HDD prototypes shipped using this model which never made it to mass production. Drop the support before it is too late. Note that any such host aware prototype HDD can still be used with Linux as we'll now treat it as a conventional HDD. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20231217165359.604246-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-19 20:17:43 -07:00
Li Nan	4c434392c4	block: add check of 'minors' and 'first_minor' in device_add_disk() 'first_minor' represents the starting minor number of disks, and 'minors' represents the number of partitions in the device. Neither of them can be greater than MINORMASK + 1. Commit `e338924bd0` ("block: check minor range in device_add_disk()") only added the check of 'first_minor + minors'. However, their sum might be less than MINORMASK but their values are wrong. Complete the checks now. Fixes: `e338924bd0` ("block: check minor range in device_add_disk()") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231219075942.840255-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-19 08:23:12 -07:00
Kundan Kumar	6c9b97085c	block: skip cgroups for passthrough io Even if BLK_CGROUP is enabled, it does not work for passthrough io. So skip setting up blkg for passthrough bio. Reduced processing gives ~5% hike in peak-performance workload. Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20231218152722.1768-1-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-18 09:46:53 -07:00
Christoph Hellwig	6ef02df154	block: support adding less than len in bio_add_hw_page bio_add_hw_page currently always fails or succeeds. This is fine for the existing callers that always add PAGE_SIZE worth given that the max_segment_size and max_sectors must always allow at least a page worth of data. But when we want to add it for bigger amounts of data this means it can also fail when adding the data to a bio, and creating a fallback for that becomes really annoying in the callers. Make use of the existing API design that allows to return a smaller length than the one passed in and add up to max_segment_size worth of data from a larger input. All the existing callers are fine with this - not because they handle this return correctly, but because they never pass more than a page in. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20231204173419.782378-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-15 07:34:27 -07:00
Christoph Hellwig	3f034c374a	block: prevent an integer overflow in bvec_try_merge_hw_page Reordered a check to avoid a possible overflow when adding len to bv_len. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20231204173419.782378-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-15 07:34:27 -07:00
Bart Van Assche	f19d1e3b17	block: Use pr_info() instead of printk(KERN_INFO ...) Switch to the modern style of printing kernel messages. Use %u instead of %d to print unsigned integers. Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20231213194702.90381-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-14 10:28:56 -07:00
Min Li	6f64f866aa	block: add check that partition length needs to be aligned with block size Before calling add partition or resize partition, there is no check on whether the length is aligned with the logical block size. If the logical block size of the disk is larger than 512 bytes, then the partition size maybe not the multiple of the logical block size, and when the last sector is read, bio_truncate() will adjust the bio size, resulting in an IO error if the size of the read command is smaller than the logical block size.If integrity data is supported, this will also result in a null pointer dereference when calling bio_integrity_free. Cc: <stable@vger.kernel.org> Signed-off-by: Min Li <min15.li@samsung.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230629142517.121241-1-min15.li@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-13 08:19:14 -07:00
Li Nan	5fa3d1a00c	block: Set memalloc_noio to false on device_add_disk() error path On the error path of device_add_disk(), device's memalloc_noio flag was set but not cleared. As the comment of pm_runtime_set_memalloc_noio(), "The function should be called between device_add() and device_del()". Clear this flag before device_del() now. Fixes: `25e823c8c3` ("block/genhd.c: apply pm_runtime_set_memalloc_noio on block devices") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231211075356.1839282-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-13 08:17:02 -07:00
Matthew Wilcox (Oracle)	af7628d6ec	fs: convert error_remove_page to error_remove_folio There were already assertions that we were not passing a tail page to error_remove_page(), so make the compiler enforce that by converting everything to pass and use a folio. Link: https://lkml.kernel.org/r/20231117161447.2461643-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-12-10 16:51:42 -08:00
Matthew Wilcox (Oracle)	1b151e2435	block: Remove special-casing of compound pages The special casing was originally added in pre-git history; reproducing the commit log here: > commit a318a92567d77 > Author: Andrew Morton <akpm@osdl.org> > Date: Sun Sep 21 01:42:22 2003 -0700 > > [PATCH] Speed up direct-io hugetlbpage handling > > This patch short-circuits all the direct-io page dirtying logic for > higher-order pages. Without this, we pointlessly bounce BIOs up to > keventd all the time. In the last twenty years, compound pages have become used for more than just hugetlb. Rewrite these functions to operate on folios instead of pages and remove the special case for hugetlbfs; I don't think it's needed any more (and if it is, we can put it back in as a call to folio_test_hugetlb()). This was found by inspection; as far as I can tell, this bug can lead to pages used as the destination of a direct I/O read not being marked as dirty. If those pages are then reclaimed by the MM without being dirtied for some other reason, they won't be written out. Then when they're faulted back in, they will not contain the data they should. It'll take a pretty unusual setup to produce this problem with several races all going the wrong way. This problem predates the folio work; it could for example have been triggered by mmaping a THP in tmpfs and using that as the target of an O_DIRECT read. Fixes: `800d8c63b2` ("shmem: add huge pages support") Cc: <stable@vger.kernel.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-07 14:02:20 -07:00
Kundan Kumar	847c5bcdfb	block: skip QUEUE_FLAG_STATS and rq-qos for passthrough io Write-back throttling (WBT) enables QUEUE_FLAG_STATS on the request queue. But WBT does not make sense for passthrough io, so skip QUEUE_FLAG_STATS processing. Also skip rq_qos_issue/done for passthrough io. Overall, the change gives ~11% hike in peak performance. Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20231123190331.7934-1-kundan.kumar@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-01 18:29:18 -07:00
Keith Busch	492c5d4559	block: bio-integrity: directly map user buffers Passthrough commands that utilize metadata currently need to bounce the user space buffer through the kernel. Add support for mapping user space directly so that we can avoid this costly overhead. This is similar to how the normal bio data payload utilizes user addresses with bio_map_user_iov(). If the user address can't directly be used for reason, like too many segments or address unalignement, fallback to a copy of the user vec while keeping the user address pinned for the IO duration so that it can safely be copied on completion in any process context. Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20231130215309.2923568-2-kbusch@meta.com [axboe: fold in fix from Kanchan Joshi] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-01 18:29:00 -07:00
Linus Torvalds	ee0c8a9b34	block-6.7-2023-12-01 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmVqKo4QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqL9D/9bPvuA+Oogx+C/kNConjxnuyPBiXcZjb/4 5gO/6N0FC8yu+HQqgscGTyEjJO2FKfLx+YxxBs1UVIt4Tm+jZwC3nPqw9X4W3RCz pK9fxCNlzxey0SZU3ZJQIOtqP3df5Yuas9V/h35GS4m1XaoDE6cPpsIVUrAnoNwg W990L8sOy6y4XzMPzyHJCyoDCay1Qp2ly0Vdlz4/ESRmEp564i42nFN+8zpZ/w7h V+Ekn6JwP1ssqUeY/k43QcfRzYwSvvnTQJ1y9t3erf6HcHtpbCgnL1jTaGEmr4IS 1sw3ffqo23xBSsGP+D2OF4+9pwGI9+xwNpYnRdrpDPxKhCn5EEh+g6+f+m7YEnFV q1swlMTqHtRLFdYbKe8Tl8hPRwEeSpKy8sXph56hwGZY0T/IyB+Pe3aXrh1DYPA5 4+GASZHFQPH82P1ibVNdpMRZe4rPPblw38GZauZ1JbI0m0zXqEveB2AgZeCcw1ky l7KBdMdGBqSWYVmfKcJd3f30vKPyhMSp4eE9/LFp24vmyIIw+dSp6vup0yrM6jk9 taUU6PCHzaxmI1YGz1BzNVa8cfYKB6aiWeQ2OGa4Z7ba4TuksMLkbfVvu21jdi+z PsL/KlqPSPwFL/3XAZagIb3BXUhoQyfwIU8GnAuw2wTU5RJzWnbwF3wXpNaBIJxI 8y5OWsFqIg== =5kb6 -----END PGP SIGNATURE----- Merge tag 'block-6.7-2023-12-01' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Invalid namespace identification error handling (Marizio Ewan, Keith) - Fabrics keep-alive tuning (Mark) - Fix for a bad error check regression in bcache (Markus) - Fix for a performance regression with O_DIRECT (Ming) - Fix for a flush related deadlock (Ming) - Make the read-only warn on per-partition (Yu) * tag 'block-6.7-2023-12-01' of git://git.kernel.dk/linux: nvme-core: check for too small lba shift blk-mq: don't count completed flush data request as inflight in case of quiesce block: Document the role of the two attribute groups block: warn once for each partition in bio_check_ro() block: move .bd_inode into 1st cacheline of block_device nvme: check for valid nvme_identify_ns() before using it nvme-core: fix a memory leak in nvme_ns_info_from_identify() nvme: fine-tune sending of first keep-alive bcache: revert replacing IS_ERR_OR_NULL with IS_ERR	2023-12-02 06:39:30 +09:00
Ming Lei	0e4237ae8d	blk-mq: don't count completed flush data request as inflight in case of quiesce Request queue quiesce may interrupt flush sequence, and the original request may have been marked as COMPLETE, but can't get finished because of queue quiesce. This way is fine from driver viewpoint, because flush sequence is block layer concept, and it isn't related with driver. However, driver(such as dm-rq) can call blk_mq_queue_inflight() to count & drain inflight requests, then the wait & drain never gets done because the completed & not-finished flush request is counted as inflight. Fix this issue by not counting completed flush data request as inflight in case of quiesce. Cc: Mike Snitzer <snitzer@kernel.org> Cc: David Jeffery <djeffery@redhat.com> Cc: John Pittman <jpittman@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231201085605.577730-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-12-01 07:34:47 -07:00
Bart Van Assche	3649ff0a0b	block: Document the role of the two attribute groups It is nontrivial to derive the role of the two attribute groups in source file block/blk-sysfs.c. Hence add a comment that explains their roles. See also commit `6d85ebf95c` ("blk-sysfs: add a new attr_group for blk_mq"). Cc: Christoph Hellwig <hch@lst.de> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20231128194019.72762-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-11-29 10:18:38 -07:00
Yu Kuai	67d995e069	block: warn once for each partition in bio_check_ro() Commit `1b0a151c10` ("blk-core: use pr_warn_ratelimited() in bio_check_ro()") fix message storm by limit the rate, however, there will still be lots of message in the long term. Fix it better by warn once for each partition. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231128123027.971610-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-11-28 12:11:08 -07:00
Linus Torvalds	fa2b906f51	vfs-6.7-rc3.fixes -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZWBq0gAKCRCRxhvAZXjc ot4EAP48O5ExMtQ3/AIkNDo+/9/Iz4g7bE1HYmdyiMPO3Ou/uwEAySwBXRJrFAsS 9omvkEdqrfyguW0xgoYwcxBdATVHnAE= =ScR3 -----END PGP SIGNATURE----- Merge tag 'vfs-6.7-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Avoid calling back into LSMs from vfs_getattr_nosec() calls. IMA used to query inode properties accessing raw inode fields without dedicated helpers. That was finally fixed a few releases ago by forcing IMA to use vfs_getattr_nosec() helpers. The goal of the vfs_getattr_nosec() helper is to query for attributes without calling into the LSM layer which would be quite problematic because incredibly IMA is called from __fput()... __fput() -> ima_file_free() What it does is to call back into the filesystem to update the file's IMA xattr. Querying the inode without using vfs_getattr_nosec() meant that IMA didn't handle stacking filesystems such as overlayfs correctly. So the switch to vfs_getattr_nosec() is quite correct. But the switch to vfs_getattr_nosec() revealed another bug when used on stacking filesystems: __fput() -> ima_file_free() -> vfs_getattr_nosec() -> i_op->getattr::ovl_getattr() -> vfs_getattr() -> i_op->getattr::$WHATEVER_UNDERLYING_FS_getattr() -> security_inode_getattr() # calls back into LSMs Now, if that __fput() happens from task_work_run() of an exiting task current->fs and various other pointer could already be NULL. So anything in the LSM layer relying on that not being NULL would be quite surprised. Fix that by passing the information that this is a security request through to the stacking filesystem by adding a new internal ATT_GETATTR_NOSEC flag. Now the callchain becomes: __fput() -> ima_file_free() -> vfs_getattr_nosec() -> i_op->getattr::ovl_getattr() -> if (AT_GETATTR_NOSEC) vfs_getattr_nosec() else vfs_getattr() -> i_op->getattr::$WHATEVER_UNDERLYING_FS_getattr() - Fix a bug introduced with the iov_iter rework from last cycle. This broke /proc/kcore by copying too much and without the correct offset. - Add a missing NULL check when allocating the root inode in autofs_fill_super(). - Fix stable writes for multi-device filesystems (xfs, btrfs etc) and the block device pseudo filesystem. Stable writes used to be a superblock flag only, making it a per filesystem property. Add an additional AS_STABLE_WRITES mapping flag to allow for fine-grained control. - Ensure that offset_iterate_dir() returns 0 after reaching the end of a directory so it adheres to getdents() convention. * tag 'vfs-6.7-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: libfs: getdents() should return 0 after reaching EOD xfs: respect the stable writes flag on the RT device xfs: clean up FS_XFLAG_REALTIME handling in xfs_ioctl_setattr_xflags block: update the stable_writes flag in bdev_add filemap: add a per-mapping stable writes flag autofs: add: new_inode check in autofs_fill_super() iov_iter: fix copy_page_to_iter_nofault() fs: Pass AT_GETATTR_NOSEC flag to getattr interface function	2023-11-24 09:45:40 -08:00
Damien Le Moal	c96b817552	block: Remove blk_set_runtime_active() The function blk_set_runtime_active() is called only from blk_post_runtime_resume(), so there is no need for that function to be exported. Open-code this function directly in blk_post_runtime_resume() and remove it. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20231120070611.33951-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-11-20 10:22:40 -07:00
Christoph Hellwig	1898efcdbe	block: update the stable_writes flag in bdev_add Propagate the per-queue stable_write flags into each bdev inode in bdev_add. This makes sure devices that require stable writes have it set for I/O on the block device node as well. Note that this doesn't cover the case of a flag changing on a live device yet. We should handle that as well, but I plan to cover it as part of a more general rework of how changing runtime paramters on block devices works. Fixes: `1cb039f3dc` ("bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag") Reported-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231025141020.192413-3-hch@lst.de Tested-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-11-20 15:05:18 +01:00
Jan Kara	ed5cc702d3	block: Add config option to not allow writing to mounted devices Writing to mounted devices is dangerous and can lead to filesystem corruption as well as crashes. Furthermore syzbot comes with more and more involved examples how to corrupt block device under a mounted filesystem leading to kernel crashes and reports we can do nothing about. Add tracking of writers to each block device and a kernel cmdline argument which controls whether other writeable opens to block devices open with BLK_OPEN_RESTRICT_WRITES flag are allowed. We will make filesystems use this flag for used devices. Note that this effectively only prevents modification of the particular block device's page cache by other writers. The actual device content can still be modified by other means - e.g. by issuing direct scsi commands, by doing writes through devices lower in the storage stack (e.g. in case loop devices, DM, or MD are involved) etc. But blocking direct modifications of the block device page cache is enough to give filesystems a chance to perform data validation when loading data from the underlying storage and thus prevent kernel crashes. Syzbot can use this cmdline argument option to avoid uninteresting crashes. Also users whose userspace setup does not need writing to mounted block devices can set this option for hardening. Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231101174325.10596-3-jack@suse.cz Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-11-18 14:59:25 +01:00
Jan Kara	cd34758c52	block: Remove blkdev_get_by_() functions blkdev_get_by_() and blkdev_put() functions are now unused. Remove them. Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231101174325.10596-2-jack@suse.cz Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-11-18 14:59:25 +01:00
Christian Brauner	49ef8832fb	bdev: implement freeze and thaw holder operations The old method of implementing block device freeze and thaw operations required us to rely on get_active_super() to walk the list of all superblocks on the system to find any superblock that might use the block device. This is wasteful and not very pleasant overall. Now that we can finally go straight from block device to owning superblock things become way simpler. Link: https://lore.kernel.org/r/20231024-vfs-super-freeze-v2-5-599c19f4faac@kernel.org Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-11-18 14:59:23 +01:00
Christian Brauner	fbcb8f39e9	bdev: surface the error from sync_blockdev() When freeze_super() is called, sync_filesystem() will be called which calls sync_blockdev() and already surfaces any errors. Do the same for block devices that aren't owned by a superblock and also for filesystems that don't call sync_blockdev() internally but implicitly rely on bdev_freeze() to do it. Link: https://lore.kernel.org/r/20231024-vfs-super-freeze-v2-3-599c19f4faac@kernel.org Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-11-18 14:59:23 +01:00
Christian Brauner	982c3b3058	bdev: rename freeze and thaw helpers We have bdev_mark_dead() etc and we're going to move block device freezing to holder ops in the next patch. Make the naming consistent: * freeze_bdev() -> bdev_freeze() * thaw_bdev() -> bdev_thaw() Also document the return code. Link: https://lore.kernel.org/r/20231024-vfs-super-freeze-v2-2-599c19f4faac@kernel.org Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-11-18 14:59:23 +01:00
Ming Lei	e63a573035	blk-cgroup: bypass blkcg_deactivate_policy after destroying blkcg_deactivate_policy() can be called after blkg_destroy_all() returns, and it isn't necessary since blkg_destroy_all has covered policy deactivation. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231117023527.3188627-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-11-17 10:48:58 -07:00
Ming Lei	35a99d6557	blk-cgroup: avoid to warn !rcu_read_lock_held() in blkg_lookup() So far, all callers either holds spin lock or rcu read explicitly, and most of the caller has added WARN_ON_ONCE(!rcu_read_lock_held()) or lockdep_assert_held(&disk->queue->queue_lock). Remove WARN_ON_ONCE(!rcu_read_lock_held()) from blkg_lookup() for killing the false positive warning from blkg_conf_prep(). Reported-by: Changhui Zhong <czhong@redhat.com> Fixes: `83462a6c97` ("blkcg: Drop unnecessary RCU read [un]locks from blkg_conf_prep/finish()") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231117023527.3188627-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-11-17 10:48:58 -07:00
Ming Lei	27b13e209d	blk-throttle: fix lockdep warning of "cgroup_mutex or RCU read lock required!" Inside blkg_for_each_descendant_pre(), both css_for_each_descendant_pre() and blkg_lookup() requires RCU read lock, and either cgroup_assert_mutex_or_rcu_locked() or rcu_read_lock_held() is called. Fix the warning by adding rcu read lock. Reported-by: Changhui Zhong <czhong@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231117023527.3188627-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-11-17 10:48:58 -07:00
Christoph Hellwig	b0077e269f	blk-mq: make sure active queue usage is held for bio_integrity_prep() blk_integrity_unregister() can come if queue usage counter isn't held for one bio with integrity prepared, so this request may be completed with calling profile->complete_fn, then kernel panic. Another constraint is that bio_integrity_prep() needs to be called before bio merge. Fix the issue by: - call bio_integrity_prep() with one queue usage counter grabbed reliably - call bio_integrity_prep() before bio merge Fixes: `900e080752` ("block: move queue enter logic into blk_mq_submit_bio()") Reported-by: Yi Zhang <yi.zhang@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Link: https://lore.kernel.org/r/20231113035231.2708053-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-11-13 08:52:52 -07:00
Yu Kuai	1b0a151c10	blk-core: use pr_warn_ratelimited() in bio_check_ro() If one of the underlying disks of raid or dm is set to read-only, then each io will generate new log, which will cause message storm. This environment is indeed problematic, however we can't make sure our naive custormer won't do this, hence use pr_warn_ratelimited() to prevent message storm in this case. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Fixes: `57e95e4670` ("block: fix and cleanup bio_check_ro") Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20231107111247.2157820-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-11-07 08:15:23 -07:00
Linus Torvalds	8f6f76a6a2	As usual, lots of singleton and doubleton patches all over the tree and there's little I can say which isn't in the individual changelogs. The lengthier patch series are - "kdump: use generic functions to simplify crashkernel reservation in arch", from Baoquan He. This is mainly cleanups and consolidation of the "crashkernel=" kernel parameter handling. - After much discussion, David Laight's "minmax: Relax type checks in min() and max()" is here. Hopefully reduces some typecasting and the use of min_t() and max_t(). - A group of patches from Oleg Nesterov which clean up and slightly fix our handling of reads from /proc/PID/task/... and which remove task_struct.therad_group. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZUQP9wAKCRDdBJ7gKXxA jmOAAQDh8sxagQYocoVsSm28ICqXFeaY9Co1jzBIDdNesAvYVwD/c2DHRqJHEiS4 63BNcG3+hM9nwGJHb5lyh5m79nBMRg0= =On4u -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "As usual, lots of singleton and doubleton patches all over the tree and there's little I can say which isn't in the individual changelogs. The lengthier patch series are - 'kdump: use generic functions to simplify crashkernel reservation in arch', from Baoquan He. This is mainly cleanups and consolidation of the 'crashkernel=' kernel parameter handling - After much discussion, David Laight's 'minmax: Relax type checks in min() and max()' is here. Hopefully reduces some typecasting and the use of min_t() and max_t() - A group of patches from Oleg Nesterov which clean up and slightly fix our handling of reads from /proc/PID/task/... and which remove task_struct.thread_group" * tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (64 commits) scripts/gdb/vmalloc: disable on no-MMU scripts/gdb: fix usage of MOD_TEXT not defined when CONFIG_MODULES=n .mailmap: add address mapping for Tomeu Vizoso mailmap: update email address for Claudiu Beznea tools/testing/selftests/mm/run_vmtests.sh: lower the ptrace permissions .mailmap: map Benjamin Poirier's address scripts/gdb: add lx_current support for riscv ocfs2: fix a spelling typo in comment proc: test ProtectionKey in proc-empty-vm test proc: fix proc-empty-vm test with vsyscall fs/proc/base.c: remove unneeded semicolon do_io_accounting: use sig->stats_lock do_io_accounting: use __for_each_thread() ocfs2: replace BUG_ON() at ocfs2_num_free_extents() with ocfs2_error() ocfs2: fix a typo in a comment scripts/show_delta: add __main__ judgement before main code treewide: mark stuff as __ro_after_init fs: ocfs2: check status values proc: test /proc/${pid}/statm compiler.h: move __is_constexpr() to compiler.h ...	2023-11-02 20:53:31 -10:00
Linus Torvalds	90d624af2e	for-6.7/block-2023-10-30 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmU/vjMQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqVcEADaNf6X7LVKKrdQ4sA38dBZYGM3kNz0SCYV vkjQAs0Fyylbu6EhYOLO/R+UCtpytLlnbr4NmFDbhaEG4OJcwoDLDxpMQ7Gda58v 4RBXAiIlhZX3g99/ebvtNtVEvQa9gF4h8k2n/gKsG+PoS+cbkKAI0Na2duI1d/pL B5nQ31VAHhsyjUv1nIPLrQS6lsL7ZTFvH8L6FLcEVM03poy8PE2H6kN7WoyXwtfo LN3KK0Nu7B0Wx2nDx0ffisxcDhbChGs7G2c9ndPTvxg6/4HW+2XSeNUwTxXYpyi2 ZCD+AHCzMB/w6GNNWFw4xfau5RrZ4c4HdBnmyR6+fPb1u6nGzjgquzFyLyLu5MkA n/NvOHP1Cbd3QIXG1TnBi2kDPkQ5FOIAjFSe9IZAGT4dUkZ63wBoDil1jCgMLuCR C+AFPLhiIg3cFvu9+fdZ6BkCuZYESd3YboBtRKeMionEexrPTKt4QWqIoVJgd/Y7 nwvR8jkIBpVgQZT8ocYqhSycLCYV2lGqEBSq4rlRiEb/W1G9Awmg8UTGuUYFSC1G vGPCwhGi+SBsbo84aPCfSdUkKDlruNWP0GwIFxo0hsiTOoHP+7UWeenJ2Jw5lNPt p0Y72TEDDaSMlE4cJx6IWdWM/B+OWzCyRyl3uVcy7bToEsVhIbBSSth7+sh2n7Cy WgH1lrtMzg== =sace -----END PGP SIGNATURE----- Merge tag 'for-6.7/block-2023-10-30' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - Improvements to the queue_rqs() support, and adding null_blk support for that as well (Chengming) - Series improving badblocks support (Coly) - Key store support for sed-opal (Greg) - IBM partition string handling improvements (Jan) - Make number of ublk devices supported configurable (Mike) - Cancelation improvements for ublk (Ming) - MD pull requests via Song: - Handle timeout in md-cluster, by Denis Plotnikov - Cleanup pers->prepare_suspend, by Yu Kuai - Rewrite mddev_suspend(), by Yu Kuai - Simplify md_seq_ops, by Yu Kuai - Reduce unnecessary locking array_state_store(), by Mariusz Tkaczyk - Make rdev add/remove independent from daemon thread, by Yu Kuai - Refactor code around quiesce() and mddev_suspend(), by Yu Kuai - NVMe pull request via Keith: - nvme-auth updates (Mark) - nvme-tcp tls (Hannes) - nvme-fc annotaions (Kees) - Misc cleanups and improvements (Jiapeng, Joel) * tag 'for-6.7/block-2023-10-30' of git://git.kernel.dk/linux: (95 commits) block: ublk_drv: Remove unused function md: cleanup pers->prepare_suspend() nvme-auth: allow mixing of secret and hash lengths nvme-auth: use transformed key size to create resp nvme-auth: alloc nvme_dhchap_key as single buffer nvmet-tcp: use 'spin_lock_bh' for state_lock() powerpc/pseries: PLPKS SED Opal keystore support block: sed-opal: keystore access for SED Opal keys block:sed-opal: SED Opal keystore ublk: simplify aborting request ublk: replace monitor with cancelable uring_cmd ublk: quiesce request queue when aborting queue ublk: rename mm_lock as lock ublk: move ublk_cancel_dev() out of ub->mutex ublk: make sure io cmd handled in submitter task context ublk: don't get ublk device reference in ublk_abort_queue() ublk: Make ublks_max configurable ublk: Limit dev_id/ub_number values md-cluster: check for timeout while a new disk adding nvme: rework NVME_AUTH Kconfig selection ...	2023-11-01 12:30:07 -10:00
Linus Torvalds	d4e175f2c4	vfs-6.7.super -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZT0C2gAKCRCRxhvAZXjc otV8AQCK5F9ONoQ7ISpdrKyUJiswySGXx0CYPfXbSg5gHH87zgEAua3vwVKeGXXF 5iVsdiNzIIQDwGDx7FyxufL4ggcN6gQ= =E1kV -----END PGP SIGNATURE----- Merge tag 'vfs-6.7.super' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs superblock updates from Christian Brauner: "This contains the work to make block device opening functions return a struct bdev_handle instead of just a struct block_device. The same struct bdev_handle is then also passed to block device closing functions. This allows us to propagate context from opening to closing a block device without having to modify all users everytime. Sidenote, in the future we might even want to try and have block device opening functions return a struct file directly but that's a series on top of this. These are further preparatory changes to be able to count writable opens and blocking writes to mounted block devices. That's a separate piece of work for next cycle and for that we absolutely need the changes to btrfs that have been quietly dropped somehow. Originally the series contained a patch that removed the old blkdev_() helpers. But since this would've caused needles churn in -next for bcachefs we ended up delaying it. The second piece of work addresses one of the major annoyances about the work last cycle, namely that we required dropping s_umount whenever we used the superblock and fs_holder_ops for a block device. The reason for that requirement had been that in some codepaths s_umount could've been taken under disk->open_mutex (that's always been the case, at least theoretically). For example, on surprise block device removal or media change. And opening and closing block devices required grabbing disk->open_mutex as well. So we did the work and went through the block layer and fixed all those places so that s_umount is never taken under disk->open_mutex. This means no more brittle games where we yield and reacquire s_umount during block device opening and closing and no more requirements where block devices need to be closed. Filesystems don't need to care about this. There's a bunch of other follow-up work such as moving block device freezing and thawing to holder operations which makes it work for all block devices and not just the main block device just as we did for surprise removal. But that is for next cycle. Tested with fstests for all major fses, blktests, LTP" tag 'vfs-6.7.super' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (37 commits) porting: update locking requirements fs: assert that open_mutex isn't held over holder ops block: assert that we're not holding open_mutex over blk_report_disk_dead block: move bdev_mark_dead out of disk_check_media_change block: WARN_ON_ONCE() when we remove active partitions block: simplify bdev_del_partition() fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock jfs: fix log->bdev_handle null ptr deref in lbmStartIO bcache: Fixup error handling in register_cache() xfs: Convert to bdev_open_by_path() reiserfs: Convert to bdev_open_by_dev/path() ocfs2: Convert to use bdev_open_by_dev() nfs/blocklayout: Convert to use bdev_open_by_dev/path() jfs: Convert to bdev_open_by_dev() f2fs: Convert to bdev_open_by_dev/path() ext4: Convert to bdev_open_by_dev() erofs: Convert to use bdev_open_by_path() btrfs: Convert to bdev_open_by_path() fs: Convert to bdev_open_by_dev() mm/swap: Convert to use bdev_open_by_dev() ...	2023-10-30 08:59:05 -10:00
Christian Brauner	f61033390b	block: assert that we're not holding open_mutex over blk_report_disk_dead blk_report_disk_dead() has the following major callers: (1) del_gendisk() (2) blk_mark_disk_dead() Since del_gendisk() acquires disk->open_mutex it's clear that all callers are assumed to be called without disk->open_mutex held. In turn, blk_report_disk_dead() is called without disk->open_mutex held in del_gendisk(). All callers of blk_mark_disk_dead() call it without disk->open_mutex as well. Ensure that it is clear that blk_report_disk_dead() is called without disk->open_mutex on purpose by asserting it and a comment in the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231017184823.1383356-5-hch@lst.de Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-10-28 13:29:23 +02:00
Christoph Hellwig	6e57236ed6	block: move bdev_mark_dead out of disk_check_media_change disk_check_media_change is mostly called from ->open where it makes little sense to mark the file system on the device as dead, as we are just opening it. So instead of calling bdev_mark_dead from disk_check_media_change move it into the few callers that are not in an open instance. This avoid calling into bdev_mark_dead and thus taking s_umount with open_mutex held. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231017184823.1383356-4-hch@lst.de Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-10-28 13:29:23 +02:00
Christian Brauner	51b4cb4f3e	block: WARN_ON_ONCE() when we remove active partitions The logic for disk->open_partitions is: blkdev_get_by_*() -> bdev_is_partition() -> blkdev_get_part() -> blkdev_get_whole() // bdev_whole->bd_openers++ -> if (part->bd_openers == 0) disk->open_partitions++ part->bd_openers In other words, when we first claim/open a partition we increment disk->open_partitions and only when all part->bd_openers are closed will disk->open_partitions be zero. That should mean that disk->open_partitions is always > 0 as long as there's anyone that has an open partition. So the check for disk->open_partitions should mean that we can never remove an active partition that has a holder and holder ops set. Assert that in the code. The main disk isn't removed so that check doesn't work for disk->part0 which is what we want. After all we only care about partition not about the main disk. Link: https://lore.kernel.org/r/20231017184823.1383356-3-hch@lst.de Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-10-28 13:29:22 +02:00
Christian Brauner	c30b9787a4	block: simplify bdev_del_partition() BLKPG_DEL_PARTITION refuses to delete partitions that still have openers, i.e., that has an elevated @bdev->bd_openers count. If a device is claimed by setting @bdev->bd_holder and @bdev->bd_holder_ops @bdev->bd_openers and @bdev->bd_holders are incremented. @bdev->bd_openers is effectively guaranteed to be >= @bdev->bd_holders. So as long as @bdev->bd_openers isn't zero we know that this partition is still in active use and that there might still be @bdev->bd_holder and @bdev->bd_holder_ops set. The only current example is @fs_holder_ops for filesystems. But that means bdev_mark_dead() which calls into bdev->bd_holder_ops->mark_dead::fs_bdev_mark_dead() is a nop. As long as there's an elevated @bdev->bd_openers count we can't delete the partition and if there isn't an elevated @bdev->bd_openers count then there's no @bdev->bd_holder or @bdev->bd_holder_ops. So simply open-code what we need to do. This gets rid of one more instance where we acquire s_umount under @disk->open_mutex. Link: https://lore.kernel.org/r/20231016-fototermin-umriss-59f1ea6c1fe6@brauner Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231017184823.1383356-2-hch@lst.de Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-10-28 13:29:22 +02:00
Jan Kara	fd1464105c	fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock The implementation of bdev holder operations such as fs_bdev_mark_dead() and fs_bdev_sync() grab sb->s_umount semaphore under bdev->bd_holder_lock. This is problematic because it leads to disk->open_mutex -> sb->s_umount lock ordering which is counterintuitive (usually we grab higher level (e.g. filesystem) locks first and lower level (e.g. block layer) locks later) and indeed makes lockdep complain about possible locking cycles whenever we open a block device while holding sb->s_umount semaphore. Implement a function bdev_super_lock_shared() which safely transitions from holding bdev->bd_holder_lock to holding sb->s_umount on alive superblock without introducing the problematic lock dependency. We use this function fs_bdev_sync() and fs_bdev_mark_dead(). Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231018152924.3858-1-jack@suse.cz Link: https://lore.kernel.org/r/20231017184823.1383356-1-hch@lst.de Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-10-28 13:29:22 +02:00
Jan Kara	acb083b555	block: Use bdev_open_by_dev() in disk_scan_partitions() and blkdev_bszset() Convert disk_scan_partitions() and blkdev_bszset() to use bdev_open_by_dev(). Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-3-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-10-28 13:29:16 +02:00
Jan Kara	841dd789b8	block: Use bdev_open_by_dev() in blkdev_open() Convert blkdev_open() to use bdev_open_by_dev(). To be able to propagate handle from blkdev_open() to blkdev_release() we need to stop using existence of file->private_data to determine exclusive block device opens. Use bdev_handle->mode for this purpose since file->f_flags isn't usable for this (O_EXCL is cleared from the flags during open). Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-2-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-10-28 13:29:16 +02:00
Jan Kara	e719b4d156	block: Provide bdev_open_* functions Create struct bdev_handle that contains all parameters that need to be passed to blkdev_put() and provide bdev_open_* functions that return this structure instead of plain bdev pointer. This will eventually allow us to pass one more argument to blkdev_put() (renamed to bdev_release()) without too much hassle. Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-1-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-10-28 13:29:16 +02:00
Khazhismel Kumykov	2dd710d476	blk-throttle: check for overflow in calculate_bytes_allowed Inexact, we may reject some not-overflowing values incorrectly, but they'll be on the order of exabytes allowed anyways. This fixes divide error crash on x86 if bps_limit is not configured or is set too high in the rare case that jiffy_elapsed is greater than HZ. Fixes: `e8368b57c0` ("blk-throttle: use calculate_io/bytes_allowed() for throtl_trim_slice()") Fixes: `8d6bbaada2` ("blk-throttle: prevent overflow while calculating wait time") Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20231020223617.2739774-1-khazhy@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-10-20 18:38:17 -06:00
Alexey Dobriyan	68279f9c9f	treewide: mark stuff as __ro_after_init __read_mostly predates __ro_after_init. Many variables which are marked __read_mostly should have been __ro_after_init from day 1. Also, mark some stuff as "const" and "__init" while I'm at it. [akpm@linux-foundation.org: revert sysctl_nr_open_min, sysctl_nr_open_max changes due to arm warning] [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/4f6bb9c0-abba-4ee4-a7aa-89265e886817@p183 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-10-18 14:43:23 -07:00
Greg Joyce	ec8cf230ce	powerpc/pseries: PLPKS SED Opal keystore support Define operations for SED Opal to read/write keys from POWER LPAR Platform KeyStore(PLPKS). This allows non-volatile storage of SED Opal keys. Signed-off-by: Greg Joyce <gjoyce@linux.vnet.ibm.com> Reviewed-by: Jonathan Derrick <jonathan.derrick@linux.dev> Link: https://lore.kernel.org/r/20231004201957.1451669-4-gjoyce@linux.vnet.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-10-17 09:10:06 -06:00
Greg Joyce	5dd339722f	block: sed-opal: keystore access for SED Opal keys Allow for permanent SED authentication keys by reading/writing to the SED Opal non-volatile keystore. Signed-off-by: Greg Joyce <gjoyce@linux.vnet.ibm.com> Reviewed-by: Jonathan Derrick <jonathan.derrick@linux.dev> Link: https://lore.kernel.org/r/20231004201957.1451669-3-gjoyce@linux.vnet.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-10-17 09:10:06 -06:00
Milan Broz	4eaf0932c6	block: Fix regression in sed-opal for a saved key. The commit `3bfeb61256` introduced the use of keyring for sed-opal. Unfortunately, there is also a possibility to save the Opal key used in opal_lock_unlock(). This patch switches the order of operation, so the cached key is used instead of failure for opal_get_key. The problem was found by the cryptsetup Opal test recently added to the cryptsetup tree. Fixes: `3bfeb61256` ("block: sed-opal: keyring support for SED keys") Tested-by: Ondrej Kozina <okozina@redhat.com> Signed-off-by: Milan Broz <gmazyland@gmail.com> Link: https://lore.kernel.org/r/20231003100209.380037-1-gmazyland@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-10-13 08:16:08 -06:00
Sarthak Kukreti	1364a3c391	block: Don't invalidate pagecache for invalid falloc modes Only call truncate_bdev_range() if the fallocate mode is supported. This fixes a bug where data in the pagecache could be invalidated if the fallocate() was called on the block device with an invalid mode. Fixes: `25f4c41415` ("block: implement (some of) fallocate for block devices") Cc: stable@vger.kernel.org Reported-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Fixes: line? I've never seen those wrapped. Link: https://lore.kernel.org/r/20231011201230.750105-1-sarthakkukreti@chromium.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-10-11 15:53:17 -06:00
Jan Höppner	a31281acc4	partitions/ibm: Introduce defines for magic string length values The length values for volume label type and volume label id are hard-coded in several places. Provide defines for those values and replace all occurrences accordingly. Note that the length is defined and used, and not the size since the volume label type string and volume label id string are not nul-terminated. Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com> Reviewed-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Link: https://lore.kernel.org/r/20230915131001.697070-4-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-10-04 08:04:08 -06:00
Jan Höppner	f5f43aae6f	partitions/ibm: Replace strncpy() and improve readability strncpy() is deprecated and needs to be replaced. The volume label information strings are not nul-terminated and strncpy() can simply be replaced with memcpy(). To enhance the readability of find_label() alongside this change, the following improvements are made: - Introduce the array dasd_vollabels[] containing all information necessary for the label detection. - Provide a helper function to obtain an index value corresponding to a volume label type. This allows the use of a switch statement to reduce indentation levels. - The 'temp' variable is used to check against valid volume label types. In the good case, this variable already contains the volume label type making it unnecessary to copy the information again from e.g. label->vol.vollbl. Remove the 'temp' variable and the second copy as all information are already provided. - Remove the 'found' variable and replace it with early returns Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com> Reviewed-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Link: https://lore.kernel.org/r/20230915131001.697070-3-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-10-04 08:04:08 -06:00
Jan Höppner	d323c1a947	partitions/ibm: Remove unnecessary memset The data holding the volume label information is zeroed in case no valid volume label was found. Since the label information isn't used in that case, zeroing the data doesn't provide any value whatsoever. Remove the unnecessary memset() call accordingly. Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com> Reviewed-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Link: https://lore.kernel.org/r/20230915131001.697070-2-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-10-04 08:04:08 -06:00
Coly Li	aa511ff821	badblocks: switch to the improved badblock handling code This patch removes old code of badblocks_set(), badblocks_clear() and badblocks_check(), and make them as wrappers to call _badblocks_set(), _badblocks_clear() and _badblocks_check(). By this change now the badblock handing switch to the improved algorithm in _badblocks_set(), _badblocks_clear() and _badblocks_check(). This patch only contains the changes of old code deletion, new added code for the improved algorithms are in previous patches. Signed-off-by: Coly Li <colyli@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Geliang Tang <geliang.tang@suse.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: NeilBrown <neilb@suse.de> Cc: Vishal L Verma <vishal.l.verma@intel.com> Cc: Xiao Ni <xni@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Acked-by: Geliang Tang <geliang.tang@suse.com> Link: https://lore.kernel.org/r/20230811170513.2300-7-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-09-26 00:44:33 -06:00
Coly Li	3ea3354cb9	badblocks: improve badblocks_check() for multiple ranges handling This patch rewrites badblocks_check() with similar coding style as _badblocks_set() and _badblocks_clear(). The only difference is bad blocks checking may handle multiple ranges in bad tables now. If a checking range covers multiple bad blocks range in bad block table, like the following condition (C is the checking range, E1, E2, E3 are three bad block ranges in bad block table), +------------------------------------+ \| C \| +------------------------------------+ +----+ +----+ +----+ \| E1 \| \| E2 \| \| E3 \| +----+ +----+ +----+ The improved badblocks_check() algorithm will divide checking range C into multiple parts, and handle them in 7 runs of a while-loop, +--+ +----+ +----+ +----+ +----+ +----+ +----+ \|C1\| \| C2 \| \| C3 \| \| C4 \| \| C5 \| \| C6 \| \| C7 \| +--+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ \| E1 \| \| E2 \| \| E3 \| +----+ +----+ +----+ And the start LBA and length of range E1 will be set as first_bad and bad_sectors for the caller. The return value rule is consistent for multiple ranges. For example if there are following bad block ranges in bad block table, Index No. Start Len Ack 0 400 20 1 1 500 50 1 2 650 20 0 the return value, first_bad, bad_sectors by calling badblocks_set() with different checking range can be the following values, Checking Start, Len Return Value first_bad bad_sectors 100, 100 0 N/A N/A 100, 310 1 400 10 100, 440 1 400 10 100, 540 1 400 10 100, 600 -1 400 10 100, 800 -1 400 10 In order to make code review easier, this patch names the improved bad block range checking routine as _badblocks_check() and does not change existing badblock_check() code yet. Later patch will delete old code of badblocks_check() and make it as a wrapper to call _badblocks_check(). Then the new added code won't mess up with the old deleted code, it will be more clear and easier for code review. Signed-off-by: Coly Li <colyli@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Geliang Tang <geliang.tang@suse.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: NeilBrown <neilb@suse.de> Cc: Vishal L Verma <vishal.l.verma@intel.com> Cc: Xiao Ni <xni@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Acked-by: Geliang Tang <geliang.tang@suse.com> Link: https://lore.kernel.org/r/20230811170513.2300-6-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-09-26 00:44:33 -06:00
Coly Li	db448eb686	badblocks: improve badblocks_clear() for multiple ranges handling With the fundamental ideas and helper routines from badblocks_set() improvement, clearing bad block for multiple ranges is much simpler. With a similar idea from badblocks_set() improvement, this patch simplifies bad block range clearing into 5 situations. No matter how complicated the clearing condition is, we just look at the head part of clearing range with relative already set bad block range from the bad block table. The rested part will be handled in next run of the while-loop. Based on existing helpers added from badblocks_set(), this patch adds two more helpers, - front_clear() Clear the bad block range from bad block table which is front overlapped with the clearing range. - front_splitting_clear() Handle the condition that the clearing range hits middle of an already set bad block range from bad block table. Similar as badblocks_set(), the first part of clearing range is handled with relative bad block range which is find by prev_badblocks(). In most cases a valid hint is provided to prev_badblocks() to avoid unnecessary bad block table iteration. This patch also explains the detail algorithm code comments at beginning of badblocks.c, including which five simplified situations are categrized and how all the bad block range clearing conditions are handled by these five situations. Again, in order to make the code review easier and avoid the code changes mixed together, this patch does not modify badblock_clear() and implement another routine called _badblock_clear() for the improvement. Later patch will delete current code of badblock_clear() and make it as a wrapper to _badblock_clear(), so the code change can be much clear for review. Signed-off-by: Coly Li <colyli@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Geliang Tang <geliang.tang@suse.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: NeilBrown <neilb@suse.de> Cc: Vishal L Verma <vishal.l.verma@intel.com> Cc: Xiao Ni <xni@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Acked-by: Geliang Tang <geliang.tang@suse.com> Link: https://lore.kernel.org/r/20230811170513.2300-5-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-09-26 00:44:33 -06:00
Coly Li	1726c77467	badblocks: improve badblocks_set() for multiple ranges handling Recently I received a bug report that current badblocks code does not properly handle multiple ranges. For example, badblocks_set(bb, 32, 1, true); badblocks_set(bb, 34, 1, true); badblocks_set(bb, 36, 1, true); badblocks_set(bb, 32, 12, true); Then indeed badblocks_show() reports, 32 3 36 1 But the expected bad blocks table should be, 32 12 Obviously only the first 2 ranges are merged and badblocks_set() returns and ignores the rest setting range. This behavior is improper, if the caller of badblocks_set() wants to set a range of blocks into bad blocks table, all of the blocks in the range should be handled even the previous part encountering failure. The desired way to set bad blocks range by badblocks_set() is, - Set as many as blocks in the setting range into bad blocks table. - Merge the bad blocks ranges and occupy as less as slots in the bad blocks table. - Fast. Indeed the above proposal is complicated, especially with the following restrictions, - The setting bad blocks range can be acknowledged or not acknowledged. - The bad blocks table size is limited. - Memory allocation should be avoided. The basic idea of the patch is to categorize all possible bad blocks range setting combinations into much less simplified and more less special conditions. Inside badblocks_set() there is an implicit loop composed by jumping between labels 're_insert' and 'update_sectors'. No matter how large the setting bad blocks range is, in every loop just a minimized range from the head is handled by a pre-defined behavior from one of the categorized conditions. The logic is simple and code flow is manageable. The different relative layout between the setting range and existing bad block range are checked and handled (merge, combine, overwrite, insert) by the helpers in previous patch. This patch is to make all the helpers work together with the above idea. This patch only has the algorithm improvement for badblocks_set(). There are following patches contain improvement for badblocks_clear() and badblocks_check(). But the algorithm in badblocks_set() is fundamental and typical, other improvement in clear and check routines are based on all the helpers and ideas in this patch. In order to make the change to be more clear for code review, this patch does not directly modify existing badblocks_set(), and just add a new one named _badblocks_set(). Later patch will remove current existing badblocks_set() code and make it as a wrapper of _badblocks_set(). So the new added change won't be mixed with deleted code, the code review can be easier. Signed-off-by: Coly Li <colyli@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Geliang Tang <geliang.tang@suse.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: NeilBrown <neilb@suse.de> Cc: Vishal L Verma <vishal.l.verma@intel.com> Cc: Wols Lists <antlists@youngman.org.uk> Cc: Xiao Ni <xni@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Acked-by: Geliang Tang <geliang.tang@suse.com> Link: https://lore.kernel.org/r/20230811170513.2300-4-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-09-26 00:44:33 -06:00
Coly Li	c3c6a86e9e	badblocks: add helper routines for badblock ranges handling This patch adds several helper routines to improve badblock ranges handling. These helper routines will be used later in the improved version of badblocks_set()/badblocks_clear()/badblocks_check(). - Helpers prev_by_hint() and prev_badblocks() are used to find the bad range from bad table which the searching range starts at or after. - The following helpers are to decide the relative layout between the manipulating range and existing bad block range from bad table. - can_merge_behind() Return 'true' if the manipulating range can backward merge with the bad block range. - can_merge_front() Return 'true' if the manipulating range can forward merge with the bad block range. - can_combine_front() Return 'true' if two adjacent bad block ranges before the manipulating range can be merged. - overlap_front() Return 'true' if the manipulating range exactly overlaps with the bad block range in front of its range. - overlap_behind() Return 'true' if the manipulating range exactly overlaps with the bad block range behind its range. - can_front_overwrite() Return 'true' if the manipulating range can forward overwrite the bad block range in front of its range. - The following helpers are to add the manipulating range into the bad block table. Different routine is called with the specific relative layout between the manipulating range and other bad block range in the bad block table. - behind_merge() Merge the manipulating range with the bad block range behind its range, and return the number of merged length in unit of sector. - front_merge() Merge the manipulating range with the bad block range in front of its range, and return the number of merged length in unit of sector. - front_combine() Combine the two adjacent bad block ranges before the manipulating range into a larger one. - front_overwrite() Overwrite partial of whole bad block range which is in front of the manipulating range. The overwrite may split existing bad block range and generate more bad block ranges into the bad block table. - insert_at() Insert the manipulating range at a specific location in the bad block table. All the above helpers are used in later patches to improve the bad block ranges handling for badblocks_set()/badblocks_clear()/badblocks_check(). Signed-off-by: Coly Li <colyli@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Geliang Tang <geliang.tang@suse.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: NeilBrown <neilb@suse.de> Cc: Vishal L Verma <vishal.l.verma@intel.com> Cc: Xiao Ni <xni@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Acked-by: Geliang Tang <geliang.tang@suse.com> Link: https://lore.kernel.org/r/20230811170513.2300-3-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-09-26 00:44:33 -06:00
Randy Dunlap	a578a25339	block: fix kernel-doc for disk_force_media_change() Drop one function parameter's kernel-doc comment since the parameter was removed. This prevents a kernel-doc warning: block/disk-events.c:300: warning: Excess function parameter 'events' description in 'disk_force_media_change' Fixes: `ab6860f62b` ("block: simplify the disk_force_media_change interface") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: kernel test robot <lkp@intel.com> Closes: lore.kernel.org/r/202309060957.vfl0mUur-lkp@intel.com Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230926005232.23666-1-rdunlap@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-09-26 00:43:34 -06:00
Chengming Zhou	217b613a53	blk-mq: update driver tags request table when start request Now we update driver tags request table in blk_mq_get_driver_tag(), so the driver that support queue_rqs() have to update that inflight table by itself. Move it to blk_mq_start_request(), which is a better place where we setup the deadline for request timeout check. And it's just where the request becomes inflight. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230913151616.3164338-5-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-09-22 08:52:13 -06:00
Chengming Zhou	434097ee37	blk-mq: support batched queue_rqs() on shared tags queue Since active requests have been accounted when allocate driver tags, we can remove this limit now. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230913151616.3164338-4-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-09-22 08:52:13 -06:00
Chengming Zhou	48554df6bf	blk-mq: remove RQF_MQ_INFLIGHT Since the previous patch change to only account active requests when we really allocate the driver tag, the RQF_MQ_INFLIGHT can be removed and no double account problem. 1. none elevator: flush request will use the first pending request's driver tag, won't double account. 2. other elevator: flush request will be accounted when allocate driver tag when issue, and will be unaccounted when it put the driver tag. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230913151616.3164338-3-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-09-22 08:52:13 -06:00
Chengming Zhou	b8643d6826	blk-mq: account active requests when get driver tag There is a limit that batched queue_rqs() can't work on shared tags queue, since the account of active requests can't be done there. Now we account the active requests only in blk_mq_get_driver_tag(), which is not the time we get driver tag actually (with none elevator). To support batched queue_rqs() on shared tags queue, we move the account of active requests to where we get the driver tag: 1. none elevator: blk_mq_get_tags() and blk_mq_get_tag() 2. other elevator: __blk_mq_alloc_driver_tag() This is clearer and match with the unaccount side, which just happen when we put the driver tag. The other good point is that we don't need RQF_MQ_INFLIGHT trick anymore, which used to avoid double account of flush request. Now we only account when actually get the driver tag, so all is good. We will remove RQF_MQ_INFLIGHT in the next patch. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230913151616.3164338-2-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-09-22 08:52:13 -06:00
Kemeng Shi	e599ed7866	block: correct stale comment in rq_qos_wait The rq_qos_wait calls common wake-up function rq_qos_wake_function to get token. Just replace stale wbt_wake_function with rq_qos_wake_function in comment. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230914091508.36232-1-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-09-18 14:15:28 -06:00
Chengming Zhou	6be6d11241	blk-mq: fix tags UAF when shrinking q->nr_hw_queues When nr_hw_queues shrink, we free the excess tags before realloc'ing hw_ctxs for each queue. During that resize, we may need to access those tags, like blk_mq_tag_idle(hctx) will access queue shared tags. This can cause a slab use-after-free, as reported by KASAN. Fix it by moving the releasing of excess tags to the end. Fixes: `e1dd7bc930` ("blk-mq: fix tags leak when shrink nr_hw_queues") Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAHj4cs_CK63uoDpGBGZ6DN4OCTpzkR3UaVgK=LX8Owr8ej2ieQ@mail.gmail.com/ Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20230908005702.2183908-1-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-09-11 16:17:34 -06:00
Christoph Hellwig	5905afc2c7	block: fix pin count management when merging same-page segments There is no need to unpin the added page when adding it to the bio fails as that is done by the loop below. Instead we want to unpin it when adding a single page to the bio more than once as bio_release_pages will only unpin it once. Fixes: `d1916c86cc` ("block: move same page handling from __bio_add_pc_page to the callers") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230905124731.328255-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-09-06 07:32:27 -06:00
Li Lingfeng	1a721de848	block: don't add or resize partition on the disk with GENHD_FL_NO_PART Commit `a33df75c63` ("block: use an xarray for disk->part_tbl") remove disk_expand_part_tbl() in add_partition(), which means all kinds of devices will support extended dynamic `dev_t`. However, some devices with GENHD_FL_NO_PART are not expected to add or resize partition. Fix this by adding check of GENHD_FL_NO_PART before add or resize partition. Fixes: `a33df75c63` ("block: use an xarray for disk->part_tbl") Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230831075900.1725842-1-lilingfeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-31 08:00:35 -06:00
Christoph Hellwig	0d997f1de8	block: remove the call to file_remove_privs in blkdev_write_iter file_remove_privs instantly returns 0 when not called for regular files, so don't bother. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230831121911.280155-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-31 08:00:23 -06:00
Yu Kuai	eead005664	blk-throttle: consider 'carryover_ios/bytes' in throtl_trim_slice() Currently, 'carryover_ios/bytes' is not handled in throtl_trim_slice(), for consequence, 'carryover_ios/bytes' will be used to throttle bio multiple times, for example: 1) set iops limit to 100, and slice start is 0, slice end is 100ms; 2) current time is 0, and 10 ios are dispatched, those io won't be throttled and io_disp is 10; 3) still at current time 0, update iops limit to 1000, carryover_ios is updated to (0 - 10) = -10; 4) in this slice(0 - 100ms), io_allowed = 100 + (-10) = 90, which means only 90 ios can be dispatched without waiting; 5) assume that io is throttled in slice(0 - 100ms), and throtl_trim_slice() update silce to (100ms - 200ms). In this case, 'carryover_ios/bytes' is not cleared and still only 90 ios can be dispatched between 100ms - 200ms. Fix this problem by updating 'carryover_ios/bytes' in throtl_trim_slice(). Fixes: `a880ae93e5` ("blk-throttle: fix io hung due to configuration updates") Reported-by: zhuxiaohui <zhuxiaohui.400@bytedance.com> Link: https://lore.kernel.org/all/20230812072116.42321-1-zhuxiaohui.400@bytedance.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230816012708.1193747-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-30 10:15:01 -06:00
Yu Kuai	e8368b57c0	blk-throttle: use calculate_io/bytes_allowed() for throtl_trim_slice() There are no functional changes, just make the code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230816012708.1193747-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-30 10:15:01 -06:00
Yu Kuai	bb8d5587bd	blk-throttle: fix wrong comparation while 'carryover_ios/bytes' is negative carryover_ios/bytes[] can be negative in the case that ios are dispatched in the slice in advance, and then configuration is updated. For example: 1) set iops limit to 1000, and slice start is 0, slice end is 100ms; 2) current time is 0, and 100 ios are dispatched, those ios will not be throttled, hence io_disp is 100; 3) still at current time 0, update iops limit to 100, then carryover_ios is (0 - 100) = -100; 4) then, dispatch a new io at time 0, the expected result is that this io will wait for 1s. The calculation in tg_within_iops_limit: io_disp = 0; io_allowed = calculate_io_allowed + carryover_ios = 10 + (-100) = -90; io won't be throttled if (io_disp + 1 < io_allowed) passed. Before this patch, in step 4) (io_disp + 1 < io_allowed) is passed, because -90 for unsigned value is very huge, and such io won't be throttled. Fix this problem by checking if 'io/bytes_allowed' is negative first. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230816012708.1193747-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-30 10:15:01 -06:00
Yu Kuai	ef100397fa	blk-throttle: print signed value 'carryover_bytes/ios' for user 'carryover_bytes/ios' can be negative, indicate that some bio is dispatched in advance within slice while configuration is updated. Print a huge value is not user-friendly. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230816012708.1193747-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-30 10:15:01 -06:00
Linus Torvalds	3d3dfeb3ae	for-6.6/block-2023-08-28 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTs08EQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqa4EACu/zKE+omGXBV0Q7kEpVsChjp0ElGtSDIJ tJfTuvnWqQjrqRv4ksmZvGdx8SkqFuXri4/7oBXlsaqeUVbIQdWJUpLErBye6nxa lUb6nXOFWwyG94cMRYs71lN0loosjb7aiVw7oVLAIhntq3p3doFl/cyy3ndMZrUE pZbsrWSt4QiOKhcO0TtIjfAwsr31AN51qFiNNITEiZl3UjXfkGRCK81X0yM2N8zZ 7Y0h1ldPBsZ/olNWeRyaW1uB64nKM0buR7/nDxCV/NI05nndJ34bIgo/JIj4xy0v SiBj2+y86+oMJZt17yYENwOQdtX3hbyESGuVm9dCrO0t9/byVQxkUk0OMm65BM/l l2d+gmMQZTbHziqfLlgq9i3i9+B4C2hsb7iBpuo7SW/FPbM45POgi3lpiZycaZyu krQo1qwL4KSGXzGN9CabEuKDcJcXqLxqMDOyEDA3R5Kz06V9tNuM+Di/mr4vuZHK sVHUfHuWBO9ionLlGPdc3fH/CuMqic8SHjumiAm2menBZV6cSzRDxpm6H4CyLt7y tWmw7BNU7dfHFGd+Jw0Ld49sAuEybszEXq6qYv5uYBVfJNqDvOvEeVoQp0RN2jJA AG30hymcZgxn9n7gkIgkPQDgIGUjnzUR8B2mE2UFU1CYVHXYXAXU55CCI5oeTkbs d0Y/zCZf1A== =p1bd -----END PGP SIGNATURE----- Merge tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: "Pretty quiet round for this release. This contains: - Add support for zoned storage to ublk (Andreas, Ming) - Series improving performance for drivers that mark themselves as needing a blocking context for issue (Bart) - Cleanup the flush logic (Chengming) - sed opal keyring support (Greg) - Fixes and improvements to the integrity support (Jinyoung) - Add some exports for bcachefs that we can hopefully delete again in the future (Kent) - deadline throttling fix (Zhiguo) - Series allowing building the kernel without buffer_head support (Christoph) - Sanitize the bio page adding flow (Christoph) - Write back cache fixes (Christoph) - MD updates via Song: - Fix perf regression for raid0 large sequential writes (Jan) - Fix split bio iostat for raid0 (David) - Various raid1 fixes (Heinz, Xueshi) - raid6test build fixes (WANG) - Deprecate bitmap file support (Christoph) - Fix deadlock with md sync thread (Yu) - Refactor md io accounting (Yu) - Various non-urgent fixes (Li, Yu, Jack) - Various fixes and cleanups (Arnd, Azeem, Chengming, Damien, Li, Ming, Nitesh, Ruan, Tejun, Thomas, Xu)" * tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux: (113 commits) block: use strscpy() to instead of strncpy() block: sed-opal: keyring support for SED keys block: sed-opal: Implement IOC_OPAL_REVERT_LSP block: sed-opal: Implement IOC_OPAL_DISCOVERY blk-mq: prealloc tags when increase tagset nr_hw_queues blk-mq: delete redundant tagset map update when fallback blk-mq: fix tags leak when shrink nr_hw_queues ublk: zoned: support REQ_OP_ZONE_RESET_ALL md: raid0: account for split bio in iostat accounting md/raid0: Fix performance regression for large sequential writes md/raid0: Factor out helper for mapping and submitting a bio md raid1: allow writebehind to work on any leg device set WriteMostly md/raid1: hold the barrier until handle_read_error() finishes md/raid1: free the r1bio before waiting for blocked rdev md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io() blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init drivers/rnbd: restore sysfs interface to rnbd-client md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid() raid6: test: only check for Altivec if building on powerpc hosts raid6: test: make sure all intermediate and artifact files are .gitignored ...	2023-08-29 20:21:42 -07:00
Linus Torvalds	511fb5bafe	v6.6-vfs.super -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXpbgAKCRCRxhvAZXjc oi8PAQCtXelGZHmTcmevsO8p4Qz7hFpkonZ/TnxKf+RdnlNgPgD+NWi+LoRBpaAj xk4z8SqJaTTP4WXrG5JZ6o7EQkUL8gE= =2e9I -----END PGP SIGNATURE----- Merge tag 'v6.6-vfs.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull superblock updates from Christian Brauner: "This contains the super rework that was ready for this cycle. The first part changes the order of how we open block devices and allocate superblocks, contains various cleanups, simplifications, and a new mechanism to wait on superblock state changes. This unblocks work to ultimately limit the number of writers to a block device. Jan has already scheduled follow-up work that will be ready for v6.7 and allows us to restrict the number of writers to a given block device. That series builds on this work right here. The second part contains filesystem freezing updates. Overview: The generic superblock changes are rougly organized as follows (ignoring additional minor cleanups): (1) Removal of the bd_super member from struct block_device. This was a very odd back pointer to struct super_block with unclear rules. For all relevant places we have other means to get the same information so just get rid of this. (2) Simplify rules for superblock cleanup. Roughly, everything that is allocated during fs_context initialization and that's stored in fs_context->s_fs_info needs to be cleaned up by the fs_context->free() implementation before the superblock allocation function has been called successfully. After sget_fc() returned fs_context->s_fs_info has been transferred to sb->s_fs_info at which point sb->kill_sb() if fully responsible for cleanup. Adhering to these rules means that cleanup of sb->s_fs_info in fill_super() is to be avoided as it's brittle and inconsistent. Cleanup shouldn't be duplicated between sb->put_super() as sb->put_super() is only called if sb->s_root has been set aka when the filesystem has been successfully born (SB_BORN). That complexity should be avoided. This also means that block devices are to be closed in sb->kill_sb() instead of sb->put_super(). More details in the lower section. (3) Make it possible to lookup or create a superblock before opening block devices There's a subtle dependency on (2) as some filesystems did rely on fill_super() to be called in order to correctly clean up sb->s_fs_info. All these filesystems have been fixed. (4) Switch most filesystem to follow the same logic as the generic mount code now does as outlined in (3). (5) Use the superblock as the holder of the block device. We can now easily go back from block device to owning superblock. (6) Export and extend the generic fs_holder_ops and use them as holder ops everywhere and remove the filesystem specific holder ops. (7) Call from the block layer up into the filesystem layer when the block device is removed, allowing to shut down the filesystem without risk of deadlocks. (8) Get rid of get_super(). We can now easily go back from the block device to owning superblock and can call up from the block layer into the filesystem layer when the device is removed. So no need to wade through all registered superblock to find the owning superblock anymore" Link: https://lore.kernel.org/lkml/20230824-prall-intakt-95dbffdee4a0@brauner/ * tag 'v6.6-vfs.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (47 commits) super: use higher-level helper for {freeze,thaw} super: wait until we passed kill super super: wait for nascent superblocks super: make locking naming consistent super: use locking helpers fs: simplify invalidate_inodes fs: remove get_super block: call into the file system for ioctl BLKFLSBUF block: call into the file system for bdev_mark_dead block: consolidate __invalidate_device and fsync_bdev block: drop the "busy inodes on changed media" log message dasd: also call __invalidate_device when setting the device offline amiflop: don't call fsync_bdev in FDFMTBEG floppy: call disk_force_media_change when changing the format block: simplify the disk_force_media_change interface nbd: call blk_mark_disk_dead in nbd_clear_sock_ioctl xfs use fs_holder_ops for the log and RT devices xfs: drop s_umount over opening the log and RT devices ext4: use fs_holder_ops for the log device ext4: drop s_umount over opening the log device ...	2023-08-28 11:04:18 -07:00
Christian Brauner	3fb5a6562a	New code for 6.6: * Allow the kernel to initiate a freeze of a filesystem. The kernel and userspace can both hold a freeze on a filesystem at the same time; the freeze is not lifted until /both/ holders lift it. This will enable us to fix a longstanding bug in XFS online fsck. Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZLVnJwAKCRBKO3ySh0YR pqVIAP9u9CZEJ2Zcc7YpBj1MLUQGr2xBmz8RJEVJbQHKVgYcQwEA9BNb4eH4i2Af K7Qp0OGNgyzZw37lN23Uf/SDuBK2QgM= =seMl -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXo2AAKCRCRxhvAZXjc ojDfAQDguc2saF8WLeXtn2O0pGOW8vTrhpwiFHNI6hwdzf07/AD+LGBpFEqYKyX5 NHPzdR7YYpJoTsQzR4JFJVZqN9Q1xgU= =wDq0 -----END PGP SIGNATURE----- Merge tag 'vfs-6.6-merge-2' of ssh://gitolite.kernel.org/pub/scm/fs/xfs/xfs-linux Pull filesystem freezing updates from Darrick Wong: New code for 6.6: * Allow the kernel to initiate a freeze of a filesystem. The kernel and userspace can both hold a freeze on a filesystem at the same time; the freeze is not lifted until /both/ holders lift it. This will enable us to fix a longstanding bug in XFS online fsck. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Message-Id: <20230822182604.GB11286@frogsfrogsfrogs> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-08-23 13:06:55 +02:00
Xu Panda	146afeb235	block: use strscpy() to instead of strncpy() The implementation of strscpy() is more robust and safer. That's now the recommended way to copy NUL terminated strings. Signed-off-by: Xu Panda <xu.panda@zte.com.cn> Signed-off-by: Yang Yang <yang.yang29@zte.com> Reviewed-by: Justin Stitt <justinstitt@google.com> Link: https://lore.kernel.org/r/202212031422587503771@zte.com.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-22 18:07:50 -06:00
Greg Joyce	3bfeb61256	block: sed-opal: keyring support for SED keys Extend the SED block driver so it can alternatively obtain a key from a sed-opal kernel keyring. The SED ioctls will indicate the source of the key, either directly in the ioctl data or from the keyring. This allows the use of SED commands in scripts such as udev scripts so that drives may be automatically unlocked as they become available. Signed-off-by: Greg Joyce <gjoyce@linux.vnet.ibm.com> Reviewed-by: Jonathan Derrick <jonathan.derrick@linux.dev> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lore.kernel.org/r/20230721211534.3437070-4-gjoyce@linux.vnet.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-22 11:10:26 -06:00
Greg Joyce	5c82efc1ae	block: sed-opal: Implement IOC_OPAL_REVERT_LSP This is used in conjunction with IOC_OPAL_REVERT_TPR to return a drive to Original Factory State without erasing the data. If IOC_OPAL_REVERT_LSP is called with opal_revert_lsp.options bit OPAL_PRESERVE set prior to calling IOC_OPAL_REVERT_TPR, the drive global locking range will not be erased. Signed-off-by: Greg Joyce <gjoyce@linux.vnet.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jonathan Derrick <jonathan.derrick@linux.dev> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lore.kernel.org/r/20230721211534.3437070-3-gjoyce@linux.vnet.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-22 11:10:26 -06:00
Greg Joyce	9fb10726ec	block: sed-opal: Implement IOC_OPAL_DISCOVERY Add IOC_OPAL_DISCOVERY ioctl to return raw discovery data to a SED Opal application. This allows the application to display drive capabilities and state. Signed-off-by: Greg Joyce <gjoyce@linux.vnet.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jonathan Derrick <jonathan.derrick@linux.dev> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lore.kernel.org/r/20230721211534.3437070-2-gjoyce@linux.vnet.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-22 11:10:26 -06:00
Chengming Zhou	7222657e51	blk-mq: prealloc tags when increase tagset nr_hw_queues Just like blk_mq_alloc_tag_set(), it's better to prepare all tags before using to map to queue ctxs in blk_mq_map_swqueue(), which now have to consider empty set->tags[]. The good point is that we can fallback easily if increasing nr_hw_queues fail, instead of just mapping to hctx[0] when fail in blk_mq_map_swqueue(). And the fallback path already has tags free & clean handling, so all is good. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230821095602.70742-3-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-22 08:58:06 -06:00
Chengming Zhou	2bc4d7a355	blk-mq: delete redundant tagset map update when fallback When we increase nr_hw_queues fail, the fallback path will use blk_mq_update_queue_map() to clear and update all maps. Obviously, this line of update of HCTX_TYPE_DEFAULT only is not needed, so delete it. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230821095602.70742-2-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-22 08:58:05 -06:00
Chengming Zhou	e1dd7bc930	blk-mq: fix tags leak when shrink nr_hw_queues Although we don't need to realloc set->tags[] when shrink nr_hw_queues, we need to free them. Or these tags will be leaked. How to reproduce: 1. mount -t configfs configfs /mnt 2. modprobe null_blk nr_devices=0 submit_queues=8 3. mkdir /mnt/nullb/nullb0 4. echo 1 > /mnt/nullb/nullb0/power 5. echo 4 > /mnt/nullb/nullb0/submit_queues 6. rmdir /mnt/nullb/nullb0 In step 4, will alloc 9 tags (8 submit queues and 1 poll queue), then in step 5, new_nr_hw_queues = 5 (4 submit queues and 1 poll queue). At last in step 6, only these 5 tags are freed, the other 4 tags leaked. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230821095602.70742-1-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-22 08:58:05 -06:00
Christoph Hellwig	2142b88c37	block: call into the file system for ioctl BLKFLSBUF BLKFLSBUF is a historic ioctl that is called on a file handle to a block device and syncs either the file system mounted on that block device if there is one, or otherwise the just the data on the block device. Replace the get_super based syncing with a holder operation to remove the last usage of get_super, and to also support syncing the file system if the block device is not the main block device stored in s_dev. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Message-Id: <20230811100828.1897174-16-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-08-21 14:35:32 +02:00
Christoph Hellwig	d8530de5a6	block: call into the file system for bdev_mark_dead Combine the newly merged bdev_mark_dead helper with the existing mark_dead holder operation so that all operations that invalidate a device that is dead or being removed now go through the holder ops. This allows file systems to explicitly shutdown either ASAP (for a surprise removal) or after writing back data (for an orderly removal), and do so not only for the main device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Message-Id: <20230811100828.1897174-15-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-08-21 14:35:32 +02:00
Christoph Hellwig	560e20e4bf	block: consolidate __invalidate_device and fsync_bdev We currently have two interfaces that take a block_devices and the find a mounted file systems to flush or invaldidate data on it. Both are a bit problematic because they only work for the "main" block devices that is used as s_dev for the super_block, and because they don't call into the file system at all. Merge the two into a new bdev_mark_dead helper that does both the syncing and invalidation and which is properly documented. This is in preparation of merging the functionality into the ->mark_dead holder operation so that it will work on additional block devices used by a file systems and give us a single entry point for invalidation of dead devices or media. Note that a single standalone fsync_bdev call for an obscure ioctl remains for now, but that one will also be deal with in a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Message-Id: <20230811100828.1897174-14-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-08-21 14:35:31 +02:00
Christoph Hellwig	127a5093c7	block: drop the "busy inodes on changed media" log message This message isn't exactly helpful, and file systems already print way more useful messages when shut down while active. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Message-Id: <20230811100828.1897174-13-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-08-21 14:35:31 +02:00
Christoph Hellwig	ab6860f62b	block: simplify the disk_force_media_change interface Hard code the events to DISK_EVENT_MEDIA_CHANGE as that is the only useful use case, and drop the superfluous return value. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Message-Id: <20230811100828.1897174-9-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-08-21 14:35:30 +02:00
Linus Torvalds	2383ffc41a	block-6.5-2023-08-19 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTg19oQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmLjD/wKs0hA4JOQDqZPZK1p1aBU4f0vXwQxFGlL +gcnO4/MIB/5Ud+T+SXYuMrimLws7xsVbymcGatiRjH8LTfJVXFhuAzLILi0AcHw nhzjOUEzHokUex+tZLZRZxmavR+9SyGJoFNIbh+mY8JOLdNVzFDSqnLWO+D02Q2R OOBupA0mLRelYODEm2rI4xlQndwfrOAoAyEv+R7Ug0F6bFSno36QOg64pmZVI0Fl eudORXnIRYdtUajv+kNATWoqBbq/UCuBJdk0veM07Try6ZGRXRh6dQSA+GRh93pE Zg3JAHj4MKwlP3/wglw3SzoeECHpZrKQavIQQe9pTWKP4xGI/jdbVBcyFE0ERc66 HijMo6CLeAzpOI1nEv+QhD8ntr4polEiWL4EVLuoXE9fVI1mYzavqmqrsDHeOHeF IJHadXZwsTG243msDvqedy0RFBwAkpnK0XdQuDtMnSa7UHwWWbxwUOwO5p4COJ3g vmrCfPQr7TTgkOtAXoMnwOZ1troEGxa/2CdUKaTdVG8RkMeM2qy8tmBBTV9Bx6+i rwQbB/JJm5SE6DX309TRaR6w+5YiwR6e7ECKx5hdYXia7M3OxlBBvl1NOfiWjWE3 abC38/FReHLmFKHaDaN2AM1vLy+duc4NEc/yMQ4FDcfj/hUHQCoZBPYUsvlC+a4e Ws4qoMLU8A== =LnzH -----END PGP SIGNATURE----- Merge tag 'block-6.5-2023-08-19' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: "Main thing here is the fix for the regression in flush handling which caused IO hangs/stalls for a few reporters. Hopefully that should all be sorted out now. Outside of that, just a few minor fixes for issues that were introduced in this cycle" * tag 'block-6.5-2023-08-19' of git://git.kernel.dk/linux: blk-mq: release scheduler resource when request completes blk-crypto: dynamically allocate fallback profile blk-cgroup: hold queue_lock when removing blkg->q_node drivers/rnbd: restore sysfs interface to rnbd-client	2023-08-19 17:31:46 +02:00
Chengming Zhou	e5c0ca1365	blk-mq: release scheduler resource when request completes Chuck reported [1] an IO hang problem on NFS exports that reside on SATA devices and bisected to commit `615939a2ae` ("blk-mq: defer to the normal submission path for post-flush requests"). We analysed the IO hang problem, found there are two postflush requests waiting for each other. The first postflush request completed the REQ_FSEQ_DATA sequence, so go to the REQ_FSEQ_POSTFLUSH sequence and added in the flush pending list, but failed to blk_kick_flush() because of the second postflush request which is inflight waiting in scheduler queue. The second postflush waiting in scheduler queue can't be dispatched because the first postflush hasn't released scheduler resource even though it has completed by itself. Fix it by releasing scheduler resource when the first postflush request completed, so the second postflush can be dispatched and completed, then make blk_kick_flush() succeed. While at it, remove the check for e->ops.finish_request, as all schedulers set that. Reaffirm this requirement by adding a WARN_ON_ONCE() at scheduler registration time, just like we do for insert_requests and dispatch_request. [1] https://lore.kernel.org/all/7A57C7AE-A51A-4254-888B-FE15CA21F9E9@oracle.com/ Link: https://lore.kernel.org/linux-block/20230819031206.2744005-1-chengming.zhou@linux.dev/ Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202308172100.8ce4b853-oliver.sang@intel.com Fixes: `615939a2ae` ("blk-mq: defer to the normal submission path for post-flush requests") Reported-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Tested-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20230813152325.3017343-1-chengming.zhou@linux.dev [axboe: folded in incremental fix and added tags] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-19 07:47:17 -06:00
Sweet Tea Dorminy	c984ff1423	blk-crypto: dynamically allocate fallback profile blk_crypto_profile_init() calls lockdep_register_key(), which warns and does not register if the provided memory is a static object. blk-crypto-fallback currently has a static blk_crypto_profile and calls blk_crypto_profile_init() thereupon, resulting in the warning and failure to register. Fortunately it is simple enough to use a dynamically allocated profile and make lockdep function correctly. Fixes: `2fb48d88e7` ("blk-crypto: use dynamic lock class for blk_crypto_profile::lock") Cc: stable@vger.kernel.org Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20230817141615.15387-1-sweettea-kernel@dorminy.me Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-18 15:00:39 -06:00
Ming Lei	c164c7bc97	blk-cgroup: hold queue_lock when removing blkg->q_node When blkg is removed from q->blkg_list from blkg_free_workfn(), queue_lock has to be held, otherwise, all kinds of bugs(list corruption, hard lockup, ..) can be triggered from blkg_destroy_all(). Fixes: `f1c006f1c6` ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()") Cc: Yu Kuai <yukuai3@huawei.com> Cc: xiaoli feng <xifeng@redhat.com> Cc: Chunyu Hu <chuhu@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230817141751.1128970-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-18 15:00:39 -06:00
Tejun Heo	ec14a87ee1	blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init blk-iocost sometimes causes the following crash: BUG: kernel NULL pointer dereference, address: 00000000000000e0 ... RIP: 0010:_raw_spin_lock+0x17/0x30 Code: be 01 02 00 00 e8 79 38 39 ff 31 d2 89 d0 5d c3 0f 1f 00 0f 1f 44 00 00 55 48 89 e5 65 ff 05 48 d0 34 7e b9 01 00 00 00 31 c0 <f0> 0f b1 0f 75 02 5d c3 89 c6 e8 ea 04 00 00 5d c3 0f 1f 84 00 00 RSP: 0018:ffffc900023b3d40 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 00000000000000e0 RCX: 0000000000000001 RDX: ffffc900023b3d20 RSI: ffffc900023b3cf0 RDI: 00000000000000e0 RBP: ffffc900023b3d40 R08: ffffc900023b3c10 R09: 0000000000000003 R10: 0000000000000064 R11: 000000000000000a R12: ffff888102337000 R13: fffffffffffffff2 R14: ffff88810af408c8 R15: ffff8881070c3600 FS: 00007faaaf364fc0(0000) GS:ffff88842fdc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000e0 CR3: 00000001097b1000 CR4: 0000000000350ea0 Call Trace: <TASK> ioc_weight_write+0x13d/0x410 cgroup_file_write+0x7a/0x130 kernfs_fop_write_iter+0xf5/0x170 vfs_write+0x298/0x370 ksys_write+0x5f/0xb0 __x64_sys_write+0x1b/0x20 do_syscall_64+0x3d/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 This happens because iocg->ioc is NULL. The field is initialized by ioc_pd_init() and never cleared. The NULL deref is caused by blkcg_activate_policy() installing blkg_policy_data before initializing it. blkcg_activate_policy() was doing the following: 1. Allocate pd's for all existing blkg's and install them in blkg->pd[]. 2. Initialize all pd's. 3. Online all pd's. blkcg_activate_policy() only grabs the queue_lock and may release and re-acquire the lock as allocation may need to sleep. ioc_weight_write() grabs blkcg->lock and iterates all its blkg's. The two can race and if ioc_weight_write() runs during #1 or between #1 and #2, it can encounter a pd which is not initialized yet, leading to crash. The crash can be reproduced with the following script: #!/bin/bash echo +io > /sys/fs/cgroup/cgroup.subtree_control systemd-run --unit touch-sda --scope dd if=/dev/sda of=/dev/null bs=1M count=1 iflag=direct echo 100 > /sys/fs/cgroup/system.slice/io.weight bash -c "echo '8:0 enable=1' > /sys/fs/cgroup/io.cost.qos" & sleep .2 echo 100 > /sys/fs/cgroup/system.slice/io.weight with the following patch applied: > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c > index fc49be622e05..38d671d5e10c 100644 > --- a/block/blk-cgroup.c > +++ b/block/blk-cgroup.c > @@ -1553,6 +1553,12 @@ int blkcg_activate_policy(struct gendisk disk, const struct blkcg_policy pol) > pd->online = false; > } > > + if (system_state == SYSTEM_RUNNING) { > + spin_unlock_irq(&q->queue_lock); > + ssleep(1); > + spin_lock_irq(&q->queue_lock); > + } > + > /* all allocated, init in the same order */ > if (pol->pd_init_fn) > list_for_each_entry_reverse(blkg, &q->blkg_list, q_node) I don't see a reason why all pd's should be allocated, initialized and onlined together. The only ordering requirement is that parent blkgs to be initialized and onlined before children, which is guaranteed from the walking order. Let's fix the bug by allocating, initializing and onlining pd for each blkg and holding blkcg->lock over initialization and onlining. This ensures that an installed blkg is always fully initialized and onlined removing the the race window. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Breno Leitao <leitao@debian.org> Fixes: `9d179b8654` ("blkcg: Fix multiple bugs in blkcg_activate_policy()") Link: https://lore.kernel.org/r/ZN0p5_W-Q9mAHBVY@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-17 19:21:05 -06:00
Kent Overstreet	649f070e69	block: Bring back zero_fill_bio_iter This reverts `6f822e1b5d` - this helper is used by bcachefs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Link: https://lore.kernel.org/r/20230813182636.2966159-4-kent.overstreet@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-14 15:40:42 -06:00
Kent Overstreet	168145f617	block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset bio_iov_iter_get_pages() trims the IO based on the block size of the block device the IO will be issued to. However, bcachefs is a multi device filesystem; when we're creating the bio we don't yet know which block device the bio will be submitted to - we have to handle the alignment checks elsewhere. Thus this is needed to avoid a null ptr deref. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Link: https://lore.kernel.org/r/20230813182636.2966159-3-kent.overstreet@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-14 15:40:42 -06:00
Kent Overstreet	7ba3792718	block: Add some exports for bcachefs - bio_set_pages_dirty(), bio_check_pages_dirty() - dio path - blk_status_to_str() - error messages - bio_add_folio() - this should definitely be exported for everyone, it's the modern version of bio_add_page() Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Cc: linux-block@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/r/20230813182636.2966159-2-kent.overstreet@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-14 15:40:42 -06:00
Linus Torvalds	360e694282	block-6.5-2023-08-11 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTWfLQQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpg3nEACROhaeX6cpeDCSTqDVDW/ontbyn15eX7ep tGPLn/TVtKv2AztIobEinS08MdywqBO/VcB7XkxQV9Ov4JqCHIAKhndWI6/HqD9P DH3h6tE5JA8RQlNw1aHRrqWWIl1lpDQI6263um1tB2TuaxRa4xuR560jju0VZzAm 9541ceKlJT8Qc7yG0aiiCv6Bxz+b6Htv3DqCf1mY2yznl3BpN52RQHKhiA0sfnlF WKqNsvSJ9/kz3vJbNpFucO7ch8a7W+MzmBx0vf2ickTBpL/3hbhUOrE7dGeKI9rS cWh1HaULWqjnKY1uxF9nnapZxm8QoxkT/5T0DgmprKjwuZivfLASAhYpHBc3mT1S eQQ0AK8hqx7sPnPeO/kxWtxM2nzRLkeVd19ClbIwux/zDbRrpHWk2/wgnSUUd3/H HBbjbgPWbkgLvTOUKhIA5VPBcgkC1efom1+ePzkH/H4TRRuVJwg6s6utGXdgc1PX +B4TA8GtXRH/7L0tsblFyJRmd0Y6G7gYE/yy0DYZTMie3oaWrKx3lmz48AQUtEzh DG46VRA4wnthHRlw3mkLP7C6z4PJvK9WWBiK11eZ9VfJMF643FNpXQ3/bviR9pfF kXdwYXoi1mlnsQ0VUhu2f+JeV4hHalrjwD/VE2H0E8Ogb4ezmJteLyiZKcw5xwaA Hmtmbb7Qxw== =+1Vt -----END PGP SIGNATURE----- Merge tag 'block-6.5-2023-08-11' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Fixes for request_queue state (Ming) - Another uuid quirk (August) - RCU poll fix for NVMe (Ming) - Fix for an IO stall with polled IO (me) - Fix for blk-iocost stats enable/disable accounting (Chengming) - Regression fix for large pages for zram (Christoph) * tag 'block-6.5-2023-08-11' of git://git.kernel.dk/linux: nvme: core: don't hold rcu read lock in nvme_ns_chr_uring_cmd_iopoll blk-iocost: fix queue stats accounting block: don't make REQ_POLLED imply REQ_NOWAIT block: get rid of unused plug->nowait flag zram: take device and not only bvec offset into account nvme-pci: add NVME_QUIRK_BOGUS_NID for Samsung PM9B1 256G and 512G nvme-rdma: fix potential unbalanced freeze & unfreeze nvme-tcp: fix potential unbalanced freeze & unfreeze nvme: fix possible hang when removing a controller during error recovery	2023-08-11 12:14:08 -07:00
Jens Axboe	18267a0365	block: fix bad lockdep annotation in blk-iolatency A previous commit added a lockdep annotation, but botched it. Use the right type. Fixes: `4eb44d1076` ("block: remove init_mutex and open-code blk_iolatency_try_init") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-10 17:24:53 -06:00
Li Lingfeng	4eb44d1076	block: remove init_mutex and open-code blk_iolatency_try_init Commit `a13696b83d` ("blk-iolatency: Make initialization lazy") adds a mutex named "init_mutex" in blk_iolatency_try_init for the race condition of initializing RQ_QOS_LATENCY. Now a new lock has been add to struct request_queue by commit `a13bd91be2` ("block/rq_qos: protect rq_qos apis with a new lock"). And it has been held in blkg_conf_open_bdev before calling blk_iolatency_init. So it's not necessary to keep init_mutex in blk_iolatency_try_init, just remove it. Since init_mutex has been removed, blk_iolatency_try_init can be open-coded back to iolatency_set_limit() like ioc_qos_write(). Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Link: https://lore.kernel.org/r/20230810035111.2236335-1-lilingfeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-10 07:20:31 -06:00
Jinyoung Choi	0ece1d649b	bio-integrity: create multi-page bvecs in bio_integrity_add_page() In general, the bvec data structure consists of one for physically continuous pages. But, in the bvec configuration for bip, physically continuous integrity pages are composed of each bvec. Allow bio_integrity_add_page() to create multi-page bvecs, just like the bio payloads. This simplifies adding larger payloads, and fixes support for non-tiny workloads with nvme, which stopped using scatterlist for metadata a while ago. Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Fixes: `783b94bd92` ("nvme-pci: do not build a scatterlist to map metadata") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com> Tested-by: "Martin K. Petersen" <martin.petersen@oracle.com> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230803025202epcms2p82f57cbfe32195da38c776377b55aed59@epcms2p8 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-09 16:05:35 -06:00
Jinyoung Choi	d1f04c2e23	bio-integrity: cleanup adding integrity pages to bip's bvec. bio_integrity_add_page() returns the add length if successful, else 0, just as bio_add_page. Simply check return value checking in bio_integrity_prep to not deal with a > 0 but < len case that can't happen. Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com> Tested-by: "Martin K. Petersen" <martin.petersen@oracle.com> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230803025058epcms2p5a4d0db5da2ad967668932d463661c633@epcms2p5 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-09 16:05:35 -06:00
Jinyoung Choi	80814b8e35	bio-integrity: update the payload size in bio_integrity_add_page() Previously, the bip's bi_size has been set before an integrity pages were added. If a problem occurs in the process of adding pages for bip, the bi_size mismatch problem must be dealt with. When the page is successfully added to bvec, the bi_size is updated. The parts affected by the change were also contained in this commit. Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com> Tested-by: "Martin K. Petersen" <martin.petersen@oracle.com> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230803024956epcms2p38186a17392706650c582d38ef3dbcd32@epcms2p3 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-09 16:05:35 -06:00
Jinyoung Choi	7c8998f75d	block: make bvec_try_merge_hw_page() non-static This will be used for multi-page configuration for integrity payload. Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com> Tested-by: "Martin K. Petersen" <martin.petersen@oracle.com> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230803024827epcms2p838d9e9131492c86a159fff25d195658f@epcms2p8 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-09 16:05:35 -06:00
Chengming Zhou	f099a108ca	blk-iocost: fix queue stats accounting The q->stats->accounting is not only used by iocost, but iocost only increase this counter, never decrease it. So queue stats accounting will always enabled after using iocost once. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230804070609.31623-1-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-09 16:04:14 -06:00
Jens Axboe	2bc0576925	block: don't make REQ_POLLED imply REQ_NOWAIT Normally these two flags do go together, as the issuer of polled IO generally cannot wait for resources that will get freed as part of IO completion. This is because that very task is the one that will complete the request and free those resources, hence that would introduce a deadlock. But it is possible to have someone else issue the polled IO, eg via io_uring if the request is punted to io-wq. For that case, it's fine to have the task block on IO submission, as it is not the same task that will be completing the IO. It's completely up to the caller to ask for both polled and nowait IO separately! If we don't allow polled IO where IOCB_NOWAIT isn't set in the kiocb, then we can run into repeated -EAGAIN submissions and not make any progress. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-09 16:04:07 -06:00
Jens Axboe	d74f714896	block: get rid of unused plug->nowait flag This was introduced to add a plug based way of signaling nowait issues, but we have since moved on from that. Kill the old dead code, nobody is setting it anymore. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-08 15:50:37 -06:00
Zhiguo Niu	d47f9717e5	block/mq-deadline: use correct way to throttling write requests The original formula was inaccurate: dd->async_depth = max(1UL, 3 * q->nr_requests / 4); For write requests, when we assign a tags from sched_tags, data->shallow_depth will be passed to sbitmap_find_bit, see the following code: nr = sbitmap_find_bit_in_word(&sb->map[index], min_t (unsigned int, __map_depth(sb, index), depth), alloc_hint, wrap); The smaller of data->shallow_depth and __map_depth(sb, index) will be used as the maximum range when allocating bits. For a mmc device (one hw queue, deadline I/O scheduler): q->nr_requests = sched_tags = 128, so according to the previous calculation method, dd->async_depth = data->shallow_depth = 96, and the platform is 64bits with 8 cpus, sched_tags.bitmap_tags.sb.shift=5, sb.maps[]=32/32/32/32, 32 is smaller than 96, whether it is a read or a write I/O, tags can be allocated to the maximum range each time, which has not throttling effect. In addition, refer to the methods of bfg/kyber I/O scheduler, limit ratiois are calculated base on sched_tags.bitmap_tags.sb.shift. This patch can throttle write requests really. Fixes: `07757588e5` ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests") Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/1691061162-22898-1-git-send-email-zhiguo.niu@unisoc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-08 15:46:41 -06:00
Christoph Hellwig	925c86a19b	fs: add CONFIG_BUFFER_HEAD Add a new config option that controls building the buffer_head code, and select it from all file systems and stacking drivers that need it. For the block device nodes and alternative iomap based buffered I/O path is provided when buffer_head support is not enabled, and iomap needs a a small tweak to define the IOMAP_F_BUFFER_HEAD flag to 0 to not call into the buffer_head code when it doesn't exist. Otherwise this is just Kconfig and ifdef changes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230801172201.1923299-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-02 09:13:09 -06:00
Christoph Hellwig	487c607df7	block: use iomap for writes to block devices Use iomap in buffer_head compat mode to write to block devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230801172201.1923299-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-02 09:13:09 -06:00
Christoph Hellwig	a05f7bd957	block: stop setting ->direct_IO Direct I/O on block devices now nevers goes through aops->direct_IO. Stop setting it and set the FMODE_CAN_ODIRECT in ->open instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230801172201.1923299-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-02 09:13:09 -06:00
Christoph Hellwig	727cfe9767	block: open code __generic_file_write_iter for blkdev writes Open code __generic_file_write_iter to remove the indirect call into ->direct_IO and to prepare using the iomap based write code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20230801172201.1923299-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-02 09:13:09 -06:00
Jinyoung Choi	51d74ec9b6	block: cleanup bio_integrity_prep If a problem occurs in the process of creating an integrity payload, the status of bio is always BLK_STS_RESOURCE. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230725051839epcms2p8e4d20ad6c51326ad032e8406f59d0aaa@epcms2p8 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-25 20:30:54 -06:00
Bart Van Assche	65a558f66c	block: Improve performance for BLK_MQ_F_BLOCKING drivers blk_mq_run_queue() runs the queue asynchronously if BLK_MQ_F_BLOCKING has been set. This is suboptimal since running the queue asynchronously is slower than running the queue synchronously. This patch modifies blk_mq_run_queue() as follows if BLK_MQ_F_BLOCKING has been set: - Run the queue synchronously if it is allowed to sleep. - Run the queue asynchronously if it is not allowed to sleep. Additionally, blk_mq_run_hw_queue(hctx, false) calls are modified into blk_mq_run_hw_queue(hctx, hctx->flags & BLK_MQ_F_BLOCKING) if the caller may be invoked from atomic context. The following caller chains have been reviewed: blk_mq_run_hw_queue(hctx, false) blk_mq_get_tag() /* may sleep, hence the functions it calls may also sleep / blk_execute_rq() / may sleep / blk_mq_run_hw_queues(q, async=false) blk_freeze_queue_start() / may sleep / blk_mq_requeue_work() / may sleep / scsi_kick_queue() scsi_requeue_run_queue() / may sleep / scsi_run_host_queues() scsi_ioctl_reset() / may sleep / blk_mq_insert_requests(hctx, ctx, list, run_queue_async=false) blk_mq_dispatch_plug_list(plug, from_sched=false) blk_mq_flush_plug_list(plug, from_schedule=false) __blk_flush_plug(plug, from_schedule=false) blk_add_rq_to_plug() blk_mq_submit_bio() / may sleep if REQ_NOWAIT has not been set / blk_mq_plug_issue_direct() blk_mq_flush_plug_list() / see above / blk_mq_dispatch_plug_list(plug, from_sched=false) blk_mq_flush_plug_list() / see above / blk_mq_try_issue_directly() blk_mq_submit_bio() / may sleep if REQ_NOWAIT has not been set / blk_mq_try_issue_list_directly(hctx, list) blk_mq_insert_requests() / see above */ Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230721172731.955724-4-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-24 20:13:12 -06:00
Christoph Hellwig	ae42f0b3bf	block: don't pass a bio to bio_try_merge_hw_seg There is no good reason to pass the bio to bio_try_merge_hw_seg. Just pass the current bvec and rename the function to bvec_try_merge_hw_page. This will allow reusing this function for supporting multi-page integrity payload bvecs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com> Link: https://lore.kernel.org/r/20230724165433.117645-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-24 19:55:16 -06:00
Christoph Hellwig	858c708d9e	block: move the bi_size update out of __bio_try_merge_page The update of bi_size is the only thing in __bio_try_merge_page that needs a bio. Move it to the callers, and merge __bio_try_merge_page and page_is_mergeable into a single bvec_try_merge_page that only takes the current bvec instead of a full bio. This will allow reusing this function for supporting multi-page integrity payload bvecs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com> Link: https://lore.kernel.org/r/20230724165433.117645-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-24 19:55:16 -06:00
Christoph Hellwig	80232b5203	block: downgrade a bio_full call in bio_add_page bio_add_page already checks that there is space in bi_size a little earlier. So after we failed to add to an existing segment, just check that there is another one available instead of duplicating the bi_size check. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com> Link: https://lore.kernel.org/r/20230724165433.117645-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-24 19:55:16 -06:00
Christoph Hellwig	613699050a	block: move the bi_size overflow check in __bio_try_merge_page Checking for availability in bi_size in a function that attempts to merge into an existing segment is a bit odd, as the limit also applies when adding a new segment. This code works fine as we always call __bio_try_merge_page, but contributes to sub-optimal calling conventions and doesn't lead to clear code. Move it to two of the callers instead, the third one already has a more strict check that includes max_hw_segments anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230724165433.117645-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-24 19:55:16 -06:00
Christoph Hellwig	0eca8b6f97	block: move the bi_vcnt check out of __bio_try_merge_page Move the bi_vcnt out of __bio_try_merge_page and into the two callers that don't already have it in preparation for additional changes to __bio_try_merge_page. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230724165433.117645-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-24 19:55:16 -06:00
Christoph Hellwig	939e1a3703	block: move the BIO_CLONED checks out of __bio_try_merge_page __bio_try_merge_page is a way too low-level helper to assert that the bio is not cloned. Move the check into bio_add_page and bio_iov_iter_get_pages instead, which are the high level entry points that should enforce this variant. bio_add_hw_page already this check, coverig the third (indirect) caller of __bio_try_merge_page. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230724165433.117645-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-24 19:55:16 -06:00
Christoph Hellwig	6850b2dd5c	block: use SECTOR_SHIFT bio_add_hw_page Use the SECTOR_SHIFT magic constant instead of the magic number. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230724165433.117645-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-24 19:55:16 -06:00
Christoph Hellwig	cd1d83e24e	block: tidy up the bio full checks in bio_add_hw_page bio_add_hw_page already checks if the number of bytes trying to be added even fit into max_hw_sectors limit of the queue. Remove the call to bio_full and just do a check for the smaller of the number of segments in the bio and the queue max segments limit, and do this cheap check before the more expensive gap to previous check. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com> Link: https://lore.kernel.org/r/20230724165433.117645-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-24 19:55:16 -06:00
Linus Torvalds	f036d67c02	block-6.5-2023-07-21 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmS629wQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpv/7D/99ysE5ZszmjxNOmyy1lGfqtQnaTLuToRsl wB16umIPAFfye5r4TV8l9GZuUyI7FU8LySglu0Y0qMKmCp+kJKLh90kB281Co4Dn yp1AbqlTorAlG4ElQJBRaQr4kaqqvI2tzeVmFdUhIE1oX2e9OX/O+YKa8k1JfsKI oecChQgodlPxX3wusItgiyvZKl2q2+mivg5E6cqiGIgP3uF8fmOQCbio4Vm8ZSxb TO8JEfBTiXslR+CvJD3Gi96pzexN1qCUed8/7FDiIUufhETmwqSIOo89GxzGAQ6O 7o/83IkqgXPHjKLYs3R4/jhHPXZmXmvDZHWIiSg+KLOFqxxWmRPNJ6V6igIBP8SG eu5PTA7SDGtvIXePpu38FTPmSiUW7MbGhnjqY8u64Je6MaQ8l28KN7xkFtmxV+n4 hgB0gr6uKBnXMKZHobk0yJeUUI/L/0ESzbVPDHY8JM/rQCsp1eSNQDpZoVjPWZmg lMGYmOq57oPA20LVch7U3gUFhD4CJ7c3e2/EzJdJVjsTveTYieBCEESQErFbMcEr VuRZSAGnPyXQ4yF4wG93x4sDye28ZFS/Q9c6Q3DCUxctDkCz4eY1+vmdX+NJXwDA aYXCyyKzk18udbKvV0QvTuDTb6PrJDPxbFagCveibPTtP4XDMv1LvpdZPUPJ/HGX 4xA1mrsGJA== =e2OR -----END PGP SIGNATURE----- Merge tag 'block-6.5-2023-07-21' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - Fix for loop regressions (Mauricio) - Fix a potential stall with batched wakeups in sbitmap (David) - Fix for stall with recursive plug flushes (Ross) - Skip accounting of empty requests for blk-iocost (Chengming) - Remove a dead field in struct blk_mq_hw_ctx (Chengming) * tag 'block-6.5-2023-07-21' of git://git.kernel.dk/linux: loop: do not enforce max_loop hard limit by (new) default loop: deprecate autoloading callback loop_probe() sbitmap: fix batching wakeup blk-iocost: skip empty flush bio in iocost blk-mq: delete dead struct blk_mq_hw_ctx->queued field blk-mq: Fix stall due to recursive flush plug	2023-07-22 11:05:15 -07:00
Nitesh Shetty	8f63fef586	block: refactor to use helper Reduce some code by making use of bio_integrity_bytes(). Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230719121608.32105-1-nj.shetty@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-20 16:27:52 -06:00
Chengming Zhou	013adcbef1	blk-iocost: skip empty flush bio in iocost The flush bio may have data, may have no data (empty flush), we couldn't calculate cost for empty flush bio. So we'd better just skip it for now. Another side effect is that empty flush bio's bio_end_sector() is 0, cause iocg->cursor reset to 0, may break the cost calculation of other bios. This isn't good enough, since flush bio still consume the device bandwidth, but flush request is special, can be merged randomly in the flush state machine, we don't know how to calculate cost for it for now. Its completion time also has flaws, which may include the pre-flush or post-flush completion time, but I don't know if we need to fix that and how to fix it. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230720121441.1408522-1-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-20 14:02:17 -06:00
Darrick J. Wong	880b957785	fs: distinguish between user initiated freeze and kernel initiated freeze Userspace can freeze a filesystem using the FIFREEZE ioctl or by suspending the block device; this state persists until userspace thaws the filesystem with the FITHAW ioctl or resuming the block device. Since commit `18e9e5104f` ("Introduce freeze_super and thaw_super for the fsfreeze ioctl") we only allow the first freeze command to succeed. The kernel may decide that it is necessary to freeze a filesystem for its own internal purposes, such as suspends in progress, filesystem fsck activities, or quiescing a device prior to removal. Userspace thaw commands must never break a kernel freeze, and kernel thaw commands shouldn't undo userspace's freeze command. Introduce a couple of freeze holder flags and wire it into the sb_writers state. One kernel and one userspace freeze are allowed to coexist at the same time; the filesystem will not thaw until both are lifted. I wonder if the f2fs/gfs2 code should be using a kernel freeze here, but for now we'll use FREEZE_HOLDER_USERSPACE to preserve existing behaviors. Cc: mcgrof@kernel.org Cc: jack@suse.cz Cc: hch@infradead.org Cc: ruansy.fnst@fujitsu.com Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>	2023-07-17 09:00:09 -07:00
Chengming Zhou	81ada09cc2	blk-flush: reuse rq queuelist in flush state machine Since we don't need to maintain inflight flush_data requests list anymore, we can reuse rq->queuelist for flush pending list. Note in mq_flush_data_end_io(), we need to re-initialize rq->queuelist before reusing it in the state machine when end, since the rq->rq_next also reuse it, may have corrupted rq->queuelist by the driver. This patch decrease the size of struct request by 16 bytes. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230717040058.3993930-5-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-17 08:18:21 -06:00
Chengming Zhou	b175c86739	blk-flush: count inflight flush_data requests The flush state machine use a double list to link all inflight flush_data requests, to avoid issuing separate post-flushes for these flush_data requests which shared PREFLUSH. So we can't reuse rq->queuelist, this is why we need rq->flush.list In preparation of the next patch that reuse rq->queuelist for flush state machine, we change the double linked list to unsigned long counter, which count all inflight flush_data requests. This is ok since we only need to know if there is any inflight flush_data request, so unsigned long counter is good. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230717040058.3993930-4-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-17 08:18:21 -06:00
Chengming Zhou	28b2412374	blk-flush: fix rq->flush.seq for post-flush requests If the policy == (REQ_FSEQ_DATA \| REQ_FSEQ_POSTFLUSH), it means that the data sequence and post-flush sequence need to be done for this request. The rq->flush.seq should record what sequences have been done (or don't need to be done). So in this case, pre-flush doesn't need to be done, we should init rq->flush.seq to REQ_FSEQ_PREFLUSH not REQ_FSEQ_POSTFLUSH. Fixes: `615939a2ae` ("blk-mq: defer to the normal submission path for post-flush requests") Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230717040058.3993930-3-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-17 08:18:21 -06:00
Chengming Zhou	660e802c76	blk-mq: use percpu csd to remote complete instead of per-rq csd If request need to be completed remotely, we insert it into percpu llist, and smp_call_function_single_async() if llist is empty previously. We don't need to use per-rq csd, percpu csd is enough. And the size of struct request is decreased by 24 bytes. This way is cleaner, and looks correct, given block softirq is guaranteed to be scheduled to consume the list if one new request is added to this percpu list, either smp_call_function_single_async() returns -EBUSY or 0. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230717040058.3993930-2-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-17 08:18:21 -06:00
Christoph Hellwig	43c9835b14	block: don't allow enabling a cache on devices that don't support it Currently the write_cache attribute allows enabling the QUEUE_FLAG_WC flag on devices that never claimed the capability. Fix that by adding a QUEUE_FLAG_HW_WC flag that is set by blk_queue_write_cache and guards re-enabling the cache through sysfs. Note that any rescan that calls blk_queue_write_cache will still re-enable the write cache as in the current code. Fixes: `93e9d8e836` ("block: add ability to flag write back caching on a device") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230707094239.107968-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-17 08:18:18 -06:00
Christoph Hellwig	c4e21bcd0f	block: cleanup queue_wc_store Get rid of the local queue_wc_store variable and handling setting and clearing the QUEUE_FLAG_WC flag diretly instead the if / else if. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230707094239.107968-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-17 08:18:17 -06:00
Linus Torvalds	be522ac7cd	SCSI fixes on 20230714 This is a bunch of small driver fixes and a larger rework of zone disk handling (which reaches into blk and nvme). The aacraid array-bounds fix is now critical since the security people turned on -Werror for some build tests, which now fail without it. Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com> -----BEGIN PGP SIGNATURE----- iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCZLGSiCYcamFtZXMuYm90 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishd/BAPwO2i4t 5uzhcWihoYaIZ6x07oEhgOP/o1h5n5mM908AyAEA6s2hQKDoIxjJexqvkS7lPjni P8VMcfvOmdsLDCD3nJ4= =+10g -----END PGP SIGNATURE----- Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "This is a bunch of small driver fixes and a larger rework of zone disk handling (which reaches into blk and nvme). The aacraid array-bounds fix is now critical since the security people turned on -Werror for some build tests, which now fail without it" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: storvsc: Handle SRB status value 0x30 scsi: block: Improve checks in blk_revalidate_disk_zones() scsi: block: virtio_blk: Set zone limits before revalidating zones scsi: block: nullblk: Set zone limits before revalidating zones scsi: nvme: zns: Set zone limits before revalidating zones scsi: sd_zbc: Set zone limits before revalidating zones scsi: ufs: core: Add support for qTimestamp attribute scsi: aacraid: Avoid -Warray-bounds warning scsi: ufs: ufs-mediatek: Add dependency for RESET_CONTROLLER scsi: ufs: core: Update contact email for monitor sysfs nodes scsi: scsi_debug: Remove dead code scsi: qla2xxx: Use vmalloc_array() and vcalloc() scsi: fnic: Use vmalloc_array() and vcalloc() scsi: qla2xxx: Fix error code in qla2x00_start_sp() scsi: qla2xxx: Silence a static checker warning scsi: lpfc: Fix a possible data race in lpfc_unregister_fcf_rescan()	2023-07-14 19:57:29 -07:00
Ross Lagerwall	7090426351	blk-mq: Fix stall due to recursive flush plug We have seen rare IO stalls as follows: * blk_mq_plug_issue_direct() is entered with an mq_list containing two requests. * For the first request, it sets last == false and enters the driver's queue_rq callback. * The driver queue_rq callback indirectly calls schedule() which calls blk_flush_plug(). This may happen if the driver has the BLK_MQ_F_BLOCKING flag set and is allowed to sleep in ->queue_rq. * blk_flush_plug() handles the remaining request in the mq_list. mq_list is now empty. * The original call to queue_rq resumes (with last == false). * The loop in blk_mq_plug_issue_direct() terminates because there are no remaining requests in mq_list. The IO is now stalled because the last request submitted to the driver had last == false and there was no subsequent call to commit_rqs(). Fix this by returning early in blk_mq_flush_plug_list() if rq_count is 0 which it will be in the recursive case, rather than checking if the mq_list is empty. At the same time, adjust one of the callers to skip the mq_list empty check as it is not necessary. Fixes: `dc5fc361d8` ("block: attempt direct issue of plug list") Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230714101106.3635611-1-ross.lagerwall@citrix.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-14 13:44:24 -06:00
Christoph Hellwig	9f87fc4d72	block: queue data commands from the flush state machine at the head We used to insert the data commands following a pre-flush to the head of the queue until commit `1e82fadfc6` ("blk-mq: do not do head insertions post-pre-flush commands"). Not doing this seems to cause hangs of such commands on NFS workloads when exported from file systems with SATA SSDs. I have no idea why this would starve these workloads, but doing a semantic revert of this patch (which looks quite different due to various other changes) fixes the hangs. Fixes: `1e82fadfc6` ("blk-mq: do not do head insertions post-pre-flush commands") Reported-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20230714143014.11879-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-14 08:42:58 -06:00
Chengming Zhou	5c17f45e91	blk-mq: fix start_time_ns and alloc_time_ns for pre-allocated rq The iocost rely on rq start_time_ns and alloc_time_ns to tell saturation state of the block device. Most of the time request is allocated after rq_qos_throttle() and its alloc_time_ns or start_time_ns won't be affected. But for plug batched allocation introduced by the commit `47c122e35d` ("block: pre-allocate requests if plug is started and is a batch"), we can rq_qos_throttle() after the allocation of the request. This is what the blk_mq_get_cached_request() does. In this case, the cached request alloc_time_ns or start_time_ns is much ahead if blocked in any qos ->throttle(). Fix it by setting alloc_time_ns and start_time_ns to now when the allocated request is actually used. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230710105516.2053478-1-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-13 12:30:57 -06:00
Bart Van Assche	f673b4f5bd	block/mq-deadline: Fix a bug in deadline_from_pos() A bug was introduced in deadline_from_pos() while implementing the suggestion to use round_down() in the following code: pos -= bdev_offset_from_zone_start(rq->q->disk->part0, pos); This patch makes deadline_from_pos() use round_down() such that 'pos' is rounded down. Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/all/5zthzi3lppvcdp4nemum6qck4gpqbdhvgy4k3qwguhgzxc4quj@amulvgycq67h/ Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Fixes: `0effb390c4` ("block: mq-deadline: Handle requeued requests correctly") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230712173344.2994513-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-12 11:37:41 -06:00
Martin K. Petersen	e96277a570	Merge branch '6.5/scsi-staging' into 6.5/scsi-fixes Pull in the currently staged SCSI fixes for 6.5. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2023-07-11 12:15:15 -04:00
Damien Le Moal	03e51c4a74	scsi: block: Improve checks in blk_revalidate_disk_zones() blk_revalidate_disk_zones() implements checks of the zones of a zoned block device, verifying that the zone size is a power of 2 number of sectors, that all zones (except possibly the last one) have the same size and that zones cover the entire addressing space of the device. While these checks are appropriate to verify that well tested hardware devices have an adequate zone configurations, they lack in certain areas which may result in issues with emulated devices implemented with user drivers such as ublk or tcmu. Specifically, this function does not check if the device driver indicated support for the mandatory zone append writes, that is, if the device max_zone_append_sectors queue limit is set to a non-zero value. Additionally, invalid zones such as a zero length zone with a start sector equal to the device capacity will not be detected and result in out of bounds use of the zone bitmaps prepared with the callback function blk_revalidate_zone_cb(). Improve blk_revalidate_disk_zones() to address these inadequate checks, relying on the fact that all device drivers supporting zoned block devices must set the device zone size (chunk_sectors queue limit) and the max_zone_append_sectors queue limit before executing this function. The check for a non-zero max_zone_append_sectors value is done in blk_revalidate_disk_zones() before executing the zone report. The zone report callback function blk_revalidate_zone_cb() is also modified to add a check that a zone start is below the device capacity. The check that the zone size is a power of 2 number of sectors is moved to blk_revalidate_disk_zones() as the zone size is already known. Similarly, the number of zones of the device can be calculated in blk_revalidate_disk_zones() before executing the zone report. The kdoc comment for blk_revalidate_disk_zones() is also updated to mention that device drivers must set the device zone size and the max_zone_append_sectors queue limit before calling this function. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230703024812.76778-6-dlemoal@kernel.org Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2023-07-05 21:58:10 -04:00
Eric Biggers	2fb48d88e7	blk-crypto: use dynamic lock class for blk_crypto_profile::lock When a device-mapper device is passing through the inline encryption support of an underlying device, calls to blk_crypto_evict_key() take the blk_crypto_profile::lock of the device-mapper device, then take the blk_crypto_profile::lock of the underlying device (nested). This isn't a real deadlock, but it causes a lockdep report because there is only one lock class for all instances of this lock. Lockdep subclasses don't really work here because the hierarchy of block devices is dynamic and could have more than 2 levels. Instead, register a dynamic lock class for each blk_crypto_profile, and associate that with the lock. This avoids false-positive lockdep reports like the following: ============================================ WARNING: possible recursive locking detected 6.4.0-rc5 #2 Not tainted -------------------------------------------- fscryptctl/1421 is trying to acquire lock: ffffff80829ca418 (&profile->lock){++++}-{3:3}, at: __blk_crypto_evict_key+0x44/0x1c0 but task is already holding lock: ffffff8086b68ca8 (&profile->lock){++++}-{3:3}, at: __blk_crypto_evict_key+0xc8/0x1c0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&profile->lock); lock(&profile->lock); * DEADLOCK * May be due to missing lock nesting notation Fixes: `1b26283970` ("block: Keyslot Manager for Inline Encryption") Reported-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230610061139.212085-1-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-05 16:36:12 -06:00
Michael Schmitz	7eb1e47696	block/partition: fix signedness issue for Amiga partitions Making 'blk' sector_t (i.e. 64 bit if LBD support is active) fails the 'blk>0' test in the partition block loop if a value of (signed int) -1 is used to mark the end of the partition block list. Explicitly cast 'blk' to signed int to allow use of -1 to terminate the partition block linked list. Fixes: `b6f3f28f60` ("block: add overflow checks for Amiga partition support") Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de> Link: https://lore.kernel.org/r/024ce4fa-cc6d-50a2-9aae-3701d0ebf668@xenosoft.de Signed-off-by: Michael Schmitz <schmitzmic@gmail.com> Reviewed-by: Martin Steigerwald <martin@lichtvoll.de> Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-05 16:34:56 -06:00
Linus Torvalds	e50df24979	block-6.5-2023-07-03 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSjJ2IQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpsQMEACQiUBw81tXvetYhz3P/4KrrjvUobgqMU0w jtrxqMgPee9FbqCShpj76c+La5wu23DnlrCXoHZxFQuiQLnsX5xFV66NYVi+W1CN k5MHP7f2e9V0T7qJ9UoHFRV1k22LF4X6T8njEZimxsm/uXfpav/knkhI7nUDnB1K wxlu9akD2Bo/X9O2NTS+X6qjoawZ6rDWN15THMXlC45VzJPLmIcs07Ev+mvw21KE XqasoZrxEO0S8dWxmJgJGqnRIOQptTS5U+0OPBZT8H220Qp/1q0pQHPw6iLXNrkc w1a2W1Bge012gjJt7gCMkdDnZb76sKiyGuMbFME7DoRbLCQeaOtoSfmg7NoRI2gp 74TCSr7dPWZUVUy5Tmsy0DCv0552vIbnlQ69W6Xwx8YkplM3FPiMpWrQ5JWEHdvv Zl84mLP6Yyo54JVuk9zi8q/2L0HfyfMDj4UM/mNs8hwmcUSbPO2TKdIWDaq8xPuS Ed+D+kg6XFux8tLnCSDLNbaD5JE+ak9gTVhNdRa/zFE04o/OeidscKEqRSYTkdXL 2p34qtw5kEQocO4Pa3eUGO6KJCDTR36Rms5p6ZFybL4O2oZYrAbRi1TGDxaG2Hag GCr2vaFbmz1zbGuMpFhLha5B7HeDLs+PHOn+B1iUNjEr9RC0EOHV7moJKqjxlnCh 4mBkK/Nlyg== =kSeX -----END PGP SIGNATURE----- Merge tag 'block-6.5-2023-07-03' of git://git.kernel.dk/linux Pull more block updates from Jens Axboe: "Mostly items that came in a bit late for the initial pull request, wanted to make sure they had the appropriate amount of linux-next soak before going upstream. Outside of stragglers, just generic fixes for either merge window items, or longer standing bugs" * tag 'block-6.5-2023-07-03' of git://git.kernel.dk/linux: (25 commits) md/raid0: add discard support for the 'original' layout nvme: disable controller on reset state failure nvme: sync timeout work on failed reset nvme: ensure unquiesce on teardown cdrom/gdrom: Fix build error nvme: improved uring polling block: add request polling helper nvme-mpath: fix I/O failure with EAGAIN when failing over I/O nvme: host: fix command name spelling blk-sysfs: add a new attr_group for blk_mq blk-iocost: move wbt_enable/disable_default() out of spinlock blk-wbt: cleanup rwb_enabled() and wbt_disabled() blk-wbt: remove dead code to handle wbt enable/disable with io inflight blk-wbt: don't create wbt sysfs entry if CONFIG_BLK_WBT is disabled blk-mq: fix two misuses on RQF_USE_SCHED blk-throttle: Fix io statistics for cgroup v1 bcache: Fix bcache device claiming bcache: Alloc holder object before async registration raid10: avoid spin_lock from fastpath from raid10_unplug() md: fix 'delete_mutex' deadlock ...	2023-07-03 18:48:38 -07:00
Linus Torvalds	ca7ce08d6a	SCSI misc on 20230629 Updates to the usual drivers (ufs, pm80xx, libata-scsi, smartpqi, lpfc, qla2xxx). We have a couple of major core changes impacting other systems: Command Duration Limits, which spills into block and ATA and block level Persistent Reservation Operations, which touches block, nvme, target and dm (both of which are added with merge commits containing a cover letter explaining what's going on). Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com> -----BEGIN PGP SIGNATURE----- iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCZJ19cSYcamFtZXMuYm90 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishfZpAQCQBuWR ELcOhsaG5KzO6xLWcH8mjsOoxffKvazZjTKXlAD5ATEv7++E250oKS3t+yfjae5I Lc195MlDju85ItUQgfk= =U9ik -----END PGP SIGNATURE----- Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI updates from James Bottomley: "Updates to the usual drivers (ufs, pm80xx, libata-scsi, smartpqi, lpfc, qla2xxx). We have a couple of major core changes impacting other systems: - Command Duration Limits, which spills into block and ATA - block level Persistent Reservation Operations, which touches block, nvme, target and dm Both of these are added with merge commits containing a cover letter explaining what's going on" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (187 commits) scsi: core: Improve warning message in scsi_device_block() scsi: core: Replace scsi_target_block() with scsi_block_targets() scsi: core: Don't wait for quiesce in scsi_device_block() scsi: core: Don't wait for quiesce in scsi_stop_queue() scsi: core: Merge scsi_internal_device_block() and device_block() scsi: sg: Increase number of devices scsi: bsg: Increase number of devices scsi: qla2xxx: Remove unused nvme_ls_waitq wait queue scsi: ufs: ufs-pci: Add support for Intel Arrow Lake scsi: sd: sd_zbc: Use PAGE_SECTORS_SHIFT scsi: ufs: wb: Add explicit flush_threshold sysfs attribute scsi: ufs: ufs-qcom: Switch to the new ICE API scsi: ufs: dt-bindings: qcom: Add ICE phandle scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_RTC quirk scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_INTR quirk scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_RTC scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_INTR scsi: ufs: core: Remove dedicated hwq for dev command scsi: ufs: core: mcq: Fix the incorrect OCS value for the device command scsi: ufs: dt-bindings: samsung,exynos: Drop unneeded quotes ...	2023-06-30 11:57:07 -07:00
Keith Busch	f6c80cffcd	block: add request polling helper Provide a direct request polling will for drivers. The interface does not require a bio, and can skip the overhead associated with polling those. The biggest gain from skipping the relatively expensive xarray lookup unnecessary when you already have the request. With this, the simple rq/qc conversion functions have only one caller each, so open code this and remove the helpers. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230612190343.2087040-2-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-28 16:09:41 -06:00
Jens Axboe	3a08284ff2	Merge branch 'for-6.5/block-late' into block-6.5 * for-6.5/block-late: blk-sysfs: add a new attr_group for blk_mq blk-iocost: move wbt_enable/disable_default() out of spinlock blk-wbt: cleanup rwb_enabled() and wbt_disabled() blk-wbt: remove dead code to handle wbt enable/disable with io inflight blk-wbt: don't create wbt sysfs entry if CONFIG_BLK_WBT is disabled blk-mq: fix two misuses on RQF_USE_SCHED blk-throttle: Fix io statistics for cgroup v1 bcache: Fix bcache device claiming bcache: Alloc holder object before async registration raid10: avoid spin_lock from fastpath from raid10_unplug() md: fix 'delete_mutex' deadlock md: use mddev->external to select holder in export_rdev() md/raid1-10: fix casting from randomized structure in raid1_submit_write() md/raid10: fix the condition to call bio_end_io_acct()	2023-06-28 16:08:19 -06:00
Linus Torvalds	6e17c6de3d	- Yosry Ahmed brought back some cgroup v1 stats in OOM logs. - Yosry has also eliminated cgroup's atomic rstat flushing. - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability. - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning. - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface. - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree. - Johannes Weiner has done some cleanup work on the compaction code. - David Hildenbrand has contributed additional selftests for get_user_pages(). - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code. - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code. - Huang Ying has done some maintenance on the swap code's usage of device refcounting. - Christoph Hellwig has some cleanups for the filemap/directio code. - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses. - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings. - John Hubbard has a series of fixes to the MM selftesting code. - ZhangPeng continues the folio conversion campaign. - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock. - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8. - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management. - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code. - Vishal Moola also has done some folio conversion work. - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh J0qXCzqaaN8+BuEyLGDVPaXur9KirwY= =B7yQ -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: - Yosry Ahmed brought back some cgroup v1 stats in OOM logs - Yosry has also eliminated cgroup's atomic rstat flushing - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree - Johannes Weiner has done some cleanup work on the compaction code - David Hildenbrand has contributed additional selftests for get_user_pages() - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code - Huang Ying has done some maintenance on the swap code's usage of device refcounting - Christoph Hellwig has some cleanups for the filemap/directio code - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings - John Hubbard has a series of fixes to the MM selftesting code - ZhangPeng continues the folio conversion campaign - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8 - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code - Vishal Moola also has done some folio conversion work - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch * tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits) mm/hugetlb: remove hugetlb_set_page_subpool() mm: nommu: correct the range of mmap_sem_read_lock in task_mem() hugetlb: revert use of page_cache_next_miss() Revert "page cache: fix page_cache_next/prev_miss off by one" mm/vmscan: fix root proactive reclaim unthrottling unbalanced node mm: memcg: rename and document global_reclaim() mm: kill [add\|del]_page_to_lru_list() mm: compaction: convert to use a folio in isolate_migratepages_block() mm: zswap: fix double invalidate with exclusive loads mm: remove unnecessary pagevec includes mm: remove references to pagevec mm: rename invalidate_mapping_pagevec to mapping_try_invalidate mm: remove struct pagevec net: convert sunrpc from pagevec to folio_batch i915: convert i915_gpu_error to use a folio_batch pagevec: rename fbatch_count() mm: remove check_move_unevictable_pages() drm: convert drm_gem_put_pages() to use a folio_batch i915: convert shmem_sg_free_table() to use a folio_batch scatterlist: add sg_set_folio() ...	2023-06-28 10:28:11 -07:00
Linus Torvalds	a0433f8cae	for-6.5/block-2023-06-23 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8dwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpilGD/9Yys1oxIXJpRf00fzrylAlBthRxMjFQVWw zAut106hAQiBHvU8IkmGA3MvEFVHxtzwYhHI7IR8K3aZBIqscweCqmVI9JyogJw9 U9Twnzel47VmuKdM94FeoN+hbj1fP8EWTjzmy67/zEEfFCdmHvNlMi3lSrGYIpFy 39LxTB99Y4UarM5PtWbes37GYYljzMSWKuo4AfBkvq1eQa+sZ0Vq2xAABKq3UM7f apqhgHtkJooRePDP0eQp+kAyyVMgW2jIK+oIdJDxNF3CKTu2w40RzaYz6fp+jVSU H4R/xS59GW4/xql+VBJDh/qJg9K62DPPYjlW8BmSR8+IjvfFpsyH3/MacE50CD3P 20fs/Mnj49H79fDrQEHJI53cOOb2EmUitbwLbvOcColNTPpt8loBtdQxjF2RMU8R Nyort9DJPFclYCxky1LYg1CNEC2Ln4Zy/jD47wPvqRmOQphOoVlV/hPnOEqvjaZC 49Vn70W2DeE9cXvYI7ha+XIg6/oj+Gs3iusEbV08Ci7EAtXgI+ZUUsQ97K8UNiUh h2lqSJtuI7lBpYP9sf+BeCch5UCC+xGYyTdoM5f58lehWBBPtbs0g7S9RyRyOYxe n+yxEUo3dAGzJ/xsKAjinbZfeWIpr0b1TkAh4w3Cq/BKzRr9Bp8lBAxYuancbQ+Y 1ADPteUOTA== =zP4Y -----END PGP SIGNATURE----- Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - NVMe pull request via Keith: - Various cleanups all around (Irvin, Chaitanya, Christophe) - Better struct packing (Christophe JAILLET) - Reduce controller error logs for optional commands (Keith) - Support for >=64KiB block sizes (Daniel Gomez) - Fabrics fixes and code organization (Max, Chaitanya, Daniel Wagner) - bcache updates via Coly: - Fix a race at init time (Mingzhe Zou) - Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye) - use page pinning in the block layer for dio (David) - convert old block dio code to page pinning (David, Christoph) - cleanups for pktcdvd (Andy) - cleanups for rnbd (Guoqing) - use the unchecked __bio_add_page() for the initial single page additions (Johannes) - fix overflows in the Amiga partition handling code (Michael) - improve mq-deadline zoned device support (Bart) - keep passthrough requests out of the IO schedulers (Christoph, Ming) - improve support for flush requests, making them less special to deal with (Christoph) - add bdev holder ops and shutdown methods (Christoph) - fix the name_to_dev_t() situation and use cases (Christoph) - decouple the block open flags from fmode_t (Christoph) - ublk updates and cleanups, including adding user copy support (Ming) - BFQ sanity checking (Bart) - convert brd from radix to xarray (Pankaj) - constify various structures (Thomas, Ivan) - more fine grained persistent reservation ioctl capability checks (Jingbo) - misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan, Jordy, Li, Min, Yu, Zhong, Waiman) * tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits) scsi/sg: don't grab scsi host module reference ext4: Fix warning in blkdev_put() block: don't return -EINVAL for not found names in devt_from_devname cdrom: Fix spectre-v1 gadget block: Improve kernel-doc headers blk-mq: don't insert passthrough request into sw queue bsg: make bsg_class a static const structure ublk: make ublk_chr_class a static const structure aoe: make aoe_class a static const structure block/rnbd: make all 'class' structures const block: fix the exclusive open mask in disk_scan_partitions block: add overflow checks for Amiga partition support block: change all __u32 annotations to __be32 in affs_hardblocks.h block: fix signed int overflow in Amiga partition support block: add capacity validation in bdev_add_partition() block: fine-granular CAP_SYS_ADMIN for Persistent Reservation block: disallow Persistent Reservation on partitions reiserfs: fix blkdev_put() warning from release_journal_dev() block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions() block: document the holder argument to blkdev_get_by_path ...	2023-06-26 12:47:20 -07:00
Linus Torvalds	0aa69d53ac	for-6.5/io_uring-2023-06-23 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8cEQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpnvZD/0QWstFCe1CSLWaycdC9fhWftFt3hyEIST5 CYEL56UZrDWNkv9xTLe855xvMavjd4sdHlUa8NUPghRQeJyKYgRxHBLXRWmy0uNN l47Zjiwsolmbr3Nt6qViLdCDYmG39ZGNwWo8b6p3ybWYLtzxeOblocOBTPzoCtkS hjo7Z0eMONvsvLX+l0o9IDdWtZIQ2fGo4VYkMIVb6CyxRPpuUuPKbE25qaTx+uBg Fy6Qa3SlaTwzqcg3dttggjIP792L/eETUCWGndg5pJNbrkj/fI4Vm4bEljID76eS HODl+pWHmyM6avVkypr7N3Tp5HKF0OTUa4vJTLIZo1QiRu6zlXphtuvGn+McEmgV hbYmQMYWzqJ22k2iEpCR58pdhmZJC9uB8r4Rwgr/t9GKqt4E+15EzmqkG9cUVMGV rfbBwVLwBUd5+0WHwQ8RzdtaUPt17vSIW/8WhU5zoMGVotqVBHO/H+5BtmKPWWpq fx1etQ8XJVPIxziJvgsEitb1s6KZzJspcONDlLEitmZkflv3gGdVm99KNbXwJpcp m6+FcYQ5d5FivfLPGgpx8go+4M2QuoW2yRGwZHu54buCnpxgNjIk898OjrUrdXCg 3/0m99GXmOWQQl0VrrTr+Fv99nVsQ2hMQzOFJGMYRtHEEc5xiTcJiZmoxmF7T7/C TipyW3czsw== =5Me8 -----END PGP SIGNATURE----- Merge tag 'for-6.5/io_uring-2023-06-23' of git://git.kernel.dk/linux Pull io_uring updates from Jens Axboe: "Nothing major in this release, just a bunch of cleanups and some optimizations around networking mostly. - clean up file request flags handling (Christoph) - clean up request freeing and CQ locking (Pavel) - support for using pre-registering the io_uring fd at setup time (Josh) - Add support for user allocated ring memory, rather than having the kernel allocate it. Mostly for packing rings into a huge page (me) - avoid an unnecessary double retry on receive (me) - maintain ordering for task_work, which also improves performance (me) - misc cleanups/fixes (Pavel, me)" * tag 'for-6.5/io_uring-2023-06-23' of git://git.kernel.dk/linux: (39 commits) io_uring: merge conditional unlock flush helpers io_uring: make io_cq_unlock_post static io_uring: inline __io_cq_unlock io_uring: fix acquire/release annotations io_uring: kill io_cq_unlock() io_uring: remove IOU_F_TWQ_FORCE_NORMAL io_uring: don't batch task put on reqs free io_uring: move io_clean_op() io_uring: inline io_dismantle_req() io_uring: remove io_free_req_tw io_uring: open code io_put_req_find_next io_uring: add helpers to decode the fixed file file_ptr io_uring: use io_file_from_index in io_msg_grab_file io_uring: use io_file_from_index in __io_sync_cancel io_uring: return REQ_F_ flags from io_file_get_flags io_uring: remove io_req_ffs_set io_uring: remove a confusing comment above io_file_get_flags io_uring: remove the mode variable in io_file_get_flags io_uring: remove __io_file_supports_nowait io_uring: wait interruptibly for request completions on exit ...	2023-06-26 12:30:26 -07:00
Linus Torvalds	3eccc0c886	for-6.5/splice-2023-06-23 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8QgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpupIEADKEZvpxDyaxHjYZFFeoSJRkh+AEJHe0Xtr J5vUL8t8zmAV3F7i8XaoAEcR0dC0VQcoTc8fAOty71+5hsc7gvtyyNjqU/YWRVqK Xr+VJuSJ+OGx3MzpRWEkepagfPyqP5cyyCOK6gqIgqzc3IwqkR/3QHVRc6oR8YbY AQd7tqm2fQXK9WDHEy5hcaQeqb9uKZjQQoZejpPPerpJM+9RMgKxpCGtnLLIUhr/ sgl7KyLIQPBmveO2vfOR+dmsJBqsLqneqkXDKMAIfpeVEEkHHAlCH4E5Ne1XUS+s ie4If+reuyn1Ktt5Ry1t7w2wr8cX1fcay3K28tgwjE2Bvremc5YnYgb3pyUDW38f tXXkpg/eTXd/Pn0Crpagoa9zJ927tt5JXIO1/PagPEP1XOqUuthshDFsrVqfqbs+ 36gqX2JWB4NJTg9B9KBHA3+iVCJyZLjUqOqws7hOJOvhQytZVm/IwkGBg1Slhe1a J5WemBlqX8lTgXz0nM7cOhPYTZeKe6hazCcb5VwxTUTj9SGyYtsMfqqTwRJO9kiF j1VzbOAgExDYe+GvfqOFPh9VqZho66+DyOD/Xtca4eH7oYyHSmP66o8nhRyPBPZA maBxQhUkPQn4/V/0fL2TwIdWYKsbj8bUyINKPZ2L35YfeICiaYIctTwNJxtRmItB M3VxWD3GZQ== =KhW4 -----END PGP SIGNATURE----- Merge tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux Pull splice updates from Jens Axboe: "This kills off ITER_PIPE to avoid a race between truncate, iov_iter_revert() on the pipe and an as-yet incomplete DMA to a bio with unpinned/unref'ed pages from an O_DIRECT splice read. This causes memory corruption. Instead, we either use (a) filemap_splice_read(), which invokes the buffered file reading code and splices from the pagecache into the pipe; (b) copy_splice_read(), which bulk-allocates a buffer, reads into it and then pushes the filled pages into the pipe; or (c) handle it in filesystem-specific code. Summary: - Rename direct_splice_read() to copy_splice_read() - Simplify the calculations for the number of pages to be reclaimed in copy_splice_read() - Turn do_splice_to() into a helper, vfs_splice_read(), so that it can be used by overlayfs and coda to perform the checks on the lower fs - Make vfs_splice_read() jump to copy_splice_read() to handle direct-I/O and DAX - Provide shmem with its own splice_read to handle non-existent pages in the pagecache. We don't want a ->read_folio() as we don't want to populate holes, but filemap_get_pages() requires it - Provide overlayfs with its own splice_read to call down to a lower layer as overlayfs doesn't provide ->read_folio() - Provide coda with its own splice_read to call down to a lower layer as coda doesn't provide ->read_folio() - Direct ->splice_read to copy_splice_read() in tty, procfs, kernfs and random files as they just copy to the output buffer and don't splice pages - Provide wrappers for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3, ocfs2, orangefs, xfs and zonefs to do locking and/or revalidation - Make cifs use filemap_splice_read() - Replace pointers to generic_file_splice_read() with pointers to filemap_splice_read() as DIO and DAX are handled in the caller; filesystems can still provide their own alternate ->splice_read() op - Remove generic_file_splice_read() - Remove ITER_PIPE and its paraphernalia as generic_file_splice_read was the only user" * tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux: (31 commits) splice: kdoc for filemap_splice_read() and copy_splice_read() iov_iter: Kill ITER_PIPE splice: Remove generic_file_splice_read() splice: Use filemap_splice_read() instead of generic_file_splice_read() cifs: Use filemap_splice_read() trace: Convert trace/seq to use copy_splice_read() zonefs: Provide a splice-read wrapper xfs: Provide a splice-read wrapper orangefs: Provide a splice-read wrapper ocfs2: Provide a splice-read wrapper ntfs3: Provide a splice-read wrapper nfs: Provide a splice-read wrapper f2fs: Provide a splice-read wrapper ext4: Provide a splice-read wrapper ecryptfs: Provide a splice-read wrapper ceph: Provide a splice-read wrapper afs: Provide a splice-read wrapper 9p: Add splice_read wrapper net: Make sock_splice_read() use copy_splice_read() by default tty, proc, kernfs, random: Use copy_splice_read() ...	2023-06-26 11:52:12 -07:00
Yu Kuai	6d85ebf95c	blk-sysfs: add a new attr_group for blk_mq Currently wbt sysfs entry is created for bio based device, and wbt can be enabled for such device through sysfs while it doesn't make sense because wbt can only work for rq based device. In the meantime, there are other similar sysfs entries. Fix this by adding a new attr_group for blk_mq, and sysfs entries will only be created when the device is rq based. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230527010644.647900-6-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-26 09:53:36 -06:00
Yu Kuai	eebc21d12f	blk-iocost: move wbt_enable/disable_default() out of spinlock There are following smatch warning: block/blk-wbt.c:843 wbt_init() warn: sleeping in atomic context ioc_qos_write() <- disables preempt -> wbt_enable_default() -> wbt_init() wbt_init() will be called from wbt_enable_default() if wbt is not initialized, currently this is only possible in blk_register_queue(), hence wbt_init() will never be called from iocost and this warning is false positive. However, we might support rq_qos destruction dynamically in the future, and it's better to prevent that, hence move wbt_enable_default() outside 'ioc->lock'. This is safe because queue is still freezed. Reported-by: Dan Carpenter <error27@gmail.com> Link: https://lore.kernel.org/lkml/Y+Ja5SRs886CEz7a@kadam/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230527010644.647900-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-26 09:53:36 -06:00
Yu Kuai	06257fda83	blk-wbt: cleanup rwb_enabled() and wbt_disabled() 'wb_normal' will set to 0 if 'min_lat_nsec' is 0, and 'min_lat_nsec' can only be set to 0 through sysfs configuration where 'WBT_STATE_OFF_MANUAL' is set together, in the meantime, they can only be cleared together through sysfs afterwards. Hence 'wb_normal != 0' is the same as 'rwb->enable_state != WBT_STATE_OFF_MANUAL'. The code is redundan, hence replace the checking of 'wb_normal' to 'enable_state' in rwb_enabled() and reuse rwb_enabled() for wbt_disabled(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230527010644.647900-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-26 09:53:36 -06:00
Yu Kuai	71b8642e79	blk-wbt: remove dead code to handle wbt enable/disable with io inflight enable or disable wbt is always called with queue freezed, so that wbt can never be enabled or disabled while io is still inflight, and this behaviour should always hold to avoid io hang(There have been reported several times). Therefor, the code to handle wbt enable/diskble with io inflight is not and never will be used, hence remove such dead code. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230527010644.647900-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-26 09:53:36 -06:00
Yu Kuai	645a829e03	blk-wbt: don't create wbt sysfs entry if CONFIG_BLK_WBT is disabled sysfs entry /sys/block/[device]/queue/wbt_lat_usec will be created even if CONFIG_BLK_WBT is disabled, while read and write will always fail. It doesn't make sense to create a sysfs entry that can't be accessed, so don't create such entry. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230527010644.647900-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-26 09:53:36 -06:00
Ming Lei	c6b7a3a26e	blk-mq: fix two misuses on RQF_USE_SCHED Request allocated from sched tags can't be issued via ->queue_rqs() directly, since driver tag isn't allocated yet. This is the 1st misuse of RQF_USE_SCHED for figuring out plug->has_elevator. Request allocated from sched tags can't be ended by blk_mq_end_request_batch() too, fix the 2nd RQF_USE_SCHED misuse in blk_mq_add_to_batch(). Without this patch, NVMe uring cmd passthrough IO workload can run into hang easily with real io scheduler. Fixes: `dd6216bb16` ("blk-mq: make sure elevator callbacks aren't called for passthrough request") Reported-by: Guangwu Zhang <guazhang@redhat.com> Closes: https://lore.kernel.org/linux-block/CAGS2=YrBjpLPOKa-gzcKuuOG60AGth5794PNCDwatdnnscB9ug@mail.gmail.com/ Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230624130105.1443879-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-25 08:01:05 -06:00
Jinke Han	ad7c3b41e8	blk-throttle: Fix io statistics for cgroup v1 After commit `f382fb0bce` ("block: remove legacy IO schedulers"), blkio.throttle.io_serviced and blkio.throttle.io_service_bytes become the only stable io stats interface of cgroup v1, and these statistics are done in the blk-throttle code. But the current code only counts the bios that are actually throttled. When the user does not add the throttle limit, the io stats for cgroup v1 has nothing. I fix it according to the statistical method of v2, and made it count all ios accurately. Fixes: `a7b36ee6ba` ("block: move blk-throtl fast path inline") Tested-by: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Jinke Han <hanjinke.666@bytedance.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230507170631.89607-1-hanjinke.666@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-25 08:00:39 -06:00
Christoph Hellwig	648fa60fa7	block: don't return -EINVAL for not found names in devt_from_devname When we didn't find a device and didn't guess it might be a partition, it might still show up later, so don't disable rootwait for it by returning -EINVAL. Fixes: `079caa35f7` ("init: clear root_wait on all invalid root= strings") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230622150644.600327-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-22 09:09:33 -06:00
Ming Lei	9c39b7a905	block: make sure local irq is disabled when calling __blkcg_rstat_flush When __blkcg_rstat_flush() is called from cgroup_rstat_flush*() code path, interrupt is always disabled. When we start to flush blkcg per-cpu stats list in __blkg_release() for avoiding to leak blkcg_gq's reference in commit `20cb1c2fb7` ("blk-cgroup: Flush stats before releasing blkcg_gq"), local irq isn't disabled yet, then lockdep warning may be triggered because the dependent cgroup locks may be acquired from irq(soft irq) handler. Fix the issue by disabling local irq always. Fixes: `20cb1c2fb7` ("blk-cgroup: Flush stats before releasing blkcg_gq") Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/linux-block/pz2wzwnmn5tk3pwpskmjhli6g3qly7eoknilb26of376c7kwxy@qydzpvt6zpis/T/#u Cc: stable@vger.kernel.org Cc: Jay Shin <jaeshin@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20230622084249.1208005-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-22 07:44:00 -06:00
Bart Van Assche	017fb83ee0	block: Improve kernel-doc headers Fix the documentation of the devt_from_partuuid() return value. Fix the following two recently introduced kernel-doc warnings: block/bdev.c:570: warning: Function parameter or member 'hops' not described in 'bd_finish_claiming' block/early-lookup.c:46: warning: Function parameter or member 'devt' not described in 'devt_from_partuuid' Cc: Christoph Hellwig <hch@lst.de> Fixes: `0718afd47f` ("block: introduce holder ops") Fixes: `cf056a4312` ("init: improve the name_to_dev_t interface") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230621165054.743815-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-21 13:18:15 -06:00
Ming Lei	2293cae703	blk-mq: don't insert passthrough request into sw queue In case of real io scheduler, q->elevator is set, so blk_mq_run_hw_queue() may just check if scheduler queue has request to dispatch, see __blk_mq_sched_dispatch_requests(). Then IO hang may be caused because all passthorugh requests may stay in sw queue. And any passthrough request should have been inserted to hctx->dispatch always. Reported-by: Guangwu Zhang <guazhang@redhat.com> Fixes: `d97217e7f0` ("blk-mq: don't queue plugged passthrough requests into scheduler") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230621132208.1142318-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-21 07:48:34 -06:00
Ivan Orlov	72ef02b8df	bsg: make bsg_class a static const structure Now that the driver core allows for struct class to be in read-only memory, move the bsg_class structure to be declared at build time placing it into read-only memory, instead of having to be dynamically allocated at boot time. Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-scsi@vger.kernel.org Cc: linux-block@vger.kernel.org Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20230620180129.645646-8-gregkh@linuxfoundation.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-21 07:46:17 -06:00
Christoph Hellwig	56e71bdf32	block: fix the exclusive open mask in disk_scan_partitions FMODE_EXEC has nothing to do with exclusive opens, and even is of the wrong type. We need to check for BLK_OPEN_EXCL here. Fixes: `985958b858` ("block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230621124914.185992-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-21 07:37:52 -06:00
Michael Schmitz	b6f3f28f60	block: add overflow checks for Amiga partition support The Amiga partition parser module uses signed int for partition sector address and count, which will overflow for disks larger than 1 TB. Use u64 as type for sector address and size to allow using disks up to 2 TB without LBD support, and disks larger than 2 TB with LBD. The RBD format allows to specify disk sizes up to 2^128 bytes (though native OS limitations reduce this somewhat, to max 2^68 bytes), so check for u64 overflow carefully to protect against overflowing sector_t. Bail out if sector addresses overflow 32 bits on kernels without LBD support. This bug was reported originally in 2012, and the fix was created by the RDB author, Joanne Dow <jdow@earthlink.net>. A patch had been discussed and reviewed on linux-m68k at that time but never officially submitted (now resubmitted as patch 1 in this series). This patch adds additional error checking and warning messages. Reported-by: Martin Steigerwald <Martin@lichtvoll.de> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=43511 Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Message-ID: <201206192146.09327.Martin@lichtvoll.de> Cc: <stable@vger.kernel.org> # 5.2 Signed-off-by: Michael Schmitz <schmitzmic@gmail.com> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Christoph Hellwig <hch@infradead.org> Link: https://lore.kernel.org/r/20230620201725.7020-4-schmitzmic@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-20 14:28:17 -06:00
Michael Schmitz	fc3d092c6b	block: fix signed int overflow in Amiga partition support The Amiga partition parser module uses signed int for partition sector address and count, which will overflow for disks larger than 1 TB. Use sector_t as type for sector address and size to allow using disks up to 2 TB without LBD support, and disks larger than 2 TB with LBD. This bug was reported originally in 2012, and the fix was created by the RDB author, Joanne Dow <jdow@earthlink.net>. A patch had been discussed and reviewed on linux-m68k at that time but never officially submitted. This patch differs from Joanne's patch only in its use of sector_t instead of unsigned int. No checking for overflows is done (see patch 3 of this series for that). Reported-by: Martin Steigerwald <Martin@lichtvoll.de> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=43511 Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Message-ID: <201206192146.09327.Martin@lichtvoll.de> Cc: <stable@vger.kernel.org> # 5.2 Signed-off-by: Michael Schmitz <schmitzmic@gmail.com> Tested-by: Martin Steigerwald <Martin@lichtvoll.de> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620201725.7020-2-schmitzmic@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-20 14:28:17 -06:00
Min Li	6d4e80db4e	block: add capacity validation in bdev_add_partition() In the function bdev_add_partition(),there is no check that the start and end sectors exceed the size of the disk before calling add_partition. When we call the block's ioctl interface directly to add a partition, and the capacity of the disk is set to 0 by driver,the command will continue to execute. Signed-off-by: Min Li <min15.li@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230619091214.31615-1-min15.li@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-20 12:50:51 -06:00
Jingbo Xu	9a72a02456	block: fine-granular CAP_SYS_ADMIN for Persistent Reservation Allow of unprivileged Persistent Reservation operations on devices if the write permission check on the device node has passed. brw-rw---- 1 root disk 259, 0 Jun 13 07:09 /dev/nvme0n1 In the example above, the "disk" group of nvme0n1 is also allowed to make reservations on the device even without CAP_SYS_ADMIN. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230613084008.93795-3-jefflexu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-20 12:49:23 -06:00
Jingbo Xu	1262962166	block: disallow Persistent Reservation on partitions Refuse Persistent Reservation operations on partitions as reservation on partitions doesn't make sense. Besides, introduce blkdev_pr_allowed() helper, where more policies could be placed here later. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230613084008.93795-2-jefflexu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-20 12:49:23 -06:00
Yu Kuai	985958b858	block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions() After commit `2736e8eeb0` ("block: use the holder as indication for exclusive opens"), blkdev_get_by_dev() will warn if holder is NULL and mode contains 'FMODE_EXCL'. holder from blkdev_get_by_dev() from disk_scan_partitions() is always NULL, hence it should not use 'FMODE_EXCL', which is broben by the commit. For consequence, WARN_ON_ONCE() will be triggered from blkdev_get_by_dev() if user scan partitions with device opened exclusively. Fix this problem by removing 'FMODE_EXCL' from disk_scan_partitions(), as it used to be. Reported-by: syzbot+00cd27751f78817f167b@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=00cd27751f78817f167b Fixes: `2736e8eeb0` ("block: use the holder as indication for exclusive opens") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230618140402.7556-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-20 07:17:46 -06:00
Christoph Hellwig	e89e001f24	block: document the holder argument to blkdev_get_by_path Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620043536.707249-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-20 07:16:38 -06:00
Demi Marie Obenour	b90ecc0379	block: increment diskseq on all media change events Currently, associating a loop device with a different file descriptor does not increment its diskseq. This allows the following race condition: 1. Program X opens a loop device 2. Program X gets the diskseq of the loop device. 3. Program X associates a file with the loop device. 4. Program X passes the loop device major, minor, and diskseq to something. 5. Program X exits. 6. Program Y detaches the file from the loop device. 7. Program Y attaches a different file to the loop device. 8. The opener finally gets around to opening the loop device and checks that the diskseq is what it expects it to be. Even though the diskseq is the expected value, the result is that the opener is accessing the wrong file. From discussions with Christoph Hellwig, it appears that disk_force_media_change() was supposed to call inc_diskseq(), but in fact it does not. Adding a Fixes: tag to indicate this. Christoph's Reported-by is because he stated that disk_force_media_change() calls inc_diskseq(), which is what led me to discover that it should but does not. Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Fixes: `e6138dc12d` ("block: add a helper to raise a media changed event") Cc: stable@vger.kernel.org # 5.15+ Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230607170837.1559-1-demi@invisiblethingslab.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-20 07:16:24 -06:00
Martin K. Petersen	af92c02fb2	Merge patch series "scsi: fixes for targets with many LUNs, and scsi_target_block rework" Martin Wilck <mwilck@suse.com> says: This patch series addresses some issues we saw in a test setup with a large number of SCSI LUNs. The first two patches simply increase the number of available sg and bsg devices. 3-5 fix a large delay we encountered between blocking a Fibre Channel remote port and the dev_loss_tmo. 6 renames scsi_target_block() to scsi_block_targets(), and makes additional changes to this API, as suggested in the review of the v2 series. 7 improves a warning message. Link: https://lore.kernel.org/r/20230614103616.31857-1-mwilck@suse.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2023-06-16 12:21:04 -04:00
Hannes Reinecke	9077fb2ab7	scsi: bsg: Increase number of devices Larger setups may need to allocate more than 32k bsg devices, so increase the number of devices to the full range of minor device numbers. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Martin Wilck <mwilck@suse.com> Link: https://lore.kernel.org/r/20230614103616.31857-2-mwilck@suse.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2023-06-16 12:19:59 -04:00
Ming Lei	245165658e	blk-mq: fix NULL dereference on q->elevator in blk_mq_elv_switch_none After grabbing q->sysfs_lock, q->elevator may become NULL because of elevator switch. Fix the NULL dereference on q->elevator by checking it with lock. Reported-by: Guangwu Zhang <guazhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230616132354.415109-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-16 10:12:25 -06:00
Christoph Hellwig	e4cc64657b	block: remove BIO_PAGE_REFFED Now that all block direct I/O helpers use page pinning, this flag is unused. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20230614140341.521331-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-16 10:08:09 -06:00
Damien Le Moal	01584c1e23	scsi: block: Improve ioprio value validity checks The introduction of the macro IOPRIO_PRIO_LEVEL() in commit `eca2040972` ("scsi: block: ioprio: Clean up interface definition") results in an iopriority level to always be masked using the macro IOPRIO_LEVEL_MASK, and thus to the kernel always seeing an acceptable value for an I/O priority level when checked in ioprio_check_cap(). Before this patch, this function would return an error for some (but not all) invalid values for a level valid range of [0..7]. Restore and improve the detection of invalid priority levels by introducing the inline function ioprio_value() to check an ioprio class, level and hint value before combining these fields into a single value to be used with ioprio_set() or AIOs. If an invalid value for the class, level or hint of an ioprio is detected, ioprio_value() returns an ioprio using the class IOPRIO_CLASS_INVALID, indicating an invalid value and causing ioprio_check_cap() to return -EINVAL. Fixes: `6c91325722` ("scsi: block: Introduce ioprio hints") Fixes: `eca2040972` ("scsi: block: ioprio: Clean up interface definition") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230608095556.124001-1-dlemoal@kernel.org Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2023-06-16 12:04:30 -04:00
Yu Kuai	dd7de3704a	block: fix blktrace debugfs entries leakage Commit `99d055b4fd` ("block: remove per-disk debugfs files in blk_unregister_queue") moves blk_trace_shutdown() from blk_release_queue() to blk_unregister_queue(), this is safe if blktrace is created through sysfs, however, there is a regression in corner case. blktrace can still be enabled after del_gendisk() through ioctl if the disk is opened before del_gendisk(), and if blktrace is not shutdown through ioctl before closing the disk, debugfs entries will be leaked. Fix this problem by shutdown blktrace in disk_release(), this is safe because blk_trace_remove() is reentrant. Fixes: `99d055b4fd` ("block: remove per-disk debugfs files in blk_unregister_queue") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230610022003.2557284-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-14 20:24:03 -06:00
Ed Tsai	30654614f3	blk-mq: check on cpu id when there is only one ctx mapping commit `f168420c62` ("blk-mq: don't redirect completion for hctx withs only one ctx mapping") When nvme applies a 1:1 mapping of hctx and ctx, there will be no remote request. But for ufs, the submission and completion queues could be asymmetric. (e.g. Multiple SQs share one CQ) Therefore, 1:1 mapping of hctx and ctx won't complete request on the submission cpu. In this situation, this nr_ctx check could violate the QUEUE_FLAG_SAME_FORCE, as a result, check on cpu id when there is only one ctx mapping. Signed-off-by: Ed Tsai <ed.tsai@mediatek.com> Signed-off-by: Po-Wen Kao <powen.kao@mediatek.com> Suggested-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230614002529.6636-1-ed.tsai@mediatek.com [axboe: fixed up indentation] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-14 11:11:25 -06:00
Yu Kuai	4f1731df60	blk-mq: fix potential io hang by wrong 'wake_batch' In __blk_mq_tag_busy/idle(), updating 'active_queues' and calculating 'wake_batch' is not atomic: t1: t2: _blk_mq_tag_busy blk_mq_tag_busy inc active_queues // assume 1->2 inc active_queues // 2 -> 3 blk_mq_update_wake_batch // calculate based on 3 blk_mq_update_wake_batch /* calculate based on 2, while active_queues is actually 3. */ Fix this problem by protecting them wih 'tags->lock', this is not a hot path, so performance should not be concerned. And now that all writers are inside the lock, switch 'actives_queues' from atomic to unsigned int. Fixes: `180dccb0db` ("blk-mq: fix tag_get wait task can't be awakened") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230610023043.2559121-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-12 09:55:53 -06:00
Christoph Hellwig	ee3249a8ce	block: store the holder in file->private_data Store the file struct used as the holder in file->private_data as an indicator that this file descriptor was opened exclusively to remove the last use of FMODE_EXCL. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20230608110258.189493-30-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-12 08:04:05 -06:00
Christoph Hellwig	4e762d8623	block: always use I_BDEV on file->f_mapping->host to find the bdev Always use I_BDEV(file->f_mapping->host) to find the bdev for a file to free up file->private_data for other uses. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-29-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-12 08:04:05 -06:00
Christoph Hellwig	05bdb99653	block: replace fmode_t with a block-specific type for block open flags The only overlap between the block open flags mapped into the fmode_t and other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and ->ioctl and stop abusing fmode_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-12 08:04:05 -06:00
Christoph Hellwig	5e4ea83467	block: remove unused fmode_t arguments from ioctl handlers A few ioctl handlers have fmode_t arguments that are entirely unused, remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20230608110258.189493-27-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-12 08:04:04 -06:00
Christoph Hellwig	cfb425761c	block: move a few internal definitions out of blkdev.h All these helpers are only used in core block code, so move them out of the public header. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-26-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-12 08:04:04 -06:00
Christoph Hellwig	1991299e49	scsi: replace the fmode_t argument to ->sg_io_fn with a simple bool Instead of passing a fmode_t and only checking it for FMODE_WRITE, pass a bool open_for_write to prepare for callers that won't have the fmode_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-21-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-12 08:04:04 -06:00
Christoph Hellwig	2736e8eeb0	block: use the holder as indication for exclusive opens The current interface for exclusive opens is rather confusing as it requires both the FMODE_EXCL flag and a holder. Remove the need to pass FMODE_EXCL and just key off the exclusive open off a non-NULL holder. For blkdev_put this requires adding the holder argument, which provides better debug checking that only the holder actually releases the hold, but at the same time allows removing the now superfluous mode argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-12 08:04:04 -06:00
Christoph Hellwig	7ee34cbc29	block: rename blkdev_close to blkdev_release Make the function name match the method name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-12 08:04:04 -06:00
Christoph Hellwig	ae220766d8	block: remove the unused mode argument to ->release The mode argument to the ->release block_device_operation is never used, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-12 08:04:04 -06:00
Christoph Hellwig	d32e2bf837	block: pass a gendisk to ->open ->open is only called on the whole device. Make that explicit by passing a gendisk instead of the block_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-12 08:04:04 -06:00
Christoph Hellwig	444aa2c58c	block: pass a gendisk on bdev_check_media_change bdev_check_media_change should only ever be called for the whole device. Pass a gendisk to make that explicit and rename the function to disk_check_media_change. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-12 08:04:03 -06:00
Christoph Hellwig	9d1c92872e	block: also call ->open for incremental partition opens For whole devices ->open is called for each open, but for partitions it is only called on the first open of a partition, e.g.: open("/dev/vdb", ...) open("/dev/vdb", ...) - 2 call to ->open open("/dev/vdb1", ...) open("/dev/vdb", ...) - 2 call to ->open open("/dev/vdb", ...) open("/dev/vdb", ...) - just open call to ->open This is problematic as various block drivers look at open flags and might not do all the required setup if the earlier open was with an odd flag like O_NDELAY or the magic 3 ioctl-only open mode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Phillip Potter <phil@philpotter.co.uk> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-12 08:04:03 -06:00
Ming Lei	20cb1c2fb7	blk-cgroup: Flush stats before releasing blkcg_gq As noted by Michal, the blkg_iostat_set's in the lockless list hold reference to blkg's to protect against their removal. Those blkg's hold reference to blkcg. When a cgroup is being destroyed, cgroup_rstat_flush() is only called at css_release_work_fn() which is called when the blkcg reference count reaches 0. This circular dependency will prevent blkcg and some blkgs from being freed after they are made offline. It is less a problem if the cgroup to be destroyed also has other controllers like memory that will call cgroup_rstat_flush() which will clean up the reference count. If block is the only controller that uses rstat, these offline blkcg and blkgs may never be freed leaking more and more memory over time. To prevent this potential memory leak: - flush blkcg per-cpu stats list in __blkg_release(), when no new stat can be added - add global blkg_stat_lock for covering concurrent parent blkg stat update - don't grab bio->bi_blkg reference when adding the stats into blkcg's per-cpu stat list since all stats are guaranteed to be consumed before releasing blkg instance, and grabbing blkg reference for stats was the most fragile part of original patch Based on Waiman's patch: https://lore.kernel.org/linux-block/20221215033132.230023-3-longman@redhat.com/ Fixes: `3b8cc62987` ("blk-cgroup: Optimize blkcg_rstat_flush()") Cc: stable@vger.kernel.org Reported-by: Jay Shin <jaeshin@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: mkoutny@suse.com Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230609234249.1412858-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-11 19:49:29 -06:00
Christoph Hellwig	3c435a0fe3	filemap: add a kiocb_write_and_wait helper Factor out a helper that does filemap_write_and_wait_range for the range covered by a read kiocb, or returns -EAGAIN if the kiocb is marked as nowait and there would be pages to write. Link: https://lkml.kernel.org/r/20230601145904.1385409-5-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-09 16:25:52 -07:00
Christoph Hellwig	bb91a7d96a	block: fix rootwait= again The previous rootwait fix added an -EINVAL return to a completely bogus superflous branch, fix this. Fixes: `1341c7d2cc` ("block: fix rootwait=") Reported-by: Mark Brown <broonie@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Fabio Estevam <festevam@gmail.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20230609051737.328930-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-09 11:38:10 -06:00
Christoph Hellwig	1341c7d2cc	block: fix rootwait= Failures to look up the gendisk must return -ENODEV so that rootwait retries the lookup instead of -EINVAL which exits early. Fixes: `cf056a4312` ("init: improve the name_to_dev_t interface") Reported-by: Fabio Estevam <festevam@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Fabio Estevam <festevam@gmail.com> Link: https://lore.kernel.org/r/20230607135746.92995-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-07 08:00:14 -06:00
Waiman Long	3d2af77e31	blk-cgroup: Reinit blkg_iostat_set after clearing in blkcg_reset_stats() When blkg_alloc() is called to allocate a blkcg_gq structure with the associated blkg_iostat_set's, there are 2 fields within blkg_iostat_set that requires proper initialization - blkg & sync. The former field was introduced by commit `3b8cc62987` ("blk-cgroup: Optimize blkcg_rstat_flush()") while the later one was introduced by commit `f733164829` ("blk-cgroup: reimplement basic IO stats using cgroup rstat"). Unfortunately those fields in the blkg_iostat_set's are not properly re-initialized when they are cleared in v1's blkcg_reset_stats(). This can lead to a kernel panic due to NULL pointer access of the blkg pointer. The missing initialization of sync is less problematic and can be a problem in a debug kernel due to missing lockdep initialization. Fix these problems by re-initializing them after memory clearing. Fixes: `3b8cc62987` ("blk-cgroup: Optimize blkcg_rstat_flush()") Fixes: `f733164829` ("blk-cgroup: reimplement basic IO stats using cgroup rstat") Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230606180724.2455066-1-longman@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-07 07:51:34 -06:00
Yu Kuai	a7cfa0af0c	blk-ioc: fix recursive spin_lock/unlock_irq() in ioc_clear_queue() Recursive spin_lock/unlock_irq() is not safe, because spin_unlock_irq() will enable irq unconditionally: spin_lock_irq queue_lock -> disable irq spin_lock_irq ioc->lock spin_unlock_irq ioc->lock -> enable irq /* * AA dead lock will be triggered if current context is preempted by irq, * and irq try to hold queue_lock again. */ spin_unlock_irq queue_lock Fix this problem by using spin_lock/unlock() directly for 'ioc->lock'. Fixes: `5a0ac57c48` ("blk-ioc: protect ioc_destroy_icq() by 'queue_lock'") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230606011438.3743440-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-07 07:51:00 -06:00
Hou Tao	ddf63516d8	blk-ioprio: Introduce promote-to-rt policy Since commit `a78418e6a0` ("block: Always initialize bio IO priority on submit"), bio->bi_ioprio will never be IOPRIO_CLASS_NONE when calling blkcg_set_ioprio(), so there will be no way to promote the io-priority of one cgroup to IOPRIO_CLASS_RT, because bi_ioprio will always be greater than or equals to IOPRIO_CLASS_RT. It seems possible to call blkcg_set_ioprio() first then try to initialize bi_ioprio later in bio_set_ioprio(), but this doesn't work for bio in which bi_ioprio is already initialized (e.g., direct-io), so introduce a new promote-to-rt policy to promote the iopriority of bio to IOPRIO_CLASS_RT if the ioprio is not already RT. For none-to-rt policy, although it doesn't work now, but considering that its purpose was also to override the io-priority to RT and allowing for a smoother transition, just keep it and treat it as an alias of the promote-to-rt policy. Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://lore.kernel.org/r/20230428074404.280532-1-houtao@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-06 22:26:26 -06:00
Li Nan	8d21155467	blk-iocost: use spin_lock_irqsave in adjust_inuse_and_calc_cost adjust_inuse_and_calc_cost() use spin_lock_irq() and IRQ will be enabled when unlock. DEADLOCK might happen if we have held other locks and disabled IRQ before invoking it. Fix it by using spin_lock_irqsave() instead, which can keep IRQ state consistent with before when unlock. ================================ WARNING: inconsistent lock state 5.10.0-02758-g8e5f91fd772f #26 Not tainted -------------------------------- inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. kworker/2:3/388 [HC0[0]:SC0[0]:HE0:SE1] takes: ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: spin_lock_irq ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: bfq_bio_merge+0x141/0x390 {IN-HARDIRQ-W} state was registered at: __lock_acquire+0x3d7/0x1070 lock_acquire+0x197/0x4a0 __raw_spin_lock_irqsave _raw_spin_lock_irqsave+0x3b/0x60 bfq_idle_slice_timer_body bfq_idle_slice_timer+0x53/0x1d0 __run_hrtimer+0x477/0xa70 __hrtimer_run_queues+0x1c6/0x2d0 hrtimer_interrupt+0x302/0x9e0 local_apic_timer_interrupt __sysvec_apic_timer_interrupt+0xfd/0x420 run_sysvec_on_irqstack_cond sysvec_apic_timer_interrupt+0x46/0xa0 asm_sysvec_apic_timer_interrupt+0x12/0x20 irq event stamp: 837522 hardirqs last enabled at (837521): [<ffffffff84b9419d>] __raw_spin_unlock_irqrestore hardirqs last enabled at (837521): [<ffffffff84b9419d>] _raw_spin_unlock_irqrestore+0x3d/0x40 hardirqs last disabled at (837522): [<ffffffff84b93fa3>] __raw_spin_lock_irq hardirqs last disabled at (837522): [<ffffffff84b93fa3>] _raw_spin_lock_irq+0x43/0x50 softirqs last enabled at (835852): [<ffffffff84e00558>] __do_softirq+0x558/0x8ec softirqs last disabled at (835845): [<ffffffff84c010ff>] asm_call_irq_on_stack+0xf/0x20 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&bfqd->lock); <Interrupt> lock(&bfqd->lock); * DEADLOCK * 3 locks held by kworker/2:3/388: #0: ffff888107af0f38 ((wq_completion)kthrotld){+.+.}-{0:0}, at: process_one_work+0x742/0x13f0 #1: ffff8881176bfdd8 ((work_completion)(&td->dispatch_work)){+.+.}-{0:0}, at: process_one_work+0x777/0x13f0 #2: ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: spin_lock_irq #2: ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: bfq_bio_merge+0x141/0x390 stack backtrace: CPU: 2 PID: 388 Comm: kworker/2:3 Not tainted 5.10.0-02758-g8e5f91fd772f #26 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Workqueue: kthrotld blk_throtl_dispatch_work_fn Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x107/0x167 print_usage_bug valid_state mark_lock_irq.cold+0x32/0x3a mark_lock+0x693/0xbc0 mark_held_locks+0x9e/0xe0 __trace_hardirqs_on_caller lockdep_hardirqs_on_prepare.part.0+0x151/0x360 trace_hardirqs_on+0x5b/0x180 __raw_spin_unlock_irq _raw_spin_unlock_irq+0x24/0x40 spin_unlock_irq adjust_inuse_and_calc_cost+0x4fb/0x970 ioc_rqos_merge+0x277/0x740 __rq_qos_merge+0x62/0xb0 rq_qos_merge bio_attempt_back_merge+0x12c/0x4a0 blk_mq_sched_try_merge+0x1b6/0x4d0 bfq_bio_merge+0x24a/0x390 __blk_mq_sched_bio_merge+0xa6/0x460 blk_mq_sched_bio_merge blk_mq_submit_bio+0x2e7/0x1ee0 __submit_bio_noacct_mq+0x175/0x3b0 submit_bio_noacct+0x1fb/0x270 blk_throtl_dispatch_work_fn+0x1ef/0x2b0 process_one_work+0x83e/0x13f0 process_scheduled_works worker_thread+0x7e3/0xd80 kthread+0x353/0x470 ret_from_fork+0x1f/0x30 Fixes: `b0853ab4a2` ("blk-iocost: revamp in-period donation snapbacks") Signed-off-by: Li Nan <linan122@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230527091904.3001833-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-05 12:08:32 -06:00
Christoph Hellwig	2577f53f42	block: mark early_lookup_bdev as __init early_lookup_bdev is now only used during the early boot code as it should, so mark it __init to not waste run time memory on it. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230531125535.676098-25-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-05 11:10:14 -06:00
Christoph Hellwig	7cadcaf1d8	block: move more code to early-lookup.c blk_lookup_devt is only used by code in early-lookup.c, so move it there. printk_all_partitions and it's helper bdevt_str are only used by the early init code in init/do_mounts.c, so they should go there as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230531125535.676098-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-05 10:57:40 -06:00
Christoph Hellwig	702f3189e4	block: move the code to do early boot lookup of block devices to block/ Create a new block/early-lookup.c to keep the early block device lookup code instead of having this code sit with the early mount code. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230531125535.676098-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-05 10:57:40 -06:00
Christoph Hellwig	f55e017c64	block: add a mark_dead holder operation Add a mark_dead method to blk_holder_ops that is called from blk_mark_disk_dead to notify the holder that the block device it is using has been marked dead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-05 10:53:04 -06:00
Christoph Hellwig	0718afd47f	block: introduce holder ops Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and installed in the block_device for exclusive claims. It will be used to allow the block layer to call back into the user of the block device for thing like notification of a removed device or a device resize. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-05 10:53:04 -06:00
Christoph Hellwig	00080f7fb7	block: remove blk_drop_partitions There is only a single caller left, so fold the loop into that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-05 10:53:04 -06:00
Christoph Hellwig	eec1be4c30	block: delete partitions later in del_gendisk Delay dropping the block_devices for partitions in del_gendisk until after the call to blk_mark_disk_dead, so that we can implementat notification of removed devices in blk_mark_disk_dead. This requires splitting a lower-level drop_partition helper out of delete_partition and using that from del_gendisk, while having a common loop for the whole device and partitions that calls remove_inode_hash, fsync_bdev and __invalidate_device before the call to blk_mark_disk_dead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-05 10:53:04 -06:00
Christoph Hellwig	69f90b70bd	block: unhash the inode earlier in delete_partition Move the call to remove_inode_hash to the beginning of delete_partition, as we want to prevent opening a block_device that is about to be removed ASAP. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-05 10:53:04 -06:00
Christoph Hellwig	a4f75764d1	block: avoid repeated work in blk_mark_disk_dead Check if GD_DEAD is already set in blk_mark_disk_dead, and don't duplicate the work already done. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-05 10:53:04 -06:00
Christoph Hellwig	66fddc25fe	block: consolidate the shutdown logic in blk_mark_disk_dead and del_gendisk blk_mark_disk_dead does very similar work a a section of del_gendisk: - set the GD_DEAD flag - set the capacity to zero - start a queue drain but del_gendisk also sets QUEUE_FLAG_DYING on the queue if it is owned by the disk, sets the capacity to zero before starting the drain, and both with sending a uevent and kernel message for this fake capacity change. Move the exact logic from the more heavily used del_gendisk into blk_mark_disk_dead and then call blk_mark_disk_dead from del_gendisk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-05 10:53:04 -06:00
Christoph Hellwig	74e6464a98	block: turn bdev_lock into a mutex There is no reason for this lock to spin, and being able to sleep under it will come in handy soon. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-05 10:53:03 -06:00
Christoph Hellwig	ae5f855ead	block: refactor bd_may_claim The long if/else chain obsfucates the actual logic. Tidy it up to be more structured. Also drop the whole argument, as it can be trivially derived from bdev using bdev_whole, and having the bdev_whole in the function makes it easier to follow. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-05 10:53:03 -06:00
Christoph Hellwig	0783b1a7cb	block: factor out a bd_end_claim helper from blkdev_put Move all the logic to release an exclusive claim into a helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-05 10:53:03 -06:00
Tian Lan	ddad59331a	blk-mq: fix blk_mq_hw_ctx active request accounting The nr_active counter continues to increase over time which causes the blk_mq_get_tag to hang until the thread is rescheduled to a different core despite there are still tags available. kernel-stack INFO: task inboundIOReacto:3014879 blocked for more than 2 seconds Not tainted 6.1.15-amd64 #1 Debian 6.1.15~debian11 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:inboundIOReacto state:D stack:0 pid:3014879 ppid:4557 flags:0x00000000 Call Trace: <TASK> __schedule+0x351/0xa20 scheduler+0x5d/0xe0 io_schedule+0x42/0x70 blk_mq_get_tag+0x11a/0x2a0 ? dequeue_task_stop+0x70/0x70 __blk_mq_alloc_requests+0x191/0x2e0 kprobe output showing RQF_MQ_INFLIGHT bit is not cleared before __blk_mq_free_request being called. 320 320 kworker/29:1H __blk_mq_free_request rq_flags 0x220c0 in-flight 1 b'__blk_mq_free_request+0x1 [kernel]' b'bt_iter+0x50 [kernel]' b'blk_mq_queue_tag_busy_iter+0x318 [kernel]' b'blk_mq_timeout_work+0x7c [kernel]' b'process_one_work+0x1c4 [kernel]' b'worker_thread+0x4d [kernel]' b'kthread+0xe6 [kernel]' b'ret_from_fork+0x1f [kernel]' Signed-off-by: Tian Lan <tian.lan@twosigma.com> Fixes: `2e315dc07d` ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter") Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230513221227.497327-1-tilan7663@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-03 17:20:00 -06:00
Azeem Shaikh	20d099756b	block: Replace all non-returning strlcpy with strscpy strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230530155608.272266-1-azeemshaikh38@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-01 09:13:31 -06:00
Yu Kuai	5a0ac57c48	blk-ioc: protect ioc_destroy_icq() by 'queue_lock' Currently, icq is tracked by both request_queue(icq->q_node) and task(icq->ioc_node), and ioc_clear_queue() from elevator exit is not safe because it can access the list without protection: ioc_clear_queue ioc_release_fn lock queue_lock list_splice /* move queue list to a local list / unlock queue_lock / * lock is released, the local list * can be accessed through task exit. / lock ioc->lock while (!hlist_empty) icq = hlist_entry lock queue_lock ioc_destroy_icq delete icq->ioc_node while (!list_empty) icq = list_entry() list_del icq->q_node / * This is not protected by any lock, * list_entry concurrent with list_del * is not safe. */ unlock queue_lock unlock ioc->lock Fix this problem by protecting list 'icq->q_node' by queue_lock from ioc_clear_queue(). Reported-and-tested-by: Pradeep Pragallapati <quic_pragalla@quicinc.com> Link: https://lore.kernel.org/lkml/20230517084434.18932-1-quic_pragalla@quicinc.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230531073435.2923422-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-01 09:13:31 -06:00
Johannes Thumshirn	7a150f1ed1	block: add bio_add_folio_nofail Just like for bio_add_pages() add a no-fail variant for bio_add_folio(). Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/924dff4077812804398ef84128fb920507fa4be1.1685532726.git.johannes.thumshirn@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-01 09:13:31 -06:00
Thomas Weißschuh	a378f6a40f	block: constify the whole_disk device_attribute The struct is never modified so it can be const. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20230419-const-partition-v3-4-4e14e48be367@weissschuh.net Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-30 11:12:43 -06:00
Thomas Weißschuh	0bd478005c	block: constify struct part_attr_group The struct is never modified so it can be const. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20230419-const-partition-v3-3-4e14e48be367@weissschuh.net Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-30 11:12:43 -06:00
Thomas Weißschuh	cdb37f73cf	block: constify struct part_type part_type The struct is never modified so it can be const. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20230419-const-partition-v3-2-4e14e48be367@weissschuh.net Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-30 11:12:43 -06:00
Thomas Weißschuh	539050f92e	block: constify partition prober array The array is never modified so it can be const. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202304191640.SkNk7kVN-lkp@intel.com/ Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20230419-const-partition-v3-1-4e14e48be367@weissschuh.net Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-30 11:12:43 -06:00
Damien Le Moal	47fe1c3064	block: fix revalidate performance regression The scsi driver function sd_read_block_characteristics() always calls disk_set_zoned() to a disk zoned model correctly, in case the device model changed. This is done even for regular disks to set the zoned model to BLK_ZONED_NONE and free any zone related resources if the drive previously was zoned. This behavior significantly impact the time it takes to revalidate disks on a large system as the call to disk_clear_zone_settings() done from disk_set_zoned() for the BLK_ZONED_NONE case results in the device request queued to be frozen, even if there are no zone resources to free. Avoid this overhead for non-zoned devices by not calling disk_clear_zone_settings() in disk_set_zoned() if the device model was already set to BLK_ZONED_NONE, which is always the case for regular devices. Reported by: Brian Bunker <brian@purestorage.com> Fixes: `508aebb805` ("block: introduce blk_queue_clear_zone_settings()") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230529073237.1339862-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-29 08:40:32 -06:00
David Howells	403b6fb8da	block: convert bio_map_user_iov to use iov_iter_extract_pages This will pin pages or leave them unaltered rather than getting a ref on them as appropriate to the iterator. The pages need to be pinned for DIO rather than having refs taken on them to prevent VM copy-on-write from malfunctioning during a concurrent fork() (the result of the I/O could otherwise end up being visible to/affected by the child process). Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: Jan Kara <jack@suse.cz> cc: Matthew Wilcox <willy@infradead.org> cc: Logan Gunthorpe <logang@deltatee.com> cc: linux-block@vger.kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230522205744.2825689-7-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-24 08:42:44 -06:00
David Howells	a7e689dd1c	block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages This will pin pages or leave them unaltered rather than getting a ref on them as appropriate to the iterator. The pages need to be pinned for DIO rather than having refs taken on them to prevent VM copy-on-write from malfunctioning during a concurrent fork() (the result of the I/O could otherwise end up being affected by/visible to the child process). Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: Jan Kara <jack@suse.cz> cc: Matthew Wilcox <willy@infradead.org> cc: Logan Gunthorpe <logang@deltatee.com> cc: linux-block@vger.kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230522205744.2825689-6-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-24 08:42:44 -06:00
David Howells	fd363244e8	block: Add BIO_PAGE_PINNED and associated infrastructure Add BIO_PAGE_PINNED to indicate that the pages in a bio are pinned (FOLL_PIN) and that the pin will need removing. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: Jan Kara <jack@suse.cz> cc: Matthew Wilcox <willy@infradead.org> cc: Logan Gunthorpe <logang@deltatee.com> cc: linux-block@vger.kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230522205744.2825689-5-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-24 08:42:44 -06:00
Christoph Hellwig	e51bab4e20	block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic Replace BIO_NO_PAGE_REF with a BIO_PAGE_REFFED flag that has the inverted meaning is only set when a page reference has been acquired that needs to be released by bio_release_pages(). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: Jan Kara <jack@suse.cz> cc: Matthew Wilcox <willy@infradead.org> cc: Logan Gunthorpe <logang@deltatee.com> cc: linux-block@vger.kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230522205744.2825689-4-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-24 08:42:44 -06:00
Jens Axboe	bbeb087e5a	Merge branch 'for-6.5/splice' into for-6.5/block Merge splice bits as subsequent block cleanups and improvements for DIO depend on them. * for-6.5/splice: (31 commits) splice: kdoc for filemap_splice_read() and copy_splice_read() iov_iter: Kill ITER_PIPE splice: Remove generic_file_splice_read() splice: Use filemap_splice_read() instead of generic_file_splice_read() cifs: Use filemap_splice_read() trace: Convert trace/seq to use copy_splice_read() zonefs: Provide a splice-read wrapper xfs: Provide a splice-read wrapper orangefs: Provide a splice-read wrapper ocfs2: Provide a splice-read wrapper ntfs3: Provide a splice-read wrapper nfs: Provide a splice-read wrapper f2fs: Provide a splice-read wrapper ext4: Provide a splice-read wrapper ecryptfs: Provide a splice-read wrapper ceph: Provide a splice-read wrapper afs: Provide a splice-read wrapper 9p: Add splice_read wrapper net: Make sock_splice_read() use copy_splice_read() by default tty, proc, kernfs, random: Use copy_splice_read() ...	2023-05-24 08:42:22 -06:00
David Howells	2cb1e08985	splice: Use filemap_splice_read() instead of generic_file_splice_read() Replace pointers to generic_file_splice_read() with calls to filemap_splice_read(). Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Al Viro <viro@zeniv.linux.org.uk> cc: David Hildenbrand <david@redhat.com> cc: John Hubbard <jhubbard@nvidia.com> cc: linux-mm@kvack.org cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20230522135018.2742245-29-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-24 08:42:17 -06:00
Hengqi Chen	5a80bd075f	block: introduce block_io_start/block_io_done tracepoints Currently, several BCC ([0]) tools (biosnoop/biostacks/biotop) use kprobes to blk_account_io_start/blk_account_io_done to implement their functionalities. This is fragile because the target kernel functions may be renamed ([1]) or inlined ([2]). So introduce two new tracepoints for such use cases. [0]: https://github.com/iovisor/bcc [1]: https://github.com/iovisor/bcc/issues/3954 [2]: https://github.com/iovisor/bcc/issues/4261 Tested-by: Francis Laniel <flaniel@linux.microsoft.com> Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com> Tested-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230520084057.1467003-1-hengqi.chen@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-24 08:38:59 -06:00
Christoph Hellwig	3eb96946f0	block: make bio_check_eod work for zero sized devices Since the dawn of time bio_check_eod has a check for a non-zero size of the device. This doesn't really make any sense as we never want to send I/O to a device that's been set to zero size, or never moved out of that. I am a bit surprised we haven't caught this for a long time, but the removal of the extra validation inside of zram caused syzbot to trip over this issue recently. I've added a Fixes tag for that commit, but the issue really goes back way before git history. Fixes: `9fe95babc7` ("zram: remove valid_io_request") Reported-by: syzbot+b8d61a58b7c7ebd2c8e0@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230524060538.1593686-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-24 08:19:26 -06:00
Yu Kuai	a13bd91be2	block/rq_qos: protect rq_qos apis with a new lock commit `50e34d7881` ("block: disable the elevator int del_gendisk") move rq_qos_exit() from disk_release() to del_gendisk(), this will introduce some problems: 1) If rq_qos_add() is triggered by enabling iocost/iolatency through cgroupfs, then it can concurrent with del_gendisk(), it's not safe to write 'q->rq_qos' concurrently. 2) Activate cgroup policy that is relied on rq_qos will call rq_qos_add() and blkcg_activate_policy(), and if rq_qos_exit() is called in the middle, null-ptr-dereference will be triggered in blkcg_activate_policy(). 3) blkg_conf_open_bdev() can call blkdev_get_no_open() first to find the disk, then if rq_qos_exit() from del_gendisk() is done before rq_qos_add(), then memory will be leaked. This patch add a new disk level mutex 'rq_qos_mutex': 1) The lock will protect rq_qos_exit() directly. 2) For wbt that doesn't relied on blk-cgroup, rq_qos_add() can only be called from disk initialization for now because wbt can't be destructed until rq_qos_exit(), so it's safe not to protect wbt for now. Hoever, in case that rq_qos dynamically destruction is supported in the furture, this patch also protect rq_qos_add() from wbt_init() directly, this is enough because blk-sysfs already synchronize writers with disk removal. 3) For iocost and iolatency, in order to synchronize disk removal and cgroup configuration, the lock is held after blkdev_get_no_open() from blkg_conf_open_bdev(), and is released in blkg_conf_exit(). In order to fix the above memory leak, disk_live() is checked after holding the new lock. Fixes: `50e34d7881` ("block: disable the elevator int del_gendisk") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230414084008.2085155-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-23 11:13:19 -06:00
Anuj Gupta	46930b7cc7	block: fix bio-cache for passthru IO commit <8af870aa5b847> ("block: enable bio caching use for passthru IO") introduced bio-cache for passthru IO. In case when nr_vecs are greater than BIO_INLINE_VECS, bio and bvecs are allocated from mempool (instead of percpu cache) and REQ_ALLOC_CACHE is cleared. This causes the side effect of not freeing bio/bvecs into mempool on completion. This patch lets the passthru IO fallback to allocation using bio_kmalloc when nr_vecs are greater than BIO_INLINE_VECS. The corresponding bio is freed during call to blk_mq_map_bio_put during completion. Cc: stable@vger.kernel.org # 6.1 fixes <8af870aa5b847> ("block: enable bio caching use for passthru IO") Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20230523111709.145676-1-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-23 11:11:29 -06:00
Tian Lan	3e94d54e83	blk-mq: fix race condition in active queue accounting If multiple CPUs are sharing the same hardware queue, it can cause leak in the active queue counter tracking when __blk_mq_tag_busy() is executed simultaneously. Fixes: `ee78ec1077` ("blk-mq: blk_mq_tag_busy is no need to return a value") Signed-off-by: Tian Lan <tian.lan@twosigma.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20230522210555.794134-1-tilan7663@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-23 11:10:16 -06:00
Yu Kuai	8a2b20a997	blk-wbt: fix that wbt can't be disabled by default commit `b11d31ae01` ("blk-wbt: remove unnecessary check in wbt_enable_default()") removes the checking of CONFIG_BLK_WBT_MQ by mistake, which is used to control enable or disable wbt by default. Fix the problem by adding back the checking. This patch also do a litter cleanup to make related code more readable. Fixes: `b11d31ae01` ("blk-wbt: remove unnecessary check in wbt_enable_default()") Reported-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Link: https://lore.kernel.org/lkml/CAKXUXMzfKq_J9nKHGyr5P5rvUETY4B-fxoQD4sO+NYjFOfVtZA@mail.gmail.com/t/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230522121854.2928880-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-23 11:08:53 -06:00
Martin K. Petersen	8b60e2189f	Merge patch series "Add Command Duration Limits support" Niklas Cassel <nks@flawful.org> says: This series adds support for Command Duration Limits. The series is based on linux tag: v6.4-rc1 The series can also be found in git: https://github.com/floatious/linux/commits/cdl-v7 ================= CDL in ATA / SCSI ================= Command Duration Limits is defined in: T13 ATA Command Set - 5 (ACS-5) and T10 SCSI Primary Commands - 6 (SPC-6) respectively (a simpler version of CDL is defined in T10 SPC-5). CDL defines Duration Limits Descriptors (DLD). 7 DLDs for read commands and 7 DLDs for write commands. Simply put, a DLD contains a limit and a policy. A command can specify that a certain limit should be applied by setting the DLD index field (3 bits, so 0-7) in the command itself. The DLD index points to one of the 7 DLDs. DLD index 0 means no descriptor, so no limit. DLD index 1-7 means DLD 1-7. A DLD can have a few different policies, but the two major ones are: -Policy 0xF (abort), command will be completed with command aborted error (ATA) or status CHECK CONDITION (SCSI), with sense data indicating that the command timed out. -Policy 0xD (complete-unavailable), command will be completed without error (ATA) or status GOOD (SCSI), with sense data indicating that the command timed out. Note that the command will not have transferred any data to/from the device when the command timed out, even though the command returned success. Regardless of the CDL policy, in case of a CDL timeout, the I/O will result in a -ETIME error to user-space. The DLDs are defined in the CDL log page(s) and are readable and writable. Reading and writing the CDL DLDs are outside the scope of the kernel. If a user wants to read or write the descriptors, they can do so using a user-space application that sends passthrough commands, such as cdl-tools: https://github.com/westerndigitalcorporation/cdl-tools ================================ The introduction of ioprio hints ================================ What the kernel does provide, is a method to let I/O use one of the CDL DLDs defined in the device. Note that the kernel will simply forward the DLD index to the device, so the kernel currently does not know, nor does it need to know, how the DLDs are defined inside the device. The way that the CDL DLD index is supplied to the kernel is by introducing a new 10 bit "ioprio hint" field within the existing 16 bit ioprio definition. Currently, only 6 out of the 16 ioprio bits are in use, the remaining 10 bits are unused, and are currently explicitly disallowed to be set by the kernel. For now, we only add ioprio hints representing CDL DLD index 1-7. Additional ioprio hints for other QoS features could be defined in the future. A theoretical future work could be to make an I/O scheduler aware of these hints. E.g. for CDL, an I/O scheduler could make use of the duration limit in each descriptor, and take that information into account while scheduling commands. Right now, the ioprio hints will be ignored by the I/O schedulers. ============================== How to use CDL from user-space ============================== Since CDL is mutually exclusive with NCQ priority (see ncq_prio_enable and sas_ncq_prio_enable in Documentation/ABI/testing/sysfs-block-device), CDL has to be explicitly enabled using: echo 1 > /sys/block/$bdev/device/cdl_enable Since the ioprio hints are supplied through the existing I/O priority API, it should be simple for an application to make use of the ioprio hints. It simply has to reuse one of the new macros defined in include/uapi/linux/ioprio.h: IOPRIO_PRIO_HINT() or IOPRIO_PRIO_VALUE_HINT(), and supply one of the new hints defined in include/uapi/linux/ioprio.h: IOPRIO_HINT_DEV_DURATION_LIMIT_[1-7], which indicates that the I/O should use the corresponding CDL DLD index 1-7. By reusing the I/O priority API, the user can both define a DLD to use per AIO (io_uring sqe->ioprio or libaio iocb->aio_reqprio) or per-thread (ioprio_set()). ======= Testing ======= With the following fio patches: https://github.com/floatious/fio/commits/cdl fio adds support for ioprio hints, such that CDL can be tested using e.g.: fio --ioengine=io_uring --cmdprio_percentage=10 --cmdprio_hint=DLD_index A simple way to test is to use a DLD with a very short duration limit, and send large reads. Regardless of the CDL policy, in case of a CDL timeout, the I/O will result in a -ETIME error to user-space. We also provide a CDL test suite located in the cdl-tools repo, see: https://github.com/westerndigitalcorporation/cdl-tools#testing-a-system-command-duration-limits-support We have tested this patch series using: -real hardware -the following QEMU implementation: https://github.com/floatious/qemu/tree/cdl (NOTE: the QEMU implementation requires you to define the CDL policy at compile time, so you currently need to recompile QEMU when switching between policies.) =================== Further information =================== For further information about CDL, see Damien's slides: Presented at SDC 2021: https://www.snia.org/sites/default/files/SDC/2021/pdfs/SNIA-SDC21-LeMoal-Be-On-Time-command-duration-limits-Feature-Support-in%20Linux.pdf Presented at Lund Linux Con 2022: https://drive.google.com/file/d/1I6ChFc0h4JY9qZdO1bY5oCAdYCSZVqWw/view?usp=sharing ================ Changes since V6 ================ -Rebased series on v6.4-rc1. -Picked up Reviewed-by tags from Hannes (Thank you Hannes!) -Picked up Reviewed-by tag from Christoph (Thank you Christoph!) -Changed KernelVersion from 6.4 to 6.5 for new sysfs attributes. For older change logs, see previous patch series versions: https://lore.kernel.org/linux-scsi/20230406113252.41211-1-nks@flawful.org/ https://lore.kernel.org/linux-scsi/20230404182428.715140-1-nks@flawful.org/ https://lore.kernel.org/linux-scsi/20230309215516.3800571-1-niklas.cassel@wdc.com/ https://lore.kernel.org/linux-scsi/20230124190308.127318-1-niklas.cassel@wdc.com/ https://lore.kernel.org/linux-scsi/20230112140412.667308-1-niklas.cassel@wdc.com/ https://lore.kernel.org/linux-scsi/20221208105947.2399894-1-niklas.cassel@wdc.com/ Link: https://lore.kernel.org/r/20230511011356.227789-1-nks@flawful.org Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2023-05-22 17:09:51 -04:00
Damien Le Moal	dffc480d2d	scsi: block: Introduce BLK_STS_DURATION_LIMIT Introduce the new block I/O status BLK_STS_DURATION_LIMIT for LLDDs to report command that failed due to a command duration limit being exceeded. This new status is mapped to the ETIME error code to allow users to differentiate "soft" duration limit failures from other more serious hardware related errors. If we compare BLK_STS_DURATION_LIMIT with BLK_STS_TIMEOUT: -BLK_STS_DURATION_LIMIT means that the drive gave a reply indicating that the command duration limit was exceeded before the command could be completed. This I/O status is mapped to ETIME for user space. -BLK_STS_TIMEOUT means that the drive never gave a reply at all. This I/O status is mapped to ETIMEDOUT for user space. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Co-developed-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Link: https://lore.kernel.org/r/20230511011356.227789-4-nks@flawful.org Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2023-05-22 17:05:18 -04:00
Damien Le Moal	eca2040972	scsi: block: ioprio: Clean up interface definition The I/O priority user interface defines the 16-bits ioprio values as the combination of the upper 3-bits for an I/O priority class and the lower 13-bits as priority data. However, the kernel only uses the lower 3-bits of the priority data to define priority levels for the RT and BE priority classes. The data part of an ioprio value is completely ignored for the IDLE and NONE classes. This is enforced by checks done in ioprio_check_cap(), which is called for all paths that allow defining an I/O priority for I/Os: the per-context ioprio_set() system call, aio interface and io_uring interface. Clarify this fact in the uapi ioprio.h header file and introduce the IOPRIO_PRIO_LEVEL_MASK and IOPRIO_PRIO_LEVEL() macros for users to define and get priority levels in an ioprio value. The coarser macro IOPRIO_PRIO_DATA() is retained for backward compatibility with old applications already using it. There is no functional change introduced with this. In-kernel users of the IOPRIO_PRIO_DATA() macro which are explicitly handling I/O priority data as a priority level are modified to use the new IOPRIO_PRIO_LEVEL() macro without any functional change. Since f2fs is the only user of this macro not explicitly using that value as a priority level, it is left unchanged. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Link: https://lore.kernel.org/r/20230511011356.227789-2-nks@flawful.org Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2023-05-22 17:05:18 -04:00
Martin K. Petersen	7907ad748b	Merge patch series "Use block pr_ops in LIO" Mike Christie <michael.christie@oracle.com> says: The patches in this thread allow us to use the block pr_ops with LIO's target_core_iblock module to support cluster applications in VMs. They were built over Linus's tree. They also apply over linux-next and Martin's tree and Jens's trees. Currently, to use windows clustering or linux clustering (pacemaker + cluster labs scsi fence agents) in VMs with LIO and vhost-scsi, you have to use tcmu or pscsi or use a cluster aware FS/framework for the LIO pr file. Setting up a cluster FS/framework is pain and waste when your real backend device is already a distributed device, and pscsi and tcmu are nice for specific use cases, but iblock gives you the best performance and allows you to use stacked devices like dm-multipath. So these patches allow iblock to work like pscsi/tcmu where they can pass a PR command to the backend module. And then iblock will use the pr_ops to pass the PR command to the real devices similar to what we do for unmap today. The patches are separated in the following groups: Patch 1 - 2: - Add block layer callouts for reading reservations and rename reservation error code. Patch 3 - 5: - SCSI support for new callouts. Patch 6: - DM support for new callouts. Patch 7 - 13: - NVMe support for new callouts. Patch 14 - 18: - LIO support for new callouts. This patchset has been tested with the libiscsi PGR ops and with window's failover cluster verification test. Note that for scsi backend devices we need this patchset: https://lore.kernel.org/linux-scsi/20230123221046.125483-1-michael.christie@oracle.com/T/#m4834a643ffb5bac2529d65d40906d3cfbdd9b1b7 to handle UAs. To reduce the size of this patchset that's being done separately to make reviewing easier. And to make merging easier this patchset and the one above do not have any conflicts so can be merged in different trees. Link: https://lore.kernel.org/r/20230407200551.12660-1-michael.christie@oracle.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2023-05-22 16:35:02 -04:00
Christoph Hellwig	712c736465	block: don't plug in blkdev_write_iter For direct I/O writes that issues more than a single bio, the plugging is already done in __blkdev_direct_IO. For synchronous buffered writes the plugging is done deep down in writeback_inodes_wb / wb_writeback. For the other cases there is no point in plugging as as single bio or no bio at all is submitted. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230520044503.334444-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-20 05:42:41 -06:00
Loic Poulain	69baa3a623	block: Deny writable memory mapping if block is read-only User should not be able to write block device if it is read-only at block level (e.g force_ro attribute). This is ensured in the regular fops write operation (blkdev_write_iter) but not when writing via user mapping (mmap), allowing user to actually write a read-only block device via a PROT_WRITE mapping. Example: This can lead to integrity issue of eMMC boot partition (e.g mmcblk0boot0) which is read-only by default. To fix this issue, simply deny shared writable mapping if the block is readonly. Note: Block remains writable if switch to read-only is performed after the initial mapping, but this is expected behavior according to commit `a32e236eb9` ("Partially revert "block: fail op_is_write() requests to read-only partitions"")'. Signed-off-by: Loic Poulain <loic.poulain@linaro.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230510074223.991297-1-loic.poulain@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-19 20:17:10 -06:00
Bart Van Assche	f80dd11dd1	block: BFQ: Move an invariant check Check bfqq->dispatched for each BFQ queue instead of checking it for an invalid bfqq pointer. Fixes: `3e49c1e4a6` ("block: BFQ: Add several invariant checks") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230519220347.3643295-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-19 19:59:52 -06:00
Christoph Hellwig	9a67aa52a4	blk-mq: don't use the requeue list to queue flush commands Currently both requeues of commands that were already sent to the driver and flush commands submitted from the flush state machine share the same requeue_list struct request_queue, despite requeues doing head insertions and flushes not. Switch to using two separate lists instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230519044050.107790-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-19 19:52:42 -06:00
Christoph Hellwig	1e82fadfc6	blk-mq: do not do head insertions post-pre-flush commands blk_flush_complete_seq currently queues requests that write data after a pre-flush from the flush state machine at the head of the queue. This doesn't really make sense, as the original request bypassed all queue lists by directly diverting to blk_insert_flush from blk_mq_submit_bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230519044050.107790-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-19 19:52:29 -06:00
Christoph Hellwig	615939a2ae	blk-mq: defer to the normal submission path for post-flush requests Requests with the FUA bit on hardware without FUA support need a post flush before returning to the caller, but they can still be sent using the normal I/O path after initializing the flush-related fields and end I/O handler. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230519044050.107790-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-19 19:52:29 -06:00
Bart Van Assche	be4c427809	blk-mq: use the I/O scheduler for writes from the flush state machine Send write requests issued by the flush state machine through the normal I/O submission path including the I/O scheduler (if present) so that I/O scheduler policies are applied to writes with the FUA flag set. Separate the I/O scheduler members from the flush members in struct request since now a request may pass through both an I/O scheduler and the flush machinery. Note that the actual flush requests, which have no bio attached to the request still bypass the I/O schedulers. Signed-off-by: Bart Van Assche <bvanassche@acm.org> [hch: rebased] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230519044050.107790-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-19 19:52:29 -06:00
Christoph Hellwig	360f264834	blk-mq: defer to the normal submission path for non-flush flush commands If blk_insert_flush decides that a command does not need to use the flush state machine, return false and let blk_mq_submit_bio handle it the normal way (including using an I/O scheduler) instead of doing a bypass insert. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230519044050.107790-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-19 19:52:29 -06:00
Christoph Hellwig	c1075e548c	blk-mq: reflow blk_insert_flush Use a switch statement to decide on the disposition of a flush request instead of multiple if statements, out of which one does checks that are more complex than required. Also warn on a malformed request early on instead of doing a BUG_ON later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230519044050.107790-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-19 19:52:29 -06:00
Christoph Hellwig	0b573692f1	blk-mq: factor out a blk_rq_init_flush helper Factor out a helper from blk_insert_flush that initializes the flush machine related fields in struct request, and don't bother with the full memset as there's just a few fields to initialize, and all but one already have explicit initializers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230519044050.107790-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-19 19:52:29 -06:00
Bart Van Assche	3e49c1e4a6	block: BFQ: Add several invariant checks If anything goes wrong with the counters that track the number of requests, I/O locks up. Make such scenarios easier to debug by adding invariant checks for the request counters. Additionally, check that BFQ queues are empty before these are freed. Cc: Jan Kara <jack@suse.cz> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230516223853.1385255-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-18 20:20:43 -06:00
Bart Van Assche	a036e698c2	block: mq-deadline: Fix handling of at-head zoned writes Before dispatching a zoned write from the FIFO list, check whether there are any zoned writes in the RB-tree with a lower LBA for the same zone. This patch ensures that zoned writes happen in order even if at_head is set for some writes for a zone and not for others. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230517174230.897144-12-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-18 19:47:49 -06:00
Bart Van Assche	0effb390c4	block: mq-deadline: Handle requeued requests correctly Start dispatching from the start of a zone instead of from the starting position of the most recently dispatched request. If a zoned write is requeued with an LBA that is lower than already inserted zoned writes, make sure that it is submitted first. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230517174230.897144-11-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-18 19:47:49 -06:00
Bart Van Assche	83c46ed675	block: mq-deadline: Track the dispatch position Track the position (sector_t) of the most recently dispatched request instead of tracking a pointer to the next request to dispatch. This patch is the basis for patch "Handle requeued requests correctly". Without this patch it would be significantly more complicated to make sure that zoned writes are dispatched in LBA order per zone. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230517174230.897144-10-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-18 19:47:49 -06:00
Bart Van Assche	b2097bd24b	block: mq-deadline: Reduce lock contention blk_mq_free_requests() calls dd_finish_request() indirectly. Prevent nested locking of dd->lock and dd->zone_lock by moving the code for freeing requests. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230517174230.897144-9-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-18 19:47:49 -06:00
Bart Van Assche	3b463cbea9	block: mq-deadline: Simplify deadline_skip_seq_writes() Make the deadline_skip_seq_writes() code shorter without changing its functionality. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230517174230.897144-8-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-18 19:47:49 -06:00
Bart Van Assche	e0d85cde95	block: mq-deadline: Clean up deadline_check_fifo() Change the return type of deadline_check_fifo() from 'int' into 'bool'. Use time_is_before_eq_jiffies() instead of time_after_eq(). No functionality has been changed. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230517174230.897144-7-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-18 19:47:49 -06:00
Bart Van Assche	19821fee3e	block: Introduce blk_rq_is_seq_zoned_write() Introduce the function blk_rq_is_seq_zoned_write(). This function will be used in later patches to preserve the order of zoned writes that require write serialization. This patch includes an optimization: instead of using rq->q->disk->part0->bd_queue to check whether or not the queue is associated with a zoned block device, use rq->q->disk->queue. Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230517174230.897144-6-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-18 19:47:49 -06:00
Bart Van Assche	4f51644ccf	block: Simplify blk_req_needs_zone_write_lock() Remove the blk_rq_is_passthrough() check because it is redundant: blk_req_needs_zone_write_lock() also calls bdev_op_is_zoned_write() and the latter function returns false for pass-through requests. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230517174230.897144-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-18 19:47:49 -06:00
Bart Van Assche	45b46b6f15	block: mq-deadline: Add a word in a source code comment Add the missing word "and". Cc: Damien Le Moal <dlemoal@kernel.org> Suggested-by: Damien Le Moal <dlemoal@kernel.org> Fixes: `945ffb60c1` ("mq-deadline: add blk-mq adaptation of the deadline IO scheduler") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230517174230.897144-2-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-18 19:47:49 -06:00
Christoph Hellwig	dd6216bb16	blk-mq: make sure elevator callbacks aren't called for passthrough request In case of q->elevator, passthrough request can still be marked as RQF_ELV, so some elevator callbacks will be called for them. Fix this by splitting RQF_SCHED_TAGS, which is set for all requests that are issued on a queue that uses an I/O scheduler, and RQF_USE_SCHED for non-flush, non-passthrough requests on such a queue. Roughly based on two different patches from Ming Lei <ming.lei@redhat.com>. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230518053101.760632-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-18 19:42:54 -06:00
Christoph Hellwig	fdcab6cdde	blk-mq: remove RQF_ELVPRIV RQF_ELVPRIV is set for all non-flush requests that have RQF_ELV set. Expand this condition in the two users of the flag and remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230518053101.760632-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-18 19:41:20 -06:00
Ming Lei	d97217e7f0	blk-mq: don't queue plugged passthrough requests into scheduler Passthrough requests should never be queued to the I/O scheduler, as scheduling these opaque requests doesn't make sense, and I/O schedulers might require req->bio to be always valid. We never let passthrough requests insert into the scheduler before commit `1c2d2fff6d` ("block: wire-up support for passthrough plugging"), restore this behavior even for passthrough requests issued under a plug. [hch: use blk_mq_insert_requests for passthrough requests, fix up the commit message and comments] Reported-by: Guangwu Zhang <guazhang@redhat.com> Closes: https://lore.kernel.org/linux-block/CAGS2=YosaYaUTEMU3uaf+y=8MqSrhL7sYsJn8EwbaM=76p_4Qg@mail.gmail.com/ Investigated-by: Yu Kuai <yukuai1@huaweicloud.com> Fixes: `1c2d2fff6d` ("block: wire-up support for passthrough plugging") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230518053101.760632-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-18 19:40:09 -06:00
Bart Van Assche	d5fb8726f1	block: Decode all flag names in the debugfs output See also: * Commit `4d337cebcb` ("blk-mq: avoid to touch q->elevator without any protection"). * Commit `414dd48e88` ("blk-mq: add tagset quiesce interface"). Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230518222708.1190867-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-18 19:34:31 -06:00
Jens Axboe	e9833d8701	block: mark bdev files as FMODE_NOWAIT if underlying device supports it We set this unconditionally, but it really should be dependent on if the underlying device is nowait compliant. Cc: linux-block@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230509151910.183637-3-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-15 10:12:27 -06:00
Linus Torvalds	a3b111b046	for-6.4/block-2023-05-06 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRWLQYQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpnwhD/4xRYfAY4O0oGZCITiKEEFxiSfDHCPMEBji zTEtideDRjcrpFmjmZ411C9prW32MxYQ3wQTf4O7w4t906xTYVr9FQy8g3Et4izI zglPcsa2jPzYCadQ0Ye4n9dRuYOH9FDJzDDLC+smu8zQKKmAqEAN/1ftpnADdVrY qX1sHyfz4RQRAgTHg2WqgKOi2O9VwGSMwOfBocmzAnruv9oLUlypcGnFPRCSZH/a OKpUZlvQhCBZTKScvVxQeJMg2Tl5yokQ0TH+gkQsdav9XcPJktqXiuD+c4h6q5ux oTlysEqcrwcaAafOcV0w9u80SFNlFACUYsNmEnFJaPXFTqAdNHvo1DJNsmxiHJDU bGo5ktlo5b/VZ51niOoWvGxursavq16G4yIYlGGHc7f4wGs12oc5ZP/yM3GRUY+C PdezwEvvQufxP7sFokfpgAS4SuH+tBlrhFXMsYaI4NukZQW4TK1zzbMrzOkdxFhW BOx17VFUKWtUnRmxinFGIA8Vj+FXN+E+ND+FoDsbrMyJD4maKDdJapPchG0J0Vbs pDcsB4c0pBC6H2xrobKiA1CuSq2t2qvyvwe1Zl2Xd+RVW9vBB5SI6HXYrC+UtxwY 7LfX8F13cFD1E6iJ9Nta6x8fOunGnOVBdW5O0k4hDWEuZduvHItEDn2c3Ehqp4Jw P8dFBbk8SQ== =gAYf -----END PGP SIGNATURE----- Merge tag 'for-6.4/block-2023-05-06' of git://git.kernel.dk/linux Pull more block updates from Jens Axboe: - MD pull request via Song: - Improve raid5 sequential IO performance on spinning disks, which fixes a regression since v6.0 (Jan Kara) - Fix bitmap offset types, which fixes an issue introduced in this merge window (Jonathan Derrick) - Cleanup of hweight type used for cgroup writeback (Maxim) - Fix a regression with the "has_submit_bio" changes across partitions (Ming) - Cleanup of QUEUE_FLAG_ADD_RANDOM clearing. We used to set this flag on queues non blk-mq queues, and hence some drivers clear it unconditionally. Since all of these have since been converted to true blk-mq drivers, drop the useless clear as the bit is not set (Chaitanya) - Fix the flags being set in a bio for a flush for drbd (Christoph) - Cleanup and deduplication of the code handling setting block device capacity (Damien) - Fix for ublk handling IO timeouts (Ming) - Fix for a regression in blk-cgroup teardown (Tao) - NBD documentation and code fixes (Eric) - Convert blk-integrity to using device_attributes rather than a second kobject to manage lifetimes (Thomas) * tag 'for-6.4/block-2023-05-06' of git://git.kernel.dk/linux: ublk: add timeout handler drbd: correctly submit flush bio on barrier mailmap: add mailmap entries for Jens Axboe block: Skip destroyed blkg when restart in blkg_destroy_all() writeback: fix call of incorrect macro md: Fix bitmap offset type in sb writer md/raid5: Improve performance for sequential IO docs nbd: userspace NBD now favors github over sourceforge block nbd: use req.cookie instead of req.handle uapi nbd: add cookie alias to handle uapi nbd: improve doc links to userspace spec blk-integrity: register sysfs attributes on struct device blk-integrity: convert to struct device_attribute blk-integrity: use sysfs_emit block/drivers: remove dead clear of random flag block: sync part's ->bd_has_submit_bio with disk's block: Cleanup set_capacity()/bdev_set_nr_sectors()	2023-05-06 08:28:58 -07:00
Tao Su	8176080d59	block: Skip destroyed blkg when restart in blkg_destroy_all() Kernel hang in blkg_destroy_all() when total blkg greater than BLKG_DESTROY_BATCH_SIZE, because of not removing destroyed blkg in blkg_list. So the size of blkg_list is same after destroying a batch of blkg, and the infinite 'restart' occurs. Since blkg should stay on the queue list until blkg_free_workfn(), skip destroyed blkg when restart a new round, which will solve this kernel hang issue and satisfy the previous will to restart. Reported-by: Xiangfei Ma <xiangfeix.ma@intel.com> Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Tao Su <tao1.su@linux.intel.com> Fixes: `f1c006f1c6` ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()") Suggested-and-reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230428045149.1310073-1-tao1.su@linux.intel.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-28 11:23:58 -06:00
Linus Torvalds	556eb8b791	Driver core changes for 6.4-rc1 Here is the large set of driver core changes for 6.4-rc1. Once again, a busy development cycle, with lots of changes happening in the driver core in the quest to be able to move "struct bus" and "struct class" into read-only memory, a task now complete with these changes. This will make the future rust interactions with the driver core more "provably correct" as well as providing more obvious lifetime rules for all busses and classes in the kernel. The changes required for this did touch many individual classes and busses as many callbacks were changed to take const * parameters instead. All of these changes have been submitted to the various subsystem maintainers, giving them plenty of time to review, and most of them actually did so. Other than those changes, included in here are a small set of other things: - kobject logging improvements - cacheinfo improvements and updates - obligatory fw_devlink updates and fixes - documentation updates - device property cleanups and const * changes - firwmare loader dependency fixes. All of these have been in linux-next for a while with no reported problems. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZEp7Sw8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ykitQCfamUHpxGcKOAGuLXMotXNakTEsxgAoIquENm5 LEGadNS38k5fs+73UaxV =7K4B -----END PGP SIGNATURE----- Merge tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the large set of driver core changes for 6.4-rc1. Once again, a busy development cycle, with lots of changes happening in the driver core in the quest to be able to move "struct bus" and "struct class" into read-only memory, a task now complete with these changes. This will make the future rust interactions with the driver core more "provably correct" as well as providing more obvious lifetime rules for all busses and classes in the kernel. The changes required for this did touch many individual classes and busses as many callbacks were changed to take const * parameters instead. All of these changes have been submitted to the various subsystem maintainers, giving them plenty of time to review, and most of them actually did so. Other than those changes, included in here are a small set of other things: - kobject logging improvements - cacheinfo improvements and updates - obligatory fw_devlink updates and fixes - documentation updates - device property cleanups and const * changes - firwmare loader dependency fixes. All of these have been in linux-next for a while with no reported problems" * tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (120 commits) device property: make device_property functions take const device * driver core: update comments in device_rename() driver core: Don't require dynamic_debug for initcall_debug probe timing firmware_loader: rework crypto dependencies firmware_loader: Strip off \n from customized path zram: fix up permission for the hot_add sysfs file cacheinfo: Add use_arch[\|_cache]_info field/function arch_topology: Remove early cacheinfo error message if -ENOENT cacheinfo: Check cache properties are present in DT cacheinfo: Check sib_leaf in cache_leaves_are_shared() cacheinfo: Allow early level detection when DT/ACPI info is missing/broken cacheinfo: Add arm64 early level initializer implementation cacheinfo: Add arch specific early level initializer tty: make tty_class a static const structure driver core: class: remove struct class_interface * from callbacks driver core: class: mark the struct class in struct class_interface constant driver core: class: make class_register() take a const * driver core: class: mark class_release() as taking a const * driver core: remove incorrect comment for device_create* MIPS: vpe-cmp: remove module owner pointer from struct class usage. ...	2023-04-27 11:53:57 -07:00
Thomas Weißschuh	ff53cd52d9	blk-integrity: register sysfs attributes on struct device The "integrity" kobject only acted as a holder for static sysfs entries. It also was embedded into struct gendisk without managing it, violating assumptions of the driver core. Instead register the sysfs entries directly onto the struct device. Also drop the now unused member integrity_kobj from struct gendisk. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230309-kobj_release-gendisk_integrity-v3-3-ceccb4493c46@weissschuh.net Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-26 18:22:50 -06:00
Thomas Weißschuh	76b8c319f0	blk-integrity: convert to struct device_attribute An upcoming patch will register the integrity attributes directly with the struct device kobject. For this the attributes have to be implemented in terms of struct device_attribute. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230309-kobj_release-gendisk_integrity-v3-2-ceccb4493c46@weissschuh.net Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-26 18:22:50 -06:00
Thomas Weißschuh	3315e169b4	blk-integrity: use sysfs_emit The correct way to emit data into sysfs is via sysfs_emit(), use it. Also perform some trivial syntactic cleanups. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230309-kobj_release-gendisk_integrity-v3-1-ceccb4493c46@weissschuh.net Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-26 18:22:50 -06:00
Linus Torvalds	9dd6956b38	for-6.4/block-2023-04-21 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRCvcIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpk+JEACj01t7Xen2+Razagu3aTx9tmRGFnTNR3MY raFG6B1TADk1TgCWWa2C4Dj67SOispPLm8hbIcOxqB1UscDWCCwjmnr/debADFzW Ap6shv/IRwVGmDp+F7ocYas0ynwooOJg4WJTwkSKz2o4m4p3vzlwAKi4fLiSjbXp gJTrA7WEvDOVjzajlTFUtjr8rc6PdunbGm25cPIufAxUEhvttYex2VbVqjDmfNsE 8tyyk9RWbe4AY/ZYaGXVn4yQ/CgL/sXFkVc5noRXNfAQ/K3CVLQrFLJ3JlwUHpiA xXBor21TUWCZEo33Y2G5NConAYqE7etoPTkaTDO3/aZ+dAMFyhC/WAYLz1KZGMh1 +g1fDX1QKEd40H2lfDXvqF1ob7Ut8EzUx+gvBXcc3/AiRpJ5rjfOcj6LPUMUqQJk nucLLFTiMKecnDMBERbvixqbaTyrjvkFEj2wYJvgj1LKXAd+x/bj8SGajs9r88Nb 9YT9ai/+Yl7Ppfb67rCgXJU7oNZQSAQ2H+X/l2jbiqImOgq1u/45AmINnbanS7HH Y1I8pbH45AcnCgkJRoQwrNX3BnTOTBJ+D/4Fl4b8jsihq0D3UtwCwPCObHP4LW9S MUNPhP3tUuYsAgXqX80+Sao6SYvXDwnbWOM+LOaaZXgjb1ndwDUZXpto8Ra8WB1u 8kM6s6ZR7g== =W1Zb -----END PGP SIGNATURE----- Merge tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - drbd patches, bringing us closer to unifying the out-of-tree version and the in tree one (Andreas, Christoph) - support for auto-quiesce for the s390 dasd driver (Stefan) - MD pull request via Song: - md/bitmap: Optimal last page size (Jon Derrick) - Various raid10 fixes (Yu Kuai, Li Nan) - md: add error_handlers for raid0 and linear (Mariusz Tkaczyk) - NVMe pull request via Christoph: - Drop redundant pci_enable_pcie_error_reporting (Bjorn Helgaas) - Validate nvmet module parameters (Chaitanya Kulkarni) - Fence TCP socket on receive error (Chris Leech) - Fix async event trace event (Keith Busch) - Minor cleanups (Chaitanya Kulkarni, zhenwei pi) - Fix and cleanup nvmet Identify handling (Damien Le Moal, Christoph Hellwig) - Fix double blk_mq_complete_request race in the timeout handler (Lei Yin) - Fix irq locking in nvme-fcloop (Ming Lei) - Remove queue mapping helper for rdma devices (Sagi Grimberg) - use structured request attribute checks for nbd (Jakub) - fix blk-crypto race conditions between keyslot management (Eric) - add sed-opal support for reading read locking range attributes (Ondrej) - make fault injection configurable for null_blk (Akinobu) - clean up the request insertion API (Christoph) - clean up the queue running API (Christoph) - blkg config helper cleanups (Tejun) - lazy init support for blk-iolatency (Tejun) - various fixes and tweaks to ublk (Ming) - remove hybrid polling. It hasn't really been useful since we got async polled IO support, and these days we don't support sync polled IO at all (Keith) - misc fixes, cleanups, improvements (Zhong, Ondrej, Colin, Chengming, Chaitanya, me) * tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linux: (118 commits) nbd: fix incomplete validation of ioctl arg ublk: don't return 0 in case of any failure sed-opal: geometry feature reporting command null_blk: Always check queue mode setting from configfs block: ublk: switch to ioctl command encoding blk-mq: fix the blk_mq_add_to_requeue_list call in blk_kick_flush block, bfq: Fix division by zero error on zero wsum fault-inject: fix build error when FAULT_INJECTION_CONFIGFS=y and CONFIGFS_FS=m block: store bdev->bd_disk->fops->submit_bio state in bdev block: re-arrange the struct block_device fields for better layout md/raid5: remove unused working_disks variable md/raid10: don't call bio_start_io_acct twice for bio which experienced read error md/raid10: fix memleak of md thread md/raid10: fix memleak for 'conf->bio_split' md/raid10: fix leak of 'r10bio->remaining' for recovery md/raid10: don't BUG_ON() in raise_barrier() md: fix soft lockup in status_resync md: add error_handlers for raid0 and linear md: Use optimal I/O size for last bitmap page md: Fix types in sb writer ...	2023-04-26 12:52:58 -07:00
Linus Torvalds	85d7ab2463	for-6.4-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmRHC3gACgkQxWXV+ddt WDvI/A//ZzREEE0wNexbuidoTacDVXVJ6LBb2K1eP+HUKfsmd6GYWQDJ9x/ExpKb T1ehLibCYWLeYxEREFbjXI3x9G8mrvLzvzsqXs/MzJPkmEF1igPddFztidBwvLQH ey/Bh+cra2bpVhRhkX0Cf09/q/YWp17/d14ZxxW60PMfyhx8RWXejXhHkulOPVv8 +3FL8E0kc2Zjx9ioUwOy/i18LR6YzsCNVXoHzUZuWyWM4A7NG2TZR6FhuLSjlWSZ 3RAnROwr+8i5nR0xchcyYaVMO2LMbqH6mBtHnXCtxCr+4pFrfrvKym+CQco/Xriz v1y/xDc23XeYXLCVhb0beJ6uRcjaM9+gvDF1oVBSJEv6V7sQr/tEGo/8QRehfEfT FTro7Lf89R1GOa1IBSkv/T5S25d9LlIID3/g7PbcUBtXNKvLAjDAGTH9bzL4HS5x /MKwN80GvaGs1KyEfUndbVPIpAwNFDYZPHM7nw1x+JTkIBcHgfjRyAMAC9jrJd0D 730W04c+0nXZtQGtKKsxc3U8y4ewzSJAKx9t7Vgo7+1P6dSRnzvJee3x/5kXV9Yn MhxxzYDfIN9EcWbASdSm11gY5WZdG3an609pO7nc1T2K4Tuo0SPs4xOR7c3xuZrY MN5z3QFWyI2ustUuTG+nsd5J81j76DEmj5ymWQfG3SBplTneDM0= =Jt7p -----END PGP SIGNATURE----- Merge tag 'for-6.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Mostly core changes and cleanups, some notable fixes and two performance improvements in directory logging. The IO path cleanups are removing or refactoring old code, scrub main loop has been completely rewritten also refactoring old code. There are some changes to non-btrfs code, mostly trivial, the cgroup punt bio logic is only moved from generic code. Performance improvements: - improve logging changes in a directory during one transaction, avoid iterating over items and reduce lock contention (fsync time 4x lower) - when logging directory entries during one transaction, reduce locking of subvolume trees by checking tree-log instead (improvement in throughput and latency for concurrent access to a subvolume) Notable fixes: - dev-replace: - properly honor read mode when requested to avoid reading from source device - target device won't be used for eventual read repair, this is unreliable for NODATASUM files - when there are unpaired (and unrepairable) metadata during replace, exit early with error and don't try to finish whole operation - scrub ioctl properly rejects unknown flags - fix global block reserve calculations - fix partial direct io write when there's a page fault in the middle, iomap will try to continue with partial request but the btrfs part did not match that, this can lead to zeros written instead of data Core changes: - io path: - continued cleanups and refactoring around bio handling - extent io submit path simplifications and cleanups - flush write path simplifications and cleanups - rework logic of passing sync mode of bio, with further cleanups - rewrite scrub code flow, restructure how the stripes are enumerated and verified in a more unified way - allow to set lower threshold for block group reclaim in debug mode to aid zoned mode testing - remove obsolete time-based delayed ref throttling logic when truncating items - DREW locks are not using percpu variables anymore - more warning fixes (-Wmaybe-uninitialized) - u64 division simplifications - error handling improvements Non-btrfs code changes: - push cgroup punt bio logic to btrfs code (there was no other user of that), the functionality can be now selected separately by BLK_CGROUP_PUNT_BIO - crc32c_impl removed after removing last uses in btrfs code - add btrfs_assertfail() to objtool table" * tag 'for-6.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (147 commits) btrfs: mark btrfs_assertfail() __noreturn btrfs: fix uninitialized variable warnings btrfs: use log root when iterating over index keys when logging directory btrfs: avoid iterating over all indexes when logging directory btrfs: dev-replace: error out if we have unrepaired metadata error during btrfs: remove pointless loop at btrfs_get_next_valid_item() btrfs: scrub: reject unsupported scrub flags btrfs: reinterpret async discard iops_limit=0 as no delay btrfs: set default discard iops_limit to 1000 btrfs: remove unused raid56 functions which were dedicated for scrub btrfs: scrub: remove scrub_bio structure btrfs: scrub: remove scrub_block and scrub_sector structures btrfs: scrub: remove the old scrub recheck code btrfs: scrub: remove the old writeback infrastructure btrfs: scrub: remove scrub_parity structure btrfs: scrub: use scrub_stripe to implement RAID56 P/Q scrub btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure btrfs: scrub: introduce helper to queue a stripe for scrub btrfs: scrub: introduce error reporting functionality for scrub_stripe btrfs: scrub: introduce a writeback helper for scrub_stripe ...	2023-04-26 09:13:44 -07:00
Linus Torvalds	0cfcde1faf	There are a number of major cleanups in ext4 this cycle: * The data=journal writepath has been significantly cleaned up and simplified, and reduces a large number of data=journal special cases by Jan Kara. * Ojaswin Muhoo has replaced linked list used to track extents that have been used for inode preallocation with a red-black tree in the multi-block allocator. This improves performance for workloads which do a large number of random allocating writes. * Thanks to Kemeng Shi for a lot of cleanup and bug fixes in the multi-block allocator. * Matthew wilcox has converted the code paths for reading and writing ext4 pages to use folios. * Jason Yan has continued to factor out ext4_fill_super() into smaller functions for improve ease of maintenance and comprehension. * Josh Triplett has created an uapi header for ext4 userspace API's. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmRHS3IACgkQ8vlZVpUN gaNN7AgAnFiWfk4UqKpBsUL5iQKJgf2K4tjlNXgPd6ghNns0IdFEyeWSHhr6KLv/ SQeoMMyiWaUcTvZs9DokD8U/9M1ELPUiE9W5c9GxJjM86SXp8BlLYSZTiRoNHzGJ noQpvikj4qTRviK0rA3q5ICTP2eh1ECHMFJy2wcsZQgwnBelUejQHsTGtOwSvFWF 8wMdfuVtAFDZJjzOxzVKfHP22R5HVRWlAU7P1d97qKjBj4Se3+QchI+zdcIrmU9A tTmCXj57NpTDyLjS9dIDmLygtTv93lOzOmZS8glw0BFonPcd3ObI4RHVxR+V9xu1 lN13YYgBrK6yfApn9L5XL/31PuLfbg== =VLBx -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "There are a number of major cleanups in ext4 this cycle: - The data=journal writepath has been significantly cleaned up and simplified, and reduces a large number of data=journal special cases by Jan Kara. - Ojaswin Muhoo has replaced linked list used to track extents that have been used for inode preallocation with a red-black tree in the multi-block allocator. This improves performance for workloads which do a large number of random allocating writes. - Thanks to Kemeng Shi for a lot of cleanup and bug fixes in the multi-block allocator. - Matthew wilcox has converted the code paths for reading and writing ext4 pages to use folios. - Jason Yan has continued to factor out ext4_fill_super() into smaller functions for improve ease of maintenance and comprehension. - Josh Triplett has created an uapi header for ext4 userspace API's" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (105 commits) ext4: Add a uapi header for ext4 userspace APIs ext4: remove useless conditional branch code ext4: remove unneeded check of nr_to_submit ext4: move dax and encrypt checking into ext4_check_feature_compatibility() ext4: factor out ext4_block_group_meta_init() ext4: move s_reserved_gdt_blocks and addressable checking into ext4_check_geometry() ext4: rename two functions with 'check' ext4: factor out ext4_flex_groups_free() ext4: use ext4_group_desc_free() in ext4_put_super() to save some duplicated code ext4: factor out ext4_percpu_param_init() and ext4_percpu_param_destroy() ext4: factor out ext4_hash_info_init() Revert "ext4: Fix warnings when freezing filesystem with journaled data" ext4: Update comment in mpage_prepare_extent_to_map() ext4: Simplify handling of journalled data in ext4_bmap() ext4: Drop special handling of journalled data from ext4_quota_on() ext4: Drop special handling of journalled data from ext4_evict_inode() ext4: Fix special handling of journalled data from extent zeroing ext4: Drop special handling of journalled data from extent shifting operations ext4: Drop special handling of journalled data from ext4_sync_file() ext4: Commit transaction before writing back pages in data=journal mode ...	2023-04-26 08:57:41 -07:00
Ming Lei	38c8e3dfb2	block: sync part's ->bd_has_submit_bio with disk's submit_bio() always uses bio->bi_bdev->bd_has_submit_bio to decide if disk's ->submit_bio() is called, and bio->bi_bdev could point to one partition device. So we have to sync part bdev's ->bd_has_submit_bio with disk's. Reported-by: Changhui Zhong <czhong@redhat.com> Link: https://lore.kernel.org/linux-block/ZEdItaPqif8fp85H@ovpn-8-24.pek2.redhat.com/T/#t Fixes: `9f4107b07b` ("block: store bdev->bd_disk->fops->submit_bio state in bdev") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230425034154.110099-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-25 07:36:02 -06:00
Linus Torvalds	b9dff2195f	iter-ubuf.2-2023-04-21 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRCvdsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpg4oD/457EJ21Fm36NuyT/S0Cr8ok9Tdk7t9BeBh V/9CYThoXr5aqAox0Vq23FF+Rhzm81GzwYERN4493LBblliNeNOo2IaXF9/7qrUW 11v9Bkug2J3k3hRGtEa6Zl0EpMu+FRLsNpchjFS2KPuOq+iMDxrvwuy50kidWg7n r25e4UwpExVO9fIoUSmzgWVfRHOTuj9yiG/UsaH2+2BRXerIX0Q1tyElwmcGh25M Ad2hN+yDnuIbNA5gNUpnzY32Dp0zjAsquc//QOvq9mltcNTElokB8idGliismvyd 8qF0lkwQwewOBT/sSD5EY3K0Qd8IJu425bvT/yPUDScHz1chxHUoxo5eisIr2M9l 5AL5KHAf7Zzs8ZuV+IYPzZ5qM6a/vF3mHUisKRNKYVhF46Nmd4cBratfXwWb1MxV clQM2qr0TLOYli9mOeTXph3hg/rBVqKqf90boAZoN8b2tWBKlMykpqRadbepjrgx bmBSwwAF99NxIHEjU3U5DMdUloCSiMZIfMfDxQrPNDrfWAW4xJs5Ym0VeOjEotTt oFEs1fr6c3Mn7KEuPPfOtnDxvs51IP/B8+gDgMt/edf+wHiCU1Zm31u2gxt2dsKh g73Y92i5SHjIf36H5szBTeioyMy1E1VA9HF14xWz2eKdQ+wxQ9VNWoctcJ85k3F4 6AZDYRIrWA== =EaE9 -----END PGP SIGNATURE----- Merge tag 'iter-ubuf.2-2023-04-21' of git://git.kernel.dk/linux Pull ITER_UBUF updates from Jens Axboe: "This turns singe vector imports into ITER_UBUF, rather than ITER_IOVEC. The former is more trivial to iterate and advance, and hence a bit more efficient. From some very unscientific testing, ~60% of all iovec imports are single vector" * tag 'iter-ubuf.2-2023-04-21' of git://git.kernel.dk/linux: iov_iter: Mark copy_compat_iovec_from_user() noinline iov_iter: import single vector iovecs as ITER_UBUF iov_iter: convert import_single_range() to ITER_UBUF iov_iter: overlay struct iovec and ubuf/len iov_iter: set nr_segs = 1 for ITER_UBUF iov_iter: remove iov_iter_iovec() iov_iter: add iter_iov_addr() and iter_iov_len() helpers ALSA: pcm: check for user backed iterator, not specific iterator type IB/qib: check for user backed iterator, not specific iterator type IB/hfi1: check for user backed iterator, not specific iterator type iov_iter: add iter_iovec() helper block: ensure bio_alloc_map_data() deals with ITER_UBUF correctly	2023-04-24 10:29:28 -07:00
Damien Le Moal	83794367dc	block: Cleanup set_capacity()/bdev_set_nr_sectors() The code for setting a block device capacity (bd_nr_sectors field of struct block_device) is duplicated in set_capacity() and bdev_set_nr_sectors(). Clean this up by making bdev_set_nr_sectors() a block layer internal function defined in block/bdev.c instead of having this function statically defined in block/partitions/core.c. With this change, set_capacity() implementation can be simplified to only calling bdev_set_nr_sectors(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230424131318.79935-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-24 08:48:09 -06:00
Ming Lei	81ea1222f2	Revert "block: Merge bio before checking ->cached_rq" This reverts commit `23f3e3272e`. blk-mq sched bio merge still needs request to grab queue usage counter, so we can't simply call blk_mq_attempt_bio_merge() when queue usage counter isn't held. Fixes: `23f3e3272e` ("block: Merge bio before checking ->cached_rq") Cc: Xiao Ni <xni@redhat.com> Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230420112018.1108058-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-20 06:54:17 -06:00
Ondrej Kozina	9e05a2599a	sed-opal: geometry feature reporting command Locking range start and locking range length attributes may be require to satisfy restrictions exposed by OPAL2 geometry feature reporting. Geometry reporting feature is described in TCG OPAL SSC, section 3.1.1.4 (ALIGN, LogicalBlockSize, AlignmentGranularity and LowestAlignedLBA). 4.3.5.2.1.1 RangeStart Behavior: [ StartAlignment = (RangeStart modulo AlignmentGranularity) - LowestAlignedLBA ] When processing a Set method or CreateRow method on the Locking table for a non-Global Range row, if: a) the AlignmentRequired (ALIGN above) column in the LockingInfo table is TRUE; b) RangeStart is non-zero; and c) StartAlignment is non-zero, then the method SHALL fail and return an error status code INVALID_PARAMETER. 4.3.5.2.1.2 RangeLength Behavior: If RangeStart is zero, then [ LengthAlignment = (RangeLength modulo AlignmentGranularity) - LowestAlignedLBA ] If RangeStart is non-zero, then [ LengthAlignment = (RangeLength modulo AlignmentGranularity) ] When processing a Set method or CreateRow method on the Locking table for a non-Global Range row, if: a) the AlignmentRequired (ALIGN above) column in the LockingInfo table is TRUE; b) RangeLength is non-zero; and c) LengthAlignment is non-zero, then the method SHALL fail and return an error status code INVALID_PARAMETER In userspace we stuck to logical block size reported by general block device (via sysfs or ioctl), but we can not read 'AlignmentGranularity' or 'LowestAlignedLBA' anywhere else and we need to get those values from sed-opal interface otherwise we will not be able to report or avoid locking range setup INVALID_PARAMETER errors above. Signed-off-by: Ondrej Kozina <okozina@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Tested-by: Milan Broz <gmazyland@gmail.com> Link: https://lore.kernel.org/r/20230411090931.9193-2-okozina@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-19 14:07:13 -06:00
Christoph Hellwig	2c275afeb6	block: make blkcg_punt_bio_submit optional Guard all the code to punt bios to a per-cgroup submission helper by a new CONFIG_BLK_CGROUP_PUNT_BIO symbol that is selected by btrfs. This way non-btrfs kernel builds don't need to have this code. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:22 +02:00
Christoph Hellwig	12be09fe18	block: async_bio_lock does not need to be bh-safe async_bio_lock is only taken from bio submission and workqueue context, both are never in bottom halves. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:22 +02:00
Christoph Hellwig	3480373ebd	btrfs, block: move REQ_CGROUP_PUNT to btrfs REQ_CGROUP_PUNT is a bit annoying as it is hard to follow and adds a branch to the bio submission hot path. To fix this, export blkcg_punt_bio_submit and let btrfs call it directly. Add a new REQ_FS_PRIVATE flag for btrfs to indicate to it's own low-level bio submission code that a punt to the cgroup submission helper is required. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:22 +02:00
Christoph Hellwig	26a42b614e	blk-mq: fix the blk_mq_add_to_requeue_list call in blk_kick_flush Commit `b12e5c6c75` accidentally changes blk_kick_flush to do a head insert into the requeue list, fix this up. Fixes: `b12e5c6c75` ("blk-mq: pass a flags argument to blk_mq_add_to_requeue_list") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230416073553.966161-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-16 13:01:43 -06:00
Colin Ian King	e53413f8de	block, bfq: Fix division by zero error on zero wsum When the weighted sum is zero the calculation of limit causes a division by zero error. Fix this by continuing to the next level. This was discovered by running as root: stress-ng --ioprio 0 Fixes divison by error oops: [ 521.450556] divide error: 0000 [#1] SMP NOPTI [ 521.450766] CPU: 2 PID: 2684464 Comm: stress-ng-iopri Not tainted 6.2.1-1280.native #1 [ 521.451117] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014 [ 521.451627] RIP: 0010:bfqq_request_over_limit+0x207/0x400 [ 521.451875] Code: 01 48 8d 0c c8 74 0b 48 8b 82 98 00 00 00 48 8d 0c c8 8b 85 34 ff ff ff 48 89 ca 41 0f af 41 50 48 d1 ea 48 98 48 01 d0 31 d2 <48> f7 f1 41 39 41 48 89 85 34 ff ff ff 0f 8c 7b 01 00 00 49 8b 44 [ 521.452699] RSP: 0018:ffffb1af84eb3948 EFLAGS: 00010046 [ 521.452938] RAX: 000000000000003c RBX: 0000000000000000 RCX: 0000000000000000 [ 521.453262] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb1af84eb3978 [ 521.453584] RBP: ffffb1af84eb3a30 R08: 0000000000000001 R09: ffff8f88ab8a4ba0 [ 521.453905] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8f88ab8a4b18 [ 521.454224] R13: ffff8f8699093000 R14: 0000000000000001 R15: ffffb1af84eb3970 [ 521.454549] FS: 00005640b6b0b580(0000) GS:ffff8f88b3880000(0000) knlGS:0000000000000000 [ 521.454912] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 521.455170] CR2: 00007ffcbcae4e38 CR3: 00000002e46de001 CR4: 0000000000770ee0 [ 521.455491] PKRU: 55555554 [ 521.455619] Call Trace: [ 521.455736] <TASK> [ 521.455837] ? bfq_request_merge+0x3a/0xc0 [ 521.456027] ? elv_merge+0x115/0x140 [ 521.456191] bfq_limit_depth+0xc8/0x240 [ 521.456366] __blk_mq_alloc_requests+0x21a/0x2c0 [ 521.456577] blk_mq_submit_bio+0x23c/0x6c0 [ 521.456766] __submit_bio+0xb8/0x140 [ 521.457236] submit_bio_noacct_nocheck+0x212/0x300 [ 521.457748] submit_bio_noacct+0x1a6/0x580 [ 521.458220] submit_bio+0x43/0x80 [ 521.458660] ext4_io_submit+0x23/0x80 [ 521.459116] ext4_do_writepages+0x40a/0xd00 [ 521.459596] ext4_writepages+0x65/0x100 [ 521.460050] do_writepages+0xb7/0x1c0 [ 521.460492] __filemap_fdatawrite_range+0xa6/0x100 [ 521.460979] file_write_and_wait_range+0xbf/0x140 [ 521.461452] ext4_sync_file+0x105/0x340 [ 521.461882] __x64_sys_fsync+0x67/0x100 [ 521.462305] ? syscall_exit_to_user_mode+0x2c/0x1c0 [ 521.462768] do_syscall_64+0x3b/0xc0 [ 521.463165] entry_SYSCALL_64_after_hwframe+0x5a/0xc4 [ 521.463621] RIP: 0033:0x5640b6c56590 [ 521.464006] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 80 3d 71 70 0e 00 00 74 17 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20230413133009.1605335-1-colin.i.king@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-16 13:01:43 -06:00
Jens Axboe	9f4107b07b	block: store bdev->bd_disk->fops->submit_bio state in bdev We have a long chain of memory dereferencing just to whether or not this disk has a special submit_bio helper. As that's not necessarily the common case, add a bd_has_submit_bio state in the bdev to avoid traversing this memory dependency chain if we don't need to. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-16 13:01:42 -06:00
Jens Axboe	d2a1d45ced	nvme updates for Linux 6.4 - drop redundant pci_enable_pcie_error_reporting (Bjorn Helgaas) - validate nvmet module parameters (Chaitanya Kulkarni) - fence TCP socket on receive error (Chris Leech) - fix async event trace event (Keith Busch) - minor cleanups (Chaitanya Kulkarni, zhenwei pi) - fix and cleanup nvmet Identify handling (Damien Le Moal, Christoph Hellwig) - fix double blk_mq_complete_request race in the timeout handler (Lei Yin) - fix irq locking in nvme-fcloop (Ming Lei) - remove queue mapping helper for rdma devices (Sagi Grimberg) -----BEGIN PGP SIGNATURE----- iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmQ44sALHGhjaEBsc3Qu ZGUACgkQD55TZVIEUYMH7hAAxN15Sw3pkk1UBpDQDXjAqzftc1nJ/wBZabsQ9k3s Qoye1TRAnv/qt78wTzl660sj/tNnz8vMXSb9Wh5Vi+y+tNB3IN7JnVDpng9M6bsH /RNxLmUTiaa7sT/IhqU7dq7kxHL1aFWawsQwnnGQnXYOjc3RC/Hf25f59WKRDQic kAjsE55F6fpn/ry+DU8Ia8IPq22IUk56JONO01LpxGrfRgNC4P4hkpQJk7n2CFkd xBKntuCLDiLzRS5RVH8KcNOhhx/L6JRvl1xwkc/CRWt/DvGHfhbnTZ9e4Vn30XF4 3aCpBQu+CiNJPcpdiOD0CH0iOAio0o0klbOLmlo5Bg19Cw+ALqPIZrHU+UivJxw4 U1I4mkmB3ydHQlurVm4KemRih9PT/rw2cgTwogyhfNGw9rKjV/F2Exs6HFHIpP8X SgvomWXFSJ5saYswMoNIYvJHz+CISbq+XsLv0iBCAS7U3ZCqw4U5VkKLHH4hIYXG wjyGdGNwPE6JghCtHVkS4ZwSqkAwAaOWqdX3E4CzHYN6zn9nkPLurcwgfksgrnPP Z/Nzfz3Wwh7NzZlUyyFjUB4Iu80Up5zZZiz0ZQC+QiLVvy89weNPpnpN1vkd8dex hRKa2D0cfUyhpYzZssa/6CTHGOLYgpymUYGNitZtf0LKyhwgBSLOwcfk8XLxrFru U7E= =wI74 -----END PGP SIGNATURE----- Merge tag 'nvme-6.4-2023-04-14' of git://git.infradead.org/nvme into for-6.4/block Pull NVMe updates from Christoph: "nvme updates for Linux 6.4 - drop redundant pci_enable_pcie_error_reporting (Bjorn Helgaas) - validate nvmet module parameters (Chaitanya Kulkarni) - fence TCP socket on receive error (Chris Leech) - fix async event trace event (Keith Busch) - minor cleanups (Chaitanya Kulkarni, zhenwei pi) - fix and cleanup nvmet Identify handling (Damien Le Moal, Christoph Hellwig) - fix double blk_mq_complete_request race in the timeout handler (Lei Yin) - fix irq locking in nvme-fcloop (Ming Lei) - remove queue mapping helper for rdma devices (Sagi Grimberg)" * tag 'nvme-6.4-2023-04-14' of git://git.infradead.org/nvme: nvme-fcloop: fix "inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage" blk-mq-rdma: remove queue mapping helper for rdma devices nvme-rdma: minor cleanup in nvme_rdma_create_cq() nvme: fix double blk_mq_complete_request for timeout request with low probability nvme: fix async event trace event nvme-apple: return directly instead of else nvme-apple: return directly instead of else nvmet-tcp: validate idle poll modparam value nvmet-tcp: validate so_priority modparam value nvme-tcp: fence TCP socket on receive error nvmet: remove nvmet_req_cns_error_complete nvmet: rename nvmet_execute_identify_cns_cs_ns nvmet: fix Identify Identification Descriptor List handling nvmet: cleanup nvmet_execute_identify() nvmet: fix I/O Command Set specific Identify Controller nvmet: fix Identify Active Namespace ID list handling nvmet: fix Identify Controller handling nvmet: fix Identify Namespace handling nvmet: fix error handling in nvmet_execute_identify_cns_cs_ns() nvme-pci: drop redundant pci_enable_pcie_error_reporting()	2023-04-14 06:31:29 -06:00
Christoph Hellwig	4d5bba5bee	blk-mq: remove __blk_mq_run_hw_queue __blk_mq_run_hw_queue just contains a WARN_ON_ONCE for calls from interrupt context and a blk_mq_run_dispatch_ops-protected call to blk_mq_sched_dispatch_requests. Open code the call to blk_mq_sched_dispatch_requests in both callers, and move the WARN_ON_ONCE to blk_mq_run_hw_queue where it can be extended to all !async calls, while the other call is from workqueue context and thus obviously does not need the assert. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413060651.694656-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:58:02 -06:00
Christoph Hellwig	1aa8d875b5	blk-mq: move the !async handling out of __blk_mq_delay_run_hw_queue Only blk_mq_run_hw_queue can call __blk_mq_delay_run_hw_queue with async=false, so move the handling there. With this __blk_mq_delay_run_hw_queue can be merged into blk_mq_delay_run_hw_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413060651.694656-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:57:18 -06:00
Christoph Hellwig	cd735e1113	blk-mq: move the blk_mq_hctx_stopped check in __blk_mq_delay_run_hw_queue For the in-context dispatch, blk_mq_hctx_stopped is alredy checked in blk_mq_sched_dispatch_requests under blk_mq_run_dispatch_ops() protection. For the async dispatch case having a check before scheduling the work still makes sense to avoid needless workqueue scheduling, so just keep it for that case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413060651.694656-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:57:18 -06:00
Christoph Hellwig	c20a1a2c1a	blk-mq: remove the blk_mq_hctx_stopped check in blk_mq_run_work_fn blk_mq_hctx_stopped is already checked in blk_mq_sched_dispatch_requests under blk_mq_run_dispatch_ops() protection, so remove the duplicate check. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413060651.694656-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:57:18 -06:00
Christoph Hellwig	89ea5ceb53	blk-mq: cleanup __blk_mq_sched_dispatch_requests __blk_mq_sched_dispatch_requests currently has duplicated logic for the cases where requests are on the hctx dispatch list or not. Merge the two with a new need_dispatch variable and remove a few pointless local variables. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413060651.694656-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:57:18 -06:00
Christoph Hellwig	b12e5c6c75	blk-mq: pass a flags argument to blk_mq_add_to_requeue_list Replace the boolean at_head argument with the same flags that are already passed to blk_mq_insert_request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-21-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:52:30 -06:00
Christoph Hellwig	93fffe16f7	blk-mq: pass a flags argument to elevator_type->insert_requests Instead of passing a bool at_head, pass down the full flags from the blk_mq_insert_request interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-20-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:52:30 -06:00
Christoph Hellwig	2b5976134b	blk-mq: pass a flags argument to blk_mq_request_bypass_insert Replace the boolean at_head argument with the same flags that are already passed to blk_mq_insert_request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-19-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:52:30 -06:00
Christoph Hellwig	710fa3789e	blk-mq: pass a flags argument to blk_mq_insert_request Replace the at_head bool with a flags argument that so far only contains a single BLK_MQ_INSERT_AT_HEAD value. This makes it much easier to grep for head insertions into the blk-mq dispatch queues. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:52:30 -06:00
Christoph Hellwig	214a441805	blk-mq: don't kick the requeue_list in blk_mq_add_to_requeue_list blk_mq_add_to_requeue_list takes a bool parameter to control how to kick the requeue list at the end of the function. Move the call to blk_mq_kick_requeue_list to the callers that want it instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:52:30 -06:00
Christoph Hellwig	2394395cd5	blk-mq: don't run the hw_queue from blk_mq_request_bypass_insert blk_mq_request_bypass_insert takes a bool parameter to control how to run the queue at the end of the function. Move the blk_mq_run_hw_queue call to the callers that want it instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:52:30 -06:00
Christoph Hellwig	f0dbe6e88e	blk-mq: don't run the hw_queue from blk_mq_insert_request blk_mq_insert_request takes two bool parameters to control how to run the queue at the end of the function. Move the blk_mq_run_hw_queue call to the callers that want it instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:52:30 -06:00
Christoph Hellwig	e1f44ac0d7	blk-mq: fold __blk_mq_try_issue_directly into its two callers Due to the wildly different behavior based on the bypass_insert argument, not a whole lot of code in __blk_mq_try_issue_directly is actually shared between blk_mq_try_issue_directly and blk_mq_request_issue_directly. Remove __blk_mq_try_issue_directly and fold the code into the two callers instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:52:30 -06:00
Christoph Hellwig	2b71b87707	blk-mq: factor out a blk_mq_get_budget_and_tag helper Factor out a helper from __blk_mq_try_issue_directly in preparation of folding that function into its two callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:52:30 -06:00
Christoph Hellwig	a1e948b81a	blk-mq: refactor the DONTPREP/SOFTBARRIER andling in blk_mq_requeue_work Split the RQF_DONTPREP and RQF_SOFTBARRIER in separate branches to make the code more readable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:52:30 -06:00
Christoph Hellwig	53548d2a94	blk-mq: refactor passthrough vs flush handling in blk_mq_insert_request While both passthrough and flush requests call directly into blk_mq_request_bypass_insert, the parameters aren't the same. Split the handling into two separate conditionals and turn the whole function into an if/elif/elif/else flow instead of the gotos. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:52:30 -06:00
Christoph Hellwig	a4fa57ffb7	blk-mq: remove blk_flush_queue_rq Just call blk_mq_add_to_requeue_list directly from the two callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:52:29 -06:00
Christoph Hellwig	4ec5c0553c	blk-mq: fold __blk_mq_insert_req_list into blk_mq_insert_request Remove this very small helper and fold it into the only caller. Note that this moves the trace_block_rq_insert out of ctx->lock, matching the other calls to this tracepoint. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:52:29 -06:00
Christoph Hellwig	a88db1e000	blk-mq: fold __blk_mq_insert_request into blk_mq_insert_request There is no good point in keeping the __blk_mq_insert_request around for two function calls and a singler caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:52:29 -06:00
Christoph Hellwig	2bd215df79	blk-mq: move blk_mq_sched_insert_request to blk-mq.c blk_mq_sched_insert_request is the main request insert helper and not directly I/O scheduler related. Move blk_mq_sched_insert_request to blk-mq.c, rename it to blk_mq_insert_request and mark it static. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:52:29 -06:00
Christoph Hellwig	05a9311770	blk-mq: fold blk_mq_sched_insert_requests into blk_mq_dispatch_plug_list blk_mq_dispatch_plug_list is the only caller of blk_mq_sched_insert_requests, and it makes sense to just fold it there as blk_mq_sched_insert_requests isn't specific to I/O schedulers despite the name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:52:29 -06:00
Christoph Hellwig	94aa228c2a	blk-mq: move more logic into blk_mq_insert_requests Move all logic related to the direct insert (including the call to blk_mq_run_hw_queue) into blk_mq_insert_requests to streamline the code flow up a bit, and to allow marking blk_mq_try_issue_list_directly static. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:52:29 -06:00
Christoph Hellwig	90110e04f2	blk-mq: include <linux/blk-mq.h> in block/blk-mq.h block/blk-mq.h needs various definitions from <linux/blk-mq.h>, include it there instead of relying on the source files to include both. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:52:29 -06:00
Christoph Hellwig	bebe84ebee	blk-mq: remove blk-mq-tag.h blk-mq-tag.h is always included by blk-mq.h, and causes recursive inclusion hell with further changes. Just merge it into blk-mq.h instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:52:29 -06:00
Christoph Hellwig	50947d7fe9	blk-mq: don't plug for head insertions in blk_execute_rq_nowait Plugs never insert at head, so don't plug for head insertions. Fixes: `1c2d2fff6d` ("block: wire-up support for passthrough plugging") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:52:29 -06:00
Chengming Zhou	8e15dfbd9a	blk-throttle: only enable blk-stat when BLK_DEV_THROTTLING_LOW blk_throtl_register() will unconditionally enable blk-stat for gendisk when register, even when we have no BLK_DEV_THROTTLING_LOW config. Since the kernel always has only BLK_DEV_THROTTLING config and the BLK_DEV_THROTTLING_LOW config is still in EXPERIMENTAL state, we can just skip blk-stat when !BLK_DEV_THROTTLING_LOW. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230413062805.2081970-2-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:48:11 -06:00
Chengming Zhou	20de765f6d	blk-stat: fix QUEUE_FLAG_STATS clear We need to set QUEUE_FLAG_STATS for two cases: 1. blk_stat_enable_accounting() 2. blk_stat_add_callback() So we should clear it only when ((q->stats->accounting == 0) && list_empty(&q->stats->callbacks)). blk_stat_disable_accounting() only check if q->stats->accounting is 0 before clear the flag, this patch fix it. Also add list_empty(&q->stats->callbacks)) check when enable, or the flag is already set. The bug can be reproduced on kernel without BLK_DEV_THROTTLING (since it unconditionally enable accounting, see the next patch). # cat /sys/block/sr0/queue/scheduler none mq-deadline [bfq] # cat /sys/kernel/debug/block/sr0/state SAME_COMP\|IO_STAT\|INIT_DONE\|STATS\|REGISTERED\|NOWAIT\|30 # echo none > /sys/block/sr0/queue/scheduler # cat /sys/kernel/debug/block/sr0/state SAME_COMP\|IO_STAT\|INIT_DONE\|REGISTERED\|NOWAIT # cat /sys/block/sr0/queue/wbt_lat_usec 75000 We can see that after changing elevator from "bfq" to "none", "STATS" flag is lost even though WBT callback still need it. Fixes: `68497092bd` ("block: make queue stat accounting a reference") Cc: <stable@vger.kernel.org> # v5.17+ Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230413062805.2081970-1-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:48:11 -06:00
Tejun Heo	a13696b83d	blk-iolatency: Make initialization lazy Other rq_qos policies such as wbt and iocost are lazy-initialized when they are configured for the first time for the device but iolatency is initialized unconditionally from blkcg_init_disk() during gendisk init. Lazy init is beneficial because rq_qos policies add runtime overhead when initialized as every IO has to walk all registered rq_qos callbacks. This patch switches iolatency to lazy initialization too so that it only registered its rq_qos policy when it is first configured. Note that there is a known race condition between blkcg config file writes and del_gendisk() and this patch makes iolatency susceptible to it by exposing the init path to race against the deletion path. However, that problem already exists in iocost and is being worked on. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20230413000649.115785-5-tj@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:46:49 -06:00
Tejun Heo	3304918758	blk-iolatency: s/blkcg_rq_qos/iolat_rq_qos/ The name was too generic given that there are multiple blkcg rq-qos policies. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20230413000649.115785-4-tj@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:46:49 -06:00
Tejun Heo	faffaab289	blkcg: Restructure blkg_conf_prep() and friends We want to support lazy init of rq-qos policies so that iolatency is enabled lazily on configuration instead of gendisk initialization. The way blkg config helpers are structured now is a bit awkward for that. Let's restructure: * blkcg_conf_open_bdev() is renamed to blkg_conf_open_bdev(). The blkcg_ prefix was used because the bdev opening step is blkg-independent. However, the distinction is too subtle and confuses more than helps. Let's switch to blkg prefix so that it's consistent with the type and other helper names. * struct blkg_conf_ctx now remembers the original input string and is always initialized by the new blkg_conf_init(). * blkg_conf_open_bdev() is updated to take a pointer to blkg_conf_ctx like blkg_conf_prep() and can be called multiple times safely. Instead of modifying the double pointer to input string directly, blkg_conf_open_bdev() now sets blkg_conf_ctx->body. * blkg_conf_finish() is renamed to blkg_conf_exit() for symmetry and now must be called on all blkg_conf_ctx's which were initialized with blkg_conf_init(). Combined, this allows the users to either open the bdev first or do it altogether with blkg_conf_prep() which will help implementing lazy init of rq-qos policies. blkg_conf_init/exit() will also be used implement synchronization against device removal. This is necessary because iolat / iocost are configured through cgroupfs instead of one of the files under /sys/block/DEVICE. As cgroupfs operations aren't synchronized with block layer, the lazy init and other configuration operations may race against device removal. This patch makes blkg_conf_init/exit() used consistently for all cgroup-orginating configurations making them a good place to implement explicit synchronization. Users are updated accordingly. No behavior change is intended by this patch. v2: bfq wasn't updated in v1 causing a build error. Fixed. v3: Update the description to include future use of blkg_conf_init/exit() as synchronization points. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Yu Kuai <yukuai1@huaweicloud.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230413000649.115785-3-tj@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:46:49 -06:00
Tejun Heo	83462a6c97	blkcg: Drop unnecessary RCU read [un]locks from blkg_conf_prep/finish() Now that all RCU flavors have been combined either holding a spin lock, disabling irq or disabling preemption implies RCU read lock, so there's no need to use rcu_read_[un]lock() explicitly while holding queue_lock. This shouldn't cause any behavior changes. v2: Description updated. Leave __acquires/release on queue_lock alone. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230413000649.115785-2-tj@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-13 06:46:48 -06:00
Sagi Grimberg	edde9e70bb	blk-mq-rdma: remove queue mapping helper for rdma devices No rdma device exposes its irq vectors affinity today. So the only mapping that we have left, is the default blk_mq_map_queues, which we fallback to anyways. Also fixup the only consumer of this helper (nvme-rdma). Remove this now dead code. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Acked-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2023-04-13 08:59:05 +02:00
Mike Christie	7ba150834b	block: Rename BLK_STS_NEXUS to BLK_STS_RESV_CONFLICT BLK_STS_NEXUS is used for NVMe/SCSI reservation conflicts and DASD's locking feature which works similar to NVMe/SCSI reservations where a host can get a lock on a device and when the lock is taken it will get failures. This patch renames BLK_STS_NEXUS so it better reflects this type of use. Signed-off-by: Mike Christie <michael.christie@oracle.com> Link: https://lore.kernel.org/r/20230407200551.12660-3-michael.christie@oracle.com Acked-by: Stefan Haberland <sth@linux.ibm.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2023-04-11 21:55:35 -04:00
Yu Kuai	3723091ea1	block: don't set GD_NEED_PART_SCAN if scan partition failed Currently if disk_scan_partitions() failed, GD_NEED_PART_SCAN will still set, and partition scan will be proceed again when blkdev_get_by_dev() is called. However, this will cause a problem that re-assemble partitioned raid device will creat partition for underlying disk. Test procedure: mdadm -CR /dev/md0 -l 1 -n 2 /dev/sda /dev/sdb -e 1.0 sgdisk -n 0:0:+100MiB /dev/md0 blockdev --rereadpt /dev/sda blockdev --rereadpt /dev/sdb mdadm -S /dev/md0 mdadm -A /dev/md0 /dev/sda /dev/sdb Test result: underlying disk partition and raid partition can be observed at the same time Note that this can still happen in come corner cases that GD_NEED_PART_SCAN can be set for underlying disk while re-assemble raid device. Fixes: `e5cfefa97b` ("block: fix scan partition for exclusively open device again") Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-06 20:41:53 -06:00
Chengming Zhou	650e2cb50f	blk-cgroup: delete cpd_init_fn of blkcg_policy blkcg_policy cpd_init_fn() is used to just initialize some default fields of policy data, which is enough to do in cpd_alloc_fn(). This patch delete the only user bfq_cpd_init(), and remove cpd_init_fn from blkcg_policy. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230406145050.49914-4-zhouchengming@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-06 16:17:32 -06:00
Chengming Zhou	d1023165ee	blk-cgroup: delete cpd_bind_fn of blkcg_policy cpd_bind_fn is just used for update default weight when block subsys attached to a hierarchy. No any policy need it anymore. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230406145050.49914-3-zhouchengming@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-06 16:17:32 -06:00
Chengming Zhou	e9f2f3f590	block, bfq: remove BFQ_WEIGHT_LEGACY_DFL BFQ_WEIGHT_LEGACY_DFL is the same as CGROUP_WEIGHT_DFL, which means we don't need cpd_bind_fn() callback to update default weight when attached to a hierarchy. This patch remove BFQ_WEIGHT_LEGACY_DFL and cpd_bind_fn(). Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230406145050.49914-2-zhouchengming@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-06 16:17:32 -06:00
Matthew Wilcox	cd57b77197	ext4: Convert ext4_bio_write_page() to use a folio Remove several calls to compound_head() and the last caller of set_page_writeback_keepwrite(), so remove the wrapper too. Also export bio_add_folio() as this is the first caller from a module. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-4-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-06 13:39:50 -04:00
Ondrej Kozina	4c4dd04e75	sed-opal: Add command to read locking range parameters. It returns following attributes: locking range start locking range length read lock enabled write lock enabled lock state (RW, RO or LK) It can be retrieved by user authority provided the authority was added to locking range via prior IOC_OPAL_ADD_USR_TO_LR ioctl command. The command was extended to add user in ACE that allows to read attributes listed above. Signed-off-by: Ondrej Kozina <okozina@redhat.com> Tested-by: Luca Boccassi <bluca@debian.org> Tested-by: Milan Broz <gmazyland@gmail.com> Link: https://lore.kernel.org/r/20230405111223.272816-6-okozina@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-05 07:46:26 -06:00
Ondrej Kozina	baf82b679c	sed-opal: add helper to get multiple columns at once. Refactors current code querying single column to use the new helper. Real multi column usage will be added later. Signed-off-by: Ondrej Kozina <okozina@redhat.com> Tested-by: Luca Boccassi <bluca@debian.org> Tested-by: Milan Broz <gmazyland@gmail.com> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230405111223.272816-5-okozina@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-05 07:46:26 -06:00
Ondrej Kozina	8be19a02f1	sed-opal: allow user authority to get locking range attributes. Extend ACE set of locking range attributes accessible to user authority. This patch allows user authority to get following locking range attribues when user get added to locking range via IOC_OPAL_ADD_USR_TO_LR: locking range start locking range end read lock enabled write lock enabled read locked write locked lock on reset active key Note: Admin1 authority always remains in the ACE. Otherwise it breaks current userspace expecting Admin1 in the ACE (sedutils). See TCG OPAL2 s.4.3.1.7 "ACE_Locking_RangeNNNN_Get_RangeStartToActiveKey". Signed-off-by: Ondrej Kozina <okozina@redhat.com> Tested-by: Luca Boccassi <bluca@debian.org> Tested-by: Milan Broz <gmazyland@gmail.com> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230405111223.272816-4-okozina@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-05 07:46:25 -06:00
Ondrej Kozina	175b654402	sed-opal: add helper for adding user authorities in ACE. Move ACE construction away from add_user_to_lr routine and refactor it to be used also in later code. Also adds boolean operators defines from TCG Core specification. Signed-off-by: Ondrej Kozina <okozina@redhat.com> Tested-by: Luca Boccassi <bluca@debian.org> Tested-by: Milan Broz <gmazyland@gmail.com> Link: https://lore.kernel.org/r/20230405111223.272816-3-okozina@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-05 07:46:25 -06:00
Ondrej Kozina	2fce95b196	sed-opal: do not add same authority twice in boolean ace. While adding user authority in boolean ace value of uid OPAL_LOCKINGRANGE_ACE_WRLOCKED or OPAL_LOCKINGRANGE_ACE_RDLOCKED, it was added twice. It seemed redundant when only single authority was added in the set method aka { authority1, authority1, OR }: TCG Storage Architecture Core Specification, 5.1.3.3 ACE_expression "This is an alternative type where the options are either a uidref to an Authority object or one of the boolean_ACE (AND = 0 and OR = 1) options. This type is used within the AC_element list to form a postfix Boolean expression of Authorities." Signed-off-by: Ondrej Kozina <okozina@redhat.com> Tested-by: Luca Boccassi <bluca@debian.org> Tested-by: Milan Broz <gmazyland@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230405111223.272816-2-okozina@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-05 07:46:25 -06:00
Keith Busch	38a8c4d1d4	blk-mq: directly poll requests Polling needs a bio with a valid bi_bdev, but neither of those are guaranteed for polled driver requests. Make request based polling directly use blk-mq's polling function instead. When executing a request from a polled hctx, we know the request's cookie, and that it's from a live blk-mq queue that supports polling, so we can safely skip everything that bio_poll provides. Cc: stable@kernel.org Reported-by: Martin Belanger <Martin.Belanger@dell.com> Reported-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Tested-by: Daniel Wagner <dwagner@suse.de> Revieded-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20230331180056.1155862-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-04 16:11:47 -06:00
Greg Kroah-Hartman	cd8fe5b6db	Merge 6.3-rc5 into driver-core-next We need the fixes in here for testing, as well as the driver core changes for documentation updates to build on. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-04-03 09:33:30 +02:00
Greg Kroah-Hartman	e78195d529	driver core: class: remove dev_kobj from struct class The dev_kobj field in struct class is now only written to, but never read from, so it can be removed as it is useless. Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20230331093318.82288-5-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-31 17:45:11 +02:00
Jens Axboe	de4f5fed3f	iov_iter: add iter_iovec() helper This returns a pointer to the current iovec entry in the iterator. Only useful with ITER_IOVEC right now, but it prepares us to treat ITER_UBUF and ITER_IOVEC identically for the first segment. Rename struct iov_iter->iov to iov_iter->__iov to find any potentially troublesome spots, and also to prevent anyone from adding new code that accesses iter->iov directly. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-30 08:12:29 -06:00
Jens Axboe	0a2481cde2	block: ensure bio_alloc_map_data() deals with ITER_UBUF correctly This helper blindly copies the iovec, even if we don't have one. Make this case a bit smarter by only doing so if we have an iovec array to copy. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-29 08:55:18 -06:00
Chaitanya Kulkarni	06965037ce	block: open code __blk_account_io_done() There is only one caller for __blk_account_io_done(), the function is small enough to fit in its caller blk_account_io_done(). Remove the function and opencode in the its caller blk_account_io_done(). Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20230327073427.4403-2-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-27 13:22:58 -06:00
Chaitanya Kulkarni	e165fb4dd6	block: open code __blk_account_io_start() There is only one caller for __blk_account_io_start(), the function is small enough to fit in its caller blk_account_io_start(). Remove the function and opencode in the its caller blk_account_io_start(). Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20230327073427.4403-2-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-27 13:22:58 -06:00
Keith Busch	54bdd67d0f	blk-mq: remove hybrid polling io_uring provides the only way user space can poll completions, and that always sets BLK_POLL_NOSLEEP. This effectively makes hybrid polling dead code, so remove it and everything supporting it. Hybrid polling was effectively killed off with `9650b453a3`, "block: ignore RWF_HIPRI hint for sync dio", but still potentially reachable through io_uring until `d729cf9acb`, "io_uring: don't sleep when polling for I/O", but hybrid polling probably should not have been reachable through that async interface from the beginning. Fixes: `9650b453a3` ("block: ignore RWF_HIPRI hint for sync dio") Fixes: `d729cf9acb` ("io_uring: don't sleep when polling for I/O") Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20230320194926.3353144-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-20 15:30:03 -06:00
Greg Kroah-Hartman	1aaba11da9	driver core: class: remove module * from class_create() The module pointer in class_create() never actually did anything, and it shouldn't have been requred to be set as a parameter even if it did something. So just remove it and fix up all callers of the function in the kernel tree at the same time. Cc: "Rafael J. Wysocki" <rafael@kernel.org> Acked-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Link: https://lore.kernel.org/r/20230313181843.1207845-4-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-17 15:16:33 +01:00
Lukas Bulwahn	8f0d196e4d	block: remove obsolete config BLOCK_COMPAT Before commit `bdc1ddad3e` ("compat_ioctl: block: move blkdev_compat_ioctl() into ioctl.c"), the config BLOCK_COMPAT was used to include compat_ioctl.c into the kernel build. With this commit, the code is moved into ioctl.c and included with the config COMPAT. So, since then, the config BLOCK_COMPAT has no effect and any further purpose. Remove this obsolete config BLOCK_COMPAT. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20230316111630.4897-1-lukas.bulwahn@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-16 09:35:44 -06:00
Eric Biggers	4cf2c3ab2c	blk-crypto: drop the NULL check from blk_crypto_put_keyslot() Now that all callers of blk_crypto_put_keyslot() check for NULL before calling it, there is no need for blk_crypto_put_keyslot() to do the NULL check itself. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-16 09:35:09 -06:00
Eric Biggers	5b8562f0e8	blk-mq: return actual keyslot error in blk_insert_cloned_request() To avoid hiding information, pass on the error code from blk_crypto_rq_get_keyslot() instead of always using BLK_STS_IOERR. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-16 09:35:09 -06:00
Eric Biggers	435c0e9996	blk-crypto: remove blk_crypto_insert_cloned_request() blk_crypto_insert_cloned_request() is the same as blk_crypto_rq_get_keyslot(), so just use that directly. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-16 09:35:09 -06:00
Eric Biggers	5c7cb94452	blk-crypto: make blk_crypto_evict_key() more robust If blk_crypto_evict_key() sees that the key is still in-use (due to a bug) or that ->keyslot_evict failed, it currently just returns while leaving the key linked into the keyslot management structures. However, blk_crypto_evict_key() is only called in contexts such as inode eviction where failure is not an option. So actually the caller proceeds with freeing the blk_crypto_key regardless of the return value of blk_crypto_evict_key(). These two assumptions don't match, and the result is that there can be a use-after-free in blk_crypto_reprogram_all_keys() after one of these errors occurs. (Note, these errors shouldn't happen; we're just talking about what happens if they do anyway.) Fix this by making blk_crypto_evict_key() unlink the key from the keyslot management structures even on failure. Also improve some comments. Fixes: `1b26283970` ("block: Keyslot Manager for Inline Encryption") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-16 09:35:09 -06:00
Eric Biggers	70493a63ba	blk-crypto: make blk_crypto_evict_key() return void blk_crypto_evict_key() is only called in contexts such as inode eviction where failure is not an option. So there is nothing the caller can do with errors except log them. (dm-table.c does "use" the error code, but only to pass on to upper layers, so it doesn't really count.) Just make blk_crypto_evict_key() return void and log errors itself. Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-16 09:35:09 -06:00
Eric Biggers	9cd1e56667	blk-mq: release crypto keyslot before reporting I/O complete Once all I/O using a blk_crypto_key has completed, filesystems can call blk_crypto_evict_key(). However, the block layer currently doesn't call blk_crypto_put_keyslot() until the request is being freed, which happens after upper layers have been told (via bio_endio()) the I/O has completed. This causes a race condition where blk_crypto_evict_key() can see 'slot_refs != 0' without there being an actual bug. This makes __blk_crypto_evict_key() hit the 'WARN_ON_ONCE(atomic_read(&slot->slot_refs) != 0)' and return without doing anything, eventually causing a use-after-free in blk_crypto_reprogram_all_keys(). (This is a very rare bug and has only been seen when per-file keys are being used with fscrypt.) There are two options to fix this: either release the keyslot before bio_endio() is called on the request's last bio, or make __blk_crypto_evict_key() ignore slot_refs. Let's go with the first solution, since it preserves the ability to report bugs (via WARN_ON_ONCE) where a key is evicted while still in-use. Fixes: `a892c8d52c` ("block: Inline encryption support for blk-mq") Cc: stable@vger.kernel.org Reviewed-by: Nathan Huckleberry <nhuck@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-16 09:35:09 -06:00
Yu Kuai	5f27571382	block: count 'ios' and 'sectors' when io is done for bio-based device While using iostat for raid, I observed very strange 'await' occasionally, and turns out it's due to that 'ios' and 'sectors' is counted in bdev_start_io_acct(), while 'nsecs' is counted in bdev_end_io_acct(). I'm not sure why they are ccounted like that but I think this behaviour is obviously wrong because user will get wrong disk stats. Fix the problem by counting 'ios' and 'sectors' when io is done, like what rq-based device does. Fixes: `394ffa503b` ("blk: introduce generic io stat accounting help function") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230223091226.1135678-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-15 09:25:04 -06:00
Chris Leech	00e885efcf	blk-mq: fix "bad unlock balance detected" on q->srcu in __blk_mq_run_dispatch_ops The 'q' parameter of the macro __blk_mq_run_dispatch_ops may not be one local variable, such as, it is rq->q, then request queue pointed by this variable could be changed to another queue in case of BLK_MQ_F_TAG_QUEUE_SHARED after 'dispatch_ops' returns, then 'bad unlock balance' is triggered. Fixes the issue by adding one local variable for doing srcu lock/unlock. Fixes: `2a904d0085` ("blk-mq: remove hctx_lock and hctx_unlock") Cc: Marco Patalano <mpatalan@redhat.com> Signed-off-by: Chris Leech <cleech@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230310010913.1014789-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-14 19:20:55 -06:00
Jan Kara	34e0a279a9	block: do not reverse request order when flushing plug list Commit `26fed4ac4e` ("block: flush plug based on hardware and software queue order") changed flushing of plug list to submit requests one device at a time. However while doing that it also started using list_add_tail() instead of list_add() used previously thus effectively submitting requests in reverse order. Also when forming a rq_list with remaining requests (in case two or more devices are used), we effectively reverse the ordering of the plug list for each device we process. Submitting requests in reverse order has negative impact on performance for rotational disks (when BFQ is not in use). We observe 10-25% regression in random 4k write throughput, as well as ~20% regression in MariaDB OLTP benchmark on rotational storage on btrfs filesystem. Fix the problem by preserving ordering of the plug list when inserting requests into the queuelist as well as by appending to requeue_list instead of prepending to it. Fixes: `26fed4ac4e` ("block: flush plug based on hardware and software queue order") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230313093002.11756-1-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-14 09:25:50 -06:00
Yu Kuai	e2f2a39452	block, bfq: fix uaf for 'stable_merge_bfqq' Before commit `fd571df0ac` ("block, bfq: turn bfqq_data into an array in bfq_io_cq"), process reference is read before bfq_put_stable_ref(), and it's safe if bfq_put_stable_ref() put the last reference, because process reference will be 0 and 'stable_merge_bfqq' won't be accessed in this case. However, the commit changed the order and will cause uaf for 'stable_merge_bfqq'. In order to emphasize that bfq_put_stable_ref() can drop the last reference, fix the problem by moving bfq_put_stable_ref() to the end of bfq_setup_stable_merge(). Fixes: `fd571df0ac` ("block, bfq: turn bfqq_data into an array in bfq_io_cq") Reported-and-tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/linux-block/20230307071448.rzihxbm4jhbf5krj@shindev/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-08 07:34:50 -07:00
Yu Kuai	428913bce1	block: fix wrong mode for blkdev_put() from disk_scan_partitions() If disk_scan_partitions() is called with 'FMODE_EXCL', blkdev_get_by_dev() will be called without 'FMODE_EXCL', however, follow blkdev_put() is still called with 'FMODE_EXCL', which will cause 'bd_holders' counter to leak. Fix the problem by using the right mode for blkdev_put(). Reported-by: syzbot+2bcc0d79e548c4f62a59@syzkaller.appspotmail.com Link: https://lore.kernel.org/lkml/f9649d501bc8c3444769418f6c26263555d9d3be.camel@linux.ibm.com/T/ Tested-by: Julian Ruess <julianr@linux.ibm.com> Fixes: `e5cfefa97b` ("block: fix scan partition for exclusively open device again") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-07 07:24:38 -07:00
Greg Kroah-Hartman	721da5cee9	driver core: remove CONFIG_SYSFS_DEPRECATED and CONFIG_SYSFS_DEPRECATED_V2 CONFIG_SYSFS_DEPRECATED was added in commit `88a22c985e` ("CONFIG_SYSFS_DEPRECATED") in 2006 to allow systems with older versions of some tools (i.e. Fedora 3's version of udev) to boot properly. Four years later, in 2010, the option was attempted to be removed as most of userspace should have been fixed up properly by then, but some kernel developers clung to those old systems and refused to update, so we added CONFIG_SYSFS_DEPRECATED_V2 in commit `e52eec13cd` ("SYSFS: Allow boot time switching between deprecated and modern sysfs layout") to allow them to continue to boot properly, and we allowed a boot time parameter to be used to switch back to the old format if needed. Over time, the logic that was covered under these config options was slowly removed from individual driver subsystems successfully, removed, and the only thing that is now left in the kernel are some changes in the block layer's representation in sysfs where real directories are used instead of symlinks like normal. Because the original changes were done to userspace tools in 2006, and all distros that use those tools are long end-of-life, and older non-udev-based systems do not care about the block layer's sysfs representation, it is time to finally remove this old logic and the config entries from the kernel. Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: linux-block@vger.kernel.org Cc: linux-doc@vger.kernel.org Acked-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20230223073326.2073220-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-06 07:46:23 +01:00
Linus Torvalds	9d0281b56b	block-6.3-2023-03-03 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmQB57MQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgputpEADVrc1OFzHOivJq+LJ3HS3ufhLBthtgu1Lp sEHvDNp9tBGXMLkomuCYpAju5TBAEKC+AJTZyj9iS1j++ItoezdoP55YRIH7t2Or UTy8ex3rLPGkQk6k3o8roWCyajTW/ZS+4fmk+NkVYMLsQBp9I+kFbxgJa5bbREdU Z8b/9hcBGz58R8Kq+TEMp/bO7oCV4c8xWumrKER+MktDDx0kc5d+afWXoy7bEKFg jLB3gleTM9HUpa9a2GPc4fxqdb0KanQdMtiyn/oplg0JcZLMiHfRbiRnsgQkjN0O RVtUcdxXmOkQeFra4GXPiHmQBcIfE85wP4wxb8p/F2StYRhb1epzzeCXOhuNZvv4 dd6OSARgtzWt3OlHka4aC63H4kzs9SxJp0F2uwuPLV0fM91TP1oOTWV+53FrQr9Z OQYyB8d9Il4K72NFLwU4ukJ1fPoCRHjpgAXIIkasEjaBftpJlMNnfblncTZTBumy XumFVdKfvqc3OFt8LLKWqLDV0j3TknVeCMPKhsbRwQ0NG4vlNOSWaLkGJCDLJ7ga ebf8AD5eaLCT9qyYquBuW5VBKZH5Z4rf5yHta9Dx+Omu0JTQYtTkiiM3UTdpDbtq SObZ31UvLoYK2dOZcVgjhE2RgM/AV5jJcx7aHhT3UptavAehHbePgiNhuEEntlKv L87kXJkSSQ== =ezrg -----END PGP SIGNATURE----- Merge tag 'block-6.3-2023-03-03' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - NVMe pull request via Christoph: - Don't access released socket during error recovery (Akinobu Mita) - Bring back auto-removal of deleted namespaces during sequential scan (Christoph Hellwig) - Fix an error code in nvme_auth_process_dhchap_challenge (Dan Carpenter) - Show well known discovery name (Daniel Wagner) - Add a missing endianess conversion in effects masking (Keith Busch) - Fix for a regression introduced in blk-rq-qos during init in this merge window (Breno) - Reorder a few fields in struct blk_mq_tag_set, eliminating a few holes and shrinking it (Christophe) - Remove redundant bdev_get_queue() NULL checks (Juhyung) - Add sed-opal single user mode support flag (Luca) - Remove SQE128 check in ublk as it isn't needed, saving some memory (Ming) - Op specific segment checking for cloned requests (Uday) - Exclusive open partition scan fixes (Yu) - Loop offset/size checking before assigning them in the device (Zhong) - Bio polling fixes (me) * tag 'block-6.3-2023-03-03' of git://git.kernel.dk/linux: blk-mq: enforce op-specific segment limits in blk_insert_cloned_request nvme-fabrics: show well known discovery name nvme-tcp: don't access released socket during error recovery nvme-auth: fix an error code in nvme_auth_process_dhchap_challenge() nvme: bring back auto-removal of deleted namespaces during sequential scan blk-iocost: Pass gendisk to ioc_refresh_params nvme: fix sparse warning on effects masking block: be a bit more careful in checking for NULL bdev while polling block: clear bio->bi_bdev when putting a bio back in the cache loop: loop_set_status_from_info() check before assignment ublk: remove check IO_URING_F_SQE128 in ublk_ch_uring_cmd block: remove more NULL checks after bdev_get_queue() blk-mq: Reorder fields in 'struct blk_mq_tag_set' block: fix scan partition for exclusively open device again block: Revert "block: Do not reread partition table on exclusively open device" sed-opal: add support flag for SUM in status ioctl	2023-03-03 10:21:39 -08:00
Uday Shankar	49d2439832	blk-mq: enforce op-specific segment limits in blk_insert_cloned_request The block layer might merge together discard requests up until the max_discard_segments limit is hit, but blk_insert_cloned_request checks the segment count against max_segments regardless of the req op. This can result in errors like the following when discards are issued through a DM device and max_discard_segments exceeds max_segments for the queue of the chosen underlying device. blk_insert_cloned_request: over max segments limit. (256 > 129) Fix this by looking at the req_op and enforcing the appropriate segment limit - max_discard_segments for REQ_OP_DISCARDs and max_segments for everything else. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230301000655.48112-1-ushankar@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-03-02 21:00:20 -07:00
Breno Leitao	e33b93650f	blk-iocost: Pass gendisk to ioc_refresh_params Current kernel (`d2980d8d82`) crashes when blk_iocost_init for `nvme1` disk. BUG: kernel NULL pointer dereference, address: 0000000000000050 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page blk_iocost_init (include/asm-generic/qspinlock.h:128 include/linux/spinlock.h:203 include/linux/spinlock_api_smp.h:158 include/linux/spinlock.h:400 block/blk-iocost.c:2884) ioc_qos_write (block/blk-iocost.c:3198) ? kretprobe_perf_func (kernel/trace/trace_kprobe.c:1566) ? kernfs_fop_write_iter (include/linux/slab.h:584 fs/kernfs/file.c:311) ? __kmem_cache_alloc_node (mm/slab.h:? mm/slub.c:3452 mm/slub.c:3491) ? _copy_from_iter (arch/x86/include/asm/uaccess_64.h:46 arch/x86/include/asm/uaccess_64.h:52 lib/iov_iter.c:183 lib/iov_iter.c:628) ? kretprobe_dispatcher (kernel/trace/trace_kprobe.c:1693) cgroup_file_write (kernel/cgroup/cgroup.c:4061) kernfs_fop_write_iter (fs/kernfs/file.c:334) vfs_write (include/linux/fs.h:1849 fs/read_write.c:491 fs/read_write.c:584) ksys_write (fs/read_write.c:637) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) This happens because ioc_refresh_params() is being called without a properly initialized ioc->rqos, which is happening later in the callee side. ioc_refresh_params() -> ioc_autop_idx() tries to access ioc->rqos.disk->queue but ioc->rqos.disk is NULL, causing the BUG above. Create function, called ioc_refresh_params_disk(), that is similar to ioc_refresh_params() but where the "struct gendisk" could be passed as an explicit argument. This function will be called when ioc->rqos.disk is not initialized. Fixes: `ce57b55860` ("blk-rq-qos: make rq_qos_add and rq_qos_del more useful") Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230228111654.1778120-1-leitao@debian.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-28 05:51:19 -07:00
Linus Torvalds	a93e884edf	Driver core changes for 6.3-rc1 Here is the large set of driver core changes for 6.3-rc1. There's a lot of changes this development cycle, most of the work falls into two different categories: - fw_devlink fixes and updates. This has gone through numerous review cycles and lots of review and testing by lots of different devices. Hopefully all should be good now, and Saravana will be keeping a watch for any potential regression on odd embedded systems. - driver core changes to work to make struct bus_type able to be moved into read-only memory (i.e. const) The recent work with Rust has pointed out a number of areas in the driver core where we are passing around and working with structures that really do not have to be dynamic at all, and they should be able to be read-only making things safer overall. This is the contuation of that work (started last release with kobject changes) in moving struct bus_type to be constant. We didn't quite make it for this release, but the remaining patches will be finished up for the release after this one, but the groundwork has been laid for this effort. Other than that we have in here: - debugfs memory leak fixes in some subsystems - error path cleanups and fixes for some never-able-to-be-hit codepaths. - cacheinfo rework and fixes - Other tiny fixes, full details are in the shortlog All of these have been in linux-next for a while with no reported problems. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCY/ipdg8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ynL3gCgwzbcWu0So3piZyLiJKxsVo9C2EsAn3sZ9gN6 6oeFOjD3JDju3cQsfGgd =Su6W -----END PGP SIGNATURE----- Merge tag 'driver-core-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the large set of driver core changes for 6.3-rc1. There's a lot of changes this development cycle, most of the work falls into two different categories: - fw_devlink fixes and updates. This has gone through numerous review cycles and lots of review and testing by lots of different devices. Hopefully all should be good now, and Saravana will be keeping a watch for any potential regression on odd embedded systems. - driver core changes to work to make struct bus_type able to be moved into read-only memory (i.e. const) The recent work with Rust has pointed out a number of areas in the driver core where we are passing around and working with structures that really do not have to be dynamic at all, and they should be able to be read-only making things safer overall. This is the contuation of that work (started last release with kobject changes) in moving struct bus_type to be constant. We didn't quite make it for this release, but the remaining patches will be finished up for the release after this one, but the groundwork has been laid for this effort. Other than that we have in here: - debugfs memory leak fixes in some subsystems - error path cleanups and fixes for some never-able-to-be-hit codepaths. - cacheinfo rework and fixes - Other tiny fixes, full details are in the shortlog All of these have been in linux-next for a while with no reported problems" [ Geert Uytterhoeven points out that that last sentence isn't true, and that there's a pending report that has a fix that is queued up - Linus ] * tag 'driver-core-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (124 commits) debugfs: drop inline constant formatting for ERR_PTR(-ERROR) OPP: fix error checking in opp_migrate_dentry() debugfs: update comment of debugfs_rename() i3c: fix device.h kernel-doc warnings dma-mapping: no need to pass a bus_type into get_arch_dma_ops() driver core: class: move EXPORT_SYMBOL_GPL() lines to the correct place Revert "driver core: add error handling for devtmpfs_create_node()" Revert "devtmpfs: add debug info to handle()" Revert "devtmpfs: remove return value of devtmpfs_delete_node()" driver core: cpu: don't hand-override the uevent bus_type callback. devtmpfs: remove return value of devtmpfs_delete_node() devtmpfs: add debug info to handle() driver core: add error handling for devtmpfs_create_node() driver core: bus: update my copyright notice driver core: bus: add bus_get_dev_root() function driver core: bus: constify bus_unregister() driver core: bus: constify some internal functions driver core: bus: constify bus_get_kset() driver core: bus: constify bus_register/unregister_notifier() driver core: remove private pointer from struct bus_type ...	2023-02-24 12:58:55 -08:00
Jens Axboe	310726c33a	block: be a bit more careful in checking for NULL bdev while polling Wei reports a crash with an application using polled IO: PGD 14265e067 P4D 14265e067 PUD 47ec50067 PMD 0 Oops: 0000 [#1] SMP CPU: 0 PID: 21915 Comm: iocore_0 Kdump: loaded Tainted: G S 5.12.0-0_fbk12_clang_7346_g1bb6f2e7058f #1 Hardware name: Wiwynn Delta Lake MP T8/Delta Lake-Class2, BIOS Y3DLM08 04/10/2022 RIP: 0010:bio_poll+0x25/0x200 Code: 0f 1f 44 00 00 0f 1f 44 00 00 55 41 57 41 56 41 55 41 54 53 48 83 ec 28 65 48 8b 04 25 28 00 00 00 48 89 44 24 20 48 8b 47 08 <48> 8b 80 70 02 00 00 4c 8b 70 50 8b 6f 34 31 db 83 fd ff 75 25 65 RSP: 0018:ffffc90005fafdf8 EFLAGS: 00010292 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 74b43cd65dd66600 RDX: 0000000000000003 RSI: ffffc90005fafe78 RDI: ffff8884b614e140 RBP: ffff88849964df78 R08: 0000000000000000 R09: 0000000000000008 R10: 0000000000000000 R11: 0000000000000000 R12: ffff88849964df00 R13: ffffc90005fafe78 R14: ffff888137d3c378 R15: 0000000000000001 FS: 00007fd195000640(0000) GS:ffff88903f400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000270 CR3: 0000000466121001 CR4: 00000000007706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: iocb_bio_iopoll+0x1d/0x30 io_do_iopoll+0xac/0x250 __se_sys_io_uring_enter+0x3c5/0x5a0 ? __x64_sys_write+0x89/0xd0 do_syscall_64+0x2d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x94f225d Code: 24 cc 00 00 00 41 8b 84 24 d0 00 00 00 c1 e0 04 83 e0 10 41 09 c2 8b 33 8b 53 04 4c 8b 43 18 4c 63 4b 0c b8 aa 01 00 00 0f 05 <85> c0 0f 88 85 00 00 00 29 03 45 84 f6 0f 84 88 00 00 00 41 f6 c7 RSP: 002b:00007fd194ffcd88 EFLAGS: 00000202 ORIG_RAX: 00000000000001aa RAX: ffffffffffffffda RBX: 00007fd194ffcdc0 RCX: 00000000094f225d RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000007 RBP: 00007fd194ffcdb0 R08: 0000000000000000 R09: 0000000000000008 R10: 0000000000000001 R11: 0000000000000202 R12: 00007fd269d68030 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000 which is due to bio->bi_bdev being NULL. This can happen if we have two tasks doing polled IO, and task B ends up completing IO from task A if they are sharing a poll queue. If task B completes the IO and puts the bio into our cache, then it can allocate that bio again before task A is done polling for it. As that would necessitate a preempt between the two tasks, it's enough to just be a bit more careful in checking for whether or not bio->bi_bdev is NULL. Reported-and-tested-by: Wei Zhang <wzhang@meta.com> Cc: stable@vger.kernel.org Fixes: `be4d234d7a` ("bio: add allocation cache abstraction") Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-24 13:19:59 -07:00
Jens Axboe	11eb695feb	block: clear bio->bi_bdev when putting a bio back in the cache This isn't strictly needed in terms of correctness, but it does allow polling to know if the bio has been put already by a different task and hence avoid polling something that we don't need to. Cc: stable@vger.kernel.org Fixes: `be4d234d7a` ("bio: add allocation cache abstraction") Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-24 13:19:56 -07:00
Linus Torvalds	3822a7c409	- Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K DmxHkn0LAitGgJRS/W9w81yrgig9tAQ= =MlGs -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...	2023-02-23 17:09:35 -08:00
Linus Torvalds	307e14c039	46 fs/cifs (smb3 client) changesets, 37 in fs/cifs and 9 for related helper functions and cleanup outside from Dave Howells and Willy -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmP2kaAACgkQiiy9cAdy T1Eergv9FHVs7hS0anJF0xgRghR4+g0m5UUo08iJazgJdDgcS5JY+ZasIpYpEsG3 QmsIT33XVYZypXoOzjMSsPlwo6esTCJQScVLz85e4ebedCbCBDks+wVQcbfTzD5/ KrwmUoTBLU0L/ppFhqRk9k53nrSf1SXCWPthjdfWa3mTHdIVM4kQJruTWwUDiJXp mdYwTx6FnTNer3QWetNzYOwdUgLu3rk0zLcBwQNCo6g5LOpA44iFfEAO4zeiOuZT LMDPbDj0nWQyWPLLdcbtsn2laYyEBDBLZevLirSaqPQ/KCtGcw0mBt6dCAzg8/CM ONqHHxdEpvPON8Sxujcn4CxpXhl0nCLwwtKtWU4rt7IevI9U+PynNl57TtJJ16/s b3XD2QVbFjlcdAMTmArvqnogdzoC3mZu1R1IRs+jukhLAOqZiLN6o/E2HAllt47i krzXeXIzQr10w9fnJ7LtIc/7IUFgtUfrOkg4TKyNcnRVHQaSSxv+JLRgqMPOr/M0 I7zt0G0j =4hIT -----END PGP SIGNATURE----- Merge tag '6.3-rc-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull cifs client updates from Steve French: "The largest subset of this is from David Howells et al: making the cifs/smb3 driver pass iov_iters down to the lowest layers, directly to the network transport rather than passing lists of pages around, helping multiple areas: - Pin user pages, thereby fixing the race between concurrent DIO read and fork, where the pages containing the DIO read buffer may end up belonging to the child process and not the parent - with the result that the parent might not see the retrieved data. - cifs shouldn't take refs on pages extracted from non-user-backed iterators (eg. KVEC). With these changes, cifs will apply the appropriate cleanup. - Making it easier to transition to using folios in cifs rather than pages by dealing with them through BVEC and XARRAY iterators. - Allowing cifs to use the new splice function The remainder are: - fixes for stable, including various fixes for uninitialized memory, wrong length field causing mount issue to very old servers, important directory lease fixes and reconnect fixes - cleanups (unused code removal, change one element array usage, and a change form strtobool to kstrtobool, and Kconfig cleanups) - SMBDIRECT (RDMA) fixes including iov_iter integration and UAF fixes - reconnect fixes - multichannel fixes, including improving channel allocation (to least used channel) - remove the last use of lock_page_killable by moving to folio_lock_killable" * tag '6.3-rc-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: (46 commits) update internal module version number for cifs.ko cifs: update ip_addr for ses only for primary chan setup cifs: use tcon allocation functions even for dummy tcon cifs: use the least loaded channel for sending requests cifs: DIO to/from KVEC-type iterators should now work cifs: Remove unused code cifs: Build the RDMA SGE list directly from an iterator cifs: Change the I/O paths to use an iterator rather than a page list cifs: Add a function to read into an iter from a socket cifs: Add some helper functions cifs: Add a function to Hash the contents of an iterator cifs: Add a function to build an RDMA SGE list from an iterator netfs: Add a function to extract an iterator into a scatterlist netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator cifs: Implement splice_read to pass down ITER_BVEC not ITER_PIPE splice: Export filemap/direct_splice_read() iov_iter: Add a function to extract a page list from an iterator iov_iter: Define flags to qualify page extraction. splice: Add a func to do a splice from an O_DIRECT file without ITER_PIPE splice: Add a func to do a splice from a buffered file without ITER_PIPE ...	2023-02-22 17:12:44 -08:00
Linus Torvalds	6861eaf791	ATA changes for 6.3-rc1 * Small cleanup of the pata_octeon driver to drop a useless platform callback, from Uwe. * Simplify ata_scsi_cmd_error_handler() code using the fact that ap->ops->error_handler is NULL most of the time, from Wenchao. * Several patches improving libata error handling. This is in preparation for supporting the command duration limits (CDL) feature. The changes allow handling corner cases of ATA NCQ errors which do not happen with regular drives but will be triggered with CDL drives. From Niklas. * Simplify the qc_fill_rtf operation, from me. * Improve SCSI command translation for the REPORT_SUPPORTED_OPERATION_CODES command, from me. * Cleanup of libata FUA handling. This falls short of enabling FUA for ATA drives that support it by default as there were concerns that old drives would break. The series howeverfixes several issues with the FUA support to ensure that FUA is reported as being supported only for drives that can handle all possible write cases (NCQ and non-NCQ). A check in the block layer is also added to ensure that we never see read FUA commands (current behavior). From me. * Several patches to move the old PARIDE (parallel port IDE) driver to libata as pata_parport. Given that this driver also needs protocol modules, the driver code resides in its own pata_parport directoy under drivers/ata. From Ondrej. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCY/VTnQAKCRDdoc3SxdoY dk77AQCA1frczKhcOFe2PK/FsFAiO9Nlx/snk7V95JdjVG8GlwEAkey7mvbXMfX0 fDbqpaCkWFb6SvwxdMSATlqUvwEpSQ8= =tqQP -----END PGP SIGNATURE----- Merge tag 'ata-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata Pull ATA updates from Damien Le Moal: - Small cleanup of the pata_octeon driver to drop a useless platform callback (Uwe) - Simplify ata_scsi_cmd_error_handler() code using the fact that ap->ops->error_handler is NULL most of the time (Wenchao) - Several patches improving libata error handling. This is in preparation for supporting the command duration limits (CDL) feature. The changes allow handling corner cases of ATA NCQ errors which do not happen with regular drives but will be triggered with CDL drives (Niklas) - Simplify the qc_fill_rtf operation (me) - Improve SCSI command translation for REPORT_SUPPORTED_OPERATION_CODES command (me) - Cleanup of libata FUA handling. This falls short of enabling FUA for ATA drives that support it by default as there were concerns that old drives would break. The series however fixes several issues with the FUA support to ensure that FUA is reported as being supported only for drives that can handle all possible write cases (NCQ and non-NCQ). A check in the block layer is also added to ensure that we never see read FUA commands (current behavior) (me) - Several patches to move the old PARIDE (parallel port IDE) driver to libata as pata_parport. Given that this driver also needs protocol modules, the driver code resides in its own pata_parport directoy under drivers/ata (Ondrej) * tag 'ata-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata: ata: pata_parport: Fix ida_alloc return value error check drivers/block: Move PARIDE protocol modules to drivers/ata/pata_parport drivers/block: Remove PARIDE core and high-level protocols ata: pata_parport: add driver (PARIDE replacement) ata: libata: exclude FUA support for known buggy drives ata: libata: Fix FUA handling in ata_build_rw_tf() ata: libata: cleanup fua support detection ata: libata: Rename and cleanup ata_rwcmd_protocol() ata: libata: Introduce ata_ncq_supported() block: add a sanity check for non-write flush/fua bios ata: libata-scsi: improve ata_scsiop_maint_in() ata: libata-scsi: do not overwrite SCSI ML and status bytes ata: libata: move NCQ related ATA_DFLAGs ata: libata: respect successfully completed commands during errors ata: libata: read the shared status for successful NCQ commands once ata: libata: simplify qc_fill_rtf port operation interface ata: scsi: rename flag ATA_QCFLAG_FAILED to ATA_QCFLAG_EH ata: libata-eh: Cleanup ata_scsi_cmd_error_handler() ata: octeon: Drop empty platform remove function	2023-02-22 13:35:51 -08:00
Linus Torvalds	9e58df973d	Updates for the interrupt subsystem: Core: - Move the interrupt affinity spreading mechanism into lib/group_cpus so it can be used for similar spreading requirements, e.g. in the block multi-queue code. This also contains a first usecase in the block multi-queue code which Jens asked to take along with the librarization. - Improve irqdomain locking to close a number race conditions which can be observed with massive parallel device driver probing. - Enforce and document the semantics of disable_irq() which cannot be invoked safely from non-sleepable context. - Move the IPI multiplexing code from the Apple AIC driver into the core. so it can be reused by RISCV. Drivers: - Plug OF node refcounting leaks in various drivers. - Correctly mark level triggered interrupts in the Broadcom L2 drivers. - The usual small fixes and improvements. - No new drivers for the record! -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmPzUSkTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoY3DEAC9E4yLO7VxxTrs/KrAVCgL3SnHVXQU nE42uFbQwpCILuNmnqP3uvTHLCsXZkbuBaZEbxLBxC2iyU6+31N1Is+e6cClGMjK kX6U9g9EqiRCdX3fgJiEU16fCgE8D1AEg+7XKLjeasQhCfKQGGtCtE9/Gmg/Ji92 gcEY/bjvm1hcoNo9dh/vR4k0k63fb13716RLScozUkS/XYVlu+LrrG349gD2WEA9 lh1twDkXvZTWkiYKWAkLorxcNyKhcnJxJw8zEIGVF5b6pCCudK8gXjBbMD5abC7W xano6B8F455eSKNsi2TWyW47ZHUkC60sqCNDgI2MBTsI7D72UpAJoDfe0VjbMoaH RQJnrGsUQbviBUen+LEet7nWZBQJRKZHOVtYEjA8ndB3PJUXKKcLeODdw11odyjR bgZk+0wnowMArIaoLfeItF2oSpfSzLVxh2i8Aeus5tBesvhVCOi4LABRBKGCWvMj cpSlMhZ4znMnr5j5lOGpcAjKFlWVh1HmF70Y2deGZi5xC8EXFL/VsB7rH5LEEEuF 7I8CO8M1mXeOTJoCchCbuAYgZyuk1DIhKUyOiYQZblaPNGcVGvCIN31SFBRT9h/8 e0VwSvVL756GhotUp/LjgTdG7MoKspWqRG00+q84SsDalsKGXMW7zmHc+1NgGN/C Yxio1Jlly9Rwyw== =+pu3 -----END PGP SIGNATURE----- Merge tag 'irq-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "Updates for the interrupt subsystem: Core: - Move the interrupt affinity spreading mechanism into lib/group_cpus so it can be used for similar spreading requirements, e.g. in the block multi-queue code This also contains a first usecase in the block multi-queue code which Jens asked to take along with the librarization - Improve irqdomain locking to close a number race conditions which can be observed with massive parallel device driver probing - Enforce and document the semantics of disable_irq() which cannot be invoked safely from non-sleepable context - Move the IPI multiplexing code from the Apple AIC driver into the core, so it can be reused by RISCV Drivers: - Plug OF node refcounting leaks in various drivers - Correctly mark level triggered interrupts in the Broadcom L2 drivers - The usual small fixes and improvements - No new drivers for the record!" * tag 'irq-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits) irqchip/irq-bcm7120-l2: Set IRQ_LEVEL for level triggered interrupts irqchip/irq-brcmstb-l2: Set IRQ_LEVEL for level triggered interrupts irqdomain: Switch to per-domain locking irqchip/mvebu-odmi: Use irq_domain_create_hierarchy() irqchip/loongson-pch-msi: Use irq_domain_create_hierarchy() irqchip/gic-v3-mbi: Use irq_domain_create_hierarchy() irqchip/gic-v3-its: Use irq_domain_create_hierarchy() irqchip/gic-v2m: Use irq_domain_create_hierarchy() irqchip/alpine-msi: Use irq_domain_add_hierarchy() x86/uv: Use irq_domain_create_hierarchy() x86/ioapic: Use irq_domain_create_hierarchy() irqdomain: Clean up irq_domain_push/pop_irq() irqdomain: Drop leftover brackets irqdomain: Drop dead domain-name assignment irqdomain: Drop revmap mutex irqdomain: Fix domain registration race irqdomain: Fix mapping-creation race irqdomain: Refactor __irq_domain_alloc_irqs() irqdomain: Look for existing mapping only once irqdomain: Drop bogus fwspec-mapping error handling ...	2023-02-21 10:03:48 -08:00
Juhyung Park	9e0c7efa5e	block: remove more NULL checks after bdev_get_queue() bdev_get_queue() never returns NULL. Several commits [1][2] have been made before to remove such superfluous checks, but some still remained. For places where bdev_get_queue() is called solely for NULL checks, it is removed entirely. [1] commit `ec9fd2a13d` ("blk-lib: don't check bdev_get_queue() NULL check") [2] commit `fea127b36c` ("block: remove superfluous check for request queue in bdev_is_zoned()") Signed-off-by: Juhyung Park <qkrwngud825@gmail.com> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20230203024029.48260-1-qkrwngud825@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-21 09:23:22 -07:00
David Howells	f62e52d127	iov_iter: Define flags to qualify page extraction. Define flags to qualify page extraction to pass into iov_iter__pages() rather than passing in FOLL_* flags. For now only a flag to allow peer-to-peer DMA is supported. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Logan Gunthorpe <logang@deltatee.com> cc: linux-fsdevel@vger.kernel.org cc: linux-block@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>	2023-02-20 17:25:43 -06:00
Linus Torvalds	5b0ed59649	for-6.3/block-2023-02-16 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPvfncQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpob2EADXJxcr2jjYHm/7cjKkyuVX8fr80dNdMeuY JFdsjG1k6Uj73BVhQQWYTcs/PsrWBHWRsv6uz4WgOELj55eXmf5Q0kJszyUeJW33 /DjqLvtoppVcYf80xE13wKvCfn73BjwQo6xkGM0qAYn15eaXiD/Ax3xC6eJlsBeK PEw7EJyhacbSxZa/1D2B6+mqII1jUQWProTCc3udZ4JHi3WvdWa3Rda0qCqHl4a1 +K2aP2YTFIRPxBzfMNa/CafWVIFubTdht+4Ds6R60RImzB9e0VUBfcsiUyW5Zg7L Fwv7ptXuWrALwVNdW56Oz1QikBxn2pdRR2HMLwKJW1MD8kP9r8LMm2jV5Rhiwe0B OQsGRYkOzBvw+bxeP5fvk0iPGVMz6ActH4gkraA5QdLqayDaFYOadlhqz0uRo5SH Fb42Vl658K/MHDSIk8U58TNkmrsIJsBGohXI9DOGINPPvv3XOPi4Q1HmXkGRmii0 y+lNU/QEGh7xXXew29SPP76uQpQaYfC7NxXCMw/OpOMwehzjsjshmM2lpxi8zsgt PJUmfHv5qxCplNmTJXmUpmX7sS7550HUdu9FJb13DM+gzKg8bk9jWVuLrzqrVlG5 1hKWEl1+heg1heRfaIuJVLbPI0au6Sb4uqhih/PHyrP9TWIoAruDbDJM65GKTxyE 2uEgcHzHQw== =poRc -----END PGP SIGNATURE----- Merge tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - NVMe updates via Christoph: - Small improvements to the logging functionality (Amit Engel) - Authentication cleanups (Hannes Reinecke) - Cleanup and optimize the DMA mapping cod in the PCIe driver (Keith Busch) - Work around the command effects for Format NVM (Keith Busch) - Misc cleanups (Keith Busch, Christoph Hellwig) - Fix and cleanup freeing single sgl (Keith Busch) - MD updates via Song: - Fix a rare crash during the takeover process - Don't update recovery_cp when curr_resync is ACTIVE - Free writes_pending in md_stop - Change active_io to percpu - Updates to drbd, inching us closer to unifying the out-of-tree driver with the in-tree one (Andreas, Christoph, Lars, Robert) - BFQ update adding support for multi-actuator drives (Paolo, Federico, Davide) - Make brd compliant with REQ_NOWAIT (me) - Fix for IOPOLL and queue entering, fixing stalled IO waiting on timeouts (me) - Fix for REQ_NOWAIT with multiple bios (me) - Fix memory leak in blktrace cleanup (Greg) - Clean up sbitmap and fix a potential hang (Kemeng) - Clean up some bits in BFQ, and fix a bug in the request injection (Kemeng) - Clean up the request allocation and issue code, and fix some bugs related to that (Kemeng) - ublk updates and fixes: - Add support for unprivileged ublk (Ming) - Improve device deletion handling (Ming) - Misc (Liu, Ziyang) - s390 dasd fixes (Alexander, Qiheng) - Improve utility of request caching and fixes (Anuj, Xiao) - zoned cleanups (Pankaj) - More constification for kobjs (Thomas) - blk-iocost cleanups (Yu) - Remove bio splitting from drivers that don't need it (Christoph) - Switch blk-cgroups to use struct gendisk. Some of this is now incomplete as select late reverts were done. (Christoph) - Add bvec initialization helpers, and convert callers to use that rather than open-coding it (Christoph) - Misc fixes and cleanups (Jinke, Keith, Arnd, Bart, Li, Martin, Matthew, Ulf, Zhong) * tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux: (169 commits) brd: use radix_tree_maybe_preload instead of radix_tree_preload block: use proper return value from bio_failfast() block: bio-integrity: Copy flags when bio_integrity_payload is cloned block: Fix io statistics for cgroup in throttle path brd: mark as nowait compatible brd: check for REQ_NOWAIT and set correct page allocation mask brd: return 0/-error from brd_insert_page() block: sync mixed merged request's failfast with 1st bio's Revert "blk-cgroup: pin the gendisk in struct blkcg_gq" Revert "blk-cgroup: pass a gendisk to blkg_lookup" Revert "blk-cgroup: delay blk-cgroup initialization until add_disk" Revert "blk-cgroup: delay calling blkcg_exit_disk until disk_release" Revert "blk-cgroup: move the cgroup information to struct gendisk" nvme-pci: remove iod use_sgls nvme-pci: fix freeing single sgl block: ublk: check IO buffer based on flag need_get_data s390/dasd: Fix potential memleak in dasd_eckd_init() s390/dasd: sort out physical vs virtual pointers usage block: Remove the ALLOC_CACHE_SLACK constant block: make kobj_type structures constant ...	2023-02-20 14:27:21 -08:00
Linus Torvalds	c1ef500307	for-6.3/iter-ubuf-2023-02-16 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPueOUQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpkEWD/9hOagNSeXfCd1eAJ44E5IemgHKqfU0RXRs kdW1o35eBXwPVAyhhDmcz60hkijm47Pw3IJUdSNaGqdm9uYpLwiatuYY5EOVC4qg BFkVPGCA8ERXStFM/mnWj0gkYDmb/8bzk9bdBU1FQvQOIQgYpomlHdMVfQJ+0tDT 7VTffRaWfcxWd1u+NBMDxmfz47teplxiHJDg38wGlgT6G1kMdEUK+y6hd0SoASPM ocMW8LL2v3wLQhQAOWYd6sw2kFnxx4VOzhSepPAY0U78CR6CYm6zthRd+k+Ro/nt RFKL6Ijt2LRaOZqY3HRnCpUwmhBNft0ZFH4OHh21vPaukB4sjWbQ5SJniucNcoCN rb9jAJDJdS6oy+Uimeig99aQ/yGSLJXG8MQKrC36NdGSwydUfaCLaoLKwfC8zYDC Zr3G7tfOhSJQzQtNSH1H0SqHFvMfc7C2Ra8mYXdHbcREswKOTT73aJUHq5RFfwO+ m10V5rQgCB9rJz0NLbo68GhxDrbTQuueDj+yDWCSoulUdNg3s2BZ3/iBjODJyJNO P3aG4bMYxC5te2JWCBnmR6du//8vnvDHnwWh9yKcUk+l/9OTtAPouAdUCv+r1wkz Ib0aEX3SiJ65LIePQO2kbdvgnweyFCJYduvMW9zjsH9GMgRP0eA6EKZh3mbKhOw4 yw9BcZoNYQ== =+ImB -----END PGP SIGNATURE----- Merge tag 'for-6.3/iter-ubuf-2023-02-16' of git://git.kernel.dk/linux Pull io_uring ITER_UBUF conversion from Jens Axboe: "Since we now have ITER_UBUF available, switch to using it for single ranges as it's more efficient than ITER_IOVEC for that" * tag 'for-6.3/iter-ubuf-2023-02-16' of git://git.kernel.dk/linux: block: use iter_ubuf for single range iov_iter: move iter_ubuf check inside restore WARN io_uring: use iter_ubuf for single range imports io_uring: switch network send/recv to ITER_UBUF iov: add import_ubuf()	2023-02-20 14:03:57 -08:00
Thomas Gleixner	6f3ee0e22b	irqchip updates for 6.3 - New and improved irqdomain locking, closing a number of races that became apparent now that we are able to probe drivers in parallel - A bunch of OF node refcounting bugs have been fixed - We now have a new IPI mux, lifted from the Apple AIC code and made common. It is expected that riscv will eventually benefit from it - Two small fixes for the Broadcom L2 drivers - Various cleanups and minor bug fixes -----BEGIN PGP SIGNATURE----- iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmPw4OgPHG1hekBrZXJu ZWwub3JnAAoJECPQ0LrRPXpDYVgP/iVFxCPs+DCWUYvyTC8rvNzOj51COHUV/7yD mY5BTIjH3yTQPDhQmFvITCAjKaMYc3eDLml/nF4tTCU0MFig+KsRsWNIEFXtSsI0 wO+S19QhHzj5odUok5IDC+cNTXScp2HV+vFoOhhf0zDzXqwVxRr7lO5i+n37ELMp Mm9g2+EeUt43xTQxzbmNn5Kkpq9PMEnQFU2UkvJleg+KCgzSYThcR8/KUDKySZpk TP+mcR5PevcqGhLt7vYS2lGh8Ye1warzp54C7Je8P8Txg3BM8xBynT1d3fgrlKfm AOAPVW3PV6bPhgVYXZJopH3ykfmYM4ZiIvhRcgLyf6tbZAU6Twpiq823TAOVHyPI SRcW8dehuvgq1VJIpRGZOSB2qIvFrqLhl0B1CtT04gFWJW9bSa2n5Y1h4Gcqy29o SLJiKscx2KqvPmQqarLUUnuOZ5hhIrtYhkhhJuuwqZqzS1Kkz/mSB1MkPQEGxJi1 MpoTfbQ/0KTYXCqqgs/GBnDJ0mYrcvtBoGP7bjnVYnXpANP2bs+ZpQVPVq+17uuQ k0gjxe8iENqXjW6JMlFX5K3dxG5ygXjfECMWsCJ+JdCtJdaIL8I46X/u7wHU2mfY bohhb7xS2+HIPxz6w8aRu3IQG00mMv06vCYPBbPh+W0dUtocdM3U2kpe5gPYm1iz kWx3WLaM =ONcj -----END PGP SIGNATURE----- Merge tag 'irqchip-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core Pull irqchip updates from Marc Zyngier: - New and improved irqdomain locking, closing a number of races that became apparent now that we are able to probe drivers in parallel - A bunch of OF node refcounting bugs have been fixed - We now have a new IPI mux, lifted from the Apple AIC code and made common. It is expected that riscv will eventually benefit from it - Two small fixes for the Broadcom L2 drivers - Various cleanups and minor bug fixes Link: https://lore.kernel.org/r/20230218143452.3817627-1-maz@kernel.org	2023-02-19 00:07:56 +01:00
Yu Kuai	e5cfefa97b	block: fix scan partition for exclusively open device again As explained in commit `36369f46e9` ("block: Do not reread partition table on exclusively open device"), reread partition on the device that is exclusively opened by someone else is problematic. This patch will make sure partition scan will only be proceed if current thread open the device exclusively, or the device is not opened exclusively, and in the later case, other scanners and exclusive openers will be blocked temporarily until partition scan is done. Fixes: `10c70d95c0` ("block: remove the bd_openers checks in blk_drop_partitions") Cc: <stable@vger.kernel.org> Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230217022200.3092987-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-17 06:15:57 -07:00
Yu Kuai	0f77b29ad1	block: Revert "block: Do not reread partition table on exclusively open device" This reverts commit `36369f46e9`. This patch can't fix the problem in a corner case that device can be opened exclusively after the checking and before blkdev_get_by_dev(). We'll use a new solution to fix the problem in the next patch, and the new solution doesn't need to change apis. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230217022200.3092987-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-17 06:15:57 -07:00
Luca Boccassi	9ec041ea40	sed-opal: add support flag for SUM in status ioctl Not every OPAL drive supports SUM (Single User Mode), so report this information to userspace via the get-status ioctl so that we can adjust the formatting options accordingly. Tested on a kingston drive (which supports it) and a samsung one (which does not). Signed-off-by: Luca Boccassi <bluca@debian.org> Link: https://lore.kernel.org/r/20230210010612.28729-1-luca.boccassi@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-17 06:15:53 -07:00
Jens Axboe	f3ca738624	block: use proper return value from bio_failfast() kernel test robot complains about a type mismatch: block/blk-merge.c:984:42: sparse: expected restricted blk_opf_t const [usertype] ff block/blk-merge.c:984:42: sparse: got unsigned int block/blk-merge.c:1010:42: sparse: sparse: incorrect type in initializer (different base types) @@ expected restricted blk_opf_t const [usertype] ff @@ got unsigned int @@ block/blk-merge.c:1010:42: sparse: expected restricted blk_opf_t const [usertype] ff block/blk-merge.c:1010:42: sparse: got unsigned int because bio_failfast() is return an unsigned int rather than the appropriate blk_opt_f type. Fix it up. Fixes: `3ce6a11598` ("block: sync mixed merged request's failfast with 1st bio's") Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202302170743.GXypM9Rt-lkp@intel.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-16 19:39:15 -07:00
Martin K. Petersen	b6a4bdcda4	block: bio-integrity: Copy flags when bio_integrity_payload is cloned Make sure to copy the flags when a bio_integrity_payload is cloned. Otherwise per-I/O properties such as IP checksum flag will not be passed down to the HBA driver. Since the integrity buffer is owned by the original bio, the BIP_BLOCK_INTEGRITY flag needs to be masked off to avoid a double free in the completion path. Fixes: `aae7df5019` ("block: Integrity checksum flag") Fixes: `b1f0138857` ("block: Relocate bio integrity flags") Reported-by: Saurav Kashyap <skashyap@marvell.com> Tested-by: Saurav Kashyap <skashyap@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20230215171801.21062-1-martin.petersen@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-16 11:05:41 -07:00
Jinke Han	0f7c8f0f79	block: Fix io statistics for cgroup in throttle path In the current code, io statistics are missing for cgroup when bio was throttled by blk-throttle. Fix it by moving the unreaching code to submit_bio_noacct_nocheck. Fixes: `3f98c75371` ("block: don't check bio in blk_throtl_dispatch_work_fn") Signed-off-by: Jinke Han <hanjinke.666@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230216032250.74230-1-hanjinke.666@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-16 11:04:11 -07:00
Ming Lei	3ce6a11598	block: sync mixed merged request's failfast with 1st bio's We support mixed merge for requests/bios with different fastfail settings. When request fails, each time we only handle the portion with same failfast setting, then bios with failfast can be failed immediately, and bios without failfast can be retried. The idea is pretty good, but the current implementation has several defects: 1) initially RA bio doesn't set failfast, however bio merge code doesn't consider this point, and just check its failfast setting for deciding if mixed merge is required. Fix this issue by adding helper of bio_failfast(). 2) when merging bio to request front, if this request is mixed merged, we have to sync request's faifast setting with 1st bio's failfast. Fix it by calling blk_update_mixed_merge(). 3) when merging bio to request back, if this request is mixed merged, we have to mark the bio as failfast, because blk_update_request simply updates request failfast with 1st bio's failfast. Fix it by calling blk_update_mixed_merge(). Fixes one normal EXT4 READ IO failure issue, because it is observed that the normal READ IO is merged with RA IO, and the mixed merged request has different failfast setting with 1st bio's, so finally the normal READ IO doesn't get retried. Cc: Tejun Heo <tj@kernel.org> Fixes: `80a761fd33` ("block: implement mixed merge of different failfast requests") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230209125527.667004-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-16 07:49:27 -07:00
Christoph Hellwig	fd8f8ede23	block: export bio_split_rw bio_split_rw can be used by file systems to split and incoming write bio into multiple bios fitting the hardware limit for use as ZONE_APPEND bios. Export it for initial use in btrfs. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:50 +01:00
Christoph Hellwig	a06377c5d0	Revert "blk-cgroup: pin the gendisk in struct blkcg_gq" This reverts commit `84d7d462b1`. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230214183308.1658775-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-14 14:24:09 -07:00
Christoph Hellwig	9a9c261e6b	Revert "blk-cgroup: pass a gendisk to blkg_lookup" This reverts commit 821e840c08ad83736eced4037cdad864e95e2584. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230214183308.1658775-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-14 14:24:09 -07:00
Christoph Hellwig	b6553bef8c	Revert "blk-cgroup: delay blk-cgroup initialization until add_disk" This reverts commit `178fa7d498`. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230214183308.1658775-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-14 14:24:09 -07:00
Christoph Hellwig	b4e94f9c2c	Revert "blk-cgroup: delay calling blkcg_exit_disk until disk_release" This reverts commit `c43332fe02` as it is not needed without moving to disk references in the blkg. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230214183308.1658775-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-14 14:24:09 -07:00
Christoph Hellwig	1231039db3	Revert "blk-cgroup: move the cgroup information to struct gendisk" This reverts commit `3f13ab7c80` as a patch it depends on caused a few problems. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230214183308.1658775-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-14 14:24:09 -07:00
Bart Van Assche	9af9935494	block: Remove the ALLOC_CACHE_SLACK constant Commit `b99182c501` ("bio: add pcpu caching for non-polling bio_put") removed the code that uses this constant. Hence also remove the constant itself. Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230209230135.3475829-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-09 17:03:36 -07:00
Thomas Weißschuh	5f6224175f	block: make kobj_type structures constant Since commit `ee6d3dd4ed` ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definitions to prevent modification at runtime. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20230208-kobj_type-block-v1-1-0b3eafd7d983@weissschuh.net Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-09 09:38:16 -07:00
Xiao Ni	23f3e3272e	block: Merge bio before checking ->cached_rq It checks if plug->cached_rq is empty before merging bio. But the merge action doesn't have relationship with plug->cached_rq, it trys to merge bio with requests within plug->mq_list. Now it checks if ->cached_rq is empty before merging bio. If it's empty, it will miss the merge chances. So move the merge function before checking ->cached_rq. Signed-off-by: Xiao Ni <xni@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230209031930.27354-1-xni@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-09 08:11:25 -07:00
Christoph Hellwig	dcb5220143	Revert "blk-cgroup: simplify blkg freeing from initialization failure paths" It turns out this was too soon. blkg_conf_prep does to funky locking games with the queue lock for this to work properly. This reverts commit `27b642b07a`. Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230209053523.437927-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-09 08:11:11 -07:00
Christoph Hellwig	c43332fe02	blk-cgroup: delay calling blkcg_exit_disk until disk_release While del_gendisk ensures there is no outstanding I/O on the queue, it can't prevent block layer users from building new I/O. This leads to a NULL ->root_blkg reference in bio_associate_blkg when allocating a new bio on a shut down file system. Delay freeing the blk-cgroup subsystems from del_gendisk until disk_release to make sure the blkg and throttle information is still avaіlable for bio submitters, even if those bios will immediately fail. This now can cause a case where disk_release is called on a disk that hasn't been added. That's mostly harmless, except for a case in blk_throttl_exit that now needs to check for a NULL ->td pointer. Fixes: `178fa7d498` ("blk-cgroup: delay blk-cgroup initialization until add_disk") Reported-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230208063514.171485-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-09 08:10:45 -07:00
Yu Kuai	f37bf75ca7	block, bfq: cleanup 'bfqg->online' After commit `dfd6200a09` ("blk-cgroup: support to track if policy is online"), there is no need to do this again in bfq. However, 'pd->online' is not protected by 'bfqd->lock', in order to make sure bfq won't see that 'pd->online' is still set after bfq_pd_offline(), clear it before bfq_pd_offline() is called. This is fine because other polices doesn't use 'pd->online' and bfq_pd_offline() will move active bfqq to root cgroup anyway. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230202134913.2364549-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-07 10:20:59 -07:00
Kemeng Shi	01542f651a	blk-mq: correct stale comment of .get_budget Commit `88022d7201` ("blk-mq: don't handle failure in .get_budget") remove BLK_STS_RESOURCE return value and we only check if we can get the budget from .get_budget() now. Correct stale comment that ".get_budget() returns BLK_STS_NO_RESOURCE" to ".get_budget() fails to get the budget". Fixes: `88022d7201` ("blk-mq: don't handle failure in .get_budget") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-06 09:22:29 -07:00
Kemeng Shi	27e8b2bb14	blk-mq: use switch/case to improve readability in blk_mq_try_issue_list_directly Use switch/case handle error as other function do to improve readability in blk_mq_try_issue_list_directly. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-06 09:22:29 -07:00
Kemeng Shi	f1ce99f709	blk-mq: remove set of bd->last when get driver tag for next request fails Commit `113285b473` ("blk-mq: ensure that bd->last is always set correctly") will set last if we failed to get driver tag for next request to avoid flush miss as we break the list walk and will not send the last request in the list which will be sent with last set normally. This code seems stale now becase the flush introduced is always redundant as: For case tag is really out, we will send a extra flush if we find list is not empty after list walk. For case some tag is freed before retry in blk_mq_prep_dispatch_rq for next, then we can get a tag for next request in retry and flush notified already is not necessary. Just remove these stale codes. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-06 09:22:29 -07:00
Kemeng Shi	4ea58fe456	blk-mq: remove unnecessary error count and check in blk_mq_dispatch_rq_list blk_mq_dispatch_rq_list will notify if hctx is busy in return bool. It will return true if we are not busy and can handle more and return false on the opposite. Inside blk_mq_dispatch_rq_list, errors is only used if list is empty and we will return true if list is empty and (errors + queued) != 0. There are three types of status returned from request: -busy error BLK_STS*_RESOURCE: the failed request will be added back to list and list will not be empty. -BLK_STS_OK: We count queued for BLK_STS_OK -rest error: We count errors for rest error If list is empty, there is no request gets busy error then (errors + queued) will be total requests in the list which is checked not empty at beginning of blk_mq_dispatch_rq_list. So (errors + queued) != 0 is always met if list is empty. Then the (errors + queued) != 0 check and errors number count is not needed. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-06 09:22:28 -07:00
Kemeng Shi	e4ef2e05e0	blk-mq: simplify flush check in blk_mq_dispatch_rq_list 1. Remove check of needs_resource and ret == BLK_STS_DEV_RESOURCE. For busy error BLK_STS*_RESOURCE, request will always be added back to list, so need_resource will not be true and ret will not be == BLK_STS_DEV_RESOURCE if list is empty. We could remove these dead check. 2. Check ret of last request instead of errors If list is empty, we only need to explicitly commit_rqs if error happens at last request which is stored in ret. So check ret of last request instead of errors to remove unnecessary commit_rqs triggered by errors returned from previous request. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-06 09:22:28 -07:00
Kemeng Shi	984ce0a7d7	blk-mq: use blk_mq_commit_rqs helper in blk_mq_try_issue_list_directly Call blk_mq_commit_rqs instead of access ->commit_rqs directly. As you can see in comment of blk_mq_commit_rqs, we only need explicitly call this in two cases: -did not queue everything initially scheduled to queue -the last attempt to queue a request failed Both cases can be checked with ret of last request which breaks list walk. Then we can remove unnecessary error count and unnecessary commit triggered by error besides cases described above. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-06 09:22:28 -07:00
Kemeng Shi	0d617a83e8	blk-mq: remove unncessary error count and commit in blk_mq_plug_issue_direct We need only to explicitly commit in two error cases: -did not queue everything initially scheduled to queue -the last attempt to queue a request failed (see comment of blk_mq_commit_rqs for more details). Both cases can be checked with ret of last request which breaks list walk. Remove unnecessary error count and unnecessary commit triggered by error which is not covered by cases described above. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-06 09:22:28 -07:00
Kemeng Shi	34c9f54740	blk-mq: make blk_mq_commit_rqs a general function for all commits 1. move blk_mq_commit_rqs forward before functions need commits. 2. add queued check and only commits request if any request was queued in blk_mq_commit_rqs to keep commit behavior consistent and remove unnecessary commit. 3. split the queued clearing from blk_mq_plug_commit_rqs as it is not wanted general. 4. sync current caller of blk_mq_commit_rqs with new general blk_mq_commit_rqs. 5. document rule for unusual cases which need explicit commit_rqs. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-06 09:22:28 -07:00
Kemeng Shi	3e368fb023	blk-mq: remove unncessary from_schedule parameter in blk_mq_plug_issue_direct Function blk_mq_plug_issue_direct tries to issue batch requests in plug list to driver directly. We will only issue plug request to driver if we are not from scheduler, so from_scheduler parameter of blk_mq_plug_issue_direct is always false. Remove unncessary from_scheduler of blk_mq_plug_issue_direct. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-06 09:22:28 -07:00
Kemeng Shi	08e3599e74	blk-mq: remove unnecessary list_empty check in blk_mq_try_issue_list_directly We only break the list walk if we get 'BLK_STS_RESOURCE'. We also count errors for 'BLK_STS_RESOURCE' error. If list is not empty, errors will always be non-zero. So we can remove unnecessary list_empty check. This will remove redundant list_empty check for case that error happened at sending last request in list. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-06 09:22:28 -07:00
Kemeng Shi	47df9ce95c	blk-mq: Fix potential io hung for shared sbitmap per tagset Commit `f906a6a0f4` ("blk-mq: improve tag waiting setup for non-shared tags") mark restart for unshared tags for improvement. At that time, tags is only shared betweens queues and we can check if tags is shared by test BLK_MQ_F_TAG_SHARED. Afterwards, commit `32bc15afed` ("blk-mq: Facilitate a shared sbitmap per tagset") enabled tags share betweens hctxs inside a queue. We only mark restart for shared hctxs inside a queue and may cause io hung if there is no tag currently allocated by hctxs going to be marked restart. Wait on sbitmap_queue instead of mark restart for shared hctxs case to fix this. Fixes: `32bc15afed` ("blk-mq: Facilitate a shared sbitmap per tagset") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-06 09:22:28 -07:00
Kemeng Shi	98b99e9412	blk-mq: wait on correct sbitmap_queue in blk_mq_mark_tag_wait For shared queues case, we will only wait on bitmap_tags if we fail to get driver tag. However, rq could be from breserved_tags, then two problems will occur: 1. io hung if no tag is currently allocated from bitmap_tags. 2. unnecessary wakeup when tag is freed to bitmap_tags while no tag is freed to breserved_tags. Wait on the bitmap which rq from to fix this. Fixes: `f906a6a0f4` ("blk-mq: improve tag waiting setup for non-shared tags") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-06 09:22:28 -07:00
Kemeng Shi	c31e76bcc3	blk-mq: remove stale comment for blk_mq_sched_mark_restart_hctx Commit `97889f9ac2` ("blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()") remove handle of TAG_SHARED in restart, then shared_hctx_restart counted for how many hardware queues are marked for restart is removed too. Remove the stale comment that we still count hardware queues need restart. Fixes: `97889f9ac2` ("blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-06 09:22:28 -07:00
Kemeng Shi	6ee858a3d3	blk-mq: avoid sleep in blk_mq_alloc_request_hctx Commit `1f5bd336b9` ("blk-mq: add blk_mq_alloc_request_hctx") add blk_mq_alloc_request_hctx to send commands to a specific queue. If BLK_MQ_REQ_NOWAIT is not set in tag allocation, we may change to different hctx after sleep and get tag from unexpected hctx. So BLK_MQ_REQ_NOWAIT must be set in flags for blk_mq_alloc_request_hctx. After commit `600c3b0cea` ("blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx"), blk_mq_alloc_request_hctx return -EINVAL if both BLK_MQ_REQ_NOWAIT and BLK_MQ_REQ_RESERVED are not set instead of if BLK_MQ_REQ_NOWAIT is not set. So if BLK_MQ_REQ_NOWAIT is not set and BLK_MQ_REQ_RESERVED is set, blk_mq_alloc_request_hctx could alloc tag from unexpected hctx. I guess what we need here is that return -EINVAL if either BLK_MQ_REQ_NOWAIT or BLK_MQ_REQ_RESERVED is not set. Currently both BLK_MQ_REQ_NOWAIT and BLK_MQ_REQ_RESERVED will be set if specific hctx is needed in nvme_auth_submit, nvmf_connect_io_queue and nvmf_connect_admin_queue. Fix the potential BLK_MQ_REQ_NOWAIT missed case in future. Fixes: `600c3b0cea` ("blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-06 09:22:28 -07:00
Christoph Hellwig	e81cd5a983	block: stub out and deprecated the capability attribute on the gendisk The capability attribute was added in 2017 to expose the kernel internal GENHD_FL_MEDIA_CHANGE_NOTIFY to userspace without ever adding a value to an UAPI header, and without ever setting it in any driver until it was finally removed in Linux 5.7. Deprecate the file and always return 0 instead of exposing the other internal and frequently renumbered other gendisk flags. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230203150209.3199115-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-06 08:44:55 -07:00
Christoph Hellwig	28e538a309	blk-cgroup: fix freeing NULL blkg in blkg_create new_blkg can be NULL if the caller didn't pass in a pre-allocated blkg. Don't try to free it in that case. Fixes: `27b642b07a` ("blk-cgroup: simplify blkg freeing from initialization failure paths") Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20230206150201.3438972-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-06 08:43:23 -07:00
Linus Torvalds	0136d86b78	block-6.2-2023-02-03 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPdRq8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjcqEADcWlRjkcLzRpEMD9g3IyDShasT1JVeSvV6 xqDuA0kRF6DyObu82jE2wiZ49FRpeCUw6S6ZdVhvwGHgPpfLBuPWonFnTqxYAnSz XCYnt4QdZHGiydIHVxkyP8Raz6d24kZawlUmbE7dcfksNziyGR5UjbCsk1HNJhmf EvnLZ2EozZwsZLW/RRYZrh9Q8ccB8kJeX+JuUVw7sboNyJ+bW+x+7prlm3CKgopX IiP69E6qIPe6RHkyLRdKgYgxRdcgeq6uJk/nuZ/6uPCcyrz+0QEtge3CkTe7zLkF CPmbWlqngmNfNsS93nPTK2kHWTz8P2spo+UTkXIegSYBA8CIr9lDxazSFKT0B6zH yIWzmQoE7YXRI5B21rlPvNGE/gPSy48mSn1ym/MCf+UyWGneRypeU/K//2Ww3UJK F1Xl2c1v/EEr28qPuC8VQbAsQ56GOcZ6zW4Q0grxTYm0KzzJ2O5B3FEHdCWlS/x9 KY5v3a8a3nXg9rNio0ruXiyD5l7PE5nFESNrBFDS4kEfxk4cx50ZfgDH68d515/W //EnNjx9nN20yF+LcKD70KJHxPdWaUXGT2c1+E/tdbrgUKReCpER+5hQc8+YxQML DCbzr7LJjX5mmDQ5YI6Y09/L6luzFMjrnxpmXkL7nyWQlSYkMqus3vPtDcJ5Xk2J shHBlzIcuw== =/+rE -----END PGP SIGNATURE----- Merge tag 'block-6.2-2023-02-03' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: "A bit bigger than I'd like at this point, but mostly a bunch of little fixes. In detail: - NVMe pull request via Christoph: - Fix a missing queue put in nvmet_fc_ls_create_association (Amit Engel) - Clear queue pointers on tag_set initialization failure (Maurizio Lombardi) - Use workqueue dedicated to authentication (Shin'ichiro Kawasaki) - Fix for an overflow in ublk (Liu) - Fix for leaking a queue reference in block cgroups (Ming) - Fix for a use-after-free in BFQ (Yu)" * tag 'block-6.2-2023-02-03' of git://git.kernel.dk/linux: blk-cgroup: don't update io stat for root cgroup nvme-auth: use workqueue dedicated to authentication nvme: clear the request_queue pointers on failure in nvme_alloc_io_tag_set nvme: clear the request_queue pointers on failure in nvme_alloc_admin_tag_set nvme-fc: fix a missing queue put in nvmet_fc_ls_create_association block: Fix the blk_mq_destroy_queue() documentation block: ublk: extending queue_size to fix overflow block, bfq: fix uaf for bfqq in bic_set_bfqq()	2023-02-03 11:35:42 -08:00
Christoph Hellwig	d58cdfae6a	block: factor out a bvec_set_page helper Add a helper to initialize a bvec based of a page pointer. This will help removing various open code bvec initializations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230203150634.3199647-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:54 -07:00
Christoph Hellwig	3f13ab7c80	blk-cgroup: move the cgroup information to struct gendisk cgroup information only makes sense on a live gendisk that allows file system I/O (which includes the raw block device). So move over the cgroup related members. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-20-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	479664cee1	blk-cgroup: pass a gendisk to blkg_lookup Pass a gendisk to blkg_lookup and use that to find the match as part of phasing out usage of the request_queue in the blk-cgroup code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-19-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	0a0b4f79db	blk-cgroup: pass a gendisk to pd_alloc_fn No need to the request_queue here, pass a gendisk and extract the node ids from that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	40e4996ec0	blk-cgroup: pass a gendisk to blkcg_{de,}activate_policy Prepare for storing the blkcg information in the gendisk instead of the request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	ba91c849fa	blk-rq-qos: store a gendisk instead of request_queue in struct rq_qos This is what about half of the users already want, and it's only going to grow more. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	3963d84df7	blk-rq-qos: constify rq_qos_ops These op vectors are constant, so mark them const. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	ce57b55860	blk-rq-qos: make rq_qos_add and rq_qos_del more useful Switch to passing a gendisk, and make rq_qos_add initialize all required fields and drop the not required q argument from rq_qos_del. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	b494f9c566	blk-rq-qos: move rq_qos_add and rq_qos_del out of line These two functions are rather larger and not in a fast path, so move them out of line. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	4e1d91ae87	blk-wbt: open code wbt_queue_depth_changed in wbt_init wbt_queue_depth_changed just updates a field and calls another function. Open code it in wbt_init, so that the local queue variable can be used instead of the one stored in the rq_qos. This will allow delaying that rq_qos->queue assignment in a subsequent patch. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	0bc65bd41d	blk-wbt: move private information from blk-wbt.h to blk-wbt.c A large part of blk-wbt.h is only used in blk-wbt.c, so move it there. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	958f296547	blk-wbt: pass a gendisk to wbt_init Pass a gendisk to wbt_init to prepare for phasing out usage of the request_queue in the blk-cgroup code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	04aad37be1	blk-wbt: pass a gendisk to wbt_{enable,disable}_default Pass a gendisk to wbt_enable_default and wbt_disable_default to prepare for phasing out usage of the request_queue in the blk-cgroup code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	f05837ed73	blk-cgroup: store a gendisk to throttle in struct task_struct Switch from a request_queue pointer and reference to a gendisk once for the throttle information in struct task_struct. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Link: https://lore.kernel.org/r/20230203150400.3199230-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	84d7d462b1	blk-cgroup: pin the gendisk in struct blkcg_gq Currently each blkcg_gq holds a request_queue reference, which is what is used in the policies. But a lot of these interfaces will move over to use a gendisk, so store a disk in struct blkcg_gq and hold a reference to it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	180b04d450	blk-cgroup: remove the !bdi->dev check in blkg_dev_name bdi_dev_name already performs the same check. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	27b642b07a	blk-cgroup: simplify blkg freeing from initialization failure paths There is no need to delay freeing a blkg to a workqueue when freeing it after an initialization failure. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:05 -07:00
Christoph Hellwig	0b6f93bdf0	blk-cgroup: improve error unwinding in blkg_alloc Unwind only the previous initialization steps that happened in blkg_alloc using goto based unwinding. This avoids the need for the !queue special case in blkg_free and thus ensures that any blkg seens outside of blkg_alloc is always fully constructed. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:04 -07:00
Christoph Hellwig	178fa7d498	blk-cgroup: delay blk-cgroup initialization until add_disk There is no need to initialize the cgroup code before the disk is marked live. Moving the cgroup initialization earlier will help to have a fully initialized struct device in the gendisk for the cgroup code to use in the future. Similarly tear the cgroup information down in del_gendisk to be symmetric and because none of the cgroup tracking is needed once non-passthrough I/O stops. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:04 -07:00
Christoph Hellwig	a886001c2d	block: don't call blk_throtl_stat_add for non-READ/WRITE commands blk_throtl_stat_add is called from blk_stat_add explicitly, unlike the other stats that go through q->stats->callbacks. To prepare for cgroup data moving to the gendisk, ensure blk_throtl_stat_add is only called for the plain READ and WRITE commands that it actually handles internally, as blk_stat_add can also be called for passthrough commands on queues that do not have a gendisk associated with them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-03 08:20:04 -07:00
Christoph Hellwig	3222d8c2a7	block: remove ->rw_page The ->rw_page method is a special purpose bypass of the usual bio handling path that is limited to single-page reads and writes and synchronous which causes a lot of extra code in the drivers, callers and the block layer. The only remaining user is the MM swap code. Switch that swap code to simply submit a single-vec on-stack bio an synchronously wait on it based on a newly added QUEUE_FLAG_SYNCHRONOUS flag set by the drivers that currently implement ->rw_page instead. While this touches one extra cache line and executes extra code, it simplifies the block layer and drivers and ensures that all feastures are properly supported by all drivers, e.g. right now ->rw_page bypassed cgroup writeback entirely. [akpm@linux-foundation.org: fix comment typo, per Dan] Link: https://lkml.kernel.org/r/20230125133436.447864-8-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Keith Busch <kbusch@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-02-02 22:33:34 -08:00
Ming Lei	0416f3be58	blk-cgroup: don't update io stat for root cgroup We source root cgroup stats from the system-wide stats, see blkcg_print_stat and blkcg_rstat_flush, so don't update io state for root cgroup. Fixes blkg leak issue introduced in commit `3b8cc62987` ("blk-cgroup: Optimize blkcg_rstat_flush()") which starts to grab blkg's reference when adding iostat_cpu into percpu blkcg list, but this state won't be consumed by blkcg_rstat_flush() where the blkg reference is dropped. Tested-by: Bart van Assche <bvanassche@acm.org> Reported-by: Bart van Assche <bvanassche@acm.org> Fixes: `3b8cc62987` ("blk-cgroup: Optimize blkcg_rstat_flush()") Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230202021804.278582-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-02-01 19:26:41 -07:00
Bart Van Assche	81ea42b9c3	block: Fix the blk_mq_destroy_queue() documentation Commit `2b3f056f72` moved a blk_put_queue() call from blk_mq_destroy_queue() into its callers. Reflect this change in the documentation block above blk_mq_destroy_queue(). Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Keith Busch <kbusch@kernel.org> Fixes: `2b3f056f72` ("blk-mq: move the call to blk_put_queue out of blk_mq_destroy_queue") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230130211233.831613-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-31 11:46:15 -07:00
Ulf Hansson	4a6a7bc21d	block: Default to use cgroup support for BFQ Assuming that both Kconfig options, BLK_CGROUP and IOSCHED_BFQ are set, we most likely want cgroup support for BFQ too (BFQ_GROUP_IOSCHED), so let's make it default y. Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/r/20230130121240.159456-1-ulf.hansson@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-30 09:42:42 -07:00
Kemeng Shi	323745a3aa	block, bfq: remove unused bfq_wr_max_time in struct bfq_data bfqd->bfq_wr_max_time is set to 0 in bfq_init_queue and is never changed. It is only used in bfq_wr_duration when bfq_wr_max_time > 0 which never meets, so bfqd->bfq_wr_max_time is not used actually. Just remove it. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-9-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 20:03:49 -07:00
Kemeng Shi	87c971de81	block, bfq: remove unnecessary goto tag in bfq_dispatch_rq_from_bfqq We jump to tag only for returning current rq. Return directly to remove this tag. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230116095153.3810101-8-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 20:03:49 -07:00

... 11 12 13 14 15 ...

7906 Commits