linux-loongson

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-08-28 00:19:36 +00:00

Author	SHA1	Message	Date
Christoph Hellwig	858299dc61	block: add scatterlist-less DMA mapping helpers Add a new blk_rq_dma_map / blk_rq_dma_unmap pair that does away with the wasteful scatterlist structure. Instead it uses the mapping iterator to either add segments to the IOVA for IOMMU operations, or just maps them one by one for the direct mapping. For the IOMMU case instead of a scatterlist with an entry for each segment, only a single [dma_addr,len] pair needs to be stored for processing a request, and for the direct mapping the per-segment allocation shrinks from [page,offset,len,dma_addr,dma_len] to just [dma_addr,len]. One big difference to the scatterlist API, which could be considered downside, is that the IOVA collapsing only works when the driver sets a virt_boundary that matches the IOMMU granule. For NVMe this is done already so it works perfectly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20250625113531.522027-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-30 15:50:32 -06:00
Christoph Hellwig	3844601464	block: don't merge different kinds of P2P transfers in a single bio To get out of the DMA mapping helpers having to check every segment for it's P2P status, ensure that bios either contain P2P transfers or non-P2P transfers, and that a P2P bio only contains ranges from a single device. This means we do the page zone access in the bio add path where it should be still page hot, and will only have do the fairly expensive P2P topology lookup once per bio down in the DMA mapping path, and only for already marked bios. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20250625113531.522027-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-30 15:50:32 -06:00
Damien Le Moal	f70291411b	block: Introduce bio_needs_zone_write_plugging() In preparation for fixing device mapper zone write handling, introduce the inline helper function bio_needs_zone_write_plugging() to test if a BIO requires handling through zone write plugging using the function blk_zone_plug_bio(). This function returns true for any write (op_is_write(bio) == true) operation directed at a zoned block device using zone write plugging, that is, a block device with a disk that has a zone write plug hash table. This helper allows simplifying the check on entry to blk_zone_plug_bio() and used in to protect calls to it for blk-mq devices and DM devices. Fixes: `f211268ed1` ("dm: Use the block layer zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250625093327.548866-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-30 15:50:31 -06:00
Damien Le Moal	9b8b84879d	block: Increase BLK_DEF_MAX_SECTORS_CAP Back in 2015, commit `d2be537c3b` ("block: bump BLK_DEF_MAX_SECTORS to 2560") increased the default maximum size of a block device I/O to 2560 sectors (1280 KiB) to "accommodate a 10-data-disk stripe write with chunk size 128k". This choice is rather arbitrary and since then, improvements to the block layer have software RAID drivers correctly advertize their stripe width through chunk_sectors and abuses of BLK_DEF_MAX_SECTORS_CAP by drivers (to set the HW limit rather than the default user controlled maximum I/O size) have been fixed. Since many block devices can benefit from a larger value of BLK_DEF_MAX_SECTORS_CAP, and in particular HDDs, increase this value to be 4MiB, or 8192 sectors. And given that BLK_DEF_MAX_SECTORS_CAP is only used in the block layer and should not be used by drivers directly, move this macro definition to the block layer internal header file block/blk.h. Suggested-by: Martin K . Petersen <martin.petersen@oracle.com> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20250618060045.37593-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-30 15:50:31 -06:00
Linus Torvalds	e540341508	block-6.16-20250626 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhd4zsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmZgEACOk81RNf8WGNQf4/parSENzebWNj9W+fKD RDhWxwBAquT2VzkF8Iu6wbteVbP9A8yq4BagbD079OWrr0iV8NgWA5y1GyqdER6N upe2ZtBlY7RR4F1FerpSGqRBbhWYejNojSr073ea8mmx5Yl0BbHz5aKKmzWGbUYO lveYPgCeL4dD7kfPeINiamhicLudyAGdqqYpG+/wriefwaVhTgCe+4aQ6pEwftRT utqCzrpUnxrmXS4TFXiWd4u3iVNwPhzcMyUrgkK1yTM7mWIqp8QyHzfF4Acbh/T3 RN/8d5OCfYmamlRvDUCl3FXWukkdGtBrA4m51mhUIzRJ9Np9IiSHdd2UTDgGqSeG 2NSjLtmdDQvtVXeuqBs56os7e3DFx42LZuceqbGWaTQ4VC4QE+Xz+n2ZENx/hWFZ /lixcIBdxt6iqjveJuBJeXW6UqaR+Hz4hpSigZU69DMQzrKm65bSoMdOvyn5b0bU GtlPusSnfgpsSe/H41Lm7SLBePiGXMJvhujzlkWW5cnUUl+yRUQhTO206kQJkbV1 XUMs8Syow15gjQaXI9KiAq+MMUuUwOvXmptMyYQ1NjFy16yzhJ8QOhJilJLWfLdT SqsLyXn1kG2EdcPmXHJRthIgVmQ+uORy2JB1wAomyjJj9a16wJYhgCGDjrl4mocl 9LpjfnyMsA== =ln4w -----END PGP SIGNATURE----- Merge tag 'block-6.16-20250626' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - Fixes for ublk: - fix C++ narrowing warnings in the uapi header - update/improve UBLK_F_SUPPORT_ZERO_COPY comment in uapi header - fix for the ublk ->queue_rqs() implementation, limiting a batch to just the specific task AND ring - ublk_get_data() error handling fix - sanity check more arguments in ublk_ctrl_add_dev() - selftest addition - NVMe pull request via Christoph: - reset delayed remove_work after reconnect - fix atomic write size validation - Fix for a warning introduced in bdev_count_inflight_rw() in this merge window * tag 'block-6.16-20250626' of git://git.kernel.dk/linux: block: fix false warning in bdev_count_inflight_rw() ublk: sanity check add_dev input for underflow nvme: fix atomic write size validation nvme: refactor the atomic write unit detection nvme: reset delayed remove_work after reconnect ublk: setup ublk_io correctly in case of ublk_get_data() failure ublk: update UBLK_F_SUPPORT_ZERO_COPY comment in UAPI header ublk: fix narrowing warnings in UAPI header selftests: ublk: don't take same backing file for more than one ublk devices ublk: build batch from IOs in same io_ring_ctx and io task	2025-06-27 09:02:33 -07:00
Yu Kuai	c007062188	block: fix false warning in bdev_count_inflight_rw() While bdev_count_inflight is interating all cpus, if some IOs are issued from traversed cpu and then completed from the cpu that is not traversed yet: cpu0 cpu1 bdev_count_inflight //for_each_possible_cpu // cpu0 is 0 infliht += 0 // issue a io blk_account_io_start // cpu0 inflight ++ cpu2 // the io is done blk_account_io_done // cpu2 inflight -- // cpu 1 is 0 inflight += 0 // cpu2 is -1 inflight += -1 ... In this case, the total inflight will be -1, causing lots of false warning. Fix the problem by removing the warning. Noted there is still a valid warning for nvme-mpath(From Yi) that is not fixed yet. Fixes: `f5482ee5ed` ("block: WARN if bdev inflight counter is negative") Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/linux-block/aFtUXy-lct0WxY2w@mozart.vkv.me/T/#mae89155a5006463d0a21a4a2c35ae0034b26a339 Reported-and-tested-by: Calvin Owens <calvin@wbinvd.org> Closes: https://lore.kernel.org/linux-block/aFtUXy-lct0WxY2w@mozart.vkv.me/T/#m1d935a00070bf95055d0ac84e6075158b08acaef Reported-by: Dave Chinner <david@fromorbit.com> Closes: https://lore.kernel.org/linux-block/aFuypjqCXo9-5_En@dread.disaster.area/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250626115743.1641443-1-yukuai3@huawei.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-26 07:34:11 -06:00
Matthew Wilcox (Oracle)	b39f7d75dc	fs: Remove three arguments from block_write_end() block_write_end() looks like it can be used as a ->write_end() implementation. However, it can't as it does not unlock nor put the folio. Since it does not use the 'file', 'mapping' nor 'fsdata' arguments, remove them. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/20250624132130.1590285-1-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-06-24 15:53:40 +02:00
Zhang Yi	912b6038fe	block: add FALLOC_FL_WRITE_ZEROES support Add support for FALLOC_FL_WRITE_ZEROES, if the block device enables the unmap write zeroes operation, it will issue a write zeroes command. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-9-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-06-23 12:45:14 +02:00
Zhang Yi	562108d56b	block: factor out common part in blkdev_fallocate() Only the flags passed to blkdev_issue_zeroout() differ among the two zeroing branches in blkdev_fallocate(). Therefore, do cleanup by factoring them out. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-8-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-06-23 12:45:13 +02:00
Zhang Yi	0c40d7cb5e	block: introduce max_{hw\|user}_wzeroes_unmap_sectors to queue limits Currently, disks primarily implement the write zeroes command (aka REQ_OP_WRITE_ZEROES) through two mechanisms: the first involves physically writing zeros to the disk media (e.g., HDDs), while the second performs an unmap operation on the logical blocks, effectively putting them into a deallocated state (e.g., SSDs). The first method is generally slow, while the second method is typically very fast. For example, on certain NVMe SSDs that support NVME_NS_DEAC, submitting REQ_OP_WRITE_ZEROES requests with the NVME_WZ_DEAC bit can accelerate the write zeros operation by placing disk blocks into a deallocated state, which opportunistically avoids writing zeroes to media while still guaranteeing that subsequent reads from the specified block range will return zeroed data. This is a best-effort optimization, not a mandatory requirement, some devices may partially fall back to writing physical zeroes due to factors such as misalignment or being asked to clear a block range smaller than the device's internal allocation unit. Therefore, the speed of this operation is not guaranteed. It is difficult to determine whether the storage device supports unmap write zeroes operation. We cannot determine this by only querying bdev_limits(bdev)->max_write_zeroes_sectors. Therefore, first, add a new hardware queue limit parameters, max_hw_wzeroes_unmap_sectors, to indicate whether a device supports this unmap write zeroes operation. Then, add two new counterpart software queue limits, max_wzeroes_unmap_sectors and max_user_wzeroes_unmap_sectors, which allow users to disable this operation if the speed is very slow on some sepcial devices. Finally, for the stacked devices cases, initialize these two parameters to UINT_MAX. This operation should be enabled by both the stacking driver and all underlying devices. Thanks to Martin K. Petersen for optimizing the documentation of the write_zeroes_unmap sysfs interface. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-2-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-06-23 12:45:13 +02:00
Lorenzo Stoakes	9d5403b103	fs: convert most other generic_file_*mmap() users to .mmap_prepare() Update nearly all generic_file_mmap() and generic_file_readonly_mmap() callers to use generic_file_mmap_prepare() and generic_file_readonly_mmap_prepare() respectively. We update blkdev, 9p, afs, erofs, ext2, nfs, ntfs3, smb, ubifs and vboxsf file systems this way. Remaining users we cannot yet update are ecryptfs, fuse and cramfs. The former two are nested file systems that must support any underlying file ssytem, and cramfs inserts a mixed mapping which currently requires a VMA. Once all file systems have been converted to mmap_prepare(), we can then update nested file systems. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/08db85970d89b17a995d2cffae96fb4cc462377f.1750099179.git.lorenzo.stoakes@oracle.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-06-19 13:56:57 +02:00
Linus Torvalds	f713ffa363	block-6.16-20250614 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhNaUIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvUOD/0WlwBN8WxAA+rzUXo42QJ3W+XruQ+VhQdx Hs/DBEH6KZji86ZVzoJOIwsdlSL2/6PxRIZVqwr3Q8aYnNedUnsjcD4frNBl76EA wFfPttjL7DcaOvVhY0n37IrQNmaeQC1R1O2JxhiWTBzNoNf2iWj84vSSgbgcfDVR trfhRvEwRgmAy037/72pUFYN+JRlv80D03SGfWTQtp6/qq+AA/z5XqWdg9I/opVM 7+H5GoWHfPSG0wQo+Dms3mHV4zm5tOOfMmGIR2o4DoueKgMNgnUXRT8dc7DDBsqV 0moKRHKbTbeN1fz3zqcko0Mp1gq+62hF/eXppQSeJMpMuAbcxaA+/ZFv7Ho9ZwYF jJwcp0O5e8XbRHFqrYWysKGKSvYfvTjr08X70+QFzm9ZJaGCtJYd2ceUNmyO2p6s m54gUnPq5d3nABbpCkAdP5sAv0yVV5idIoezCHIaBYQv8qPpKDrdHHXTQY/VX05x VBGmg9hUZSDMiGkR1d4oKTBayehuWVIpyczhy65KbAfoBA62hAl+aAldkpvLRo1r gKsrMSGP/H6zBU/IRaMGc/bnEnP6zFkn5vxnGwpDcD2tdJn0g+yEjIvJSXrmGJ0w lwzqYd3/vhFPmaEDxE3PyOOGBVCOPqGic+Y6OEIuHA3p2HFO3bsh6+64+iqls/so EmiHPp7n5g== =N1zM -----END PGP SIGNATURE----- Merge tag 'block-6.16-20250614' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - Fix for a deadlock on queue freeze with zoned writes - Fix for zoned append emulation - Two bio folio fixes, for sparsemem and for very large folios - Fix for a performance regression introduced in 6.13 when plug insertion was changed - Fix for NVMe passthrough handling for polled IO - Document the ublk auto registration feature - loop lockdep warning fix * tag 'block-6.16-20250614' of git://git.kernel.dk/linux: nvme: always punt polled uring_cmd end_io work to task_work Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists block: Fix bvec_set_folio() for very large folios bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP block: use plug request list tail for one-shot backmerge attempt block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG) loop: move lo_set_size() out of queue freeze	2025-06-14 09:25:22 -07:00
Jens Axboe	961296e89d	block: use plug request list tail for one-shot backmerge attempt Previously, the block layer stored the requests in the plug list in LIFO order. For this reason, blk_attempt_plug_merge() would check just the head entry for a back merge attempt, and abort after that unless requests for multiple queues existed in the plug list. If more than one request is present in the plug list, this makes the one-shot back merging less useful than before, as it'll always fail to find a quick merge candidate. Use the tail entry for the one-shot merge attempt, which is the last added request in the list. If that fails, abort immediately unless there are multiple queues available. If multiple queues are available, then scan the list. Ideally the latter scan would be a backwards scan of the list, but as it currently stands, the plug list is singly linked and hence this isn't easily feasible. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-block/20250611121626.7252-1-abuehaze@amazon.com/ Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com> Fixes: `e70c301fae` ("block: don't reorder requests in blk_add_rq_to_plug") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-11 08:48:46 -06:00
Christoph Hellwig	cf625013d8	block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work Bios queued up in the zone write plug have already gone through all all preparation in the submit_bio path, including the freeze protection. Submitting them through submit_bio_noacct_nocheck duplicates the work and can can cause deadlocks when freezing a queue with pending bio write plugs. Go straight to ->submit_bio or blk_mq_submit_bio to bypass the superfluous extra freeze protection and checks. Fixes: `9b1ce7f0c6` ("block: Implement zone append emulation") Reported-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250611044416.2351850-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-11 06:42:27 -06:00
Damien Le Moal	f705d33c2f	block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion When blk_zone_write_plug_bio_endio() is called for a regular write BIO used to emulate a zone append operation, that is, a BIO flagged with BIO_EMULATES_ZONE_APPEND, the BIO operation code is restored to the original REQ_OP_ZONE_APPEND but the BIO_EMULATES_ZONE_APPEND flag is not cleared. Clear it to fully return the BIO to its orginal definition. Fixes: `9b1ce7f0c6` ("block: Implement zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250611005915.89843-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-11 06:42:07 -06:00
Ingo Molnar	41cb08555c	treewide, timers: Rename from_timer() to timer_container_of() Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com	2025-06-08 09:07:37 +02:00
Linus Torvalds	6d8854216e	block-6.16-20250606 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhC7/UQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgps+6D/9BOhkMyMkUF9LAev4PBNE+x3aftjl7Y1AY EHv2vozb4nDwXIaalG4qGUhprz2+z+hqxYjmnlOAsqbixhcSzKK5z9rjxyDka776 x03vfvKXaXZUG7XN7ENY8sJnLx4QJ0nh4+0gzT9yDyq2vKvPFLEKweNOxKDKCSbE 31vGoLFwjltp74hX+Qrnj1KMaTLgvAaV0eXKWlbX7Iiw6GFVm200zb27gth6U8bV WQAmjSkFQ0daHtAWmXIVy7hrXiCqe8D6YPKvBXnQ4cfKVbgG0HHDuTmQLpKGzfMi rr24MU5vZjt6OsYalceiTtifSUcf/I2+iFV7HswOk9kpOY5A2ylsWawRP2mm4PDI nJE3LaSTRpEvs5kzPJ2kr8Zp4/uvF6ehSq8Y9w52JekmOzxusLcRcswezaO00EI0 32uuK+P505EGTcCBTrEdtaI6k7zzQEeVoIpxqvMhNRG/s5vzvIV3eVrALu2HSDma P3paEdx7PwJla3ndmdChfh1vUR3TW3gWoZvoNCVmJzNCnLEAScTS2NsiQeEjy8zs 20IGsrRgIqt9KR8GZ2zj1ZOM47Cg0dIU3pbbA2Ja71wx4TYXJCSFFRK7mzDtXYlY BWOix/Dks8tk118cwuxnT+IiwmWDMbDZKnygh+4tiSyrs0IszeekRADLUu03C0Ve Dhpljqf3zA== =gs32 -----END PGP SIGNATURE----- Merge tag 'block-6.16-20250606' of git://git.kernel.dk/linux Pull more block updates from Jens Axboe: - NVMe pull request via Christoph: - TCP error handling fix (Shin'ichiro Kawasaki) - TCP I/O stall handling fixes (Hannes Reinecke) - fix command limits status code (Keith Busch) - support vectored buffers also for passthrough (Pavel Begunkov) - spelling fixes (Yi Zhang) - MD pull request via Yu: - fix REQ_RAHEAD and REQ_NOWAIT IO err handling for raid1/10 - fix max_write_behind setting for dm-raid - some minor cleanups - Integrity data direction fix and cleanup - bcache NULL pointer fix - Fix for loop missing write start/end handling - Decouple hardware queues and IO threads in ublk - Slew of ublk selftests additions and updates * tag 'block-6.16-20250606' of git://git.kernel.dk/linux: (29 commits) nvme: spelling fixes nvme-tcp: fix I/O stalls on congested sockets nvme-tcp: sanitize request list handling nvme-tcp: remove tag set when second admin queue config fails nvme: enable vectored registered bufs for passthrough cmds nvme: fix implicit bool to flags conversion nvme: fix command limits status code selftests: ublk: kublk: improve behavior on init failure block: flip iter directions in blk_rq_integrity_map_user() block: drop direction param from bio_integrity_copy_user() selftests: ublk: cover PER_IO_DAEMON in more stress tests Documentation: ublk: document UBLK_F_PER_IO_DAEMON selftests: ublk: add stress test for per io daemons selftests: ublk: add functional test for per io daemons selftests: ublk: kublk: decouple ublk_queues from ublk server threads selftests: ublk: kublk: move per-thread data out of ublk_queue selftests: ublk: kublk: lift queue initialization out of thread selftests: ublk: kublk: tie sqe allocation to io instead of queue selftests: ublk: kublk: plumb q_id in io_uring user_data ublk: have a per-io daemon instead of a per-queue daemon ...	2025-06-06 13:12:50 -07:00
Caleb Sander Mateos	43a67dd812	block: flip iter directions in blk_rq_integrity_map_user() blk_rq_integrity_map_user() creates the ubuf iter with ITER_DEST for write-direction operations and ITER_SOURCE for read-direction ones. This is backwards; writes use the user buffer as a source for metadata and reads use it as a destination. Switch to the rq_data_dir() helper, which maps writes to ITER_SOURCE (WRITE) and reads to ITER_DEST(READ). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: `fe8f4ca710` ("block: modify bio_integrity_map_user to accept iov_iter as argument") Link: https://lore.kernel.org/r/20250603184752.1185676-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-03 17:24:59 -06:00
Linus Torvalds	3c727285f1	- dm: better error handling when reloading a table - dm-delay: don't busy-wait in kthread - dm: use use generic disable_* functions instead of open coding them - dm: lock queue limits when reading them - dm-verity: use softirq context only when !need_resched() - dm-bufio: remove maximum age based eviction - dm: remove unneeded kvfree from alloc_targets - dm-flakey: various fixes - dm-mpath: interface for explicit probing of active paths - dm: fix BLK_FEAT_ATOMIC_WRITES - dm: pass through operations on wrapped inline crypto keys - dm vdo indexer: don't read request structure after enqueuing - dm-zone: Use bdev_() helper functions where applicable - dm-mpath: replace spin_lock_irqsave with spin_lock_irq - dm-mirror: fix a tiny race condition - dm-verity: fix a memory leak if some arguments are specified multiple times - dm-stripe: small code cleanup -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCaD8vFBQcbXBhdG9ja2FA cmVkaGF0LmNvbQAKCRATAyx9YGnhbYGFAQC/V62PzDUa326WSdvwhtYe6jphInlW ZSmh37L4MIcV2wD/S8UqIaC9GSakee6jEWBTRiDqNZNEOIWhbd7f6gnTOQ0= =eYUe -----END PGP SIGNATURE----- Merge tag 'for-6.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mikulas Patocka: - better error handling when reloading a table - use use generic disable_ functions instead of open coding them - lock queue limits when reading them - remove unneeded kvfree from alloc_targets - fix BLK_FEAT_ATOMIC_WRITES - pass through operations on wrapped inline crypto keys - dm-verity: - use softirq context only when !need_resched() - fix a memory leak if some arguments are specified multiple times - dm-mpath: - interface for explicit probing of active paths - replace spin_lock_irqsave with spin_lock_irq - dm-delay: don't busy-wait in kthread - dm-bufio: remove maximum age based eviction - dm-flakey: various fixes - vdo indexer: don't read request structure after enqueuing - dm-zone: Use bdev_() helper functions where applicable - dm-mirror: fix a tiny race condition - dm-stripe: small code cleanup tag 'for-6.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (29 commits) dm-stripe: small code cleanup dm-verity: fix a memory leak if some arguments are specified multiple times dm-mirror: fix a tiny race condition dm-table: check BLK_FEAT_ATOMIC_WRITES inside limits_lock dm mpath: replace spin_lock_irqsave with spin_lock_irq dm-mpath: Don't grab work_mutex while probing paths dm-zone: Use bdev_*() helper functions where applicable dm vdo indexer: don't read request structure after enqueuing dm: pass through operations on wrapped inline crypto keys blk-crypto: export wrapped key functions dm-table: Set BLK_FEAT_ATOMIC_WRITES for target queue limits dm mpath: Interface for explicit probing of active paths dm: Allow .prepare_ioctl to handle ioctls directly dm-flakey: make corrupting read bios work dm-flakey: remove useless ERROR_READS check in flakey_end_io dm-flakey: error all IOs when num_features is absent dm-flakey: Clean up parsing messages dm: remove unneeded kvfree from alloc_targets dm-bufio: remove maximum age based eviction dm-verity: use softirq context only when !need_resched() ...	2025-06-03 15:54:46 -07:00
Caleb Sander Mateos	c09a8b00f8	block: drop direction param from bio_integrity_copy_user() direction is determined from bio, which is already passed in. Compute op_is_write(bio_op(bio)) directly instead of converting it to an iter direction and back to a bool. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/20250603183133.1178062-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-03 12:45:45 -06:00
Linus Torvalds	3b66e6b3c0	cgroup: Changes for v6.16 - cgroup rstat shared the tracking tree across all controlers with the rationale being that a cgroup which is using one resource is likely to be using other resources at the same time (ie. if something is allocating memory, it's probably consuming CPU cycles). However, this turned out to not scale very well especially with memcg using rstat for internal operations which made memcg stat read and flush patterns substantially different from other controllers. JP Kobryn split the rstat tree per controller. - cgroup BPF support was hooking into cgroup init/exit paths directly. Convert them to use a notifier chain instead so that other usages can be added easily. The two of the patches which implement this are mislabeled as belonging to sched_ext instead of cgroup. Sorry. - Relatively minor cpuset updates. - Documentation updates. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaDYUmA4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGRhbAP90v8QwUkWEKGQSam8JY3by7PvrW6pV5ot+BGuM 4xu3BAEAjsJ9FdiwYLwKYqG7y59xhhBFOo6GpcP52kPp3znl+QQ= =6MIT -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - cgroup rstat shared the tracking tree across all controllers with the rationale being that a cgroup which is using one resource is likely to be using other resources at the same time (ie. if something is allocating memory, it's probably consuming CPU cycles). However, this turned out to not scale very well especially with memcg using rstat for internal operations which made memcg stat read and flush patterns substantially different from other controllers. JP Kobryn split the rstat tree per controller. - cgroup BPF support was hooking into cgroup init/exit paths directly. Convert them to use a notifier chain instead so that other usages can be added easily. The two of the patches which implement this are mislabeled as belonging to sched_ext instead of cgroup. Sorry. - Relatively minor cpuset updates - Documentation updates * tag 'cgroup-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (23 commits) sched_ext: Convert cgroup BPF support to use cgroup_lifetime_notifier sched_ext: Introduce cgroup_lifetime_notifier cgroup: Minor reorganization of cgroup_create() cgroup, docs: cpu controller's interaction with various scheduling policies cgroup, docs: convert space indentation to tab indentation cgroup: avoid per-cpu allocation of size zero rstat cpu locks cgroup, docs: be specific about bandwidth control of rt processes cgroup: document the rstat per-cpu initialization cgroup: helper for checking rstat participation of css cgroup: use subsystem-specific rstat locks to avoid contention cgroup: use separate rstat trees for each subsystem cgroup: compare css to cgroup::self in helper for distingushing css cgroup: warn on rstat usage by early init subsystems cgroup/cpuset: drop useless cpumask_empty() in compute_effective_exclusive_cpumask() cgroup/rstat: Improve cgroup_rstat_push_children() documentation cgroup: fix goto ordering in cgroup_init() cgroup: fix pointer check in css_rstat_init() cgroup/cpuset: Add warnings to catch inconsistency in exclusive CPUs cgroup/cpuset: Fix obsolete comment in cpuset_css_offline() cgroup/cpuset: Always use cpu_active_mask ...	2025-05-27 20:59:53 -07:00
Linus Torvalds	f83fcb87f8	xfs: New code for 6.16 Signed-off-by: Carlos Maiolino <cem@kernel.org> -----BEGIN PGP SIGNATURE----- iJUEABMJAB0WIQSmtYVZ/MfVMGUq1GNcsMJ8RxYuYwUCaDQXTQAKCRBcsMJ8RxYu YwUHAYDYYm9oit6AIr0AgTXBMJ+DHyqaszBy0VT2jQUP+yXxyrQc46QExXKU9YQV ffmGRAsBgN7ZdDI8D5qWySyOynB3b1Jn3/0jY82GscFK0k0oX3EtxbN9MdrovbgK qyO66BVx7w== =pG5y -----END PGP SIGNATURE----- Merge tag 'xfs-merge-6.16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs updates from Carlos Maiolino: - Atomic writes for XFS - Remove experimental warnings for pNFS, scrub and parent pointers * tag 'xfs-merge-6.16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits) xfs: add inode to zone caching for data placement xfs: free the item in xfs_mru_cache_insert on failure xfs: remove the EXPERIMENTAL warning for pNFS xfs: remove some EXPERIMENTAL warnings xfs: Remove deprecated xfs_bufd sysctl parameters xfs: stop using set_blocksize xfs: allow sysadmins to specify a maximum atomic write limit at mount time xfs: update atomic write limits xfs: add xfs_calc_atomic_write_unit_max() xfs: add xfs_file_dio_write_atomic() xfs: commit CoW-based atomic writes atomically xfs: add large atomic writes checks in xfs_direct_write_iomap_begin() xfs: add xfs_atomic_write_cow_iomap_begin() xfs: refine atomic write size check in xfs_file_write_iter() xfs: refactor xfs_reflink_end_cow_extent() xfs: allow block allocator to take an alignment hint xfs: ignore HW which cannot atomic write a single block xfs: add helpers to compute transaction reservation for finishing intent items xfs: add helpers to compute log item overhead xfs: separate out setting buftarg atomic writes limits ...	2025-05-26 12:56:01 -07:00
Linus Torvalds	6f59de9bc0	for-6.16/block-20250523 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgwnGYQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpq9aD/4iqOts77xhWWLrOJWkkhOcV5rREeyppq8X MKYul9S4cc4Uin9Xou9a+nab31QBQEk3nsN3kX9o3yAXvkh6yUm36HD8qYNW/46q IUkwRQQJ0COyTnexMZQNTbZPQDIYcenXmQxOcrEJ5jC1Jcz0sOKHsgekL+ab3kCy fLnuz2ozvjGDMala/NmE8fN5qSlj4qQABHgbamwlwfo4aWu07cwfqn5G/FCYJgDO xUvsnTVclom2g4G+7eSSvGQI1QyAxl5QpviPnj/TEgfFBFnhbCSoBTEY6ecqhlfW 6u59MF/Uw8E+weiuGY4L87kDtBhjQs3UMSLxCuwH7MxXb25ff7qB4AIkcFD0kKFH 3V5NtwqlU7aQT0xOjGxaHhfPwjLD+FVss4ARmuHS09/Kn8egOW9yROPyetnuH84R Oz0Ctnt1IPLFjvGeg3+rt9fjjS9jWOXLITb9Q6nX9gnCt7orCwIYke8YCpmnJyhn i+fV4CWYIQBBRKxIT0E/GhJxZOmL0JKpomnbpP2dH8npemnsTCuvtfdrK9gfhH2X chBVqCPY8MNU5zKfzdEiavPqcm9392lMzOoOXW2pSC1eAKqnAQ86ZT3r7rLntqE8 75LxHcvaQIsnpyG+YuJVHvoiJ83TbqZNpyHwNaQTYhDmdYpp2d/wTtTQywX4DuXb Y6NDJw5+kQ== =1PNK -----END PGP SIGNATURE----- Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - ublk updates: - Add support for updating the size of a ublk instance - Zero-copy improvements - Auto-registering of buffers for zero-copy - Series simplifying and improving GET_DATA and request lookup - Series adding quiesce support - Lots of selftests additions - Various cleanups - NVMe updates via Christoph: - add per-node DMA pools and use them for PRP/SGL allocations (Caleb Sander Mateos, Keith Busch) - nvme-fcloop refcounting fixes (Daniel Wagner) - support delayed removal of the multipath node and optionally support the multipath node for private namespaces (Nilay Shroff) - support shared CQs in the PCI endpoint target code (Wilfred Mallawa) - support admin-queue only authentication (Hannes Reinecke) - use the crc32c library instead of the crypto API (Eric Biggers) - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes Reinecke, Leon Romanovsky, Gustavo A. R. Silva) - MD updates via Yu: - Fix that normal IO can be starved by sync IO, found by mkfs on newly created large raid5, with some clean up patches for bdev inflight counters - Clean up brd, getting rid of atomic kmaps and bvec poking - Add loop driver specifically for zoned IO testing - Eliminate blk-rq-qos calls with a static key, if not enabled - Improve hctx locking for when a plug has IO for multiple queues pending - Remove block layer bouncing support, which in turn means we can remove the per-node bounce stat as well - Improve blk-throttle support - Improve delay support for blk-throttle - Improve brd discard support - Unify IO scheduler switching. This should also fix a bunch of lockdep warnings we've been seeing, after enabling lockdep support for queue freezing/unfreezeing - Add support for block write streams via FDP (flexible data placement) on NVMe - Add a bunch of block helpers, facilitating the removal of a bunch of duplicated boilerplate code - Remove obsolete BLK_MQ pci and virtio Kconfig options - Add atomic/untorn write support to blktrace - Various little cleanups and fixes * tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits) selftests: ublk: add test for UBLK_F_QUIESCE ublk: add feature UBLK_F_QUIESCE selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE traceevent/block: Add REQ_ATOMIC flag to block trace events ublk: run auto buf unregisgering in same io_ring_ctx with registering io_uring: add helper io_uring_cmd_ctx_handle() ublk: remove io argument from ublk_auto_buf_reg_fallback() ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch() selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK selftests: ublk: support UBLK_F_AUTO_BUF_REG ublk: support UBLK_AUTO_BUF_REG_FALLBACK ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG ublk: prepare for supporting to register request buffer automatically ublk: convert to refcount_t selftests: ublk: make IO & device removal test more stressful nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk nvme: introduce multipath_always_on module param nvme-multipath: introduce delayed removal of the multipath head node nvme-pci: derive and better document max segments limits nvme-pci: use struct_size for allocation struct nvme_dev ...	2025-05-26 11:39:36 -07:00
Linus Torvalds	dc76285144	vfs-6.16-rc1.writepage -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPTgAKCRCRxhvAZXjc ovkTAP9tyN24Oo+koY/2UedYBxM54cW4BCCRsVmkzfr8NSVdwwD/dg+v6gS8+nyD 3jlR0Z/08UyMHapB7fnAuFxPXXc8oAo= =e55o -----END PGP SIGNATURE----- Merge tag 'vfs-6.16-rc1.writepage' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull final writepage conversion from Christian Brauner: "This converts vboxfs from ->writepage() to ->writepages(). This was the last user of the ->writepage() method. So remove ->writepage() completely and all references to it" * tag 'vfs-6.16-rc1.writepage' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: Remove aops->writepage mm: Remove swap_writepage() and shmem_writepage() ttm: Call shmem_writeout() from ttm_backup_backup_page() i915: Use writeback_iter() shmem: Add shmem_writeout() writeback: Remove writeback_use_writepage() migrate: Remove call to ->writepage vboxsf: Convert to writepages 9p: Add a migrate_folio method	2025-05-26 08:23:09 -07:00
JP Kobryn	748922dcfa	cgroup: use subsystem-specific rstat locks to avoid contention It is possible to eliminate contention between subsystems when updating/flushing stats by using subsystem-specific locks. Let the existing rstat locks be dedicated to the cgroup base stats and rename them to reflect that. Add similar locks to the cgroup_subsys struct for use with individual subsystems. Lock initialization is done in the new function ss_rstat_init(ss) which replaces cgroup_rstat_boot(void). If NULL is passed to this function, the global base stat locks will be initialized. Otherwise, the subsystem locks will be initialized. Change the existing lock helper functions to accept a reference to a css. Then within these functions, conditionally select the appropriate locks based on the subsystem affiliation of the given css. Add helper functions for this selection routine to avoid repeated code. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-05-19 10:29:42 -10:00
Linus Torvalds	83a896549f	SCSI fixes on 20250516 Fix to zone block devices to make the maximum segment count match what the block layer is capable of. Signed-off-by: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> -----BEGIN PGP SIGNATURE----- iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCaCdNASYcamFtZXMuYm90 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishdM9AQCAEmBq jLr8DawbZikyyoesDwaUelT2NHZv0N2pf4ju0gEAraO3pYMYWaDnUQ73cIKhl8Bo ZG3BCf6MZD1fRq8SxBU= =C6ti -----END PGP SIGNATURE----- Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fix from James Bottomley: "Fix to zone block devices to make the maximum segment count match what the block layer is capable of" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: sd_zbc: block: Respect bio vector limits for REPORT ZONES buffer	2025-05-16 10:28:22 -07:00
Linus Torvalds	6462c247b2	block-6.15-20250515 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgmVlYQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpg3FD/9KaCQn3VOI6NBMSgNP2FRjGhHbyza6kjN8 Q1g0wVTEESt5jBDoNuahA1aTevhBT9qDNG3d7PPwg0OEsE/WW0/RBMe16W9VlDz5 ph3S+nNLaBXQ6aZgH/qyH8eM+CEdoV2FhCUaikuQ25mft6BeonYGTEHVmdnfAg0i er7XnBl/lYNSzdzy1bEbGb0N7Lheasa386Z9oMevCtqgObG/XIFvc2otxxx1QmJ9 iVcJJSnHf1Y9oqYGjebmqRoYnp9d3KEeAA1lMamlshVyX3DcZcXqMe0xEBZflGoG +nLUQ9Qjk38Azp+FOqQh/D4Bs/jXzThSsizzTwgBmHA5Pu5dCC0ZyPSrSX8Aahd6 Yf4awgwFBH7vAvIbf8GMDloNxa6IAuZRzDig/fgqF2thSJJBghA7HNl92oz3BvoY KsnpdxPa8EtrTc0n5WNDj9/0m1msDfzlgLUBgSTM4N0fPlEkDYzLnBSi7d5jq8K4 3lLqmZ/YhBeyzx37pTHLzE3rax/gZ6yLxyxliHrN/F9wSZRbc/PpMu1TdaSwElyk F5VyLMOMHA4PPpqkNdn1TgPd2GQ/uqDkNA0oO7wstBQ5rBuHN4RmItpQBq3lov+Z JyntB5th//2IFEVWAso6Ct67hqUAC8+JNmkTMpRzLaKgX4qr2ZBeGJIx445dhh+z C99L/zA3XA== =s90l -----END PGP SIGNATURE----- Merge tag 'block-6.15-20250515' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - NVMe pull request via Christoph: - fixes for atomic writes (Alan Adamson) - fixes for polled CQs in nvmet-epf (Damien Le Moal) - fix for polled CQs in nvme-pci (Keith Busch) - fix compile on odd configs that need to be forced to inline (Kees Cook) - one more quirk (Ilya Guterman) - Fix for missing allocation of an integrity buffer for some cases - Fix for a regression with ublk command cancelation * tag 'block-6.15-20250515' of git://git.kernel.dk/linux: ublk: fix dead loop when canceling io command nvme-pci: add NVME_QUIRK_NO_DEEPEST_PS quirk for SOLIDIGM P44 Pro nvme: all namespaces in a subsystem must adhere to a common atomic write size nvme: multipath: enable BLK_FEAT_ATOMIC_WRITES for multipathing nvmet: pci-epf: remove NVMET_PCI_EPF_Q_IS_SQ nvmet: pci-epf: improve debug message nvmet: pci-epf: cleanup nvmet_pci_epf_raise_irq() nvmet: pci-epf: do not fall back to using INTX if not supported nvmet: pci-epf: clear completion queue IRQ flag on delete nvme-pci: acquire cq_poll_lock in nvme_poll_irqdisable nvme-pci: make nvme_pci_npages_prp() __always_inline block: always allocate integrity buffer when required	2025-05-16 10:21:25 -07:00
Christoph Hellwig	496a3bc5e4	blk-mq: add a copyright notice to blk-mq-dma.c blk-mq-dma.c was split from blk-merge.c which has no copyright notice, but except for some boilerplate code and comments left from the old version this is all my code, so add my copyright. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250513071433.836797-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-16 08:43:41 -06:00
Christoph Hellwig	b0a4158554	blk-mq: move the DMA mapping code to a separate file While working on the new DMA API I kept getting annoyed how it was placed right in the middle of the bio splitting code in blk-merge.c. Split it out into a separate file. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250513071433.836797-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-16 08:43:41 -06:00
Nilay Shroff	532b9e11b8	block: fix elv_update_nr_hw_queues() to reattach elevator When nr_hw_queues is updated, the elevator needs to be switched to ensure that we exit elevator and reattach it to ensure that hctx-> sched_tags is correctly allocated for the new hardware queues. However, elv_update_nr_hw_queues() currently only switches the elevator if the queue is not registered. This is incorrect, as it prevents reattaching the elevator after updating nr_hw_queues, which in turn inhibits allocation of sched_tags. Fix this by allowing the elevator switch if the queue is registered, ensuring proper reattachment and resource allocation. Fixes: `596dce110b` ("block: simplify elevator reattachment for updating nr_hw_queues") Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250515134511.548270-1-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-15 12:14:18 -06:00
Jens Axboe	dbc5ba08ec	block/blk-throttle: silence !BLK_DEV_IO_TRACE variable warnings If blk-throttle is enabled but blktrace is not, then the compiler will notice that the following two variables are unused: ../block/blk-throttle.c: In function 'throtl_pending_timer_fn': ../block/blk-throttle.c:1153:30: warning: unused variable 'bio_cnt_w' [-Wunused-variable] 1153 \| unsigned int bio_cnt_w = sq_queued(sq, WRITE); \| ^~~~~~~~~ ../block/blk-throttle.c:1152:30: warning: unused variable 'bio_cnt_r' [-Wunused-variable] 1152 \| unsigned int bio_cnt_r = sq_queued(sq, READ); \| ^~~~~~~~~ Silence that my annotating them with __maybe_unused. Fixes: `28ad83b774` ("blk-throttle: Split the service queue") Link: https://lore.kernel.org/all/20250515130830.9671-1-aishwarya.tcv@arm.com/ Reported-by: Aishwarya <aishwarya.tcv@arm.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-15 07:47:59 -06:00
Lukas Bulwahn	1e332795d0	block: Remove obsolete configs BLK_MQ_{PCI,VIRTIO} Commit `9bc1e897a8` ("blk-mq: remove unused queue mapping helpers") makes the two config options, BLK_MQ_PCI and BLK_MQ_VIRTIO, have no remaining effect. Remove the two obsolete config options. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com> Reviewed-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20250514065513.463941-1-lukas.bulwahn@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-14 05:43:56 -06:00
Carlos Maiolino	6e7d71b3a0	Merge branch 'atomic_writes-6.16' into xfs-6.16-merge Required update due to conflict with patch: xfs: stop using set_blocksize Conflicts: fs/xfs/xfs_buf.c Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-05-14 12:38:53 +02:00
Carlos Maiolino	6475ece803	Merge branch 'block-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into xfs-6.16-merge Merging block tree into XFS because of some dependencies like bdev_validate_blocksize() Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-05-14 12:20:57 +02:00
Christoph Hellwig	77fd359b6d	block: remove the same_page output argument to bvec_try_merge_page bvec_try_merge_page currently returns if the added page fragment is within the same page as the last page in the last current bio_vec. This information is used by __bio_iov_iter_get_pages so that we always have a single folio pin per page even when the page is split over multiple __bio_iov_iter_get_pages calls. Threading this through the entire lowlevel add page to bio logic is annoying and inefficient and leads to less code sharing than otherwise possible. Instead add code to __bio_iov_iter_get_pages that checks if the bio_vecs did not change and thus a merge into the last segment must have happened, and if there is an offset into the page for the currently added fragment, because if yes we must have already had a previous fragment of the same page in the last bio_vec. While this is still a bit ugly, it keeps the logic in the one place that needs it and allows for more code sharing. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250512042354.514329-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-13 12:09:32 -06:00
Zizhi Wo	d1ba22ab2b	blk-throttle: Prevents the bps restricted io from entering the bps queue again [BUG] There has an issue of io delayed dispatch caused by io splitting. Consider the following scenario: 1) If we set a BPS limit of 1MB/s and restrict the maximum IO size per dispatch to 4KB, submitting -two- 1MB IO requests results in completion times of 1s and 2s, which is expected. 2) However, if we additionally set an IOPS limit of 1,000,000/s with the same BPS limit of 1MB/s, submitting -two- 1MB IO requests again results in both completing in 2s, even though the IOPS constraint is being met. [CAUSE] This issue arises because BPS and IOPS currently share the same queue in the blkthrotl mechanism: 1) This issue does not occur when only BPS is limited because the split IOs return false in blk_should_throtl() and do not go through to throtl again. 2) For split IOs, even if they have been tagged with BIO_BPS_THROTTLED, they still get queued alternately in the same list due to continuous splitting and reordering. As a result, the two IO requests are both completed at the 2-second mark, causing an unintended delay. 3) It is not difficult to imagine that in this scenario, if N 1MB IOs are issued at once, all IOs will eventually complete together in N seconds. [FIX] With the queue separation introduced in the previous patches, we now have separate BPS and IOPS queues. For IOs that have already passed the BPS limitation, they do not need to re-enter the BPS queue and can directly placed to the IOPS queue. Since we have split the queues, when the IOPS queue is previously empty and a new bio is added to the first qnode->bios_iops list in the service_queue, we also need to update the disptime. This patch introduces "THROTL_TG_IOPS_WAS_EMPTY" flag to mark it. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com> Link: https://lore.kernel.org/r/20250506020935.655574-8-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-13 12:08:27 -06:00
Zizhi Wo	28ad83b774	blk-throttle: Split the service queue This patch splits throtl_service_queue->nr_queued into "nr_queued_bps" and "nr_queued_iops", allowing separate accounting of BPS and IOPS queued bios. This prepares for future changes that need to check whether the BPS or IOPS queues are empty. To facilitate updating the number of IOs in the BPS and IOPS queues, the addition logic will be moved from throtl_add_bio_tg() to throtl_qnode_add_bio(), and similarly, the removal logic will be moved from tg_dispatch_one_bio() to throtl_pop_queued(). And introduce sq_queued() to calculate the total sum of sq->nr_queued. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com> Link: https://lore.kernel.org/r/20250506020935.655574-7-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-13 12:08:27 -06:00
Zizhi Wo	f2c4902bd0	blk-throttle: Split the blkthrotl queue This patch splits the single queue into separate bps and iops queues. Now, an IO request must first pass through the bps queue, then the iops queue, and finally be dispatched. Due to the queue splitting, we need to modify the throtl add/peek/pop function. Additionally, the patch modifies the logic related to tg_dispatch_time(). If bio needs to wait for bps, function directly returns the bps wait time; otherwise, it charges bps and returns the iops wait time so that bio can be directly placed into the iops queue afterward. Note that this may lead to more frequent updates to disptime, but the overhead is negligible for the slow path. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com> Link: https://lore.kernel.org/r/20250506020935.655574-6-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-13 12:08:27 -06:00
Zizhi Wo	c4da7bf54b	blk-throttle: Introduce flag "BIO_TG_BPS_THROTTLED" Subsequent patches will split the single queue into separate bps and iops queues. To prevent IO that has already passed through the bps queue at a single tg level from being counted toward bps wait time again, we introduce "BIO_TG_BPS_THROTTLED" flag. Since throttle and QoS operate at different levels, we reuse the value as "BIO_QOS_THROTTLED". We set this flag when charge bps and clear it when charge iops, as the bio will move to the upper-level tg or be dispatched. This patch does not involve functional changes. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com> Link: https://lore.kernel.org/r/20250506020935.655574-5-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-13 12:08:27 -06:00
Zizhi Wo	a404be5399	blk-throttle: Split throtl_charge_bio() into bps and iops functions Split throtl_charge_bio() to facilitate subsequent patches that will separately charge bps and iops after queue separation. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com> Link: https://lore.kernel.org/r/20250506020935.655574-4-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-13 12:08:27 -06:00
Zizhi Wo	3660cd4228	blk-throttle: Refactor tg_dispatch_time by extracting tg_dispatch_bps/iops_time tg_dispatch_time() contained both bps and iops throttling logic. We now split its internal logic into tg_dispatch_bps/iops_time() to improve code consistency for future separation of the bps and iops queues. Besides, merge time_before() from caller into throtl_extend_slice() to make code cleaner. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com> Link: https://lore.kernel.org/r/20250506020935.655574-3-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-13 12:08:27 -06:00
Zizhi Wo	fd6c08b264	blk-throttle: Rename tg_may_dispatch() to tg_dispatch_time() tg_may_dispatch() can directly indicate whether bio can be dispatched by returning the time to wait, without the need for the redundant "wait" parameter. Remove it and modify the function's return type accordingly. Since we have determined by the return time whether bio can be dispatched, rename tg_may_dispatch() to tg_dispatch_time(). Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com> Link: https://lore.kernel.org/r/20250506020935.655574-2-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-13 12:08:27 -06:00
Jens Axboe	cf724e5e41	Merge tag 'md-6.16-20250513' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-6.16/block Pull MD changes from Yu Kuai: - Fix that normal IO can be starved by sync IO, found by mkfs on newly created large raid5, with some clean up patches for bdev inflight counters. * tag 'md-6.16-20250513' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux: md: clean up accounting for issued sync IO md: fix is_mddev_idle() md: add a new api sync_io_depth md: record dm-raid gendisk in mddev block: export API to get the number of bdev inflight IO block: clean up blk_mq_in_flight_rw() block: WARN if bdev inflight counter is negative block: reuse part_in_flight_rw for part_in_flight blk-mq: remove blk_mq_in_flight()	2025-05-13 07:13:26 -06:00
Steve Siwinski	e8007fad54	scsi: sd_zbc: block: Respect bio vector limits for REPORT ZONES buffer The REPORT ZONES buffer size is currently limited by the HBA's maximum segment count to ensure the buffer can be mapped. However, the block layer further limits the number of iovec entries to 1024 when allocating a bio. To avoid allocation of buffers too large to be mapped, further restrict the maximum buffer size to BIO_MAX_INLINE_VECS. Replace the UIO_MAXIOV symbolic name with the more contextually appropriate BIO_MAX_INLINE_VECS. Fixes: `b091ac6168` ("sd_zbc: Fix report zones buffer allocation") Cc: stable@vger.kernel.org Signed-off-by: Steve Siwinski <ssiwinski@atto.com> Link: https://lore.kernel.org/r/20250508200122.243129-1-ssiwinski@atto.com Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2025-05-12 22:35:48 -04:00
Nilay Shroff	2d8951aee8	block: unfreeze queue if realloc tag set fails during nr_hw_queues update In __blk_mq_update_nr_hw_queues(), the current sequence involves: 1. unregistering sysfs/debugfs attributes 2. freeze the queue 3. reallocating the tag set 4. updating the queue map 5. reallocating hardware contexts 6. updating the elevator (which unfreeze the queue again) 7. re-register sysfs/debugfs attributes If tag set reallocation fails at step 3, the function skips steps 4–6 and proceeds directly to step 7, re-registering the sysfs/debugfs attributes without unfreezing the queue first. This is incorrect and can lead to a system hang or lockdep splat, as the queue remains frozen and is never properly unfrozen. This patch addresses the issue by explicitly unfreezing the queue before re-registering the sysfs/debugfs attributes in the event of a tag set reallocation failure. Fixes: `9dc7a882ce` ("block: move hctx debugfs/sysfs registering out of freezing queue") Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250512092952.135887-1-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-12 07:14:53 -06:00
Keith Busch	8098514bd5	block: always allocate integrity buffer when required Many nvme metadata formats can not strip or generate the metadata on the controller side. For these formats, a host provided integrity buffer is mandatory even if it isn't checked. The block integrity read_verify and write_generate attributes prevent allocating the metadata buffer, but we need it when the format requires it, otherwise reads and writes will be rejected by the driver with IO errors. Assume the integrity buffer can be offloaded to the controller if the metadata size is the same as the protection information size. Otherwise provide an unchecked host buffer when the read verify or write generation attributes are disabled. This fixes the following nvme warning: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 371 at drivers/nvme/host/core.c:1036 nvme_setup_rw+0x122/0x210 ... RIP: 0010:nvme_setup_rw+0x122/0x210 ... Call Trace: <TASK> nvme_setup_cmd+0x1b4/0x280 nvme_queue_rqs+0xc4/0x1f0 [nvme] blk_mq_dispatch_queue_requests+0x24a/0x430 blk_mq_flush_plug_list+0x50/0x140 __blk_flush_plug+0xc1/0x100 __submit_bio+0x1c1/0x360 ? submit_bio_noacct_nocheck+0x2d6/0x3c0 submit_bio_noacct_nocheck+0x2d6/0x3c0 ? submit_bio_noacct+0x47/0x4c0 submit_bio_wait+0x48/0xa0 __blkdev_direct_IO_simple+0xee/0x210 ? current_time+0x1d/0x100 ? current_time+0x1d/0x100 ? __bio_clone+0xb0/0xb0 blkdev_read_iter+0xbb/0x140 vfs_read+0x239/0x310 ksys_read+0x58/0xc0 do_syscall_64+0x6c/0x180 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250509153802.3482493-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-12 07:14:03 -06:00
Yu Kuai	f2987c5816	block: export API to get the number of bdev inflight IO - rename part_in_{flight, flight_rw} to bdev_count_{inflight, inflight_rw} - export bdev_count_inflight, to fix a problem in mdraid that foreground IO can be starved by background sync IO in later patches Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-6-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de>	2025-05-10 16:11:49 +08:00
Yu Kuai	6b6c3a97ab	block: clean up blk_mq_in_flight_rw() Also add comment for part_inflight_show() for the difference between bio-based and rq-based device. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-4-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2025-05-10 16:11:21 +08:00
Yu Kuai	f5482ee5ed	block: WARN if bdev inflight counter is negative Which means there is a bug for related bio-based disk driver, or blk-mq for rq-based disk, it's better not to hide the bug. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-3-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: John Garry <john.g.garry@oracle.com>	2025-05-10 16:06:12 +08:00
Yu Kuai	5b8f19aee4	block: reuse part_in_flight_rw for part_in_flight They are almost identical, to make code cleaner. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-2-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de>	2025-05-10 16:05:38 +08:00
Yu Kuai	c151919080	blk-mq: remove blk_mq_in_flight() After commit `7be835694d` ("block: fix that util can be greater than 100%"), it's not used and can be removed. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-1-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de>	2025-05-10 16:04:38 +08:00
Linus Torvalds	cc9f0629ca	block-6.15-20250509 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgeCdgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgprW3EACoZmPgLRnt19f/u+veGKnJknGMkxgyph44 MryIe6kmJPnZ8CUAq+RloHtObAsMDbNmvcfwvp4Aeg1he7D6yNgCTHyamGRhtwR0 ydGS/eawYUibp5757TIegj26jZOEwWKkVaytX60N0rnRLLkr6IFXgA9JBhxpkF95 YYDW5+6nr7Z95ZNprSpvA5Q59tbv+HAGoKX2mmmj6sWwXCwx9eHEK0//bnOe6HzA 3mYGpsFOIgOZ/DFj33jscSygsuyQRevh/i4TxuNZjDm+AD9TZZwAbspxKPVml+pT aVQ6oRQoGzcPDaoMTTUAwxrMmMaVI3oDFibTDofldDjRbvPI96wR4Cs1orCBHrsE 5xOynjyp6Yrr+RBCbA0i6vIdhb5P/oDsXQ4/qxLgdz46aWwXl7kxd6JhdjDXrmXm 3k9Kou/GHNGbB2qhYU+qASRAWj+hieEEgS31ppCu44q0hh8pbynlUCnl3Di0hHsJ CIgjzOtdLb9sCJm68mpZGCkx71B2RX2TTZRhtCUHZNuAr2vqynPcza2+gtTIYJkN uxtYeguKwKzWTJkB5l4hvscH8GVAKY4hFpILT7CjAMs6Rn/i9mCtnuWlDdGbxQdb JQjxbKM5C/XNdcMG/3oDWaTrg1beFPSk87N9dCNx2A87zLxtLi71OpIScPFdlP9y XN6QgJbh2g== =sdd1 -----END PGP SIGNATURE----- Merge tag 'block-6.15-20250509' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - Fix for a regression in this series for loop and read/write iterator handling - zone append block update tweak - remove a broken IO priority test - NVMe pull request via Christoph: - unblock ctrl state transition for firmware update (Daniel Wagner) * tag 'block-6.15-20250509' of git://git.kernel.dk/linux: block: remove test of incorrect io priority level nvme: unblock ctrl state transition for firmware update block: only update request sector if needed loop: Add sanity check for read/write_iter	2025-05-09 10:34:50 -07:00
Aaron Lu	c0d0a9ff6d	block: remove test of incorrect io priority level Ever since commit eca2040972b4("scsi: block: ioprio: Clean up interface definition"), the macro IOPRIO_PRIO_LEVEL() will mask the level value to something between 0 and 7 so necessarily, level will always be lower than IOPRIO_NR_LEVELS(8). Remove this obsolete check. Reported-by: Kexin Wei <ys.weikexin@h3c.com> Cc: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250508083018.GA769554@bytedance Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-08 09:04:12 -06:00
Ming Lei	824afb9b04	block: move removing elevator after deleting disk->queue_kobj When blk_unregister_queue() is called from add_disk() failure path, there is race in registering/unregistering elevator queue kobject from the two code paths, because commit `559dc11143` ("block: move elv_register[unregister]_queue out of elevator_lock") moves elevator queue register/unregister out of elevator lock. Fix the race by removing elevator after deleting disk->queue_kobj, because kobject_del(&disk->queue_kobj) drains in-progress sysfs show()/store() of all attributes. Fixes: `559dc11143` ("block: move elv_register[unregister]_queue out of elevator_lock") Reported-by: Nilay Shroff <nilay@linux.ibm.com> Suggested-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250508085807.3175112-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-08 09:03:44 -06:00
Ming Lei	8336d18c6b	block: don't quiesce queue for calling elevator_set_none() blk_mq_freeze_queue() can't be called on quiesced queue, otherwise it may never return if there is any queued requests. Fix it by removing quiesce queue around elevator_set_none() because elevator_switch() does quiesce queue in case that we need to switch to none really. Fixes: `1e44bedbc9` ("block: unifying elevator change") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250508085807.3175112-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-08 09:03:44 -06:00
John Garry	5d894321c4	fs: add atomic write unit max opt to statx XFS will be able to support large atomic writes (atomic write > 1x block) in future. This will be achieved by using different operating methods, depending on the size of the write. Specifically a new method of operation based in FS atomic extent remapping will be supported in addition to the current HW offload-based method. The FS method will generally be appreciably slower performing than the HW-offload method. However the FS method will be typically able to contribute to achieving a larger atomic write unit max limit. XFS will support a hybrid mode, where HW offload method will be used when possible, i.e. HW offload is used when the length of the write is supported, and for other times FS-based atomic writes will be used. As such, there is an atomic write length at which the user may experience appreciably slower performance. Advertise this limit in a new statx field, stx_atomic_write_unit_max_opt. When zero, it means that there is no such performance boundary. Masks STATX{_ATTR}_WRITE_ATOMIC can be used to get this new field. This is ok for older kernels which don't support this new field, as they would report 0 in this field (from zeroing in cp_statx()) already. Furthermore those older kernels don't support large atomic writes - apart from block fops, but there would be consistent performance there for atomic writes in range [unit min, unit max]. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>	2025-05-07 14:25:30 -07:00
Christoph Hellwig	6ff54f4566	block: simplify bio_map_kern Rewrite bio_map_kern using the new bio_add_* helpers and drop the kerneldoc comment that is superfluous for an internal helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-07 07:31:07 -06:00
Christoph Hellwig	fddbc51dc2	block: pass the operation to bio_{map,copy}_kern That way the bio can be allocated with the right operation already set and there is no need to pass the separated 'reading' argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-07 07:31:07 -06:00
Christoph Hellwig	af78428ed3	block: remove the q argument from blk_rq_map_kern Remove the q argument from blk_rq_map_kern and the internal helpers called by it as the queue can trivially be derived from the request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-07 07:31:07 -06:00
Christoph Hellwig	8dd16f5e34	block: add a bio_add_vmalloc helpers Add a helper to add a vmalloc region to a bio, abstracting away the vmalloc addresses from the underlying pages and another one wrapping it for the simple case where all data fits into a single bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-07 07:31:07 -06:00
Christoph Hellwig	10b1e59cda	block: add a bdev_rw_virt helper Add a helper to perform synchronous I/O on a kernel direct map range. Currently this is implemented in various places in usually not very efficient ways, so provide a generic helper instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-07 07:31:07 -06:00
Christoph Hellwig	850e210d5a	block: add a bio_add_virt_nofail helper Add a helper to add a directly mapped kernel virtual address to a bio so that callers don't have to convert to pages or folios. For now only the _nofail variant is provided as that is what all the obvious callers want. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-07 07:31:07 -06:00
Eric Biggers	025e138eeb	blk-crypto: export wrapped key functions Export blk_crypto_derive_sw_secret(), blk_crypto_import_key(), blk_crypto_generate_key(), and blk_crypto_prepare_key() so that they can be used by device-mapper when passing through wrapped key support. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-05-06 19:08:08 +02:00
Christoph Hellwig	c27683da64	block: expose write streams for block device nodes Use the per-kiocb write stream if provided, or map temperature hints to write streams (which is a bit questionable, but this shows how it is done). Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> [kbusch: removed statx reporting] Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-6-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:46:43 -06:00
Christoph Hellwig	c23acfac10	block: introduce a write_stream_granularity queue limit Export the granularity that write streams should be discarded with, as it is essential for making good use of them. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-5-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:46:43 -06:00
Keith Busch	d2f526ba27	block: introduce max_write_streams queue limit Drivers with hardware that support write streams need a way to export how many are available so applications can generically query this. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Keith Busch <kbusch@kernel.org> [hch: renamed hints to streams, removed stacking] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-4-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:46:43 -06:00
Christoph Hellwig	5006f85ea2	block: add a bi_write_stream field Add the ability to pass a write stream for placement control in the bio. The new field fits in an existing hole, so does not change the size of the struct. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-3-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:46:43 -06:00
Johannes Thumshirn	db492e24f9	block: only update request sector if needed In case of a ZONE APPEND write, regardless of native ZONE APPEND or the emulation layer in the zone write plugging code, the sector the data got written to by the device needs to be updated in the bio. At the moment, this is done for every native ZONE APPEND write and every request that is flagged with 'BIO_ZONE_WRITE_PLUGGING'. But thus superfluously updates the sector for regular writes to a zoned block device. Check if a bio is a native ZONE APPEND write or if the bio is flagged as 'BIO_EMULATES_ZONE_APPEND', meaning the block layer's zone write plugging code handles the ZONE APPEND and translates it into a regular write and back. Only if one of these two criterion is met, update the sector in the bio upon completion. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/dea089581cb6b777c1cd1500b38ac0b61df4b2d1.1746530748.git.jth@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:45:59 -06:00
Johannes Thumshirn	3bb6e35632	block: only update request sector if needed In case of a ZONE APPEND write, regardless of native ZONE APPEND or the emulation layer in the zone write plugging code, the sector the data got written to by the device needs to be updated in the bio. At the moment, this is done for every native ZONE APPEND write and every request that is flagged with 'BIO_ZONE_WRITE_PLUGGING'. But thus superfluously updates the sector for regular writes to a zoned block device. Check if a bio is a native ZONE APPEND write or if the bio is flagged as 'BIO_EMULATES_ZONE_APPEND', meaning the block layer's zone write plugging code handles the ZONE APPEND and translates it into a regular write and back. Only if one of these two criterion is met, update the sector in the bio upon completion. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/dea089581cb6b777c1cd1500b38ac0b61df4b2d1.1746530748.git.jth@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:45:31 -06:00
Ming Lei	78c271344b	block: move wbt_enable_default() out of queue freezing from sched ->exit() scheduler's ->exit() is called with queue frozen and elevator lock is held, and wbt_enable_default() can't be called with queue frozen, otherwise the following lockdep warning is triggered: #6 (&q->rq_qos_mutex){+.+.}-{4:4}: #5 (&eq->sysfs_lock){+.+.}-{4:4}: #4 (&q->elevator_lock){+.+.}-{4:4}: #3 (&q->q_usage_counter(io)#3){++++}-{0:0}: #2 (fs_reclaim){+.+.}-{0:0}: #1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: #0 (&q->debugfs_mutex){+.+.}-{4:4}: Fix the issue by moving wbt_enable_default() out of bfq's exit(), and call it from elevator_change_done(). Meantime add disk->rqos_state_mutex for covering wbt state change, which matches the purpose more than ->elevator_lock. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-26-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	7ed7fa561c	block: move hctx cpuhp add/del out of queue freezing Move hctx cpuhp add/del out of queue freezing for not connecting freeze lock with cpuhp locks, then lockdep warning can be avoided. This way is safe because both needn't queue to be frozen and scheduler switch isn't allowed, with same reason for moving hctx debugfs/sysfs register out of queue freeze. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-25-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	0a47d2b433	block: don't acquire ->elevator_lock in blk_mq_map_swqueue and blk_mq_realloc_hw_ctxs Both blk_mq_map_swqueue() and blk_mq_realloc_hw_ctxs() are called before the request queue is added to tagset list, so the two won't run concurrently with blk_mq_update_nr_hw_queues(). When the two functions are only called from queue initialization or blk_mq_update_nr_hw_queues(), elevator switch can't happen. So remove ->elevator_lock uses from the two functions. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-24-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	9dc7a882ce	block: move hctx debugfs/sysfs registering out of freezing queue Move hctx debugfs/sysfs register out of freezing queue in __blk_mq_update_nr_hw_queues(), so that the following lockdep dependency can be killed: #2 (&q->q_usage_counter(io)#16){++++}-{0:0}: #1 (fs_reclaim){+.+.}-{0:0}: #0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs And registering/un-registering hctx debugfs/sysfs does not require queue to be frozen: - hctx sysfs attributes show() are drained when removing kobject, and there isn't store() implementation for hctx sysfs attributes - debugfs entry read() is drained too when removing debugfs directory, and there isn't write() implementation for hctx debugfs too - so it is safe to register/unregister hctx sysfs/debugfs without freezing queue because the cod paths changes nothing, and we just need to keep hctx live Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-23-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	559dc11143	block: move elv_register[unregister]_queue out of elevator_lock Move elv_register[unregister]_queue out of ->elevator_lock & queue freezing, so we can kill many lockdep warnings. elv_register[unregister]_queue() is serialized, and just dealing with sysfs/ debugfs things, no need to be done with queue frozen: - when it is called from adding disk, elevator switch isn't possible because ->queue_kobj isn't added yet - when it is called from deleting disk, disable_elv_switch() is responsible for preventing new elevator switch and draining old elevator switch. - when it is called from blk_mq_update_nr_hw_queues(), adding/removing disk and elevator switch can't be allowed or in-progress With this change, elevator's ->exit() is called before calling elv_unregister_queue, then user may call into ->show()/store() of elevator's sysfs attributes, and we have covered this issue by adding `ELEVATOR_FLAG_DYNG`. For blk-mq debugfs, hctx->sched_tags is always checked with ->elevator_lock by debugfs code, meantime hctx->sched_tags is updated with ->elevator_lock, so there isn't such issue. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-22-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	21eed794ab	block: add new helper for disabling elevator switch when deleting disk Add new helper disable_elv_switch() and new flag QUEUE_FLAG_NO_ELV_SWITCH for disabling elevator switch before deleting disk: - originally flag QUEUE_FLAG_REGISTERED is added for preventing elevator switch during removing disk, but this flag has been used widely for other purposes, so add one new flag for disabling elevator switch only - for avoiding deadlock risk, we have to move elevator queue register/unregister out of elevator lock and queue freeze, which will be done in next patch. However, this way adds small race window between elevator switch and deleting ->queue_kobj, in which elevator queue register/unregister could be run concurrently. The added helper will be used for avoiding the race in the following patch. - drain in-progress elevator switch before deleting disk Suggested-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250505141805.2751237-21-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	5c3d858cdc	block: fail to show/store elevator sysfs attribute if elevator is dying Prepare for moving elv_register[unregister]_queue out of elevator_lock & queue freezing, so we may have to call elv_unregister_queue() after elevator ->exit() is called, then there is small window for user to call into ->show()/store(), and user-after-free can be caused. Fail to show/store elevator sysfs attribute if elevator is dying by adding one new flag of ELEVATOR_FLAG_DYNG, which is protected by elevator ->sysfs_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20250505141805.2751237-20-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	e25ee50dfa	block: remove elevator queue's type check in elv_attr_show/store() elevatore queue's type is assigned since its allocation, and never get cleared until it is released. So its ->type is always not NULL, remove the unnecessary check. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-19-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	a3dc6279c2	block: pass elevator_queue to elv_register_queue & unregister_queue Pass elevator_queue reference to elv_register_queue() & elv_unregister_queue(). No functional change, and prepare for moving the two out of elevator lock & freezing queue, when we need to store the old & new elevator queue in `struct elv_change_ctx` instance, then both two can co-exist for short while, so we have to pass the exact elevator_queue instance to elv_register_queue & unregister_queue. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-18-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	1e44bedbc9	block: unifying elevator change Elevator change is one well-define behavior: - tear down current elevator if it exists - setup new elevator It is supposed to cover any case for changing elevator by single internal API, typically the following cases: - setup default elevator in add_disk() - switch to none in del_disk() - reset elevator in blk_mq_update_nr_hw_queues() - switch elevator in sysfs `store` elevator attribute This patch uses elevator_change() to cover all above cases: - every elevator switch is serialized with each other: add_disk/del_disk/ store elevator is serialized already, blk_mq_update_nr_hw_queues() uses srcu for syncing with the other three cases - for both add_disk()/del_disk(), queue freeze works at atomic mode or has been froze, so the freeze in elevator_change() won't add extra delay - `struct elev_change_ctx` instance holds any info for changing elevator Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-17-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	1e9db5c427	block: add `struct elv_change_ctx` for unifying elevator change Add `struct elv_change_ctx` and prepare for unifying elevator change by elevator_change(). With this way, any input & output parameter can be provided & observed in top helper. This way helps to move kobject add/delete & debugfs register/unregister out of ->elevator_lock & freezing queue. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-16-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	20117b5a4b	block: move queue freezing & elevator_lock into elevator_change() Move queue freezing & elevator_lock into elevator_change(), and prepare for using elevator_change() for setting up & tearing down default elevator too. Also add lockdep_assert_held() in __elevator_change() because either read or write lock is required for changing elevator. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-15-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	596dce110b	block: simplify elevator reattachment for updating nr_hw_queues In blk_mq_update_nr_hw_queues(), nr_hw_queues changes and elevator data depends on it, and elevator has to be reattached, so call elevator_switch() to force attachment. Add elv_update_nr_hw_queues() simply for blk_mq_update_nr_hw_queues() to reattach elevator, since elevator switch isn't likely when running blk_mq_update_nr_hw_queues(). This way removes the current switch none and switch back code. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-14-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	ac55b71a31	block: move blk_queue_registered() check into elv_iosched_store() Move blk_queue_registered() check into elv_iosched_store() and prepare for using elevator_change() for covering any kind of elevator change in adding/deleting disk and updating nr_hw_queue. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-13-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Christoph Hellwig	1bb7fba0e2	block: fold elevator_disable into elevator_switch This removes duplicate code, and keeps the callers tidy. Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-12-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Christoph Hellwig	a11abb9838	block: look up the elevator type in elevator_switch That makes the function nicely self-contained and can be used to avoid code duplication. Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-11-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	b126d9d747	block: don't allow to switch elevator if updating nr_hw_queues is in-progress Elevator switch code is another `nr_hw_queue` reader in non-fast-IO code path, so it can't be done if updating `nr_hw_queues` is in-progress. Take same approach with not allowing add/del disk when updating nr_hw_queues is in-progress, by grabbing read lock of set->update_nr_hwq_sema. Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/linux-block/aAWv3NPtNIKKvJZc@fedora/ [1] Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-10-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:42 -06:00
Ming Lei	98e68f6702	block: prevent adding/deleting disk during updating nr_hw_queues Both adding/deleting disk code are reader of `nr_hw_queues`, so we can't allow them in-progress when updating nr_hw_queues, kernel panic and kasan has been reported in [1]. Prevent adding/deleting disk during updating nr_hw_queues by adding rw_semaphore to tagset, write lock is grabbed in blk_mq_update_nr_hw_queues(), and read lock is acquired when adding/deleting disk. Also mark GFP_NOIO allocation scope for adding/deleting disk because blk_mq_update_nr_hw_queues() is part of some driver's error handler. This way avoids lot of trouble. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Suggested-by: Nilay Shroff <nilay@linux.ibm.com> Reported-by: Nilay Shroff <nilay@linux.ibm.com> Closes: https://lore.kernel.org/linux-block/a5896cdb-a59a-4a37-9f99-20522f5d2987@linux.ibm.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-9-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:42 -06:00
Ming Lei	5fad1490ef	block: add helper add_disk_final() Add helper add_disk_final() for scanning partitions, announcing disk and handling the last thing for adding disk. No functional change, and prepare for prevent adding disk from happening when updating nr_hw_queues. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-8-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:42 -06:00
Ming Lei	92c22d7efc	block: move sched debugfs register into elvevator_register_queue sched debugfs shares same lifetime with scheduler's kobject, and same lock(elevator lock), so move sched debugfs register/unregister into elevator_register_queue() and elevator_unregister_queue(). Then we needn't blk_mq_debugfs_register() for us to register sched debugfs any more. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-7-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:42 -06:00
Ming Lei	ed3896acdc	block: add two helpers for registering/un-registering sched debugfs Add blk_mq_sched_reg_debugfs()/blk_mq_sched_unreg_debugfs() to clean up sched init/exit code a bit. Register & unregister debugfs for sched & sched_hctx order is changed a bit, but it is safe because sched & sched_hctx is guaranteed to be ready when exporting via debugfs. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-6-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:42 -06:00
Ming Lei	94209d27d1	block: use q->elevator with ->elevator_lock held in elv_iosched_show() Use q->elevator with ->elevator_lock held in elv_iosched_show(), since the local cached elevator reference may become stale after getting ->elevator_lock. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:42 -06:00
Ming Lei	f8e111c859	block: don't call freeze queue in elevator_switch() and elevator_disable() Both elevator_switch() and elevator_disable() are only called from the two code paths, in which queue is guaranteed to be frozen. So don't call freeze queue in the two functions, also add asserts for queue freeze. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250505141805.2751237-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:42 -06:00
Ming Lei	56dee46ff4	block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag ELEVATOR_FLAG_DISABLE_WBT is only used by BFQ to disallow wbt when BFQ is in use. The flag is set in BFQ's init(), and cleared in BFQ's exit(). Making it as request queue flag, so that we can avoid to deal with elevator switch race. Also it isn't graceful to checking one scheduler flag in wbt_enable_default(). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:42 -06:00
Ming Lei	f24d47edd1	block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue() Move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue(), and publish this request queue to tagset after everything is setup. This way is safe because BLK_MQ_F_TAG_QUEUE_SHARED isn't used by blk_mq_map_swqueue(), and this flag is mainly checked in fast IO code path. Prepare for removing ->elevator_lock from blk_mq_map_swqueue() which is supposed to be called when elevator switch can't be done. Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reported-by: Nilay Shroff <nilay@linux.ibm.com> Closes: https://lore.kernel.org/linux-block/567cb7ab-23d6-4cee-a915-c8cdac903ddd@linux.ibm.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:42 -06:00
Zizhi Wo	18b8144a1b	blk-throttle: Add an additional overflow check to the call calculate_bytes/io_allowed Now the tg->[bytes/io]_disp type is signed, and calculate_bytes/io_allowed return type is unsigned. Even if the bps/iops limit is not set to max, the return value of the function may still exceed INT_MAX or LLONG_MAX, which can cause overflow in outer variables. In such cases, we can add additional checks accordingly. And in throtl_trim_slice(), if the BPS/IOPS limit is set to max, there's no need to call calculate_bytes/io_allowed(). Introduces the helper functions throtl_trim_bps/iops to simplifies the process. For cases when the calculated trim value exceeds INT_MAX (causing an overflow), we reset tg->[bytes/io]_disp to zero, so return original tg->[bytes/io]_disp because it is the size that is actually trimmed. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250417132054.2866409-4-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-05 19:08:34 -06:00
Zizhi Wo	7b89d46051	blk-throttle: Delete unnecessary carryover-related fields from throtl_grp We no longer need carryover_[bytes/ios] in tg, so it is removed. The related comments about carryover in tg are also merged into [bytes/io]_disp, and modify other related comments. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250417132054.2866409-3-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-05 19:08:34 -06:00
Zizhi Wo	f66cf69eb8	blk-throttle: Fix wrong tg->[bytes/io]_disp update in __tg_update_carryover() In commit `6cc477c368` ("blk-throttle: carry over directly"), the carryover bytes/ios was be carried to [bytes/io]_disp. However, its update mechanism has some issues. In __tg_update_carryover(), we calculate "bytes" and "ios" to represent the carryover, but the computation when updating [bytes/io]_disp is incorrect. And if the sq->nr_queued is empty, we may not update tg->[bytes/io]_disp to 0 in tg_update_carryover(). We should set it to 0 in non carryover case. This patch fixes the issue. Fixes: `6cc477c368` ("blk-throttle: carry over directly") Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250417132054.2866409-2-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-05 19:08:34 -06:00
Christoph Hellwig	eeadd68e2a	block: remove bounce buffering support The block layer bounce buffering support is unused now, remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250505081138.3435992-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-05 13:22:39 -06:00
Christoph Hellwig	00ef5c728e	block: use writeback_iter Use writeback_iter instead of the deprecated write_cache_pages wrapper in blkdev_writepages. This removes an indirect call per folio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20250424082752.1967679-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-02 09:23:00 -06:00
Caleb Sander Mateos	9712c57ec1	block: avoid hctx spinlock for plug with multiple queues blk_mq_flush_plug_list() has a fast path if all requests in the plug are destined for the same request_queue. It calls ->queue_rqs() with the whole batch of requests, falling back on ->queue_rq() for any requests not handled by ->queue_rqs(). However, if the requests are destined for multiple queues, blk_mq_flush_plug_list() has a slow path that calls blk_mq_dispatch_list() repeatedly to filter the requests by ctx/hctx. Each queue's requests are inserted into the hctx's dispatch list under a spinlock, then __blk_mq_sched_dispatch_requests() takes them out of the dispatch list (taking the spinlock again), and finally blk_mq_dispatch_rq_list() calls ->queue_rq() on each request. Acquiring the hctx spinlock twice and calling ->queue_rq() instead of ->queue_rqs() makes the slow path significantly more expensive. Thus, batching more requests into a single plug (e.g. io_uring_enter syscall) can counterintuitively hurt performance by causing the plug to span multiple queues. We have observed 2-3% of CPU time spent acquiring the hctx spinlock alone on workloads issuing requests to multiple NVMe devices in the same io_uring SQE batches. Add a medium path in blk_mq_flush_plug_list() for plugs that don't have elevators or come from a schedule, but do span multiple queues. Filter the requests by queue and call ->queue_rqs()/->queue_rq() on the list of requests destined to each request_queue. With this change, we no longer see any CPU time spent in _raw_spin_lock from blk_mq_flush_plug_list and throughput increases accordingly. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250426011728.4189119-4-csander@purestorage.com [axboe: fix whitespace damage] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-02 09:21:36 -06:00

1 2 3 4 5 ...

7899 Commits