linux-loongson

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-08-27 06:50:37 +00:00

Author	SHA1	Message	Date
Yu Kuai	c151919080	blk-mq: remove blk_mq_in_flight() After commit `7be835694d` ("block: fix that util can be greater than 100%"), it's not used and can be removed. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-1-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de>	2025-05-10 16:04:38 +08:00
Linus Torvalds	cc9f0629ca	block-6.15-20250509 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgeCdgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgprW3EACoZmPgLRnt19f/u+veGKnJknGMkxgyph44 MryIe6kmJPnZ8CUAq+RloHtObAsMDbNmvcfwvp4Aeg1he7D6yNgCTHyamGRhtwR0 ydGS/eawYUibp5757TIegj26jZOEwWKkVaytX60N0rnRLLkr6IFXgA9JBhxpkF95 YYDW5+6nr7Z95ZNprSpvA5Q59tbv+HAGoKX2mmmj6sWwXCwx9eHEK0//bnOe6HzA 3mYGpsFOIgOZ/DFj33jscSygsuyQRevh/i4TxuNZjDm+AD9TZZwAbspxKPVml+pT aVQ6oRQoGzcPDaoMTTUAwxrMmMaVI3oDFibTDofldDjRbvPI96wR4Cs1orCBHrsE 5xOynjyp6Yrr+RBCbA0i6vIdhb5P/oDsXQ4/qxLgdz46aWwXl7kxd6JhdjDXrmXm 3k9Kou/GHNGbB2qhYU+qASRAWj+hieEEgS31ppCu44q0hh8pbynlUCnl3Di0hHsJ CIgjzOtdLb9sCJm68mpZGCkx71B2RX2TTZRhtCUHZNuAr2vqynPcza2+gtTIYJkN uxtYeguKwKzWTJkB5l4hvscH8GVAKY4hFpILT7CjAMs6Rn/i9mCtnuWlDdGbxQdb JQjxbKM5C/XNdcMG/3oDWaTrg1beFPSk87N9dCNx2A87zLxtLi71OpIScPFdlP9y XN6QgJbh2g== =sdd1 -----END PGP SIGNATURE----- Merge tag 'block-6.15-20250509' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - Fix for a regression in this series for loop and read/write iterator handling - zone append block update tweak - remove a broken IO priority test - NVMe pull request via Christoph: - unblock ctrl state transition for firmware update (Daniel Wagner) * tag 'block-6.15-20250509' of git://git.kernel.dk/linux: block: remove test of incorrect io priority level nvme: unblock ctrl state transition for firmware update block: only update request sector if needed loop: Add sanity check for read/write_iter	2025-05-09 10:34:50 -07:00
Aaron Lu	c0d0a9ff6d	block: remove test of incorrect io priority level Ever since commit eca2040972b4("scsi: block: ioprio: Clean up interface definition"), the macro IOPRIO_PRIO_LEVEL() will mask the level value to something between 0 and 7 so necessarily, level will always be lower than IOPRIO_NR_LEVELS(8). Remove this obsolete check. Reported-by: Kexin Wei <ys.weikexin@h3c.com> Cc: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250508083018.GA769554@bytedance Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-08 09:04:12 -06:00
Ming Lei	824afb9b04	block: move removing elevator after deleting disk->queue_kobj When blk_unregister_queue() is called from add_disk() failure path, there is race in registering/unregistering elevator queue kobject from the two code paths, because commit `559dc11143` ("block: move elv_register[unregister]_queue out of elevator_lock") moves elevator queue register/unregister out of elevator lock. Fix the race by removing elevator after deleting disk->queue_kobj, because kobject_del(&disk->queue_kobj) drains in-progress sysfs show()/store() of all attributes. Fixes: `559dc11143` ("block: move elv_register[unregister]_queue out of elevator_lock") Reported-by: Nilay Shroff <nilay@linux.ibm.com> Suggested-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250508085807.3175112-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-08 09:03:44 -06:00
Ming Lei	8336d18c6b	block: don't quiesce queue for calling elevator_set_none() blk_mq_freeze_queue() can't be called on quiesced queue, otherwise it may never return if there is any queued requests. Fix it by removing quiesce queue around elevator_set_none() because elevator_switch() does quiesce queue in case that we need to switch to none really. Fixes: `1e44bedbc9` ("block: unifying elevator change") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250508085807.3175112-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-08 09:03:44 -06:00
John Garry	5d894321c4	fs: add atomic write unit max opt to statx XFS will be able to support large atomic writes (atomic write > 1x block) in future. This will be achieved by using different operating methods, depending on the size of the write. Specifically a new method of operation based in FS atomic extent remapping will be supported in addition to the current HW offload-based method. The FS method will generally be appreciably slower performing than the HW-offload method. However the FS method will be typically able to contribute to achieving a larger atomic write unit max limit. XFS will support a hybrid mode, where HW offload method will be used when possible, i.e. HW offload is used when the length of the write is supported, and for other times FS-based atomic writes will be used. As such, there is an atomic write length at which the user may experience appreciably slower performance. Advertise this limit in a new statx field, stx_atomic_write_unit_max_opt. When zero, it means that there is no such performance boundary. Masks STATX{_ATTR}_WRITE_ATOMIC can be used to get this new field. This is ok for older kernels which don't support this new field, as they would report 0 in this field (from zeroing in cp_statx()) already. Furthermore those older kernels don't support large atomic writes - apart from block fops, but there would be consistent performance there for atomic writes in range [unit min, unit max]. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>	2025-05-07 14:25:30 -07:00
Christoph Hellwig	6ff54f4566	block: simplify bio_map_kern Rewrite bio_map_kern using the new bio_add_* helpers and drop the kerneldoc comment that is superfluous for an internal helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-07 07:31:07 -06:00
Christoph Hellwig	fddbc51dc2	block: pass the operation to bio_{map,copy}_kern That way the bio can be allocated with the right operation already set and there is no need to pass the separated 'reading' argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-07 07:31:07 -06:00
Christoph Hellwig	af78428ed3	block: remove the q argument from blk_rq_map_kern Remove the q argument from blk_rq_map_kern and the internal helpers called by it as the queue can trivially be derived from the request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-07 07:31:07 -06:00
Christoph Hellwig	8dd16f5e34	block: add a bio_add_vmalloc helpers Add a helper to add a vmalloc region to a bio, abstracting away the vmalloc addresses from the underlying pages and another one wrapping it for the simple case where all data fits into a single bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-07 07:31:07 -06:00
Christoph Hellwig	10b1e59cda	block: add a bdev_rw_virt helper Add a helper to perform synchronous I/O on a kernel direct map range. Currently this is implemented in various places in usually not very efficient ways, so provide a generic helper instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-07 07:31:07 -06:00
Christoph Hellwig	850e210d5a	block: add a bio_add_virt_nofail helper Add a helper to add a directly mapped kernel virtual address to a bio so that callers don't have to convert to pages or folios. For now only the _nofail variant is provided as that is what all the obvious callers want. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-07 07:31:07 -06:00
Eric Biggers	025e138eeb	blk-crypto: export wrapped key functions Export blk_crypto_derive_sw_secret(), blk_crypto_import_key(), blk_crypto_generate_key(), and blk_crypto_prepare_key() so that they can be used by device-mapper when passing through wrapped key support. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-05-06 19:08:08 +02:00
Christoph Hellwig	c27683da64	block: expose write streams for block device nodes Use the per-kiocb write stream if provided, or map temperature hints to write streams (which is a bit questionable, but this shows how it is done). Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> [kbusch: removed statx reporting] Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-6-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:46:43 -06:00
Christoph Hellwig	c23acfac10	block: introduce a write_stream_granularity queue limit Export the granularity that write streams should be discarded with, as it is essential for making good use of them. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-5-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:46:43 -06:00
Keith Busch	d2f526ba27	block: introduce max_write_streams queue limit Drivers with hardware that support write streams need a way to export how many are available so applications can generically query this. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Keith Busch <kbusch@kernel.org> [hch: renamed hints to streams, removed stacking] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-4-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:46:43 -06:00
Christoph Hellwig	5006f85ea2	block: add a bi_write_stream field Add the ability to pass a write stream for placement control in the bio. The new field fits in an existing hole, so does not change the size of the struct. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-3-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:46:43 -06:00
Johannes Thumshirn	db492e24f9	block: only update request sector if needed In case of a ZONE APPEND write, regardless of native ZONE APPEND or the emulation layer in the zone write plugging code, the sector the data got written to by the device needs to be updated in the bio. At the moment, this is done for every native ZONE APPEND write and every request that is flagged with 'BIO_ZONE_WRITE_PLUGGING'. But thus superfluously updates the sector for regular writes to a zoned block device. Check if a bio is a native ZONE APPEND write or if the bio is flagged as 'BIO_EMULATES_ZONE_APPEND', meaning the block layer's zone write plugging code handles the ZONE APPEND and translates it into a regular write and back. Only if one of these two criterion is met, update the sector in the bio upon completion. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/dea089581cb6b777c1cd1500b38ac0b61df4b2d1.1746530748.git.jth@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:45:59 -06:00
Johannes Thumshirn	3bb6e35632	block: only update request sector if needed In case of a ZONE APPEND write, regardless of native ZONE APPEND or the emulation layer in the zone write plugging code, the sector the data got written to by the device needs to be updated in the bio. At the moment, this is done for every native ZONE APPEND write and every request that is flagged with 'BIO_ZONE_WRITE_PLUGGING'. But thus superfluously updates the sector for regular writes to a zoned block device. Check if a bio is a native ZONE APPEND write or if the bio is flagged as 'BIO_EMULATES_ZONE_APPEND', meaning the block layer's zone write plugging code handles the ZONE APPEND and translates it into a regular write and back. Only if one of these two criterion is met, update the sector in the bio upon completion. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/dea089581cb6b777c1cd1500b38ac0b61df4b2d1.1746530748.git.jth@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:45:31 -06:00
Ming Lei	78c271344b	block: move wbt_enable_default() out of queue freezing from sched ->exit() scheduler's ->exit() is called with queue frozen and elevator lock is held, and wbt_enable_default() can't be called with queue frozen, otherwise the following lockdep warning is triggered: #6 (&q->rq_qos_mutex){+.+.}-{4:4}: #5 (&eq->sysfs_lock){+.+.}-{4:4}: #4 (&q->elevator_lock){+.+.}-{4:4}: #3 (&q->q_usage_counter(io)#3){++++}-{0:0}: #2 (fs_reclaim){+.+.}-{0:0}: #1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: #0 (&q->debugfs_mutex){+.+.}-{4:4}: Fix the issue by moving wbt_enable_default() out of bfq's exit(), and call it from elevator_change_done(). Meantime add disk->rqos_state_mutex for covering wbt state change, which matches the purpose more than ->elevator_lock. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-26-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	7ed7fa561c	block: move hctx cpuhp add/del out of queue freezing Move hctx cpuhp add/del out of queue freezing for not connecting freeze lock with cpuhp locks, then lockdep warning can be avoided. This way is safe because both needn't queue to be frozen and scheduler switch isn't allowed, with same reason for moving hctx debugfs/sysfs register out of queue freeze. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-25-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	0a47d2b433	block: don't acquire ->elevator_lock in blk_mq_map_swqueue and blk_mq_realloc_hw_ctxs Both blk_mq_map_swqueue() and blk_mq_realloc_hw_ctxs() are called before the request queue is added to tagset list, so the two won't run concurrently with blk_mq_update_nr_hw_queues(). When the two functions are only called from queue initialization or blk_mq_update_nr_hw_queues(), elevator switch can't happen. So remove ->elevator_lock uses from the two functions. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-24-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	9dc7a882ce	block: move hctx debugfs/sysfs registering out of freezing queue Move hctx debugfs/sysfs register out of freezing queue in __blk_mq_update_nr_hw_queues(), so that the following lockdep dependency can be killed: #2 (&q->q_usage_counter(io)#16){++++}-{0:0}: #1 (fs_reclaim){+.+.}-{0:0}: #0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs And registering/un-registering hctx debugfs/sysfs does not require queue to be frozen: - hctx sysfs attributes show() are drained when removing kobject, and there isn't store() implementation for hctx sysfs attributes - debugfs entry read() is drained too when removing debugfs directory, and there isn't write() implementation for hctx debugfs too - so it is safe to register/unregister hctx sysfs/debugfs without freezing queue because the cod paths changes nothing, and we just need to keep hctx live Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-23-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	559dc11143	block: move elv_register[unregister]_queue out of elevator_lock Move elv_register[unregister]_queue out of ->elevator_lock & queue freezing, so we can kill many lockdep warnings. elv_register[unregister]_queue() is serialized, and just dealing with sysfs/ debugfs things, no need to be done with queue frozen: - when it is called from adding disk, elevator switch isn't possible because ->queue_kobj isn't added yet - when it is called from deleting disk, disable_elv_switch() is responsible for preventing new elevator switch and draining old elevator switch. - when it is called from blk_mq_update_nr_hw_queues(), adding/removing disk and elevator switch can't be allowed or in-progress With this change, elevator's ->exit() is called before calling elv_unregister_queue, then user may call into ->show()/store() of elevator's sysfs attributes, and we have covered this issue by adding `ELEVATOR_FLAG_DYNG`. For blk-mq debugfs, hctx->sched_tags is always checked with ->elevator_lock by debugfs code, meantime hctx->sched_tags is updated with ->elevator_lock, so there isn't such issue. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-22-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	21eed794ab	block: add new helper for disabling elevator switch when deleting disk Add new helper disable_elv_switch() and new flag QUEUE_FLAG_NO_ELV_SWITCH for disabling elevator switch before deleting disk: - originally flag QUEUE_FLAG_REGISTERED is added for preventing elevator switch during removing disk, but this flag has been used widely for other purposes, so add one new flag for disabling elevator switch only - for avoiding deadlock risk, we have to move elevator queue register/unregister out of elevator lock and queue freeze, which will be done in next patch. However, this way adds small race window between elevator switch and deleting ->queue_kobj, in which elevator queue register/unregister could be run concurrently. The added helper will be used for avoiding the race in the following patch. - drain in-progress elevator switch before deleting disk Suggested-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250505141805.2751237-21-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	5c3d858cdc	block: fail to show/store elevator sysfs attribute if elevator is dying Prepare for moving elv_register[unregister]_queue out of elevator_lock & queue freezing, so we may have to call elv_unregister_queue() after elevator ->exit() is called, then there is small window for user to call into ->show()/store(), and user-after-free can be caused. Fail to show/store elevator sysfs attribute if elevator is dying by adding one new flag of ELEVATOR_FLAG_DYNG, which is protected by elevator ->sysfs_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20250505141805.2751237-20-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	e25ee50dfa	block: remove elevator queue's type check in elv_attr_show/store() elevatore queue's type is assigned since its allocation, and never get cleared until it is released. So its ->type is always not NULL, remove the unnecessary check. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-19-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	a3dc6279c2	block: pass elevator_queue to elv_register_queue & unregister_queue Pass elevator_queue reference to elv_register_queue() & elv_unregister_queue(). No functional change, and prepare for moving the two out of elevator lock & freezing queue, when we need to store the old & new elevator queue in `struct elv_change_ctx` instance, then both two can co-exist for short while, so we have to pass the exact elevator_queue instance to elv_register_queue & unregister_queue. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-18-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	1e44bedbc9	block: unifying elevator change Elevator change is one well-define behavior: - tear down current elevator if it exists - setup new elevator It is supposed to cover any case for changing elevator by single internal API, typically the following cases: - setup default elevator in add_disk() - switch to none in del_disk() - reset elevator in blk_mq_update_nr_hw_queues() - switch elevator in sysfs `store` elevator attribute This patch uses elevator_change() to cover all above cases: - every elevator switch is serialized with each other: add_disk/del_disk/ store elevator is serialized already, blk_mq_update_nr_hw_queues() uses srcu for syncing with the other three cases - for both add_disk()/del_disk(), queue freeze works at atomic mode or has been froze, so the freeze in elevator_change() won't add extra delay - `struct elev_change_ctx` instance holds any info for changing elevator Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-17-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	1e9db5c427	block: add `struct elv_change_ctx` for unifying elevator change Add `struct elv_change_ctx` and prepare for unifying elevator change by elevator_change(). With this way, any input & output parameter can be provided & observed in top helper. This way helps to move kobject add/delete & debugfs register/unregister out of ->elevator_lock & freezing queue. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-16-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	20117b5a4b	block: move queue freezing & elevator_lock into elevator_change() Move queue freezing & elevator_lock into elevator_change(), and prepare for using elevator_change() for setting up & tearing down default elevator too. Also add lockdep_assert_held() in __elevator_change() because either read or write lock is required for changing elevator. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-15-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	596dce110b	block: simplify elevator reattachment for updating nr_hw_queues In blk_mq_update_nr_hw_queues(), nr_hw_queues changes and elevator data depends on it, and elevator has to be reattached, so call elevator_switch() to force attachment. Add elv_update_nr_hw_queues() simply for blk_mq_update_nr_hw_queues() to reattach elevator, since elevator switch isn't likely when running blk_mq_update_nr_hw_queues(). This way removes the current switch none and switch back code. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-14-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	ac55b71a31	block: move blk_queue_registered() check into elv_iosched_store() Move blk_queue_registered() check into elv_iosched_store() and prepare for using elevator_change() for covering any kind of elevator change in adding/deleting disk and updating nr_hw_queue. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-13-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Christoph Hellwig	1bb7fba0e2	block: fold elevator_disable into elevator_switch This removes duplicate code, and keeps the callers tidy. Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-12-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Christoph Hellwig	a11abb9838	block: look up the elevator type in elevator_switch That makes the function nicely self-contained and can be used to avoid code duplication. Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-11-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:43 -06:00
Ming Lei	b126d9d747	block: don't allow to switch elevator if updating nr_hw_queues is in-progress Elevator switch code is another `nr_hw_queue` reader in non-fast-IO code path, so it can't be done if updating `nr_hw_queues` is in-progress. Take same approach with not allowing add/del disk when updating nr_hw_queues is in-progress, by grabbing read lock of set->update_nr_hwq_sema. Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/linux-block/aAWv3NPtNIKKvJZc@fedora/ [1] Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-10-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:42 -06:00
Ming Lei	98e68f6702	block: prevent adding/deleting disk during updating nr_hw_queues Both adding/deleting disk code are reader of `nr_hw_queues`, so we can't allow them in-progress when updating nr_hw_queues, kernel panic and kasan has been reported in [1]. Prevent adding/deleting disk during updating nr_hw_queues by adding rw_semaphore to tagset, write lock is grabbed in blk_mq_update_nr_hw_queues(), and read lock is acquired when adding/deleting disk. Also mark GFP_NOIO allocation scope for adding/deleting disk because blk_mq_update_nr_hw_queues() is part of some driver's error handler. This way avoids lot of trouble. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Suggested-by: Nilay Shroff <nilay@linux.ibm.com> Reported-by: Nilay Shroff <nilay@linux.ibm.com> Closes: https://lore.kernel.org/linux-block/a5896cdb-a59a-4a37-9f99-20522f5d2987@linux.ibm.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-9-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:42 -06:00
Ming Lei	5fad1490ef	block: add helper add_disk_final() Add helper add_disk_final() for scanning partitions, announcing disk and handling the last thing for adding disk. No functional change, and prepare for prevent adding disk from happening when updating nr_hw_queues. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-8-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:42 -06:00
Ming Lei	92c22d7efc	block: move sched debugfs register into elvevator_register_queue sched debugfs shares same lifetime with scheduler's kobject, and same lock(elevator lock), so move sched debugfs register/unregister into elevator_register_queue() and elevator_unregister_queue(). Then we needn't blk_mq_debugfs_register() for us to register sched debugfs any more. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-7-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:42 -06:00
Ming Lei	ed3896acdc	block: add two helpers for registering/un-registering sched debugfs Add blk_mq_sched_reg_debugfs()/blk_mq_sched_unreg_debugfs() to clean up sched init/exit code a bit. Register & unregister debugfs for sched & sched_hctx order is changed a bit, but it is safe because sched & sched_hctx is guaranteed to be ready when exporting via debugfs. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-6-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:42 -06:00
Ming Lei	94209d27d1	block: use q->elevator with ->elevator_lock held in elv_iosched_show() Use q->elevator with ->elevator_lock held in elv_iosched_show(), since the local cached elevator reference may become stale after getting ->elevator_lock. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:42 -06:00
Ming Lei	f8e111c859	block: don't call freeze queue in elevator_switch() and elevator_disable() Both elevator_switch() and elevator_disable() are only called from the two code paths, in which queue is guaranteed to be frozen. So don't call freeze queue in the two functions, also add asserts for queue freeze. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250505141805.2751237-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:42 -06:00
Ming Lei	56dee46ff4	block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag ELEVATOR_FLAG_DISABLE_WBT is only used by BFQ to disallow wbt when BFQ is in use. The flag is set in BFQ's init(), and cleared in BFQ's exit(). Making it as request queue flag, so that we can avoid to deal with elevator switch race. Also it isn't graceful to checking one scheduler flag in wbt_enable_default(). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:42 -06:00
Ming Lei	f24d47edd1	block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue() Move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue(), and publish this request queue to tagset after everything is setup. This way is safe because BLK_MQ_F_TAG_QUEUE_SHARED isn't used by blk_mq_map_swqueue(), and this flag is mainly checked in fast IO code path. Prepare for removing ->elevator_lock from blk_mq_map_swqueue() which is supposed to be called when elevator switch can't be done. Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reported-by: Nilay Shroff <nilay@linux.ibm.com> Closes: https://lore.kernel.org/linux-block/567cb7ab-23d6-4cee-a915-c8cdac903ddd@linux.ibm.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-06 07:43:42 -06:00
Zizhi Wo	18b8144a1b	blk-throttle: Add an additional overflow check to the call calculate_bytes/io_allowed Now the tg->[bytes/io]_disp type is signed, and calculate_bytes/io_allowed return type is unsigned. Even if the bps/iops limit is not set to max, the return value of the function may still exceed INT_MAX or LLONG_MAX, which can cause overflow in outer variables. In such cases, we can add additional checks accordingly. And in throtl_trim_slice(), if the BPS/IOPS limit is set to max, there's no need to call calculate_bytes/io_allowed(). Introduces the helper functions throtl_trim_bps/iops to simplifies the process. For cases when the calculated trim value exceeds INT_MAX (causing an overflow), we reset tg->[bytes/io]_disp to zero, so return original tg->[bytes/io]_disp because it is the size that is actually trimmed. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250417132054.2866409-4-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-05 19:08:34 -06:00
Zizhi Wo	7b89d46051	blk-throttle: Delete unnecessary carryover-related fields from throtl_grp We no longer need carryover_[bytes/ios] in tg, so it is removed. The related comments about carryover in tg are also merged into [bytes/io]_disp, and modify other related comments. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250417132054.2866409-3-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-05 19:08:34 -06:00
Zizhi Wo	f66cf69eb8	blk-throttle: Fix wrong tg->[bytes/io]_disp update in __tg_update_carryover() In commit `6cc477c368` ("blk-throttle: carry over directly"), the carryover bytes/ios was be carried to [bytes/io]_disp. However, its update mechanism has some issues. In __tg_update_carryover(), we calculate "bytes" and "ios" to represent the carryover, but the computation when updating [bytes/io]_disp is incorrect. And if the sq->nr_queued is empty, we may not update tg->[bytes/io]_disp to 0 in tg_update_carryover(). We should set it to 0 in non carryover case. This patch fixes the issue. Fixes: `6cc477c368` ("blk-throttle: carry over directly") Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250417132054.2866409-2-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-05 19:08:34 -06:00
Christoph Hellwig	eeadd68e2a	block: remove bounce buffering support The block layer bounce buffering support is unused now, remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250505081138.3435992-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-05 13:22:39 -06:00
Christoph Hellwig	00ef5c728e	block: use writeback_iter Use writeback_iter instead of the deprecated write_cache_pages wrapper in blkdev_writepages. This removes an indirect call per folio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20250424082752.1967679-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-02 09:23:00 -06:00
Caleb Sander Mateos	9712c57ec1	block: avoid hctx spinlock for plug with multiple queues blk_mq_flush_plug_list() has a fast path if all requests in the plug are destined for the same request_queue. It calls ->queue_rqs() with the whole batch of requests, falling back on ->queue_rq() for any requests not handled by ->queue_rqs(). However, if the requests are destined for multiple queues, blk_mq_flush_plug_list() has a slow path that calls blk_mq_dispatch_list() repeatedly to filter the requests by ctx/hctx. Each queue's requests are inserted into the hctx's dispatch list under a spinlock, then __blk_mq_sched_dispatch_requests() takes them out of the dispatch list (taking the spinlock again), and finally blk_mq_dispatch_rq_list() calls ->queue_rq() on each request. Acquiring the hctx spinlock twice and calling ->queue_rq() instead of ->queue_rqs() makes the slow path significantly more expensive. Thus, batching more requests into a single plug (e.g. io_uring_enter syscall) can counterintuitively hurt performance by causing the plug to span multiple queues. We have observed 2-3% of CPU time spent acquiring the hctx spinlock alone on workloads issuing requests to multiple NVMe devices in the same io_uring SQE batches. Add a medium path in blk_mq_flush_plug_list() for plugs that don't have elevators or come from a schedule, but do span multiple queues. Filter the requests by queue and call ->queue_rqs()/->queue_rq() on the list of requests destined to each request_queue. With this change, we no longer see any CPU time spent in _raw_spin_lock from blk_mq_flush_plug_list and throughput increases accordingly. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250426011728.4189119-4-csander@purestorage.com [axboe: fix whitespace damage] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-02 09:21:36 -06:00
Caleb Sander Mateos	a5728a1d1e	block: factor out blk_mq_dispatch_queue_requests() helper Factor out the logic from blk_mq_flush_plug_list() that calls ->queue_rqs() with a fallback to ->queue_rq() into a helper function blk_mq_dispatch_queue_requests(). This is in preparation for using this code with other lists of requests. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250426011728.4189119-3-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-02 09:21:08 -06:00
Caleb Sander Mateos	0aeb7ebfc7	block: take rq_list instead of plug in dispatch functions blk_mq_plug_issue_direct(), __blk_mq_flush_plug_list(), and blk_mq_dispatch_plug_list() take a struct blk_plug * but only use its mq_list. Pass the struct rq_list * instead in preparation for calling them with other lists of requests. Drop "plug" from the function names as they are no longer plug-specific. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250426011728.4189119-2-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-02 09:21:08 -06:00
Carlos Maiolino	d0d7f1813d	Merge remote-tracking branch 'linux-block/block-6.15' into xfs tree We need two patches inside linux-block tree as dependencies of the patch which will follow this merge. Specifically, we need: block: fix race between set_blocksize and read paths block: hoist block size validation code to a separate function Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-04-28 11:32:06 +02:00
Linus Torvalds	7deea5634a	block-6.15-20250424 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgK7wMQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgppNwD/46vpEWhwGLeNXxFic5CCNMbUBUl04+Rc9I p22BY9rp1+7ooXJGJJTGaQAjwTFKP/kxyaQWZrJFXK8t2wYhJ8E2PWPVolk8jsOT 0wSYaF9iW4kw5twcmWq+VqPM+joLGKxkwojDTvz4CiorKrq2J14yHkrtfp81R3d3 rR7VzeggglSxEJAKkIBkbRWtMwTQ6WvImm4uufccI3AwfPJcM3qxSXGqq3wryA0O PyqFlkOdjDIbNP3Zu0QvqQ0xyefGCyGyAfKEPNEAn1oOpD8Y/SUvdMdlPzA9pJ93 9+8F9pAg6fo8vgBEMavVGNjFOw4OrxNBNL9St3vlz+VMpid+HMyflolLwCTdQXCz HEZ+H75uwMwh3mskHp5paitdE4Y70tqXW6LWgr/5wXOsl8Lh5p1A7Ll0tP27gUe1 vV1Yh+nwbg5TQ1qi+NmjhUThivT96hop+5nK9p5r7GHSZP1xiJdb7RsQqOhDXBmP I5sjc5Dny8S9b87ehX6b4VfpTbk3aRVhOaEJ4l6k4dwFFkTRP0ODa0/bWPf3Reb0 4HI2/NYRsdLNyut2896P19RxcpcXz5PKEKvcCt5pAehwv4urIdpLoeNLgMdgwfcc qbtNbTkV2V/fOKG7pS6yUQmWGR/XXQZDFJ4gCFZemiYYVAXJCqjBB6dliMy8roFz p5Rc31AEMw== =s0NF -----END PGP SIGNATURE----- Merge tag 'block-6.15-20250424' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - Fix autoloading of drivers from stat(2) - Fix losing read-ahead setting one suspend/resume, when a device is re-probed. - Fix race between setting the block size and page cache updates. Includes a helper that a coming XFS fix will use as well. - ublk cancelation fixes. - ublk selftest additions and fixes. - NVMe pull via Christoph: - fix an out-of-bounds access in nvmet_enable_port (Richard Weinberger) tag 'block-6.15-20250424' of git://git.kernel.dk/linux: ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA block: don't autoload drivers on blk-cgroup configuration block: don't autoload drivers on stat block: remove the backing_inode variable in bdev_statx block: move blkdev_{get,put} _no_open prototypes out of blkdev.h block: never reduce ra_pages in blk_apply_bdi_limits selftests: ublk: common: fix _get_disk_dev_t for pre-9.0 coreutils selftests: ublk: remove useless 'delay_us' from 'struct dev_ctx' selftests: ublk: fix recover test block: hoist block size validation code to a separate function block: fix race between set_blocksize and read paths nvmet: fix out-of-bounds access in nvmet_enable_port	2025-04-25 11:34:39 -07:00
Jens Axboe	bf4b8794de	Merge branch 'block-6.15' into for-6.16/block Merge 6.15 block fixes - both to get the fixes causing issues with XFS testing, but also to make it easier for 6.16 ublk patches to avoid conflicts. * block-6.15: ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA block: don't autoload drivers on blk-cgroup configuration block: don't autoload drivers on stat block: remove the backing_inode variable in bdev_statx block: move blkdev_{get,put} _no_open prototypes out of blkdev.h block: never reduce ra_pages in blk_apply_bdi_limits selftests: ublk: common: fix _get_disk_dev_t for pre-9.0 coreutils selftests: ublk: remove useless 'delay_us' from 'struct dev_ctx' selftests: ublk: fix recover test block: hoist block size validation code to a separate function block: fix race between set_blocksize and read paths nvmet: fix out-of-bounds access in nvmet_enable_port	2025-04-24 20:41:11 -06:00
Christoph Hellwig	c4d2519c6a	block: don't autoload drivers on blk-cgroup configuration Loading a driver just to configure blk-cgroup doesn't make sense, as that assumes and already existing device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-24 07:35:23 -06:00
Christoph Hellwig	5f33b5226c	block: don't autoload drivers on stat blkdev_get_no_open can trigger the legacy autoload of block drivers. A simple stat of a block device has not historically done that, so disable this behavior again. Fixes: `9abcfbd235` ("block: Add atomic write support for statx") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-24 07:35:23 -06:00
Christoph Hellwig	d13b7090b2	block: remove the backing_inode variable in bdev_statx backing_inode is only used once, so remove it and update the comment describing the bdev lookup to be a bit more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-24 07:35:09 -06:00
Christoph Hellwig	c63202140d	block: move blkdev_{get,put} _no_open prototypes out of blkdev.h These are only to be used by block internal code. Remove the comment as we grew more users due to reworking block device node opening. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-24 07:33:38 -06:00
Christoph Hellwig	7b720c7202	block: never reduce ra_pages in blk_apply_bdi_limits When the user increased the read-ahead size through sysfs this value currently get lost if the device is reprobe, including on a resume from suspend. As there is no hardware limitation for the read-ahead size there is no real need to reset it or track a separate hardware limitation like for max_sectors. This restores the pre-atomic queue limit behavior in the sd driver as sd did not use blk_queue_io_opt and thus never updated the read ahead size to the value based of the optimal I/O, but changes behavior for all other drivers. As the new behavior seems useful and sd is the driver for which the readahead size tweaks are most useful that seems like a worthwhile trade off. Fixes: `804e498e04` ("sd: convert to the atomic queue limits API") Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20250424082521.1967286-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-24 07:32:17 -06:00
Darrick J. Wong	e03463d247	block: hoist block size validation code to a separate function Hoist the block size validation code to bdev_validate_blocksize so that we can call it from filesystems that don't care about the bdev pagecache manipulations of set_blocksize. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/174543795720.4139148.840349813093799165.stgit@frogsfrogsfrogs Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-23 13:58:06 -06:00
Darrick J. Wong	c0e473a0d2	block: fix race between set_blocksize and read paths With the new large sector size support, it's now the case that set_blocksize can change i_blksize and the folio order in a manner that conflicts with a concurrent reader and causes a kernel crash. Specifically, let's say that udev-worker calls libblkid to detect the labels on a block device. The read call can create an order-0 folio to read the first 4096 bytes from the disk. But then udev is preempted. Next, someone tries to mount an 8k-sectorsize filesystem from the same block device. The filesystem calls set_blksize, which sets i_blksize to 8192 and the minimum folio order to 1. Now udev resumes, still holding the order-0 folio it allocated. It then tries to schedule a read bio and do_mpage_readahead tries to create bufferheads for the folio. Unfortunately, blocks_per_folio == 0 because the page size is 4096 but the blocksize is 8192 so no bufferheads are attached and the bh walk never sets bdev. We then submit the bio with a NULL block device and crash. Therefore, truncate the page cache after flushing but before updating i_blksize. However, that's not enough -- we also need to lock out file IO and page faults during the update. Take both the i_rwsem and the invalidate_lock in exclusive mode for invalidations, and in shared mode for read/write operations. I don't know if this is the correct fix, but xfs/259 found it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/174543795699.4139148.2086129139322431423.stgit@frogsfrogsfrogs Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-23 13:58:06 -06:00
Jens Axboe	033b667a82	block: blk-rq-qos: guard rq-qos helpers by static key Even if blk-rq-qos isn't used or configured, dipping into the queue to fetch ->rq_qos is a noticeable slowdown and visible in profiles. Add an unlikely static key around blk-rq-qos, to avoid fetching this cacheline if blk-iolatency or blk-wbt isn't configured or used. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 05:07:03 -06:00
Jens Axboe	9b79f86e06	block: ensure that struct blk_mq_alloc_data is fully initialized On x86, rep stos will be emitted to clear the the blk_mq_alloc_data struct, as not all members are being explicitly initialied. Depending on the type of CPU, this is a noticeable slowdown compared to just ensuring that the struct is fully initialized when setup. For the 4 spots that setup a struct blk_mq_alloc_data on the stack, ensure all members are being initialized. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 05:07:02 -06:00
Bart Van Assche	e093b784ab	block: Simplify blk_mq_dispatch_rq_list() and its callers The 'nr_budgets' argument of blk_mq_dispatch_rq_list() is either the number of elements in the 'list' argument or zero. Instead of passing the number of list elements to blk_mq_dispatch_rq_list(), pass a boolean argument that indicates whether or not blk_mq_dispatch_rq_list() should request the block driver for a budget for each request in 'list'. Remove the code for counting list elements from blk_mq_dispatch_rq_list() callers where possible. Remove the code that decrements nr_budgets from blk_mq_dispatch_rq_list() because it is superfluous. Each request that is processed by blk_mq_dispatch_rq_list() is in one of these two states if 'get_budget' is false: * Either the request is on 'list' and the budget for the request has to be released from the error path. * Or the request is not on 'list' and q->mq_ops->queue_rq() has already released the budget (ret != BLK_STS_OK) or q->mq_ops->queue_rq() will release the budget asynchronously (ret == BLK_STS_OK). Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: John Garry <john.g.garry@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20250415205134.3650042-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-21 05:07:02 -06:00
Linus Torvalds	119009db26	vfs-6.15-rc3.fixes.2 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaAQM5QAKCRCRxhvAZXjc olcwAP0RETZn15Jkt5+mKjcx99fuVE7je3lp56UH4Y4XjZmthgEA1n65RDr4Tq6E 548A2/9Hnt4NWdvoi9VhrG4+5dNRowM= =cFFa -----END PGP SIGNATURE----- Merge tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Revert the hfs{plus} deprecation warning that's also included in this pull request. The commit introducing the deprecation warning resides rather early in this branch. So simply dropping it would've rebased all other commits which I decided to avoid. Hence the revert in the same branch [ Background - the deprecation warning discussion resulted in people stepping up, and so hfs{plus} will have a maintainer taking care of it after all.. - Linus ] - Switch CONFIG_SYSFS_SYCALL default to n and decouple from CONFIG_EXPERT - Fix an audit bug caused by changes to our kernel path lookup helpers this cycle. Audit needs the parent path even if the dentry it tried to look up is negative - Ensure that the kernel path lookup helpers leave the passed in path argument clean when they return an error. This is consistent with all our other helpers - Ensure that vfs_getattr_nosec() calls bdev_statx() so the relevant information is available to kernel consumers as well - Don't set a timer and call schedule() if the timer will expire immediately in epoll - Make netfs lookup tables with __nonstring * tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: Revert "hfs{plus}: add deprecation warning" fs: move the bdex_statx call to vfs_getattr_nosec netfs: Mark __nonstring lookup tables eventpoll: Set epoll timeout if it's in the future fs: ensure that path_locked() helpers leave passed path pristine fs: add kern_path_locked_negative() hfs{plus}: add deprecation warning Kconfig: switch CONFIG_SYSFS_SYCALL default to n	2025-04-19 14:31:08 -07:00
Linus Torvalds	f7c2ca2584	block-6.15-20250417 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgBYN8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgprhiD/9LnoTEqPdDGaE/f/1yUgYcJL8IUSMTfpyB JajQp5klWNHmIyD/GdzFq+6SL/XDAllO/NgVdQlI+78s5GRn7A/fy3unmB3kYhs/ Spz9reD7/wH6lp2/u5jKD0Dk3Wz9LCGAUxQ2QYtd9lXJo/Dem3roBpty6/GYvLQy 3kSwa3e4dekd9jBZ+lbnSaFcvQg3Xc/x+SoP3r60wrMIEOyJrHLHWhLMolC/ZkGw sl1nvj4dnRAK77G7KPctYIu6K7ryJwQLJhBre7t5Fd4Dzn46l/sNwOkBn7hhdaTR e3+F7C1D22zIHFrknkm1+9KkZA/9tIz1CUYRlYCxGPsH1XP4dy78uTk7sgGORV9C 0gFJ3nqzSu0LP3Mk06e2DH+Oqq0wtdnggxmAXjJhah9JFrP7H9bEi4lTEsJ6XjLV PCL4PYGEkrJp7faD0p2icq6GKwx/EINlCob6Cx0h+lNo/Crz0FjkPNZkLTiYLahc S8Wlc6xMiMtRxdH3LX8ptAGot2s3uTQiNIKmkPkzIiSDoUoZbao1oknm8tpmXa1x Wg6bmOj5Jbd1K+Gyu24rIxW7RVkXtfB63o5ScRu+YGXhulsnV2mCPXZ2qxlW3s51 zZcHUNQPAgmBdf/qzkNbk4fPS2G1rC6eJOLn84B4E5PWbP0xFjv6FdEwPF/ovdb8 aIyR3vSjyA== =YCi8 -----END PGP SIGNATURE----- Merge tag 'block-6.15-20250417' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - MD pull via Yu: - fix raid10 missing discard IO accounting (Yu Kuai) - fix bitmap stats for bitmap file (Zheng Qixing) - fix oops while reading all member disks failed during check/repair (Meir Elisha) - NVMe pull via Christoph: - fix scan failure for non-ANA multipath controllers (Hannes Reinecke) - fix multipath sysfs links creation for some cases (Hannes Reinecke) - PCIe endpoint fixes (Damien Le Moal) - use NULL instead of 0 in the auth code (Damien Le Moal) - Various ublk fixes: - Slew of selftest additions - Improvements and fixes for IO cancelation - Tweak to Kconfig verbiage - Fix for page dirtying for blk integrity mapped pages - loop fixes: - buffered IO fix - uevent fixes - request priority inheritance fix - Various little fixes * tag 'block-6.15-20250417' of git://git.kernel.dk/linux: (38 commits) selftests: ublk: add generic_06 for covering fault inject ublk: simplify aborting ublk request ublk: remove __ublk_quiesce_dev() ublk: improve detection and handling of ublk server exit ublk: move device reset into ublk_ch_release() ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io ublk: add ublk_force_abort_dev() ublk: properly serialize all FETCH_REQs selftests: ublk: move creating UBLK_TMP into _prep_test() selftests: ublk: add test_stress_05.sh selftests: ublk: support user recovery selftests: ublk: support target specific command line selftests: ublk: increase max nr_queues and queue depth selftests: ublk: set queue pthread's cpu affinity selftests: ublk: setup ring with IORING_SETUP_SINGLE_ISSUER/IORING_SETUP_DEFER_TASKRUN selftests: ublk: add two stress tests for zero copy feature selftests: ublk: run stress tests in parallel selftests: ublk: make sure _add_ublk_dev can return in sub-shell selftests: ublk: cleanup backfile automatically selftests: ublk: add io_uring uapi header ...	2025-04-18 09:21:14 -07:00
Christoph Hellwig	777d0961ff	fs: move the bdex_statx call to vfs_getattr_nosec Currently bdex_statx is only called from the very high-level vfs_statx_path function, and thus bypassing it for in-kernel calls to vfs_getattr or vfs_getattr_nosec. This breaks querying the block ѕize of the underlying device in the loop driver and also is a pitfall for any other new kernel caller. Move the call into the lowest level helper to ensure all callers get the right results. Fixes: `2d985f8c6b` ("vfs: support STATX_DIOALIGN on block devices") Fixes: `f4774e92aa` ("loop: take the file system minimum dio alignment into account") Reported-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250417064042.712140-1-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-04-17 10:14:34 +02:00
Martin K. Petersen	39e1605051	block: integrity: Do not call set_page_dirty_lock() Placing multiple protection information buffers inside the same page can lead to oopses because set_page_dirty_lock() can't be called from interrupt context. Since a protection information buffer is not backed by a file there is no point in setting its page dirty, there is nothing to synchronize. Drop the call to set_page_dirty_lock() and remove the last argument to bio_integrity_unpin_bvec(). Cc: stable@vger.kernel.org Fixes: `492c5d4559` ("block: bio-integrity: directly map user buffers") Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/yq1v7r3ev9g.fsf@ca-mkp.ca.oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-16 14:16:48 -06:00
Zheng Qixing	40f2eb9b53	block: fix resource leak in blk_register_queue() error path When registering a queue fails after blk_mq_sysfs_register() is successful but the function later encounters an error, we need to clean up the blk_mq_sysfs resources. Add the missing blk_mq_sysfs_unregister() call in the error path to properly clean up these resources and prevent a memory leak. Fixes: `320ae51fee` ("blk-mq: new multi-queue block IO queueing mechanism") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250412092554.475218-1-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-14 08:28:26 -06:00
Bird, Tim	1b4194053f	block: add SPDX header line to blk-throttle.h Add an SPDX license identifier line to blk-throttle.h Use 'GPL-2.0' as the identifier, since blk-throttle.c uses that, and blk.h (from which some material was copied when blk-throttle.h was created) also uses that identifier. Signed-off-by: Tim Bird <tim.bird@sony.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/MW5PR13MB5632EE4645BCA24ED111EC0EFDB62@MW5PR13MB5632.namprd13.prod.outlook.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-14 08:28:09 -06:00
Matthew Wilcox (Oracle)	84798514db	mm: Remove swap_writepage() and shmem_writepage() Call swap_writeout() and shmem_writeout() from pageout() instead. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250402150005.2309458-9-willy@infradead.org Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-04-07 09:36:50 +02:00
Thomas Gleixner	8fa7292fee	treewide: Switch/rename to timer_delete[_sync]() timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-04-05 10:30:12 +02:00
JP Kobryn	a97915559f	cgroup: change rstat function signatures from cgroup-based to css-based This non-functional change serves as preparation for moving to subsystem-based rstat trees. To simplify future commits, change the signatures of existing cgroup-based rstat functions to become css-based and rename them to reflect that. Though the signatures have changed, the implementations have not. Within these functions use the css->cgroup pointer to obtain the associated cgroup and allow code to function the same just as it did before this patch. At applicable call sites, pass the subsystem-specific css pointer as an argument or pass a pointer to cgroup::self if not in subsystem context. Note that cgroup_rstat_updated_list() and cgroup_rstat_push_children() are not altered yet since there would be a larger amount of css to cgroup conversions which may overcomplicate the code at this intermediate phase. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-04-04 10:06:25 -10:00
Ming Lei	01b91bf14f	block: don't grab elevator lock during queue initialization ->elevator_lock depends on queue freeze lock, see block/blk-sysfs.c. queue freeze lock depends on fs_reclaim. So don't grab elevator lock during queue initialization which needs to call kmalloc(GFP_KERNEL), and we can cut the dependency between ->elevator_lock and fs_reclaim, then the lockdep warning can be killed. This way is safe because elevator setting isn't ready to run during queue initialization. There isn't such issue in __blk_mq_update_nr_hw_queues() because memalloc_noio_save() is called before acquiring elevator lock. Fixes the following lockdep warning: https://lore.kernel.org/linux-block/67e6b425.050a0220.2f068f.007b.GAE@google.com/ Reported-by: syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com Cc: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250403105402.1334206-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-03 08:32:03 -06:00
Nitesh Shetty	e3e68311ea	block: remove unused nseg parameter We are no longer using nr_segs, after blk_mq_attempt_bio_merge was moved out of blk_mq_get_new_request. Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Link: https://lore.kernel.org/r/20250401044348.15588-1-nj.shetty@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-04-01 07:21:35 -06:00
Linus Torvalds	9b960d8cd6	for-6.15/block-20250322 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfe8BkQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvTqD/4pOeGi/QfLyocn4TcJcidRGZAvBxecTVuM upeyr+dCyCi9Wk+EJKeAFooGe15upzxDxKj06HhCixaLx4etDK78uGV4FMM1Z4oa 2dtchz1Zd0HyBPgQIUY8OuOgbS7tstMS/KdvL+gr5IjfapeTF+54WVLCD8eVyvO/ vUIppgJBhrqy2qui4xF2lw4t2COt+/PqinGQuYALn4V4Po9NWA7lSh3ZI4F/byj1 v68jXyt2fqCAyxwkzRDv4GxhN8c6W+TPJpzivrEAuSkLacovESKztinOrafrBnLR zdyO4n0V0yGOXbAcxRbADVA4HUsqhLl4JRnvE5P5zIaD7rkE0UqggF7vrSeCvVA1 hsi1BhkAMNimKX7CZMnT3dJpxRQj1eDJxpwUAusLHWjMyQbNFhV7WAtthMtVJon8 lAS4e5+xzjqKhF15GpVg5Lzy8SAwdqgNXwwq2zbM8OaPKG0FpajG8DXAqqcj4fpy WXnwg72KZDmRcSNJhVZK6B9xSAwIMXPgH4ClCMP9/xlw8EDpM38MDmzrs35TAVtI HGE3Qv9CjFjVj/OG3el+bTGIQJFVgYEVPV5TYfNCpKoxpj5cLn5OQY5u6MJawtgK HeDgKv3jw3lHatDALMVfwJqqVlUht0R6SIxtP9WHV+CcFrqN1LJKmdhDQbm7b4XK EbbawIsdxw== =Ci5m -----END PGP SIGNATURE----- Merge tag 'for-6.15/block-20250322' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - Fixes for integrity handling - NVMe pull request via Keith: - Secure concatenation for TCP transport (Hannes) - Multipath sysfs visibility (Nilay) - Various cleanups (Qasim, Baruch, Wang, Chen, Mike, Damien, Li) - Correct use of 64-bit BARs for pci-epf target (Niklas) - Socket fix for selinux when used in containers (Peijie) - MD pull request via Yu: - fix recovery can preempt resync (Li Nan) - fix md-bitmap IO limit (Su Yue) - fix raid10 discard with REQ_NOWAIT (Xiao Ni) - fix raid1 memory leak (Zheng Qixing) - fix mddev uaf (Yu Kuai) - fix raid1,raid10 IO flags (Yu Kuai) - some refactor and cleanup (Yu Kuai) - Series cleaning up and fixing bugs in the bad block handling code - Improve support for write failure simulation in null_blk - Various lock ordering fixes - Fixes for locking for debugfs attributes - Various ublk related fixes and improvements - Cleanups for blk-rq-qos wait handling - blk-throttle fixes - Fixes for loop dio and sync handling - Fixes and cleanups for the auto-PI code - Block side support for hardware encryption keys in blk-crypto - Various cleanups and fixes * tag 'for-6.15/block-20250322' of git://git.kernel.dk/linux: (105 commits) nvmet: replace max(a, min(b, c)) by clamp(val, lo, hi) nvme-tcp: fix selinux denied when calling sock_sendmsg nvmet: pci-epf: Always configure BAR0 as 64-bit nvmet: Remove duplicate uuid_copy nvme: zns: Simplify nvme_zone_parse_entry() nvmet: pci-epf: Remove redundant 'flush_workqueue()' calls nvmet-fc: Remove unused functions nvme-pci: remove stale comment nvme-fc: Utilise min3() to simplify queue count calculation nvme-multipath: Add visibility for queue-depth io-policy nvme-multipath: Add visibility for numa io-policy nvme-multipath: Add visibility for round-robin io-policy nvmet: add tls_concat and tls_key debugfs entries nvmet-tcp: support secure channel concatenation nvmet: Add 'sq' argument to alloc_ctrl_args nvme-fabrics: reset admin connection for secure concatenation nvme-tcp: request secure channel concatenation nvme-keyring: add nvme_tls_psk_refresh() nvme: add nvme_auth_derive_tls_psk() nvme: add nvme_auth_generate_digest() ...	2025-03-26 18:08:55 -07:00
Linus Torvalds	ee6740fd34	CRC updates for 6.15 Another set of improvements to the kernel's CRC (cyclic redundancy check) code: - Rework the CRC64 library functions to be directly optimized, like what I did last cycle for the CRC32 and CRC-T10DIF library functions. - Rewrite the x86 PCLMULQDQ-optimized CRC code, and add VPCLMULQDQ support and acceleration for crc64_be and crc64_nvme. - Rewrite the riscv Zbc-optimized CRC code, and add acceleration for crc_t10dif, crc64_be, and crc64_nvme. - Remove crc_t10dif and crc64_rocksoft from the crypto API, since they are no longer needed there. - Rename crc64_rocksoft to crc64_nvme, as the old name was incorrect. - Add kunit test cases for crc64_nvme and crc7. - Eliminate redundant functions for calculating the Castagnoli CRC32, settling on just crc32c(). - Remove unnecessary prompts from some of the CRC kconfig options. - Further optimize the x86 crc32c code. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZ+CGGhQcZWJpZ2dlcnNA Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK3wRAP4tbnzawUmlIHIF0hleoADXehUgAhMt NZn15mGvyiuwIQEA8W9qvnLdFXZkdxhxAEvDDFjyrRauL6eGtr/GvCx4AQY= =wmKG -----END PGP SIGNATURE----- Merge tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull CRC updates from Eric Biggers: "Another set of improvements to the kernel's CRC (cyclic redundancy check) code: - Rework the CRC64 library functions to be directly optimized, like what I did last cycle for the CRC32 and CRC-T10DIF library functions - Rewrite the x86 PCLMULQDQ-optimized CRC code, and add VPCLMULQDQ support and acceleration for crc64_be and crc64_nvme - Rewrite the riscv Zbc-optimized CRC code, and add acceleration for crc_t10dif, crc64_be, and crc64_nvme - Remove crc_t10dif and crc64_rocksoft from the crypto API, since they are no longer needed there - Rename crc64_rocksoft to crc64_nvme, as the old name was incorrect - Add kunit test cases for crc64_nvme and crc7 - Eliminate redundant functions for calculating the Castagnoli CRC32, settling on just crc32c() - Remove unnecessary prompts from some of the CRC kconfig options - Further optimize the x86 crc32c code" * tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (36 commits) x86/crc: drop the avx10_256 functions and rename avx10_512 to avx512 lib/crc: remove unnecessary prompt for CONFIG_CRC64 lib/crc: remove unnecessary prompt for CONFIG_LIBCRC32C lib/crc: remove unnecessary prompt for CONFIG_CRC8 lib/crc: remove unnecessary prompt for CONFIG_CRC7 lib/crc: remove unnecessary prompt for CONFIG_CRC4 lib/crc7: unexport crc7_be_syndrome_table lib/crc_kunit.c: update comment in crc_benchmark() lib/crc_kunit.c: add test and benchmark for crc7_be() x86/crc32: optimize tail handling for crc32c short inputs riscv/crc64: add Zbc optimized CRC64 functions riscv/crc-t10dif: add Zbc optimized CRC-T10DIF function riscv/crc32: reimplement the CRC32 functions using new template riscv/crc: add "template" for Zbc optimized CRC functions x86/crc: add ANNOTATE_NOENDBR to suppress objtool warnings x86/crc32: improve crc32c_arch() code generation with clang x86/crc64: implement crc64_be and crc64_nvme using new template x86/crc-t10dif: implement crc_t10dif using new template x86/crc32: implement crc32_le using new template x86/crc: add "template" for [V]PCLMULQDQ based CRC functions ...	2025-03-25 18:33:04 -07:00
Linus Torvalds	a50b4fe095	A treewide hrtimer timer cleanup hrtimers are initialized with hrtimer_init() and a subsequent store to the callback pointer. This turned out to be suboptimal for the upcoming Rust integration and is obviously a silly implementation to begin with. This cleanup replaces the hrtimer_init(T); T->function = cb; sequence with hrtimer_setup(T, cb); The conversion was done with Coccinelle and a few manual fixups. Once the conversion has completely landed in mainline, hrtimer_init() will be removed and the hrtimer::function becomes a private member. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8 GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8 OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0 iJbvLJeisXr3GA== =bYAX -----END PGP SIGNATURE----- Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanups from Thomas Gleixner: "A treewide hrtimer timer cleanup hrtimers are initialized with hrtimer_init() and a subsequent store to the callback pointer. This turned out to be suboptimal for the upcoming Rust integration and is obviously a silly implementation to begin with. This cleanup replaces the hrtimer_init(T); T->function = cb; sequence with hrtimer_setup(T, cb); The conversion was done with Coccinelle and a few manual fixups. Once the conversion has completely landed in mainline, hrtimer_init() will be removed and the hrtimer::function becomes a private member" * tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits) wifi: rt2x00: Switch to use hrtimer_update_function() io_uring: Use helper function hrtimer_update_function() serial: xilinx_uartps: Use helper function hrtimer_update_function() ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup() RDMA: Switch to use hrtimer_setup() virtio: mem: Switch to use hrtimer_setup() drm/vmwgfx: Switch to use hrtimer_setup() drm/xe/oa: Switch to use hrtimer_setup() drm/vkms: Switch to use hrtimer_setup() drm/msm: Switch to use hrtimer_setup() drm/i915/request: Switch to use hrtimer_setup() drm/i915/uncore: Switch to use hrtimer_setup() drm/i915/pmu: Switch to use hrtimer_setup() drm/i915/perf: Switch to use hrtimer_setup() drm/i915/gvt: Switch to use hrtimer_setup() drm/i915/huc: Switch to use hrtimer_setup() drm/amdgpu: Switch to use hrtimer_setup() stm class: heartbeat: Switch to use hrtimer_setup() i2c: Switch to use hrtimer_setup() iio: Switch to use hrtimer_setup() ...	2025-03-25 10:54:15 -07:00
Linus Torvalds	94dc216ad8	cgroup: Changes for v6.15 - Add deprecation info messages to cgroup1-only features. - rstat updates including a bug fix and breaking up a critical section to reduce interrupt latency impact. - Other misc and doc updates. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZ9xO2g4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGQz4AQDeWKmngRsnddEMkqOV1ArwXSr+8xUQrvCBx0RL vcjOQQEAusGCTeGXWJ96kw+N9BXvGwFsfSeoxjOqAnvrBS1EgAc= =WvJg -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Add deprecation info messages to cgroup1-only features - rstat updates including a bug fix and breaking up a critical section to reduce interrupt latency impact - Other misc and doc updates * tag 'cgroup-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: rstat: Cleanup flushing functions and locking cgroup/rstat: avoid disabling irqs for O(num_cpu) mm: Fix a build breakage in memcontrol-v1.c blk-cgroup: Simplify policy files registration cgroup: Update file naming comment cgroup: Add deprecation message to legacy freezer controller mm: Add transformation message for per-memcg swappiness RFC cgroup/cpuset-v1: Add deprecation messages to sched_relax_domain_level cgroup/cpuset-v1: Add deprecation messages to memory_migrate cgroup/cpuset-v1: Add deprecation messages to mem_exclusive and mem_hardwall cgroup: Print message when /proc/cgroups is read on v2-only system cgroup/blkio: Add deprecation messages to reset_stats cgroup/cpuset-v1: Add deprecation messages to memory_spread_page and memory_spread_slab cgroup/cpuset-v1: Add deprecation messages to sched_load_balance and memory_pressure_enabled cgroup, docs: Be explicit about independence of RT_GROUP_SCHED and non-cpu controllers cgroup/rstat: Fix forceidle time in cpu.stat cgroup/misc: Remove unused misc_cg_res_total_usage cgroup/cpuset: Move procfs cpuset attribute under cgroup-v1.c cgroup: update comment about dropping cgroup kn refs	2025-03-24 16:49:40 -07:00
Linus Torvalds	e41170cc5e	vfs-6.15-rc1.pagesize -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rxAAKCRCRxhvAZXjc ooIPAQCwMjDjtWegvBy8kefiRw+fa4z3ZWHrwRT9DJrD/K9WyAD+JVd0ou27SVpQ jKpRSRct2eTbyxdYiGydHQGm5F5sLg4= =0FyQ -----END PGP SIGNATURE----- Merge tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs pagesize updates from Christian Brauner: "This enables block sizes greater than the page size for block devices. With this we can start supporting block devices with logical block sizes larger than 4k. It also allows to lift the device cache sector size support to 64k. This allows filesystems which can use larger sector sizes up to 64k to ensure that the filesystem will not generate writes that are smaller than the specified sector size" * tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: bdev: add back PAGE_SIZE block size validation for sb_set_blocksize() bdev: use bdev_io_min() for statx block size block/bdev: lift block size restrictions to 64k block/bdev: enable large folio support for large logical block sizes fs/buffer fs/mpage: remove large folio restriction fs/mpage: use blocks_per_folio instead of blocks_per_page fs/mpage: avoid negative shift for large blocksize fs/buffer: remove batching from async read fs/buffer: simplify block_read_full_folio() with bh_offset()	2025-03-24 12:01:29 -07:00
Jens Axboe	03c90afb21	block/blk-iocost: ensure 'ret' is set on error In case blkg_conf_open_bdev_frozen() fails, ioc_qos_write() jumps to the error path without assigning a value to 'ret'. Ensure that it inherits the error from the passed back error value. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202503200454.QWpwKeJu-lkp@intel.com/ Fixes: `9730763f47` ("block: correct locking order for protecting blk-wbt parameters") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-19 14:51:36 -06:00
Nilay Shroff	9730763f47	block: correct locking order for protecting blk-wbt parameters The commit '245618f8e45f ("block: protect wbt_lat_usec using q-> elevator_lock")' introduced q->elevator_lock to protect updates to blk-wbt parameters when writing to the sysfs attribute wbt_ lat_usec and the cgroup attribute io.cost.qos. However, both these attributes also acquire q->rq_qos_mutex, leading to the following lockdep warning: ====================================================== WARNING: possible circular locking dependency detected 6.14.0-rc5+ #138 Not tainted ------------------------------------------------------ bash/5902 is trying to acquire lock: c000000085d495a0 (&q->rq_qos_mutex){+.+.}-{4:4}, at: wbt_init+0x164/0x238 but task is already holding lock: c000000085d498c8 (&q->elevator_lock){+.+.}-{4:4}, at: queue_wb_lat_store+0xb0/0x20c which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&q->elevator_lock){+.+.}-{4:4}: __mutex_lock+0xf0/0xa58 ioc_qos_write+0x16c/0x85c cgroup_file_write+0xc4/0x32c kernfs_fop_write_iter+0x1b8/0x29c vfs_write+0x410/0x584 ksys_write+0x84/0x140 system_call_exception+0x134/0x360 system_call_vectored_common+0x15c/0x2ec -> #0 (&q->rq_qos_mutex){+.+.}-{4:4}: __lock_acquire+0x1b6c/0x2ae0 lock_acquire+0x140/0x430 __mutex_lock+0xf0/0xa58 wbt_init+0x164/0x238 queue_wb_lat_store+0x1dc/0x20c queue_attr_store+0x12c/0x164 sysfs_kf_write+0x6c/0xb0 kernfs_fop_write_iter+0x1b8/0x29c vfs_write+0x410/0x584 ksys_write+0x84/0x140 system_call_exception+0x134/0x360 system_call_vectored_common+0x15c/0x2ec other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&q->elevator_lock); lock(&q->rq_qos_mutex); lock(&q->elevator_lock); lock(&q->rq_qos_mutex); * DEADLOCK * 6 locks held by bash/5902: #0: c000000051122400 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x84/0x140 #1: c00000007383f088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x174/0x29c #2: c000000008550428 (kn->active#182){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x180/0x29c #3: c000000085d493a8 (&q->q_usage_counter(io)#5){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40 #4: c000000085d493e0 (&q->q_usage_counter(queue)#5){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40 #5: c000000085d498c8 (&q->elevator_lock){+.+.}-{4:4}, at: queue_wb_lat_store+0xb0/0x20c stack backtrace: CPU: 17 UID: 0 PID: 5902 Comm: bash Kdump: loaded Not tainted 6.14.0-rc5+ #138 Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries Call Trace: [c0000000721ef590] [c00000000118f8a8] dump_stack_lvl+0x108/0x18c (unreliable) [c0000000721ef5c0] [c00000000022563c] print_circular_bug+0x448/0x604 [c0000000721ef670] [c000000000225a44] check_noncircular+0x24c/0x26c [c0000000721ef740] [c00000000022bf28] __lock_acquire+0x1b6c/0x2ae0 [c0000000721ef870] [c000000000229240] lock_acquire+0x140/0x430 [c0000000721ef970] [c0000000011cfbec] __mutex_lock+0xf0/0xa58 [c0000000721efaa0] [c00000000096c46c] wbt_init+0x164/0x238 [c0000000721efaf0] [c0000000008f8cd8] queue_wb_lat_store+0x1dc/0x20c [c0000000721efb50] [c0000000008f8fa0] queue_attr_store+0x12c/0x164 [c0000000721efc60] [c0000000007c11cc] sysfs_kf_write+0x6c/0xb0 [c0000000721efca0] [c0000000007bfa4c] kernfs_fop_write_iter+0x1b8/0x29c [c0000000721efcf0] [c0000000006a281c] vfs_write+0x410/0x584 [c0000000721efdc0] [c0000000006a2cc8] ksys_write+0x84/0x140 [c0000000721efe10] [c000000000031b64] system_call_exception+0x134/0x360 [c0000000721efe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec >From the above log it's apparent that method which writes to sysfs attr wbt_lat_usec acquires q->elevator_lock first, and then acquires q->rq_ qos_mutex. However the another method which writes to io.cost.qos, acquires q->rq_qos_mutex first, and then acquires q->rq_qos_mutex. So this could potentially cause the deadlock. A closer look at ioc_qos_write shows that correcting the lock order is non-trivial because q->rq_qos_mutex is acquired in blkg_conf_open_bdev and released in blkg_conf_exit. The function blkg_conf_open_bdev is responsible for parsing user input and finding the corresponding block device (bdev) from the user provided major:minor number. Since we do not know the bdev until blkg_conf_open_bdev completes, we cannot simply move q->elevator_lock acquisition before blkg_conf_open_ bdev. So to address this, we intoduce new helpers blkg_conf_open_bdev_ frozen and blkg_conf_exit_frozen which are just wrappers around blkg_ conf_open_bdev and blkg_conf_exit respectively. The helper blkg_conf_ open_bdev_frozen is similar to blkg_conf_open_bdev, but additionally freezes the queue, acquires q->elevator_lock and ensures the correct locking order is followed between q->elevator_lock and q->rq_qos_mutex. Similarly another helper blkg_conf_exit_frozen in addition to unfreezing the queue ensures that we release the locks in correct order. By using these helpers, now we maintain the same locking order in all code paths where we update blk-wbt parameters. Fixes: `245618f8e4` ("block: protect wbt_lat_usec using q->elevator_lock") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202503171650.cc082b66-lkp@intel.com Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250319105518.468941-3-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-19 11:35:45 -06:00
Nilay Shroff	89ed5fa3b5	block: release q->elevator_lock in ioc_qos_write The ioc_qos_write method acquires q->elevator_lock to protect updates to blk-wbt parameters. Once these updates are complete, the lock should be released before returning from ioc_qos_write. However, in one code path, the release of q->elevator_lock was mistakenly omitted, potentially leading to a lock leak. This commit fixes the issue by ensuring that q->elevator_lock is properly released in all return paths of ioc_qos_write. Fixes: `245618f8e4` ("block: protect wbt_lat_usec using q->elevator_lock") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202503171650.cc082b66-lkp@intel.com Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250319105518.468941-2-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-19 11:35:45 -06:00
Chen Linxuan	e1a0202c6b	blk-cgroup: improve policy registration error handling This patch improve the returned error code of blkcg_policy_register(). 1. Move the validation check for cpd/pd_alloc_fn and cpd/pd_free_fn function pairs to the start of blkcg_policy_register(). This ensures we immediately return -EINVAL if the function pairs are not correctly provided, rather than returning -ENOSPC after locking and unlocking mutexes unnecessarily. Those locks should not contention any problems, as error of policy registration is a super cold path. 2. Return -ENOMEM when cpd_alloc_fn() failed. Co-authored-by: Wen Tao <wentao@uniontech.com> Signed-off-by: Wen Tao <wentao@uniontech.com> Signed-off-by: Chen Linxuan <chenlinxuan@uniontech.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/3E333A73B6B6DFC0+20250317022924.150907-1-chenlinxuan@uniontech.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-18 12:32:09 -06:00
Thomas Hellström	ffa1e7ada4	block: Make request_queue lockdep splats show up earlier In recent kernels, there are lockdep splats around the struct request_queue::io_lockdep_map, similar to [1], but they typically don't show up until reclaim with writeback happens. Having multiple kernel versions released with a known risc of kernel deadlock during reclaim writeback should IMHO be addressed and backported to -stable with the highest priority. In order to have these lockdep splats show up earlier, preferrably during system initialization, prime the struct request_queue::io_lockdep_map as GFP_KERNEL reclaim- tainted. This will instead lead to lockdep splats looking similar to [2], but without the need for reclaim + writeback happening. [1]: [ 189.762244] ====================================================== [ 189.762432] WARNING: possible circular locking dependency detected [ 189.762441] 6.14.0-rc6-xe+ #6 Tainted: G U [ 189.762450] ------------------------------------------------------ [ 189.762459] kswapd0/119 is trying to acquire lock: [ 189.762467] ffff888110ceb710 (&q->q_usage_counter(io)#26){++++}-{0:0}, at: __submit_bio+0x76/0x230 [ 189.762485] but task is already holding lock: [ 189.762494] ffffffff834c97c0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xbe/0xb00 [ 189.762507] which lock already depends on the new lock. [ 189.762519] the existing dependency chain (in reverse order) is: [ 189.762529] -> #2 (fs_reclaim){+.+.}-{0:0}: [ 189.762540] fs_reclaim_acquire+0xc5/0x100 [ 189.762548] kmem_cache_alloc_lru_noprof+0x4a/0x480 [ 189.762558] alloc_inode+0xaa/0xe0 [ 189.762566] iget_locked+0x157/0x330 [ 189.762573] kernfs_get_inode+0x1b/0x110 [ 189.762582] kernfs_get_tree+0x1b0/0x2e0 [ 189.762590] sysfs_get_tree+0x1f/0x60 [ 189.762597] vfs_get_tree+0x2a/0xf0 [ 189.762605] path_mount+0x4cd/0xc00 [ 189.762613] __x64_sys_mount+0x119/0x150 [ 189.762621] x64_sys_call+0x14f2/0x2310 [ 189.762630] do_syscall_64+0x91/0x180 [ 189.762637] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 189.762647] -> #1 (&root->kernfs_rwsem){++++}-{3:3}: [ 189.762659] down_write+0x3e/0xf0 [ 189.762667] kernfs_remove+0x32/0x60 [ 189.762676] sysfs_remove_dir+0x4f/0x60 [ 189.762685] __kobject_del+0x33/0xa0 [ 189.762709] kobject_del+0x13/0x30 [ 189.762716] elv_unregister_queue+0x52/0x80 [ 189.762725] elevator_switch+0x68/0x360 [ 189.762733] elv_iosched_store+0x14b/0x1b0 [ 189.762756] queue_attr_store+0x181/0x1e0 [ 189.762765] sysfs_kf_write+0x49/0x80 [ 189.762773] kernfs_fop_write_iter+0x17d/0x250 [ 189.762781] vfs_write+0x281/0x540 [ 189.762790] ksys_write+0x72/0xf0 [ 189.762798] __x64_sys_write+0x19/0x30 [ 189.762807] x64_sys_call+0x2a3/0x2310 [ 189.762815] do_syscall_64+0x91/0x180 [ 189.762823] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 189.762833] -> #0 (&q->q_usage_counter(io)#26){++++}-{0:0}: [ 189.762845] __lock_acquire+0x1525/0x2760 [ 189.762854] lock_acquire+0xca/0x310 [ 189.762861] blk_mq_submit_bio+0x8a2/0xba0 [ 189.762870] __submit_bio+0x76/0x230 [ 189.762878] submit_bio_noacct_nocheck+0x323/0x430 [ 189.762888] submit_bio_noacct+0x2cc/0x620 [ 189.762896] submit_bio+0x38/0x110 [ 189.762904] __swap_writepage+0xf5/0x380 [ 189.762912] swap_writepage+0x3c7/0x600 [ 189.762920] shmem_writepage+0x3da/0x4f0 [ 189.762929] pageout+0x13f/0x310 [ 189.762937] shrink_folio_list+0x61c/0xf60 [ 189.763261] evict_folios+0x378/0xcd0 [ 189.763584] try_to_shrink_lruvec+0x1b0/0x360 [ 189.763946] shrink_one+0x10e/0x200 [ 189.764266] shrink_node+0xc02/0x1490 [ 189.764586] balance_pgdat+0x563/0xb00 [ 189.764934] kswapd+0x1e8/0x430 [ 189.765249] kthread+0x10b/0x260 [ 189.765559] ret_from_fork+0x44/0x70 [ 189.765889] ret_from_fork_asm+0x1a/0x30 [ 189.766198] other info that might help us debug this: [ 189.767089] Chain exists of: &q->q_usage_counter(io)#26 --> &root->kernfs_rwsem --> fs_reclaim [ 189.767971] Possible unsafe locking scenario: [ 189.768555] CPU0 CPU1 [ 189.768849] ---- ---- [ 189.769136] lock(fs_reclaim); [ 189.769421] lock(&root->kernfs_rwsem); [ 189.769714] lock(fs_reclaim); [ 189.770016] rlock(&q->q_usage_counter(io)#26); [ 189.770305] * DEADLOCK * [ 189.771167] 1 lock held by kswapd0/119: [ 189.771453] #0: ffffffff834c97c0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xbe/0xb00 [ 189.771770] stack backtrace: [ 189.772351] CPU: 4 UID: 0 PID: 119 Comm: kswapd0 Tainted: G U 6.14.0-rc6-xe+ #6 [ 189.772353] Tainted: [U]=USER [ 189.772354] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023 [ 189.772354] Call Trace: [ 189.772355] <TASK> [ 189.772356] dump_stack_lvl+0x6e/0xa0 [ 189.772359] dump_stack+0x10/0x18 [ 189.772360] print_circular_bug.cold+0x17a/0x1b7 [ 189.772363] check_noncircular+0x13a/0x150 [ 189.772365] ? __pfx_stack_trace_consume_entry+0x10/0x10 [ 189.772368] __lock_acquire+0x1525/0x2760 [ 189.772368] ? ret_from_fork_asm+0x1a/0x30 [ 189.772371] lock_acquire+0xca/0x310 [ 189.772372] ? __submit_bio+0x76/0x230 [ 189.772375] ? lock_release+0xd5/0x2c0 [ 189.772376] blk_mq_submit_bio+0x8a2/0xba0 [ 189.772378] ? __submit_bio+0x76/0x230 [ 189.772380] __submit_bio+0x76/0x230 [ 189.772382] ? trace_hardirqs_on+0x1e/0xe0 [ 189.772384] submit_bio_noacct_nocheck+0x323/0x430 [ 189.772386] ? submit_bio_noacct_nocheck+0x323/0x430 [ 189.772387] ? __might_sleep+0x58/0xa0 [ 189.772390] submit_bio_noacct+0x2cc/0x620 [ 189.772391] ? count_memcg_events+0x68/0x90 [ 189.772393] submit_bio+0x38/0x110 [ 189.772395] __swap_writepage+0xf5/0x380 [ 189.772396] swap_writepage+0x3c7/0x600 [ 189.772397] shmem_writepage+0x3da/0x4f0 [ 189.772401] pageout+0x13f/0x310 [ 189.772406] shrink_folio_list+0x61c/0xf60 [ 189.772409] ? isolate_folios+0xe80/0x16b0 [ 189.772410] ? mark_held_locks+0x46/0x90 [ 189.772412] evict_folios+0x378/0xcd0 [ 189.772414] ? evict_folios+0x34a/0xcd0 [ 189.772415] ? lock_is_held_type+0xa3/0x130 [ 189.772417] try_to_shrink_lruvec+0x1b0/0x360 [ 189.772420] shrink_one+0x10e/0x200 [ 189.772421] shrink_node+0xc02/0x1490 [ 189.772423] ? shrink_node+0xa08/0x1490 [ 189.772424] ? shrink_node+0xbd8/0x1490 [ 189.772425] ? mem_cgroup_iter+0x366/0x480 [ 189.772427] balance_pgdat+0x563/0xb00 [ 189.772428] ? balance_pgdat+0x563/0xb00 [ 189.772430] ? trace_hardirqs_on+0x1e/0xe0 [ 189.772431] ? finish_task_switch.isra.0+0xcb/0x330 [ 189.772433] ? __switch_to_asm+0x33/0x70 [ 189.772437] kswapd+0x1e8/0x430 [ 189.772438] ? __pfx_autoremove_wake_function+0x10/0x10 [ 189.772440] ? __pfx_kswapd+0x10/0x10 [ 189.772441] kthread+0x10b/0x260 [ 189.772443] ? __pfx_kthread+0x10/0x10 [ 189.772444] ret_from_fork+0x44/0x70 [ 189.772446] ? __pfx_kthread+0x10/0x10 [ 189.772447] ret_from_fork_asm+0x1a/0x30 [ 189.772450] </TASK> [2]: [ 8.760253] ====================================================== [ 8.760254] WARNING: possible circular locking dependency detected [ 8.760255] 6.14.0-rc6-xe+ #7 Tainted: G U [ 8.760256] ------------------------------------------------------ [ 8.760257] (udev-worker)/674 is trying to acquire lock: [ 8.760259] ffff888100e39148 (&root->kernfs_rwsem){++++}-{3:3}, at: kernfs_remove+0x32/0x60 [ 8.760265] but task is already holding lock: [ 8.760266] ffff888110dc7680 (&q->q_usage_counter(io)#27){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x12/0x30 [ 8.760272] which lock already depends on the new lock. [ 8.760272] the existing dependency chain (in reverse order) is: [ 8.760273] -> #2 (&q->q_usage_counter(io)#27){++++}-{0:0}: [ 8.760276] blk_alloc_queue+0x30a/0x350 [ 8.760279] blk_mq_alloc_queue+0x6b/0xe0 [ 8.760281] scsi_alloc_sdev+0x276/0x3c0 [ 8.760284] scsi_probe_and_add_lun+0x22a/0x440 [ 8.760286] __scsi_scan_target+0x109/0x230 [ 8.760288] scsi_scan_channel+0x65/0xc0 [ 8.760290] scsi_scan_host_selected+0xff/0x140 [ 8.760292] do_scsi_scan_host+0xa7/0xc0 [ 8.760293] do_scan_async+0x1c/0x160 [ 8.760295] async_run_entry_fn+0x32/0x150 [ 8.760299] process_one_work+0x224/0x5f0 [ 8.760302] worker_thread+0x1d4/0x3e0 [ 8.760304] kthread+0x10b/0x260 [ 8.760306] ret_from_fork+0x44/0x70 [ 8.760309] ret_from_fork_asm+0x1a/0x30 [ 8.760312] -> #1 (fs_reclaim){+.+.}-{0:0}: [ 8.760315] fs_reclaim_acquire+0xc5/0x100 [ 8.760317] kmem_cache_alloc_lru_noprof+0x4a/0x480 [ 8.760319] alloc_inode+0xaa/0xe0 [ 8.760322] iget_locked+0x157/0x330 [ 8.760323] kernfs_get_inode+0x1b/0x110 [ 8.760325] kernfs_get_tree+0x1b0/0x2e0 [ 8.760327] sysfs_get_tree+0x1f/0x60 [ 8.760329] vfs_get_tree+0x2a/0xf0 [ 8.760332] path_mount+0x4cd/0xc00 [ 8.760334] __x64_sys_mount+0x119/0x150 [ 8.760336] x64_sys_call+0x14f2/0x2310 [ 8.760338] do_syscall_64+0x91/0x180 [ 8.760340] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 8.760342] -> #0 (&root->kernfs_rwsem){++++}-{3:3}: [ 8.760345] __lock_acquire+0x1525/0x2760 [ 8.760347] lock_acquire+0xca/0x310 [ 8.760348] down_write+0x3e/0xf0 [ 8.760350] kernfs_remove+0x32/0x60 [ 8.760351] sysfs_remove_dir+0x4f/0x60 [ 8.760353] __kobject_del+0x33/0xa0 [ 8.760355] kobject_del+0x13/0x30 [ 8.760356] elv_unregister_queue+0x52/0x80 [ 8.760358] elevator_switch+0x68/0x360 [ 8.760360] elv_iosched_store+0x14b/0x1b0 [ 8.760362] queue_attr_store+0x181/0x1e0 [ 8.760364] sysfs_kf_write+0x49/0x80 [ 8.760366] kernfs_fop_write_iter+0x17d/0x250 [ 8.760367] vfs_write+0x281/0x540 [ 8.760370] ksys_write+0x72/0xf0 [ 8.760372] __x64_sys_write+0x19/0x30 [ 8.760374] x64_sys_call+0x2a3/0x2310 [ 8.760376] do_syscall_64+0x91/0x180 [ 8.760377] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 8.760380] other info that might help us debug this: [ 8.760380] Chain exists of: &root->kernfs_rwsem --> fs_reclaim --> &q->q_usage_counter(io)#27 [ 8.760384] Possible unsafe locking scenario: [ 8.760384] CPU0 CPU1 [ 8.760385] ---- ---- [ 8.760385] lock(&q->q_usage_counter(io)#27); [ 8.760387] lock(fs_reclaim); [ 8.760388] lock(&q->q_usage_counter(io)#27); [ 8.760390] lock(&root->kernfs_rwsem); [ 8.760391] * DEADLOCK * [ 8.760391] 6 locks held by (udev-worker)/674: [ 8.760392] #0: ffff8881209ac420 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x72/0xf0 [ 8.760398] #1: ffff88810c80f488 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x136/0x250 [ 8.760402] #2: ffff888125d1d330 (kn->active#101){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x13f/0x250 [ 8.760406] #3: ffff888110dc7bb0 (&q->sysfs_lock){+.+.}-{3:3}, at: queue_attr_store+0x148/0x1e0 [ 8.760411] #4: ffff888110dc7680 (&q->q_usage_counter(io)#27){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x12/0x30 [ 8.760416] #5: ffff888110dc76b8 (&q->q_usage_counter(queue)#27){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x12/0x30 [ 8.760421] stack backtrace: [ 8.760422] CPU: 7 UID: 0 PID: 674 Comm: (udev-worker) Tainted: G U 6.14.0-rc6-xe+ #7 [ 8.760424] Tainted: [U]=USER [ 8.760425] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023 [ 8.760426] Call Trace: [ 8.760427] <TASK> [ 8.760428] dump_stack_lvl+0x6e/0xa0 [ 8.760431] dump_stack+0x10/0x18 [ 8.760433] print_circular_bug.cold+0x17a/0x1b7 [ 8.760437] check_noncircular+0x13a/0x150 [ 8.760441] ? save_trace+0x54/0x360 [ 8.760445] __lock_acquire+0x1525/0x2760 [ 8.760446] ? irqentry_exit+0x3a/0xb0 [ 8.760448] ? sysvec_apic_timer_interrupt+0x57/0xc0 [ 8.760452] lock_acquire+0xca/0x310 [ 8.760453] ? kernfs_remove+0x32/0x60 [ 8.760457] down_write+0x3e/0xf0 [ 8.760459] ? kernfs_remove+0x32/0x60 [ 8.760460] kernfs_remove+0x32/0x60 [ 8.760462] sysfs_remove_dir+0x4f/0x60 [ 8.760464] __kobject_del+0x33/0xa0 [ 8.760466] kobject_del+0x13/0x30 [ 8.760467] elv_unregister_queue+0x52/0x80 [ 8.760470] elevator_switch+0x68/0x360 [ 8.760472] elv_iosched_store+0x14b/0x1b0 [ 8.760475] queue_attr_store+0x181/0x1e0 [ 8.760479] ? lock_acquire+0xca/0x310 [ 8.760480] ? kernfs_fop_write_iter+0x13f/0x250 [ 8.760482] ? lock_is_held_type+0xa3/0x130 [ 8.760485] sysfs_kf_write+0x49/0x80 [ 8.760487] kernfs_fop_write_iter+0x17d/0x250 [ 8.760489] vfs_write+0x281/0x540 [ 8.760494] ksys_write+0x72/0xf0 [ 8.760497] __x64_sys_write+0x19/0x30 [ 8.760499] x64_sys_call+0x2a3/0x2310 [ 8.760502] do_syscall_64+0x91/0x180 [ 8.760504] ? trace_hardirqs_off+0x5d/0xe0 [ 8.760506] ? handle_softirqs+0x479/0x4d0 [ 8.760508] ? hrtimer_interrupt+0x13f/0x280 [ 8.760511] ? irqentry_exit_to_user_mode+0x8b/0x260 [ 8.760513] ? clear_bhb_loop+0x15/0x70 [ 8.760515] ? clear_bhb_loop+0x15/0x70 [ 8.760516] ? clear_bhb_loop+0x15/0x70 [ 8.760518] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 8.760520] RIP: 0033:0x7aa3bf2f5504 [ 8.760522] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 8b 10 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89 [ 8.760523] RSP: 002b:00007ffc1e3697d8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 [ 8.760526] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007aa3bf2f5504 [ 8.760527] RDX: 0000000000000003 RSI: 00007ffc1e369ae0 RDI: 000000000000001c [ 8.760528] RBP: 00007ffc1e369800 R08: 00007aa3bf3f51c8 R09: 00007ffc1e3698b0 [ 8.760528] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003 [ 8.760529] R13: 00007ffc1e369ae0 R14: 0000613ccf21f2f0 R15: 00007aa3bf3f4e80 [ 8.760533] </TASK> v2: - Update a code comment to increase readability (Ming Lei). Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250318095548.5187-1-thomas.hellstrom@linux.intel.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-18 07:57:33 -06:00
Christoph Hellwig	b0d4258119	block: fix a comment in the queue_attrs[] array queue_ra_entry uses limits_lock just like the attributes above it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250312150127.703534-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-18 07:54:50 -06:00
Nilay Shroff	0e94ed3368	block: protect debugfs attribute method hctx_busy_show The hctx_busy_show method in debugfs is currently unprotected. This method iterates over all started requests in a tagset and prints them. However, the tags can be updated concurrently via the sysfs attributes 'nr_requests' or 'scheduler' (elevator switch), leading to potential race conditions. Since sysfs attributes 'nr_requests' and 'scheduler' are already protected using q->elevator_lock, extend this protection to the debugfs 'busy' attribute as well to ensure consistency. Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250313115235.3707600-4-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-13 07:23:43 -06:00
Nilay Shroff	78800f5997	block: remove unnecessary goto labels in debugfs attribute read methods In some debugfs attribute read methods, failure to acquire the mutex lock results in jumping to a label before returning an error code. However this is unnecessary, as we can return the failure code directly, improving code readability and reducing complexity. This commit removes the goto labels and ensures that the method returns immediately upon failing to acquire the mutex lock. Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250313115235.3707600-3-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-13 07:23:14 -06:00
Nilay Shroff	a3996d11f3	block: protect debugfs attrs using elevator_lock instead of sysfs_lock Currently, the block debugfs attributes (tags, tags_bitmap, sched_tags, and sched_tags_bitmap) are protected using q->sysfs_lock. However, these attributes are updated in multiple scenarios: - During driver probe method - During an elevator switch/update - During an nr_hw_queues update - When writing to the sysfs attribute nr_requests All these update paths (except driver probe method, which doesn't require any protection) are already protected using q->elevator_lock. To ensure consistency and proper synchronization, replace q->sysfs_lock with q->elevator_lock for protecting these debugfs attributes. Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250313115235.3707600-2-nilay@linux.ibm.com [axboe: some commit message rewording/fixes] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-13 07:22:13 -06:00
Anuj Gupta	75618ac6e9	block: remove unused parameter 'q' parameter in __blk_rq_map_sg() request_queue param is no longer used by blk_rq_map_sg and __blk_rq_map_sg. Remove it. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250313035322.243239-1-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-13 05:46:19 -06:00
Ming Lei	26064d3e2b	block: fix adding folio to bio >4GB folio is possible on some ARCHs, such as aarch64, 16GB hugepage is supported, then 'offset' of folio can't be held in 'unsigned int', cause warning in bio_add_folio_nofail() and IO failure. Fix it by adjusting 'page' & trimming 'offset' so that `->bi_offset` won't be overflow, and folio can be added to bio successfully. Fixes: `ed9832bc08` ("block: introduce folio awareness and add a bigger size from folio") Cc: Kundan Kumar <kundan.kumar@samsung.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Gavin Shan <gshan@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20250312145136.2891229-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-12 14:07:11 -06:00
Guixin Liu	61667cb664	block: remove unused parameter The blk_mq_map_queue()'s request_queue param is not used anymore, remove it, same with blk_get_flush_queue(). Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://lore.kernel.org/r/20250312084722.129680-1-kanie@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-12 08:25:28 -06:00
Michal Koutný	4a893bdc18	blk-cgroup: Simplify policy files registration Use one set of files when there is no difference between default and legacy files, similar to regular subsys files registration. No functional change. Signed-off-by: Michal Koutný <mkoutny@suse.com> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-03-11 09:22:55 -10:00
Michal Koutný	77bbb259db	cgroup/blkio: Add deprecation messages to reset_stats It is difficult to sync with stat updaters, stats are (should be) monotonic so users can calculate differences from a reference. Signed-off-by: Michal Koutný <mkoutny@suse.com> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-03-11 09:22:54 -10:00
Coly Li	7e76336e14	badblocks: Fix a nonsense WARN_ON() which checks whether a u64 variable < 0 In _badblocks_check(), there are lines of code like this, 1246 sectors -= len; [snipped] 1251 WARN_ON(sectors < 0); The WARN_ON() at line 1257 doesn't make sense because sectors is unsigned long long type and never to be <0. Fix it by checking directly checking whether sectors is less than len. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Coly Li <colyli@kernel.org> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250309160556.42854-1-colyli@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-10 07:41:58 -06:00
Ming Lei	fc0e982b8a	block: make sure ->nr_integrity_segments is cloned in blk_rq_prep_clone Make sure ->nr_integrity_segments is cloned in blk_rq_prep_clone(), otherwise requests cloned by device-mapper multipath will not have the proper nr_integrity_segments values set, then BUG() is hit from sg_alloc_table_chained(). Fixes: `b0fd271d5f` ("block: add request clone interface (v2)") Cc: stable@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250310115453.2271109-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-10 07:41:25 -06:00
Nilay Shroff	5abba4cebe	block: protect hctx attributes/params using q->elevator_lock Currently, hctx attributes (nr_tags, nr_reserved_tags, and cpu_list) are protected using `q->sysfs_lock`. However, these attributes can be updated in multiple scenarios: - During the driver's probe method. - When updating nr_hw_queues. - When writing to the sysfs attribute nr_requests, which can modify nr_tags. The nr_requests attribute is already protected using q->elevator_lock, but none of the update paths actually use q->sysfs_lock to protect hctx attributes. So to ensure proper synchronization, replace q->sysfs_lock with q->elevator_lock when reading hctx attributes through sysfs. Additionally, blk_mq_update_nr_hw_queues allocates and updates hctx. The allocation of hctx is protected using q->elevator_lock, however, updating hctx params happens without any protection, so safeguard hctx param update path by also using q->elevator_lock. Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250306093956.2818808-1-nilay@linux.ibm.com [axboe: wrap comment at 80 chars] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-10 07:31:06 -06:00
Nilay Shroff	5e40f4452d	block: protect read_ahead_kb using q->limits_lock The bdi->ra_pages could be updated under q->limits_lock because it's usually calculated from the queue limits by queue_limits_commit_update. So protect reading/writing the sysfs attribute read_ahead_kb using q->limits_lock instead of q->sysfs_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-8-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-10 07:30:19 -06:00
Nilay Shroff	245618f8e4	block: protect wbt_lat_usec using q->elevator_lock The wbt latency and state could be updated while initializing the elevator or exiting the elevator. It could be also updated while configuring IO latency QoS parameters using cgroup. The elevator code path is now protected with q->elevator_lock. So we should protect the access to sysfs attribute wbt_lat_usec using q->elevator _lock instead of q->sysfs_lock. White we're at it, also protect ioc_qos_write(), which configures wbt parameters via cgroup, using q->elevator_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-7-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-10 07:30:18 -06:00
Nilay Shroff	3efe7571c3	block: protect nr_requests update using q->elevator_lock The sysfs attribute nr_requests could be simultaneously updated from elevator switch/update or nr_hw_queue update code path. The update to nr_requests for each of those code paths runs holding q->elevator_lock. So we should protect access to sysfs attribute nr_requests using q-> elevator_lock instead of q->sysfs_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-6-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-10 07:30:18 -06:00
Nilay Shroff	1bf70d08cc	block: introduce a dedicated lock for protecting queue elevator updates A queue's elevator can be updated either when modifying nr_hw_queues or through the sysfs scheduler attribute. Currently, elevator switching/ updating is protected using q->sysfs_lock, but this has led to lockdep splats[1] due to inconsistent lock ordering between q->sysfs_lock and the freeze-lock in multiple block layer call sites. As the scope of q->sysfs_lock is not well-defined, its (mis)use has resulted in numerous lockdep warnings. To address this, introduce a new q->elevator_lock, dedicated specifically for protecting elevator switches/updates. And we'd now use this new q->elevator_lock instead of q->sysfs_lock for protecting elevator switches/updates. While at it, make elv_iosched_load_module() a static function, as it is only called from elv_iosched_store(). Also, remove redundant parameters from elv_iosched_load_module() function signature. [1] https://lore.kernel.org/all/67637e70.050a0220.3157ee.000c.GAE@google.com/ Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-5-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-10 07:30:18 -06:00
Nilay Shroff	d23977fee1	block: remove q->sysfs_lock for attributes which don't need it There're few sysfs attributes in block layer which don't really need acquiring q->sysfs_lock while accessing it. The reason being, reading/ writing a value from/to such attributes are either atomic or could be easily protected using READ_ONCE()/WRITE_ONCE(). Moreover, sysfs attributes are inherently protected with sysfs/kernfs internal locking. So this change help segregate all existing sysfs attributes for which we could avoid acquiring q->sysfs_lock. For all read-only attributes we removed the q->sysfs_lock from show method of such attributes. In case attribute is read/write then we removed the q->sysfs_lock from both show and store methods of these attributes. We audited all block sysfs attributes and found following list of attributes which shouldn't require q->sysfs_lock protection: 1. io_poll: Write to this attribute is ignored. So, we don't need q->sysfs_lock. 2. io_poll_delay: Write to this attribute is NOP, so we don't need q->sysfs_lock. 3. io_timeout: Write to this attribute updates q->rq_timeout and read of this attribute returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is set only once when we init the queue (under blk_mq_ init_allocated_queue()) even before disk is added. So that means that we don't need to protect it with q->sysfs_lock. As this attribute is not directly correlated with anything else simply using READ_ONCE/WRITE_ONCE should be enough. 4. nomerges: Write to this attribute file updates two q->flags : QUEUE_FLAG_ NOMERGES and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are updated/accessed with bitops which are atomic. So, protecting it with q->sysfs_lock is not necessary. 5. rq_affinity: Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi() using test_bit macro. As read/write to q->flags uses bitops which are atomic, protecting it with q->stsys_lock is not necessary. 6. nr_zones: Write to this attribute happens in the driver probe method (except nvme) before disk is added and outside of q->sysfs_lock or any other lock. Moreover nr_zones is defined as "unsigned int" and so reading this attribute, even when it's simultaneously being updated on other cpu, should not return torn value on any architecture supported by linux. So we can avoid using q->sysfs_lock or any other lock/ protection while reading this attribute. 7. discard_zeroes_data: Reading of this attribute always returns 0, so we don't require holding q->sysfs_lock. 8. write_same_max_bytes Reading of this attribute always returns 0, so we don't require holding q->sysfs_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-4-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-10 07:30:18 -06:00
Nilay Shroff	b07a889e83	block: move q->sysfs_lock and queue-freeze under show/store method In preparation to further simplify and group sysfs attributes which don't require locking or require some form of locking other than q-> limits_lock, move acquire/release of q->sysfs_lock and queue freeze/ unfreeze under each attributes' respective show/store method. While we are at it, also remove ->load_module() as it's used to load the module before queue is freezed. Now as we moved queue-freeze under ->store(), we could load module directly from the attributes' store method before we actually start freezing the queue. Currently, the ->load_module() is only used by "scheduler" attribute, so we now load the relevant elevator module before we start freezing the queue in elv_iosched_store(). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-3-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-10 07:30:18 -06:00
Nilay Shroff	6e51a1279c	block: acquire q->limits_lock while reading sysfs attributes There're few sysfs attributes(RW) whose store method is protected with q->limits_lock, however the corresponding show method of these attributes run holding q->sysfs_lock and that doesn't make sense as ideally the show method of these attributes should also run holding q->limits_lock instead of q->sysfs_lock. Hence update the show method of these sysfs attributes so that reading of these attributes acquire q->limits_lock instead of q->sysfs_lock. Similarly, there're few sysfs attributes(RO) whose show method is currently protected with q->sysfs_lock however updates to these attributes could occur using atomic limit update APIs such as queue_ limits_start_update() and queue_limits_commit_update() which run holding q->limits_lock. So that means that reading these attributes holding q->sysfs_lock doesn't make sense. Hence update the show method of these sysfs attributes(RO) such that they run with holding q-> limits_lock instead of q->sysfs_lock. We have defined a new macro QUEUE_LIM_RO_ENTRY() which uses new ->show_ limit() method and it runs holding q->limits_lock. All existing sysfs attributes(RO) which needs protection using q->limits_lock while reading have been now updated to use this new macro for initialization. Also, the existing QUEUE_LIM_RW_ENTRY() is updated to use new ->show_ limit() method for reading attributes instead of existing ->show() method. As ->show_limit() runs holding q->limits_lock, the existing sysfs attributes(RW) requiring protection are now inherently protected using q->limits_lock instead of q->sysfs_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-2-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-10 07:30:18 -06:00
Linus Torvalds	381af8d9f4	block-6.14-20250306 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfKQvsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpnBCD/9bVSGHNnXakVwdpQmtU5zy54cyWd7VaYsz qeM+Vrl1m5nf8q5ZdEXcM11Ruib3YJiW0GN9d9sWpTwt8C5n8g+8F63koS7GordZ jcv77nO6FlnwWpm3YlNxAeLuxkl15e4MQIKj/jb540iFygzT8H2lygE816K4kpCX XuMxNxdSMksntovZufzxo3Sfkm6e6GChCkkqvBxuXiEWFhvbFQ/ZLEsEMtoH4hkI 3Nj1VB3B3pLVCZhWr2uVvcZCiYUDyBslu+SA3RRoX0W6beK1cVI4OQdS8GtnkJf3 qFnLQz0Ib3EVDtugqg7ZGSAAov6Z8waA2MrFeZkG8uIfl4WT3kBfoan7jRX3Mknl VnFkThyJOzB83OKqlZKjCzYmEzBhKJrRJVtneIrxT+gvEpevFvAQil6SQfyPDwno 4YcUD+IfU/daTdVR58QQ/iLzkQ7stQWYCtZSrICKfcAGy6zswKM5P5uoWltMBwQh aHsyz9xbmsMrxch1DPRb0T2GD2h9BsiL6rT8JCrOgucMuOYeZL9pNRgz16D/hael wBCxPcanSdap0N9kiMX8fLYYdmRxpJHzTbeNRsPhZe8HKUPu1sYTbisOou1XSdAW Dv7zeQWVlw+1cn/S1Y6Oc4mdlPzPTA9szuBXVpbe9Gd7ZqO7sbbKEkGu5w6MGSZ1 oubnZKCNvA== =jKDe -----END PGP SIGNATURE----- Merge tag 'block-6.14-20250306' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - TCP use after free fix on polling (Sagi) - Controller memory buffer cleanup fixes (Icenowy) - Free leaking requests on bad user passthrough commands (Keith) - TCP error message fix (Maurizio) - TCP corruption fix on partial PDU (Maurizio) - TCP memory ordering fix for weakly ordered archs (Meir) - Type coercion fix on message error for TCP (Dan) - Name the RQF flags enum, fixing issues with anon enums and BPF import of it - ublk parameter setting fix - GPT partition 7-bit conversion fix * tag 'block-6.14-20250306' of git://git.kernel.dk/linux: block: Name the RQF flags enum nvme-tcp: fix signedness bug in nvme_tcp_init_connection() block: fix conversion of GPT partition name to 7-bit ublk: set_params: properly check if parameters can be applied nvmet-tcp: Fix a possible sporadic response drops in weakly ordered arch nvme-tcp: fix potential memory corruption in nvme_tcp_recv_pdu() nvme-tcp: Fix a C2HTermReq error message nvmet: remove old function prototype nvme-ioctl: fix leaked requests on mapping error nvme-pci: skip CMB blocks incompatible with PCI P2P DMA nvme-pci: clean up CMBMSC when registering CMB fails nvme-tcp: fix possible UAF in nvme_tcp_poll	2025-03-07 11:12:33 -10:00
Luis Chamberlain	a64e5a5960	bdev: add back PAGE_SIZE block size validation for sb_set_blocksize() The commit titled "block/bdev: lift block size restrictions to 64k" lifted the block layer's max supported block size to 64k inside the helper blk_validate_block_size() now that we support large folios. However in lifting the block size we also removed the silly use cases many filesystems have to use sb_set_blocksize() to verify that the block size <= PAGE_SIZE. The call to sb_set_blocksize() was used to check the block size <= PAGE_SIZE since historically we've always supported userspace to create for example 64k block size filesystems even on 4k page size systems, but what we didn't allow was mounting them. Older filesystems have been using the check with sb_set_blocksize() for years. While, we could argue that such checks should be filesystem specific, there are much more users of sb_set_blocksize() than LBS enabled filesystem on upstream, so just do the easier thing and bring back the PAGE_SIZE check for sb_set_blocksize() users and only skip it for LBS enabled filesystems. This will ensure that tests such as generic/466 when run in a loop against say, ext4, won't try to try to actually mount a filesystem with a block size larger than your filesystem supports given your PAGE_SIZE and in the worst case crash. Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250307020403.3068567-1-mcgrof@kernel.org Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-07 12:56:05 +01:00
Zheng Qixing	d301f164c3	badblocks: use sector_t instead of int to avoid truncation of badblocks length There is a truncation of badblocks length issue when set badblocks as follow: echo "2055 4294967299" > bad_blocks cat bad_blocks 2055 3 Change 'sectors' argument type from 'int' to 'sector_t'. This change avoids truncation of badblocks length for large sectors by replacing 'int' with 'sector_t' (u64), enabling proper handling of larger disk sizes and ensuring compatibility with 64-bit sector addressing. Fixes: `9e0e252a04` ("badblocks: Add core badblock management code") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-13-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-06 08:04:52 -07:00
Zheng Qixing	c8775aefba	badblocks: return boolean from badblocks_set() and badblocks_clear() Change the return type of badblocks_set() and badblocks_clear() from int to bool, indicating success or failure. Specifically: - _badblocks_set() and _badblocks_clear() functions now return true for success and false for failure. - All calls to these functions are updated to handle the new boolean return type. - This change improves code clarity and ensures a more consistent handling of success and failure states. Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Acked-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20250227075507.151331-11-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-06 08:03:28 -07:00
Zheng Qixing	5236f041fa	badblocks: fix missing bad blocks on retry in _badblocks_check() The bad blocks check would miss bad blocks when retrying under contention, as checking parameters are not reset. These stale values from the previous attempt could lead to incorrect scanning in the subsequent retry. Move seqlock to outer function and reinitialize checking state for each retry. This ensures a clean state for each check attempt, preventing any missed bad blocks. Fixes: `3ea3354cb9` ("badblocks: improve badblocks_check() for multiple ranges handling") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-10-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-06 08:03:28 -07:00
Li Nan	9ec65dec63	badblocks: fix merge issue when new badblocks align with pre+1 There is a merge issue when adding badblocks as follow: echo 0 10 > bad_blocks echo 30 10 > bad_blocks echo 20 10 > bad_blocks cat bad_blocks 0 10 20 10 //should be merged with (30 10) 30 10 In this case, if new badblocks does not intersect with prev, it is added by insert_at(). If there is an intersection with prev+1, the merge will be processed in the next re_insert loop. However, when the end of the new badblocks is exactly equal to the offset of prev+1, no further re_insert loop occurs, and the two badblocks are not merge. Fix it by inc prev, badblocks can be merged during the subsequent code. Fixes: `aa511ff821` ("badblocks: switch to the improved badblock handling code") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250227075507.151331-9-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-06 08:03:28 -07:00
Li Nan	3a23d05f9c	badblocks: try can_merge_front before overlap_front Regardless of whether overlap_front() returns true or false, can_merge_front() will be executed first. Therefore, move can_merge_front() in front of can_merge_front() to simplify code. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250227075507.151331-8-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-06 08:03:28 -07:00
Li Nan	37446680df	badblocks: fix the using of MAX_BADBLOCKS The number of badblocks cannot exceed MAX_BADBLOCKS, but it should be allowed to equal MAX_BADBLOCKS. Fixes: `aa511ff821` ("badblocks: switch to the improved badblock handling code") Fixes: `c3c6a86e9e` ("badblocks: add helper routines for badblock ranges handling") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-7-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-06 08:03:28 -07:00
Li Nan	7f500f0a59	badblocks: return error if any badblock set fails _badblocks_set() returns success if at least one badblock is set successfully, even if others fail. This can lead to data inconsistencies in raid, where a failed badblock set should trigger the disk to be kicked out to prevent future reads from failed write areas. _badblocks_set() should return error if any badblock set fails. Instead of relying on 'rv', directly returning 'sectors' for clearer logic. If all badblocks are successfully set, 'sectors' will be 0, otherwise it indicates the number of badblocks that have not been set yet, thus signaling failure. By the way, it can also fix an issue: when a newly set unack badblock is included in an existing ack badblock, the setting will return an error. ··· echo "0 100" /sys/block/md0/md/dev-loop1/bad_blocks echo "0 100" /sys/block/md0/md/dev-loop1/unacknowledged_bad_blocks -bash: echo: write error: No space left on device ``` After fix, it will return success. Fixes: `aa511ff821` ("badblocks: switch to the improved badblock handling code") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-6-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-06 08:03:28 -07:00
Li Nan	28243dcd1f	badblocks: return error directly when setting badblocks exceeds 512 In the current handling of badblocks settings, a lot of processing has been done for scenarios where the number of badblocks exceeds 512. This makes the code look quite complex and also introduces some issues, For example, if there is 512 badblocks already: for((i=0; i<510; i++)); do ((sector=i*2)); echo "$sector 1" > bad_blocks; done echo 2100 10 > bad_blocks echo 2200 10 > bad_blocks Set new one, exceed 512: echo 2000 500 > bad_blocks Expected: 2000 500 Actual: 2100 400 In fact, a disk shouldn't have too many badblocks, and for disks with 512 badblocks, attempting to set more bad blocks doesn't make much sense. At that point, the more appropriate action would be to replace the disk. Therefore, to resolve these issues and simplify the code somewhat, return error directly when setting badblocks exceeds 512. Fixes: `aa511ff821` ("badblocks: switch to the improved badblock handling code") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250227075507.151331-5-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-06 08:03:28 -07:00
Li Nan	32e9ad4d11	badblocks: attempt to merge adjacent badblocks during ack_all_badblocks If ack and unack badblocks are adjacent, they will not be merged and will remain as two separate badblocks. Even after the bad blocks are written to disk and both become ack, they will still remain as two independent bad blocks. This is not ideal as it wastes the limited space for badblocks. Therefore, during ack_all_badblocks(), attempt to merge badblocks if they are adjacent. Fixes: `aa511ff821` ("badblocks: switch to the improved badblock handling code") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-4-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-06 08:03:27 -07:00
Li Nan	270b68fee9	badblocks: factor out a helper try_adjacent_combine Factor out try_adjacent_combine(), and it will be used in the later patch. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250227075507.151331-3-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-06 08:03:27 -07:00
Li Nan	7d83c5d73c	badblocks: Fix error shitf ops 'bb->shift' is used directly in badblocks. It is wrong, fix it. Fixes: `3ea3354cb9` ("badblocks: improve badblocks_check() for multiple ranges handling") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-2-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-06 08:03:27 -07:00
Anuj Gupta	85f7292500	block: Correctly initialize BLK_INTEGRITY_NOGENERATE and BLK_INTEGRITY_NOVERIFY Currently, BLK_INTEGRITY_NOGENERATE and BLK_INTEGRITY_NOVERIFY are not explicitly set during integrity initialization. This can lead to incorrect reporting of read_verify and write_generate sysfs values, particularly when a device does not support integrity. Ensure that these flags are correctly initialized by default. Reported-by: M Nikhil <nikh1092@linux.ibm.com> Link: https://lore.kernel.org/linux-block/f6130475-3ccd-45d2-abde-3ccceada0f0a@linux.ibm.com/ Fixes: `9f4aa46f2a` ("block: invert the BLK_INTEGRITY_{GENERATE,VERIFY} flags") Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250305063033.1813-3-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-06 08:01:37 -07:00
Anuj Gupta	677e332e48	block: ensure correct integrity capability propagation in stacked devices queue_limits_stack_integrity() incorrectly sets BLK_INTEGRITY_DEVICE_CAPABLE for a DM device even when none of its underlying devices support integrity. This happens because the flag is inherited unconditionally. Ensure that integrity capabilities are correctly propagated only when the underlying devices actually support integrity. Reported-by: M Nikhil <nikh1092@linux.ibm.com> Link: https://lore.kernel.org/linux-block/f6130475-3ccd-45d2-abde-3ccceada0f0a@linux.ibm.com/ Fixes: `c6e56cf6b2` ("block: move integrity information into queue_limits") Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250305063033.1813-2-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-06 08:01:37 -07:00
Ming Lei	6cc477c368	blk-throttle: carry over directly Now ->carryover_bytes[] and ->carryover_ios[] only covers limit/config update. Actually the carryover bytes/ios can be carried to ->bytes_disp[] and ->io_disp[] directly, since the carryover is one-shot thing and only valid in current slice. Then we can remove the two fields and simplify code much. Type of ->bytes_disp[] and ->io_disp[] has to change as signed because the two fields may become negative when updating limits or config, but both are big enough for holding bytes/ios dispatched in single slice Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250305043123.3938491-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-05 16:24:40 -07:00
Ming Lei	a9fc8868b3	blk-throttle: don't take carryover for prioritized processing of metadata Commit `29390bb566` ("blk-throttle: support prioritized processing of metadata") takes bytes/ios carryover for prioritized processing of metadata. Turns out we can support it by charging it directly without trimming slice, and the result is same with carryover. Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250305043123.3938491-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-05 16:24:40 -07:00
Ming Lei	483a393e7e	blk-throttle: remove last_bytes_disp and last_ios_disp The two fields are not used any more, so remove them. Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250305043123.3938491-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-05 16:24:40 -07:00
Yu Kuai	29cb955934	blk-throttle: fix lower bps rate by throtl_trim_slice() The bio submission time may be a few jiffies more than the expected waiting time, due to 'extra_bytes' can't be divided in tg_within_bps_limit(), and also due to timer wakeup delay. In this case, adjust slice_start to jiffies will discard the extra wait time, causing lower rate than expected. Current in-tree code already covers deviation by rounddown(), but turns out it is not enough, because jiffies - slice_start can be a multiple of throtl_slice. For example, assume bps_limit is 1000bytes, 1 jiffes is 10ms, and slice is 20ms(2 jiffies), expected rate is 1000 / 1000 * 20 = 20 bytes per slice. If user issues two 21 bytes IO, then wait time will be 30ms for the first IO: bytes_allowed = 20, extra_bytes = 1; jiffy_wait = 1 + 2 = 3 jiffies and consider extra 1 jiffies by timer, throtl_trim_slice() will be called at: jiffies = 40ms slice_start = 0ms, slice_end= 40ms bytes_disp = 21 In this case, before the patch, real rate in the first two slices is 10.5 bytes per slice, and slice will be updated to: jiffies = 40ms slice_start = 40ms, slice_end = 60ms, bytes_disp = 0; Hence the second IO will have to wait another 30ms; With the patch, the real rate in the first slice is 20 bytes per slice, which is the same as expected, and slice will be updated: jiffies=40ms, slice_start = 20ms, slice_end = 60ms, bytes_disp = 1; And now, there is still 19 bytes allowed in the second slice, and the second IO will only have to wait 10ms; This problem will cause blktests throtl/001 failure in case of CONFIG_HZ_100=y, fix it by preserving one extra finished slice in throtl_trim_slice(). Fixes: `e43473b7f2` ("blkio: Core implementation of throttle policy") Reported-by: Ming Lei <ming.lei@redhat.com> Closes: https://lore.kernel.org/linux-block/20250222092823.210318-3-yukuai1@huaweicloud.com/ Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250227120645.812815-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-05 16:24:30 -07:00
Olivier Gayot	e06472bab2	block: fix conversion of GPT partition name to 7-bit The utf16_le_to_7bit function claims to, naively, convert a UTF-16 string to a 7-bit ASCII string. By naively, we mean that it: * drops the first byte of every character in the original UTF-16 string * checks if all characters are printable, and otherwise replaces them by exclamation mark "!". This means that theoretically, all characters outside the 7-bit ASCII range should be replaced by another character. Examples: * lower-case alpha (ɒ) 0x0252 becomes 0x52 (R) * ligature OE (œ) 0x0153 becomes 0x53 (S) * hangul letter pieup (ㅂ) 0x3142 becomes 0x42 (B) * upper-case gamma (Ɣ) 0x0194 becomes 0x94 (not printable) so gets replaced by "!" The result of this conversion for the GPT partition name is passed to user-space as PARTNAME via udev, which is confusing and feels questionable. However, there is a flaw in the conversion function itself. By dropping one byte of each character and using isprint() to check if the remaining byte corresponds to a printable character, we do not actually guarantee that the resulting character is 7-bit ASCII. This happens because we pass 8-bit characters to isprint(), which in the kernel returns 1 for many values > 0x7f - as defined in ctype.c. This results in many values which should be replaced by "!" to be kept as-is, despite not being valid 7-bit ASCII. Examples: * e with acute accent (é) 0x00E9 becomes 0xE9 - kept as-is because isprint(0xE9) returns 1. * euro sign (€) 0x20AC becomes 0xAC - kept as-is because isprint(0xAC) returns 1. This way has broken pyudev utility[1], fixes it by using a mask of 7 bits instead of 8 bits before calling isprint. Link: https://github.com/pyudev/pyudev/issues/490#issuecomment-2685794648 [1] Link: https://lore.kernel.org/linux-block/4cac90c2-e414-4ebb-ae62-2a4589d9dc6e@canonical.com/ Cc: Mulhern <amulhern@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: stable@vger.kernel.org Signed-off-by: Olivier Gayot <olivier.gayot@canonical.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250305022154.3903128-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-05 07:40:24 -07:00
Christoph Hellwig	105ca2a2c2	block: split struct bio_integrity_payload Many of the fields in struct bio_integrity_payload are only needed for the default integrity buffer in the block layer, and the variable sized array at the end of the structure makes it very hard to embed into caller allocated structures. Reduce struct bio_integrity_payload to the minimal structure needed in common code and create two separate containing structures for the automatically generated payload and the caller allocated payload. The latter is a simple wrapper for struct bio_integrity_payload and the bvecs, while the former contains the additional fields moved out of struct bio_integrity_payload. Always use a dedicated mempool for automatic integrity metadata instead of depending on bio_set that is submitter controlled and thus often doesn't have the mempool initialized and stop using mempools for the submitter buffers as they aren't in the NOIO I/O submission path where we need to guarantee forward progress. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250225154449.422989-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-03 11:17:52 -07:00
Christoph Hellwig	e51679112c	block: move the block layer auto-integrity code into a new file The code that automatically creates a integrity payload and generates and verifies the checksums for bios that don't have submitter-provided integrity payload currently sits right in the middle of the block integrity metadata infrastructure. Split it into a separate file to make the different layers clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250225154449.422989-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-03 11:17:52 -07:00
Christoph Hellwig	5fd0268a88	block: mark bounce buffering as incompatible with integrity None of the few drivers still using the legacy block layer bounce buffering support integrity metadata. Explicitly mark the features as incompatible and stop creating the slab and mempool for integrity buffers for the bounce bio_set. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250225154449.422989-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-03 11:17:52 -07:00
Linus Torvalds	276f98efb6	block-6.14-20250228 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfBwygQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpuyVD/9kem557zNDkps/+2k8Q86FGZ/XmD+GPu1H l30qlar1XubeC/AE/bxgyI8G6rWY9li3PPn0tu/LeTgTVW5noIZCyvrtxl8g6yKV Gptm3H5AJypMU9cDz1/KTYTgrEypDJ22092/V1cuoeJxUS3srIEx6rlBp1wXzoG6 WdEIBhk9hM3hwXghyEarJeacHFe6xzd9lJM9ZODXBMkKtee85zXDLSAEPJsnjCcH t2tU/EAa6O0MLuYorG4Lkfs0ggDP+UDRdwh2MbANZXZdUCG2SwBS3pKDYtn684A1 gSsPnJGVZjLTog9jzaGkw64ebZ8tdLU4szjzroAJYkIbz9kO3QxT+H4TfW5UMoip TVPdNDqvypqs8ENKUvv3XuGsKuOfYjpBEiU2oGUUuioHJnWlh6CPnt8V8t3YKnbP xreqnIOjRJni1/OOZOMcWfRLlIRMG2dGFwhskWBWY8dmt4eHoge3RQzPZtAFelcG eM+Gkczz+GAXAnFHt5JQIPnfmcVmXqkbX12uoxUyuoa4AFaDLT+7nVtu3Gj5/beJ bcvk8q6ww8oXGVvJ0sYwic9tOX4XoxHsdr8u80Wd0uvHUB6uU/HTAxQxUO3uMSD5 0pk9l/zGDjDcEcuOUiIAUldl2M1eIyoBIOK3svMq6TKiC13j7+xkGI1uSA9cKws6 /+OsNMd9JQ== =tcA2 -----END PGP SIGNATURE----- Merge tag 'block-6.14-20250228' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - Fix plugging for native zone writes - Fix segment limit settings for != 4K page size archs - Fix for slab names overflowing * tag 'block-6.14-20250228' of git://git.kernel.dk/linux: block: fix 'kmem_cache of name 'bio-108' already exists' block: Remove zone write plugs when handling native zone append writes block: make segment size limit workable for > 4K PAGE_SIZE	2025-02-28 09:43:46 -08:00
Ming Lei	b654f7a51f	block: fix 'kmem_cache of name 'bio-108' already exists' Device mapper bioset often has big bio_slab size, which can be more than 1000, then 8byte can't hold the slab name any more, cause the kmem_cache allocation warning of 'kmem_cache of name 'bio-108' already exists'. Fix the warning by extending bio_slab->name to 12 bytes, but fix output of /proc/slabinfo Reported-by: Guangwu Zhang <guazhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250228132656.2838008-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-28 07:06:42 -07:00
Damien Le Moal	a6aa36e957	block: Remove zone write plugs when handling native zone append writes For devices that natively support zone append operations, REQ_OP_ZONE_APPEND BIOs are not processed through zone write plugging and are immediately issued to the zoned device. This means that there is no write pointer offset tracking done for these operations and that a zone write plug is not necessary. However, when receiving a zone append BIO, we may already have a zone write plug for the target zone if that zone was previously partially written using regular write operations. In such case, since the write pointer offset of the zone write plug is not incremented by the amount of sectors appended to the zone, 2 issues arise: 1) we risk leaving the plug in the disk hash table if the zone is fully written using zone append or regular write operations, because the write pointer offset will never reach the "zone full" state. 2) Regular write operations that are issued after zone append operations will always be failed by blk_zone_wplug_prepare_bio() as the write pointer alignment check will fail, even if the user correctly accounted for the zone append operations and issued the regular writes with a correct sector. Avoid these issues by immediately removing the zone write plug of zones that are the target of zone append operations when blk_zone_plug_bio() is called. The new function blk_zone_wplug_handle_native_zone_append() implements this for devices that natively support zone append. The removal of the zone write plug using disk_remove_zone_wplug() requires aborting all plugged regular write using disk_zone_wplug_abort() as otherwise the plugged write BIOs would never be executed (with the plug removed, the completion path will never see again the zone write plug as disk_get_zone_wplug() will return NULL). Rate-limited warnings are added to blk_zone_wplug_handle_native_zone_append() and to disk_zone_wplug_abort() to signal this. Since blk_zone_wplug_handle_native_zone_append() is called in the hot path for operations that will not be plugged, disk_get_zone_wplug() is optimized under the assumption that a user issuing zone append operations is not at the same time issuing regular writes and that there are no hashed zone write plugs. The struct gendisk atomic counter nr_zone_wplugs is added to check this, with this counter incremented in disk_insert_zone_wplug() and decremented in disk_remove_zone_wplug(). To be consistent with this fix, we do not need to fill the zone write plug hash table with zone write plugs for zones that are partially written for a device that supports native zone append operations. So modify blk_revalidate_seq_zone() to return early to avoid allocating and inserting a zone write plug for partially written sequential zones if the device natively supports zone append. Reported-by: Jorgen Hansen <Jorgen.Hansen@wdc.com> Fixes: `9b1ce7f0c6` ("block: Implement zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Jorgen Hansen <Jorgen.Hansen@wdc.com> Link: https://lore.kernel.org/r/20250214041434.82564-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-25 19:45:21 -07:00
Tang Yizhou	8ac17e6ae1	blk-wbt: Cleanup a comment in wb_timer_fn The original comment contains a grammatical error. Rewrite it into a more easily understandable sentence. Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Link: https://lore.kernel.org/r/20250213100611.209997-3-yizhou.tang@shopee.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-25 08:43:52 -07:00
Tang Yizhou	5d01d2df85	blk-wbt: Fix some comments wbt_wait() no longer uses a spinlock as a parameter. Update the function comments accordingly. RWB_UNKNOWN_BUMP is used when we gradually adjust scale_steps toward the center state, which is a value of 0. Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250213100611.209997-2-yizhou.tang@shopee.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-25 08:43:52 -07:00
Ming Lei	889c57066c	block: make segment size limit workable for > 4K PAGE_SIZE Using PAGE_SIZE as a minimum expected DMA segment size in consideration of devices which have a max DMA segment size of < 64k when used on 64k PAGE_SIZE systems leads to devices not being able to probe such as eMMC and Exynos UFS controller [0] [1] you can end up with a probe failure as follows: WARNING: CPU: 2 PID: 397 at block/blk-settings.c:339 blk_validate_limits+0x364/0x3c0 Ensure we use min(max_seg_size, seg_boundary_mask + 1) as the new min segment size when max segment size is < PAGE_SIZE for 16k and 64k base page size systems. If anyone need to backport this patch, the following commits are depended: commit `6aeb4f8364` ("block: remove bio_add_pc_page") commit `02ee5d69e3` ("block: remove blk_rq_bio_prep") commit `b7175e24d6` ("block: add a dma mapping iterator") Link: https://lore.kernel.org/linux-block/20230612203314.17820-1-bvanassche@acm.org/ # [0] Link: https://lore.kernel.org/linux-block/1d55e942-5150-de4c-3a02-c3d066f87028@acm.org/ # [1] Cc: Yi Zhang <yi.zhang@redhat.com> Cc: John Garry <john.g.garry@oracle.com> Cc: Keith Busch <kbusch@kernel.org> Tested-by: Paul Bunyan <pbunyan@redhat.com> Reviewed-by: Daniel Gomez <da.gomez@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250225022141.2154581-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-25 08:41:32 -07:00
Luis Chamberlain	425fbcd62d	bdev: use bdev_io_min() for statx block size You can use lsblk to query for a block device block device block size: lsblk -o MIN-IO /dev/nvme0n1 MIN-IO 4096 The min-io is the minimum IO the block device prefers for optimal performance. In turn we map this to the block device block size. The current block size exposed even for block devices with an LBA format of 16k is 4k. Likewise devices which support 4k LBA format but have a larger Indirection Unit of 16k have an exposed block size of 4k. This incurs read-modify-writes on direct IO against devices with a min-io larger than the page size. To fix this, use the block device min io, which is the minimal optimal IO the device prefers. With this we now get: lsblk -o MIN-IO /dev/nvme0n1 MIN-IO 16384 And so userspace gets the appropriate information it needs for optimal performance. This is verified with blkalgn against mkfs against a device with LBA format of 4k but an NPWG of 16k (min io size) mkfs.xfs -f -b size=16k /dev/nvme3n1 blkalgn -d nvme3n1 --ops Write Block size : count distribution 0 -> 1 : 0 \| \| 2 -> 3 : 0 \| \| 4 -> 7 : 0 \| \| 8 -> 15 : 0 \| \| 16 -> 31 : 0 \| \| 32 -> 63 : 0 \| \| 64 -> 127 : 0 \| \| 128 -> 255 : 0 \| \| 256 -> 511 : 0 \| \| 512 -> 1023 : 0 \| \| 1024 -> 2047 : 0 \| \| 2048 -> 4095 : 0 \| \| 4096 -> 8191 : 0 \| \| 8192 -> 16383 : 0 \| \| 16384 -> 32767 : 66 \|***************************************\| 32768 -> 65535 : 0 \| \| 65536 -> 131071 : 0 \| \| 131072 -> 262143 : 2 \| \| Block size: 14 - 66 Block size: 17 - 2 Algn size : count distribution 0 -> 1 : 0 \| \| 2 -> 3 : 0 \| \| 4 -> 7 : 0 \| \| 8 -> 15 : 0 \| \| 16 -> 31 : 0 \| \| 32 -> 63 : 0 \| \| 64 -> 127 : 0 \| \| 128 -> 255 : 0 \| \| 256 -> 511 : 0 \| \| 512 -> 1023 : 0 \| \| 1024 -> 2047 : 0 \| \| 2048 -> 4095 : 0 \| \| 4096 -> 8191 : 0 \| \| 8192 -> 16383 : 0 \| \| 16384 -> 32767 : 66 \|***************************************\| 32768 -> 65535 : 0 \| \| 65536 -> 131071 : 0 \| \| 131072 -> 262143 : 2 \| \| Algn size: 14 - 66 Algn size: 17 - 2 Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250221223823.1680616-9-mcgrof@kernel.org Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-24 11:44:44 +01:00
Luis Chamberlain	47dd675323	block/bdev: lift block size restrictions to 64k We now can support blocksizes larger than PAGE_SIZE, so in theory we should be able to lift the restriction up to the max supported page cache order. However bound ourselves to what we can currently validate and test. Through blktests and fstest we can validate up to 64k today. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250221223823.1680616-8-mcgrof@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-24 11:44:44 +01:00
Hannes Reinecke	3c20917120	block/bdev: enable large folio support for large logical block sizes Call mapping_set_folio_min_order() when modifying the logical block size to ensure folios are allocated with the correct size. Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20250221223823.1680616-7-mcgrof@kernel.org Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-24 11:44:44 +01:00
Thorsten Blum	8985c42987	block: Remove commented out code Remove commented out code. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250219205328.28462-2-thorsten.blum@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-21 17:12:21 -07:00
Linus Torvalds	8a61cb6e15	block-6.14-20250221 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAme4rUUQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpj9+EADLPOFPa9hT1PGBbpnj74vBayoTO/M+w2Gp +k2b8if3eGlY43WO2k+ytceWbA901iyvLPRqt1M1Ez8+BrNBg4NKcLv7q4O9NA3i nDPLggugSc5sdRbLRimxiwHkkpSOBenkdb7R9XGmMXTCfSbRKl0kK01ivpgkbiG4 pbyPWYcoMyHaECBfPhazrJig4+rugXOYbkYoOM4NHsLqlTNfmowcMRPu+6czXt7q ITHW2RTWK3ue8q+c3nwGPDk2ZKM8X/49rA/6bvD3voLNs+jQ8KFg2KULENf0Xaq6 1ZGrhLcr45iEHP0/+RORMzx27PqbTCSGIOTMZtwZNqh5+V+ybrGJq/F/T5rkrA3F QqHld/WSSKWJ10RVAyjDP7NQ5vNZTwwGAEVagjyIFEfk7G7RTY2kIpSZiUgrZ9oD 4CkOKUGmVkUsKQW6gb0JQObtYyXXoNtmg8wQU2WwhISjFDkoYWw53LHwH/LnxLyi Vg182amVBmERk4I5nTUiIML/7TzS69srb0Q7yaQS3eTwzLorDaB+3tPAxQmCTkGq KeBfuBtbP3LTOy2Oek4YbKl8CA2KDYtK7FbCE6PECUbdTjpNrcAAgA/ZcgoKV8s7 EHWZFx7dZyFS6LFNWzT9VhTtgSZS92JIsZwgnjSJPV2UazyxmqHFChzVVGWJbaB3 agkMor3nVg== =L+Lr -----END PGP SIGNATURE----- Merge tag 'block-6.14-20250221' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - FC controller state check fixes (Daniel) - PCI Endpoint fixes (Damien) - TCP connection failure fixe (Caleb) - TCP handling C2HTermReq PDU (Maurizio) - RDMA queue state check (Ruozhu) - Apple controller fixes (Hector) - Target crash on disbaled namespace (Hannes) - MD pull request via Yu: - Fix queue limits error handling for raid0, raid1 and raid10 - Fix for a NULL pointer deref in request data mapping - Code cleanup for request merging * tag 'block-6.14-20250221' of git://git.kernel.dk/linux: nvme: only allow entering LIVE from CONNECTING state nvme-fc: rely on state transitions to handle connectivity loss apple-nvme: Support coprocessors left idle apple-nvme: Release power domains when probe fails nvmet: Use enum definitions instead of hardcoded values nvme: Cleanup the definition of the controller config register fields nvme/ioctl: add missing space in err message nvme-tcp: fix connect failure on receiving partial ICResp PDU nvme: tcp: Fix compilation warning with W=1 nvmet: pci-epf: Avoid RCU stalls under heavy workload nvmet: pci-epf: Do not uselessly write the CSTS register nvmet: pci-epf: Correctly initialize CSTS when enabling the controller nvmet-rdma: recheck queue state is LIVE in state lock in recv done nvmet: Fix crash when a namespace is disabled nvme-tcp: add basic support for the C2HTermReq PDU nvme-pci: quirk Acer FA100 for non-uniqueue identifiers block: fix NULL pointer dereferenced within __blk_rq_map_sg block/merge: remove unnecessary min() with UINT_MAX md/raid*: Fix the set_queue_limits implementations	2025-02-21 09:36:28 -08:00
Nam Cao	cab0e0a056	blk_iocost: Switch to use hrtimer_setup() hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Patch was created by using Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/196d487c925411923a2d59d4bf5e366b9dac2747.1738746821.git.namcao@linutronix.de	2025-02-18 10:32:34 +01:00
Nam Cao	2414f15910	block, bfq: Switch to use hrtimer_setup() hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Patch was created by using Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/d0d57e1dab46b617856dfb93c721d221cc31ab0b.1738746821.git.namcao@linutronix.de	2025-02-18 10:32:33 +01:00
Ming Lei	dd8b0582e2	block: fix NULL pointer dereferenced within __blk_rq_map_sg The block layer internal flush request may not have bio attached, so the request iterator has to be initialized from valid req->bio, otherwise NULL pointer dereferenced is triggered. Cc: Christoph Hellwig <hch@lst.de> Reported-and-tested-by: Cheyenne Wills <cheyenne.wills@gmail.com> Fixes: `b7175e24d6` ("block: add a dma mapping iterator") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250217031626.461977-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-17 09:04:07 -07:00
Caleb Sander Mateos	43c70b1040	block/merge: remove unnecessary min() with UINT_MAX In bvec_split_segs(), max_bytes is an unsigned, so it must be less than or equal to UINT_MAX. Remove the unnecessary min(). Prior to commit `67927d2201` ("block/merge: count bytes instead of sectors"), the min() was with UINT_MAX >> 9, so it did have an effect. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250214193637.234702-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-14 15:40:17 -07:00
Linus Torvalds	1b8c8cdad1	block-6.14-20250214 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmevfDEQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgptaHEACEqo12wWNcYklms/oy9DxsVEFM7d5waYRR NZy1+i3wbUAGfYl0marBh484kDr7Uyko4YJa0O0LyMKdW7wZOEk36MRUU2+7FeSp 4bFiFlSGyds9kIqjem4dR0ACCL/NW+PS5T79Xh1PnWDEByMH7wbRtPWVT6JJl4r6 PVdi4FB1aV6+C2DayjKbFqR0kDbFnl8INaGw8mg5PpI32A9mCQtl6XU2G/Pw8WVZ 3UJR+DWzfK/lSeVvPiZgOvLHWzi1UB0rKKuWjzbIq7dTtMy241Tox0YRnLsPiNxR ncRHftgEIjgkHjpCT4qQZ/joQfLop6MSkRixWUaORjTRqHHTqhLpj5SzjNlfn0Cb qhb/jf4VoBYD/04NEwvBzNmwyX6xohD07boM2SlnpiPNzBo0pcHzD4YuYzmsUCO4 gE2DeI9NAtDLMB5987Heb2zbvNtWgSM4g9t5zZuKtBEfNPnQwzYKFWeOIbSxmcbN Y5FW+sLXmXLT+li17BeJFzOXp882Lp4oZtSdX1ibTkmdj4P/IcNYuB3Z/VYvF1NO ZY2mBFRdUrii5oBh7iVSkwGIJM/TUwBgjoPlG84F7CoaxK6wQDHovFhkLHUVd7mx JfzTDfbsC/7R934IgLcLDR8uCaLmbMnJNYJqdvGQdR2NVy4azM52zopkHX6ereby DqicWc+Ekg== =zguJ -----END PGP SIGNATURE----- Merge tag 'block-6.14-20250214' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - Fix for request rejection for batch addition - Fix a few issues for bogus mac partition tables * tag 'block-6.14-20250214' of git://git.kernel.dk/linux: partitions: mac: fix handling of bogus partition table block: cleanup and fix batch completion adding conditions	2025-02-14 11:40:59 -08:00
Jann Horn	80e648042e	partitions: mac: fix handling of bogus partition table Fix several issues in partition probing: - The bailout for a bad partoffset must use put_dev_sector(), since the preceding read_part_sector() succeeded. - If the partition table claims a silly sector size like 0xfff bytes (which results in partition table entries straddling sector boundaries), bail out instead of accessing out-of-bounds memory. - We must not assume that the partition table contains proper NUL termination - use strnlen() and strncmp() instead of strlen() and strcmp(). Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20250214-partition-mac-v1-1-c1c626dffbd5@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-14 08:38:28 -07:00
Muchun Song	a052bfa636	block: refactor rq_qos_wait() When rq_qos_wait() is first introduced, it is easy to understand. But with some bug fixes applied, it is not easy for newcomers to understand the whole logic under those fixes. In this patch, rq_qos_wait() is refactored and more comments are added for better understanding. There are 3 points for the improvement: 1) Use waitqueue_active() instead of wq_has_sleeper() to eliminate unnecessary memory barrier in wq_has_sleeper() which is supposed to be used in waker side. In this case, we do need the barrier. So use the cheaper one to locklessly test for waiters on the queue. 2) Remove acquire_inflight_cb() logic for the first waiter out of the while loop to make the code clear. 3) Add more comments to explain how to sync with different waiters and the waker. Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250208090416.38642-2-songmuchun@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-11 13:04:11 -07:00
Muchun Song	36d03cb327	block: introduce init_wait_func() There is already a macro DEFINE_WAIT_FUNC() to declare a wait_queue_entry with a specified waking function. But there is not a counterpart for initializing one wait_queue_entry with a specified waking function. So introducing init_wait_func() for this, which also could be used in iocost and rq-qos. Using default_wake_function() in rq_qos_wait() to wake up waiters, which could remove ->task field from rq_qos_wait_data. Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250208090416.38642-1-songmuchun@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-11 13:04:11 -07:00
Eric Biggers	1ebd4a3c09	blk-crypto: add ioctls to create and prepare hardware-wrapped keys Until this point, the kernel can use hardware-wrapped keys to do encryption if userspace provides one -- specifically a key in ephemerally-wrapped form. However, no generic way has been provided for userspace to get such a key in the first place. Getting such a key is a two-step process. First, the key needs to be imported from a raw key or generated by the hardware, producing a key in long-term wrapped form. This happens once in the whole lifetime of the key. Second, the long-term wrapped key needs to be converted into ephemerally-wrapped form. This happens each time the key is "unlocked". In Android, these operations are supported in a generic way through KeyMint, a userspace abstraction layer. However, that method is Android-specific and can't be used on other Linux systems, may rely on proprietary libraries, and also misleads people into supporting KeyMint features like rollback resistance that make sense for other KeyMint keys but don't make sense for hardware-wrapped inline encryption keys. Therefore, this patch provides a generic kernel interface for these operations by introducing new block device ioctls: - BLKCRYPTOIMPORTKEY: convert a raw key to long-term wrapped form. - BLKCRYPTOGENERATEKEY: have the hardware generate a new key, then return it in long-term wrapped form. - BLKCRYPTOPREPAREKEY: convert a key from long-term wrapped form to ephemerally-wrapped form. These ioctls are implemented using new operations in blk_crypto_ll_ops. Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> # sm8650 Link: https://lore.kernel.org/r/20250204060041.409950-4-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-10 09:54:19 -07:00
Eric Biggers	e35fde43e2	blk-crypto: show supported key types in sysfs Add sysfs files that indicate which type(s) of keys are supported by the inline encryption hardware associated with a particular request queue: /sys/block/$disk/queue/crypto/hw_wrapped_keys /sys/block/$disk/queue/crypto/raw_keys Userspace can use the presence or absence of these files to decide what encyption settings to use. Don't use a single key_type file, as devices might support both key types at the same time. Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> # sm8650 Link: https://lore.kernel.org/r/20250204060041.409950-3-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-10 09:54:19 -07:00
Eric Biggers	ebc4176551	blk-crypto: add basic hardware-wrapped key support To prevent keys from being compromised if an attacker acquires read access to kernel memory, some inline encryption hardware can accept keys which are wrapped by a per-boot hardware-internal key. This avoids needing to keep the raw keys in kernel memory, without limiting the number of keys that can be used. Such hardware also supports deriving a "software secret" for cryptographic tasks that can't be handled by inline encryption; this is needed for fscrypt to work properly. To support this hardware, allow struct blk_crypto_key to represent a hardware-wrapped key as an alternative to a raw key, and make drivers set flags in struct blk_crypto_profile to indicate which types of keys they support. Also add the ->derive_sw_secret() low-level operation, which drivers supporting wrapped keys must implement. For more information, see the detailed documentation which this patch adds to Documentation/block/inline-encryption.rst. Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> # sm8650 Link: https://lore.kernel.org/r/20250204060041.409950-2-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-10 09:54:19 -07:00

1 2 3 4 5 ...

7899 Commits