linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2025-08-25 03:43:55 +00:00

Author	SHA1	Message	Date
Mohamed Khalfella	80f21806b8	nvmet: exit debugfs after discovery subsystem exits Commit `528589947c` ("nvmet: initialize discovery subsys after debugfs is initialized") changed nvmet_init() to initialize nvme discovery after "nvmet" debugfs directory is initialized. The change broke nvmet_exit() because discovery subsystem now depends on debugfs. Debugfs should be destroyed after discovery subsystem. Fix nvmet_exit() to do that. Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAHj4cs96AfFQpyDKF_MdfJsnOEo=2V7dQgqjFv+k3t7H-=yGhA@mail.gmail.com/ Fixes: `528589947c` ("nvmet: initialize discovery subsys after debugfs is initialized") Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Link: https://lore.kernel.org/r/20250807053507.2794335-1-mkhalfella@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-07 06:27:58 -06:00
Bjorn Helgaas	367c240b0a	nvme: fix various comment typos Fix typos in comments. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-07-31 06:35:58 -07:00
Jiapeng Chong	b6160cd2c4	nvme-auth: remove unneeded semicolon No functional modification involved. ./drivers/nvme/host/auth.c:745:2-3: Unneeded semicolon. ./drivers/nvme/host/auth.c:755:2-3: Unneeded semicolon. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=22937 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-07-31 06:35:55 -07:00
Keith Busch	4e6e151cf9	nvme-pci: fix leak on sgl setup error We need to free the descriptor that was allocated. We also don't necessarily need to unmap each sgl entry, which was previously being attempted unconditionally. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-07-31 06:35:51 -07:00
Mohamed Khalfella	528589947c	nvmet: initialize discovery subsys after debugfs is initialized During nvme target initialization discovery subsystem is initialized before "nvmet" debugfs directory is created. This results in discovery subsystem debugfs directory to be created in debugfs root directory. nvmet_init() -> nvmet_init_discovery() -> nvmet_subsys_alloc() -> nvmet_debugfs_subsys_setup() In other words, the codepath above is exeucted before nvmet_debugfs is created. We get /sys/kernel/debug/nqn.2014-08.org.nvmexpress.discovery instead of /sys/kernel/debug/nvmet/nqn.2014-08.org.nvmexpress.discovery. Move nvmet_init_discovery() call after nvmet_init_debugfs() to fix it. Fixes: `649fd41420` ("nvmet: add debugfs support") Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-07-31 06:35:46 -07:00
Kamaljit Singh	e715b8733d	nvme: add capability to connect to an administrative controller Add capability to connect to an administrative controller by preventing ioq creation for admin-controllers. Add a nvme_admin_ctrl() to check if a controller's CNTRLTYPE indicates that it is an administrative controller and override ctrl->queue_count to 1 for admin controllers, so that only the admin queue and no I/O queues are created for an administrative controller. This override is done in nvme_init_ctrl_finish() after ctrl->cntrltype has been initialized in nvme_init_identify() so nvme_admin_ctrl() will work correctly. Doing this override in generic code (nvme_init_ctrl_finish) makes it transport agnostic and will work properly for nvme/tcp as well as for nvme/rdma. Suggested-by: Niklas Cassel <cassel@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Kamaljit Singh <kamaljit.singh1@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-07-31 06:35:44 -07:00
Nitesh Shetty	c71fc0f457	nvmet: add support for FDP in fabrics passthru path Add support for admin_get_feature FDP(0x1d) feature id, thus enabling FDP at the initiator side for the target controller and namespaces attached to it. Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-07-31 06:35:43 -07:00
Linus Torvalds	6e11664f14	for-6.17/block-20250728 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmiHdZ8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgptRED/9o3dQ1QHL5yNM/AyCCGox0V4zra8qGS/Vc cBWpAVrmPGRw0IYlLZENtN9PdwKcbMzJq3l6cxeC7dBnAZP0AxTzP4YYJYUNVsqo WtJ3d/k5+cVp0OyOp4uabaqNeMeLoPk9/JXe1Ml2KxtDmHtj5yee0JRh7zlPZmZj tsrpIUTeHgAPn6yR1EI+0ybx/mjCb05Mv2Y8gF5hkUPA2PuON+MTFixJmqoy2ySh n+22mz/prqlyOSYh/VVv1+9jcQ94wMjcW0JIpg9lM3Kg8BCPU4IetvO1UiX6X33v 154zEh2aJJDBx+yORS4BM4JMXjRZI7lYea2dkHM8Cajctu1Wpja9bNwnK9ibXvEc WtyBwztleLbAZef25fA/W87JE23fGa/r3nwIb2cF4QqkAFslCvhjA93WkOzNJCgQ qsWOrlCh3IK2NUu4b1Ncs3ZHOPvc51+zzjMzC6SUr54xhrxDK+gngDPhRy7XDqWJ DTMpIlr366o8GdJqnib0/e/CPBrThS6Vl6u0tgLnNbwdpK1svgo/uHW5ksKvDqHX kGEIhyRRJJC+4wyl4dsYKXa2twcyFrlWdAE+pZguEC2nZRYqYl9uXftOtvfp1x0y /skDX0FIDjvyjRqCLcqF03FSGqwCGS8WuWXZjPhVhcfz47NvbHeFDh1G/jMzsbpj S9zrPve/DQ== =e86T -----END PGP SIGNATURE----- Merge tag 'for-6.17/block-20250728' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - MD pull request via Yu: - call del_gendisk synchronously (Xiao) - cleanup unused variable (John) - cleanup workqueue flags (Ryo) - fix faulty rdev can't be removed during resync (Qixing) - NVMe pull request via Christoph: - try PCIe function level reset on init failure (Keith Busch) - log TLS handshake failures at error level (Maurizio Lombardi) - pci-epf: do not complete commands twice if nvmet_req_init() fails (Rick Wertenbroek) - misc cleanups (Alok Tiwari) - Removal of the pktcdvd driver This has been more than a decade coming at this point, and some recently revealed breakages that had it causing issues even for cases where it isn't required made me re-pull the trigger on this one. It's known broken and nobody has stepped up to maintain the code - Series for ublk supporting batch commands, enabling the use of multishot where appropriate - Speed up ublk exit handling - Fix for the two-stage elevator fixing which could leak data - Convert NVMe to use the new IOVA based API - Increase default max transfer size to something more reasonable - Series fixing write operations on zoned DM devices - Add tracepoints for zoned block device operations - Prep series working towards improving blk-mq queue management in the presence of isolated CPUs - Don't allow updating of the block size of a loop device that is currently under exclusively ownership/open - Set chunk sectors from stacked device stripe size and use it for the atomic write size limit - Switch to folios in bcache read_super() - Fix for CD-ROM MRW exit flush handling - Various tweaks, fixes, and cleanups * tag 'for-6.17/block-20250728' of git://git.kernel.dk/linux: (94 commits) block: restore two stage elevator switch while running nr_hw_queue update cdrom: Call cdrom_mrw_exit from cdrom_release function sunvdc: Balance device refcount in vdc_port_mpgroup_check nvme-pci: try function level reset on init failure dm: split write BIOs on zone boundaries when zone append is not emulated block: use chunk_sectors when evaluating stacked atomic write limits dm-stripe: limit chunk_sectors to the stripe size md/raid10: set chunk_sectors limit md/raid0: set chunk_sectors limit block: sanitize chunk_sectors for atomic write limits ilog2: add max_pow_of_two_factor() nvmet: pci-epf: Do not complete commands twice if nvmet_req_init() fails nvme-tcp: log TLS handshake failures at error level docs: nvme: fix grammar in nvme-pci-endpoint-target.rst nvme: fix typo in status code constant for self-test in progress nvmet: remove redundant assignment of error code in nvmet_ns_enable() nvme: fix incorrect variable in io cqes error message nvme: fix multiple spelling and grammar issues in host drivers block: fix blk_zone_append_update_request_bio() kernel-doc md/raid10: fix set but not used variable in sync_request_write() ...	2025-07-28 16:43:54 -07:00
Linus Torvalds	cec40a7c80	vfs-6.17-rc1.integrity -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCngAKCRCRxhvAZXjc ogAMAP9LqNHFf7JfDIvF/PJBxzYa0ToWwPsWACERknwkvtBRCwEAhkmscIcIMQ4t LPGLGha17dfpaE4RurRhBYgS9x2/1Ao= =jSnJ -----END PGP SIGNATURE----- Merge tag 'vfs-6.17-rc1.integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs 'protection info' updates from Christian Brauner: "This adds the new FS_IOC_GETLBMD_CAP ioctl() to query metadata and protection info (PI) capabilities. This ioctl returns information about the files integrity profile. This is useful for userspace applications to understand a files end-to-end data protection support and configure the I/O accordingly. For now this interface is only supported by block devices. However the design and placement of this ioctl in generic FS ioctl space allows us to extend it to work over files as well. This maybe useful when filesystems start supporting PI-aware layouts. A new structure struct logical_block_metadata_cap is introduced, which contains the following fields: - lbmd_flags: bitmask of logical block metadata capability flags - lbmd_interval: the amount of data described by each unit of logical block metadata - lbmd_size: size in bytes of the logical block metadata associated with each interval - lbmd_opaque_size: size in bytes of the opaque block tag associated with each interval - lbmd_opaque_offset: offset in bytes of the opaque block tag within the logical block metadata - lbmd_pi_size: size in bytes of the T10 PI tuple associated with each interval - lbmd_pi_offset: offset in bytes of T10 PI tuple within the logical block metadata - lbmd_pi_guard_tag_type: T10 PI guard tag type - lbmd_pi_app_tag_size: size in bytes of the T10 PI application tag - lbmd_pi_ref_tag_size: size in bytes of the T10 PI reference tag - lbmd_pi_storage_tag_size: size in bytes of the T10 PI storage tag The internal logic to fetch the capability is encapsulated in a helper function blk_get_meta_cap(), which uses the blk_integrity profile associated with the device. The ioctl returns -EOPNOTSUPP, if CONFIG_BLK_DEV_INTEGRITY is not enabled" * tag 'vfs-6.17-rc1.integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: block: fix lbmd_guard_tag_type assignment in FS_IOC_GETLBMD_CAP block: fix FS_IOC_GETLBMD_CAP parsing in blkdev_common_ioctl() fs: add ioctl to query metadata and protection info capabilities nvme: set pi_offset only when checksum type is not BLK_INTEGRITY_CSUM_NONE block: introduce pi_tuple_size field in blk_integrity block: rename tuple_size field in blk_integrity to metadata_size	2025-07-28 15:12:00 -07:00
Linus Torvalds	278c7d9b5e	vfs-6.17-rc1.fallocate -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCeQAKCRCRxhvAZXjc otqEAP9bWFExQtnzrNR+1s4UBfPVDAaTJzDnBWj6z0+Idw9oegEAoxF2ifdCPnR4 t/xWiM4FmSA+9pwvP3U5z3sOReDDsgo= =WMMB -----END PGP SIGNATURE----- Merge tag 'vfs-6.17-rc1.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fallocate updates from Christian Brauner: "fallocate() currently supports creating preallocated files efficiently. However, on most filesystems fallocate() will preallocate blocks in an unwriten state even if FALLOC_FL_ZERO_RANGE is specified. The extent state must later be converted to a written state when the user writes data into this range, which can trigger numerous metadata changes and journal I/O. This may leads to significant write amplification and performance degradation in synchronous write mode. At the moment, the only method to avoid this is to create an empty file and write zero data into it (for example, using 'dd' with a large block size). However, this method is slow and consumes a considerable amount of disk bandwidth. Now that more and more flash-based storage devices are available it is possible to efficiently write zeros to SSDs using the unmap write zeroes command if the devices do not write physical zeroes to the media. For example, if SCSI SSDs support the UMMAP bit or NVMe SSDs support the DEAC bit[1], the write zeroes command does not write actual data to the device, instead, NVMe converts the zeroed range to a deallocated state, which works fast and consumes almost no disk write bandwidth. This series implements the BLK_FEAT_WRITE_ZEROES_UNMAP feature and BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED flag for SCSI, NVMe and device-mapper drivers, and add the FALLOC_FL_WRITE_ZEROES and STATX_ATTR_WRITE_ZEROES_UNMAP support for ext4 and raw bdev devices. fallocate() is subsequently extended with the FALLOC_FL_WRITE_ZEROES flag. FALLOC_FL_WRITE_ZEROES zeroes a specified file range in such a way that subsequent writes to that range do not require further changes to the file mapping metadata. This flag is beneficial for subsequent pure overwriting within this range, as it can save on block allocation and, consequently, significant metadata changes" * tag 'vfs-6.17-rc1.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: ext4: add FALLOC_FL_WRITE_ZEROES support block: add FALLOC_FL_WRITE_ZEROES support block: factor out common part in blkdev_fallocate() fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate dm: clear unmap write zeroes limits when disabling write zeroes scsi: sd: set max_hw_wzeroes_unmap_sectors if device supports SD_ZERO_*_UNMAP nvmet: set WZDS and DRB if device enables unmap write zeroes operation nvme: set max_hw_wzeroes_unmap_sectors if device supports DEAC bit block: introduce max_{hw\|user}_wzeroes_unmap_sectors to queue limits	2025-07-28 13:36:49 -07:00
Linus Torvalds	e5ac874257	block-6.16-20250718 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmh6ZU8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmRdD/9Q6I5VC13uVjbrXLA3R4d+gLsDzcVv3lIp ps9HBz1s5yXIP9hb68pnIu6H+SGKyzd83Uqst/74+NzQWAuDaWO9ydT1DLu5bpHS Q1qjA1seIbhPRi184wXSqjr3OgaX0rNdzOkWL/PKQ0dHFx54adrXiu3qoSWvBQYg YUrMvFFmNN7gQdTagburM+g4RXRWqhqcn0FJfyb1IX90gQNVCv8JKY2NkJbF9SIM rlAQoZefoiX+5Fo8dGIutaZRZ1X04lIv9S5oXxzgw/4xhUtGrVfL1mwSCS9twQtp 5r2v7dcUqCxZ1pwHJazMene/Y5540ycZR3KgMsh8Ggxs9is1GbzbrXMe3gdDogTR 10k9X1C0NLkeQ6h12kcX9TlMuN4jbBRbXsNQQnTd0XEvMUVxggRcg3j/TQ/+W5Uj eEMmWKbZD1PZsxqxqKJ8T0NzNY5JdZYdRLo+4lrp3Lw2b3o1cyUQ2pONGNRzEClj 4iHWQuopbB5AV3jo9lxOrZD8tZywDNjNFYFz7aTQ5OXIA98lbAM5/0NXcExvk047 5FAjzo0dbfHVFX3jfPwTUifxFXZ3nDJSBBO2y6tKvllwZU6f/gIMSPCbeP5yQsdW jwGv5IBRBvLj7RDChSFo14KQdupNhBmIfru6huZwtT8vj4IHXRkaRAgMFqE23owx 4HLPoGR5Ww== =hRsE -----END PGP SIGNATURE----- Merge tag 'block-6.16-20250718' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - NVMe changes via Christoph: - revert the cross-controller atomic write size validation that caused regressions (Christoph Hellwig) - fix endianness of command word printout in nvme_log_err_passthru() (John Garry) - fix callback lock for TLS handshake (Maurizio Lombardi) - fix misaccounting of nvme-mpath inflight I/O (Yu Kuai) - fix inconsistent RCU list manipulation in nvme_ns_add_to_ctrl_list() (Zheng Qixing) - Fix for a kobject leak in queue unregistration - Fix for loop async file write start/end handling * tag 'block-6.16-20250718' of git://git.kernel.dk/linux: loop: use kiocb helpers to fix lockdep warning nvmet-tcp: fix callback lock for TLS handshake nvme: fix misaccounting of nvme-mpath inflight I/O nvme: revert the cross-controller atomic write size validation nvme: fix endianness of command word prints in nvme_log_err_passthru() nvme: fix inconsistent RCU list manipulation in nvme_ns_add_to_ctrl_list() block: fix kobject leak in blk_unregister_queue	2025-07-18 12:16:13 -07:00
Keith Busch	5b2c214a95	nvme-pci: try function level reset on init failure NVMe devices from multiple vendors appear to get stuck in a reset state that we can't get out of with an NVMe level Controller Reset. The kernel would report these with messages that look like: Device not ready; aborting reset, CSTS=0x1 These have historically required a power cycle to make them usable again, but in many cases, a PCIe FLR is sufficient to restart operation without a power cycle. Try it if the initial controller reset fails during any nvme reset attempt. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-07-17 17:46:33 +02:00
Rick Wertenbroek	746d0ac5a0	nvmet: pci-epf: Do not complete commands twice if nvmet_req_init() fails Have nvmet_req_init() and req->execute() complete failed commands. Description of the problem: nvmet_req_init() calls __nvmet_req_complete() internally upon failure, e.g., unsupported opcode, which calls the "queue_response" callback, this results in nvmet_pci_epf_queue_response() being called, which will call nvmet_pci_epf_complete_iod() if data_len is 0 or if dma_dir is different from DMA_TO_DEVICE. This results in a double completion as nvmet_pci_epf_exec_iod_work() also calls nvmet_pci_epf_complete_iod() when nvmet_req_init() fails. Steps to reproduce: On the host send a command with an unsupported opcode with nvme-cli, For example the admin command "security receive" $ sudo nvme security-recv /dev/nvme0n1 -n1 -x4096 This triggers a double completion as nvmet_req_init() fails and nvmet_pci_epf_queue_response() is called, here iod->dma_dir is still in the default state of "DMA_NONE" as set by default in nvmet_pci_epf_alloc_iod(), so nvmet_pci_epf_complete_iod() is called. Because nvmet_req_init() failed nvmet_pci_epf_complete_iod() is also called in nvmet_pci_epf_exec_iod_work() leading to a double completion. This not only sends two completions to the host but also corrupts the state of the PCI NVMe target leading to kernel oops. This patch lets nvmet_req_init() and req->execute() complete all failed commands, and removes the double completion case in nvmet_pci_epf_exec_iod_work() therefore fixing the edge cases where double completions occurred. Fixes: `0faa0fe6f9` ("nvmet: New NVMe PCI endpoint function target driver") Signed-off-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-07-17 13:39:57 +02:00
Maurizio Lombardi	5a58ac9bfc	nvme-tcp: log TLS handshake failures at error level Update the nvme_tcp_start_tls() function to use dev_err() instead of dev_dbg() when a TLS error is detected. This ensures that handshake failures are visible by default, aiding in debugging. Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-07-17 13:38:07 +02:00
Alok Tiwari	b5cd5f1e50	nvme: fix typo in status code constant for self-test in progress Correct a typo error in the NVMe status code constant from NVME_SC_SELT_TEST_IN_PROGRESS to NVME_SC_SELF_TEST_IN_PROGRESS to accurately reflect its meaning. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-07-17 13:38:07 +02:00
Alok Tiwari	2e7dd5c1a8	nvmet: remove redundant assignment of error code in nvmet_ns_enable() Remove the unnecessary ret = -EMFILE; assignment since it is immediately overwritten by the result of nvmet_bdev_ns_enable() The initial value (-EMFILE) is redundant because it has no effect on the code logic or outcome. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-07-17 13:38:07 +02:00
Alok Tiwari	3b1eabed27	nvme: fix incorrect variable in io cqes error message Correct the error log to print ctrl->io_cqes instead of incorrectly using ctrl->io_sqes for the io cqes size check. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-07-17 13:38:07 +02:00
Alok Tiwari	164c187d25	nvme: fix multiple spelling and grammar issues in host drivers This commit fixes several typos and grammatical issues across various nvme host driver files: - correct "glace" to "glance" in a comment in apple.c - fix "Idependent" to "Independent" in core.c - change "unsucceesful" to "unsuccessful", "they blk-mq" to "the blk-mq", - fix "terminaed" to "terminated" and other grammar in fc.c - update "O's" to "0's" to clarify meaning in nvme.h - fix a function name reference in a comment in zns.c: _transter_len() -> _transfer_len(). - fix sysfs_emit() output format in pci.c (replace x%08x with 0x%08x) These changes improve the code readability and documentation consistency across the NVMe driver. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-07-17 13:38:06 +02:00
Maurizio Lombardi	0523c6cc87	nvmet-tcp: fix callback lock for TLS handshake When restoring the default socket callbacks during a TLS handshake, we need to acquire a write lock on sk_callback_lock. Previously, a read lock was used, which is insufficient for modifying sk_user_data and sk_data_ready. Fixes: `675b453e02` ("nvmet-tcp: enable TLS handshake upcall") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-07-15 09:49:13 +02:00
Yu Kuai	71257925e8	nvme: fix misaccounting of nvme-mpath inflight I/O Procedures for nvme-mpath IO accounting: 1) initialize nvme_request and clear flags; 2) set NVME_MPATH_IO_STATS and increase inflight counter when IO started; 3) check NVME_MPATH_IO_STATS and decrease inflight counter when IO is done; However, for the case nvme_fail_nonready_command(), both step 1) and 2) are skipped, and if old nvme_request set NVME_MPATH_IO_STATS and then request is reused, step 3) will still be executed, causing inflight I/O counter to be negative. Fix the problem by clearing nvme_request in nvme_fail_nonready_command(). Fixes: `ea5e5f42cd` ("nvme-fabrics: avoid double completions in nvmf_fail_nonready_command") Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAHj4cs_+dauobyYyP805t33WMJVzOWj=7+51p4_j9rA63D9sog@mail.gmail.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-07-15 09:49:13 +02:00
Christoph Hellwig	1fc09f2961	nvme: revert the cross-controller atomic write size validation This was originally added by commit `8695f060a0` ("nvme: all namespaces in a subsystem must adhere to a common atomic write size") to check the all controllers in a subsystem report the same atomic write size, but the check wasn't quite correct and caused problems for devices with multiple namespaces that report different LBA sizes. Commit `f46d273449` ("nvme: fix atomic write size validation") tried to fix this, but then caused problems for namespace rediscovery after a format with an LBA size change that changes the AWUPF value. This drops the validation and essentially reverts those two commits while keeping the cleanup that went in between the two. We'll need to figure out how to properly check for the mouse trap that nvme left us, but for now revert the check to keep devices working for users who couldn't care less about the atomic write feature. Fixes: `8695f060a0` ("nvme: all namespaces in a subsystem must adhere to a common atomic write size") Fixes: `f46d273449` ("nvme: fix atomic write size validation") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alan Adamson <alan.adamson@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by: Alan Adamson <alan.adamson@oracle.com>	2025-07-15 09:49:13 +02:00
John Garry	dd8e34afd6	nvme: fix endianness of command word prints in nvme_log_err_passthru() The command word members of struct nvme_common_command are __le32 type, so use helper le32_to_cpu() to read them properly. Fixes: `9f079dda14` ("nvme: allow passthru cmd error logging") Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Alan Adamson <alan.adamson@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-07-14 15:51:06 +02:00
Zheng Qixing	80d7762e0a	nvme: fix inconsistent RCU list manipulation in nvme_ns_add_to_ctrl_list() When inserting a namespace into the controller's namespace list, the function uses list_add_rcu() when the namespace is inserted in the middle of the list, but falls back to a regular list_add() when adding at the head of the list. This inconsistency could lead to race conditions during concurrent access, as users might observe a partially updated list. Fix this by consistently using list_add_rcu() in both code paths to ensure proper RCU protection throughout the entire function. Fixes: `be647e2c76` ("nvme: use srcu for iterating namespace list") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-07-14 15:50:32 +02:00
Christoph Hellwig	1bb94ff5ab	nvme-pci: don't allocate dma_vec for IOVA mappings Not only do IOVA mappings no need the separate dma_vec tracking, it also won't free it and thus leak the allocations. Fixes: `b8b7570a7e` ("nvme-pci: fix dma unmapping when using PRPs and not using the IOVA mapping") Reported-by: Klara Modin <klarasmodin@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Klara Modin <klarasmodin@gmail.com> Link: https://lore.kernel.org/r/20250711112250.633269-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-11 07:46:15 -06:00
Christoph Hellwig	b8b7570a7e	nvme-pci: fix dma unmapping when using PRPs and not using the IOVA mapping The current version of the blk_rq_dma_map support in nvme-pci tries to reconstruct the DMA mappings from the on the wire descriptors if they are needed for unmapping. While this is not the case for the direct mapping fast path and the IOVA path, it is needed for the non-IOVA slow path, e.g. when using the interconnect is not dma coherent, when using swiotlb bounce buffering, or a IOMMU mapping that can't coalesce. While the reconstruction is easy and works fine for the SGL path, where the on the wire representation maps 1:1 to DMA mappings, the code to reconstruct the DMA mapping ranges from PRPs can't always work, as a given PRP layout can come from different DMA mappings, and the current code doesn't even always get that right. Give up on this approach and track the actual DMA mapping when actually needed again. Fixes: `7ce3c1dd78` ("nvme-pci: convert the data mapping to blk_rq_dma_map") Reported-by: Ben Copeland <ben.copeland@linaro.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Tested-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20250707125223.3022531-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-08 06:54:52 -06:00
Linus Torvalds	1880df2cf4	block-6.16-20250704 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhn80AQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgphdSD/93OEB7MwxhhzhaU9U0eiYRPlXcV9+nRMKI kSjPM/JFdGsiUGcEBvNvSNqJCpxQTytv+1JTPO4KhQ4hjiGDnuuaw51h7Ro3uRlp 75Up2uWnh9RaVRCABJQnHVd6zizij0RFHJYwlYlIXkGVQ6vqmaGz1Y4GAeGD4Jw+ iokVENz4uH9n5Zn3oruvufZk+uffZ++Sr4Vqtq3hVJ78ZWOV+iLXzHJSCmEnWSQL QptFP+MDSd9o0ej5bKLDP6kG4xIvMkBl9JY+Y2QH+Rev5Jroc26GmTcgwbRTkXDi hHQgilwmq4LkMyTGDaH2M7BlXoJlAhnWt7/2da9yr6ygLwHoD9LU2ALgGBKgb0r9 E/YrM2ioEC8lkKUGgalX9JReXTExGBvNeaKixi+CoNKDXMauEbJUNkSOH6kfstRo 5QCdn5g9l0Bf6qKBBmAnfty5mDtw9F3mowefxv2DFAPebXD+2I2FyIuafC5LedlE llsC77t2vBBKOAqL+WXypyYKTKAxMSk9NRO4FFkF9OFDdJIruofHXy0Nsi8aHLV7 defzDrr9y1plYHqjMzJy8VfLvv+2YDrmkldBgcfxMRBWfetD3XIOGCmpBFmdOcgx FUqviNDc7Yr2LyDwMdIPfS8ZqmAdmB198/c7UrRdiZe/QyB7tMeeo1vzeCw3XF3n srEJ1bJLxA== =1VG9 -----END PGP SIGNATURE----- Merge tag 'block-6.16-20250704' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - NVMe fixes via Christoph: - fix incorrect cdw15 value in passthru error logging (Alok Tiwari) - fix memory leak of bio integrity in nvmet (Dmitry Bogdanov) - refresh visible attrs after being checked (Eugen Hristev) - fix suspicious RCU usage warning in the multipath code (Geliang Tang) - correctly account for namespace head reference counter (Nilay Shroff) - Fix for a regression introduced in ublk in this cycle, where it would attempt to queue a canceled request. - brd RCU sleeping fix, also introduced in this cycle. Bare bones fix, should be improved upon for the next release. * tag 'block-6.16-20250704' of git://git.kernel.dk/linux: brd: fix sleeping function called from invalid context in brd_insert_page() ublk: don't queue request if the associated uring_cmd is canceled nvme-multipath: fix suspicious RCU usage warning nvme-pci: refresh visible attrs after being checked nvmet: fix memory leak of bio integrity nvme: correctly account for namespace head reference counter nvme: Fix incorrect cdw15 value in passthru error logging	2025-07-04 09:33:59 -07:00
Daniel Wagner	4082c98c1f	nvme-pci: use block layer helpers to calculate num of queues The calculation of the upper limit for queues does not depend solely on the number of possible CPUs; for example, the isolcpus kernel command-line option must also be considered. To account for this, the block layer provides a helper function to retrieve the maximum number of queues. Use it to set an appropriate upper queue number limit. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20250617-isolcpus-queue-counters-v1-3-13923686b54b@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-01 10:24:19 -06:00
Anuj Gupta	f3ee506591	nvme: set pi_offset only when checksum type is not BLK_INTEGRITY_CSUM_NONE protection information is treated as opaque when checksum type is BLK_INTEGRITY_CSUM_NONE. In order to maintain the right metadata semantics, set pi_offset only in cases where checksum type is not BLK_INTEGRITY_CSUM_NONE. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/20250630090548.3317-4-anuj20.g@samsung.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-07-01 14:00:15 +02:00
Anuj Gupta	76e45252a4	block: introduce pi_tuple_size field in blk_integrity Introduce a new pi_tuple_size field in struct blk_integrity to explicitly represent the size (in bytes) of the protection information (PI) tuple. This is a prep patch. Add validation in blk_validate_integrity_limits() to ensure that pi size matches the expected size for known checksum types and never exceeds the pi_tuple_size. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/20250630090548.3317-3-anuj20.g@samsung.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-07-01 14:00:15 +02:00
Anuj Gupta	c6603b1d65	block: rename tuple_size field in blk_integrity to metadata_size The tuple_size field in blk_integrity currently represents the total size of metadata associated with each data interval. To make the meaning more explicit, rename tuple_size to metadata_size. This is a purely mechanical rename with no functional changes. Suggested-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/20250630090548.3317-2-anuj20.g@samsung.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-07-01 14:00:14 +02:00
Geliang Tang	d681107420	nvme-multipath: fix suspicious RCU usage warning When I run the NVME over TCP test in virtme-ng, I get the following "suspicious RCU usage" warning in nvme_mpath_add_sysfs_link(): ''' [ 5.024557][ T44] nvmet: Created nvm controller 1 for subsystem nqn.2025-06.org.nvmexpress.mptcp for NQN nqn.2014-08.org.nvmexpress:uuid:f7f6b5e0-ff97-4894-98ac-c85309e0bc77. [ 5.027401][ T183] nvme nvme0: creating 2 I/O queues. [ 5.029017][ T183] nvme nvme0: mapped 2/0/0 default/read/poll queues. [ 5.032587][ T183] nvme nvme0: new ctrl: NQN "nqn.2025-06.org.nvmexpress.mptcp", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:f7f6b5e0-ff97-4894-98ac-c85309e0bc77 [ 5.042214][ T25] [ 5.042440][ T25] ============================= [ 5.042579][ T25] WARNING: suspicious RCU usage [ 5.042705][ T25] 6.16.0-rc3+ #23 Not tainted [ 5.042812][ T25] ----------------------------- [ 5.042934][ T25] drivers/nvme/host/multipath.c:1203 RCU-list traversed in non-reader section!! [ 5.043111][ T25] [ 5.043111][ T25] other info that might help us debug this: [ 5.043111][ T25] [ 5.043341][ T25] [ 5.043341][ T25] rcu_scheduler_active = 2, debug_locks = 1 [ 5.043502][ T25] 3 locks held by kworker/u9:0/25: [ 5.043615][ T25] #0: ffff888008730948 ((wq_completion)async){+.+.}-{0:0}, at: process_one_work+0x7ed/0x1350 [ 5.043830][ T25] #1: ffffc900001afd40 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work+0xcf3/0x1350 [ 5.044084][ T25] #2: ffff888013ee0020 (&head->srcu){.+.+}-{0:0}, at: nvme_mpath_add_sysfs_link.part.0+0xb4/0x3a0 [ 5.044300][ T25] [ 5.044300][ T25] stack backtrace: [ 5.044439][ T25] CPU: 0 UID: 0 PID: 25 Comm: kworker/u9:0 Not tainted 6.16.0-rc3+ #23 PREEMPT(full) [ 5.044441][ T25] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 5.044442][ T25] Workqueue: async async_run_entry_fn [ 5.044445][ T25] Call Trace: [ 5.044446][ T25] <TASK> [ 5.044449][ T25] dump_stack_lvl+0x6f/0xb0 [ 5.044453][ T25] lockdep_rcu_suspicious.cold+0x4f/0xb1 [ 5.044457][ T25] nvme_mpath_add_sysfs_link.part.0+0x2fb/0x3a0 [ 5.044459][ T25] ? queue_work_on+0x90/0xf0 [ 5.044461][ T25] ? lockdep_hardirqs_on+0x78/0x110 [ 5.044466][ T25] nvme_mpath_set_live+0x1e9/0x4f0 [ 5.044470][ T25] nvme_mpath_add_disk+0x240/0x2f0 [ 5.044472][ T25] ? __pfx_nvme_mpath_add_disk+0x10/0x10 [ 5.044475][ T25] ? add_disk_fwnode+0x361/0x580 [ 5.044480][ T25] nvme_alloc_ns+0x81c/0x17c0 [ 5.044483][ T25] ? kasan_quarantine_put+0x104/0x240 [ 5.044487][ T25] ? __pfx_nvme_alloc_ns+0x10/0x10 [ 5.044495][ T25] ? __pfx_nvme_find_get_ns+0x10/0x10 [ 5.044496][ T25] ? rcu_read_lock_any_held+0x45/0xa0 [ 5.044498][ T25] ? validate_chain+0x232/0x4f0 [ 5.044503][ T25] nvme_scan_ns+0x4c8/0x810 [ 5.044506][ T25] ? __pfx_nvme_scan_ns+0x10/0x10 [ 5.044508][ T25] ? find_held_lock+0x2b/0x80 [ 5.044512][ T25] ? ktime_get+0x16d/0x220 [ 5.044517][ T25] ? kvm_clock_get_cycles+0x18/0x30 [ 5.044520][ T25] ? __pfx_nvme_scan_ns_async+0x10/0x10 [ 5.044522][ T25] async_run_entry_fn+0x97/0x560 [ 5.044523][ T25] ? rcu_is_watching+0x12/0xc0 [ 5.044526][ T25] process_one_work+0xd3c/0x1350 [ 5.044532][ T25] ? __pfx_process_one_work+0x10/0x10 [ 5.044536][ T25] ? assign_work+0x16c/0x240 [ 5.044539][ T25] worker_thread+0x4da/0xd50 [ 5.044545][ T25] ? __pfx_worker_thread+0x10/0x10 [ 5.044546][ T25] kthread+0x356/0x5c0 [ 5.044548][ T25] ? __pfx_kthread+0x10/0x10 [ 5.044549][ T25] ? ret_from_fork+0x1b/0x2e0 [ 5.044552][ T25] ? __lock_release.isra.0+0x5d/0x180 [ 5.044553][ T25] ? ret_from_fork+0x1b/0x2e0 [ 5.044555][ T25] ? rcu_is_watching+0x12/0xc0 [ 5.044557][ T25] ? __pfx_kthread+0x10/0x10 [ 5.044559][ T25] ret_from_fork+0x218/0x2e0 [ 5.044561][ T25] ? __pfx_kthread+0x10/0x10 [ 5.044562][ T25] ret_from_fork_asm+0x1a/0x30 [ 5.044570][ T25] </TASK> ''' This patch uses sleepable RCU version of helper list_for_each_entry_srcu() instead of list_for_each_entry_rcu() to fix it. Fixes: `4dbd2b2ebe` ("nvme-multipath: Add visibility for round-robin io-policy") Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-07-01 08:17:02 +02:00
Christoph Hellwig	ba83e321cc	nvme-pci: rework the build time assert for NVME_MAX_NR_DESCRIPTORS The current use of an always_inline helper is a bit convoluted. Instead use macros that represent the arithmetics used for building up the PRP chain. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20250625113531.522027-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-30 15:50:53 -06:00
Christoph Hellwig	16353f1b0e	nvme-pci: replace NVME_MAX_KB_SZ with NVME_MAX_BYTE Having a define in kiB units is a bit weird. Also update the comment now that there is not scatterlist limit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20250625113531.522027-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-30 15:50:53 -06:00
Christoph Hellwig	7ce3c1dd78	nvme-pci: convert the data mapping to blk_rq_dma_map Use the blk_rq_dma_map API to DMA map requests instead of scatterlists. This removes the need to allocate a scatterlist covering every segment, and thus the overall transfer length limit based on the scatterlist allocation. Instead the DMA mapping is done by iterating the bio_vec chain in the request directly. The unmap is handled differently depending on how we mapped: - when using an IOMMU only a single IOVA is used, and it is stored in iova_state - for direct mappings that don't use swiotlb and are cache coherent, unmap is not needed at all - for direct mappings that are not cache coherent or use swiotlb, the physical addresses are rebuild from the PRPs or SGL segments The latter unfortunately adds a fair amount of code to the driver, but it is code not used in the fast path. The conversion only covers the data mapping path, and still uses a scatterlist for the multi-segment metadata case. I plan to convert that as soon as we have good test coverage for the multi-segment metadata path. Thanks to Chaitanya Kulkarni for an initial attempt at a new DMA API conversion for nvme-pci, Kanchan Joshi for bringing back the single segment optimization, Leon Romanovsky for shepherding this through a gazillion rebases and Nitesh Shetty for various improvements. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20250625113531.522027-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-30 15:50:53 -06:00
Christoph Hellwig	deecd1c49c	nvme-pci: remove superfluous arguments The call chain in the prep_rq and completion paths passes around a lot of nvme_dev, nvme_queue and nvme_command arguments that can be trivially derived from the passed in struct request. Remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20250625113531.522027-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-30 15:50:53 -06:00
Christoph Hellwig	cd71b52a55	nvme-pci: merge the simple PRP and SGL setup into a common helper nvme_setup_prp_simple and nvme_setup_sgl_simple share a lot of logic. Merge them into a single helper that makes use of the previously added use_sgl tristate. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250625113531.522027-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-30 15:50:42 -06:00
Christoph Hellwig	de769c846a	nvme-pci: refactor nvme_pci_use_sgls Move the average segment size into a separate helper, and return a tristate to distinguish the case where can use SGL vs where we have to use SGLs. This will allow the simplify the code and make more efficient decisions in follow on changes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250625113531.522027-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-30 15:50:32 -06:00
Eugen Hristev	14005c96d6	nvme-pci: refresh visible attrs after being checked The sysfs attributes are registered early, but the driver does not know whether they are needed or not at that moment. For the CMB attributes, commit `e917a849c3` ("nvme-pci: refresh visible attrs for cmb attributes") solved this problem by calling nvme_update_attrs after mapping the CMB. However the issue persists for the HMB attributes. To solve the problem, moved the call to nvme_update_attrs after nvme_setup_host_mem, which sets up the HMB. Fixes: `e917a849c3` ("nvme-pci: refresh visible attrs for cmb attributes") Fixes: `86adbf0cdb` ("nvme: simplify transport specific device attribute handling") Signed-off-by: Eugen Hristev <eugen.hristev@collabora.com> Signed-off-by: André Almeida <andrealmeid@igalia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-06-30 08:42:47 +02:00
Dmitry Bogdanov	190f4c2c86	nvmet: fix memory leak of bio integrity If nvmet receives commands with metadata there is a continuous memory leak of kmalloc-128 slab or more precisely bio->bi_integrity. Since commit `bf4c89fc87` ("block: don't call bio_uninit from bio_endio") each user of bio_init has to use bio_uninit as well. Otherwise the bio integrity is not getting free. Nvmet uses bio_init for inline bios. Uninit the inline bio to complete deallocation of integrity in bio. Fixes: `bf4c89fc87` ("block: don't call bio_uninit from bio_endio") Signed-off-by: Dmitry Bogdanov <d.bogdanov@yadro.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-06-30 08:32:16 +02:00
Nilay Shroff	ba806c9003	nvme: correctly account for namespace head reference counter The blktests nvme/058 manifests an issue where the NVMe subsystem kobject entry remains stale in sysfs, causing a failure during subsequent NVMe module reloads[1]. Specifically, when attempting to register a new NVMe subsystem, the driver encounters a kobejct name collision because a stale kobject still exists. Though, please note that nvme/058 doesn't report any failure and test case passes and it's only during subsequent NVMe module reloads, the stale nvme sub- system kobject entry in sysfs causes the observed symptom[1]. This issue stems from an imbalance in the get/put usage of the namespace head (nshead) reference counter. The nshead holds a reference to the associated NVMe subsystem. If the nshead reference is not properly released, it prevents the cleanup of the subsystem's kobject, leaving nvme subsystem stale entry behind in sysfs. During the failure case, the last namespace path referencing a nshead is removed, but the nshead reference was not released. This occurs because the release logic currently only puts the nshead reference when its state is LIVE. However, in configurations where ANA (Asymmetric Namespace Access) is enabled, a namespace may be associated with an ANA state that is neither optimized nor non-optimized. In this case, the nshead may never transition to LIVE, and the corresponding nshead reference is then never dropped. In fact nvme/058 associates some of nvme namespaces to an inaccessible ANA state and with that nshead is created but it's state is not transitioned to LIVE. So the current logic would then causes nshead reference to be leaked for non-LIVE states. Another scenario, during namespace allocation, the driver first allocates a nshead and then issues an Identify Namespace command. If this command fails — which can happen in tests like nvme/058 that rapidly enables and disables namespaces — we must release the reference to the newly allocated nshead. However this reference release is currently missing in the failure, causing a nshead reference leak. To fix this, we now unconditionally release the nshead reference when the last nvme path referencing to the nshead is removed, regardless of the head’s state. Also during identify namespace failure case we now properly release the nshead reference. So this ensures proper cleanup of the nshead, and consequently, the NVMe subsystem and its associated kobject. This change prevents stale kobject entries from lingering in sysfs and eliminates the module reload failures observed just after running nvme/058. [1] https://lore.kernel.org/all/CAHj4cs8fOBS-eSjsd5LUBzy7faKXJtgLkCN+mDy_-ezCLLLq+Q@mail.gmail.com/ Reported-by: yi.zhang@redhat.com Closes: https://lore.kernel.org/all/CAHj4cs8fOBS-eSjsd5LUBzy7faKXJtgLkCN+mDy_-ezCLLLq+Q@mail.gmail.com/ Fixes: `62188639ec` ("nvme-multipath: introduce delayed removal of the multipath head node") Tested-by: yi.zhang@redhat.com Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-06-30 08:31:49 +02:00
Alok Tiwari	2e96d2d8c2	nvme: Fix incorrect cdw15 value in passthru error logging Fix an error in nvme_log_err_passthru() where cdw14 was incorrectly printed twice instead of cdw15. This fix ensures accurate logging of the full passthrough command payload. Fixes: `9f079dda14` ("nvme: allow passthru cmd error logging") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-06-30 08:31:45 +02:00
Linus Torvalds	e540341508	block-6.16-20250626 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhd4zsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmZgEACOk81RNf8WGNQf4/parSENzebWNj9W+fKD RDhWxwBAquT2VzkF8Iu6wbteVbP9A8yq4BagbD079OWrr0iV8NgWA5y1GyqdER6N upe2ZtBlY7RR4F1FerpSGqRBbhWYejNojSr073ea8mmx5Yl0BbHz5aKKmzWGbUYO lveYPgCeL4dD7kfPeINiamhicLudyAGdqqYpG+/wriefwaVhTgCe+4aQ6pEwftRT utqCzrpUnxrmXS4TFXiWd4u3iVNwPhzcMyUrgkK1yTM7mWIqp8QyHzfF4Acbh/T3 RN/8d5OCfYmamlRvDUCl3FXWukkdGtBrA4m51mhUIzRJ9Np9IiSHdd2UTDgGqSeG 2NSjLtmdDQvtVXeuqBs56os7e3DFx42LZuceqbGWaTQ4VC4QE+Xz+n2ZENx/hWFZ /lixcIBdxt6iqjveJuBJeXW6UqaR+Hz4hpSigZU69DMQzrKm65bSoMdOvyn5b0bU GtlPusSnfgpsSe/H41Lm7SLBePiGXMJvhujzlkWW5cnUUl+yRUQhTO206kQJkbV1 XUMs8Syow15gjQaXI9KiAq+MMUuUwOvXmptMyYQ1NjFy16yzhJ8QOhJilJLWfLdT SqsLyXn1kG2EdcPmXHJRthIgVmQ+uORy2JB1wAomyjJj9a16wJYhgCGDjrl4mocl 9LpjfnyMsA== =ln4w -----END PGP SIGNATURE----- Merge tag 'block-6.16-20250626' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - Fixes for ublk: - fix C++ narrowing warnings in the uapi header - update/improve UBLK_F_SUPPORT_ZERO_COPY comment in uapi header - fix for the ublk ->queue_rqs() implementation, limiting a batch to just the specific task AND ring - ublk_get_data() error handling fix - sanity check more arguments in ublk_ctrl_add_dev() - selftest addition - NVMe pull request via Christoph: - reset delayed remove_work after reconnect - fix atomic write size validation - Fix for a warning introduced in bdev_count_inflight_rw() in this merge window * tag 'block-6.16-20250626' of git://git.kernel.dk/linux: block: fix false warning in bdev_count_inflight_rw() ublk: sanity check add_dev input for underflow nvme: fix atomic write size validation nvme: refactor the atomic write unit detection nvme: reset delayed remove_work after reconnect ublk: setup ublk_io correctly in case of ublk_get_data() failure ublk: update UBLK_F_SUPPORT_ZERO_COPY comment in UAPI header ublk: fix narrowing warnings in UAPI header selftests: ublk: don't take same backing file for more than one ublk devices ublk: build batch from IOs in same io_ring_ctx and io task	2025-06-27 09:02:33 -07:00
Christoph Hellwig	f46d273449	nvme: fix atomic write size validation Don't mix the namespace and controller values, and validate the per-controller limit when probing the controller. This avoid spurious failures for controllers with namespaces that have different namespaces with different logical block sizes, or report the per-namespace values only for some namespaces. It also fixes a missing queue_limits_cancel_update in an error path by removing that error path. Fixes: `8695f060a0` ("nvme: all namespaces in a subsystem must adhere to a common atomic write size") Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Tested-by: Yi Zhang <yi.zhang@redhat.com>	2025-06-26 13:04:37 +02:00
Christoph Hellwig	b2e607feca	nvme: refactor the atomic write unit detection Move all the code out of nvme_update_disk_info into the helper, and rename the helper to have a somewhat less clumsy name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com>	2025-06-26 13:04:37 +02:00
Keith Busch	dd2c185489	nvme: reset delayed remove_work after reconnect The remove_work will proceed with permanently disconnecting on the initial final path failure if the head shows no paths after the delay. If a new path connects while the remove_work is pending, and if that new path happens to disconnect before that remove_work executes, the delayed removal should reset based on the most recent path disconnect time, but queue_delayed_work() won't do anything if the work is already pending. Attempt to cancel the delayed work when a new path connects, and use mod_delayed_work() in case the remove_work remains pending anyway. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2025-06-26 13:04:35 +02:00
Zhang Yi	50634366de	nvmet: set WZDS and DRB if device enables unmap write zeroes operation Set the WZDS and DRB bits to the namespace dlfeat if the underlying block device enables the unmap write zeroes operation, make the nvme target device supports the unmap write zeroes command. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-4-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-06-23 12:45:13 +02:00
Zhang Yi	545fb46e5b	nvme: set max_hw_wzeroes_unmap_sectors if device supports DEAC bit When the device supports the Write Zeroes command and the DEAC bit, it indicates that the deallocate bit in the Write Zeroes command is supported, and the bytes read from a deallocated logical block are zeroes. This means the device supports unmap Write Zeroes operation, so set the max_hw_wzeroes_unmap_sectors to max_write_zeroes_sectors on the device's queue limit. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-3-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-06-23 12:45:13 +02:00
Linus Torvalds	f713ffa363	block-6.16-20250614 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhNaUIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvUOD/0WlwBN8WxAA+rzUXo42QJ3W+XruQ+VhQdx Hs/DBEH6KZji86ZVzoJOIwsdlSL2/6PxRIZVqwr3Q8aYnNedUnsjcD4frNBl76EA wFfPttjL7DcaOvVhY0n37IrQNmaeQC1R1O2JxhiWTBzNoNf2iWj84vSSgbgcfDVR trfhRvEwRgmAy037/72pUFYN+JRlv80D03SGfWTQtp6/qq+AA/z5XqWdg9I/opVM 7+H5GoWHfPSG0wQo+Dms3mHV4zm5tOOfMmGIR2o4DoueKgMNgnUXRT8dc7DDBsqV 0moKRHKbTbeN1fz3zqcko0Mp1gq+62hF/eXppQSeJMpMuAbcxaA+/ZFv7Ho9ZwYF jJwcp0O5e8XbRHFqrYWysKGKSvYfvTjr08X70+QFzm9ZJaGCtJYd2ceUNmyO2p6s m54gUnPq5d3nABbpCkAdP5sAv0yVV5idIoezCHIaBYQv8qPpKDrdHHXTQY/VX05x VBGmg9hUZSDMiGkR1d4oKTBayehuWVIpyczhy65KbAfoBA62hAl+aAldkpvLRo1r gKsrMSGP/H6zBU/IRaMGc/bnEnP6zFkn5vxnGwpDcD2tdJn0g+yEjIvJSXrmGJ0w lwzqYd3/vhFPmaEDxE3PyOOGBVCOPqGic+Y6OEIuHA3p2HFO3bsh6+64+iqls/so EmiHPp7n5g== =N1zM -----END PGP SIGNATURE----- Merge tag 'block-6.16-20250614' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - Fix for a deadlock on queue freeze with zoned writes - Fix for zoned append emulation - Two bio folio fixes, for sparsemem and for very large folios - Fix for a performance regression introduced in 6.13 when plug insertion was changed - Fix for NVMe passthrough handling for polled IO - Document the ublk auto registration feature - loop lockdep warning fix * tag 'block-6.16-20250614' of git://git.kernel.dk/linux: nvme: always punt polled uring_cmd end_io work to task_work Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists block: Fix bvec_set_folio() for very large folios bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP block: use plug request list tail for one-shot backmerge attempt block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG) loop: move lo_set_size() out of queue freeze	2025-06-14 09:25:22 -07:00
Jens Axboe	9ce6c9875f	nvme: always punt polled uring_cmd end_io work to task_work Currently NVMe uring_cmd completions will complete locally, if they are polled. This is done because those completions are always invoked from task context. And while that is true, there's no guarantee that it's invoked under the right ring context, or even task. If someone does NVMe passthrough via multiple threads and with a limited number of poll queues, then ringA may find completions from ringB. For that case, completing the request may not be sound. Always just punt the passthrough completions via task_work, which will redirect the completion, if needed. Cc: stable@vger.kernel.org Fixes: `585079b6e4` ("nvme: wire up async polling for io passthrough commands") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-13 15:18:34 -06:00
Ingo Molnar	41cb08555c	treewide, timers: Rename from_timer() to timer_container_of() Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com	2025-06-08 09:07:37 +02:00

1 2 3 4 5 ...

3912 Commits