Commit Graph

3912 Commits

Author SHA1 Message Date
Shin'ichiro Kawasaki
9bce6b5f89 block: change blk_mq_add_to_batch() third argument type to bool
Commit 1f47ed294a ("block: cleanup and fix batch completion adding
conditions") modified the evaluation criteria for the third argument,
'ioerror', in the blk_mq_add_to_batch() function. Initially, the
function had checked if 'ioerror' equals zero. Following the commit, it
started checking for negative error values, with the presumption that
such values, for instance -EIO, would be passed in.

However, blk_mq_add_to_batch() callers do not pass negative error
values. Instead, they pass status codes defined in various ways:

- NVMe PCI and Apple drivers pass NVMe status code
- virtio_blk driver passes the virtblk request header status byte
- null_blk driver passes blk_status_t

These codes are either zero or positive, therefore the revised check
fails to function as intended. Specifically, with the NVMe PCI driver,
this modification led to the failure of the blktests test case nvme/039.
In this test scenario, errors are artificially injected to the NVMe
driver, resulting in positive NVMe status codes passed to
blk_mq_add_to_batch(), which unexpectedly processes the failed I/O in a
batch. Hence the failure.

To correct the ioerror check within blk_mq_add_to_batch(), make all
callers to uniformly pass the argument as boolean. Modify the callers to
check their specific status codes and pass the boolean value 'is_error'.
Also describe the arguments of blK_mq_add_to_batch as kerneldoc.

Fixes: 1f47ed294a ("block: cleanup and fix batch completion adding conditions")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20250311104359.1767728-3-shinichiro.kawasaki@wdc.com
[axboe: fold in documentation update]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-12 08:26:36 -06:00
Shin'ichiro Kawasaki
e5c2bcc0cd nvme: move error logging from nvme_end_req() to __nvme_end_req()
Before the Commit 1f47ed294a ("block: cleanup and fix batch completion
adding conditions"), blk_mq_add_to_batch() did not add failed
passthrough requests to batch, and returned false. After the commit,
blk_mq_add_to_batch() always adds passthrough requests to batch
regardless of whether the request failed or not, and returns true. This
affected error logging feature in the NVME driver.

Before the commit, the call chain of failed passthrough request was as
follows:

nvme_handle_cqe()
 blk_mq_add_to_batch() .. false is returned, then call nvme_pci_complete_rq()
 nvme_pci_complete_rq()
  nvme_complete_rq()
   nvme_end_req()
    nvme_log_err_passthru() .. error logging
    __nvme_end_req()        .. end of the rqeuest

After the commit, the call chain is as follows:

nvme_handle_cqe()
 blk_mq_add_to_batch() .. true is returned, then set nvme_pci_complete_batch()
 ..
 nvme_pci_complete_batch()
  nvme_complete_batch()
   nvme_complete_batch_req()
    __nvme_end_req() .. end of the request, without error logging

To make the error logging feature work again for passthrough requests, move the
nvme_log_err_passthru() call from nvme_end_req() to __nvme_end_req().

While at it, move nvme_log_error() call for non-passthrough requests together
with nvme_log_err_passthru(). Even though the trigger commit does not affect
non-passthrough requests, move it together for code simplicity.

Fixes: 1f47ed294a ("block: cleanup and fix batch completion adding conditions")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250311104359.1767728-2-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-11 07:49:35 -06:00
Damien Le Moal
39393f5c5c nvmet: pci-epf: Do not add an IRQ vector if not needed
The function nvmet_pci_epf_create_cq() always unconditionally calls
nvmet_pci_epf_add_irq_vector() to add an IRQ vector for a completion
queue. But this is not correct if the host requested the creation of a
completion queue for polling, without an IRQ vector specified (i.e. the
flag NVME_CQ_IRQ_ENABLED is not set).

Fix this by calling nvmet_pci_epf_add_irq_vector() and setting the queue
flag NVMET_PCI_EPF_Q_IRQ_ENABLED for the cq only if NVME_CQ_IRQ_ENABLED
is set. While at it, also fix the error path to add the missing removal
of the added IRQ vector if nvmet_cq_create() fails.

Fixes: 0faa0fe6f9 ("nvmet: New NVMe PCI endpoint function target driver")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-10 10:12:16 -07:00
Damien Le Moal
bf9b8020a8 nvmet: pci-epf: Set NVMET_PCI_EPF_Q_LIVE when a queue is fully created
The function nvmet_pci_epf_create_sq() use test_and_set_bit() to check
that a submission queue is not already live and if not, set the
NVMET_PCI_EPF_Q_LIVE queue flag to declare the sq live (ready to use).
However, this is done on entry to the function, before the submission
queue is actually fully initialized and ready to use. This creates a
race situation with the function nvmet_pci_epf_poll_sqs_work() which
looks at the NVMET_PCI_EPF_Q_LIVE queue flag to poll the submission
queue when it is live. This race can lead to invalid DMA transfers if
nvmet_pci_epf_poll_sqs_work() runs after the NVMET_PCI_EPF_Q_LIVE flag
is set but before setting the sq pci address and doorbell ofset.

Avoid this race by only testing the NVMET_PCI_EPF_Q_LIVE flag on entry
to nvmet_pci_epf_create_sq() and setting it after the submission queue
is fully setup before nvmet_pci_epf_create_sq() returns success.
Since the function nvmet_pci_epf_create_cq() also has the same racy flag
setting pattern, also make a similar change in that function.

Fixes: 0faa0fe6f9 ("nvmet: New NVMe PCI endpoint function target driver")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-10 10:12:13 -07:00
Keith Busch
3f674e7b67 nvme-pci: fix stuck reset on concurrent DPC and HP
The PCIe error handling has the nvme driver quiesce the device, attempt
to restart it, then wait for that restart to complete.

A PCIe DPC event also toggles the PCIe link. If the slot doesn't have
out-of-band presence detection, this will trigger a pciehp
re-enumeration.

The error handling that calls nvme_error_resume is holding the device
lock while this happens. This lock blocks pciehp's request to disconnect
the driver from proceeding.

Meanwhile the nvme's reset can't make forward progress because its
device isn't there anymore with outstanding IO, and the timeout handler
won't do anything to fix it because the device is undergoing error
handling.

End result: deadlocked.

Fix this by having the timeout handler short cut the disabling for a
disconnected PCIe device. The downside is that we're relying on an IO
timeout to clean up this mess, which could be a minute by default.

Tested-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-10 09:15:48 -07:00
Arnd Bergmann
c339956727 Apple SoC RTKit IPC library updates for 6.15:
- Additional logging for errors
 - A few minor improvements and bugfixes required for drivers that are
   yet to be upstreamed
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQS3vz815OHsEaWy0u9EEX0kKnUe6QUCZ8RBIAAKCRBEEX0kKnUe
 6emXAQCKv1KH1fh1U3mUmqKFj6GNd3Ezq5uQqSwQi5gU+AgiHwD6A+vgMJ322tZ2
 aHxf2QfdO+RRiJulac1kcvhkBDnp2AI=
 =05ee
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmfJygcACgkQYKtH/8kJ
 UicY+w/+M+B5pdTN7GbbdXolV1/oZ9p+1MsQKa1aQRsUzZSu/6ALtQ36O1fJLe98
 AsxP07x+PTCmROGFn9XqeLKvJ9e1k6w9kCj60aNo6Rt7FtR9VjGOS5muDws2zib0
 Mu3MssXJY22M9Saf3X3lM0n1aJ0kOXd1owuOnGOx3ecZlVuUPg1JI/90R751zmxJ
 Qg3bMiEpRJfIdKCUgMmM1Y50XHv6QhF/RDjA/f3xl8uEW7BurRgGKrkCCoKkjNpZ
 znNgW6z4TRRS/zYL+0E3bJepk4rPZLLHndioCzDrUimxuf/TYAstcDTj+RepxJcc
 JymVi/noTLxCVDwrqvSSD4roC78sT5clbclbZo4huPhWCkL0CDG0WLQC/EfcRYF0
 SatsFvLVuaKTFQ0U4NxdfpSZz0oqNyfDSGbjMdSESeIaCqRvc5UjHUL5O2ZHiI8J
 Th9c+pMp70ipNPdjdB7P63gGK4LDm8DkPfuu7BvXF4c1f80O5uiU1hJIPnl5eDYM
 DnJG9m3NhDz4KtzLFcTUvhpDoLCNxMrDhUlril9M9zl2E/zxqljdUOTGWWFnBtKk
 0n1PWPXUstTFgLQ9AfHNE2WyJmxc0+WckKkxtK+JU29YswIXXRsn1N7JM+9X6BXG
 rojto59eevcsYbnpXBIpSFeOs2NeSZMLu1kFidqQSRebbCburNk=
 =QUAr
 -----END PGP SIGNATURE-----

Merge tag 'asahi-soc-rtkit-6.15' of https://github.com/AsahiLinux/linux into soc/drivers

Apple SoC RTKit IPC library updates for 6.15:

- Additional logging for errors
- A few minor improvements and bugfixes required for drivers that are
  yet to be upstreamed

* tag 'asahi-soc-rtkit-6.15' of https://github.com/AsahiLinux/linux:
  soc: apple: rtkit: Cut syslog messages after the first '\0'
  soc: apple: rtkit: Use high prio work queue
  soc: apple: rtkit: Implement OSLog buffers properly
  soc: apple: rtkit: Add and use PWR_STATE_INIT instead of _ON
  soc: apple: rtkit: Fix use-after-free in apple_rtkit_crashlog_rx()
  soc: apple: rtkit: Pass the crashlog to the crashed() callback
  soc: apple: rtkit: Check & log more failures

Link: https://lore.kernel.org/r/20250302113842.58092-1-sven@svenpeter.dev
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-03-06 17:15:03 +01:00
Dan Carpenter
528361c499 nvme-tcp: fix signedness bug in nvme_tcp_init_connection()
The kernel_recvmsg() function returns an int which could be either
negative error codes or the number of bytes received.  The problem is
that the condition:

        if (ret < sizeof(*icresp)) {

is type promoted to type unsigned long and negative values are treated
as high positive values which is success, when they should be treated as
failure.  Handle invalid positive returns separately from negative
error codes to avoid this problem.

Fixes: 578539e096 ("nvme-tcp: fix connect failure on receiving partial ICResp PDU")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-05 10:37:01 -08:00
Xinyu Zhang
99fde895ff nvme: map uring_cmd data even if address is 0
When using kernel registered bvec fixed buffers, the "address" is
actually the offset into the bvec rather than userspace address.
Therefore it can be 0.

We can skip checking whether the address is NULL before mapping
uring_cmd data. Bad userspace address will be handled properly later when
the user buffer is imported.

With this patch, we will be able to use the kernel registered bvec fixed
buffers in io_uring NVMe passthru with ublk zero-copy support.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Xinyu Zhang <xizhang@purestorage.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-4-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28 07:05:31 -07:00
Meir Elisha
a16f88964c nvmet-tcp: Fix a possible sporadic response drops in weakly ordered arch
The order in which queue->cmd and rcv_state are updated is crucial.
If these assignments are reordered by the compiler, the worker might not
get queued in nvmet_tcp_queue_response(), hanging the IO. to enforce the
the correct reordering, set rcv_state using smp_store_release().

Fixes: bdaf132791 ("nvmet-tcp: fix a segmentation fault during io parsing error")

Signed-off-by: Meir Elisha <meir.elisha@volumez.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-27 18:20:27 -08:00
Maurizio Lombardi
ad95bab0cd nvme-tcp: fix potential memory corruption in nvme_tcp_recv_pdu()
nvme_tcp_recv_pdu() doesn't check the validity of the header length.
When header digests are enabled, a target might send a packet with an
invalid header length (e.g. 255), causing nvme_tcp_verify_hdgst()
to access memory outside the allocated area and cause memory corruptions
by overwriting it with the calculated digest.

Fix this by rejecting packets with an unexpected header length.

Fixes: 3f2304f8c6 ("nvme-tcp: add NVMe over TCP host driver")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-27 18:19:08 -08:00
Maurizio Lombardi
afb41b08c4 nvme-tcp: Fix a C2HTermReq error message
In H2CTermReq, a FES with value 0x05 means "R2T Limit Exceeded"; but
in C2HTermReq the same value has a different meaning (Data Transfer Limit
Exceeded).

Fixes: 84e009042d ("nvme-tcp: add basic support for the C2HTermReq PDU")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-27 15:13:23 -08:00
Maurizio Lombardi
0979ff3676 nvmet: remove old function prototype
nvmet_subsys_nsid_exists() doesn't exist anymore

Fixes: 74d16965d7 ("nvmet-loop: avoid using mutex in IO hotpath")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-27 15:12:58 -08:00
Pavel Begunkov
69d483d5f4 io_uring/nvme: pass issue_flags to io_uring_cmd_import_fixed()
io_uring_cmd_import_fixed() will need to know the io_uring execution
state in following commits, for now just pass issue_flags into it
without actually using.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250224213116.3509093-5-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-27 07:17:39 -07:00
Keith Busch
00817f0f1c nvme-ioctl: fix leaked requests on mapping error
All the callers assume nvme_map_user_request() frees the request on a
failure. This wasn't happening on invalid metadata or io_uring command
flags, so we've been leaking those requests.

Fixes: 23fd22e55b ("nvme: wire up fixed buffer support for nvme passthrough")
Fixes: 7c2fd76048 ("nvme: fix metadata handling in nvme-passthrough")
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-25 09:09:18 -08:00
Icenowy Zheng
56cf7ef0d4 nvme-pci: skip CMB blocks incompatible with PCI P2P DMA
The PCI P2PDMA code will register the CMB block to the memory
hot-plugging subsystem, which have an alignment requirement. Memory
blocks that do not satisfy this alignment requirement (usually 2MB) will
lead to a WARNING from memory hotplugging.

Verify the CMB block's address and size against the alignment and only
try to send CMB blocks compatible with it to prevent this warning.

Tested on Intel DC D4502 SSD, which has a 512K CMB block that is too
small for memory hotplugging (thus PCI P2PDMA).

Signed-off-by: Icenowy Zheng <uwu@icenowy.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-24 16:23:57 -08:00
Icenowy Zheng
6a3572e10f nvme-pci: clean up CMBMSC when registering CMB fails
CMB decoding should get disabled when the CMB block isn't successfully
registered to P2P DMA subsystem.

Clean up the CMBMSC register in this error handling codepath to disable
CMB decoding (and CMBLOC/CMBSZ registers).

Signed-off-by: Icenowy Zheng <uwu@icenowy.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-24 16:21:24 -08:00
Sagi Grimberg
8c1624b63a nvme-tcp: fix possible UAF in nvme_tcp_poll
nvme_tcp_poll() may race with the send path error handler because
it may complete the request while it is actively being polled for
completion, resulting in a UAF panic [1]:

We should make sure to stop polling when we see an error when
trying to read from the socket. Hence make sure to propagate the
error so that the block layer breaks the polling cycle.

[1]:
--
[35665.692310] nvme nvme2: failed to send request -13
[35665.702265] nvme nvme2: unsupported pdu type (3)
[35665.702272] BUG: kernel NULL pointer dereference, address: 0000000000000000
[35665.702542] nvme nvme2: queue 1 receive failed:  -22
[35665.703209] #PF: supervisor write access in kernel mode
[35665.703213] #PF: error_code(0x0002) - not-present page
[35665.703214] PGD 8000003801cce067 P4D 8000003801cce067 PUD 37e6f79067 PMD 0
[35665.703220] Oops: 0002 [#1] SMP PTI
[35665.703658] nvme nvme2: starting error recovery
[35665.705809] Hardware name: Inspur aaabbb/YZMB-00882-104, BIOS 4.1.26 09/22/2022
[35665.705812] Workqueue: kblockd blk_mq_requeue_work
[35665.709172] RIP: 0010:_raw_spin_lock+0xc/0x30
[35665.715788] Call Trace:
[35665.716201]  <TASK>
[35665.716613]  ? show_trace_log_lvl+0x1c1/0x2d9
[35665.717049]  ? show_trace_log_lvl+0x1c1/0x2d9
[35665.717457]  ? blk_mq_request_bypass_insert+0x2c/0xb0
[35665.717950]  ? __die_body.cold+0x8/0xd
[35665.718361]  ? page_fault_oops+0xac/0x140
[35665.718749]  ? blk_mq_start_request+0x30/0xf0
[35665.719144]  ? nvme_tcp_queue_rq+0xc7/0x170 [nvme_tcp]
[35665.719547]  ? exc_page_fault+0x62/0x130
[35665.719938]  ? asm_exc_page_fault+0x22/0x30
[35665.720333]  ? _raw_spin_lock+0xc/0x30
[35665.720723]  blk_mq_request_bypass_insert+0x2c/0xb0
[35665.721101]  blk_mq_requeue_work+0xa5/0x180
[35665.721451]  process_one_work+0x1e8/0x390
[35665.721809]  worker_thread+0x53/0x3d0
[35665.722159]  ? process_one_work+0x390/0x390
[35665.722501]  kthread+0x124/0x150
[35665.722849]  ? set_kthread_struct+0x50/0x50
[35665.723182]  ret_from_fork+0x1f/0x30

Reported-by: Zhang Guanghui <zhang.guanghui@cestc.cn>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-24 14:29:00 -08:00
Daniel Wagner
d2fe192348 nvme: only allow entering LIVE from CONNECTING state
The fabric transports and also the PCI transport are not entering the
LIVE state from NEW or RESETTING. This makes the state machine more
restrictive and allows to catch not supported state transitions, e.g.
directly switching from RESETTING to LIVE.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-20 09:14:25 -08:00
Daniel Wagner
f13409bb3f nvme-fc: rely on state transitions to handle connectivity loss
It's not possible to call nvme_state_ctrl_state with holding a spin
lock, because nvme_state_ctrl_state calls cancel_delayed_work_sync
when fastfail is enabled.

Instead syncing the ASSOC_FLAG and state transitions using a lock, it's
possible to only rely on the state machine transitions. That means
nvme_fc_ctrl_connectivity_loss should unconditionally call
nvme_reset_ctrl which avoids the read race on the ctrl state variable.
Actually, it's not necessary to test in which state the ctrl is, the
reset work will only scheduled when the state machine is in LIVE state.

In nvme_fc_create_association, the LIVE state can only be entered if it
was previously CONNECTING. If this is not possible then the reset
handler got triggered. Thus just error out here.

Fixes: ee59e3820c ("nvme-fc: do not ignore connectivity loss during connecting")
Closes: https://lore.kernel.org/all/denqwui6sl5erqmz2gvrwueyxakl5txzbbiu3fgebryzrfxunm@iwxuthct377m/
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-20 09:13:55 -08:00
Asahi Lina
bf8b4e4977 soc: apple: rtkit: Pass the crashlog to the crashed() callback
Client drivers might want a copy of the crashlog to stash into a
devcoredump blob. Since device memory management can be very variable,
the actual devcoredump implementation is left to client drivers. Pass
the raw crashlog buffer to the client callback so it can use it if
desired.

Signed-off-by: Asahi Lina <lina@asahilina.net>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20250202-rtkit-crashdump-v1-1-9d38615b4e12@asahilina.net
Signed-off-by: Sven Peter <sven@svenpeter.dev>
2025-02-18 17:59:11 +01:00
Hector Martin
3f22421f6a apple-nvme: Support coprocessors left idle
iBoot on at least some firmwares/machines leaves ANS2 running, requiring
a wake command instead of a CPU boot (and if we reset ANS2 in that
state, everything breaks).

Only stop the CPU if RTKit was running, and only do the reset dance if
the CPU is stopped.

Normal shutdown handoff:
- RTKit not yet running
- CPU detected not running
- Reset
- CPU powerup
- RTKit boot wait

ANS2 left running/idle:
- RTKit not yet running
- CPU detected running
- RTKit wake message

Sleep/resume cycle:
- RTKit shutdown
- CPU stopped
- (sleep here)
- CPU detected not running
- Reset
- CPU powerup
- RTKit boot wait

Shutdown or device removal:
- RTKit shutdown
- CPU stopped

Therefore, the CPU running bit serves as a consistent flag of whether
the coprocessor is fully stopped or just idle.

Signed-off-by: Hector Martin <marcan@marcan.st>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-18 07:41:10 -08:00
Hector Martin
eefa72a15e apple-nvme: Release power domains when probe fails
Signed-off-by: Hector Martin <marcan@marcan.st>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-18 07:41:10 -08:00
Damien Le Moal
2ba8cf918f nvmet: Use enum definitions instead of hardcoded values
Change the definition of the inline functions nvmet_cc_en(),
nvmet_cc_css(), nvmet_cc_mps(), nvmet_cc_ams(), nvmet_cc_shn(),
nvmet_cc_iosqes(), and nvmet_cc_iocqes() to use the enum difinitions in
include/linux/nvme.h instead of hardcoded values.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-18 07:39:42 -08:00
Caleb Sander Mateos
487a3ea7b1 nvme/ioctl: add missing space in err message
nvme_validate_passthru_nsid() logs an err message whose format string is
split over 2 lines. There is a missing space between the two pieces,
resulting in log lines like "... does not match nsid (1)of namespace".
Add the missing space between ")" and "of". Also combine the format
string pieces onto a single line to make the err message easier to grep.

Fixes: e7d4b5493a ("nvme: factor out a nvme_validate_passthru_nsid helper")
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-18 07:37:31 -08:00
Caleb Sander Mateos
578539e096 nvme-tcp: fix connect failure on receiving partial ICResp PDU
nvme_tcp_init_connection() attempts to receive an ICResp PDU but only
checks that the return value from recvmsg() is non-negative. If the
sender closes the TCP connection or sends fewer than 128 bytes, this
check will pass even though the full PDU wasn't received.

Ensure the full ICResp PDU is received by checking that recvmsg()
returns the expected 128 bytes.

Additionally set the MSG_WAITALL flag for recvmsg(), as a sender could
split the ICResp over multiple TCP frames. Without MSG_WAITALL,
recvmsg() could return prematurely with only part of the PDU.

Fixes: 3f2304f8c6 ("nvme-tcp: add NVMe over TCP host driver")
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-18 07:37:02 -08:00
Damien Le Moal
cd513e0434 nvme: tcp: Fix compilation warning with W=1
When compiling with W=1, a warning result for the function
nvme_tcp_set_queue_io_cpu():

host/tcp.c:1578: warning: Function parameter or struct member 'queue'
not described in 'nvme_tcp_set_queue_io_cpu'
host/tcp.c:1578: warning: expecting prototype for Track the number of
queues assigned to each cpu using a global per(). Prototype was for
nvme_tcp_set_queue_io_cpu() instead

Avoid this warning by using the regular comment format for the function
nvme_tcp_set_queue_io_cpu() instead of the kdoc comment format.

Fixes: 3219378987 ("nvme-tcp: Fix I/O queue cpu spreading for multiple controllers")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-18 07:25:15 -08:00
Damien Le Moal
01ef7ff7dd nvmet: pci-epf: Avoid RCU stalls under heavy workload
The delayed work item function nvmet_pci_epf_poll_sqs_work() polls all
submission queues and keeps running in a loop as long as commands are
being submitted by the host. Depending on the preemption configuration
of the kernel, under heavy command workload, this function can thus run
for more than RCU_CPU_STALL_TIMEOUT seconds, leading to a RCU stall:

 rcu: INFO: rcu_sched self-detected stall on CPU
 rcu:   5-....: (20998 ticks this GP) idle=4244/1/0x4000000000000000 softirq=301/301 fqs=5132
 rcu:   (t=21000 jiffies g=-443 q=12 ncpus=8)
 CPU: 5 UID: 0 PID: 82 Comm: kworker/5:1 Not tainted 6.14.0-rc2 #1
 Hardware name: Radxa ROCK 5B (DT)
 Workqueue: events nvmet_pci_epf_poll_sqs_work [nvmet_pci_epf]
 pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : dw_edma_device_tx_status+0xb8/0x130
 lr : dw_edma_device_tx_status+0x9c/0x130
 sp : ffff800080b5bbb0
 x29: ffff800080b5bbb0 x28: ffff0331c5c78400 x27: ffff0331c1cd1960
 x26: ffff0331c0e39010 x25: ffff0331c20e4000 x24: ffff0331c20e4a90
 x23: 0000000000000000 x22: 0000000000000001 x21: 00000000005aca33
 x20: ffff800080b5bc30 x19: ffff0331c123e370 x18: 000000000ab29e62
 x17: ffffb2a878c9c118 x16: ffff0335bde82040 x15: 0000000000000000
 x14: 000000000000017b x13: 00000000ee601780 x12: 0000000000000018
 x11: 0000000000000000 x10: 0000000000000001 x9 : 0000000000000040
 x8 : 00000000ee601780 x7 : 0000000105c785c0 x6 : ffff0331c1027d80
 x5 : 0000000001ee7ad6 x4 : ffff0335bdea16c0 x3 : ffff0331c123e438
 x2 : 00000000005aca33 x1 : 0000000000000000 x0 : ffff0331c123e410
 Call trace:
  dw_edma_device_tx_status+0xb8/0x130 (P)
  dma_sync_wait+0x60/0xbc
  nvmet_pci_epf_dma_transfer+0x128/0x264 [nvmet_pci_epf]
  nvmet_pci_epf_poll_sqs_work+0x2a0/0x2e0 [nvmet_pci_epf]
  process_one_work+0x144/0x390
  worker_thread+0x27c/0x458
  kthread+0xe8/0x19c
  ret_from_fork+0x10/0x20

The solution for this is simply to explicitly allow rescheduling using
cond_resched(). However, since doing so for every loop of
nvmet_pci_epf_poll_sqs_work() significantly degrades performance
(for 4K random reads using 4 I/O queues, the maximum IOPS goes down from
137 KIOPS to 110 KIOPS), call cond_resched() every second to avoid the
RCU stalls.

Fixes: 0faa0fe6f9 ("nvmet: New NVMe PCI endpoint function target driver")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-18 07:25:15 -08:00
Damien Le Moal
ffa3556763 nvmet: pci-epf: Do not uselessly write the CSTS register
The function nvmet_pci_epf_poll_cc_work() will do nothing if there are
no changes to the controller configuration (CC) register. However, even
for such case, this function still calls nvmet_update_cc() and uselessly
writes the CSTS register. Avoid this by simply rescheduling the poll_cc
work if the CC register has not changed.

Also reschedule the poll_cc work if the function
nvmet_pci_epf_enable_ctrl() fails to allow the host the chance to try
again enabling the controller.

While at it, since there is no point in trying to handle the CC register
as quickly as possible, change the poll_cc work scheduling interval to
10 ms (from 5ms), to avoid excessive read accesses to that register.

Fixes: 0faa0fe6f9 ("nvmet: New NVMe PCI endpoint function target driver")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-18 07:25:15 -08:00
Damien Le Moal
68a5c91f01 nvmet: pci-epf: Correctly initialize CSTS when enabling the controller
The function nvmet_pci_epf_poll_cc_work() sets the NVME_CSTS_RDY bit of
the controller status register (CSTS) when nvmet_pci_epf_enable_ctrl()
returns success. However, since this function can be called several
times (e.g. if the host reboots), instead of setting the bit in
ctrl->csts, initialize this field to only have NVME_CSTS_RDY set.
Conversely, if nvmet_pci_epf_enable_ctrl() fails, make sure to clear all
bits from ctrl->csts.

To simplify nvmet_pci_epf_poll_cc_work(), initialize ctrl->csts to
NVME_CSTS_RDY directly inside nvmet_pci_epf_enable_ctrl() and clear this
field in that function as well in case of a failure. To be consistent,
move clearing the NVME_CSTS_RDY bit from ctrl->csts when the controller
is being disabled from nvmet_pci_epf_poll_cc_work() into
nvmet_pci_epf_disable_ctrl().

Fixes: 0faa0fe6f9 ("nvmet: New NVMe PCI endpoint function target driver")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-18 07:25:14 -08:00
Ruozhu Li
3988ac1c67 nvmet-rdma: recheck queue state is LIVE in state lock in recv done
The queue state checking in nvmet_rdma_recv_done is not in queue state
lock.Queue state can transfer to LIVE in cm establish handler between
state checking and state lock here, cause a silent drop of nvme connect
cmd.
Recheck queue state whether in LIVE state in state lock to prevent this
issue.

Signed-off-by: Ruozhu Li <david.li@jaguarmicro.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-18 07:22:55 -08:00
Hannes Reinecke
4082326807 nvmet: Fix crash when a namespace is disabled
The namespace percpu counter protects pending I/O, and we can
only safely diable the namespace once the counter drop to zero.
Otherwise we end up with a crash when running blktests/nvme/058
(eg for loop transport):

[ 2352.930426] [  T53909] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN PTI
[ 2352.930431] [  T53909] KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f]
[ 2352.930434] [  T53909] CPU: 3 UID: 0 PID: 53909 Comm: kworker/u16:5 Tainted: G        W          6.13.0-rc6 #232
[ 2352.930438] [  T53909] Tainted: [W]=WARN
[ 2352.930440] [  T53909] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
[ 2352.930443] [  T53909] Workqueue: nvmet-wq nvme_loop_execute_work [nvme_loop]
[ 2352.930449] [  T53909] RIP: 0010:blkcg_set_ioprio+0x44/0x180

as the queue is already torn down when calling submit_bio();

So we need to init the percpu counter in nvmet_ns_enable(), and
wait for it to drop to zero in nvmet_ns_disable() to avoid having
I/O pending after the namespace has been disabled.

Fixes: 74d16965d7 ("nvmet-loop: avoid using mutex in IO hotpath")

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-18 07:21:52 -08:00
Maurizio Lombardi
84e009042d nvme-tcp: add basic support for the C2HTermReq PDU
Previously, the NVMe/TCP host driver did not handle the C2HTermReq PDU,
instead printing "unsupported pdu type (3)" when received. This patch adds
support for processing the C2HTermReq PDU, allowing the driver
to print the Fatal Error Status field.

Example of output:
nvme nvme4: Received C2HTermReq (FES = Invalid PDU Header Field)

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-18 07:13:26 -08:00
Christopher Lentocha
fcd8754458 nvme-pci: quirk Acer FA100 for non-uniqueue identifiers
In order for two Acer FA100 SSDs to work in one PC (in the case of
myself, a Lenovo Legion T5 28IMB05), and not show one drive and not
the other, and sometimes mix up what drive shows up (randomly), these
two lines of code need to be added, and then both of the SSDs will
show up and not conflict when booting off of one of them. If you boot
up your computer with both SSDs installed without this patch, you may
also randomly get into a kernel panic (if the initrd is not set up) or
stuck in the initrd "/init" process, it is set up, however, if you do
apply this patch, there should not be problems with booting or seeing
both contents of the drive. Tested with the btrfs filesystem with a
RAID configuration of having the root drive '/' combined to make two
256GB Acer FA100 SSDs become 512GB in total storage.

Kernel Logs with patch applied (`dmesg -t | grep -i nvm`):

```
...
nvme 0000:04:00.0: platform quirk: setting simple suspend
nvme nvme0: pci function 0000:04:00.0
nvme 0000:05:00.0: platform quirk: setting simple suspend
nvme nvme1: pci function 0000:05:00.0
nvme nvme1: missing or invalid SUBNQN field.
nvme nvme1: allocated 64 MiB host memory buffer.
nvme nvme0: missing or invalid SUBNQN field.
nvme nvme0: allocated 64 MiB host memory buffer.
nvme nvme1: 8/0/0 default/read/poll queues
nvme nvme1: Ignoring bogus Namespace Identifiers
nvme nvme0: 8/0/0 default/read/poll queues
nvme nvme0: Ignoring bogus Namespace Identifiers
nvme0n1: p1 p2
...
```

Kernel Logs with patch not applied (`dmesg -t | grep -i nvm`):

```
...
nvme 0000:04:00.0: platform quirk: setting simple suspend
nvme nvme0: pci function 0000:04:00.0
nvme 0000:05:00.0: platform quirk: setting simple suspend
nvme nvme1: pci function 0000:05:00.0
nvme nvme0: missing or invalid SUBNQN field.
nvme nvme1: missing or invalid SUBNQN field.
nvme nvme0: allocated 64 MiB host memory buffer.
nvme nvme1: allocated 64 MiB host memory buffer.
nvme nvme0: 8/0/0 default/read/poll queues
nvme nvme1: 8/0/0 default/read/poll queues
nvme nvme1: globally duplicate IDs for nsid 1
nvme nvme1: VID:DID 1dbe:5216 model:Acer SSD FA100 256GB firmware:1.Z.J.2X
nvme0n1: p1 p2
...
```

Signed-off-by: Christopher Lentocha <christopherericlentocha@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-18 07:11:43 -08:00
Jens Axboe
457bf49627 nvme fixes for Linux 6.14
- Connection fixes for fibre channel transport (Daniel)
  - Endian fixes (Keith, Christoph)
  - Cleanup fix for host memory buffer (Francis)
  - Platform specific power quirks (Georg)
  - Target memory leak (Sagi)
  - Use appropriate controller state accessor (Daniel)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmedND8ACgkQPe3zGtjz
 RgmWIQ/9HM2bzxlVxtxv7EMUpmw8u+zVXhyUfS/4esWxAjpVc7VlQ7+TKlqt6vhx
 2HPU1cWF6D3dVm2+NikHGXWp7UwESHsev+DME1/ygd0TiLEJlMmzaejyTljBSPtm
 kbpJIPgwM3UoiiuAYfVasAXEVID9qDkEzNsr+XV8M1k4+JC2mbjWhq5Pv/1sIvny
 YaVxOtZM81LIDVFPdYQWtderWdanindc5MIGYk8okvV/eBeTTkR5jrBe8B4tDHr1
 +QsgeDYBzTE082qD1NTLEQnNgbCeAEtlOtoQHv1gyD1oIxtNjUgGWbKDy5Tnvr/H
 9j75wkWDyAR+jBS/rLPB9K3ars1IW6Cs43NAagHEn55x7y2MV86KPYe/VwKygCeV
 tQzlM4hLSbYpgs5DEn1L/R+FfacFlLnxeepMnqe+CZ0l9n9AOJab1kzR+joNNCxv
 RMZqxmQE71ax78JrwVvf/ixZFvxB/2ls0rcFtcOLZE7gQm9lhE56IAKyxpOMwxjo
 hkn+/TWc2sIAVxSMa3EW3rwXnToAEQpf3mRFgSEXpP5ZNlhJSLe5w1Rj/rdTIFZ5
 a+LFcB5b8ZbWzq9iXeA9k4qqEQvdiwQoGayGjGGe3EjQM4g9jEQvZa2H0p+vExqa
 anwpdF0OAYnDgn7eNVLUnBVlpJtZ5hUha3zFBYlQh+o0xO649jw=
 =uYd9
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.14-2025-01-31' of git://git.infradead.org/nvme into block-6.14

Pull NVMe fixes from Keith:

"nvme fixes for Linux 6.14

 - Connection fixes for fibre channel transport (Daniel)
 - Endian fixes (Keith, Christoph)
 - Cleanup fix for host memory buffer (Francis)
 - Platform specific power quirks (Georg)
 - Target memory leak (Sagi)
 - Use appropriate controller state accessor (Daniel)"

* tag 'nvme-6.14-2025-01-31' of git://git.infradead.org/nvme:
  nvme-fc: use ctrl state getter
  nvme: make nvme_tls_attrs_group static
  nvmet: add a missing endianess conversion in nvmet_execute_admin_connect
  nvmet: the result field in nvmet_alloc_ctrl_args is little endian
  nvmet: fix a memory leak in controller identify
  nvme-fc: do not ignore connectivity loss during connecting
  nvme: handle connectivity loss in nvme_set_queue_count
  nvme-fc: go straight to connecting state when initializing
  nvme-pci: Add TUXEDO IBP Gen9 to Samsung sleep quirk
  nvme-pci: Add TUXEDO InfinityFlex to Samsung sleep quirk
  nvme-pci: remove redundant dma frees in hmb
  nvmet: fix rw control endian access
2025-02-03 09:19:03 -07:00
Daniel Wagner
c8ed6cb5d3 nvme-fc: use ctrl state getter
Do not access the state variable directly, instead use proper
synchronization so not stale data is read.

Fixes: e6e7f7ac03 ("nvme: ensure reset state check ordering")
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-31 10:14:36 -08:00
Keith Busch
2d1a2dab95 nvme: make nvme_tls_attrs_group static
To suppress the compiler "warning: symbol 'nvme_tls_attrs_group' was not
declared. Should it be static?"

Fixes: 1e48b34c9b ("nvme: split off TLS sysfs attributes into a separate group")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-31 10:14:27 -08:00
Christoph Hellwig
1e1a9cecfa block: force noio scope in blk_mq_freeze_queue
When block drivers or the core block code perform allocations with a
frozen queue, this could try to recurse into the block device to
reclaim memory and deadlock.  Thus all allocations done by a process
that froze a queue need to be done without __GFP_IO and __GFP_FS.
Instead of tying to track all of them down, force a noio scope as
part of freezing the queue.

Note that nvme is a bit of a mess here due to the non-owner freezes,
and they will be addressed separately.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250131120352.1315351-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-31 07:20:08 -07:00
Christoph Hellwig
cc3d4671a0 nvmet: add a missing endianess conversion in nvmet_execute_admin_connect
The kato field is little endian on the wire, but native endian in
the in-core structure, add the missing byte swap.

Fixes: 6202783184 ("nvmet: Improve nvmet_alloc_ctrl() interface and implementation")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-28 07:07:21 -08:00
Christoph Hellwig
7bf6b497a7 nvmet: the result field in nvmet_alloc_ctrl_args is little endian
So use the __le32 type for it.

Fixes: 6202783184 ("nvmet: Improve nvmet_alloc_ctrl() interface and implementation")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-28 07:07:21 -08:00
Sagi Grimberg
58f5c8d5ca nvmet: fix a memory leak in controller identify
Simply free an allocated buffer once we copied its content
to the request sgl.

kmemleak complaint:
unreferenced object 0xffff8cd40c388000 (size 4096):
  comm "kworker/2:2H", pid 14739, jiffies 4401313113
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace (crc 0):
    [<ffffffff9e01087a>] kmemleak_alloc+0x4a/0x90
    [<ffffffff9d30324a>] __kmalloc_cache_noprof+0x35a/0x420
    [<ffffffffc180b0e2>] nvmet_execute_identify+0x912/0x9f0 [nvmet]
    [<ffffffffc181a72c>] nvmet_tcp_try_recv_pdu+0x84c/0xc90 [nvmet_tcp]
    [<ffffffffc181ac02>] nvmet_tcp_io_work+0x82/0x8b0 [nvmet_tcp]
    [<ffffffff9cfa7158>] process_one_work+0x178/0x3e0
    [<ffffffff9cfa8e9c>] worker_thread+0x2ec/0x420
    [<ffffffff9cfb2140>] kthread+0xf0/0x120
    [<ffffffff9cee36a4>] ret_from_fork+0x44/0x70
    [<ffffffff9ce7fdda>] ret_from_fork_asm+0x1a/0x30

Fixes: 84909f7dec ("nvmet: use kzalloc instead of ZERO_PAGE in nvme_execute_identify_ns_nvm()")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-24 07:34:56 -08:00
Daniel Wagner
ee59e3820c nvme-fc: do not ignore connectivity loss during connecting
When a connectivity loss occurs while nvme_fc_create_assocation is
being executed, it's possible that the ctrl ends up stuck in the LIVE
state:

  1) nvme nvme10: NVME-FC{10}: create association : ...
  2) nvme nvme10: NVME-FC{10}: controller connectivity lost.
                  Awaiting Reconnect
     nvme nvme10: queue_size 128 > ctrl maxcmd 32, reducing to maxcmd
  3) nvme nvme10: Could not set queue count (880)
     nvme nvme10: Failed to configure AEN (cfg 900)
  4) nvme nvme10: NVME-FC{10}: controller connect complete
  5) nvme nvme10: failed nvme_keep_alive_end_io error=4

A connection attempt starts 1) and the ctrl is in state CONNECTING.
Shortly after the LLDD driver detects a connection lost event and calls
nvme_fc_ctrl_connectivity_loss 2). Because we are still in CONNECTING
state, this event is ignored.

nvme_fc_create_association continues to run in parallel and tries to
communicate with the controller and these commands will fail. Though
these errors are filtered out, e.g in 3) setting the I/O queues numbers
fails which leads to an early exit in nvme_fc_create_io_queues. Because
the number of IO queues is 0 at this point, there is nothing left in
nvme_fc_create_association which could detected the connection drop.
Thus the ctrl enters LIVE state 4).

Eventually the keep alive handler times out 5) but because nothing is
being done, the ctrl stays in LIVE state.

There is already the ASSOC_FAILED flag to track connectivity loss event
but this bit is set too late in the recovery code path. Move this into
the connectivity loss event handler and synchronize it with the state
change. This ensures that the ASSOC_FAILED flag is seen by
nvme_fc_create_io_queues and it does not enter the LIVE state after a
connectivity loss event. If the connectivity loss event happens after we
entered the LIVE state the normal error recovery path is executed.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-23 10:04:26 -08:00
Daniel Wagner
294b2b7516 nvme: handle connectivity loss in nvme_set_queue_count
When the set feature attempts fails with any NVME status code set in
nvme_set_queue_count, the function still report success. Though the
numbers of queues set to 0. This is done to support controllers in
degraded state (the admin queue is still up and running but no IO
queues).

Though there is an exception. When nvme_set_features reports an host
path error, nvme_set_queue_count should propagate this error as the
connectivity is lost, which means also the admin queue is not working
anymore.

Fixes: 9a0be7abb6 ("nvme: refactor set_queue_count")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-23 10:04:26 -08:00
Daniel Wagner
d3d380eded nvme-fc: go straight to connecting state when initializing
The initial controller initialization mimiks the reconnect loop
behavior by switching from NEW to RESETTING and then to CONNECTING.

The transition from NEW to CONNECTING is a valid transition, so there is
no point entering the RESETTING state. TCP and RDMA also transition
directly to CONNECTING state.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-23 10:04:25 -08:00
Linus Torvalds
a312e1706c for-6.14/io_uring-20250119
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeNDEUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpl5hD/4t7kWWNQDeQG9CiA3QStMJ5Yow2AgYtK8f
 sJBr5/6PGEsbTreX//Kh8DtPZPRGcjG9elCo58QxWaPZ2mg3fTOR3/QYLMlaGXU2
 hSht58lj32utpuzMjMo9bG3aesi03bLf+buaq7V1FaMlcTV8rXqK1s/HGtphDBRo
 8tNLEk3JDJDs3vlWbNp/5Hqh9+Ro6DU8df1zWWH4Vbu8RXaGIPyJyjKvvcbfuuCf
 k7Ay45XNAmTZg+rSNGv1H3Yn1LNzPMVFLWBfzRahPCzlKy2+mJMWz1PWu9naaUK+
 WTM+kgiBLF24k59G/9xuxC5bYtsTjTbr4GsEE5ZvFBnhKPzLzzaJj7iQHRj83vtv
 tqxNmAbA3wJoNk48Zr8+cYbfDX9Q9Pl32wIaS/LxRgF9MT4lem6pyKY7Skd12oK3
 rnQ8moGtnOBxp3QUU6BZ7IX3ipb+Bgw7FhZbtVYJdlqKeKyi1QO0MuITwGXpMwk/
 EWDDTsspIf+QaTu+fmO8byJavugKljW8t7hM1JpvlfOLl+rsh6/+AYz42fCvcaA0
 Tu4bpUk8SuwALvZfU2R6bLkorGG6MFuGI8g3eixOcGir3YAcHBMfdg6ItpZi5qVt
 ToM87BMaezOZZvSwX1JBaQ0AR5HBQYmHaiLWgPsORf3PjJ0kz+u21SK9D+yJkUtU
 rT6+HvoVXA==
 =ufpE
 -----END PGP SIGNATURE-----

Merge tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:
 "Not a lot in terms of features this time around, mostly just cleanups
  and code consolidation:

   - Support for PI meta data read/write via io_uring, with NVMe and
     SCSI covered

   - Cleanup the per-op structure caching, making it consistent across
     various command types

   - Consolidate the various user mapped features into a concept called
     regions, making the various users of that consistent

   - Various cleanups and fixes"

* tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux: (56 commits)
  io_uring/fdinfo: fix io_uring_show_fdinfo() misuse of ->d_iname
  io_uring: reuse io_should_terminate_tw() for cmds
  io_uring: Factor out a function to parse restrictions
  io_uring/rsrc: require cloned buffers to share accounting contexts
  io_uring: simplify the SQPOLL thread check when cancelling requests
  io_uring: expose read/write attribute capability
  io_uring/rw: don't gate retry on completion context
  io_uring/rw: handle -EAGAIN retry at IO completion time
  io_uring/rw: use io_rw_recycle() from cleanup path
  io_uring/rsrc: simplify the bvec iter count calculation
  io_uring: ensure io_queue_deferred() is out-of-line
  io_uring/rw: always clear ->bytes_done on io_async_rw setup
  io_uring/rw: use NULL for rw->free_iovec assigment
  io_uring/rw: don't mask in f_iocb_flags
  io_uring/msg_ring: Drop custom destructor
  io_uring: Move old async data allocation helper to header
  io_uring/rw: Allocate async data through helper
  io_uring/net: Allocate msghdr async data through helper
  io_uring/uring_cmd: Allocate async data through generic helper
  io_uring/poll: Allocate apoll with generic alloc_cache helper
  ...
2025-01-20 20:27:33 -08:00
Linus Torvalds
1cbfb828e0 for-6.14/block-20250118
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeL6hoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppw2EADQV8nDgLRggZR+il4U03yKHXcQEdAX1GrB
 Erowx+dasIJuh6kp3n6qRe9QD/pRqt1DKyLvXoWF8Qfuwq85j7oDnDDYxutNYT27
 hDgrLJriJ3VeKYtTu+andHWt8P29b5h57UayInDOUJurEPA6rXyFZ5YVIti8n21K
 uDOrQXiACG3qRWS2+p2f3UNhX0MkFNFdN/lxi13WMIJtRWF5bXAP+JOgIWCID4Ze
 QuSY6rQD4dp4Q6M2erpX6tn0YZb7Hvw3rPjsd91n6jvYfTUVLH375zg8jCBpi6Wi
 Syufbb8xcTtriVPTDRNu0ekjebkc8wD8ax/h86g0z9v3Ua4DlNmsx9eXrtv6r5nu
 YXqDODOad6stI0+owFquW2vas0gHmfNSfyfGdlk2g24PMtP5Yx0V6FIEvwIeqnje
 ghgxQvBuKUsdhqakByfNnc+XvXi3+RUJek8kvMeUSUQWT1IyMQqPOOk0yp9WdyWD
 bY1f2ECP5BR1b37zYOyawewsI5xTupHUswn5a4r4qtGn3O15rGDkX98Nab5aLCnR
 rW/DvX7+wT6gW9EwrRHiwjwfNDZbsJ9Ggu3lMhtUl5GUWdk58yTiVgKaHJLnlX9/
 CKFKfyyIR1Vl8+gYIpemyFhhcoN+dCSf06ISkrg0jeS0/tYwydaAaCBPL5J4kxZA
 h3Rtbh+Pgg==
 =EXYs
 -----END PGP SIGNATURE-----

Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull requests via Keith:
      - Target support for PCI-Endpoint transport (Damien)
      - TCP IO queue spreading fixes (Sagi, Chaitanya)
      - Target handling for "limited retry" flags (Guixen)
      - Poll type fix (Yongsoo)
      - Xarray storage error handling (Keisuke)
      - Host memory buffer free size fix on error (Francis)

 - MD pull requests via Song:
      - Reintroduce md-linear (Yu Kuai)
      - md-bitmap refactor and fix (Yu Kuai)
      - Replace kmap_atomic with kmap_local_page (David Reaver)

 - Quite a few queue freeze and debugfs deadlock fixes

   Ming introduced lockdep support for this in the 6.13 kernel, and it
   has (unsurprisingly) uncovered quite a few issues

 - Use const attributes for IO schedulers

 - Remove bio ioprio wrappers

 - Fixes for stacked device atomic write support

 - Refactor queue affinity helpers, in preparation for better supporting
   isolated CPUs

 - Cleanups of loop O_DIRECT handling

 - Cleanup of BLK_MQ_F_* flags

 - Add rotational support for null_blk

 - Various fixes and cleanups

* tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits)
  block: Don't trim an atomic write
  block: Add common atomic writes enable flag
  md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add()
  block: limit disk max sectors to (LLONG_MAX >> 9)
  block: Change blk_stack_atomic_writes_limits() unit_min check
  block: Ensure start sector is aligned for stacking atomic writes
  blk-mq: Move more error handling into blk_mq_submit_bio()
  block: Reorder the request allocation code in blk_mq_submit_bio()
  nvme: fix bogus kzalloc() return check in nvme_init_effects_log()
  md/md-bitmap: move bitmap_{start, end}write to md upper layer
  md/raid5: implement pers->bitmap_sector()
  md: add a new callback pers->bitmap_sector()
  md/md-bitmap: remove the last parameter for bimtap_ops->endwrite()
  md/md-bitmap: factor behind write counters out from bitmap_{start/end}write()
  md: Replace deprecated kmap_atomic() with kmap_local_page()
  md: reintroduce md-linear
  partitions: ldm: remove the initial kernel-doc notation
  blk-cgroup: rwstat: fix kernel-doc warnings in header file
  blk-cgroup: fix kernel-doc warnings in header file
  nbd: fix partial sending
  ...
2025-01-20 19:38:46 -08:00
John Garry
6a7e17b220 block: Add common atomic writes enable flag
Currently only stacked devices need to explicitly enable atomic writes by
setting BLK_FEAT_ATOMIC_WRITES_STACKED flag.

This does not work well for device mapper stacking devices, as there many
sets of limits are stacked and what is the 'bottom' and 'top' device can
swapped. This means that BLK_FEAT_ATOMIC_WRITES_STACKED needs to be set
for many queue limits, which is messy.

Generalize enabling atomic writes enabling by ensuring that all devices
must explicitly set a flag - that includes NVMe, SCSI sd, and md raid.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20250116170301.474130-2-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-17 13:13:54 -07:00
Georg Gottleuber
11cb3529d1 nvme-pci: Add TUXEDO IBP Gen9 to Samsung sleep quirk
On the TUXEDO InfinityBook Pro Gen9 Intel, a Samsung 990 Evo NVMe leads to
a high power consumption in s2idle sleep (4 watts).

This patch applies 'Force No Simple Suspend' quirk to achieve a sleep with
a lower power consumption, typically around 1.2 watts.

Signed-off-by: Georg Gottleuber <ggo@tuxedocomputers.com>
Cc: stable@vger.kernel.org
Signed-off-by: Werner Sembach <wse@tuxedocomputers.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-17 11:18:09 -08:00
Georg Gottleuber
dbf2bb1a13 nvme-pci: Add TUXEDO InfinityFlex to Samsung sleep quirk
On the TUXEDO InfinityFlex, a Samsung 990 Evo NVMe leads to a high power
consumption in s2idle sleep (4 watts).

This patch applies 'Force No Simple Suspend' quirk to achieve a sleep with
a lower power consumption, typically around 1.4 watts.

Signed-off-by: Georg Gottleuber <ggo@tuxedocomputers.com>
Cc: stable@vger.kernel.org
Signed-off-by: Werner Sembach <wse@tuxedocomputers.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-17 11:18:09 -08:00
Francis Pravin
d68fc95a77 nvme-pci: remove redundant dma frees in hmb
The value of size is 0 when there is no dma buffer allocated. The value
of i also remains 0. So, no need to free the dma buffer in out_free_bufs.
Hence, remove the redundant dma frees.

Signed-off-by: Francis Pravin <francis.p@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-17 11:16:34 -08:00
Keith Busch
3c47c2ccd5 nvmet: fix rw control endian access
Fixes: 3ec5c62cfc ("nvmet: handle rw's limited retry flag")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202501142128.WexgyMTv-lkp@intel.com/
Cc: Guixin Liu <kanie@linux.alibaba.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-14 07:35:08 -08:00
Jens Axboe
170e086ad3 nvme: fix bogus kzalloc() return check in nvme_init_effects_log()
nvme_init_effects_log() returns failure when kzalloc() is successful,
which is obviously wrong and causes failures to boot. Correct the
check.

Fixes: d4a95adeab ("nvme: Add error path for xa_store in nvme_init_effects")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-13 10:27:54 -07:00
Jens Axboe
9752b55035 nvme updates for Linux 6.14
- Target support for PCI-Endpoint transport (Damien)
  - TCP IO queue spreading fixes (Sagi, Chaitanya)
  - Target handling for "limited retry" flags (Guixen)
  - Poll type fix (Yongsoo)
  - Xarray storage error handling (Keisuke)
  - Host memory buffer free size fix on error (Francis)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmeEP3AACgkQPe3zGtjz
 RgnXLw/7B8liECkx93+Ceg1JCuXX2WbbYkgyjjggEFYFNbuZdlVMoAjUsTaJLDLs
 V9Werl314aY+RKui7hGBiQ/F9ozAHLjwbi2k3/hmd6mVekZEWVTXWTQvcRb94YQX
 PJLq9ihuQb4cwuzk6MY0yZDX0cgLMb2brCQ+E4/fbfNWX8VKgRkKwEsejYciFXCz
 k8b45wY9ytW+svCcE/6Shmr0oWjcpr8F/5KASlOgmpZHuYKXBZwXErDk7ZndxoUL
 gjnELYnu1l62Ki3khkhk84ap5OH9WPRzOEBn3ab66yqyZNHpFQ4eaakzpEh7SBIO
 eZWMniSTIH1p+9evZCKd1pMKgoi+vzFm+i0mgKWpClmI0vjLgwDc3prj9xex+Wbx
 w3QKuRqlqIT0az3e64MAZcJiFzPWs1851NnI4Wb5jH9SblutJ4DtU5//aBgjTI2W
 wsgZfN9TmNBLek3YafDz1hsI5rxGTolJcYykC1VmojVbCXkhH05KjL0jGgOnvafo
 HSF6ezkEZMeB4G2sHIFsO82P0DQcdpx1zZA68X91jSDHwlGlBoMZjlLf0wYxbxBk
 iGtz/uJQeVb/PsRaP5nVlfXmT0QJLGdGtqLliJFnN6xIO8hqSW072lkHh4IdYbzG
 2CD1kwMh3Bso/HKdyjSaB3TcSRyAagVHlXDj6mAp8ZntetB6mmY=
 =37g5
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.14-2025-01-12' of git://git.infradead.org/nvme into for-6.14/block

Pull NVMe updates from Keith:

"nvme updates for Linux 6.14

 - Target support for PCI-Endpoint transport (Damien)
 - TCP IO queue spreading fixes (Sagi, Chaitanya)
 - Target handling for "limited retry" flags (Guixen)
 - Poll type fix (Yongsoo)
 - Xarray storage error handling (Keisuke)
 - Host memory buffer free size fix on error (Francis)"

* tag 'nvme-6.14-2025-01-12' of git://git.infradead.org/nvme: (25 commits)
  nvme-pci: use correct size to free the hmb buffer
  nvme: Add error path for xa_store in nvme_init_effects
  nvme-pci: fix comment typo
  Documentation: Document the NVMe PCI endpoint target driver
  nvmet: New NVMe PCI endpoint function target driver
  nvmet: Implement arbitration feature support
  nvmet: Implement interrupt config feature support
  nvmet: Implement interrupt coalescing feature support
  nvmet: Implement host identifier set feature support
  nvmet: Introduce get/set_feature controller operations
  nvmet: Do not require SGL for PCI target controller commands
  nvmet: Add support for I/O queue management admin commands
  nvmet: Introduce nvmet_sq_create() and nvmet_cq_create()
  nvmet: Introduce nvmet_req_transfer_len()
  nvmet: Improve nvmet_alloc_ctrl() interface and implementation
  nvme: Add PCI transport type
  nvmet: Add drvdata field to struct nvmet_ctrl
  nvmet: Introduce nvmet_get_cmd_effects_admin()
  nvmet: Export nvmet_update_cc() and nvmet_cc_xxx() helpers
  nvmet: Add vendor_id and subsys_vendor_id subsystem attributes
  ...
2025-01-13 07:12:15 -07:00
Francis Pravin
4a324970fa nvme-pci: use correct size to free the hmb buffer
dev->host_mem_size value is updated only after the successful buffer
allocation of hmb descriptor. Otherwise, it may have some undefined value.
So, use the correct size to free the hmb buffer when the hmb descriptor
buffer allocation failed.

Signed-off-by: Francis Pravin <francis.p@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-12 14:11:29 -08:00
Keisuke Nishimura
d4a95adeab nvme: Add error path for xa_store in nvme_init_effects
The xa_store() may fail due to memory allocation failure because there
is no guarantee that the index NVME_CSI_NVM is already used. This fix
introduces a new function to handle the error path.

Fixes: cc115cbe12 ("nvme: always initialize known command effects")
Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-12 14:11:29 -08:00
Baruch Siach
e4a0a3058d nvme-pci: fix comment typo
envent -> event.

Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-12 14:11:29 -08:00
Damien Le Moal
0faa0fe6f9 nvmet: New NVMe PCI endpoint function target driver
Implement a PCI target driver using the PCI endpoint framework. This
requires hardware with a PCI controller capable of executing in endpoint
mode.

The PCI endpoint framework is used to set up a PCI endpoint function
and its BAR compatible with a NVMe PCI controller. The framework is also
used to map local memory to the PCI address space to execute MMIO
accesses for retrieving NVMe commands from submission queues and posting
completion entries to completion queues. If supported, DMA is used for
command retreival and command data transfers, based on the PCI address
segments indicated by the command using either PRPs or SGLs.

The NVMe target driver relies on the NVMe target core code to execute
all commands isssued by the host. The PCI target driver is mainly
responsible for the following:
 - Initialization and teardown of the endpoint device and its backend
   PCI target controller. The PCI target controller is created using a
   subsystem and a port defined through configfs. The port used must be
   initialized with the "pci" transport type. The target controller is
   allocated and initialized when the PCI endpoint is started by binding
   it to the endpoint PCI device (nvmet_pci_epf_epc_init() function).

 - Manage the endpoint controller state according to the PCI link state
   and the actions of the host (e.g. checking the CC.EN register) and
   propagate these actions to the PCI target controller. Polling of the
   controller enable/disable is done using a delayed work scheduled
   every 5ms (nvmet_pci_epf_poll_cc() function). This work is started
   whenever the PCI link comes up (nvmet_pci_epf_link_up() notifier
   function) and stopped when the PCI link comes down
   (nvmet_pci_epf_link_down() notifier function).
   nvmet_pci_epf_poll_cc() enables and disables the PCI controller using
   the functions nvmet_pci_epf_enable_ctrl() and
   nvmet_pci_epf_disable_ctrl(). The controller admin queue is created
   using nvmet_pci_epf_create_cq(), which calls nvmet_cq_create(), and
   nvmet_pci_epf_create_sq() which uses nvmet_sq_create().
   nvmet_pci_epf_disable_ctrl() always resets the PCI controller to its
   initial state so that nvmet_pci_epf_enable_ctrl() can be called
   again. This ensures correct operation if, for instance, the host
   reboots causing the PCI link to be temporarily down.

 - Manage the controller admin and I/O submission queues using local
   memory. Commands are obtained from submission queues using a work
   item that constantly polls the doorbells of all submissions queues
   (nvmet_pci_epf_poll_sqs() function). This work is started whenever
   the controller is enabled (nvmet_pci_epf_enable_ctrl() function) and
   stopped when the controller is disabled (nvmet_pci_epf_disable_ctrl()
   function). When new commands are submitted by the host, DMA transfers
   are used to retrieve the commands.

 - Initiate the execution of all admin and I/O commands using the target
   core code, by calling a requests execute() function. All commands are
   individually handled using a per-command work item
   (nvmet_pci_epf_iod_work() function). A command overall execution
   includes: initializing a struct nvmet_req request for the command,
   using nvmet_req_transfer_len() to get a command data transfer length,
   parse the command PRPs or SGLs to get the PCI address segments of
   the command data buffer, retrieve data from the host (if the command
   is a write command), call req->execute() to execute the command and
   transfer data to the host (for read commands).

 - Handle the completions of commands as notified by the
   ->queue_response() operation of the PCI target controller
   (nvmet_pci_epf_queue_response() function). Completed commands are
   added to a list of completed command for their CQ. Each CQ list of
   completed command is processed using a work item
   (nvmet_pci_epf_cq_work() function) which posts entries for the
   completed commands in the CQ memory and raise an IRQ to the host to
   signal the completion. IRQ coalescing is supported as mandated by the
   NVMe base specification for PCI controllers. Of note is that
   completion entries are transmitted to the host using MMIO, after
   mapping the completion queue memory to the host PCI address space.
   Unlike for retrieving commands from SQs, DMA is not used as it
   degrades performance due to the transfer serialization needed (which
   delays completion entries transmission).

The configuration of a NVMe PCI endpoint controller is done using
configfs. First the NVMe PCI target controller configuration must be
done to set up a subsystem and a port with the "pci" addr_trtype
attribute. The subsystem can be setup using a file or block device
backed namespace or using a passthrough NVMe device. After this, the
PCI endpoint can be configured and bound to the PCI endpoint controller
to start the NVMe endpoint controller.

In order to not overcomplicate this initial implementation of an
endpoint PCI target controller driver, protection information is not
for now supported. If the PCI controller port and namespace are
configured with protection information support, an error will be
returned when the controller is created and initialized when the
endpoint function is started. Protection information support will be
added in a follow-up patch series.

Using a Rock5B board (Rockchip RK3588 SoC, PCI Gen3x4 endpoint
controller) with a target PCI controller setup with 4 I/O queues and a
null_blk block device as a namespace, the maximum performance using fio
was measured at 131 KIOPS for random 4K reads and up to 2.8 GB/S
throughput. Some data points are:

Rnd read,   4KB,  QD=1, 1 job : IOPS=16.9k, BW=66.2MiB/s (69.4MB/s)
Rnd read,   4KB, QD=32, 1 job : IOPS=78.5k, BW=307MiB/s (322MB/s)
Rnd read,   4KB, QD=32, 4 jobs: IOPS=131k, BW=511MiB/s (536MB/s)
Seq read, 512KB, QD=32, 1 job : IOPS=5381, BW=2691MiB/s (2821MB/s)

The NVMe PCI endpoint target driver is not intended for production use.
It is a tool for learning NVMe, exploring existing features and testing
implementations of new NVMe features.

Co-developed-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:49 -08:00
Damien Le Moal
a0ed77d4c9 nvmet: Implement arbitration feature support
NVMe base specification v2.1 mandates support for the arbitration
feature (NVME_FEAT_ARBITRATION). Introduce the data structure
struct nvmet_feat_arbitration to define the high, medium and low
priority weight fields and the arbitration burst field of this feature
and implement the functions nvmet_get_feat_arbitration() and
nvmet_set_feat_arbitration() functions to get and set these fields.

Since there is no generic way to implement support for the arbitration
feature, these functions respectively use the controller get_feature()
and set_feature() operations to process the feature with the help of
the controller driver. If the controller driver does not implement these
operations and a get feature command or a set feature command for this
feature is received, the command is failed with an invalid field error.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:49 -08:00
Damien Le Moal
f1ecd491b6 nvmet: Implement interrupt config feature support
The NVMe base specifications v2.1 mandate supporting the interrupt
config feature (NVME_FEAT_IRQ_CONFIG) for PCI controllers. Introduce the
data structure struct nvmet_feat_irq_config to define the coalescing
disabled (cd) and interrupt vector (iv) fields of this feature and
implement the functions nvmet_get_feat_irq_config() and
nvmet_set_feat_irq_config() functions to get and set these fields. These
functions respectively use the controller get_feature() and
set_feature() operations to fill and handle the fields of struct
nvmet_feat_irq_config.

Support for this feature is prohibited for fabrics controllers. If a get
feature command or a set feature command for this feature is received
for a fabrics controller, the command is failed with an invalid field
error.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:49 -08:00
Damien Le Moal
89b94a6cbe nvmet: Implement interrupt coalescing feature support
The NVMe base specifications v2.1 mandate Supporting the interrupt
coalescing feature (NVME_FEAT_IRQ_COALESCE) for PCI controllers.
Introduce the data structure struct nvmet_feat_irq_coalesce to define
the time and threshold (thr) fields of this feature and implement the
functions nvmet_get_feat_irq_coalesce() and
nvmet_set_feat_irq_coalesce() to get and set this feature. These
functions respectively use the controller get_feature() and
set_feature() operations to fill and handle the fields of struct
nvmet_feat_irq_coalesce.

While the Linux kernel nvme driver does not use this feature and thus
will not complain if it is not implemented, other major OSes fail
initializing the NVMe device if this feature support is missing.

Support for this feature is prohibited for fabrics controllers. If a get
feature or set feature command for this feature is received for a
fabrics controller, the command is failed with an invalid field error.

Suggested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:48 -08:00
Damien Le Moal
2f2b20fad9 nvmet: Implement host identifier set feature support
The NVMe specifications mandate support for the host identifier
set_features for controllers that also supports reservations. Satisfy
this requirement by implementing handling of the NVME_FEAT_HOST_ID
feature for the nvme_set_features command. This implementation is for
now effective only for PCI target controllers. For other controller
types, the set features command is failed with a NVME_SC_CMD_SEQ_ERROR
status as before.

As noted in the code, 128 bits host identifiers are supported since the
NVMe base specifications version 2.1 indicate in section 5.1.25.1.28.1
that "The controller may support a 64-bit Host Identifier...".

The RHII (Reservations and Host Identifier Interaction) bit of the
controller attribute (ctratt) field of the identify controller data is
also set to indicate that a host ID of "0" is supported but that the
host ID must be a non-zero value to use reservations.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:48 -08:00
Damien Le Moal
08461535a9 nvmet: Introduce get/set_feature controller operations
The implementation of some features cannot always be done generically by
the target core code. Arbitraion and IRQ coalescing features are
examples of such features: their implementation must be provided (at
least partially) by the target controller driver.

Introduce the set_feature() and get_feature() controller fabrics
operations (in struct nvmet_fabrics_ops) to allow supporting such
features.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:48 -08:00
Damien Le Moal
1ad8630ffa nvmet: Do not require SGL for PCI target controller commands
Support for SGL is optional for the PCI transport. Modify
nvmet_req_init() to not require the NVME_CMD_SGL_METABUF command flag to
be set if the target controller transport type is NVMF_TRTYPE_PCI.
In addition to this, the NVMe base specification v2.1 mandate that all
admin commands use PRP, that is, have CDW0.PSDT cleared to 0. Modify
nvmet_parse_admin_cmd() to check this.

Finally, modify nvmet_check_transfer_len() and
nvmet_check_data_len_lte() to return the appropriate error status
depending on the command using SGL or PRP. Since for fabrics
nvmet_req_init() checks that a command uses SGL, always, this change
affects only PCI target controllers.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:48 -08:00
Damien Le Moal
60d3cd8561 nvmet: Add support for I/O queue management admin commands
The I/O submission queue management admin commands
(nvme_admin_delete_sq, nvme_admin_create_sq, nvme_admin_delete_cq,
and nvme_admin_create_cq) are mandatory admin commands for I/O
controllers using the PCI transport, that is, support for these commands
is mandatory for a a PCI target I/O controller.

Implement support for these commands by adding the functions
nvmet_execute_delete_sq(), nvmet_execute_create_sq(),
nvmet_execute_delete_cq() and nvmet_execute_create_cq() to set as the
execute method of requests for these commands. These functions will
return an invalid opcode error for any controller that is not a PCI
target controller. Support for the I/O queue management commands is also
reported in the command effect log  of PCI target controllers (using
nvmet_get_cmd_effects_admin()).

Each management command is backed by a controller fabric operation
that can be defined by a PCI target controller driver to setup I/O
queues using nvmet_sq_create() and nvmet_cq_create() or delete I/O
queues using nvmet_sq_destroy().

As noted in a comment in nvmet_execute_create_sq(), we do not yet
support sharing a single CQ between multiple SQs.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:48 -08:00
Damien Le Moal
1eb380caf5 nvmet: Introduce nvmet_sq_create() and nvmet_cq_create()
Introduce the new functions nvmet_sq_create() and nvmet_cq_create() to
allow a target driver to initialize and setup admin and IO queues
directly, without needing to execute connect fabrics commands.
The helper functions nvmet_check_cqid() and nvmet_check_sqid() are
implemented to check the correctness of SQ and CQ IDs when
nvmet_sq_create() and nvmet_cq_create() are called.

nvmet_sq_create() and nvmet_cq_create() are primarily intended for use
with PCI target controller drivers and thus are not well integrated
with the current queue creation of fabrics controllers using the connect
command. These fabrices drivers are not modified to use these functions.
This simple implementation of SQ and CQ management for PCI target
controller drivers does not allow multiple SQs to share the same CQ,
similarly to other fabrics transports. This is a specification
violation. A more involved set of changes will follow to add support for
this required completion queue sharing feature.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:48 -08:00
Damien Le Moal
43043c9b97 nvmet: Introduce nvmet_req_transfer_len()
Add the new function nvmet_req_transfer_len() to parse a request command
to extract the transfer length of the command. This function
implementation relies on multiple helper functions for parsing I/O
commands (nvmet_io_cmd_transfer_len()), admin commands
(nvmet_admin_cmd_data_len()) and fabrics connect commands
(nvmet_connect_cmd_data_len).

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:47 -08:00
Damien Le Moal
6202783184 nvmet: Improve nvmet_alloc_ctrl() interface and implementation
Introduce struct nvmet_alloc_ctrl_args to define the arguments for
the function nvmet_alloc_ctrl() to avoid the need for passing a pointer
to a struct nvmet_req as an argument. This new data structure aggregates
together the arguments that were passed to nvmet_alloc_ctrl()
(subsysnqn, hostnqn and kato), together with the struct nvmet_req fields
used by nvmet_alloc_ctrl(), that is, the fields port, p2p_client, and
ops as input and the result and error_loc fields as output, as well as a
status field. nvmet_alloc_ctrl() is also changed to return a pointer
to the allocated and initialized controller structure instead of a
status code, as the status is now returned through the status field of
struct nvmet_alloc_ctrl_args.

The function nvmet_setup_p2p_ns_map() is changed to not take a pointer
to a struct nvmet_req as argument, instead, directly specify the
p2p_client device pointer needed as argument.

The code in nvmet_execute_admin_connect() that initializes a new target
controller after allocating it is moved into nvmet_alloc_ctrl().
The code that sets up an admin queue for the controller (and the call
to nvmet_install_queue()) remains in nvmet_execute_admin_connect().

Finally, nvmet_alloc_ctrl() is also exported to allow target drivers to
use this function directly to allocate and initialize a new controller
structure without the need to rely on a fabrics connect command request.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:47 -08:00
Damien Le Moal
200adac758 nvme: Add PCI transport type
Define the transport type NVMF_TRTYPE_PCI for PCI endpoint targets.
This transport type is defined using the value 0 which is reserved in
the NVMe base specifications v2.1 (Figure 294). Given that struct
nvmet_port are zeroed out on creation, to avoid having this transsport
type becoming the new default, nvmet_referral_make() and
nvmet_ports_make() are modified to initialize a port discovery address
transport type field (disc_addr.trtype) to NVMF_TRTYPE_MAX.

Any port using this transport type is also skipped and not reported in
the discovery log page (nvmet_execute_disc_get_log_page()).

The helper function nvmet_is_pci_ctrl() is also introduced to check if
a target controller uses the PCI transport.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:47 -08:00
Damien Le Moal
35c593e530 nvmet: Add drvdata field to struct nvmet_ctrl
Allow a target driver to attach private data to a target controller by
adding the new field drvdata to struct nvmet_ctrl.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:47 -08:00
Damien Le Moal
1ee4531054 nvmet: Introduce nvmet_get_cmd_effects_admin()
In order to have a logically better organized implementation of the
effects log page, split out reporting the supported admin commands from
nvmet_get_cmd_effects_nvm() into the new function
nvmet_get_cmd_effects_admin().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:47 -08:00
Damien Le Moal
15e9d26445 nvmet: Export nvmet_update_cc() and nvmet_cc_xxx() helpers
Make the function nvmet_update_cc() available to target drivers by
exporting it. To also facilitate the manipulation of the cc register
bits, move the inline helper functions nvmet_cc_en(), nvmet_cc_css(),
nvmet_cc_mps(), nvmet_cc_ams(), nvmet_cc_shn(), nvmet_cc_iosqes(), and
nvmet_cc_iocqes() from core.c to nvmet.h so that these functions can be
reused in target controller drivers.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:47 -08:00
Damien Le Moal
5d4f4ea8fa nvmet: Add vendor_id and subsys_vendor_id subsystem attributes
Define the new vendor_id and subsys_vendor_id configfs attribute for
target subsystems. These attributes are respectively reported as the
vid field and as the ssvid field of the identify controller data of
a target controllers using the subsystem for which these attributes
are set.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:46 -08:00
Damien Le Moal
30e77e0fbe nvme: Move opcode string helper functions declarations
Move the declaration of all helper functions converting NVMe command
opcodes and status codes into strings from drivers/nvme/host/nvme.h
into include/linux/nvme.h, together with the commands definitions.
This allows NVMe target drivers to call these functions without having
to include a host header file.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:46 -08:00
Yongsoo Joo
002bb02729 nvme: change return type of nvme_poll_cq() to bool
The nvme_poll_cq() function currently returns the number of CQEs
found, However, only one caller, nvme_poll(), requires a boolean
value to check whether any CQE was completed. The other callers do
not use the return value at all.

To better reflect its usage, update the return type of nvme_poll_cq()
from int to bool.

Signed-off-by: Yongsoo Joo <ysjoo@kookmin.ac.kr>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:46 -08:00
Keisuke Nishimura
ac32057acc nvme: Add error check for xa_store in nvme_get_effects_log
The xa_store() may fail due to memory allocation failure because there
is no guarantee that the index csi is already used. This fix adds an
error check of the return value of xa_store() in nvme_get_effects_log().

Fixes: 1cf7a12e09 ("nvme: use an xarray to lookup the Commands Supported and Effects log")
Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:46 -08:00
Sagi Grimberg
3219378987 nvme-tcp: Fix I/O queue cpu spreading for multiple controllers
Since day-1 we are assigning the queue io_cpu very naively. We always
base the queue id (controller scope) and assign it its matching cpu
from the online mask. This works fine when the number of queues match
the number of cpu cores.

The problem starts when we have less queues than cpu cores. First, we
should take into account the mq_map and select a cpu within the cpus
that are assigned to this queue by the mq_map in order to minimize cross
numa cpu bouncing.

Second, even worse is that we don't take into account multiple
controllers may have assigned queues to a given cpu. As a result we may
simply compund more and more queues on the same set of cpus, which is
suboptimal.

We fix this by introducing global per-cpu counters that tracks the
number of queues assigned to each cpu, and we select the least used cpu
based on the mq_map and the per-cpu counters, and assign it as the queue
io_cpu.

The behavior for a single controller is slightly optimized by selecting
better cpu candidates by consulting with the mq_map, and multiple
controllers are spreading queues among cpu cores much better, resulting
in lower average cpu load, and less likelihood to hit hotspots.

Note that the accounting is not 100% perfect, but we don't need to be,
we're simply putting our best effort to select the best candidate cpu
core that we find at any given point.

Another byproduct is that every controller reset/reconnect may change
the queues io_cpu mapping, based on the current LRU accounting scheme.

Here is the baseline queue io_cpu assignment for 4 controllers, 2 queues
per controller, and 4 cpus on the host:
nvme1: queue 0: using cpu 0
nvme1: queue 1: using cpu 1
nvme2: queue 0: using cpu 0
nvme2: queue 1: using cpu 1
nvme3: queue 0: using cpu 0
nvme3: queue 1: using cpu 1
nvme4: queue 0: using cpu 0
nvme4: queue 1: using cpu 1

And this is the fixed io_cpu assignment:
nvme1: queue 0: using cpu 0
nvme1: queue 1: using cpu 2
nvme2: queue 0: using cpu 1
nvme2: queue 1: using cpu 3
nvme3: queue 0: using cpu 0
nvme3: queue 1: using cpu 2
nvme4: queue 0: using cpu 1
nvme4: queue 1: using cpu 3

Fixes: 3f2304f8c6 ("nvme-tcp: add NVMe over TCP host driver")
Suggested-by: Hannes Reinecke <hare@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[fixed kbuild reported errors]
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:04 -08:00
Christoph Hellwig
473106dd3a nvme: fix queue freeze vs limits lock order
Match the locking order used by the core block code by only freezing
the queue after taking the limits lock.

Unlike most queue updates this does not use the
queue_limits_commit_update_frozen helper as the nvme driver want the
queue frozen for more than just the limits update.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250110054726.1499538-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10 07:29:24 -07:00
Guixin Liu
3ec5c62cfc nvmet: handle rw's limited retry flag
In some scenarios, some multipath software setup places the
REQ_FAILFAST_DEV flag on I/O to prevent retries and immediately
switch to other paths for issuing I/O commands. This will reflect
on the NVMe read and write commands with the limited retry flag.

However, the current NVMe target side does not handle the limited
retry flag, and the target's underlying driver still retries the
I/O. This will result in the I/O not being quickly switched to
other paths, ultimately leading to increased I/O latency.

When the nvme target receive an rw command with limited retry flag,
handle it in block backend by setting the REQ_FAILFAST_DEV flag to
bio.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-07 08:05:19 -08:00
Christoph Hellwig
e7602bb4f3 block: remove BLK_MQ_F_NO_SCHED
The only queues that really can't support a scheduler are those that
do not have a gendisk associated with them, and thus can't be used for
non-passthrough commands.  In addition to those null_blk can optionally
set the flag, which is a bad odd.  Replace the null_blk usage with
BLK_MQ_F_NO_SCHED_BY_DEFAULT to keep the expected semantics and then
remove BLK_MQ_F_NO_SCHED as the non-disk queues never call into
elevator_init_mq or blk_register_queue which adds the sysfs attributes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250106083531.799976-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-06 07:37:41 -07:00
Christoph Hellwig
6aeb4f8364 block: remove bio_add_pc_page
Lift bio_split_rw_at into blk_rq_append_bio so that it validates the
hardware limits.  With this all passthrough callers can simply add
bio_add_page to build the bio and delay checking for exceeding of limits
to this point instead of doing it for each page.

While this looks like adding a new expensive loop over all bio_vecs,
blk_rq_append_bio is already doing that just to counter the number of
segments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20250103073417.459715-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-04 15:27:35 -07:00
Jens Axboe
cc0331e29f nvme fixes for Linux 6.13
- Fix device specific quirk for PRP list alignment (Robert)
  - Fix target name overflow (Leo)
  - Fix target write granularity (Luis)
  - Fix target sleeping in atomic context (Nilay)
  - Remove unnecessary tcp queue teardown (Chunguang)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmd0K5cACgkQPe3zGtjz
 Rgm5jQ/9EdJb8y3eFetsAs7P5JyeZKdOWOBLgm1fdP5kPeQrbPgOqtXag2JQbdLY
 rFI7fm9wTsSnrz+v+4iRTwS0MPd/WTJw9cA+lTFlFBoc871Obg+aniiswxW+lnl2
 1KLzVSRFU3LbSSSBCNi+op+MVIgbVmiLZq+mKI7JqG4WrDXumEUSlNV4gLtDPiJo
 Z8fYoEtZIgsfmm9p8ySs6nmyrqsyM567ISoaxhAcisfAIXz20ul3fHkeLI4wa0xD
 gQPWdrz9Yz5aijr40FuiFBwKHU1Zg/vlqTl8o5gGZvKsx/epJpnQNoDTagmmYR8u
 oGU+c1R9LHVnxJjDitP6uyseafCYBJVfCdZoVXFdcDvc9aY2Pe2Sgo5y9IZGuLlP
 Vis7PHN/vFpTF1SRVFDALxXSkGR67zYSVB58CpWyxIFv3y1B212yicvrPfT5xcNE
 SbZglQd9qVaGuzXwKKHf80NOucEgagxYluCDKBOXCZj+u0S92ZAHuWv88WvzIFdL
 oK/GmFNxE3hAUfENf6FJ8Rfzx9+a+bN+QpaI1SCGFAS7dsQ8qlOBDPfQnYT8Q2T7
 yN1LHvnBEqx59/yCSN0FdWTpTEne8TgGJkj+EZN9sblbJaSfOr3f5rQbz59SXHHh
 Zap5KIlaPsPdukcWwrZp34meq3BpUuP3ELPxxcdINS4busOLP98=
 =SspQ
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.13-2024-12-31' of git://git.infradead.org/nvme into block-6.13

Pull NVMe fixes from Keith:

"nvme fixes for Linux 6.13

 - Fix device specific quirk for PRP list alignment (Robert)
 - Fix target name overflow (Leo)
 - Fix target write granularity (Luis)
 - Fix target sleeping in atomic context (Nilay)
 - Remove unnecessary tcp queue teardown (Chunguang)"

* tag 'nvme-6.13-2024-12-31' of git://git.infradead.org/nvme:
  nvme-tcp: remove nvme_tcp_destroy_io_queues()
  nvmet-loop: avoid using mutex in IO hotpath
  nvmet: propagate npwg topology
  nvmet: Don't overflow subsysnqn
  nvme-pci: 512 byte aligned dma pool segment quirk
2024-12-31 10:41:58 -07:00
Chunguang.xu
36e3b1f9ab nvme-tcp: remove nvme_tcp_destroy_io_queues()
Now when destroying the IO queue we call nvme_tcp_stop_io_queues()
twice, nvme_tcp_destroy_io_queues() has an unnecessary call. Here we
try to remove nvme_tcp_destroy_io_queues() and merge it into
nvme_tcp_teardown_io_queues(), simplify the code and align with
nvme-rdma, make it easy to maintaince.

Signed-off-by: Chunguang.xu <chunguang.xu@shopee.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-27 13:33:48 -08:00
Nilay Shroff
74d16965d7 nvmet-loop: avoid using mutex in IO hotpath
Using mutex lock in IO hot path causes the kernel BUG sleeping while
atomic. Shinichiro[1], first encountered this issue while running blktest
nvme/052 shown below:

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:585
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 996, name: (udev-worker)
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
2 locks held by (udev-worker)/996:
 #0: ffff8881004570c8 (mapping.invalidate_lock){.+.+}-{3:3}, at: page_cache_ra_unbounded+0x155/0x5c0
 #1: ffffffff8607eaa0 (rcu_read_lock){....}-{1:2}, at: blk_mq_flush_plug_list+0xa75/0x1950
CPU: 2 UID: 0 PID: 996 Comm: (udev-worker) Not tainted 6.12.0-rc3+ #339
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x6a/0x90
 __might_resched.cold+0x1f7/0x23d
 ? __pfx___might_resched+0x10/0x10
 ? vsnprintf+0xdeb/0x18f0
 __mutex_lock+0xf4/0x1220
 ? nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
 ? __pfx_vsnprintf+0x10/0x10
 ? __pfx___mutex_lock+0x10/0x10
 ? snprintf+0xa5/0xe0
 ? xas_load+0x1ce/0x3f0
 ? nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
 nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
 ? __pfx_nvmet_subsys_nsid_exists+0x10/0x10 [nvmet]
 nvmet_req_find_ns+0x24e/0x300 [nvmet]
 nvmet_req_init+0x694/0xd40 [nvmet]
 ? blk_mq_start_request+0x11c/0x750
 ? nvme_setup_cmd+0x369/0x990 [nvme_core]
 nvme_loop_queue_rq+0x2a7/0x7a0 [nvme_loop]
 ? __pfx___lock_acquire+0x10/0x10
 ? __pfx_nvme_loop_queue_rq+0x10/0x10 [nvme_loop]
 __blk_mq_issue_directly+0xe2/0x1d0
 ? __pfx___blk_mq_issue_directly+0x10/0x10
 ? blk_mq_request_issue_directly+0xc2/0x140
 blk_mq_plug_issue_direct+0x13f/0x630
 ? lock_acquire+0x2d/0xc0
 ? blk_mq_flush_plug_list+0xa75/0x1950
 blk_mq_flush_plug_list+0xa9d/0x1950
 ? __pfx_blk_mq_flush_plug_list+0x10/0x10
 ? __pfx_mpage_readahead+0x10/0x10
 __blk_flush_plug+0x278/0x4d0
 ? __pfx___blk_flush_plug+0x10/0x10
 ? lock_release+0x460/0x7a0
 blk_finish_plug+0x4e/0x90
 read_pages+0x51b/0xbc0
 ? __pfx_read_pages+0x10/0x10
 ? lock_release+0x460/0x7a0
 page_cache_ra_unbounded+0x326/0x5c0
 force_page_cache_ra+0x1ea/0x2f0
 filemap_get_pages+0x59e/0x17b0
 ? __pfx_filemap_get_pages+0x10/0x10
 ? lock_is_held_type+0xd5/0x130
 ? __pfx___might_resched+0x10/0x10
 ? find_held_lock+0x2d/0x110
 filemap_read+0x317/0xb70
 ? up_write+0x1ba/0x510
 ? __pfx_filemap_read+0x10/0x10
 ? inode_security+0x54/0xf0
 ? selinux_file_permission+0x36d/0x420
 blkdev_read_iter+0x143/0x3b0
 vfs_read+0x6ac/0xa20
 ? __pfx_vfs_read+0x10/0x10
 ? __pfx_vm_mmap_pgoff+0x10/0x10
 ? __pfx___seccomp_filter+0x10/0x10
 ksys_read+0xf7/0x1d0
 ? __pfx_ksys_read+0x10/0x10
 do_syscall_64+0x93/0x180
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on+0x78/0x100
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f565bd1ce11
Code: 00 48 8b 15 09 90 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 d0 ad 01 00 f3 0f 1e fa 80 3d 35 12 0e 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
RSP: 002b:00007ffd6e7a20c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f565bd1ce11
RDX: 0000000000001000 RSI: 00007f565babb000 RDI: 0000000000000014
RBP: 00007ffd6e7a2130 R08: 00000000ffffffff R09: 0000000000000000
R10: 0000556000bfa610 R11: 0000000000000246 R12: 000000003ffff000
R13: 0000556000bfa5b0 R14: 0000000000000e00 R15: 0000556000c07328
 </TASK>

Apparently, the above issue is caused due to using mutex lock while
we're in IO hot path. It's a regression caused with commit 505363957f
("nvmet: fix nvme status code when namespace is disabled"). The mutex
->su_mutex is used to find whether a disabled nsid exists in the config
group or not. This is to differentiate between a nsid that is disabled
vs non-existent.

To mitigate the above issue, we've worked upon a fix[2] where we now
insert nsid in subsys Xarray as soon as it's created under config group
and later when that nsid is enabled, we add an Xarray mark on it and set
ns->enabled to true. The Xarray mark is useful while we need to loop
through all enabled namepsaces under a subsystem using xa_for_each_marked()
API. If later a nsid is disabled then we clear Xarray mark from it and also
set ns->enabled to false. It's only when nsid is deleted from the config
group we delete it from the Xarray.

So with this change, now we could easily differentiate a nsid is disabled
(i.e. Xarray entry for ns exists but ns->enabled is set to false) vs non-
existent (i.e.Xarray entry for ns doesn't exist).

Link: https://lore.kernel.org/linux-nvme/20241022070252.GA11389@lst.de/ [2]
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/linux-nvme/tqcy3sveity7p56v7ywp7ssyviwcb3w4623cnxj3knoobfcanq@yxgt2mjkbkam/ [1]
Fixes: 505363957f ("nvmet: fix nvme status code when namespace is disabled")
Fix-suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-27 13:24:00 -08:00
Luis Chamberlain
b579d6fdc3 nvmet: propagate npwg topology
Ensure we propagate npwg to the target as well instead
of assuming its the same logical blocks per physical block.

This ensures devices with large IUs information properly
propagated on the target.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-27 13:18:01 -08:00
Leo Stone
4db3d750ac nvmet: Don't overflow subsysnqn
nvmet_root_discovery_nqn_store treats the subsysnqn string like a fixed
size buffer, even though it is dynamically allocated to the size of the
string.

Create a new string with kstrndup instead of using the old buffer.

Reported-by: syzbot+ff4aab278fa7e27e0f9e@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=ff4aab278fa7e27e0f9e
Fixes: 95409e277d ("nvmet: implement unique discovery NQN")
Signed-off-by: Leo Stone <leocstone@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-27 13:14:31 -08:00
Christoph Hellwig
cc76ace465 block: remove BLK_MQ_F_SHOULD_MERGE
BLK_MQ_F_SHOULD_MERGE is set for all tag_sets except those that purely
process passthrough commands (bsg-lib, ufs tmf, various nvme admin
queues) and thus don't even check the flag.  Remove it to simplify the
driver interface.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241219060214.1928848-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23 08:17:23 -07:00
Daniel Wagner
4425f6492a nvme: replace blk_mq_pci_map_queues with blk_mq_map_hw_queues
Replace all users of blk_mq_pci_map_queues with the more generic
blk_mq_map_hw_queues. This in preparation to retire
blk_mq_pci_map_queues.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Link: https://lore.kernel.org/r/20241202-refactor-blk-affinity-helpers-v6-6-27211e9c2cd5@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23 08:17:23 -07:00
Kanchan Joshi
472292cd8c nvme: add support for passing on the application tag
With user integrity buffer, there is a way to specify the app_tag.
Set the corresponding protocol specific flags and send the app_tag down.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20241128112240.8867-9-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23 08:17:16 -07:00
Anuj Gupta
2c0487d8b1 block: introduce BIP_CHECK_GUARD/REFTAG/APPTAG bip_flags
This patch introduces BIP_CHECK_GUARD/REFTAG/APPTAG bip_flags which
indicate how the hardware should check the integrity payload.
BIP_CHECK_GUARD/REFTAG are conversion of existing semantics, while
BIP_CHECK_APPTAG is a new flag. The driver can now just rely on block
layer flags, and doesn't need to know the integrity source. Submitter
of PI decides which tags to check. This would also give us a unified
interface for user and kernel generated integrity.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20241128112240.8867-8-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23 08:17:16 -07:00
Luis Chamberlain
51588b1b77 nvme: use blk_validate_block_size() for max LBA check
The block layer already has support to validates proper block sizes
with blk_validate_block_size(), we can leverage that as well.

No functional changes.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241218020212.3657139-3-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-18 07:22:30 -07:00
Robert Beckett
ebefac5647 nvme-pci: 512 byte aligned dma pool segment quirk
We initially introduced a quick fix limiting the queue depth to 1 as
experimentation showed that it fixed data corruption on 64GB steamdecks.

Further experimentation revealed corruption only happens when the last
PRP data element aligns to the end of the page boundary. The device
appears to treat this as a PRP chain to a new list instead of the data
element that it actually is. This implementation is in violation of the
spec. Encountering this errata with the Linux driver requires the host
request a 128k transfer and coincidently be handed the last small pool
dma buffer within a page.

The QD1 quirk effectly works around this because the last data PRP
always was at a 248 byte offset from the page start, so it never
appeared at the end of the page, but comes at the expense of throttling
IO and wasting the remainder of the PRP page beyond 256 bytes. Also to
note, the MDTS on these devices is small enough that the "large" prp
pool can hold enough PRP elements to never reach the end, so that pool
is not a problem either.

Introduce a new quirk to ensure the small pool is always aligned such
that the last PRP element can't appear a the end of the page. This comes
at the expense of wasting 256 bytes per small pool page allocated.

Link: https://lore.kernel.org/linux-nvme/20241113043151.GA20077@lst.de/T/#u
Fixes: 83bdfcbdbe ("nvme-pci: qdepth 1 quirk")
Cc: Paweł Anikiel <panikiel@google.com>
Signed-off-by: Robert Beckett <bob.beckett@collabora.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-11 14:46:23 -08:00
Jens Axboe
d64fd5f777 nvme fixes for Linux 6.13
- Target fix using incorrect zero buffer (Nilay)
  - Device specifc deallocate quirk fixes (Christoph, Keith)
  - Fabrics fix for handling max command target bugs (Maurizio)
  - Cocci fix usage for kzalloc (Yu-Chen)
  - DMA size fix for host memory buffer feature (Christoph)
  - Fabrics queue cleanup fixes (Chunguang)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmdR3fkACgkQPe3zGtjz
 RgkAbQ/+IblDuMKO6pzZkzMVECwjtJz8E195weCBVWsc5FE80AXX9lz3I5w0N1I5
 5etZlXoQ9YDyPWIzlkg2WKkXtGI4vKUUiJ02dXbre7xBMbRA1Qb4C9hH2KmvnxA4
 Qb52717FSMDxzjQKz4e3cz34cS/RaDENypSvAu1SDGvJhKSNrkEjz/lVXy6eFWvI
 DELL7yDOdl4H91G4vGlC7eKV0BI6rCAThyAnivkKS1P+UWy9tJja7UNKix2cuh6Y
 PhEdSynH+JgfKO873j45qEPWaa/htjXSsa0gp6yEd2ZdO5lIeoDRbLYCX8gE6Vcs
 yw4xbuQwexhnJi4LVeAR3fUNA2RdPy8zSjVM0gwpQCI161QEocl/KaWNmBcjcwLp
 0dZGRftQoLGN+Ugmv7eTBBN3qb60nYwfCDxHrLv6ffQyai2++2b1PqLnsd+oHGOQ
 vjmFxBSDbmXybbb5h47lJY+EU6Cr3HRKq136Ypmwx7t3d+1cLCzVPsal///5AuYq
 /Srg1CENmAWEiM4LnuZYqs45BGu+/HN4TWjbi1ddAcPmc+YpOz89hdHZ/TnS9Gjl
 nVN8CHrCBuSnG9ZVyla+nFaqAr3A8YFWfDNp05AVFUsLW3hWrWrj2m0wmGegwoIE
 JVgC+ebhQg6UWTAdq4gBnSwz8bbgj/YwRYbqqzfJTioOXbgkZBY=
 =niio
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.13-2024-12-05' of git://git.infradead.org/nvme into block-6.13

Pull NVMe fixess from Keith:

"nvme fixes for Linux 6.13

 - Target fix using incorrect zero buffer (Nilay)
 - Device specifc deallocate quirk fixes (Christoph, Keith)
 - Fabrics fix for handling max command target bugs (Maurizio)
 - Cocci fix usage for kzalloc (Yu-Chen)
 - DMA size fix for host memory buffer feature (Christoph)
 - Fabrics queue cleanup fixes (Chunguang)"

* tag 'nvme-6.13-2024-12-05' of git://git.infradead.org/nvme:
  nvme-tcp: simplify nvme_tcp_teardown_io_queues()
  nvme-tcp: no need to quiesce admin_q in nvme_tcp_teardown_io_queues()
  nvme-rdma: unquiesce admin_q before destroy it
  nvme-tcp: fix the memleak while create new ctrl failed
  nvme-pci: don't use dma_alloc_noncontiguous with 0 merge boundary
  nvmet: replace kmalloc + memset with kzalloc for data allocation
  nvme-fabrics: handle zero MAXCMD without closing the connection
  nvme-pci: remove two deallocate zeroes quirks
  nvme: don't apply NVME_QUIRK_DEALLOCATE_ZEROES when DSM is not supported
  nvmet: use kzalloc instead of ZERO_PAGE in nvme_execute_identify_ns_nvm()
2024-12-05 10:14:36 -07:00
Chunguang.xu
b4e12f5728 nvme-tcp: simplify nvme_tcp_teardown_io_queues()
As nvme_tcp_teardown_io_queues() is the only one caller of
nvme_tcp_destroy_admin_queue(), so we can merge it into
nvme_tcp_teardown_io_queues() to simplify the code.

Signed-off-by: Chunguang.xu <chunguang.xu@shopee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-04 10:15:46 -08:00
Chunguang.xu
fdc5664c69 nvme-tcp: no need to quiesce admin_q in nvme_tcp_teardown_io_queues()
As we quiesce admin_q in nvme_tcp_teardown_admin_queue(), so we should no
need to quiesce it in nvme_tcp_reaardown_io_queues(), make things simple.

Signed-off-by: Chunguang.xu <chunguang.xu@shopee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-04 10:15:46 -08:00
Chunguang.xu
5858b68755 nvme-rdma: unquiesce admin_q before destroy it
Kernel will hang on destroy admin_q while we create ctrl failed, such
as following calltrace:

PID: 23644    TASK: ff2d52b40f439fc0  CPU: 2    COMMAND: "nvme"
 #0 [ff61d23de260fb78] __schedule at ffffffff8323bc15
 #1 [ff61d23de260fc08] schedule at ffffffff8323c014
 #2 [ff61d23de260fc28] blk_mq_freeze_queue_wait at ffffffff82a3dba1
 #3 [ff61d23de260fc78] blk_freeze_queue at ffffffff82a4113a
 #4 [ff61d23de260fc90] blk_cleanup_queue at ffffffff82a33006
 #5 [ff61d23de260fcb0] nvme_rdma_destroy_admin_queue at ffffffffc12686ce
 #6 [ff61d23de260fcc8] nvme_rdma_setup_ctrl at ffffffffc1268ced
 #7 [ff61d23de260fd28] nvme_rdma_create_ctrl at ffffffffc126919b
 #8 [ff61d23de260fd68] nvmf_dev_write at ffffffffc024f362
 #9 [ff61d23de260fe38] vfs_write at ffffffff827d5f25
    RIP: 00007fda7891d574  RSP: 00007ffe2ef06958  RFLAGS: 00000202
    RAX: ffffffffffffffda  RBX: 000055e8122a4d90  RCX: 00007fda7891d574
    RDX: 000000000000012b  RSI: 000055e8122a4d90  RDI: 0000000000000004
    RBP: 00007ffe2ef079c0   R8: 000000000000012b   R9: 000055e8122a4d90
    R10: 0000000000000000  R11: 0000000000000202  R12: 0000000000000004
    R13: 000055e8122923c0  R14: 000000000000012b  R15: 00007fda78a54500
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

This due to we have quiesced admi_q before cancel requests, but forgot
to unquiesce before destroy it, as a result we fail to drain the
pending requests, and hang on blk_mq_freeze_queue_wait() forever. Here
try to reuse nvme_rdma_teardown_admin_queue() to fix this issue and
simplify the code.

Fixes: 958dc1d32c ("nvme-rdma: add clean action for failed reconnection")
Reported-by: Yingfu.zhou <yingfu.zhou@shopee.com>
Signed-off-by: Chunguang.xu <chunguang.xu@shopee.com>
Signed-off-by: Yue.zhao <yue.zhao@shopee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-04 10:15:46 -08:00
Chunguang.xu
fec55c29e5 nvme-tcp: fix the memleak while create new ctrl failed
Now while we create new ctrl failed, we have not free the
tagset occupied by admin_q, here try to fix it.

Fixes: fd1418de10 ("nvme-tcp: avoid open-coding nvme_tcp_teardown_admin_queue()")
Signed-off-by: Chunguang.xu <chunguang.xu@shopee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-04 10:15:46 -08:00
Christoph Hellwig
ad0cf42e1f nvme-pci: don't use dma_alloc_noncontiguous with 0 merge boundary
Only call into nvme_alloc_host_mem_single which uses
dma_alloc_noncontiguous when there is non-null dma merge boundary.
Without this we'll call into dma_alloc_noncontiguous for device using
dma-direct, which can work fine as long as the preferred size is below the
MAX_ORDER of the page allocator, but blows up with a warning if it is
too large.

Fixes: 63a5c7a4b4 ("nvme-pci: use dma_alloc_noncontigous if possible")
Reported-by: Leon Romanovsky <leon@kernel.org>
Reported-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Leon Romanovsky <leon@kernel.org>
Tested-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-04 09:23:11 -08:00
Yu-Chun Lin
41d826c8a9 nvmet: replace kmalloc + memset with kzalloc for data allocation
cocci warnings: (new ones prefixed by >>)
>> drivers/nvme/target/pr.c:831:8-15: WARNING: kzalloc should be used for data, instead of kmalloc/memset

The pattern of using 'kmalloc' followed by 'memset' is replaced with
'kzalloc', which is functionally equivalent to 'kmalloc' + 'memset',
but more efficient. 'kzalloc' automatically zeroes the allocated
memory, making it a faster and more streamlined solution.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202411301434.LEckbcWx-lkp@intel.com/
Reviewed-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-04 09:20:00 -08:00
Maurizio Lombardi
88c23a32b8 nvme-fabrics: handle zero MAXCMD without closing the connection
The NVMe specification states that MAXCMD is mandatory
for NVMe-over-Fabrics implementations. However, some NVMe/TCP
and NVMe/FC arrays from major vendors have buggy firmware
that reports MAXCMD as zero in the Identify Controller data structure.

Currently, the implementation closes the connection in such cases,
completely preventing the host from connecting to the target.

Fix the issue by printing a clear error message about the firmware bug
and allowing the connection to proceed. It assumes that the
target supports a MAXCMD value of SQSIZE + 1. If any issues arise,
the user can manually adjust SQSIZE to mitigate them.

Fixes: 4999568184 ("nvme-fabrics: check max outstanding commands")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-04 09:19:54 -08:00
Keith Busch
b0de5456e2 nvme-pci: remove two deallocate zeroes quirks
The quirk was initially used as a signal to set the discard_zeroes_data
queue limit because there were some use cases that relied on that
behavior. The queue limit no longer exists as every user of it has been
converted to use the write zeroes operation instead.

The quirk now means to use a discard command as an alias to a write
zeroes request. Two of the devices previously using the quirk support
the write zeroes command directly, so these don't need or want to use
discard when the desired operation is to write zeroes.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-03 10:56:27 -08:00
Peter Zijlstra
cdd30ebb1b module: Convert symbol namespace to string literal
Clean up the existing export namespace code along the same lines of
commit 33def8498f ("treewide: Convert macro and uses of __section(foo)
to __section("foo")") and for the same reason, it is not desired for the
namespace argument to be a macro expansion itself.

Scripted using

  git grep -l -e MODULE_IMPORT_NS -e EXPORT_SYMBOL_NS | while read file;
  do
    awk -i inplace '
      /^#define EXPORT_SYMBOL_NS/ {
        gsub(/__stringify\(ns\)/, "ns");
        print;
        next;
      }
      /^#define MODULE_IMPORT_NS/ {
        gsub(/__stringify\(ns\)/, "ns");
        print;
        next;
      }
      /MODULE_IMPORT_NS/ {
        $0 = gensub(/MODULE_IMPORT_NS\(([^)]*)\)/, "MODULE_IMPORT_NS(\"\\1\")", "g");
      }
      /EXPORT_SYMBOL_NS/ {
        if ($0 ~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+),/) {
  	if ($0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/ &&
  	    $0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(\)/ &&
  	    $0 !~ /^my/) {
  	  getline line;
  	  gsub(/[[:space:]]*\\$/, "");
  	  gsub(/[[:space:]]/, "", line);
  	  $0 = $0 " " line;
  	}

  	$0 = gensub(/(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/,
  		    "\\1(\\2, \"\\3\")", "g");
        }
      }
      { print }' $file;
  done

Requested-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://mail.google.com/mail/u/2/#inbox/FMfcgzQXKWgMmjdFwwdsfgxzKpVHWPlc
Acked-by: Greg KH <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-02 11:34:44 -08:00
Christoph Hellwig
58a0c875ce nvme: don't apply NVME_QUIRK_DEALLOCATE_ZEROES when DSM is not supported
Commit 63dfa10043 ("nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of
nvme_config_discard") started applying the NVME_QUIRK_DEALLOCATE_ZEROES
quirk even then the Dataset Management is not supported.  It turns out
that there versions of these old Intel SSDs that have DSM support
disabled in the firmware, which will now lead to errors everytime
a Write Zeroes command is issued.  Fix this by checking for DSM support
before applying the quirk.

Reported-by: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com>
Fixes: 63dfa10043 ("nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of nvme_config_discard")
Tested-by: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-02 10:03:19 -08:00
Nilay Shroff
84909f7dec nvmet: use kzalloc instead of ZERO_PAGE in nvme_execute_identify_ns_nvm()
The nvme_execute_identify_ns_nvm function uses ZERO_PAGE for copying
SG list with all zeros. As ZERO_PAGE would not necessarily return the
virtual-address of the zero page, we need to first convert the page
address to kernel virtual-address and then use it as source address
for copying the data to SG list with all zeros. Using return address
of ZERO_PAGE(0) as source address for copying data to SG list would
fill the target buffer with random/garbage value and causes the
undesired side effect.

As other identify implemenations uses kzalloc for allocating a zero
filled buffer, we decided use kzalloc for allocating a zero filled
buffer in nvme_execute_identify_ns_nvm function and then use this
buffer for copying all zeros to SG list buffers. So esentially, we
now avoid using ZERO_PAGE.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Fixes: 64a51080ea ("nvmet: implement id ns for nvm command set")
Link: https://lore.kernel.org/all/CAHj4cs8OVyxmn4XTvA=y4uQ3qWpdw-x3M3FSUYr-KpE-nhaFEA@mail.gmail.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-02 10:02:47 -08:00
Linus Torvalds
e70140ba0d Get rid of 'remove_new' relic from platform driver struct
The continual trickle of small conversion patches is grating on me, and
is really not helping.  Just get rid of the 'remove_new' member
function, which is just an alias for the plain 'remove', and had a
comment to that effect:

  /*
   * .remove_new() is a relic from a prototype conversion of .remove().
   * New drivers are supposed to implement .remove(). Once all drivers are
   * converted to not use .remove_new any more, it will be dropped.
   */

This was just a tree-wide 'sed' script that replaced '.remove_new' with
'.remove', with some care taken to turn a subsequent tab into two tabs
to make things line up.

I did do some minimal manual whitespace adjustment for places that used
spaces to line things up.

Then I just removed the old (sic) .remove_new member function, and this
is the end result.  No more unnecessary conversion noise.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-01 15:12:43 -08:00
Linus Torvalds
cfd47302ac block-6.13-20242901
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmdJ6jwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvgeEADaPL+qmsRyp070K1OI/yTA9Jhf6liEyJ31
 0GPVar5Vt6ObH6/POObrJqbBtAo5asQanFvwyVyLztKYxPHU7sdQaRcD+vvj7q3+
 EhmQZSKM7Spp77awWhRWbeQfUBvdhTeGHjQH/0e60eIrF9KtEL9sM9hVqc8hBD9F
 YtDNWPCk7Rz1PPNYlGEkQ2JmYmaxh3Gn29c/k1cSSo3InEQOFj6x+0Cgz6RjbTx3
 9HfpLhVG3WV5MlZCCwp7KG36aJzlc0nq53x/sC9cg+F17RvL2EwNAOUfLl75/Kp/
 t7PCQSd2ODciiDN9qZW71KGtVtlJ07W048Rk0nB+ogneC0uh4fuIYTidP9D7io7D
 bBMrhDuUpnlPzlOqg0aeedXePQL7TRfT3CTentol6xldqg14n7C4QTQFQMSJCgJf
 gr4YCTwl0RTknXo0A3ja16XwsUq5+2xsSoCTU25TY+wgKiAcc5lN9fhbvPRzbCQC
 u9EQ9I9IFAMqEdnE51sw0x16fLtN2w4/zOkvTF+gD/KooEjSn9lcfeNue7jt1O0/
 gFvFJCdXK/2GgxwHihvsEVdcNeaS8JowNafKUsfOM2G0qWQbY+l2vl/b5PfwecWi
 0knOaqNWlGMwrQ+z+fgsEeFG7X98ninC7tqVZpzoZ7j0x65anH+Jq4q1Egongj0H
 90zclclxjg==
 =6cbB
 -----END PGP SIGNATURE-----

Merge tag 'block-6.13-20242901' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:

 - NVMe pull request via Keith:
      - Use correct srcu list traversal (Breno)
      - Scatter-gather support for metadata (Keith)
      - Fabrics shutdown race condition fix (Nilay)
      - Persistent reservations updates (Guixin)

 - Add the required bits for MD atomic write support for raid0/1/10

 - Correct return value for unknown opcode in ublk

 - Fix deadlock with zone revalidation

 - Fix for the io priority request vs bio cleanups

 - Use the correct unsigned int type for various limit helpers

 - Fix for a race in loop

 - Cleanup blk_rq_prep_clone() to prevent uninit-value warning and make
   it easier for actual humans to read

 - Fix potential UAF when iterating tags

 - A few fixes for bfq-iosched UAF issues

 - Fix for brd discard not decrementing the allocated page count

 - Various little fixes and cleanups

* tag 'block-6.13-20242901' of git://git.kernel.dk/linux: (36 commits)
  brd: decrease the number of allocated pages which discarded
  block, bfq: fix bfqq uaf in bfq_limit_depth()
  block: Don't allow an atomic write be truncated in blkdev_write_iter()
  mq-deadline: don't call req_get_ioprio from the I/O completion handler
  block: Prevent potential deadlock in blk_revalidate_disk_zones()
  block: Remove extra part pointer NULLify in blk_rq_init()
  nvme: tuning pr code by using defined structs and macros
  nvme: introduce change ptpl and iekey definition
  block: return bool from get_disk_ro and bdev_read_only
  block: remove a duplicate definition for bdev_read_only
  block: return bool from blk_rq_aligned
  block: return unsigned int from blk_lim_dma_alignment_and_pad
  block: return unsigned int from queue_dma_alignment
  block: return unsigned int from bdev_io_opt
  block: req->bio is always set in the merge code
  block: don't bother checking the data direction for merges
  block: blk-mq: fix uninit-value in blk_rq_prep_clone and refactor
  Revert "block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()"
  md/raid10: Atomic write support
  md/raid1: Atomic write support
  ...
2024-11-30 15:47:29 -08:00
Guixin Liu
029cc98dec nvme: tuning pr code by using defined structs and macros
All the modifications are simply to make the code more readable,
and this patch does not include any functional changes.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-21 08:57:42 -08:00
Nilay Shroff
e9869c85c8 nvme-fabrics: fix kernel crash while shutting down controller
The nvme keep-alive operation, which executes at a periodic interval,
could potentially sneak in while shutting down a fabric controller.
This may lead to a race between the fabric controller admin queue
destroy code path (invoked while shutting down controller) and hw/hctx
queue dispatcher called from the nvme keep-alive async request queuing
operation. This race could lead to the kernel crash shown below:

Call Trace:
    autoremove_wake_function+0x0/0xbc (unreliable)
    __blk_mq_sched_dispatch_requests+0x114/0x24c
    blk_mq_sched_dispatch_requests+0x44/0x84
    blk_mq_run_hw_queue+0x140/0x220
    nvme_keep_alive_work+0xc8/0x19c [nvme_core]
    process_one_work+0x200/0x4e0
    worker_thread+0x340/0x504
    kthread+0x138/0x140
    start_kernel_thread+0x14/0x18

While shutting down fabric controller, if nvme keep-alive request sneaks
in then it would be flushed off. The nvme_keep_alive_end_io function is
then invoked to handle the end of the keep-alive operation which
decrements the admin->q_usage_counter and assuming this is the last/only
request in the admin queue then the admin->q_usage_counter becomes zero.
If that happens then blk-mq destroy queue operation (blk_mq_destroy_
queue()) which could be potentially running simultaneously on another
cpu (as this is the controller shutdown code path) would forward
progress and deletes the admin queue. So, now from this point onward
we are not supposed to access the admin queue resources. However the
issue here's that the nvme keep-alive thread running hw/hctx queue
dispatch operation hasn't yet finished its work and so it could still
potentially access the admin queue resource while the admin queue had
been already deleted and that causes the above crash.

The above kernel crash is regression caused due to changes implemented
in commit a54a93d0e3 ("nvme: move stopping keep-alive into
nvme_uninit_ctrl()"). Ideally we should stop keep-alive before destroyin
g the admin queue and freeing the admin tagset so that it wouldn't sneak
in during the shutdown operation. However we removed the keep alive stop
operation from the beginning of the controller shutdown code path in commit
a54a93d0e3 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()")
and added it under nvme_uninit_ctrl() which executes very late in the
shutdown code path after the admin queue is destroyed and its tagset is
removed. So this change created the possibility of keep-alive sneaking in
and interfering with the shutdown operation and causing observed kernel
crash.

To fix the observed crash, we decided to move nvme_stop_keep_alive() from
nvme_uninit_ctrl() to nvme_remove_admin_tag_set(). This change would ensure
that we don't forward progress and delete the admin queue until the keep-
alive operation is finished (if it's in-flight) or cancelled and that would
help contain the race condition explained above and hence avoid the crash.

Moving nvme_stop_keep_alive() to nvme_remove_admin_tag_set() instead of
adding nvme_stop_keep_alive() to the beginning of the controller shutdown
code path in nvme_stop_ctrl(), as was the case earlier before commit
a54a93d0e3 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()"),
would help save one callsite of nvme_stop_keep_alive().

Fixes: a54a93d0e3 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()")
Link: https://lore.kernel.org/all/1a21f37b-0f2a-4745-8c56-4dc8628d3983@linux.ibm.com/
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-19 07:49:48 -08:00
Nilay Shroff
8448828216 Revert "nvme: make keep-alive synchronous operation"
This reverts commit d06923670b.

It was realized that the fix implemented to contain the race condition
among the keep alive task and the fabric shutdown code path in the commit
d06923670b5ia ("nvme: make keep-alive synchronous operation") is not
optimal. The reason being keep-alive runs under the workqueue and making
it synchronous would waste a workqueue context.
Furthermore, we later found that the above race condition is a regression
caused due to the changes implemented in commit a54a93d0e3 ("nvme: move
stopping keep-alive into nvme_uninit_ctrl()"). So we decided to revert the
commit d06923670b ("nvme: make keep-alive synchronous operation") and
then fix the regression.

Link: https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-19 07:49:48 -08:00
Linus Torvalds
77a0cfafa9 for-6.13/block-20241118
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmc7S40QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjHVD/43rDZ8ehs+IAAr6S0RemNX1SRG0mK2UOEb
 kMoNogS7StO/c4JYW3JuzCyLRn5ZsgeWV/muqxwDEWQrmTGrvi+V45KikrZPwm3k
 p0ump33qV9EU2jiR1MKZjtwK2P0CI7/DD3W8ww6IOvKbTT7RcqQcdHznvXArFBtc
 xCuQPpayFG7ZasC+N9VaBwtiUEVgU3Ek9AFT7UVZRWajjHPNalQwaooJWayO0rEG
 KdoW5yG0ryLrgCY2ACSvRLS+2s14EJtb8hgT08WKHTNgd5LxhSKxfsTapamua+7U
 FdVS6Ij0tEkgu2jpvgj7QKO0Uw10Cnep2gj7RHts/LVewvkliS6XcheOzqRS1jWU
 I2EI+UaGOZ11OUiw52VIveEVS5zV/NWhgy5BSP9LYEvXw0BUAHRDYGMem8o5G1V1
 SWqjIM1UWvcQDlAnMF9FDVzojvjVUmYWvcAlFFztO8J0B7SavHR3NcfHwEf57reH
 rNoUbi/9c4/wjJJF33gejiR5pU+ewy/Mk75GrtX3xpEqlztfRbf9/FbPCMEAO1KR
 DF/b3lkUV9i2/BRW6a0SpZ5RDSmSYMnateel6TrPyVSRnpiSSFO8FrbynwUOa17b
 6i49YDFWzzXOrR1YWDg6IEtTrcmBEmvi7F6aoDs020qUnL0hwLn1ZuoIxuiFEpor
 Z0iFF1B/nw==
 =PWTH
 -----END PGP SIGNATURE-----

Merge tag 'for-6.13/block-20241118' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe updates via Keith:
      - Use uring_cmd helper (Pavel)
      - Host Memory Buffer allocation enhancements (Christoph)
      - Target persistent reservation support (Guixin)
      - Persistent reservation tracing (Guixen)
      - NVMe 2.1 specification support (Keith)
      - Rotational Meta Support (Matias, Wang, Keith)
      - Volatile cache detection enhancment (Guixen)

 - MD updates via Song:
      - Maintainers update
      - raid5 sync IO fix
      - Enhance handling of faulty and blocked devices
      - raid5-ppl atomic improvement
      - md-bitmap fix

 - Support for manually defining embedded partition tables

 - Zone append fixes and cleanups

 - Stop sending the queued requests in the plug list to the driver
   ->queue_rqs() handle in reverse order.

 - Zoned write plug cleanups

 - Cleanups disk stats tracking and add support for disk stats for
   passthrough IO

 - Add preparatory support for file system atomic writes

 - Add lockdep support for queue freezing. Already found a bunch of
   issues, and some fixes for that are in here. More will be coming.

 - Fix race between queue stopping/quiescing and IO queueing

 - ublk recovery improvements

 - Fix ublk mmap for 64k pages

 - Various fixes and cleanups

* tag 'for-6.13/block-20241118' of git://git.kernel.dk/linux: (118 commits)
  MAINTAINERS: Update git tree for mdraid subsystem
  block: make struct rq_list available for !CONFIG_BLOCK
  block/genhd: use seq_put_decimal_ull for diskstats decimal values
  block: don't reorder requests in blk_mq_add_to_batch
  block: don't reorder requests in blk_add_rq_to_plug
  block: add a rq_list type
  block: remove rq_list_move
  virtio_blk: reverse request order in virtio_queue_rqs
  nvme-pci: reverse request order in nvme_queue_rqs
  btrfs: validate queue limits
  block: export blk_validate_limits
  nvmet: add tracing of reservation commands
  nvme: parse reservation commands's action and rtype to string
  nvmet: report ns's vwc not present
  md/raid5: Increase r5conf.cache_name size
  block: remove the ioprio field from struct request
  block: remove the write_hint field from struct request
  nvme: check ns's volatile write cache not present
  nvme: add rotational support
  nvme: use command set independent id ns if available
  ...
2024-11-18 16:50:08 -08:00
Keith Busch
6fad84a4d6 nvme-pci: use sgls for all user requests if possible
If the device supports SGLs, use these for all user requests. This
format encodes the expected transfer length so it can catch short buffer
errors in a user command, whether it occurred accidently or maliciously.

For controllers that support SGL data mode, this is a viable mitigation
to CVE-2023-6238. For controllers that don't support SGLs, log a warning
in the passthrough path since not having the capability can corrupt
data if the interface is not used correctly.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-18 09:27:47 -08:00
Keith Busch
6399a0db8c nvme: define the remaining used sgls constants
This provides a little more context when reading the code than hardcoded
magic numbers.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-18 09:17:26 -08:00
Keith Busch
979c6342f9 nvme-pci: add support for sgl metadata
Supporting this mode allows creating and merging multi-segment metadata
requests that wouldn't be possible otherwise. It also allows directly
using user space requests that straddle physically discontiguous pages.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-18 09:17:25 -08:00
Breno Leitao
5dd18f09ce nvme/multipath: Fix RCU list traversal to use SRCU primitive
The code currently uses list_for_each_entry_rcu() while holding an SRCU
lock, triggering false positive warnings with CONFIG_PROVE_RCU=y
enabled:

	drivers/nvme/host/multipath.c:168 RCU-list traversed in non-reader section!!
	drivers/nvme/host/multipath.c:227 RCU-list traversed in non-reader section!!
	drivers/nvme/host/multipath.c:260 RCU-list traversed in non-reader section!!

While the list is properly protected by SRCU lock, the code uses the
wrong list traversal primitive. Replace list_for_each_entry_rcu() with
list_for_each_entry_srcu() to correctly indicate SRCU-based protection
and eliminate the false warning.

Signed-off-by: Breno Leitao <leitao@debian.org>
Fixes: be647e2c76 ("nvme: use srcu for iterating namespace list")
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-18 09:11:24 -08:00
Christoph Hellwig
e70c301fae block: don't reorder requests in blk_add_rq_to_plug
Add requests to the tail of the list instead of the front so that they
are queued up in submission order.

Remove the re-reordering in blk_mq_dispatch_plug_list, virtio_queue_rqs
and nvme_queue_rqs now that the list is ordered as expected.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-13 12:04:58 -07:00
Christoph Hellwig
a3396b9999 block: add a rq_list type
Replace the semi-open coded request list helpers with a proper rq_list
type that mirrors the bio_list and has head and tail pointers.  Besides
better type safety this actually allows to insert at the tail of the
list, which will be useful soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-13 12:04:58 -07:00
Christoph Hellwig
beadf00885 nvme-pci: reverse request order in nvme_queue_rqs
blk_mq_flush_plug_list submits requests in the reverse order that they
were submitted, which leads to a rather suboptimal I/O pattern especially
in rotational devices.  Fix this by rewriting nvme_queue_rqs so that it
always pops the requests from the passed in request list, and then adds
them to the head of a local submit list.  This actually simplifies the
code a bit as it removes the complicated list splicing, at the cost of
extra updates of the rq_next pointer.  As that should be cache hot
anyway it should be an easy price to pay.

Fixes: d62cbcf62f ("nvme: add support for mq_ops->queue_rqs()")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-13 11:40:33 -07:00
Jens Axboe
15da3dd3f5 nvme updates for Linux 6.13
- Use uring_cmd helper (Pavel)
  - Host Memory Buffer allocation enhancements (Christoph)
  - Target persistent reservation support (Guixin)
  - Persistent reservation tracing (Guixen)
  - NVMe 2.1 specification support (Keith)
  - Rotational Meta Support (Matias, Wang, Keith)
  - Volatile cache detection enhancment (Guixen)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmc0264ACgkQPe3zGtjz
 RgmhvhAAzEVbniR/OBlJbqZ+rdwEHOj181XJIWUD72yZUVl2akikYq88JpiMCfcS
 pwdVAdFDEfvMjyIGpWXqE/G2NIYzb2qdGC0D3q5e/CgH/mxJ+5zJKKjj+6pqtWBt
 BJnoJ0YZcTnLXQWOrY6NxUOVn2LxxtvrKArCbh467GnDxWF7WJbwv+wkbPZZ78YR
 6IYRQU0La/uAvdpZ+ijHEOdieHtN3uJtu1AxxCFOK9gMpbHq92tm4Ya6bF09VDbG
 F+Ywhuu/gZkglTL5jEUvtt1Jd4VlhtGzBC2BhCFeSI54IwjhV3UFCajQeBh0zT/V
 Ca2VkFMAO1/Z3gRuK1QtEYkAf6Bwv591zpsoUEYvvlolXDL2aRKT5Jggwe/SMYYI
 ZA/3dSW/gRAV+bny2htVMK2n+hcn+VXhFaJlpZ7kSySK0b89wMlQ96BupTnmfyMD
 PdgVVaWVQ4onQcEu7/ItD9uFVe9tvTCH12MXRqlgJx4iM0w4ucpBh8QdOdHxMorD
 0bVCE4oLSbw6XJrfKmlytHJs4ZMdmNEoXzaJuBMsPDAlCvZiaihzTusIY7dWq4xi
 xNt6mQOOriONNpYRlaBrBGsinmQx6Ysz8q60RT9mLmGAwwI/nY9r1oxAd4ZknhKv
 c9clP0F20uO3se8vKUMbXOeGe8ZETD+S94hcGtHp9uxF8w6DfnQ=
 =Vi8L
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.13-2024-11-13' of git://git.infradead.org/nvme into for-6.13/block

Pull NVMe updates from Keith:

"nvme updates for Linux 6.13

 - Use uring_cmd helper (Pavel)
 - Host Memory Buffer allocation enhancements (Christoph)
 - Target persistent reservation support (Guixin)
 - Persistent reservation tracing (Guixen)
 - NVMe 2.1 specification support (Keith)
 - Rotational Meta Support (Matias, Wang, Keith)
 - Volatile cache detection enhancment (Guixen)"

* tag 'nvme-6.13-2024-11-13' of git://git.infradead.org/nvme: (22 commits)
  nvmet: add tracing of reservation commands
  nvme: parse reservation commands's action and rtype to string
  nvmet: report ns's vwc not present
  nvme: check ns's volatile write cache not present
  nvme: add rotational support
  nvme: use command set independent id ns if available
  nvmet: support for csi identify ns
  nvmet: implement rotational media information log
  nvmet: implement endurance groups
  nvmet: declare 2.1 version compliance
  nvmet: implement crto property
  nvmet: implement supported features log
  nvmet: implement supported log pages
  nvmet: implement active command set ns list
  nvmet: implement id ns for nvm command set
  nvmet: support reservation feature
  nvme: add reservation command's defines
  nvme-core: remove repeated wq flags
  nvmet: make nvmet_wq visible in sysfs
  nvme-pci: use dma_alloc_noncontigous if possible
  ...
2024-11-13 10:43:11 -07:00
Guixin Liu
50bee3857d nvmet: add tracing of reservation commands
Add tracing of reservation commands, including register, acquire,
release and report, and also parse the action and rtype to string
to make the trace log more human-readable.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-13 08:51:24 -08:00
Guixin Liu
8a502b5c16 nvme: parse reservation commands's action and rtype to string
Parse reservation commands's action(including rrega, racqa and rrela)
and rtype to string to make the trace log more human-readable.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-13 08:51:24 -08:00
Guixin Liu
609e60a3a9 nvmet: report ns's vwc not present
Currently, we report that controller has vwc even though the ns may
not have vwc. Report ns's vwc not present when not buffered_io or
backdev doesn't have vwc.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-13 08:51:17 -08:00
Guixin Liu
8a825d22a7 nvme: check ns's volatile write cache not present
When the VWC of a namespace does not exist, the BLK_FEAT_WRITE_CACHE
flag should not be set when registering the block device, regardless
of whether the controller supports VWC.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:50 -08:00
Wang Yugui
1d81143885 nvme: add rotational support
Rotational devices, such as hard-drives, can be detected using
the rotational bit in the namespace independent identify namespace
data structure. Make the bit visible to the block layer through the
rotational queue setting.

Signed-off-by: Wang Yugui <wangyugui@e16-tech.com>
Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:50 -08:00
Matias Bjørling
ee9f36db1f nvme: use command set independent id ns if available
The NVMe 2.0 specification adds an independent identify namespace
data structure that contains generic attributes that apply to all
namespace types. Some attributes carry over from the NVM command set
identify namespace data structure, and others are new.

Currently, the data structure only considered when CRIMS is enabled or
when the namespace type is key-value.

However, the independent namespace data structure is mandatory for
devices that implement features from the 2.0+ specification. Therefore,
we can check this data structure first. If unavailable, retrieve the
generic attributes from the NVM command set identify namespace data
structure.

Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:50 -08:00
Keith Busch
e2758c76a0 nvmet: support for csi identify ns
Implements reporting the I/O Command Set Independent Identify Namespace
command.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:49 -08:00
Keith Busch
5fd075cdaf nvmet: implement rotational media information log
Most of the information is stubbed. Supporting these commands is a
requirement for supporting rotational media.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:49 -08:00
Keith Busch
266b652c65 nvmet: implement endurance groups
Most of the returned information is just stubbed data. The target must
support these in order to report rotational media. Since this driver
doesn't know any better, each namespace is its own endurance group with
the engid value matching the nsid.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:49 -08:00
Keith Busch
81ee2f2811 nvmet: declare 2.1 version compliance
The target driver implements all the mandatory logs, identifications,
features, and properties up to nvme sepcification 2.1.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:49 -08:00
Keith Busch
1e058089d2 nvmet: implement crto property
This property is required for nvme 2.1. The target only supports ready
with media, so this is just the same value as CAP.TO.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:49 -08:00
Keith Busch
e973c91727 nvmet: implement supported features log
This log is required for nvme 2.1.

Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:49 -08:00
Keith Busch
83acb24e6d nvmet: implement supported log pages
This log is required for nvme 2.1.

Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:48 -08:00
Keith Busch
61c9967cd6 nvmet: implement active command set ns list
This is required for nvme 2.1 for targets that support multiple command
sets. We support NVM and ZNS, so are required to support this
identification.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:48 -08:00
Keith Busch
64a51080ea nvmet: implement id ns for nvm command set
We don't report anything here, but it's a mandatory identification for
nvme 2.1.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:48 -08:00
Guixin Liu
5a47c2080a nvmet: support reservation feature
This patch implements the reservation feature, including:
  1. reservation register(register, unregister and replace).
  2. reservation acquire(acquire, preempt, preempt and abort).
  3. reservation release(release and clear).
  4. reservation report.
  5. set feature and get feature of reservation notify mask.
  6. get log page of reservation event.

Not supported:
  1. persistent reservation through power loss.

Test cases:
  Use nvme-cli and fio to test all implemented sub features:
  1. use nvme resv-register to register host a registrant or
     unregister or replace a new key.
  2. use nvme resv-acquire to set host to the holder, and use fio
     to send read and write io in all reservation type. And also
     test preempt and "preempt and abort".
  3. use nvme resv-report to show all registrants and reservation
     status.
  4. use nvme resv-release to release all registrants.
  5. use nvme get-log to get events generated by the preceding
     operations.

In addition, make reservation configurable, one can set ns to
support reservation before enable ns. The default of resv_enable
is false.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:48 -08:00
Christoph Hellwig
0b4ace9da5 nvme-multipath: don't bother clearing max_hw_zone_append_sectors
The limits stacking now properly zeroes it if at least one of the
underlying limits clears it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20241108154657.845768-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-11 09:20:36 -07:00
Christoph Hellwig
559218d43e block: pre-calculate max_zone_append_sectors
max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using queue_limits_max_zone_append_sectors helper.  This not
only adds (tiny) extra overhead to the I/O path, but also can be easily
forgotten in file system code.

Add a new max_hw_zone_append_sectors value to queue_limits which is
set by the driver, and calculate max_zone_append_sectors from that and
the other inputs in blk_validate_zoned_limits, similar to how
max_sectors is calculated to fix this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20241108154657.845768-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-11 09:20:36 -07:00
Linus Torvalds
a58f4dd952 block-6.12-20241108
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmcuxukQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqzrEACLE/7MG7YmKL5gz8fPRXN2OFFwSJNCwuY+
 q3Y3aD3YaBn37EgwG6NsHMGfNLkyrbhEyWdfJjHbgA/mtYIRCMRQ2Yp7dINLhjq7
 giW/NQEeWzr/QJaD18GDFy5NARGL7a36/pYvU4RdPfGR0o0N8++OZ7JJ/YY3f0CO
 4Oo3uglzDdtHJnVOBM/ciknXNPAwY7oRqkBa5heuexF2YR7m1u5YOT0voEdLR6cv
 2DCgHDyDn5I4sz708ALvHEpXFHPrfK8+OcvSK34+OEFlphT1jB1udMG/jzZlj5ZR
 v5YHPITHpTUddyDI4kP00LIKavqiq5lRfFQyqh6Q8f+Xp66+15apXYitDZcfPUic
 Zq6wabOhEhJX6v18AafM4Wq9W+sTAVj3oxg1GN/edBpOFMi4fKkQHsaqfL8fiBve
 eR1kk5gLDs7DS2PYvXX5pPNUdlxyraWlXsExr19tLgeO7YdZzwZ+takWVqSnzXep
 apxI7xnenewsnJ9t1f5ttamrHxXexj63Mnbc0eJCFPJaWPktqxQbe9vKhi/20evp
 ZI36PLDOfCcchKorzuayj23RBxB8HRSh9f/JRhgGtoLVguzvzI6KTXQ/h01cYzNl
 mTot/mgNLyVO0Ij1Wr54SCFW/hP3okz8G5NBOqGv/h2n0UlW8GUD+XTbVT3hiQ9X
 nJyHwGC26Q==
 =v8nG
 -----END PGP SIGNATURE-----

Merge tag 'block-6.12-20241108' of git://git.kernel.dk/linux

Pull block fix from Jens Axboe:
 "Single fix for an issue triggered with PROVE_RCU=y, with nvme using
  the wrong iterators for an SRCU protected list"

* tag 'block-6.12-20241108' of git://git.kernel.dk/linux:
  nvme/host: Fix RCU list traversal to use SRCU primitive
2024-11-09 12:55:32 -08:00
Jens Axboe
ab9bc81c1c Revert "block: pre-calculate max_zone_append_sectors"
This causes issue on, at least, nvme-mpath where my boot fails with:

WARNING: CPU: 354 PID: 2729 at block/blk-settings.c:75 blk_validate_limits+0x356/0x380
Modules linked in: tg3(+) nvme usbcore scsi_mod ptp i2c_piix4 libphy nvme_core crc32c_intel scsi_common usb_common pps_core i2c_smbus
CPU: 354 UID: 0 PID: 2729 Comm: kworker/u2061:1 Not tainted 6.12.0-rc6+ #181
Hardware name: Dell Inc. PowerEdge R7625/06444F, BIOS 1.8.3 04/02/2024
Workqueue: async async_run_entry_fn
RIP: 0010:blk_validate_limits+0x356/0x380
Code: f6 47 01 04 75 28 83 bf 94 00 00 00 00 75 39 83 bf 98 00 00 00 00 75 34 83 7f 68 00 75 32 31 c0 83 7f 5c 00 0f 84 9b fd ff ff <0f> 0b eb 13 0f 0b eb 0f 48 c7 c0 74 12 58 92 48 89 c7 e8 13 76 46
RSP: 0018:ffffa8a1dfb93b30 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff9232829c8388 RCX: 0000000000000088
RDX: 0000000000000080 RSI: 0000000000000200 RDI: ffffa8a1dfb93c38
RBP: 000000000000000c R08: 00000000ffffffff R09: 000000000000ffff
R10: 0000000000000000 R11: 0000000000000000 R12: ffff9232829b9000
R13: ffff9232829b9010 R14: ffffa8a1dfb93c38 R15: ffffa8a1dfb93c38
FS:  0000000000000000(0000) GS:ffff923867c80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055c1b92480a8 CR3: 0000002484ff0002 CR4: 0000000000370ef0
Call Trace:
 <TASK>
 ? __warn+0xca/0x1a0
 ? blk_validate_limits+0x356/0x380
 ? report_bug+0x11a/0x1a0
 ? handle_bug+0x5e/0x90
 ? exc_invalid_op+0x16/0x40
 ? asm_exc_invalid_op+0x16/0x20
 ? blk_validate_limits+0x356/0x380
 blk_alloc_queue+0x7a/0x250
 __blk_alloc_disk+0x39/0x80
 nvme_mpath_alloc_disk+0x13d/0x1b0 [nvme_core]
 nvme_scan_ns+0xcc7/0x1010 [nvme_core]
 async_run_entry_fn+0x27/0x120
 process_scheduled_works+0x1a0/0x360
 worker_thread+0x2bc/0x350
 ? pr_cont_work+0x1b0/0x1b0
 kthread+0x111/0x120
 ? kthread_unuse_mm+0x90/0x90
 ret_from_fork+0x30/0x40
 ? kthread_unuse_mm+0x90/0x90
 ret_from_fork_asm+0x11/0x20
 </TASK>
---[ end trace 0000000000000000 ]---

presumably due to max_zone_append_sectors not being cleared to zero,
resulting in blk_validate_zoned_limits() complaining and failing.

This reverts commit 2a8f6153e1.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07 05:45:34 -07:00
Chaitanya Kulkarni
43d5d3b417 nvme-core: remove repeated wq flags
In nvme_core_init() nvme_wq, nvme_reset_wq, nvme_delete_wq share same
flags :- WQ_UNBOUND | WQ_MEM_RECLAIM | WQ_SYSFS.

Insated of repeating these flags in each call use the common variable.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-05 08:43:24 -08:00
Guixin Liu
c74649b6e4 nvmet: make nvmet_wq visible in sysfs
In some complex scenarios, we deploy multiple tasks on a single machine
(hybrid deployment), such as Docker containers for function computation
(background processing), real-time tasks, monitoring, event handling,
and management, along with an NVMe target server.

Each of these components is restricted to its own CPU cores to prevent
mutual interference and ensure strict isolation. To achieve this level
of isolation for nvmet_wq we need to  use sysfs tunables such as
cpumask that are currently not accessible.

Add WQ_SYSFS flag to alloc_workqueue() when creating nvmet_wq so
workqueue tunables are exported in the userspace via sysfs.

with this patch :-

  nvme (nvme-6.13) # ls /sys/devices/virtual/workqueue/nvmet-wq/
  affinity_scope  affinity_strict  cpumask  max_active  nice per_cpu
  power  subsystem  uevent

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-05 08:36:18 -08:00
Christoph Hellwig
63a5c7a4b4 nvme-pci: use dma_alloc_noncontigous if possible
Use dma_alloc_noncontigous to allocate a single IOVA-contigous segment
when backed by an IOMMU.  This allow to easily use bigger segments and
avoids running into segment limits if we can avoid it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-05 07:54:34 -08:00
Christoph Hellwig
3c2fb1ca80 nvme-pci: fix freeing of the HMB descriptor table
The HMB descriptor table is sized to the maximum number of descriptors
that could be used for a given device, but __nvme_alloc_host_mem could
break out of the loop earlier on memory allocation failure and end up
using less descriptors than planned for, which leads to an incorrect
size passed to dma_free_coherent.

In practice this was not showing up because the number of descriptors
tends to be low and the dma coherent allocator always allocates and
frees at least a page.

Fixes: 87ad72a59a ("nvme-pci: implement host memory buffer support")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-05 07:54:34 -08:00
Breno Leitao
6d1c69945c nvme/host: Fix RCU list traversal to use SRCU primitive
The code currently uses list_for_each_entry_rcu() while holding an SRCU
lock, triggering false positive warnings with CONFIG_PROVE_RCU=y
enabled:

  drivers/nvme/host/core.c:3770 RCU-list traversed in non-reader section!!

While the list is properly protected by SRCU lock, the code uses the wrong
list traversal primitive. Replace list_for_each_entry_rcu() with
list_for_each_entry_srcu() to correctly indicate SRCU-based protection
and eliminate the false warning.

Fixes: be647e2c76 ("nvme: use srcu for iterating namespace list")
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-04 13:25:41 -08:00
Christoph Hellwig
2a8f6153e1 block: pre-calculate max_zone_append_sectors
max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using queue_limits_max_zone_append_sectors helper.  This not
only adds (tiny) extra overhead to the I/O path, but also can be easily
forgotten in file system code.

Add a new max_hw_zone_append_sectors value to queue_limits which is
set by the driver, and calculate max_zone_append_sectors from that and
the other inputs in blk_validate_zoned_limits, similar to how
max_sectors is calculated to fix this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-04 10:34:07 -07:00
Linus Torvalds
f4a1e8e369 block-6.12-20241101
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmclGQUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkKsEADWz2dU1Edfbg5lez9Kf2SZwLKUbh9VVBQY
 kuIeGI/lQ1eeIdIgEJtCmx6ePlbk9D5WYu7lJvXfODAh5gUBLMpI4t6p4L1TkU2o
 yZ9UDoy5SK8o0UAOZhrHw4fE70EdI31Olm2mz3ToIevmpKOabNr/joVEfpEYaoPV
 t9FIGH1DvUZQSWUgw6G2lxWA7gTwfzvAfOegXF0VlFyhCEXjLXwJSUq+lWaShL33
 oDxETRYGaoBVfNVLIaZ6LFVUrvmgzFQ1Q6q1iY+DAUgMh9NXV2mzQ+/17HZjWaeU
 jsEeeOMnvtxhvQQpRyEvGA93Fh+gDkx/wljB4EmK2oqETs+oDVzts+L+YJk2ef7K
 n71dhuj04BRfduu6DFNg6ufAcMJOTgd0odgN9h9hPa0z3y0sx9hOSW6XZfepABl1
 +XDDyI4p/lVTMH/vYlS2ay2RUo7KrxfX26qSYbz6UItzo6dMl/3YjtKbzYpP0mZM
 4+Yu5YxDXirbnxpk9uhzOA4CdwiXXWxLIVIX1KeUWkJZ5wzmH2rFqmpzH8OL91uH
 J4SIMLgMCm4e45nDOLH5ZSLIc1MO2df0sJ1gZZD6MoT0xbxULyIQuj19sBszeHwH
 YgV4sQlPead4NT2zoSDeXSpeIWmv5dd15To7wVpeInyj/DHyXSHzycHYSehgdmv1
 FBQ+LMffyA==
 =e3uK
 -----END PGP SIGNATURE-----

Merge tag 'block-6.12-20241101' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - Fixup for a recent blk_rq_map_user_bvec() patch

 - NVMe pull request via Keith:
     - Spec compliant identification fix (Keith)
     - Module parameter to enable backward compatibility on unusual
       namespace formats (Keith)
     - Target double free fix when using keys (Vitaliy)
     - Passthrough command error handling fix (Keith)

* tag 'block-6.12-20241101' of git://git.kernel.dk/linux:
  nvme: re-fix error-handling for io_uring nvme-passthrough
  nvmet-auth: assign dh_key to NULL after kfree_sensitive
  nvme: module parameter to disable pi with offsets
  block: fix queue limits checks in blk_rq_map_user_bvec for real
  nvme: enhance cns version checking
2024-11-01 13:41:55 -10:00
Christoph Hellwig
f187b9bf1a block: remove bio_add_zone_append_page
This is only used by the nvmet zns passthrough code, which can trivially
just use bio_add_pc_page and do the sanity check for the max zone append
limit itself.

All future zoned file systems should follow the btrfs lead and let the
upper layers fill up bios unlimited by hardware constraints and split
them to the limits in the I/O submission handler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20241030051859.280923-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-31 10:54:25 -06:00
Keith Busch
5eed4fb274 nvme: re-fix error-handling for io_uring nvme-passthrough
This was previously fixed with commit 1147dd0503
("nvme: fix error-handling for io_uring nvme-passthrough"), but the
change was mistakenly undone in a later commit.

Fixes: d6aacee925 ("nvme: use bio_integrity_map_user")
Cc: stable@vger.kernel.org
Reported-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-30 07:19:18 -07:00
Vitaliy Shevtsov
d2f551b1f7 nvmet-auth: assign dh_key to NULL after kfree_sensitive
ctrl->dh_key might be used across multiple calls to nvmet_setup_dhgroup()
for the same controller. So it's better to nullify it after release on
error path in order to avoid double free later in nvmet_destroy_auth().

Found by Linux Verification Center (linuxtesting.org) with Svace.

Fixes: 7a277c37d3 ("nvmet-auth: Diffie-Hellman key exchange support")
Cc: stable@vger.kernel.org
Signed-off-by: Vitaliy Shevtsov <v.shevtsov@maxima.ru>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-30 07:19:18 -07:00
Keith Busch
42ab37eaad nvme: module parameter to disable pi with offsets
A recent commit enables integrity checks for formats the previous kernel
versions registered with the "nop" integrity profile. This means
namespaces using that format become unreadable when upgrading the kernel
past that commit.

Introduce a module parameter to restore the "nop" integrity profile so
that storage can be readable once again. This could be a boot device, so
the setting needs to happen at module load time.

Fixes: 921e81db52 ("nvme: allow integrity when PI is not in first bytes")
Reported-by: David Wei <dw@davidwei.uk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-30 07:19:18 -07:00
Keith Busch
133008e84b blk-integrity: remove seed for user mapped buffers
The seed is only used for kernel generation and verification. That
doesn't happen for user buffers, so passing the seed around doesn't
accomplish anything.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20241016201309.1090320-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-30 07:49:32 -06:00
Ming Lei
6b6f6c41c8 nvme: core: switch to non_owner variant of start_freeze/unfreeze queue
nvme_start_freeze() and nvme_unfreeze() may be called from same context,
so switch them to call non_owner variant of start_freeze/unfreeze queue.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241025003722.3630252-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-26 07:14:53 -06:00
Keith Busch
f54f0d0e2b nvme: enhance cns version checking
The number of CNS bits in the command is specific to the nvme spec
version compliance. The existing check is not sufficient for possible
CNS values the driver uses that may create confusion between host and
device, so enhance the check to consider the version and desired CNS
value.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-22 09:11:03 -07:00
Pavel Begunkov
5e52f71f85 nvme: use helpers to access io_uring cmd space
Command implementations shouldn't be directly looking into io_uring_cmd
to carve free space. Use an io_uring helper, which will also do build
time size sanitisation.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-22 07:28:36 -07:00
Linus Torvalds
f8eacd8ad7 block-6.12-20241018
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmcSk4AQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuuXD/0UERdP+djJNoXBW5Mv7U5a4rJ7ZgfPL7ku
 z3ZfdnNYGitZhYkVjNQ60TLzXRQyUaIIxMVBWzkb59I6ixmuQbzm/lC55B6s/FIR
 bfT3afe1WRgLCaFbStu91qRs/44Mq4yK6wXcIU7LwutRT/5cqZwelqRZLK7DMFln
 zlGX4zNrCMRUDTr6PLa6CvyY4dmQSL17Ib1ypcKXjGs5YjDntzSrIsKVT1Wayans
 WroGGPG6W7r2c2kn8pe4uPIjZVfMUF2vrdIs0KEYaAQOC7ppEucCgDZMEWRs7kdH
 63hheudJjVSwLF/qYnXNHe/Bz12QCZohPp6UsqRpC8o96Ralgo6Q+FxkXsVelMXW
 JKhtDqYGBDHOQrjrEWN1rnYw/DauEQAgvOtdVfEx2IBzPsG07cB8yv8MNA90H9QH
 KStI7h9qnBEMMNcXX8prOymCHNWAeuF4mbitVrRfSfEVm/0BbQ19qoyGrvwNFgEf
 6T+4Xj/P+FsiLVe8vsgBZDaxEEU5Ifd/rki/QFVk/2z72BBZxmdf2nm51SOM28V7
 HGMHwJI3H8rdmPXvt5Q/ve6GWNOYLO5PSAJgSSe96UStvtsAHGB4eM+LykdnE7cI
 SoytU5KfAM8DD6wnyHIgYuvJyZWrmLoVDrRjym8emc2KrJOe7qg+Ah4ERcNTCnhl
 nw50f27G4w==
 =waNY
 -----END PGP SIGNATURE-----

Merge tag 'block-6.12-20241018' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
     - Fix target passthrough identifier (Nilay)
     - Fix tcp locking (Hannes)
     - Replace list with sbitmap for tracking RDMA rsp tags (Guixen)
     - Remove unnecessary fallthrough statements (Tokunori)
     - Remove ready-without-media support (Greg)
     - Fix multipath partition scan deadlock (Keith)
     - Fix concurrent PCI reset and remove queue mapping (Maurizio)
     - Fabrics shutdown fixes (Nilay)

 - Fix for a kerneldoc warning (Keith)

 - Fix a race with blk-rq-qos and wakeups (Omar)

 - Cleanup of checking for always-set tag_set (SurajSonawane2415)

 - Fix for a crash with CPU hotplug notifiers (Ming)

 - Don't allow zero-copy ublk on unprivileged device (Ming)

 - Use array_index_nospec() for CDROM (Josh)

 - Remove dead code in drbd (David)

 - Tweaks to elevator loading (Breno)

* tag 'block-6.12-20241018' of git://git.kernel.dk/linux:
  cdrom: Avoid barrier_nospec() in cdrom_ioctl_media_changed()
  nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function
  nvme: make keep-alive synchronous operation
  nvme-loop: flush off pending I/O while shutting down loop controller
  nvme-pci: fix race condition between reset and nvme_dev_disable()
  ublk: don't allow user copy for unprivileged device
  blk-rq-qos: fix crash on rq_qos_wait vs. rq_qos_wake_function race
  nvme-multipath: defer partition scanning
  blk-mq: setup queue ->tag_set before initializing hctx
  elevator: Remove argument from elevator_find_get
  elevator: do not request_module if elevator exists
  drbd: Remove unused conn_lowest_minor
  nvme: disable CC.CRIME (NVME_CC_CRIME)
  nvme: delete unnecessary fallthru comment
  nvmet-rdma: use sbitmap to replace rsp free list
  block: Fix elevator_get_default() checking for NULL q->tag_set
  nvme: tcp: avoid race between queue_lock lock and destroy
  nvmet-passthru: clear EUID/NGUID/UUID while using loop target
  block: fix blk_rq_map_integrity_sg kernel-doc
2024-10-18 15:53:00 -07:00
Nilay Shroff
599d9f3a10 nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function
We no more need acquiring ctrl->lock before accessing the
NVMe controller state and instead we can now use the helper
nvme_ctrl_state. So replace the use of ctrl->lock from
nvme_keep_alive_finish function with nvme_ctrl_state call.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-17 11:07:37 -07:00
Nilay Shroff
d06923670b nvme: make keep-alive synchronous operation
The nvme keep-alive operation, which executes at a periodic interval,
could potentially sneak in while shutting down a fabric controller.
This may lead to a race between the fabric controller admin queue
destroy code path (invoked while shutting down controller) and hw/hctx
queue dispatcher called from the nvme keep-alive async request queuing
operation. This race could lead to the kernel crash shown below:

Call Trace:
    autoremove_wake_function+0x0/0xbc (unreliable)
    __blk_mq_sched_dispatch_requests+0x114/0x24c
    blk_mq_sched_dispatch_requests+0x44/0x84
    blk_mq_run_hw_queue+0x140/0x220
    nvme_keep_alive_work+0xc8/0x19c [nvme_core]
    process_one_work+0x200/0x4e0
    worker_thread+0x340/0x504
    kthread+0x138/0x140
    start_kernel_thread+0x14/0x18

While shutting down fabric controller, if nvme keep-alive request sneaks
in then it would be flushed off. The nvme_keep_alive_end_io function is
then invoked to handle the end of the keep-alive operation which
decrements the admin->q_usage_counter and assuming this is the last/only
request in the admin queue then the admin->q_usage_counter becomes zero.
If that happens then blk-mq destroy queue operation (blk_mq_destroy_
queue()) which could be potentially running simultaneously on another
cpu (as this is the controller shutdown code path) would forward
progress and deletes the admin queue. So, now from this point onward
we are not supposed to access the admin queue resources. However the
issue here's that the nvme keep-alive thread running hw/hctx queue
dispatch operation hasn't yet finished its work and so it could still
potentially access the admin queue resource while the admin queue had
been already deleted and that causes the above crash.

This fix helps avoid the observed crash by implementing keep-alive as a
synchronous operation so that we decrement admin->q_usage_counter only
after keep-alive command finished its execution and returns the command
status back up to its caller (blk_execute_rq()). This would ensure that
fabric shutdown code path doesn't destroy the fabric admin queue until
keep-alive request finished execution and also keep-alive thread is not
running hw/hctx queue dispatch operation.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-17 11:07:37 -07:00
Nilay Shroff
c199fac88f nvme-loop: flush off pending I/O while shutting down loop controller
While shutting down loop controller, we first quiesce the admin/IO queue,
delete the admin/IO tag-set and then at last destroy the admin/IO queue.
However it's quite possible that during the window between quiescing and
destroying of the admin/IO queue, some admin/IO request might sneak in
and if that happens then we could potentially encounter a hung task
because shutdown operation can't forward progress until any pending I/O
is flushed off.

This commit helps ensure that before destroying the admin/IO queue, we
unquiesce the admin/IO queue so that any outstanding requests, which are
added after the admin/IO queue is quiesced, are now flushed to its
completion.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-17 11:07:37 -07:00
Maurizio Lombardi
26bc0a81f6 nvme-pci: fix race condition between reset and nvme_dev_disable()
nvme_dev_disable() modifies the dev->online_queues field, therefore
nvme_pci_update_nr_queues() should avoid racing against it, otherwise
we could end up passing invalid values to blk_mq_update_nr_hw_queues().

 WARNING: CPU: 39 PID: 61303 at drivers/pci/msi/api.c:347
          pci_irq_get_affinity+0x187/0x210
 Workqueue: nvme-reset-wq nvme_reset_work [nvme]
 RIP: 0010:pci_irq_get_affinity+0x187/0x210
 Call Trace:
  <TASK>
  ? blk_mq_pci_map_queues+0x87/0x3c0
  ? pci_irq_get_affinity+0x187/0x210
  blk_mq_pci_map_queues+0x87/0x3c0
  nvme_pci_map_queues+0x189/0x460 [nvme]
  blk_mq_update_nr_hw_queues+0x2a/0x40
  nvme_reset_work+0x1be/0x2a0 [nvme]

Fix the bug by locking the shutdown_lock mutex before using
dev->online_queues. Give up if nvme_dev_disable() is running or if
it has been executed already.

Fixes: 949928c1c7 ("NVMe: Fix possible queue use after freed")
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-17 09:55:22 -07:00
Keith Busch
1f021341ee nvme-multipath: defer partition scanning
We need to suppress the partition scan from occuring within the
controller's scan_work context. If a path error occurs here, the IO will
wait until a path becomes available or all paths are torn down, but that
action also occurs within scan_work, so it would deadlock. Defer the
partion scan to a different context that does not block scan_work.

Reported-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-15 08:32:07 -07:00
Greg Joyce
0ce96a6708 nvme: disable CC.CRIME (NVME_CC_CRIME)
Disable NVME_CC_CRIME so that CSTS.RDY indicates that the media
is ready and able to handle commands without returning
NVME_SC_ADMIN_COMMAND_MEDIA_NOT_READY.

Signed-off-by: Greg Joyce <gjoyce@linux.ibm.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Tested-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-09 14:45:19 -07:00
Tokunori Ikegami
9c7072df53 nvme: delete unnecessary fallthru comment
Signed-off-by: Tokunori Ikegami <ikegami.t@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-08 13:50:44 -07:00
Guixin Liu
40f0e5dc2f nvmet-rdma: use sbitmap to replace rsp free list
We can use sbitmap to manage all the nvmet_rdma_rsp instead of using
free lists and spinlock, and we can use an additional tag to
determine whether the nvmet_rdma_rsp is extra allocated.

In addition, performance has improved:
1. testing environment is local rxe rdma devie and mem-based
backstore device.
2. fio command, test the average 5 times:
fio -filename=/dev/nvme0n1 --ioengine=libaio -direct=1
-size=1G -name=1 -thread -runtime=60 -time_based -rw=read -numjobs=16
-iodepth=128 -bs=4k -group_reporting
3. Before: 241k IOPS, After: 256k IOPS, an increase of about 5%.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
2024-10-08 13:45:36 -07:00
Hannes Reinecke
782373ba27 nvme: tcp: avoid race between queue_lock lock and destroy
Commit 76d54bf20c ("nvme-tcp: don't access released socket during
error recovery") added a mutex_lock() call for the queue->queue_lock
in nvme_tcp_get_address(). However, the mutex_lock() races with
mutex_destroy() in nvme_tcp_free_queue(), and causes the WARN below.

DEBUG_LOCKS_WARN_ON(lock->magic != lock)
WARNING: CPU: 3 PID: 34077 at kernel/locking/mutex.c:587 __mutex_lock+0xcf0/0x1220
Modules linked in: nvmet_tcp nvmet nvme_tcp nvme_fabrics iw_cm ib_cm ib_core pktcdvd nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables qrtr sunrpc ppdev 9pnet_virtio 9pnet pcspkr netfs parport_pc parport e1000 i2c_piix4 i2c_smbus loop fuse nfnetlink zram bochs drm_vram_helper drm_ttm_helper ttm drm_kms_helper xfs drm sym53c8xx floppy nvme scsi_transport_spi nvme_core nvme_auth serio_raw ata_generic pata_acpi dm_multipath qemu_fw_cfg [last unloaded: ib_uverbs]
CPU: 3 UID: 0 PID: 34077 Comm: udisksd Not tainted 6.11.0-rc7 #319
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
RIP: 0010:__mutex_lock+0xcf0/0x1220
Code: 08 84 d2 0f 85 c8 04 00 00 8b 15 ef b6 c8 01 85 d2 0f 85 78 f4 ff ff 48 c7 c6 20 93 ee af 48 c7 c7 60 91 ee af e8 f0 a7 6d fd <0f> 0b e9 5e f4 ff ff 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1
RSP: 0018:ffff88811305f760 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff88812c652058 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000001
RBP: ffff88811305f8b0 R08: 0000000000000001 R09: ffffed1075c36341
R10: ffff8883ae1b1a0b R11: 0000000000010498 R12: 0000000000000000
R13: 0000000000000000 R14: dffffc0000000000 R15: ffff88812c652058
FS:  00007f9713ae4980(0000) GS:ffff8883ae180000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fcd78483c7c CR3: 0000000122c38000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ? __warn.cold+0x5b/0x1af
 ? __mutex_lock+0xcf0/0x1220
 ? report_bug+0x1ec/0x390
 ? handle_bug+0x3c/0x80
 ? exc_invalid_op+0x13/0x40
 ? asm_exc_invalid_op+0x16/0x20
 ? __mutex_lock+0xcf0/0x1220
 ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
 ? __pfx___mutex_lock+0x10/0x10
 ? __lock_acquire+0xd6a/0x59e0
 ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
 nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
 ? __pfx_nvme_tcp_get_address+0x10/0x10 [nvme_tcp]
 nvme_sysfs_show_address+0x81/0xc0 [nvme_core]
 dev_attr_show+0x42/0x80
 ? __asan_memset+0x1f/0x40
 sysfs_kf_seq_show+0x1f0/0x370
 seq_read_iter+0x2cb/0x1130
 ? rw_verify_area+0x3b1/0x590
 ? __mutex_lock+0x433/0x1220
 vfs_read+0x6a6/0xa20
 ? lockdep_hardirqs_on+0x78/0x100
 ? __pfx_vfs_read+0x10/0x10
 ksys_read+0xf7/0x1d0
 ? __pfx_ksys_read+0x10/0x10
 ? __x64_sys_openat+0x105/0x1d0
 do_syscall_64+0x93/0x180
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on+0x78/0x100
 ? do_syscall_64+0x9f/0x180
 ? __pfx_ksys_read+0x10/0x10
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on+0x78/0x100
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on+0x78/0x100
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on+0x78/0x100
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on+0x78/0x100
 ? do_syscall_64+0x9f/0x180
 ? do_syscall_64+0x9f/0x180
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f9713f55cfa
Code: 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 e8 74 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 2e 44 89 c7 48 89 45 f8 e8 42 75 f8 ff 48 8b
RSP: 002b:00007ffd7f512e70 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 000055c38f316859 RCX: 00007f9713f55cfa
RDX: 0000000000000fff RSI: 00007ffd7f512eb0 RDI: 0000000000000011
RBP: 00007ffd7f512e90 R08: 0000000000000000 R09: 00000000ffffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000055c38f317148
R13: 0000000000000000 R14: 00007f96f4004f30 R15: 000055c3b6b623c0
 </TASK>

The WARN is observed when the blktests test case nvme/014 is repeated
with tcp transport. It is rare, and 200 times repeat is required to
recreate in some test environments.

To avoid the WARN, check the NVME_TCP_Q_LIVE flag before locking
queue->queue_lock. The flag is cleared long time before the lock gets
destroyed.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-03 12:14:35 -07:00
Al Viro
5f60d5f6bb move asm/unaligned.h to linux/unaligned.h
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.

auto-generated by the following:

for i in `git grep -l -w asm/unaligned.h`; do
	sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
	sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
2024-10-02 17:23:23 -04:00
Nilay Shroff
e38dad438f nvmet-passthru: clear EUID/NGUID/UUID while using loop target
When nvme passthru is configured using loop target, the clear_ids
attribute is, by default, set to true. This attribute would ensure that
EUID/NGUID/UUID is cleared for the loop passthru target.

The newer NVMe disk supporting the NVMe spec 1.3 or higher, typically,
implements the support for "Namespace Identification Descriptor list"
command. This command when issued from host returns EUID/NGUID/UUID
assigned to the inquired namespace. Not clearing these values, while
using nvme passthru using loop target, would result in NVMe host driver
rejecting the namespace. This check was implemented in the commit
2079f41ec6 ("nvme: check that EUI/GUID/UUID are globally unique").

The fix implemented in this commit ensure that when host issues ns-id
descriptor list command, the EUID/NGUID/UUID are cleared by passthru
target. In fact, the function nvmet_passthru_override_id_descs() which
clears those unique ids already exits, so we just need to ensure that
ns-id descriptor list command falls through the corretc code path. And
while we're at it, we also combines the three passthru admin command
cases together which shares the same code.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-01 11:08:40 -07:00
Linus Torvalds
11a299a793 for-6.12/block-20240925
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmb0T5AQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnfHEADCXqmqZC+xr3sHZH9T1lz9KaFp1FjuBhCw
 bGpUgXQ9aLcqQUWJxmYVer8N2x2+Ds+xq4fm/rP1BfvNgRupqheHBwuLxSrz14EX
 lYmKZ+krMIPTDaLFewmEWflDwmZX0WFgV6nKTMLiO5BMeI4zXCkFGtwYFys2+Cdd
 9zYCFPgGDZUR77Ws5PpyqPVz2MoiNtsjrGmHpEmNZ+rIDzlpVOYgYk27X9ZbvNxC
 /l0KTc9+ayAeG0Kx5jO+m6Hrj3I6ehvM9JZMgpS/tF/jtccD2oVkJFJDlU+Jciv6
 BwVzgyDPGV7sXFT1fnSqDBYYwr/73nzNH0Gk8wn4Jg2LhjmVANVo9eQSOXDTYZI+
 O4HfIHGTIrk75TQd4bhq3dqaylS78pKBI/eQJUli2UNoyLWMrMyE88yh2YJam2Fs
 vJ/MHGxvFRurYbAlqLr33nb3ajvpg+D7XuAYfqHPMc2ZUe28Kza50Dj+luNjfVCu
 3qfR6qBlsdWuABtUS3vneB9jZp5jDnOpVfuBgtcAqIboUjehTXsI7If09Ex/mxLq
 O0KqNwBMfunPOKd5kGXlAgY8LRMfOhNaAAFBlXYUZB2eAadQnqVselTFvHMZkXo7
 wH/l6trd+/Tf+7Rav0YduNIlpVr7IctC+A7ph4zPdIjQxFEySCrC7cvAjel29LyV
 zgWW0Mw/sA==
 =yiWu
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12/block-20240925' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:

 - Improve blk-integrity segment counting and merging (Keith)

 - NVMe pull request via Keith:
      - Multipath fixes (Hannes)
      - Sysfs attribute list NULL terminate fix (Shin'ichiro)
      - Remove problematic read-back (Keith)

 - Fix for a regression with the IO scheduler switching freezing from
   6.11 (Damien)

 - Use a raw spinlock for sbitmap, as it may get called from preempt
   disabled context (Ming)

 - Cleanup for bd_claiming waiting, using var_waitqueue() rather than
   the bit waitqueues, as that more accurately describes that it does
   (Neil)

 - Various cleanups (Kanchan, Qiu-ji, David)

* tag 'for-6.12/block-20240925' of git://git.kernel.dk/linux:
  nvme: remove CC register read-back during enabling
  nvme: null terminate nvme_tls_attrs
  nvme-multipath: avoid hang on inaccessible namespaces
  nvme-multipath: system fails to create generic nvme device
  lib/sbitmap: define swap_lock as raw_spinlock_t
  block: Remove unused blk_limits_io_{min,opt}
  drbd: Fix atomicity violation in drbd_uuid_set_bm()
  block: Fix elv_iosched_local_module handling of "none" scheduler
  block: remove bogus union
  block: change wait on bd_claiming to use a var_waitqueue
  blk-integrity: improved sg segment mapping
  block: unexport blk_rq_count_integrity_sg
  nvme-rdma: use request to get integrity segments
  scsi: use request to get integrity segments
  block: provide a request helper for user integrity segments
  blk-integrity: consider entire bio list for merging
  blk-integrity: properly account for segments
  blk-mq: set the nr_integrity_segments from bio
  blk-mq: unconditional nr_integrity_segments
2024-09-25 14:56:40 -07:00
Keith Busch
9064610348 nvme: remove CC register read-back during enabling
Any non-posted read should flush the previous write, so we don't
necessarily need to read back the value we just wrote. I've found at
least some controllers that respond with 0 for short moments after
writing the CC register with EN (enable) cleared, so the read-back is
overwriting our valid ctrl_config value and ends up breaking on the
subsequent enabling.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-24 23:35:10 -07:00
Shin'ichiro Kawasaki
83340d9c61 nvme: null terminate nvme_tls_attrs
Commit 1e48b34c9b ("nvme: split off TLS sysfs attributes into a
separate group") introduced the struct attribute array nvme_tls_attrs.
However, the array was not null terminated and caused BUG KASAN global-
out-of-bounds. To avoid the BUG, null terminate the array.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/linux-nvme/jhllwfxcedrcxcnbajwl4x2l2ujcqowqcd4ps574zrafrqhjna@f4icvecutekm/
Fixes: 1e48b34c9b ("nvme: split off TLS sysfs attributes into a separate group")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-24 23:34:13 -07:00
Hannes Reinecke
3b97f5a05c nvme-multipath: avoid hang on inaccessible namespaces
During repetitive namespace remapping operations on the target the
namespace might have changed between the time the initial scan
was performed, and partition scan was invoked by device_add_disk()
in nvme_mpath_set_live(). We then end up with a stuck scanning process:

[<0>] folio_wait_bit_common+0x12a/0x310
[<0>] filemap_read_folio+0x97/0xd0
[<0>] do_read_cache_folio+0x108/0x390
[<0>] read_part_sector+0x31/0xa0
[<0>] read_lba+0xc5/0x160
[<0>] efi_partition+0xd9/0x8f0
[<0>] bdev_disk_changed+0x23d/0x6d0
[<0>] blkdev_get_whole+0x78/0xc0
[<0>] bdev_open+0x2c6/0x3b0
[<0>] bdev_file_open_by_dev+0xcb/0x120
[<0>] disk_scan_partitions+0x5d/0x100
[<0>] device_add_disk+0x402/0x420
[<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
[<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
[<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
[<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
[<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]

This happens when we have several paths, some of which are inaccessible,
and the active paths are removed first. Then nvme_find_path() will requeue
I/O in the ns_head (as paths are present), but the requeue list is never
triggered as all remaining paths are inactive.

This patch checks for NVME_NSHEAD_DISK_LIVE in nvme_available_path(),
and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once
the last path has been removed to properly terminate pending I/O.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-24 23:30:28 -07:00
Hannes Reinecke
63bcf9014e nvme-multipath: system fails to create generic nvme device
NVME_NSHEAD_DISK_LIVE is a flag for struct nvme_ns_head, not nvme_ns.
The current code has a typo causing NVME_NSHEAD_DISK_LIVE never to
be cleared once device_add_disk_fails, causing the system never to
create the 'generic' character device. Even several rescan attempts
will change the situation and the system has to be rebooted to fix
the issue.

Fixes: 11384580e3 ("nvme-multipath: add error handling support for add_disk()")
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-24 23:30:28 -07:00
Linus Torvalds
7856a56541 Many singleton patches - please see the various changelogs for details.
Quite a lot of nilfs2 work this time around.
 
 Notable patch series in this pull request are:
 
 "mul_u64_u64_div_u64: new implementation" by Nicolas Pitre, with
 assistance from Uwe Kleine-König.  Reimplement mul_u64_u64_div_u64() to
 provide (much) more accurate results.  The current implementation was
 causing Uwe some issues in the PWM drivers.
 
 "xz: Updates to license, filters, and compression options" from Lasse
 Collin.  Miscellaneous maintenance and kinor feature work to the xz
 decompressor.
 
 "Fix some GDB command error and add some GDB commands" from Kuan-Ying Lee.
 Fixes and enhancements to the gdb scripts.
 
 "treewide: add missing MODULE_DESCRIPTION() macros" from Jeff Johnson.
 Adds lots of MODULE_DESCRIPTIONs, thus fixing lots of warnings about this.
 
 "nilfs2: add support for some common ioctls" from Ryusuke Konishi.  Adds
 various commonly-available ioctls to nilfs2.
 
 "This series fixes a number of formatting issues in kernel doc comments"
 from Ryusuke Konishi does that.
 
 "nilfs2: prevent unexpected ENOENT propagation" from Ryusuke Konishi.  Fix
 issues where -ENOENT was being unintentionally and inappropriately
 returned to userspace.
 
 "nilfs2: assorted cleanups" from Huang Xiaojia.
 
 "nilfs2: fix potential issues with empty b-tree nodes" from Ryusuke
 Konishi fixes some issues which can occur on corrupted nilfs2 filesystems.
 
 "scripts/decode_stacktrace.sh: improve error reporting and usability" from
 Luca Ceresoli does those things.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZu7dpAAKCRDdBJ7gKXxA
 jsPqAPwMDEZyKlfSw7QioEHNHDkmkbP7VYCYR0CbUnppbztwpAD8D37aVbWQ+UzM
 3nnOq3W2Pc2o/20zqi8Upf1mnvUrygQ=
 =/NWE
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2024-09-21-07-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:
 "Many singleton patches - please see the various changelogs for
  details.

  Quite a lot of nilfs2 work this time around.

  Notable patch series in this pull request are:

   - "mul_u64_u64_div_u64: new implementation" by Nicolas Pitre, with
     assistance from Uwe Kleine-König. Reimplement mul_u64_u64_div_u64()
     to provide (much) more accurate results. The current implementation
     was causing Uwe some issues in the PWM drivers.

   - "xz: Updates to license, filters, and compression options" from
     Lasse Collin. Miscellaneous maintenance and kinor feature work to
     the xz decompressor.

   - "Fix some GDB command error and add some GDB commands" from
     Kuan-Ying Lee. Fixes and enhancements to the gdb scripts.

   - "treewide: add missing MODULE_DESCRIPTION() macros" from Jeff
     Johnson. Adds lots of MODULE_DESCRIPTIONs, thus fixing lots of
     warnings about this.

   - "nilfs2: add support for some common ioctls" from Ryusuke Konishi.
     Adds various commonly-available ioctls to nilfs2.

   - "This series fixes a number of formatting issues in kernel doc
     comments" from Ryusuke Konishi does that.

   - "nilfs2: prevent unexpected ENOENT propagation" from Ryusuke
     Konishi. Fix issues where -ENOENT was being unintentionally and
     inappropriately returned to userspace.

   - "nilfs2: assorted cleanups" from Huang Xiaojia.

   - "nilfs2: fix potential issues with empty b-tree nodes" from Ryusuke
     Konishi fixes some issues which can occur on corrupted nilfs2
     filesystems.

   - "scripts/decode_stacktrace.sh: improve error reporting and
     usability" from Luca Ceresoli does those things"

* tag 'mm-nonmm-stable-2024-09-21-07-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (103 commits)
  list: test: increase coverage of list_test_list_replace*()
  list: test: fix tests for list_cut_position()
  proc: use __auto_type more
  treewide: correct the typo 'retun'
  ocfs2: cleanup return value and mlog in ocfs2_global_read_info()
  nilfs2: remove duplicate 'unlikely()' usage
  nilfs2: fix potential oob read in nilfs_btree_check_delete()
  nilfs2: determine empty node blocks as corrupted
  nilfs2: fix potential null-ptr-deref in nilfs_btree_insert()
  user_namespace: use kmemdup_array() instead of kmemdup() for multiple allocation
  tools/mm: rm thp_swap_allocator_test when make clean
  squashfs: fix percpu address space issues in decompressor_multi_percpu.c
  lib: glob.c: added null check for character class
  nilfs2: refactor nilfs_segctor_thread()
  nilfs2: use kthread_create and kthread_stop for the log writer thread
  nilfs2: remove sc_timer_task
  nilfs2: do not repair reserved inode bitmap in nilfs_new_inode()
  nilfs2: eliminate the shared counter and spinlock for i_generation
  nilfs2: separate inode type information from i_state field
  nilfs2: use the BITS_PER_LONG macro
  ...
2024-09-21 08:20:50 -07:00
Jens Axboe
42b16d3ac3 Linux 6.11
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmbm9fQeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGXcwH/A8+IXnrGv+VzYgD
 +mE4hgGGHt4dClcUZ31gQetkkT6xktEVp6pB6JkFO7oEgBiTkJBbYGl6VZtsAIOd
 Fi3jic8ik0uhZLFcxDJcHTceh6Pw8bkhWoh0tkF3bkDRwbppJdG7Khyk8DxTl24w
 ldqh9om2cC7w9IPVx93xTgKgMMZ63qiJyUdTvxEZI3BG8F70smlgZSPskLp2Iktd
 FIJZPcyKM0bhJYwZOpXK0vx5C2cA4oIW4xriHUw4aklv646OBxNKevB2JJAft2uA
 6LyvuLgnYn/OpdFGZ8slvdmhm6hLWft5B1/bWKorUkz7p5YGiySFzpkMVAkNJ6mS
 cRwHJNc=
 =flw3
 -----END PGP SIGNATURE-----

Merge tag 'v6.11' into for-6.12/block

Merge in 6.11 final to get the fix for preventing deadlocks on an
elevator switch, as there's a fixup for that patch.

* tag 'v6.11': (1788 commits)
  Linux 6.11
  Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"
  pinctrl: pinctrl-cy8c95x0: Fix regcache
  cifs: Fix signature miscalculation
  mm: avoid leaving partial pfn mappings around in error case
  drm/xe/client: add missing bo locking in show_meminfo()
  drm/xe/client: fix deadlock in show_meminfo()
  drm/xe/oa: Enable Xe2+ PES disaggregation
  drm/xe/display: fix compat IS_DISPLAY_STEP() range end
  drm/xe: Fix access_ok check in user_fence_create
  drm/xe: Fix possible UAF in guc_exec_queue_process_msg
  drm/xe: Remove fence check from send_tlb_invalidation
  drm/xe/gt: Remove double include
  net: netfilter: move nf flowtable bpf initialization in nf_flow_table_module_init()
  PCI: Fix potential deadlock in pcim_intx()
  workqueue: Clear worker->pool in the worker thread context
  net: tighten bad gso csum offset check in virtio_net_hdr
  netlink: specs: mptcp: fix port endianness
  net: dpaa: Pad packets to ETH_ZLEN
  mptcp: pm: Fix uaf in __timer_delete_sync
  ...
2024-09-17 08:32:53 -06:00
Linus Torvalds
26bb0d3f38 for-6.12/block-20240913
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbkZhQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjOKD/0fzd4yOcqxSI9W3OLGd04VrOTJIQa4CRbV
 GmoTq39pOeIDVGug5ekkTpqqHHnuGk+nQhCzD9vsN/eTmC7yZOIr847O2aWzvYEn
 PzFRgmJpoo2E9sr/IsTR5LnJjbaIZhQVkqLH6ZOj9tpKlVwN2SK0nIRVNrAi5zgT
 MaDrto/2OUld+vmA99Rgb23jxM6UBdCPIjuiVa+11Vg9Z3D1tWbBmrsG7OMysyIf
 FbASBeKHqFSO61/ipFCZv6VV1X8zoWEVyT8n4A1yUbbN5rLzPgoQJVbfSqQRXIdr
 cdrKeCbKxl+joSgKS6LKpvnfwRgGF+hgAfpZg4c0vrbZGTQcRhhLFECyh/aVI08F
 p5TOMArhVaX59664gHgSPq4KnGTXOO29dot9N3Jya/ZQnxinjY9r+GVOfLuduPPy
 1B04vab8oAsk4zK7fZbkDxgYUyifwzK/vQ6OqYq2mYdpdIS/AE7T2ou61Bz5mI7I
 /BuucNV0Z96OKlyLEXwXXZjZgNu1TFcq6ARIBJ8L08PY64Fesj5BXabRyXkeNH26
 0exyz9heeJs6OwRGfngXmS24tDSS0k74CeZX3KoePNj69u6KCn346KiU1qgntwwD
 E5F7AEHqCl5FjUEIWB4M1EPlfA8U0MzOL+tkx2xKJAjsU60wAy7jRSyOIcqodpMs
 6UlPcJzgYg==
 =uuLl
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - MD changes via Song:
      - md-bitmap refactoring (Yu Kuai)
      - raid5 performance optimization (Artur Paszkiewicz)
      - Other small fixes (Yu Kuai, Chen Ni)
      - Add a sysfs entry 'new_level' (Xiao Ni)
      - Improve information reported in /proc/mdstat (Mateusz Kusiak)

 - NVMe changes via Keith:
      - Asynchronous namespace scanning (Stuart)
      - TCP TLS updates (Hannes)
      - RDMA queue controller validation (Niklas)
      - Align field names to the spec (Anuj)
      - Metadata support validation (Puranjay)
      - A syntax cleanup (Shen)
      - Fix a Kconfig linking error (Arnd)
      - New queue-depth quirk (Keith)

 - Add missing unplug trace event (Keith)

 - blk-iocost fixes (Colin, Konstantin)

 - t10-pi modular removal and fixes (Alexey)

 - Fix for potential BLKSECDISCARD overflow (Alexey)

 - bio splitting cleanups and fixes (Christoph)

 - Deal with folios rather than rather than pages, speeding up how the
   block layer handles bigger IOs (Kundan)

 - Use spinlocks rather than bit spinlocks in zram (Sebastian, Mike)

 - Reduce zoned device overhead in ublk (Ming)

 - Add and use sendpages_ok() for drbd and nvme-tcp (Ofir)

 - Fix regression in partition error pointer checking (Riyan)

 - Add support for write zeroes and rotational status in nbd (Wouter)

 - Add Yu Kuai as new BFQ maintainer. The scheduler has been
   unmaintained for quite a while.

 - Various sets of fixes for BFQ (Yu Kuai)

 - Misc fixes and cleanups (Alvaro, Christophe, Li, Md Haris, Mikhail,
   Yang)

* tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux: (120 commits)
  nvme-pci: qdepth 1 quirk
  block: fix potential invalid pointer dereference in blk_add_partition
  blk_iocost: make read-only static array vrate_adj_pct const
  block: unpin user pages belonging to a folio at once
  mm: release number of pages of a folio
  block: introduce folio awareness and add a bigger size from folio
  block: Added folio-ized version of bio_add_hw_page()
  block, bfq: factor out a helper to split bfqq in bfq_init_rq()
  block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq()
  block, bfq: remove local variable 'split' in bfq_init_rq()
  block, bfq: remove bfq_log_bfqg()
  block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()
  block, bfq: fix procress reference leakage for bfqq in merge chain
  block, bfq: fix uaf for accessing waker_bfqq after splitting
  blk-throttle: support prioritized processing of metadata
  blk-throttle: remove last_low_overflow_time
  drbd: Add NULL check for net_conf to prevent dereference in state validation
  nvme-tcp: fix link failure for TCP auth
  blk-mq: add missing unplug trace event
  mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init()
  ...
2024-09-16 13:33:06 +02:00
Keith Busch
76c313f658 blk-integrity: improved sg segment mapping
Make the integrity mapping more like data mapping, blk_rq_map_sg. Use
the request to validate the segment count, and update the callers so
they don't have to.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240913191746.2628196-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-13 13:22:09 -06:00
Keith Busch
f4330766bc nvme-rdma: use request to get integrity segments
The request tracks the integrity segments already, so no need to recount
the segments again.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240913182854.2445457-8-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-13 12:31:45 -06:00
Keith Busch
d2c5b1facc block: provide a request helper for user integrity segments
Provide a helper to keep the request flags and nr_integrity_segments in
sync with the bio's integrity payload. This is an integrity equivalent
to the normal data helper function, 'blk_rq_map_user()'.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240913182854.2445457-6-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-13 12:31:45 -06:00
Keith Busch
83bdfcbdbe nvme-pci: qdepth 1 quirk
Another device has been reported to be unreliable if we have more than
one outstanding command. In this new case, data corruption may occur.
Since we have two devices now needing this quirky behavior, make a
generic quirk flag.

The same Apple quirk is clearly not "temporary", so update the comment
while moving it.

Link: https://lore.kernel.org/linux-nvme/191d810a4e3.fcc6066c765804.973611676137075390@collabora.com/
Reported-by: Robert Beckett <bob.beckett@collabora.com>
Reviewed-by: Christoph Hellwig hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-13 07:29:30 -07:00
Arnd Bergmann
2d5a333e09 nvme-tcp: fix link failure for TCP auth
The nvme fabric driver calls the nvme_tls_key_lookup() function from
nvmf_parse_key() when the keyring is enabled, but this is broken in a
configuration with CONFIG_NVME_FABRICS=y and CONFIG_NVME_TCP=m because
this leads to the function definition being in a loadable module:

x86_64-linux-ld: vmlinux.o: in function `nvmf_parse_key':
fabrics.c:(.text+0xb1bdec): undefined reference to `nvme_tls_key_lookup'

Move the 'select' up to CONFIG_NVME_FABRICS itself to force this
part to be built-in as well if needed.

Fixes: 5bc46b49c8 ("nvme-tcp: check for invalidated or revoked key")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-10 07:41:07 -07:00
Shen Lichuan
389e72c5d1 nvme: Convert comma to semicolon
To ensure code clarity and prevent potential errors, it's advisable
to employ the ';' as a statement separator, except when ',' are
intentionally used for specific purposes.

Signed-off-by: Shen Lichuan <shenlichuan@vivo.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-06 14:20:38 -07:00
Maurizio Lombardi
899d2e5a4e nvmet: Identify-Active Namespace ID List command should reject invalid nsid
nsid values of 0xFFFFFFFE and 0XFFFFFFFF should be rejected with
a status code of "Invalid Namespace or Format".
See NVMe Base Specification, Active Namespace ID list (CNS 02h).

Fixes: a07b4970f4 ("nvmet: add a generic NVMe target")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-03 10:05:40 -07:00
Christoph Hellwig
28982ad73d nvme: set BLK_FEAT_ZONED for ZNS multipath disks
The new stricter limits validation doesn't like a max_append_sectors value
to be set without BLK_FEAT_ZONED.  Set it before allocation the disk to
fix this instead of just inheriting it later.

Fixes: d690cb8ae1 ("block: add an API to atomically update queue limits")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-03 10:00:10 -07:00
Jani Nikula
6ce2082fd3 fault-inject: improve build for CONFIG_FAULT_INJECTION=n
The fault-inject.h users across the kernel need to add a lot of #ifdef
CONFIG_FAULT_INJECTION to cater for shortcomings in the header.  Make
fault-inject.h self-contained for CONFIG_FAULT_INJECTION=n, and add stubs
for DECLARE_FAULT_ATTR(), setup_fault_attr(), should_fail_ex(), and
should_fail() to allow removal of conditional compilation.

[akpm@linux-foundation.org: repair fallout from no longer including debugfs.h into fault-inject.h]
[akpm@linux-foundation.org: fix drivers/misc/xilinx_tmr_inject.c]
[akpm@linux-foundation.org: Add debugfs.h inclusion to more files, per Stephen]
Link: https://lkml.kernel.org/r/20240813121237.2382534-1-jani.nikula@intel.com
Fixes: 6ff1cb355e ("[PATCH] fault-injection capabilities infrastructure")
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:43:33 -07:00
Puranjay Mohan
7c2fd76048 nvme: fix metadata handling in nvme-passthrough
On an NVMe namespace that does not support metadata, it is possible to
send an IO command with metadata through io-passthru. This allows issues
like [1] to trigger in the completion code path.
nvme_map_user_request() doesn't check if the namespace supports metadata
before sending it forward. It also allows admin commands with metadata to
be processed as it ignores metadata when bdev == NULL and may report
success.

Reject an IO command with metadata when the NVMe namespace doesn't
support it and reject an admin command if it has metadata.

[1] https://lore.kernel.org/all/mb61pcylvnym8.fsf@amazon.com/

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Puranjay Mohan <pjy@amazon.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-30 07:50:29 -07:00
Georg Gottleuber
61aa894e7a nvme-pci: Add sleep quirk for Samsung 990 Evo
On some TUXEDO platforms, a Samsung 990 Evo NVMe leads to a high
power consumption in s2idle sleep (2-3 watts).

This patch applies 'Force No Simple Suspend' quirk to achieve a
sleep with a lower power consumption, typically around 0.5 watts.

Signed-off-by: Georg Gottleuber <ggo@tuxedocomputers.com>
Signed-off-by: Werner Sembach <wse@tuxedocomputers.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-27 12:06:43 -07:00
Keith Busch
6f01bdbfef nvme-pci: allocate tagset on reset if necessary
If a drive is unable to create IO queues on the initial probe, a
subsequent reset will need to allocate the tagset if IO queue creation
is successful. Without this, blk_mq_update_nr_hw_queues will crash on a
bad pointer due to the invalid tagset.

Fixes: eac3ef2629 ("nvme-pci: split the initial probe from the rest path")
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-27 07:55:40 -07:00
Maurizio Lombardi
5572a55a6f nvmet-tcp: fix kernel crash if commands allocation fails
If the commands allocation fails in nvmet_tcp_alloc_cmds()
the kernel crashes in nvmet_tcp_release_queue_work() because of
a NULL pointer dereference.

  nvmet: failed to install queue 0 cntlid 1 ret 6
  Unable to handle kernel NULL pointer dereference at
         virtual address 0000000000000008

Fix the bug by setting queue->nr_cmds to zero in case
nvmet_tcp_alloc_cmd() fails.

Fixes: 872d26a391 ("nvmet-tcp: add NVMe over TCP target driver")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-26 16:00:52 -07:00
Anuj Gupta
cead0b8991 nvme: rename apptag and appmask to lbat and lbatm
Rename apptag and appmask to lbat and lbatm so that it matches the field
names used in NVMe spec.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-26 09:51:32 -07:00
Niklas Cassel
03c3d7c743 nvme-rdma: send cntlid in the RDMA_CM_REQUEST Private Data
When sending a RDMA_CM_REQUEST, the NVMe RDMA Transport Specification
allows you to populate the cntlid field in the RDMA_CM_REQUEST Private
Data.

The cntlid is returned by the target on completion of the first
RDMA_CM_REQUEST command (which creates the admin queue).

The cntlid field can then be populated by the host when the I/O queues
are created (using additional RDMA_CM_REQUEST commands), such that the
target can perform extra validation for additional RDMA_CM_REQUEST
commands.

This additional error code and error message is also added, such that
nvme_rdma_cm_msg() will display the proper error message if the target
fails the RDMA_CM_REQUEST command because of this extra validation.

Signed-off-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-26 08:11:19 -07:00
Keith Busch
5a6d3a638c nvme: use better description for async reset reason
The NVMe AER notification of a persistent internal error triggers a
reset. The existing warning message just says "due to AER", which can be
confused with the unrelated PCIe AER condition. Just say what the event
was instead of the generic overloaded acronym.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-23 09:52:04 -07:00
Jinjie Ruan
f4bd313993 nvmet: Make nvmet_debugfs static
The sparse tool complains as follows:

drivers/nvme/target/debugfs.c:16:15: warning:
	symbol 'nvmet_debugfs' was not declared. Should it be static?

This symbol is not used outside debugfs.c, so marks it static.

Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-23 09:50:16 -07:00
Nilay Shroff
fe01751347 nvme: Remove unused field
The "name" field in struct nvme_ctrl is unsued so removing it.
This would help save 12 bytes of space for each nvme_ctrl instance
created.

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22 13:28:40 -07:00
Ming Lei
a54a93d0e3 nvme: move stopping keep-alive into nvme_uninit_ctrl()
Commit 4733b65d82 ("nvme: start keep-alive after admin queue setup")
moves starting keep-alive from nvme_start_ctrl() into
nvme_init_ctrl_finish(), but don't move stopping keep-alive into
nvme_uninit_ctrl(), so keep-alive work can be started and keep pending
after failing to start controller, finally use-after-free is triggered if
nvme host driver is unloaded.

This patch fixes kernel panic when running nvme/004 in case that connection
failure is triggered, by moving stopping keep-alive into nvme_uninit_ctrl().

This way is reasonable because keep-alive is now started in
nvme_init_ctrl_finish().

Fixes: 3af755a468 ("nvme: move nvme_stop_keep_alive() back to original position")
Cc: Hannes Reinecke <hare@suse.de>
Cc: Mark O'Donovan <shiftee@posteo.net>
Reported-by: Changhui Zhong <czhong@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22 13:28:40 -07:00
Hannes Reinecke
ff4a0a4088 nvme-target: do not check authentication status for admin commands twice
nvmet_check_ctrl_status() checks the authentication status, so
we don't need to do that prior to calling it.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22 13:25:11 -07:00
Hannes Reinecke
bb2df18958 nvmet-auth: allow to clear DH-HMAC-CHAP keys
As we can set DH-HMAC-CHAP keys, we should also be
able to unset them.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22 13:25:11 -07:00
Hannes Reinecke
02a3688c53 nvme-sysfs: add 'tls_keyring' attribute
Add a 'tls_keyring' attribute to display the contents of the
--keyring option from the connect string. Adding this attribute
allows us to recreate the original connect string from sysfs
settings.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22 13:25:11 -07:00
Hannes Reinecke
f5eb739747 nvme-sysfs: add 'tls_configured_key' sysfs attribute
There is a difference between the negotiated TLS key (which is
always present for a TLS encrypted connection) and the configured
TLS key (which is specified with the --tls_key command line option).
To differentate between these two add a new sysfs attribute
'tls_configured_key' to hold the specified on the command line.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22 13:25:11 -07:00
Hannes Reinecke
1e48b34c9b nvme: split off TLS sysfs attributes into a separate group
Split off TLS sysfs attributes into a separate group to improve
readability and to keep all TLS related handling in one section.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22 13:25:11 -07:00
Hannes Reinecke
c5f2ca52d0 nvme: add a newline to the 'tls_key' sysfs attribute
Print a newline for easier userspace handling.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22 13:25:11 -07:00
Hannes Reinecke
5bc46b49c8 nvme-tcp: check for invalidated or revoked key
key_lookup() will always return a key, even if that key is revoked
or invalidated. So check for invalid keys before continuing.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22 13:25:07 -07:00
Hannes Reinecke
363895767f nvme-tcp: sanitize TLS key handling
There is a difference between TLS configured (ie the user has
provisioned/requested a key) and TLS enabled (ie the connection
is encrypted with TLS). This becomes important for secure concatenation,
where the initial authentication is run on an unencrypted connection
(ie with TLS configured, but not enabled), and then the queue is reset to
run over TLS (ie TLS configured _and_ enabled).
So to differentiate between those two states store the generated
key in opts->tls_key (as we're using the same TLS key for all queues),
the key serial of the resulting TLS handshake in ctrl->tls_pskid
(to signal that TLS on the admin queue is enabled), and a simple
flag for the queues to indicated that TLS has been enabled.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22 13:22:41 -07:00
Hannes Reinecke
79559c7533 nvme-keyring: restrict match length for version '1' identifiers
TP8018 introduced a new TLS PSK identifier version (version 1), which appended
a PSK hash value to the existing identifier (cf NVMe TCP specification v1.1,
section 3.6.1.3 'TLS PSK and PSK Identity Derivation').
An original (version 0) identifier has the form:

NVMe0<type><hmac> <hostnqn> <subsysnqn>

and a version 1 identifier has the form:

NVMe1<type><hmac> <hostnqn> <subsysnqn> <hash>

This patch modifies the lookup algorthm to compare only the first part
of the identifier (excluding the hash value) to handle both version 0 and
version 1 identifiers.
And the spec declares 'version 0' identifiers obsolete, so the lookup
algorithm is modified to prever v1 identifiers.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22 13:22:41 -07:00
Stuart Hayes
4e893ca811 nvme_core: scan namespaces asynchronously
Use async function calls to make namespace scanning happen in parallel.

Without the patch, NVME namespaces are scanned serially, so it can take
a long time for all of a controller's namespaces to become available,
especially with a slower (TCP) interface with large number of
namespaces.

It is not uncommon to have large numbers (hundreds or thousands) of
namespaces on nvme-of with storage servers.

The time it took for all namespaces to show up after connecting (via
TCP) to a controller with 1002 namespaces was measured on one system:

network latency   without patch   with patch
     0                 6s            1s
    50ms             210s           10s
   100ms             417s           18s

Measurements taken on another system show the effect of the patch on the
time nvme_scan_work() took to complete, when connecting to a linux
nvme-of target with varying numbers of namespaces, on a network of
400us.

namespaces    without patch   with patch
     1            16ms           14ms
     2            24ms           16ms
     4            49ms           22ms
     8           101ms           33ms
    16           207ms           56ms
   100           1.4s           0.6s
  1000          12.9s           2.0s

On the same system, connecting to a local PCIe NVMe drive (a Samsung
PM1733) instead of a network target:

namespaces    without patch   with patch
     1            13ms           12ms
     2            41ms           13ms

Signed-off-by: Stuart Hayes <stuart.w.hayes@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2024-08-22 10:05:22 -07:00
Kanchan Joshi
b4c1f33a5d nvme: reorganize nvme_ns_head fields
shuffle few fields to reduce the holes within nvme_ns_head.
On x86_64, the size is reduced to 1104 bytes from 1120 bytes.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-31 07:40:10 -07:00
Kanchan Joshi
73d148ccb9 nvme: change data type of lba_shift
u8 fits the need, so stop using int for it.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-31 07:40:10 -07:00
Kanchan Joshi
6339b7edad nvme: remove a field from nvme_ns_head
pi_offset field is not required to be present in nvme_ns_head.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-31 07:40:10 -07:00
Kanchan Joshi
7ec5bd247a nvme: remove unused parameter
First parameter of nvme_init_integrity() is unused.
Remove it, and modify the callers.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-29 07:27:58 -07:00
Ofir Gal
6af7331a70 nvme-tcp: use sendpages_ok() instead of sendpage_ok()
Currently nvme_tcp_try_send_data() use sendpage_ok() in order to disable
MSG_SPLICE_PAGES, it check the first page of the iterator, the iterator
may represent contiguous pages.

MSG_SPLICE_PAGES enables skb_splice_from_iter() which checks all the
pages it sends with sendpage_ok().

When nvme_tcp_try_send_data() sends an iterator that the first page is
sendable, but one of the other pages isn't skb_splice_from_iter() warns
and aborts the data transfer.

Using the new helper sendpages_ok() in order to disable MSG_SPLICE_PAGES
solves the issue.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ofir Gal <ofir.gal@volumez.com>
Link: https://lore.kernel.org/r/20240718084515.3833733-3-ofir.gal@volumez.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-28 16:47:52 -06:00
Jens Axboe
f6bb5254b7 nvme fixes for Linux 6.11
- Fix request without payloads cleanup  (Leon)
  - Use new protection information format (Francis)
  - Improved debug message for lost pci link (Bart)
  - Another apst quirk (Wang)
  - Use appropriate sysfs api for printing chars (Markus)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmajrFcACgkQPe3zGtjz
 Rgk4QRAA357EtsOgDKAbodaePFsGhmfhVEVhizCkw+yj+l73jGWMRMDfbpNsnkii
 OTJaXhQgVPrCBGnmbgHn5GehEKJ/VZwW4fFbF42wvetFXWSYFDPL2yqC3YpaPjNp
 aSJvUM0rw4f+JE52tTPQvVzLQ/M9PZFsJ6sUmoozV5MoPLt4eKKUEHKigCXpXV/p
 AOwiejVVk035WbKNq4R8DoQsa05Yk0Tv5zKsFgmXEZjrnorC0dpqWQjT5HH6V9pt
 eHTA2cxKq9qAHBN1Zm/3HUOmxmJZ1GW3AKLxYM+k0ornnfnO7inlQwNJDsQItXXS
 ZNBELiYIIObVoy6COB03NWMSCcS/TrpfSKJ9s+JOdJt/T+AOVCwQkqpIff6aJTaH
 k4ppVjChmaY3+taIkLQ5nC1zecCZr7hY+xL0ZkUGhlznKn9x2a1zOBZ6tUuabul6
 57JztkeXPyTNZ/t1WhYQQpGQ4MCLXnu81gMRzVKfJcmtMOSrBOO1p8eNjb/LgM4M
 Qpu9VKS33OBrmMEBlhnBvhkFIXxHUU1CjZJkQ2MYm4YLLnaO4VmmOk3tcX1mpID4
 R6GrsXAOBlqjtAyqsTGQJ/+Z5BRp1nhOk/E0APiD+sbgJPiRSJ2HNZpDEbOroYOy
 4IiaQwZbyGSjieBWWe0fMMCBbazk2S9ws57eAbVh5/r6L+DSWTc=
 =1Rzv
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.11-2024-07-26' of git://git.infradead.org/nvme into block-6.11

Pull NVMe fixes from Keith:

"nvme fixes for Linux 6.11

 - Fix request without payloads cleanup  (Leon)
 - Use new protection information format (Francis)
 - Improved debug message for lost pci link (Bart)
 - Another apst quirk (Wang)
 - Use appropriate sysfs api for printing chars (Markus)"

* tag 'nvme-6.11-2024-07-26' of git://git.infradead.org/nvme:
  nvme-pci: add missing condition check for existence of mapped data
  nvme-core: choose PIF from QPIF if QPIFS supports and PIF is QTYPE
  nvme-pci: Fix the instructions for disabling power management
  nvme: remove redundant bdev local variable
  nvme-fabrics: Use seq_putc() in __nvmf_concat_opt_tokens()
  nvme/pci: Add APST quirk for Lenovo N60z laptop
2024-07-26 08:06:15 -06:00
Leon Romanovsky
c31fad1470 nvme-pci: add missing condition check for existence of mapped data
nvme_map_data() is called when request has physical segments, hence
the nvme_unmap_data() should have same condition to avoid dereference.

Fixes: 4aedb70543 ("nvme-pci: split metadata handling from nvme_map_data / nvme_unmap_data")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-25 07:20:54 -07:00
Linus Torvalds
0256994887 for-6.11/block-post-20240722
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaeY00QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjPGD/9CPo93+V/ztfzY1J18KhA2CCUh1uuxZIjx
 dLfi07Bo+gyLwB1vaSf0bNy9gM8SzGFSMszSIDTErNq9/F6RvWjXN0CchyQf1Wii
 o2UyQg8JLjT2o1pJSsdJySZQRsG/daWUHzHaX1kD343Cd6OBV2YaVFdYTaXUGg4v
 G1AVh7qFvQhAIg1jV8q2z7QC7PSeuTnvyvY65Z8/iVJe95FayOrtGmDPTaJab8r2
 7uEFiWZk23erzNygVdcSoNIrwWFmRARz5o3IvwJJfEL08hkdoAqu6vD2oCUZspKU
 3g4wU6JrN0QYQpVwIJ9WcwYcoOm6iMm9xwCVMsp8R3KRUU107HjaiEazFDGk4HW4
 ozZTa7leTXnrRqnjVhcQpUvC+1uVLCFN8sSElNY7m2dg0IojnlMz+t3lMiTtaR9N
 Rt6wy5alVQFlb2uhzALuUh6HM1zA98swWySNoP0arTkOT9kjXwwAgn0I+M1s9Uxo
 FaQvM0YnAsb2C8LSpNtZWLaTlRSLTzUsGThLSJMBZueIJ9+BF23i7W7euklCNxjj
 Jl6CykEkEkacOxU6b9PG6qSnUq9JJ+W7gcJVing+ugAFrZDutxy6eJZXVv8wuvCC
 EOxaADpSs2xAaH9V0BMmwO51w0NDWySyGPHB5UBkhNjqOji/oG3FvAITiboQArgS
 FES4jtU1TA==
 =dn4l
 -----END PGP SIGNATURE-----

Merge tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linux

Pull block integrity mapping updates from Jens Axboe:
 "A set of cleanups and fixes for the block integrity support.

  Sent separately from the main block changes from last week, as they
  depended on later fixes in the 6.10-rc cycle"

* tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linux:
  block: don't free the integrity payload in bio_integrity_unmap_free_user
  block: don't free submitter owned integrity payload on I/O completion
  block: call bio_integrity_unmap_free_user from blk_rq_unmap_user
  block: don't call bio_uninit from bio_endio
  block: also return bio_integrity_payload * from stubs
  block: split integrity support out of bio.h
2024-07-22 11:04:09 -07:00
Francis Pravin
415fb383ec nvme-core: choose PIF from QPIF if QPIFS supports and PIF is QTYPE
As per TP4141a:
"If the Qualified Protection Information Format Support(QPIFS) bit is
set to 1 and the Protection Information Format(PIF) field is set to 11b
(i.e., Qualified Type), then the pif is as defined in the Qualified
Protection Information Format (QPIF) field."
So, choose PIF from QPIF if QPIFS supports and PIF is QTYPE.

Signed-off-by: Francis Pravin <francis.p@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-16 07:55:31 -07:00
Linus Torvalds
3e78198862 for-6.11/block-20240710
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaOTd8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppqIEACUr8Vv2FtezvT3OfVSlYWHHLXzkRhwEG5s
 vdk0o7Ow6U54sMjfymbHTgLD0ZOJf3uJ6BI95FQuW41jPzDFVbx4Hy8QzqonMkw9
 1D/YQ4zrVL2mOKBzATbKpoGJzMOzGeoXEueFZ1AYPAX7RrDtP4xPQNfrcfkdE2zF
 LycJN70Vp6lrZZMuI9yb9ts1tf7TFzK0HJANxOAKTgSiPmBmxesjkJlhrdUrgkAU
 qDVyjj7u/ssndBJAb9i6Bl95Do8s9t4DeJq5/6wgKqtf5hClMXzPVB8Wy084gr6E
 rTRsCEhOug3qEZSqfAgAxnd3XFRNc/p2KMUe5YZ4mAqux4hpSmIQQDM/5X5K9vEv
 f4MNqUGlqyqntZx+KPyFpf7kLHFYS1qK4ub0FojWJEY4GrbBPNjjncLJ9+ozR0c8
 kNDaFjMNAjalBee1FxNNH8LdVcd28rrCkPxRLEfO/gvBMUmvJf4ZyKmSED0v5DhY
 vZqKlBqG+wg0EXvdiWEHMDh9Y+q/2XBIkS6NN/Bhh61HNu+XzC838ts1X7lR+4o2
 AM5Vapw+v0q6kFBMRP3IcJI/c0UcIU8EQU7axMyzWtvhog8kx8x01hIj1L4UyYYr
 rUdWrkugBVXJbywFuH/QIJxWxS/z4JdSw5VjASJLIrXy+aANmmG9Wonv95eyhpUv
 5iv+EdRSNA==
 =wVi8
 -----END PGP SIGNATURE-----

Merge tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe updates via Keith:
     - Device initialization memory leak fixes (Keith)
     - More constants defined (Weiwen)
     - Target debugfs support (Hannes)
     - PCIe subsystem reset enhancements (Keith)
     - Queue-depth multipath policy (Redhat and PureStorage)
     - Implement get_unique_id (Christoph)
     - Authentication error fixes (Gaosheng)

 - MD updates via Song
     - sync_action fix and refactoring (Yu Kuai)
     - Various small fixes (Christoph Hellwig, Li Nan, and Ofir Gal, Yu
       Kuai, Benjamin Marzinski, Christophe JAILLET, Yang Li)

 - Fix loop detach/open race (Gulam)

 - Fix lower control limit for blk-throttle (Yu)

 - Add module descriptions to various drivers (Jeff)

 - Add support for atomic writes for block devices, and statx reporting
   for same. Includes SCSI and NVMe (John, Prasad, Alan)

 - Add IO priority information to block trace points (Dongliang)

 - Various zone improvements and tweaks (Damien)

 - mq-deadline tag reservation improvements (Bart)

 - Ignore direct reclaim swap writes in writeback throttling (Baokun)

 - Block integrity improvements and fixes (Anuj)

 - Add basic support for rust based block drivers. Has a dummy null_blk
   variant for now (Andreas)

 - Series converting driver settings to queue limits, and cleanups and
   fixes related to that (Christoph)

 - Cleanup for poking too deeply into the bvec internals, in preparation
   for DMA mapping API changes (Christoph)

 - Various minor tweaks and fixes (Jiapeng, John, Kanchan, Mikulas,
   Ming, Zhu, Damien, Christophe, Chaitanya)

* tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux: (206 commits)
  floppy: add missing MODULE_DESCRIPTION() macro
  loop: add missing MODULE_DESCRIPTION() macro
  ublk_drv: add missing MODULE_DESCRIPTION() macro
  xen/blkback: add missing MODULE_DESCRIPTION() macro
  block/rnbd: Constify struct kobj_type
  block: take offset into account in blk_bvec_map_sg again
  block: fix get_max_segment_size() warning
  loop: Don't bother validating blocksize
  virtio_blk: Don't bother validating blocksize
  null_blk: Don't bother validating blocksize
  block: Validate logical block size in blk_validate_limits()
  virtio_blk: Fix default logical block size fallback
  nvmet-auth: fix nvmet_auth hash error handling
  nvme: implement ->get_unique_id
  block: pass a phys_addr_t to get_max_segment_size
  block: add a bvec_phys helper
  blk-lib: check for kill signal in ioctl BLKZEROOUT
  block: limit the Write Zeroes to manually writing zeroes fallback
  block: refacto blkdev_issue_zeroout
  block: move read-only and supported checks into (__)blkdev_issue_zeroout
  ...
2024-07-15 14:20:22 -07:00
Bart Van Assche
92fc2c469e nvme-pci: Fix the instructions for disabling power management
pcie_aspm=off tells the kernel not to modify the ASPM configuration. This
setting does not guarantee that ASPM (Active State Power Management) is
disabled. Hence add pcie_port_pm=off. This disables power management for
all PCIe ports.

This patch has been tested on a workstation with a Samsung SSD 970 EVO Plus
NVMe SSD.

Fixes: 4641a8e6e1 ("nvme-pci: add trouble shooting steps for timeouts")
Cc: Keith Busch <kbusch@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-15 13:46:00 -07:00
Israel Rukshin
88c918d1ee nvme: remove redundant bdev local variable
Use disk directly instead of getting it from bdev->bd_disk.

Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-15 13:44:59 -07:00
Markus Elfring
1a7812b25e nvme-fabrics: Use seq_putc() in __nvmf_concat_opt_tokens()
Single characters should be put into a sequence.
Thus use the corresponding function “seq_putc”.

This issue was transformed by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-15 13:43:41 -07:00
WangYuli
ab091ec536 nvme/pci: Add APST quirk for Lenovo N60z laptop
There is a hardware power-saving problem with the Lenovo N60z
board. When turn it on and leave it for 10 hours, there is a
20% chance that a nvme disk will not wake up until reboot.

Link: https://lore.kernel.org/all/2B5581C46AC6E335+9c7a81f1-05fb-4fd0-9fbb-108757c21628@uniontech.com
Signed-off-by: hmy <huanglin@uniontech.com>
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-15 13:43:39 -07:00
Jens Axboe
6b43537fae nvme updates for Linux 6.11
- Device initialization memory leak fixes (Keith)
  - More constants defined (Weiwen)
  - Target debugfs support (Hannes)
  - PCIe subsystem reset enhancements (Keith)
  - Queue-depth multipath policy (Redhat and PureStorage)
  - Implement get_unique_id (Christoph)
  - Authentication error fixes (Gaosheng)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmaMS6oACgkQPe3zGtjz
 Rgljeg//f47yJNR61V+zePH2prcTUdTA21z2xnEbd1IT40qkpbBIf2mS5R6loBzT
 fm8cGGgd3fdZ3qUnvaMExL//A4aaQCgoKmbtJHLv616KeCu8Iwy8GC0myG2gl05h
 +a96zy8b30FZjRE6H08EBt0JyRZUqjxVcFtvIq1ZlRu2xLWG7f2hg4S7sSKDPKw/
 SZWTeFcY1GRiUupb0tcgaJeqiQ0U7BPuyUuHGnuklMY60ePO5qgtw3D6dmRwLgj+
 pbcRzrp2OWbKCUH1JrW34Ku1gzDbOYFYLL04akAq4rIp4JzsXEgvs7rlyRn0PtDP
 V6gGAvIvb7ktMhD4GvXFQlqT29AoOCYmWr5w2xS6uXAoLZ+t3gK9VBKa+7oRBOeo
 hiZNJy/EdIIuF54nWaFpZ4FBKunNocp7w9BEYkuu/HBm5GW4m9mwLxL66BLLYgPj
 kuC2t4Nc/waO+SrZFrHErtdb+QZNW8IUWIRG3jXLjo6yipGrv+K6lZpOY3HCEXev
 7F0AAOVNFWW+nWv+mEVQkd5lCFrHDjbVX2rRC3z4saKJvDOh69pBaSCKyikZR0bO
 95wz3B//sF2STBa4b/570KMPHJTJfTkKRtaaZvkHPT/0IQi0kmuWM/yKy57Q7NPE
 Ehkk3hfWjLUHU7jNuI5wxky8un7GMZJKArFg3Q1rkQCQ8OxQUXM=
 =0en2
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.11-2024-07-08' of git://git.infradead.org/nvme into for-6.11/block

Pull NVMe updates from Keith:

"nvme updates for Linux 6.11

 - Device initialization memory leak fixes (Keith)
 - More constants defined (Weiwen)
 - Target debugfs support (Hannes)
 - PCIe subsystem reset enhancements (Keith)
 - Queue-depth multipath policy (Redhat and PureStorage)
 - Implement get_unique_id (Christoph)
 - Authentication error fixes (Gaosheng)"

* tag 'nvme-6.11-2024-07-08' of git://git.infradead.org/nvme: (21 commits)
  nvmet-auth: fix nvmet_auth hash error handling
  nvme: implement ->get_unique_id
  nvme-multipath: implement "queue-depth" iopolicy
  nvme-multipath: prepare for "queue-depth" iopolicy
  nvme-pci: do not directly handle subsys reset fallout
  lpfc_nvmet: implement 'host_traddr'
  nvme-fcloop: implement 'host_traddr'
  nvmet-fc: implement host_traddr()
  nvmet-rdma: implement host_traddr()
  nvmet-tcp: implement host_traddr()
  nvmet: add 'host_traddr' callback for debugfs
  nvmet: add debugfs support
  mailmap: add entry for Weiwen Hu
  nvme: rename CDR/MORE/DNR to NVME_STATUS_*
  nvme: fix status magic numbers
  nvme: rename nvme_sc_to_pr_err to nvme_status_to_pr_err
  nvme: split device add from initialization
  nvme: fc: split controller bringup handling
  nvme: rdma: split controller bringup handling
  nvme: tcp: split controller bringup handling
  ...
2024-07-08 23:57:02 -06:00
Gaosheng Cui
89f58f96d1 nvmet-auth: fix nvmet_auth hash error handling
If we fail to call nvme_auth_augmented_challenge, or fail to kmalloc
for shash, we should free the memory allocation for challenge, so add
err path out_free_challenge to fix the memory leak.

Fixes: 7a277c37d3 ("nvmet-auth: Diffie-Hellman key exchange support")
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-08 10:28:16 -07:00
Christoph Hellwig
18f03a063d nvme: implement ->get_unique_id
Implement the get_unique_id method to allow pNFS SCSI layout access to
NVMe namespaces.

This is the server side implementation of RFC 9561 "Using the Parallel
NFS (pNFS) SCSI Layout to Access Non-Volatile Memory Express (NVMe)
Storage Devices".

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-08 10:25:39 -07:00
Damien Le Moal
f2a7bea237 block: Remove REQ_OP_ZONE_RESET_ALL emulation
Now that device mapper can handle resetting all zones of a mapped zoned
device using REQ_OP_ZONE_RESET_ALL, all zoned block device drivers
support this operation. With this, the request queue feature
BLK_FEAT_ZONE_RESETALL is not necessary and the emulation code in
blk-zone.c can be removed.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240704052816.623865-5-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05 00:42:04 -06:00
Christoph Hellwig
f8924374fd block: call bio_integrity_unmap_free_user from blk_rq_unmap_user
blk_rq_unmap_user always unmaps user space pass-through request.  If such
a request has integrity data attached it must come from a user mapping
as well.  Call bio_integrity_unmap_free_user from blk_rq_unmap_user
and remove the nvme_unmap_bio wrapper in the nvme driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240702151047.1746127-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-03 10:21:16 -06:00
Christoph Hellwig
da042a3655 block: split integrity support out of bio.h
Split struct bio_integrity_payload and the related prototypes out of
bio.h into a separate bio-integrity.h header so that it is only pulled
in by the few places that need it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240702151047.1746127-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-03 10:21:15 -06:00
Jens Axboe
1a50d14670 Linux 6.10-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmaB0NweHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGkvwH/36UJRk/o6wvXnyH
 E6QjCSWo2226APyWks22NjtC3I/8Iqdvkneuh6wG0qL2sXAB078EMjUq5R81bF8H
 wWFBJwetjYTp8GEyLioMEb2wCH/J3R29dLFC4UYTplafXRGP6//xcpJaKmTxcgdR
 31IzvTPXbApZ7L3k1U6rA2bK9PNKcFCOvZlrNMUCuwMrabymHsDfOUt1DqXyg2xp
 zjqiWYBwlklozmgawSWt/mdEgkWuTcAbg+KyqDVQF59s9aj/OOwZ0j+HACq5V8CM
 quTPIAYL6CC9p7uxa69lGr/sgC0Is/BZLPX7RTZAwCgarGvnX+1HUsjDcaFCtrVg
 O6fPUV8=
 =pgUx
 -----END PGP SIGNATURE-----

Merge tag 'v6.10-rc6' into for-6.11/block-post

Pull in v6.10-rc6 to resolve a conflict for the integrity cleanups.

* tag 'v6.10-rc6': (778 commits)
  Linux 6.10-rc6
  ata: ahci: Clean up sysfs file on error
  ata: libata-core: Fix double free on error
  ata,scsi: libata-core: Do not leak memory for ata_port struct members
  ata: libata-core: Fix null pointer dereference on error
  x86-32: fix cmpxchg8b_emu build error with clang
  x86: stop playing stack games in profile_pc()
  i2c: testunit: discard write requests while old command is running
  i2c: testunit: don't erase registers after STOP
  tty: mxser: Remove __counted_by from mxser_board.ports[]
  randomize_kstack: Remove non-functional per-arch entropy filtering
  string: kunit: add missing MODULE_DESCRIPTION() macros
  ata: libata-core: Add ATA_HORKAGE_NOLPM for all Crucial BX SSD1 models
  MAINTAINERS: Update IOMMU tree location
  tools/power turbostat: Add local build_bug.h header for snapshot target
  tools/power turbostat: Fix unc freq columns not showing with '-q' or '-l'
  tools/power turbostat: option '-n' is ambiguous
  drm/drm_file: Fix pid refcounting race
  kallsyms: rework symbol lookup return codes
  gpiolib: cdev: Ignore reconfiguration without direction
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-03 10:20:05 -06:00
Thomas Song
f227345f0a nvme-multipath: implement "queue-depth" iopolicy
The round-robin path selector is inefficient in cases where there is a
difference in latency between paths.  In the presence of one or more
high latency paths the round-robin selector continues to use the high
latency path equally. This results in a bias towards the highest latency
path and can cause a significant decrease in overall performance as IOs
pile on the highest latency path. This problem is acute with NVMe-oF
controllers.

The queue-depth path selector sends I/O down the path with the lowest
number of requests in its request queue. Paths with lower latency will
clear requests more quickly and have less requests queued compared to
higher latency paths. The goal of this path selector is to make more use
of lower latency paths which will bring down overall IO latency and
increase throughput and performance.

Signed-off-by: Thomas Song <tsong@purestorage.com>
[emilne: commandeered patch developed by Thomas Song @ Pure Storage]
Co-developed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Ewan D. Milne <emilne@redhat.com>
Co-developed-by: John Meneghini <jmeneghi@redhat.com>
Signed-off-by: John Meneghini <jmeneghi@redhat.com>
Link: https://lore.kernel.org/linux-nvme/20240509202929.831680-1-jmeneghi@redhat.com/
Tested-by: Marco Patalano <mpatalan@redhat.com>
Tested-by: Jyoti Rani <jrani@purestorage.com>
Tested-by: John Meneghini <jmeneghi@redhat.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-02 06:47:19 -07:00
Christoph Hellwig
f3bf25d513 nvme: don't set io_opt if NOWS is zero
NOWS is one of the annoying "0's based values" in NVMe, where 0 means one
and we thus can't detect if it isn't set.  Thus a NOWS value of 0 means
that the Namespace Optimal Write Size is a single LBA, which is clearly
bogus.  Ignore the value in that case and don't propagate an io_opt
value to the block layer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240701051800.1245240-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-01 06:52:42 -06:00
Nathan Chancellor
440e2051c5 nvmet-fc: Remove __counted_by from nvmet_fc_tgt_queue.fod[]
Work for __counted_by on generic pointers in structures (not just
flexible array members) has started landing in Clang 19 (current tip of
tree). During the development of this feature, a restriction was added
to __counted_by to prevent the flexible array member's element type from
including a flexible array member itself such as:

  struct foo {
    int count;
    char buf[];
  };

  struct bar {
    int count;
    struct foo data[] __counted_by(count);
  };

because the size of data cannot be calculated with the standard array
size formula:

  sizeof(struct foo) * count

This restriction was downgraded to a warning but due to CONFIG_WERROR,
it can still break the build. The application of __counted_by on the fod
member of 'struct nvmet_fc_tgt_queue' triggers this restriction,
resulting in:

  drivers/nvme/target/fc.c:151:2: error: 'counted_by' should not be applied to an array with element of unknown size because 'struct nvmet_fc_fcp_iod' is a struct type with a flexible array member. This will be an error in a future compiler version [-Werror,-Wbounds-safety-counted-by-elt-type-unknown-size]
    151 |         struct nvmet_fc_fcp_iod         fod[] __counted_by(sqsize);
        |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  1 error generated.

Remove this use of __counted_by to fix the warning/error. However,
rather than remove it altogether, leave it commented, as it may be
possible to support this in future compiler releases.

Cc: stable@vger.kernel.org
Closes: https://github.com/ClangBuiltLinux/linux/issues/2027
Fixes: ccd3129aca ("nvmet-fc: Annotate struct nvmet_fc_tgt_queue with __counted_by")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-26 10:13:04 -07:00
John Meneghini
3d7c2fd2ea nvme-multipath: prepare for "queue-depth" iopolicy
This patch prepares for the introduction of a new iopolicy by breaking up
the nvme_find_path() code path into sub-routines.

Signed-off-by: John Meneghini <jmeneghi@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-26 09:45:57 -07:00
Mikulas Patocka
cf546dd289 block: change rq_integrity_vec to respect the iterator
If we allocate a bio that is larger than NVMe maximum request size,
attach integrity metadata to it and send it to the NVMe subsystem, the
integrity metadata will be corrupted.

Splitting the bio works correctly. The function bio_split will clone the
bio, trim the iterator of the first bio and advance the iterator of the
second bio.

However, the function rq_integrity_vec has a bug - it returns the first
vector of the bio's metadata and completely disregards the metadata
iterator that was advanced when the bio was split. Thus, the second bio
uses the same metadata as the first bio and this leads to metadata
corruption.

This commit changes rq_integrity_vec, so that it calls mp_bvec_iter_bvec
instead of returning the first vector. mp_bvec_iter_bvec reads the
iterator and uses it to build a bvec for the current position in the
iterator.

The "queue_max_integrity_segments(rq->q) > 1" check was removed, because
the updated rq_integrity_vec function works correctly with multiple
segments.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/49d1afaa-f934-6ed2-a678-e0d428c63a65@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26 09:14:33 -06:00
Keith Busch
210b1f6576 nvme-pci: do not directly handle subsys reset fallout
Scheduling reset_work after a nvme subsystem reset is expected to fail
on pcie, but this also prevents potential handling the platform's pcie
services may provide that might successfully recovering the link without
re-enumeration. Such examples include AER, DPC, and power's EEH.

Provide a pci specific operation that safely initiates a subsystem
reset, and instead of scheduling reset work, read back the status
register to trigger a pcie read error.

Since this only affects pci, the other fabrics drivers subscribe to a
generic nvmf subsystem reset that is exactly the same as before. The
loop fabric doesn't use it because nvmet doesn't support setting that
property anyway.

And since we're using the magic NSSR value in two places now, provide a
symbolic define for it.

Reported-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-26 07:02:22 -07:00
Hannes Reinecke
bbb443e99c nvme-fcloop: implement 'host_traddr'
Implement the 'host_traddr' callback to display the host transport
address for nvmet debugfs.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:43 -07:00
Hannes Reinecke
99032e9dba nvmet-fc: implement host_traddr()
Implement callback to display the host transport address by
adding a callback 'host_traddr' for nvmet_fc_target_template.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:43 -07:00
Hannes Reinecke
c7ea20c3af nvmet-rdma: implement host_traddr()
Implement callback to display the host transport address.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:43 -07:00
Hannes Reinecke
b4bbe00d21 nvmet-tcp: implement host_traddr()
Implement callback to display the host transport address.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:43 -07:00
Hannes Reinecke
7e5c3de3f2 nvmet: add 'host_traddr' callback for debugfs
We want to display the transport address of the connected host
in debugfs, but this is a property of the transport.
So add a callback 'host_traddr' to allow the transport drivers
to fill in the data.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:43 -07:00
Hannes Reinecke
649fd41420 nvmet: add debugfs support
Add a debugfs hierarchy to display the configured subsystems
and the controllers attached to the subsystems.

Suggested-by: Redouane BOUFENGHOUR <redouane.boufenghour@shadow.tech>
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:42 -07:00
Weiwen Hu
dd0b0a4a2c nvme: rename CDR/MORE/DNR to NVME_STATUS_*
CDR/MORE/DNR fields are not belonging to SC in the NVMe spec, rename
them to NVME_STATUS_* to avoid confusion.

Signed-off-by: Weiwen Hu <huweiwen@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:42 -07:00
Weiwen Hu
d89a5c6705 nvme: fix status magic numbers
Replaced some magic numbers about SC and SCT with enum and macro.

Signed-off-by: Weiwen Hu <huweiwen@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:42 -07:00
Weiwen Hu
22f19a584d nvme: rename nvme_sc_to_pr_err to nvme_status_to_pr_err
This should better match its semantic.  "sc" is used in the NVMe spec to
specifically refer to the last 8 bits in the status field. We should not
reuse "sc" here.

Signed-off-by: Weiwen Hu <huweiwen@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:42 -07:00
Keith Busch
1a9e218195 nvme: split device add from initialization
Combining both creates an ambiguous cleanup scenario for the caller if
an error is returned: does the device reference need to be dropped or
did the error occur before the device was initialized? If an error
occurs after the device is added, then the existing cleanup routines
will leak memory.

Furthermore, the nvme core is taking it upon itself to free the device's
kobj name under certain conditions rather than go through the core
device API. We shouldn't be peaking into these implementation details.

Split the device initialization from the addition to make it easier to
know the error handling actions, fix the existing memory leaks, and stop
the device layering violations.

Link: https://lore.kernel.org/linux-nvme/c4050a37-ecc9-462c-9772-65e25166f439@grimberg.me/
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:42 -07:00
Keith Busch
72cded7573 nvme: fc: split controller bringup handling
Drivers must call nvme_uninit_ctrl after a successful nvme_init_ctrl.
Split the allocation side out to make the error handling boundary easier
to navigate. The nvme fc driver's error handling had different returns
in the error goto label's, which harm readability.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:42 -07:00
Keith Busch
ea47c471a2 nvme: rdma: split controller bringup handling
Drivers must call nvme_uninit_ctrl after a successful nvme_init_ctrl.
Split the allocation side out to make the error handling boundary easier
to navigate. The nvme rdma driver's error handling had different returns
in the error goto label's, which harm readability.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:42 -07:00
Keith Busch
10fd7fb676 nvme: tcp: split controller bringup handling
Drivers must call nvme_uninit_ctrl after a successful nvme_init_ctrl.
Split the allocation side out to make the error handling boundary easier
to navigate. The nvme tcp driver's error handling had different returns
in the error goto label's, which harm readability.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:42 -07:00
Keith Busch
b9ecbfa455 nvme: apple: fix device reference counting
Drivers must call nvme_uninit_ctrl after a successful nvme_init_ctrl.
Split the allocation side out to make the error handling boundary easier
to navigate. The apple driver had been doing this wrong, leaking the
controller device memory on a tagset failure.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:41 -07:00
Hannes Reinecke
0f1f580392 nvmet: make 'tsas' attribute idempotent for RDMA
The RDMA transport defines values for TSAS, but it cannot be changed as
we only support the 'connected' mode.
So to avoid errors during reconfiguration we should allow to write the
current value.

Fixes: 3f123494db ("nvmet: make TCP sectype settable via configfs")
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-21 08:49:10 -07:00
Alan Adamson
5f9bbea02f nvme: Atomic write support
Add support to set block layer request_queue atomic write limits. The
limits will be derived from either the namespace or controller atomic
parameters.

NVMe atomic-related parameters are grouped into "normal" and "power-fail"
(or PF) class of parameter. For atomic write support, only PF parameters
are of interest. The "normal" parameters are concerned with racing reads
and writes (which also applies to PF). See NVM Command Set Specification
Revision 1.0d section 2.1.4 for reference.

Whether to use per namespace or controller atomic parameters is decided by
NSFEAT bit 1 - see Figure 97: Identify – Identify Namespace Data
Structure, NVM Command Set.

NVMe namespaces may define an atomic boundary, whereby no atomic guarantees
are provided for a write which straddles this per-lba space boundary. The
block layer merging policy is such that no merges may occur in which the
resultant request would straddle such a boundary.

Unlike SCSI, NVMe specifies no granularity or alignment rules, apart from
atomic boundary rule. In addition, again unlike SCSI, there is no
dedicated atomic write command - a write which adheres to the atomic size
limit and boundary is implicitly atomic.

If NSFEAT bit 1 is set, the following parameters are of interest:
- NAWUPF (Namespace Atomic Write Unit Power Fail)
- NABSPF (Namespace Atomic Boundary Size Power Fail)
- NABO (Namespace Atomic Boundary Offset)

and we set request_queue limits as follows:
- atomic_write_unit_max = rounddown_pow_of_two(NAWUPF)
- atomic_write_max_bytes = NAWUPF
- atomic_write_boundary = NABSPF

If in the unlikely scenario that NABO is non-zero, then atomic writes will
not be supported at all as dealing with this adds extra complexity. This
policy may change in future.

In all cases, atomic_write_unit_min is set to the logical block size.

If NSFEAT bit 1 is unset, the following parameter is of interest:
- AWUPF (Atomic Write Unit Power Fail)

and we set request_queue limits as follows:
- atomic_write_unit_max = rounddown_pow_of_two(AWUPF)
- atomic_write_max_bytes = AWUPF
- atomic_write_boundary = 0

A new function, nvme_valid_atomic_write(), is also called from submission
path to verify that a request has been submitted to the driver will
actually be executed atomically. As mentioned, there is no dedicated NVMe
atomic write command (which may error for a command which exceeds the
controller atomic write limits).

Note on NABSPF:
There seems to be some vagueness in the spec as to whether NABSPF applies
for NSFEAT bit 1 being unset. Figure 97 does not explicitly mention NABSPF
and how it is affected by bit 1. However Figure 4 does tell to check Figure
97 for info about per-namespace parameters, which NABSPF is, so it is
implied. However currently nvme_update_disk_info() does check namespace
parameter NABO regardless of this bit.

Signed-off-by: Alan Adamson <alan.adamson@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
jpg: total rewrite
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240620125359.2684798-11-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 15:19:17 -06:00
Jeff Johnson
5a5696a11f nvme-apple: add missing MODULE_DESCRIPTION()
make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-apple.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.

Reviewed-by: Eric Curtin <ecurtin@redhat.com>
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-20 10:21:09 -07:00
Christoph Hellwig
8c8f5c85b2 block: move the skip_tagset_quiesce flag to queue_limits
Move the skip_tagset_quiesce flag into the queue_limits feature field so
that it can be set atomically with the queue frozen.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-26-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
9c1e42e3c8 block: move the pci_p2pdma flag to queue_limits
Move the pci_p2pdma flag into the queue_limits feature field so that it
can be set atomically with the queue frozen.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-25-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
a52758a397 block: move the zone_resetall flag to queue_limits
Move the zone_resetall flag into the queue_limits feature field so that
it can be set atomically with the queue frozen.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-24-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
b1fc937a55 block: move the zoned flag into the features field
Move the zoned flags into the features field to reclaim a little
bit of space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-23-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
8023e144f9 block: move the poll flag to queue_limits
Move the poll flag into the queue_limits feature field so that it can
be set atomically with the queue frozen.

Stacking drivers are simplified in that they now can simply set the
flag, and blk_stack_limits will clear it when the features is not
supported by any of the underlying devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-22-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
f76af42f8b block: move the nowait flag to queue_limits
Move the nowait flag into the queue_limits feature field so that it can
be set atomically with the queue frozen.

Stacking drivers are simplified in that they now can simply set the
flag, and blk_stack_limits will clear it when the features is not
supported by any of the underlying devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-20-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00