Commit Graph

3912 Commits

Author SHA1 Message Date
Max Gurtovoy
f096ba3286 nvmet-rdma: set max_queue_size for RDMA transport
A new port configuration was added to set max_queue_size. Clamp user
configuration to RDMA transport limits.

Increase the maximal queue size of RDMA controllers from 128 to 256
(the default size stays 128 same as before).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02 15:18:08 -08:00
Max Gurtovoy
ca2b221d89 nvmet: introduce new max queue size configuration entry
Using this port configuration, one will be able to set the maximal queue
size to be used for any controller that will be associated to the
configured port.

The default value stayed 1024 but each transport will be able to set the
its own values before enabling the port.

Introduce lower limit of 16 for minimal queue depth (same as we use in
the host fabrics drivers).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02 15:18:08 -08:00
Max Gurtovoy
ad178ba9d9 nvme-rdma: clamp queue size according to ctrl cap
If a controller is configured with metadata support, clamp the maximal
queue size to be 128 since there are more resources that are needed
for metadata operations. Otherwise, clamp it to 256.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02 15:18:08 -08:00
Max Gurtovoy
3614496406 nvme-rdma: introduce NVME_RDMA_MAX_METADATA_QUEUE_SIZE definition
This definition will be used by controllers that are configured with
metadata support. For now, both regular and metadata controllers have
the same maximal queue size but later commit will increase the maximal
queue size for regular RDMA controllers to 256.
We'll keep the maximal queue size for metadata controllers to be 128
since there are more resources that are needed for metadata operations
and 128 is the optimal size found for metadata controllers base on
testing.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02 15:18:08 -08:00
Max Gurtovoy
c82c370dca nvmet: set ctrl pi_support cap before initializing cap reg
This is a preparation for setting the maximal queue size of a controller
that supports PI.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02 15:18:08 -08:00
Max Gurtovoy
63e8fd6240 nvmet: set maxcmd to be per controller
This is a preparation for having a dynamic configuration of max queue
size for a controller. Make sure that the maxcmd field stays the same as
the MQES (+1) value as we do today.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02 15:18:08 -08:00
Max Gurtovoy
143667ee9a nvmet: compare mqes and sqsize only for IO SQ
According to the NVMe Spec:
"
MQES: This field indicates the maximum individual queue size that the
controller supports. For NVMe over PCIe implementations, this value
applies to the I/O Submission Queues and I/O Completion Queues that the
host creates. For NVMe over Fabrics implementations, this value applies
to only the I/O Submission Queues that the host creates.
"

Align the target code to compare mqes and sqsize as mentioned in the
NVMe Spec.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02 15:18:08 -08:00
Christian Brauner
e9a7254b03
nvme: port block device access to file
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-15-adbd023e19cc@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-25 12:05:24 +01:00
Christoph Hellwig
74fa8f9c55 block: pass a queue_limits argument to blk_alloc_disk
Pass a queue_limits to blk_alloc_disk and apply it if non-NULL.  This
will allow allocating queues with valid queue limits instead of setting
the values one at a time later.

Also change blk_alloc_disk to return an ERR_PTR instead of just NULL
which can't distinguish errors.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20240215071055.2201424-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-19 16:58:23 -07:00
Chaitanya Kulkarni
8d30528a17 nvmet: remove superfluous initialization
Remove superfluous initialization of status variable in
nvmet_execute_admin_connect() and nvmet_execute_io_connect(), since it
will get overwritten by nvmet_copy_from_sgl().

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-13 15:42:44 -08:00
Keith Busch
29f6975332 nvme: implement support for relaxed effects
NVM Express TP4167 provides a way for controllers to report a relaxed
execution constraint. Specifically, it notifies of exclusivity for IO
vs. admin commands instead of grouping these together. If set, then we
don't need to freeze IO in order to execute that admin command. The
freezing distrupts IO processes, so it's nice to avoid that if the
controller tells us it's not necessary.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-13 08:46:09 -08:00
Chaitanya Kulkarni
79bd7eab83 nvme-fabrics: fix I/O connect error handling
In nvmf_connect_io_queue(), if connect I/O command fails, we log the
error and continue for authentication. This overrides error captured
from __nvme_submit_sync_cmd(), causing wrong return value.

Add goto out_free_data after logging connect error to fix the issue.

Fixes: f50fff73d6 ("nvme: implement In-Band authentication")
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-13 08:16:10 -08:00
Christoph Hellwig
27e32cd23f block: pass a queue_limits argument to blk_mq_alloc_disk
Pass a queue_limits to blk_mq_alloc_disk and apply it if non-NULL.  This
will allow allocating queues with valid queue limits instead of setting
the values one at a time later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240213073425.1621680-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-13 08:56:59 -07:00
Christoph Hellwig
9ac4dd8c47 block: pass a queue_limits argument to blk_mq_init_queue
Pass a queue_limits to blk_mq_init_queue and apply it if non-NULL.  This
will allow allocating queues with valid queue limits instead of setting
the values one at a time later.

Also rename the function to blk_mq_alloc_queue as that is a much better
name for a function that allocates a queue and always pass the queuedata
argument instead of having a separate version for the extra argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240213073425.1621680-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-13 08:56:59 -07:00
Kanchan Joshi
921e81db52 nvme: allow integrity when PI is not in first bytes
NVM command set 1.0 (or later) mandates PI to be in the last bytes of
metadata. But this was not supported in the block-layer, and driver
registered a nop profile.

Since block-integrity can now handle flexible PI offset, change the
driver to support this configuration.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240201130126.211402-4-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-12 08:49:31 -07:00
Johannes Thumshirn
71f4ecdbb4 block: remove gfp_flags from blkdev_zone_mgmt
Now that all callers pass in GFP_KERNEL to blkdev_zone_mgmt() and use
memalloc_no{io,fs}_{save,restore}() to define the allocation scope, we can
drop the gfp_mask parameter from blkdev_zone_mgmt() as well as
blkdev_zone_reset_all() and blkdev_zone_reset_all_emulated().

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20240128-zonefs_nofs-v3-5-ae3b7c8def61@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-12 08:41:16 -07:00
Francis Pravin
4054705215 nvme: use ns->head->pi_size instead of t10_pi_tuple structure size
Currently kernel supports 8 byte and 16 byte protection information.
So, use ns->head->pi_size instead of sizeof(struct t10_pi_tuple).

Signed-off-by: Francis Pravin <francis.p@samsung.com>
Signed-off-by: Sathyavathi M <sathya.m@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-07 15:49:36 -08:00
Chaitanya Kulkarni
e8c263ed6d nvme-core: fix comment to reflect right functions
The functions and the attribute listed in the comment doesn't exists in
the code, (ns->logging_enabled, nvme_passthru_err_log_enabled_store()
and nvme_passthru_err_log_enabled_show())

Update the comment with right function names and a comment
ns->head->passthru_err_log_enabled,
nvme_io_passthru_err_log_enabled_store() and
nvme_io_passthru_err_log_enabled_show().

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Alan Adamson <alan.adamson@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-07 10:37:05 -08:00
Keith Busch
1f4137e882 nvme: move passthrough logging attribute to head
The namespace does not have attributes, but the head does. Move the new
logging attribute to that structure instead of dereferencing the wrong
type.

And while we're here, fix the reverse-tree coding style.

Fixes: 9f079dda14 ("nvme: allow passthru cmd error logging")
Reported-by: Tasmiya Nalatwad <tasmiya@linux.vnet.ibm.com>
Tested-by: Tasmiya Nalatwad <tasmiya@linux.vnet.ibm.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Alan Adamson <alan.adamson@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-07 10:30:47 -08:00
Maurizio Lombardi
f0377ff975 nvme-host: fix the updating of the firmware version
The original code didn't update the firmware version if the
"next slot" of the AFI register isn't zero or if the
"current slot" field is zero; in those cases it assumed
that a reset was needed.

However, the NVMe specification doesn't exclude the possibility that
the "next slot" value is equal to the "current slot" value,
meaning that the same firmware slot will be activated after performing
a controller level reset; in this case a reset is clearly not
necessary and we can safely update the firmware version.

Modify the code so the kernel will report that a Controller Level Reset
is needed only in the following cases:

1) If the "current slot" field is zero. This is invalid and means that
   something is wrong, a reset is needed.

or

2) if the "next slot" field isn't zero AND it's not equal to the
   "current slot" value. This means that at the next reset a different
   firmware slot will be activated.

Fixes: 983a338b96 ("nvme: update firmware version after commit")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01 09:21:32 -08:00
Alan Adamson
9f079dda14 nvme: allow passthru cmd error logging
Commit d7ac8dca93 ("nvme: quiet user passthrough command errors")
disabled error logging for user passthrough commands.  This commit
adds the ability to opt-in to passthrough admin error logging. IO
commands initiated as passthrough will always be logged.

The logging output for passthrough commands (Admin and IO) has been
changed to include CDWXX fields.

nvme0n1: Read(0x2), LBA Out of Range (sct 0x0 / sc 0x80) DNR cdw10=0x0 cdw11=0x1
        cdw12=0x70000 cdw13=0x0 cdw14=0x0 cdw15=0x0

Add a helper function nvme_log_err_passthru() which allows us to log
error for passthru commands by decoding cdw10-cdw15 values of nvme
command.

Add a new sysfs attr passthru_err_log_enabled that allows user to conditionally
enable passthrough command logging for either passthrough Admin commands sent to
the controller or passthrough IO commands sent to a namespace.

By default, passthrough error logging is disabled.

To enable passthrough admin error logging:
        echo 1 > /sys/class/nvme/nvme0/passthru_err_log_enabled

To disable passthrough admin error logging:
        echo 0 > /sys/class/nvme/nvme0/passthru_err_log_enabled

To enable passthrough io error logging:
        echo 1 > /sys/class/nvme/nvme0/nvme0n1/passthru_err_log_enabled

To disable passthrough io error logging:
        echo 0 > /sys/class/nvme/nvme0/nvme0n1/passthru_err_log_enabled

Signed-off-by: Alan Adamson <alan.adamson@oracle.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01 07:44:53 -08:00
Nitin U. Yewale
c3a846fe4b nvme-fc: show hostnqn when connecting to fc target
Log hostnqn when connecting to nvme target.
As hostnqn could be changed, logging this information
in syslog at appropriate time may help in troubleshooting.

Signed-off-by: Nitin U. Yewale <nyewale@redhat.com>
Reviewed-by: John Meneghini <jmeneghi@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01 07:44:52 -08:00
Nitin U. Yewale
d2045e6a4e nvme-rdma: show hostnqn when connecting to rdma target
Log hostnqn when connecting to nvme target.
As hostnqn could be changed, logging this information
in syslog at appropriate time may help in troubleshooting.

Signed-off-by: Nitin U. Yewale <nyewale@redhat.com>
Reviewed-by: John Meneghini <jmeneghi@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01 07:44:52 -08:00
Nitin U. Yewale
524719b4c6 nvme-tcp: show hostnqn when connecting to tcp target
Log hostnqn when connecting to nvme target.
As hostnqn could be changed, logging this information
in syslog at appropriate time may help in troubleshooting.

Signed-off-by: Nitin U. Yewale <nyewale@redhat.com>
Reviewed-by: John Meneghini <jmeneghi@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01 07:44:52 -08:00
Daniel Wagner
a90ac7b348 nvmet-fc: use RCU list iterator for assoc_list
The assoc_list is a RCU protected list, thus use the RCU flavor of list
functions.

Let's use this opportunity and refactor this code and move the lookup
into a helper and give it a descriptive name.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01 07:44:52 -08:00
Daniel Wagner
fe506a7458 nvmet-fc: take ref count on tgtport before delete assoc
We have to ensure that the tgtport is not going away
before be have remove all the associations.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01 07:44:52 -08:00
Daniel Wagner
710c69dbac nvmet-fc: avoid deadlock on delete association path
When deleting an association the shutdown path is deadlocking because we
try to flush the nvmet_wq nested. Avoid this by deadlock by deferring
the put work into its own work item.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01 07:44:52 -08:00
Daniel Wagner
3146345c2e nvmet-fc: abort command when there is no binding
When the target port has not active port binding, there is no point in
trying to process the command as it has to fail anyway. Instead adding
checks to all commands abort the command early.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01 07:44:51 -08:00
Daniel Wagner
1c110588dd nvmet-fc: do not tack refs on tgtports from assoc
The association life time is tied to the life time of the target port.
That means we should not take extra a refcount when creating a
association.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01 07:44:51 -08:00
Daniel Wagner
50b474e1fa nvmet-fc: remove null hostport pointer check
An association has always a valid hostport pointer. Remove useless
null pointer check.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01 07:44:51 -08:00
Daniel Wagner
ca121a0f75 nvmet-fc: hold reference on hostport match
The hostport data structure is shared between the association, this why
we keep track of the users via a refcount. So we should not decrement
the refcount on a match and free the hostport several times.

Reported by KASAN.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01 07:44:51 -08:00
Daniel Wagner
c5e27b1a77 nvmet-fc: free queue and assoc directly
Neither struct nvmet_fc_tgt_queue nor struct nvmet_fc_tgt_assoc are data
structure which are used in a RCU context. So there is no reason to
delay the free operation.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01 07:44:51 -08:00
Daniel Wagner
4049dc96b8 nvmet-fc: defer cleanup using RCU properly
When the target executes a disconnect and the host triggers a reconnect
immediately, the reconnect command still finds an existing association.

The reconnect crashes later on because nvmet_fc_delete_target_assoc
blindly removes resources while the reconnect code wants to use it.

To address this, nvmet_fc_find_target_assoc should not be able to
lookup an association which is being removed. The association list
is already under RCU lifetime management, so let's properly use it
and remove the association from the list and wait for a grace period
before cleaning up all. This means we also can drop the RCU management
on the queues, because this is now handled via the association itself.

A second step split the execution context so that the initial disconnect
command can complete without running the reconnect code in the same
context. As usual, this is done by deferring the ->done to a workqueue.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01 07:44:46 -08:00
Daniel Wagner
c691e6d7e1 nvmet-fc: release reference on target port
In case we return early out of __nvmet_fc_finish_ls_req() we still have
to release the reference on the target port.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01 07:44:46 -08:00
Daniel Wagner
dcfad4ab4d nvmet-fcloop: swap the list_add_tail arguments
The first argument of list_add_tail function is the new element which
should be added to the list which is the second argument. Swap the
arguments to allow processing more than one element at a time.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01 07:44:41 -08:00
Daniel Wagner
70fbfc47a3 nvme-fc: do not wait in vain when unloading module
The module exit path has race between deleting all controllers and
freeing 'left over IDs'. To prevent double free a synchronization
between nvme_delete_ctrl and ida_destroy has been added by the initial
commit.

There is some logic around trying to prevent from hanging forever in
wait_for_completion, though it does not handling all cases. E.g.
blktests is able to reproduce the situation where the module unload
hangs forever.

If we completely rely on the cleanup code executed from the
nvme_delete_ctrl path, all IDs will be freed eventually. This makes
calling ida_destroy unnecessary. We only have to ensure that all
nvme_delete_ctrl code has been executed before we leave
nvme_fc_exit_module. This is done by flushing the nvme_delete_wq
workqueue.

While at it, remove the unused nvme_fc_wq workqueue too.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01 07:44:32 -08:00
Caleb Sander
0d150beff2 nvme-fc: log human-readable opcode on timeout
The fc transport logs the opcode and fctype on command timeout.
This is sufficient information to identify the command issued,
but not very human-readable. Use the nvme_fabrics_opcode_str()
helper to also log the name of the command, as rdma and tcp already do.

Signed-off-by: Caleb Sander <csander@purestorage.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-31 17:00:45 -08:00
Caleb Sander
7d23e836b0 nvme: split out fabrics version of nvme_opcode_str()
nvme_opcode_str() currently supports admin, IO, and fabrics commands.
However, fabrics commands aren't allowed for the pci transport.
Currently the pci caller passes 0 as the fctype,
which means any fabrics command would be displayed as "Property Set".

Move fabrics command support into a function nvme_fabrics_opcode_str()
and remove the fctype argument to nvme_opcode_str().
This way, a fabrics command will display as "Unknown" for pci.
Convert the rdma and tcp transports to use nvme_fabrics_opcode_str().

Signed-off-by: Caleb Sander <csander@purestorage.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-31 17:00:45 -08:00
Caleb Sander
6f9a71c611 nvme: remove redundant status mask
In nvme_get_error_status_str(), the status code is already masked
with 0x7ff at the beginning of the function.
Don't bother masking it again when indexing nvme_statuses.

Signed-off-by: Caleb Sander <csander@purestorage.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-31 16:06:12 -08:00
Caleb Sander
4b6821940e nvme: return string as char *, not unsigned char *
The functions in drivers/nvme/host/constants.c returning human-readable
status and opcode strings currently use type "const unsigned char *".
Typically string constants use type "const char *",
so remove "unsigned" from the return types.
This is a purely cosmetic change to clarify that the functions
return text strings instead of an array of bytes, for example.

Signed-off-by: Caleb Sander <csander@purestorage.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-31 16:06:12 -08:00
Chaitanya Kulkarni
0945b43b4e nvme-common: add module description
Add MODULE_DESCRIPTION() in order to remove warnings & get clean build:-

WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/common/nvme-auth.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/common/nvme-keyring.o

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-31 16:03:49 -08:00
Hannes Reinecke
48dae46676 nvme: enable retries for authentication commands
Authentication commands might trigger a lengthy computation on the
controller or even a callout to an external entity.
In these cases the controller might return a status without the DNR
bit set, indicating that the command should be retried.
This patch enables retries for authentication commands  by setting
NVME_SUBMIT_RETRY for __nvme_submit_sync_cmd().

Reported-by: Martin George <marting@netapp.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-31 15:59:48 -08:00
Hannes Reinecke
bd2687f2e5 nvme: change __nvme_submit_sync_cmd() calling conventions
Combine the two arguments 'flags' and 'at_head' from __nvme_submit_sync_cmd()
into a single 'flags' argument and use function-specific values to indicate
what should be set within the function.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-31 15:59:48 -08:00
Hannes Reinecke
f9ddefb6c0 nvme-auth: open-code single-use macros
No point in having macros just for a single function nvme_auth_submit().
Open-code them into the caller.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-31 15:59:48 -08:00
Keith Busch
6d3c7fb17b nvme: use ctrl state accessor
The ctrl->state value is updated in another thread using WRITE_ONCE, so
ensure all the readers use the appropriate accessor.

Reviewed-by: Sagi Grimberg <sagi@grmberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-29 07:02:50 -08:00
Guixin Liu
47c5dd66c1 nvmet-tcp: fix nvme tcp ida memory leak
The nvmet_tcp_queue_ida should be destroy when the nvmet-tcp module
exit.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-26 09:16:34 -08:00
Israel Rukshin
15ade5bfa5 nvme-rdma: Fix transfer length when write_generate/read_verify are 0
When the block layer doesn't generate/verify metadata, the SG length is
smaller than the transfer length. This is because the SG length doesn't
include the metadata length that is added by the HW on the wire. The
target failes those commands with "Data SGL Length Invalid" by comparing
the transfer length and the SG length. Fix it by adding the metadata
length to the transfer length when there is no metadata SGL. The bug
reproduces when setting read_verify/write_generate configs to 0 at the
child multipath device or at the primary device when NVMe multipath is
disabled.

Note that setting those configs to 0 on the multipath device (ns_head)
doesn't have any impact on the I/Os.

Fixes: 5ec5d3bddc ("nvme-rdma: add metadata/T10-PI support")
Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-24 07:29:34 -08:00
Chaitanya Kulkarni
41951f83ef nvmet: add module description to stop warnings
Add MODULE_DESCRIPTION() in order to remove warnings & get clean build:-

WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/target/nvmet.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/target/nvme-loop.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/target/nvmet-rdma.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/target/nvmet-fc.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/target/nvme-fcloop.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/target/nvmet-tcp.o

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-24 07:25:04 -08:00
Chaitanya Kulkarni
92b0b0ff0b nvme: add module description to stop warnings
Add MODULE_DESCRIPTION() in order to remove warnings & get clean build:-

WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-core.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-fabrics.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-rdma.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-fc.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-tcp.o

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-24 07:25:01 -08:00
Guixin Liu
41353fbad4 nvmet: unify aer type enum
The host and target use two definition of aer type, unify
them into a single one.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-23 08:29:08 -08:00
Linus Torvalds
9d1694dc91 for-6.8/block-2024-01-18
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmWpoCgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqIUEADFvJdC2izkPzYzsOrMK5Rt1H7vaHGKhbA+
 zWCuQaa1xQd8bazq+NVnQpbzgclkE/WodTCNfNXcTTjzeQEmcZC888llP3Y9vwyP
 XfEKH7fSaeKvGigJLro1oPe3YV7/t89F5ol3BoZayfzJF8GEU9BXRWzgOkZzijnk
 xdm5wUyn/GknksMuQQraZ+U6bQRFLBOulzoaQeMD6Dosx+uRlM4WvAJawC+uOV6R
 qPT2BVSfYGzmgEKvoaphw0FMkUhFBMDHfXTpQBi5tIzTKOaof8tynYEGz0FHZWeh
 V0JEEp+3jLWFxFXeEcXgBVPJPE8J0DzGm9g17/uwC2Yhmlbw4FKZVRvGG+PpeUso
 D5aqhqm3w0x7HgZ7JKwy/aUctADYvjVcSVzPHTaFK0aCSYCIAXxqv4p7fOoxPqyx
 T32IUHTzGtkCdqzv/xFdtTYhTNM2vyzzbbWj5lXgCBqHsXOVbCh8UM2p+9ec2Umq
 Fo1XF9eoCDe6Sn4s15hJ5G4DEhKGOKkHluvRUdM+0selA5b0sNOeUqlAf2v+0ve3
 Pv3e3X4NPssNIEcsDHf5pc3zGC+LXRS0oFvfIvDESBjwXc3iHIMl+SkjyS57P4Fd
 RKrHEUUiACuCKO/IWqFYLiNBNHnP3RmV5gSxIZr9QJhFSwOzP+/+4++TCdF5vdAV
 amhv+0PdCw==
 =DLW9
 -----END PGP SIGNATURE-----

Merge tag 'for-6.8/block-2024-01-18' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
      - tcp, fc, and rdma target fixes (Maurizio, Daniel, Hannes,
        Christoph)
      - discard fixes and improvements (Christoph)
      - timeout debug improvements (Keith, Max)
      - various cleanups (Daniel, Max, Giuxen)
      - trace event string fixes (Arnd)
      - shadow doorbell setup on reset fix (William)
      - a write zeroes quirk for SK Hynix (Jim)

 - MD pull request via Song:
      - Sparse warning since v6.0 (Bart)
      - /proc/mdstat regression since v6.7 (Yu Kuai)

 - Use symbolic error value (Christian)

 - IO Priority documentation update (Christian)

 - Fix for accessing queue limits without having entered the queue
   (Christoph, me)

 - Fix for loop dio support (Christoph)

 - Move null_blk off deprecated ida interface (Christophe)

 - Ensure nbd initializes full msghdr (Eric)

 - Fix for a regression with the folio conversion, which is now easier
   to hit because of an unrelated change (Matthew)

 - Remove redundant check in virtio-blk (Li)

 - Fix for a potential hang in sbitmap (Ming)

 - Fix for partial zone appending (Damien)

 - Misc changes and fixes (Bart, me, Kemeng, Dmitry)

* tag 'for-6.8/block-2024-01-18' of git://git.kernel.dk/linux: (45 commits)
  Documentation: block: ioprio: Update schedulers
  loop: fix the the direct I/O support check when used on top of block devices
  blk-mq: Remove the hctx 'run' debugfs attribute
  nbd: always initialize struct msghdr completely
  block: Fix iterating over an empty bio with bio_for_each_folio_all
  block: bio-integrity: fix kcalloc() arguments order
  virtio_blk: remove duplicate check if queue is broken in virtblk_done
  sbitmap: remove stale comment in sbq_calc_wake_batch
  block: Correct a documentation comment in blk-cgroup.c
  null_blk: Remove usage of the deprecated ida_simple_xx() API
  block: ensure we hold a queue reference when using queue limits
  blk-mq: rename blk_mq_can_use_cached_rq
  block: print symbolic error name instead of error code
  blk-mq: fix IO hang from sbitmap wakeup race
  nvmet-rdma: avoid circular locking dependency on install_queue()
  nvmet-tcp: avoid circular locking dependency on install_queue()
  nvme-pci: set doorbell config before unquiescing
  block: fix partial zone append completion handling in req_bio_endio()
  block/iocost: silence warning on 'last_period' potentially being unused
  md/raid1: Use blk_opf_t for read and write operations
  ...
2024-01-18 18:22:40 -08:00
Linus Torvalds
4c72e2b8c4 for-6.8/io_uring-2024-01-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmWcOk0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpq2wEACgP1Gvfb2cR65B/6tkLJ6neKnYOPVSYx9F
 4Mv6KS306vmOsE67LqynhtAe/Hgfv0NKADN+mKLLV8IdwhneLAwFxRoHl8QPO2dW
 DIxohDMLIqzY/SOUBqBHMS8b5lEr79UVh8vXjAuuYCBTcHBndi4hj0S0zkwf5YiK
 Vma6ty+TnzmOv+c6+OCgZZlwFhkxD1pQBM5iOlFRAkdt3Er/2e8FqthK0/od7Fgy
 Ii9xdxYZU9KTsM4R+IWhqZ1rj1qUdtMPSGzFRbzFt3b6NdANRZT9MUlxvSkuHPIE
 P/9anpmgu41gd2JbHuyVnw+4jdZMhhZUPUYtphmQ5n35rcFZjv1gnJalH/Nw/DTE
 78bDxxP0+35bkuq5MwHfvA3imVsGnvnFx5MlZBoqvv+VK4S3Q6E2+ARQZdiG+2Is
 8Uyjzzt0nq2waUO3H1JLkgiM9J9LFqaYQLE68u569m4NCfRSpoZva9KVmNpre2K0
 9WdGy2gfzaYAQkGud2TULzeLiEbWfvZAL0o43jYXkVPRKmnffCEphf2X6C2IDV/3
 5ZR4bandRRC6DVnE+8n1MB06AYtpJPB/w3rplON+l/V5Gnb7wRNwiUrBr/F15OOy
 OPbAnP6k56wY/JRpgzdbNqwGi8EvGWX6t3Kqjp2mczhs2as8M193FKS2xppl1TYl
 BPyTsAfbdQ==
 =5v/U
 -----END PGP SIGNATURE-----

Merge tag 'for-6.8/io_uring-2024-01-08' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:
 "Mostly just come fixes and cleanups, but one feature as well. In
  detail:

   - Harden the check for handling IOPOLL based on return (Pavel)

   - Various minor optimizations (Pavel)

   - Drop remnants of SCM_RIGHTS fd passing support, now that it's no
     longer supported since 6.7 (me)

   - Fix for a case where bytes_done wasn't initialized properly on a
     failure condition for read/write requests (me)

   - Move the register related code to a separate file (me)

   - Add support for returning the provided ring buffer head (me)

   - Add support for adding a direct descriptor to the normal file table
     (me, Christian Brauner)

   - Fix for ensuring pending task_work for a ring with DEFER_TASKRUN is
     run even if we timeout waiting (me)"

* tag 'for-6.8/io_uring-2024-01-08' of git://git.kernel.dk/linux:
  io_uring: ensure local task_work is run on wait timeout
  io_uring/kbuf: add method for returning provided buffer ring head
  io_uring/rw: ensure io->bytes_done is always initialized
  io_uring: drop any code related to SCM_RIGHTS
  io_uring/unix: drop usage of io_uring socket
  io_uring/register: move io_uring_register(2) related code to register.c
  io_uring/openclose: add support for IORING_OP_FIXED_FD_INSTALL
  io_uring/cmd: inline io_uring_cmd_get_task
  io_uring/cmd: inline io_uring_cmd_do_in_task_lazy
  io_uring: split out cmd api into a separate header
  io_uring: optimise ltimeout for inline execution
  io_uring: don't check iopoll if request completes
2024-01-11 14:19:23 -08:00
Linus Torvalds
01d550f0fc for-6.8/block-2024-01-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmWcIOIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpn6hD/9oO7U75PuxUwYYHZ9Uzxpw6gQ0LEmeyJmE
 NQYCkfYHVq3IsgOdF7elI9v3qtr6v8V8CdB7cByrnn3DgwsMuiTKZZ0dK7vH37PO
 DX+/xn349e8oH7RdRo7f3m95g1YbHfpfnj0Rc4mjTDV72Jr/HlLTVgGTQg8DEnCR
 wBIFmeuBHHgeeLh87gsWLAP7ReReiy9V1uqpDFsko2/4BxRAM/8eedkwcAxD8aEy
 rd+dT/SBQj2cOdQMUeExT3gWjwzHh6ZHx3f1WCLK5fdck6BogH2hBUeri6F/H98L
 HoaXjBZYBTH68hB/mnO5I4g1ZlrVM74Vp7JPa3e1SFFtyEi6lsyrk2J3GoNh0E7r
 pXqH5kAcaJwBsBrbRGuvEyGbn9RLTaN5Gvseud0VE4oMruyodTniQaHXuIGackgz
 sMavMho4486EUWPaF7gIBdLNK1hO13w+IDZ4+3oBxhudMqdgZbk4iYpOCqQ7QY5G
 2vkzAE/sZ+aVNXeaIQOI8dE5clBy8gJ+6+t8dm3DY1r1xdbcnU40iZ8/fri3h69r
 vHs9bpQnVWZF0gEyEflY1pkcAPpIkvMmWCR7Ehy5YCkIfa+qfSL05o3dicpWovLP
 N+gCtpkhTK2AvmUWsUMypMLRvoSOImyCIiobrr3qNBaUdgRP8xKfUa72RuRp8cGl
 Vrj5oAiE3w==
 =YAfp
 -----END PGP SIGNATURE-----

Merge tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:
 "Pretty quiet round this time around. This contains:

   - NVMe updates via Keith:
        - nvme fabrics spec updates (Guixin, Max)
        - nvme target udpates (Guixin, Evan)
        - nvme attribute refactoring (Daniel)
        - nvme-fc numa fix (Keith)

   - MD updates via Song:
        - Fix/Cleanup RCU usage from conf->disks[i].rdev (Yu Kuai)
        - Fix raid5 hang issue (Junxiao Bi)
        - Add Yu Kuai as Reviewer of the md subsystem
        - Remove deprecated flavors (Song Liu)
        - raid1 read error check support (Li Nan)
        - Better handle events off-by-1 case (Alex Lyakas)

   - Efficiency improvements for passthrough (Kundan)

   - Support for mapping integrity data directly (Keith)

   - Zoned write fix (Damien)

   - rnbd fixes (Kees, Santosh, Supriti)

   - Default to a sane discard size granularity (Christoph)

   - Make the default max transfer size naming less confusing
     (Christoph)

   - Remove support for deprecated host aware zoned model (Christoph)

   - Misc fixes (me, Li, Matthew, Min, Ming, Randy, liyouhong, Daniel,
     Bart, Christoph)"

* tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linux: (78 commits)
  block: Treat sequential write preferred zone type as invalid
  block: remove disk_clear_zoned
  sd: remove the !ZBC && blk_queue_is_zoned case in sd_read_block_characteristics
  drivers/block/xen-blkback/common.h: Fix spelling typo in comment
  blk-cgroup: fix rcu lockdep warning in blkg_lookup()
  blk-cgroup: don't use removal safe list iterators
  block: floor the discard granularity to the physical block size
  mtd_blkdevs: use the default discard granularity
  bcache: use the default discard granularity
  zram: use the default discard granularity
  null_blk: use the default discard granularity
  nbd: use the default discard granularity
  ubd: use the default discard granularity
  block: default the discard granularity to sector size
  bcache: discard_granularity should not be smaller than a sector
  block: remove two comments in bio_split_discard
  block: rename and document BLK_DEF_MAX_SECTORS
  loop: don't abuse BLK_DEF_MAX_SECTORS
  aoe: don't abuse BLK_DEF_MAX_SECTORS
  null_blk: don't cap max_hw_sectors to BLK_DEF_MAX_SECTORS
  ...
2024-01-11 13:58:04 -08:00
Hannes Reinecke
31deaeb11b nvmet-rdma: avoid circular locking dependency on install_queue()
nvmet_rdma_install_queue() is driven from the ->io_work workqueue
function, but will call flush_workqueue() which might trigger
->release_work() which in itself calls flush_work on ->io_work.

To avoid that check for pending queue in disconnecting status,
and return 'controller busy' when we reached a certain threshold.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-10 13:27:45 -08:00
Hannes Reinecke
07a29b134c nvmet-tcp: avoid circular locking dependency on install_queue()
nvmet_tcp_install_queue() is driven from the ->io_work workqueue
function, but will call flush_workqueue() which might trigger
->release_work() which in itself calls flush_work on ->io_work.

To avoid that check for pending queue in disconnecting status,
and return 'controller busy' when we reached a certain threshold.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-10 13:27:32 -08:00
Linus Torvalds
120a201bd2 hardening updates for v6.8-rc1
- Introduce the param_unknown_fn type and other clean ups (Andy Shevchenko)
 
 - Various __counted_by annotations (Christophe JAILLET, Gustavo A. R. Silva,
   Kees Cook)
 
 - Add KFENCE test to LKDTM (Stephen Boyd)
 
 - Various strncpy() refactorings (Justin Stitt)
 
 - Fix qnx4 to avoid writing into the smaller of two overlapping buffers
 
 - Various strlcpy() refactorings
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmWcOsQWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJoiDD/9gNhalNG+6MNF5TDwSvO9X7pvL
 bQ6D3clByRxYjnJ4dMQ7p3s+rJ937uQt9PezIWHgRoldjQy3x7AJ5BxkhjeMlD2B
 YLbfdVYPy09X0Ewk1Efvfm/ta6tJpBGYF7Bc7LIneZrdQ6gemBpLW1PNZAFYzcWX
 oDjV+M1NytxaiF0aebxPZvZ1W+NGQ105Sxvj5MheDoezyO/j0CTe+ZYtCzFguFY0
 8SPpR5FG4AFidb8GHd5Ndv0trVWjF1jat0FUFgEFOCE0fJNWLVR0Bbr2MtXiG7wL
 LF7IZ/Mn+mi+O3BmcD6JiaYf9EPlMUXCyqc8NvsnoWGqhWhWmQPCInZVrpplMUNK
 V/UHVMkmjDs4f/lAHBJoJHDK6fmOD+cAFaNMOltfErcjV4s+lEo6vHoiKl8hfPnH
 EzpQaK3funGroVYwTc35e07NrJJHCzqIUhZ0FJO7ByuOE2tIomiVo9Xy9gy54iCT
 qzC7zkrZ0MKqui4qiUY9FWayRRYLX4qNxELm4yie6Pzmk8943hNOaDofcyKWuZFC
 eqvhIkvqb4LasLrzCBk+ehA2KWSRmTrR6E9IygwbBXUTsvn2yj2RRYeAlGQNBTBZ
 adgSXQpRBmtKYqyihWLhP4QcunknEiQdDS3lS2qJmPH33Iv3jGH4yS6BNIBufMGL
 PoC2UxSfGd+YT079fw==
 =1Wxx
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening updates from Kees Cook:

 - Introduce the param_unknown_fn type and other clean ups (Andy
   Shevchenko)

 - Various __counted_by annotations (Christophe JAILLET, Gustavo A. R.
   Silva, Kees Cook)

 - Add KFENCE test to LKDTM (Stephen Boyd)

 - Various strncpy() refactorings (Justin Stitt)

 - Fix qnx4 to avoid writing into the smaller of two overlapping buffers

 - Various strlcpy() refactorings

* tag 'hardening-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  qnx4: Use get_directory_fname() in qnx4_match()
  qnx4: Extract dir entry filename processing into helper
  atags_proc: Add __counted_by for struct buffer and use struct_size()
  tracing/uprobe: Replace strlcpy() with strscpy()
  params: Fix multi-line comment style
  params: Sort headers
  params: Use size_add() for kmalloc()
  params: Do not go over the limit when getting the string length
  params: Introduce the param_unknown_fn type
  lkdtm: Add kfence read after free crash type
  nvme-fc: replace deprecated strncpy with strscpy
  nvdimm/btt: replace deprecated strncpy with strscpy
  nvme-fabrics: replace deprecated strncpy with strscpy
  drm/modes: replace deprecated strncpy with strscpy_pad
  afs: Add __counted_by for struct afs_acl and use struct_size()
  VMCI: Annotate struct vmci_handle_arr with __counted_by
  i40e: Annotate struct i40e_qvlist_info with __counted_by
  HID: uhid: replace deprecated strncpy with strscpy
  samples: Replace strlcpy() with strscpy()
  SUNRPC: Replace strlcpy() with strscpy()
2024-01-10 11:03:52 -08:00
William Butler
06c59d4270 nvme-pci: set doorbell config before unquiescing
During resets, if queues are unquiesced first, then the host can submit
IOs to the controller using shadow doorbell logic but the controller
won't be aware. This can lead to necessary MMIO doorbells from being
not issued, causing requests to be delayed and timed-out.

Signed-off-by: William Butler <wab@google.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-10 10:36:12 -08:00
Maurizio Lombardi
9a1abc2485 nvmet-tcp: Fix the H2C expected PDU len calculation
The nvmet_tcp_handle_h2c_data_pdu() function should take into
consideration the possibility that the header digest and/or the data
digests are enabled when calculating the expected PDU length, before
comparing it to the value stored in cmd->pdu_len.

Fixes: efa5630590 ("nvmet-tcp: Fix a kernel panic when host sends an invalid H2C PDU length")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-08 10:09:53 -08:00
Max Gurtovoy
45c36f04f1 nvme-tcp: enhance timeout kernel log
Print the command_id along side blk-mq's tag to help match commands with
protocol wire traces and logs.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-08 10:09:50 -08:00
Max Gurtovoy
a5c1a87ce0 nvme-rdma: enhance timeout kernel log
Print the command_id along side blk-mq's tag to help match commands with
protocol wire traces and logs.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-08 10:09:45 -08:00
Keith Busch
172fb49600 nvme-pci: enhance timeout kernel log
Kernel configs don't necessarily have opcode decoding, and some opcodes
are not even decodable. It is still interesting for debugging SSD issues
to know what opcode is timing out, what request type it came from, and
the data size (if applicable).

Also print the command_id along side blk-mq's tag to help match commands
with protocol wire traces and firmware logs,

Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-08 10:09:29 -08:00
Arnd Bergmann
a7de1dea76 nvme: trace: avoid memcpy overflow warning
A previous patch introduced a struct_group() in nvme_common_command to help
stringop fortification figure out the length of the fields, but one function
is not currently using them:

In file included from drivers/nvme/target/core.c:7:
In file included from include/linux/string.h:254:
include/linux/fortify-string.h:592:4: error: call to '__read_overflow2_field' declared with 'warning' attribute: detected read beyond size of field (2nd parameter); maybe use struct_group()? [-Werror,-Wattribute-warning]
                        __read_overflow2_field(q_size_field, size);
                        ^

Change this one to use the correct field name to avoid the warning.

Fixes: 5c629dc960 ("nvme: use struct group for generic command dwords")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-05 13:16:18 -08:00
Arnd Bergmann
4ee7ffeb4c nvmet: re-fix tracing strncpy() warning
An earlier patch had tried to address a warning about a string copy with
missing zero termination:

drivers/nvme/target/trace.h:52:3: warning: ‘strncpy’ specified bound 32 equals destination size [-Wstringop-truncation]

The new version causes a different warning with some compiler versions, notably
gcc-9 and gcc-10, and also misses the zero padding that was apparently done
intentionally in the original code:

drivers/nvme/target/trace.h:56:2: error: 'strncpy' specified bound depends on the length of the source argument [-Werror=stringop-overflow=]

Change it to use strscpy_pad() with the original length, which will give
a properly padded and zero-terminated string as well as avoiding the warning.

Fixes: d86481e924 ("nvmet: use min of device_path and disk len")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-05 13:16:18 -08:00
Guixin Liu
bafd590910 nvme: introduce nvme_disk_is_ns_head helper
We currently rely on gendisk's file operations (fops) to distinguish
between a namespace head (ns_head) and a regular namespace. To enhance
code readability, introduce a helper function.
Additionally, we must ensure that the device is not an ns_head before
calling nvme_get_ns_from_dev(). To enforce this, add a WARN_ON check
within the nvme_get_ns_from_dev().

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Liu Song <liusong@linux.alibaba.com>
[include fix: https://lore.kernel.org/oe-kbuild-all/202401031943.0N72Tkji-lkp@intel.com/]
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-05 13:15:41 -08:00
Jim.Lin
bd029a02ce nvme-pci: disable write zeroes for SK Hynix BC901
SK Hynix BC901 drive write zero will cause Chromebook takes more than 20 mins to switch to developer mode
"disable write zeroes" can fix this issue and Sk Hynix has been verified.

Signed-off-by: Jim.Lin <jim.lin@siliconmotion.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-05 13:15:41 -08:00
Daniel Wagner
f644d21baa nvmet-fcloop: Remove remote port from list when unlinking
The remote port is removed too late from fcloop_nports list. Remove it
when port is unregistered.

This prevents a busy loop in fcloop_exit, because it is possible the
remote port is found in the list and thus we will never progress.

The kernel log will be spammed with

  nvme_fcloop: fcloop_exit: Failed deleting remote port
  nvme_fcloop: fcloop_exit: Failed deleting target port

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-05 13:15:40 -08:00
Daniel Wagner
0e716cec6f nvmet-trace: avoid dereferencing pointer too early
The first command issued from the host to the target is the fabrics
connect command. At this point, neither the target queue nor the
controller have been allocated. But we already try to trace this command
in nvmet_req_init.

Reported by KASAN.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-03 08:09:41 -08:00
Daniel Wagner
72e8c9379d nvmet-fc: remove unnecessary bracket
There is no need for the bracket around the identifier. Remove it.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-03 08:09:41 -08:00
Christoph Hellwig
3b946fe1cc nvme: simplify the max_discard_segments calculation
Just stash away the DMRL value in the nvme_ctrl struture, and leave
all interpretation to nvme_config_discard, where we know DSM is
supported by the time we're configuring the number of segments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-03 08:09:40 -08:00
Christoph Hellwig
f29886c249 nvme: fix max_discard_sectors calculation
ctrl->max_discard_sectors stores a value that is potentially based of
the DMRSL field in Identify Controller, which is in units of LBAs and
thus dependent on the Format of a namespace.

Fix this by moving the calculation of max_discard_sectors entirely
into nvme_config_discard and replacing the ctrl->max_discard_sectors
value with a local variable so that the calculation is always
namespace-specific.

Fixes: 1a86924e4f ("nvme: fix interpretation of DMRSL")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-03 08:09:40 -08:00
Christoph Hellwig
a4be9679aa nvme: also skip discard granularity updates in nvme_config_discard
Don't just skip the discard sectors and segments but also the granularity
if a value was already set before.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-03 08:09:40 -08:00
Christoph Hellwig
d3074e9a73 nvme: update the explanation for not updating the limits in nvme_config_discard
Expeand the comment a bit to explain what is going on.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-03 08:09:40 -08:00
Christoph Hellwig
3a96bff229 nvmet-tcp: fix a missing endianess conversion in nvmet_tcp_try_peek_pdu
No, a __le32 cast doesn't magically byteswap on big-endian systems..

Fixes: 70525e5d82 ("nvmet-tcp: peek icreq before starting TLS")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-03 08:09:40 -08:00
Christoph Hellwig
2abd2c39ad nvme-common: mark nvme_tls_psk_prio static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-03 08:09:40 -08:00
Guixin Liu
ef184b8844 nvme: tcp: remove unnecessary goto statement
There is no requirement to call nvme_tcp_free_queue() for queue
deallocation if the pskid is null or the queue allocation fails, as
the NVME_TCP_Q_ALLOCATED flag would not be set in such scenarios.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-03 08:09:39 -08:00
Maurizio Lombardi
75011bd0f9 nvmet-tcp: remove boilerplate code
Simplify the nvmet_tcp_handle_h2c_data_pdu() function by removing
boilerplate code.

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-02 12:56:28 -08:00
Maurizio Lombardi
0849a54413 nvmet-tcp: fix a crash in nvmet_req_complete()
in nvmet_tcp_handle_h2c_data_pdu(), if the host sends a data_offset
different from rbytes_done, the driver ends up calling nvmet_req_complete()
passing a status error.
The problem is that at this point cmd->req is not yet initialized,
the kernel will crash after dereferencing a NULL pointer.

Fix the bug by replacing the call to nvmet_req_complete() with
nvmet_tcp_fatal_error().

Fixes: 872d26a391 ("nvmet-tcp: add NVMe over TCP target driver")
Reviewed-by: Keith Busch <kbsuch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-02 12:56:19 -08:00
Maurizio Lombardi
efa5630590 nvmet-tcp: Fix a kernel panic when host sends an invalid H2C PDU length
If the host sends an H2CData command with an invalid DATAL,
the kernel may crash in nvmet_tcp_build_pdu_iovec().

Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000000
lr : nvmet_tcp_io_work+0x6ac/0x718 [nvmet_tcp]
Call trace:
  process_one_work+0x174/0x3c8
  worker_thread+0x2d0/0x3e8
  kthread+0x104/0x110

Fix the bug by raising a fatal error if DATAL isn't coherent
with the packet size.
Also, the PDU length should never exceed the MAXH2CDATA parameter which
has been communicated to the host in nvmet_tcp_handle_icreq().

Fixes: 872d26a391 ("nvmet-tcp: add NVMe over TCP target driver")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-02 12:56:03 -08:00
Jens Axboe
f70a479228 nvme udpates for Linux 6.8
- nvme fabrics spec updates (Guixin, Max)
  - nvme target udpates (Guixin, Evan)
  - nvme attribute refactoring (Daniel)
  - nvme-fc numa fix (Keith)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmWErTYACgkQPe3zGtjz
 RgkMew/+I3Lx92leFzO4NtLX1zRbuhcVfO+cYyJQorkilbgYVBAnBEDjFiVCHLIL
 xPsVMJ0lEjta3uo7KhgV8mc6ZezlPwyZ2ke6Rc5tUsuzuLte9R0Uv382huDAniTb
 FD3VMYtlPU0Oq+bOWVFkkh2pqI2HXR/M3v3j1Xpki422rkxahpZ9n02p7YLHtkX1
 sVezjw/s3OPHq517HcIzKiLK7HBZWvA9oC1qyVVcUBxvqMYp/xYURpEwb4r8TP1r
 /Wcidlo5gcias4/sQrYnMkXkh/wOMDv0TKhjeb1dRr844wztm/WhsPZCP+Suhvf5
 E0SuS6tov3SVgfvBWLvHuFQzPx+NTD+1HaVq61CicQWH6S6lRtm3qQ7vcozbvZwl
 1wQ3Ic1wKIG095clYNSHZomdlc2b4z1YUHRJDE7r4COZWX2jTwH2piKxtYpdM+t4
 TGdS9E3hbwurn6rxt032MpMrxPsY5/zRYTbxRzrQ4BxVh9gcrq+WESztu8jfS/bM
 D5wXQPsL9K4njN9Uw7sWYFa2OL01wp8GgbdyL5AOeAJ5xxskcPIzvKny3cWo2WN3
 fsspzHGFSBy+A6nkvHnLvgNqD4m7fzzbAdNLL7+3cSkZtoQ8QH4xr+64axweWhJ8
 PZqncj9gxw82W2hsY4ap9Z6KYTe38TUTl0YJI22JuwTqJ0Tmbm8=
 =Zzf6
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.8-2023-12-21' of git://git.infradead.org/nvme into for-6.8/block

Pull NVMe updates from Keith:

"nvme updates for Linux 6.8

 - nvme fabrics spec updates (Guixin, Max)
 - nvme target udpates (Guixin, Evan)
 - nvme attribute refactoring (Daniel)
 - nvme-fc numa fix (Keith)"

* tag 'nvme-6.8-2023-12-21' of git://git.infradead.org/nvme:
  nvme-fc: set numa_node after nvme_init_ctrl
  nvme-fabrics: don't check discovery ioccsz/iorcsz
  nvmet: configfs: use ctrl->instance to track passthru subsystems
  nvme: repack struct nvme_ns_head
  nvme: add csi, ms and nuse to sysfs
  nvme: rename ns attribute group
  nvme: refactor ns info setup function
  nvme: refactor ns info helpers
  nvme: move ns id info to struct nvme_ns_head
  nvmet: remove cntlid_min and cntlid_max check in nvmet_alloc_ctrl
  nvmet: allow identical cntlid_min and cntlid_max settings
  nvme-fabrics: check ioccsz and iorcsz
  nvme: introduce nvme_check_ctrl_fabric_info helper
2023-12-21 14:44:17 -07:00
Keith Busch
5d51dc8db1 nvme-fc: set numa_node after nvme_init_ctrl
nvme_init_ctrl() resets numa_node to NUMA_NO_NODE, so be sure to set the
desired value after that function call so it won't be overwritten.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-21 09:19:01 -08:00
Max Gurtovoy
7642138e17 nvme-fabrics: don't check discovery ioccsz/iorcsz
IOCCSZ and IORCSZ are reserved for discovery controllers. Avoid checking
their values during identify controller phase.

Fixes: 2fcd3ab398 ("nvme-fabrics: check ioccsz and iorcsz")
Reported-by: Daniel Wagner <dwagner@suse.de>
Tested-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-21 09:18:25 -08:00
Christoph Hellwig
d73e93b4df block: simplify disk_set_zoned
Only use disk_set_zoned to actually enable zoned device support.
For clearing it, call disk_clear_zoned, which is renamed from
disk_clear_zone_settings and now directly clears the zoned flag as
well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20231217165359.604246-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-19 20:17:43 -07:00
Christoph Hellwig
7437bb73f0 block: remove support for the host aware zone model
When zones were first added the SCSI and ATA specs, two different
models were supported (in addition to the drive managed one that
is invisible to the host):

 - host managed where non-conventional zones there is strict requirement
   to write at the write pointer, or else an error is returned
 - host aware where a write point is maintained if writes always happen
   at it, otherwise it is left in an under-defined state and the
   sequential write preferred zones behave like conventional zones
   (probably very badly performing ones, though)

Not surprisingly this lukewarm model didn't prove to be very useful and
was finally removed from the ZBC and SBC specs (NVMe never implemented
it).  Due to to the easily disappearing write pointer host software
could never rely on the write pointer to actually be useful for say
recovery.

Fortunately only a few HDD prototypes shipped using this model which
never made it to mass production.  Drop the support before it is too
late.  Note that any such host aware prototype HDD can still be used
with Linux as we'll now treat it as a conventional HDD.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20231217165359.604246-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-19 20:17:43 -07:00
Maurizio Lombardi
f6fe0b2d35 nvme-pci: fix sleeping function called from interrupt context
the nvme_handle_cqe() interrupt handler calls nvme_complete_async_event()
but the latter may call nvme_auth_stop() which is a blocking function.
Sleeping functions can't be called in interrupt context

 BUG: sleeping function called from invalid context
 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/15
  Call Trace:
     <IRQ>
      __cancel_work_timer+0x31e/0x460
      ? nvme_change_ctrl_state+0xcf/0x3c0 [nvme_core]
      ? nvme_change_ctrl_state+0xcf/0x3c0 [nvme_core]
      nvme_complete_async_event+0x365/0x480 [nvme_core]
      nvme_poll_cq+0x262/0xe50 [nvme]

Fix the bug by moving nvme_auth_stop() to fw_act_work
(executed by the nvme_wq workqueue)

Fixes: f50fff73d6 ("nvme: implement In-Band authentication")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19 12:41:05 -08:00
Evan Burgess
536ecccbaf nvmet: configfs: use ctrl->instance to track passthru subsystems
To prevent enabling more than one passthrough subsystem per NVMe
controller, passthru.c maintains an xarray indexed by cntlid values.
Passthrough for a given nvmet subsystem cannot be enabled by configfs
if the subsystem's passthru_ctrl->cntlid value is already accounted
for in the xarray.

However, according to the NVMe spec (rev 2.0c, p.145), "The Controller
ID (CNTLID) value returned in the Identify Controller data structure
may be used to uniquely identify a controller within an NVM subsystem,"
meaning that cntlid values are not guaranteed to be globally unique
across multiple subsystems. Instead, the cntlid only uniquely
identifies multiple controllers _within_ a subsystem.

As a result, multiple unique & valid NVMe targets can be blocked from
enabling passthrough at the same time if their controllers share cntlid
values, a behavior allowed by the spec. Fix this by indexing the xarray
with passthru_ctrl->instance values, which are allocated per
controller by IDA and thus should be truly unique.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Evan Burgess <evan.burgess@seagate.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19 09:10:22 -08:00
Daniel Wagner
9639296151 nvme: repack struct nvme_ns_head
ns_id, lba_shift and ms are always accessed for every read/write I/O in
nvme_setup_rw. By grouping these variables into one cacheline we can
safe some cycles.

4k sequential reads:

           baseline   patched
Bandwidth: 1620       1634
IOPs       66345579   66910939

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19 09:10:13 -08:00
Daniel Wagner
a1a825ab6a nvme: add csi, ms and nuse to sysfs
libnvme is using the sysfs for enumarating the nvme resources. Though
there are few missing attritbutes in the sysfs. For these libnvme issues
commands during discovering.

As the kernel already knows all these attributes and we would like to
avoid libnvme to issue commands all the time, expose these missing
attributes.

The nuse value is updated on request because the nuse is a volatile
value. Since any user can read the sysfs attribute, a very simple rate
limit is added (update once every 5 seconds). A more sophisticated
update strategy can be added later if there is actually a need for it.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19 09:10:08 -08:00
Daniel Wagner
83ac678e59 nvme: rename ns attribute group
Drop the 'id' part of the attribute group name because we want to expose
non 'id' related attributes via the ns attribute group.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19 09:10:05 -08:00
Daniel Wagner
d386aedc94 nvme: refactor ns info setup function
Use nvme_ns_head instead of nvme_ns where possible. This reduces the
coupling between the different data structures.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19 09:10:01 -08:00
Daniel Wagner
0372dd4e36 nvme: refactor ns info helpers
Pass in the nvme_ns_head pointer directly. This reduces the necessity on
the caller side have the nvme_ns data structure present. Thus we can
refactor the caller side in the next step as well.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19 09:09:58 -08:00
Daniel Wagner
9419e71b8d nvme: move ns id info to struct nvme_ns_head
Move the namesapce info to struct nvme_ns_head, because it's the same
for all associated namespaces.

Note: with multipathing enabled the PI information is shared between all
paths. If a path is using a different PI configuration it will overwrite
the previous settings. This is obviously not correct and such
configuration will be rejected in future. For the time being we expect
a correctly configured storage.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19 09:09:15 -08:00
Keith Busch
d3e8b18587 Revert "nvme-fc: fix race between error recovery and creating association"
The commit was identified to might sleep in invalid context and is
blocking regression testing.

This reverts commit ee6fdc5055.

Link: https://lore.kernel.org/linux-nvme/hkhl56n665uvc6t5d6h3wtx7utkcorw4xlwi7d2t2bnonavhe6@xaan6pu43ap6/
Link: https://lists.infradead.org/pipermail/linux-nvme/2023-December/043756.html
Reported-by: Daniel Wagner <dwagner@suse.de>
Reported-by: Maurizio Lombardi <mlombard@redhat.com>
Cc: Michael Liang <mliang@purestorage.com>
Tested-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19 08:58:46 -08:00
Guixin Liu
4ba8b3f7d3 nvmet: remove cntlid_min and cntlid_max check in nvmet_alloc_ctrl
The cntlid_min and cntlid_max are checked in configfs, don't check
again in nvmet_alloc_ctrl().

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-13 14:53:33 -08:00
Guixin Liu
906dbc47b1 nvmet: allow identical cntlid_min and cntlid_max settings
When the user wants to restrict to only creating one controller,
they can set cntlid_min and cntlid_max to the same value.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-13 14:53:33 -08:00
Pavel Begunkov
b66509b849 io_uring: split out cmd api into a separate header
linux/io_uring.h is slowly becoming a rubbish bin where we put
anything exposed to other subsystems. For instance, the task exit
hooks and io_uring cmd infra are completely orthogonal and don't need
each other's definitions. Start cleaning it up by splitting out all
command bits into a new header file.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7ec50bae6e21f371d3850796e716917fc141225a.1701391955.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-12 07:42:52 -07:00
Georg Gottleuber
107b4e063d nvme-pci: Add sleep quirk for Kingston drives
Some Kingston NV1 and A2000 are wasting a lot of power on specific TUXEDO
platforms in s2idle sleep if 'Simple Suspend' is used.

This patch applies a new quirk 'Force No Simple Suspend' to achieve a
low power sleep without 'Simple Suspend'.

Signed-off-by: Werner Sembach <wse@tuxedocomputers.com>
Signed-off-by: Georg Gottleuber <ggo@tuxedocomputers.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-07 08:53:20 -08:00
Guixin Liu
2fcd3ab398 nvme-fabrics: check ioccsz and iorcsz
Make sure that ioccsz and iorcsz returned by target are correct before use it.

Per 2.0a base NVMe spec:

  I/O Queue Command Capsule Supported Size (IOCCSZ): This field defines
  the maximum I/O command capsule size in 16 byte units. The minimum value
  that shall be indicated is 4 corresponding to 64 bytes.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-06 13:57:43 -08:00
Guixin Liu
68999d1dd2 nvme: introduce nvme_check_ctrl_fabric_info helper
Inroduce nvme_check_ctrl_fabric_info helper to check fabric controller info
returned by target.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-06 13:57:40 -08:00
Bitao Hu
839a40d1e7 nvme: fix deadlock between reset and scan
If controller reset occurs when allocating namespace, both
nvme_reset_work and nvme_scan_work will hang, as shown below.

Test Scripts:

    for ((t=1;t<=128;t++))
    do
    nsid=`nvme create-ns /dev/nvme1 -s 14537724 -c 14537724 -f 0 -m 0 \
    -d 0 | awk -F: '{print($NF);}'`
    nvme attach-ns /dev/nvme1 -n $nsid -c 0
    done
    nvme reset /dev/nvme1

We will find that both nvme_reset_work and nvme_scan_work hung:

    INFO: task kworker/u249:4:17848 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
    message.
    task:kworker/u249:4  state:D stack:    0 pid:17848 ppid:     2
    flags:0x00000028
    Workqueue: nvme-reset-wq nvme_reset_work [nvme]
    Call trace:
    __switch_to+0xb4/0xfc
    __schedule+0x22c/0x670
    schedule+0x4c/0xd0
    blk_mq_freeze_queue_wait+0x84/0xc0
    nvme_wait_freeze+0x40/0x64 [nvme_core]
    nvme_reset_work+0x1c0/0x5cc [nvme]
    process_one_work+0x1d8/0x4b0
    worker_thread+0x230/0x440
    kthread+0x114/0x120
    INFO: task kworker/u249:3:22404 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
    message.
    task:kworker/u249:3  state:D stack:    0 pid:22404 ppid:     2
    flags:0x00000028
    Workqueue: nvme-wq nvme_scan_work [nvme_core]
    Call trace:
    __switch_to+0xb4/0xfc
    __schedule+0x22c/0x670
    schedule+0x4c/0xd0
    rwsem_down_write_slowpath+0x32c/0x98c
    down_write+0x70/0x80
    nvme_alloc_ns+0x1ac/0x38c [nvme_core]
    nvme_validate_or_alloc_ns+0xbc/0x150 [nvme_core]
    nvme_scan_ns_list+0xe8/0x2e4 [nvme_core]
    nvme_scan_work+0x60/0x500 [nvme_core]
    process_one_work+0x1d8/0x4b0
    worker_thread+0x260/0x440
    kthread+0x114/0x120
    INFO: task nvme:28428 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
    message.
    task:nvme            state:D stack:    0 pid:28428 ppid: 27119
    flags:0x00000000
    Call trace:
    __switch_to+0xb4/0xfc
    __schedule+0x22c/0x670
    schedule+0x4c/0xd0
    schedule_timeout+0x160/0x194
    do_wait_for_common+0xac/0x1d0
    __wait_for_common+0x78/0x100
    wait_for_completion+0x24/0x30
    __flush_work.isra.0+0x74/0x90
    flush_work+0x14/0x20
    nvme_reset_ctrl_sync+0x50/0x74 [nvme_core]
    nvme_dev_ioctl+0x1b0/0x250 [nvme_core]
    __arm64_sys_ioctl+0xa8/0xf0
    el0_svc_common+0x88/0x234
    do_el0_svc+0x7c/0x90
    el0_svc+0x1c/0x30
    el0_sync_handler+0xa8/0xb0
    el0_sync+0x148/0x180

The reason for the hang is that nvme_reset_work occurs while nvme_scan_work
is still running. nvme_scan_work may add new ns into ctrl->namespaces
list after nvme_reset_work frozen all ns->q in ctrl->namespaces list.
The newly added ns is not frozen, so nvme_wait_freeze will wait forever.
Unfortunately, ctrl->namespaces_rwsem is held by nvme_reset_work, so
nvme_scan_work will also wait forever. Now we are deadlocked!

PROCESS1                         PROCESS2
==============                   ==============
nvme_scan_work
  ...                            nvme_reset_work
  nvme_validate_or_alloc_ns        nvme_dev_disable
    nvme_alloc_ns                    nvme_start_freeze
     down_write                      ...
     nvme_ns_add_to_ctrl_list        ...
     up_write                      nvme_wait_freeze
    ...                              down_read
    nvme_alloc_ns                    blk_mq_freeze_queue_wait
     down_write

Fix by marking the ctrl with say NVME_CTRL_FROZEN flag set in
nvme_start_freeze and cleared in nvme_unfreeze. Then the scan can check
it before adding the new namespace (under the namespaces_rwsem).

Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-04 08:39:04 -08:00
Nitesh Shetty
20dc66f2d7 nvme: prevent potential spectre v1 gadget
This patch fixes the smatch warning, "nvmet_ns_ana_grpid_store() warn:
potential spectre issue 'nvmet_ana_group_enabled' [w] (local cap)"
Prevent the contents of kernel memory from being leaked to  user space
via speculative execution by using array_index_nospec.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-04 08:39:04 -08:00
Shin'ichiro Kawasaki
29ac4b2f92 nvme: improve NVME_HOST_AUTH and NVME_TARGET_AUTH config descriptions
Currently two similar config options NVME_HOST_AUTH and NVME_TARGET_AUTH
have almost same descriptions. It is confusing to choose them in
menuconfig. Improve the descriptions to distinguish them.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-04 08:39:03 -08:00
Keith Busch
7be866b1cf nvme-ioctl: move capable() admin check to the end
This can be an expensive call on some kernel configs. Move it to the end
after checking the cheaper ways to determine if the command is allowed.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-04 08:39:03 -08:00
Keith Busch
e6e7f7ac03 nvme: ensure reset state check ordering
A different CPU may be setting the ctrl->state value, so ensure proper
barriers to prevent optimizing to a stale state. Normally it isn't a
problem to observe the wrong state as it is merely advisory to take a
quicker path during initialization and error recovery, but seeing an old
state can report unexpected ENETRESET errors when a reset request was in
fact successful.

Reported-by: Minh Hoang <mh2022@meta.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Hannes Reinecke <hare@suse.de>
2023-12-04 08:39:03 -08:00
Keith Busch
5c687c287c nvme: introduce helper function to get ctrl state
The controller state is typically written by another CPU, so reading it
should ensure no optimizations are taken. This is a repeated pattern in
the driver, so start with adding a convenience function that returns the
controller state with READ_ONCE().

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-04 08:39:03 -08:00
Keith Busch
d6aacee925 nvme: use bio_integrity_map_user
Map user metadata buffers directly. Now that the bio tracks the
metadata, nvme doesn't need special metadata handling and tracking with
callbacks and additional fields in the pdu.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20231130215309.2923568-3-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-01 18:29:18 -07:00
Justin Stitt
e5a4975ca4 nvme-fc: replace deprecated strncpy with strscpy
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

Let's instead use strscpy() [2] as it guarantees NUL-termination on the
destination buffer.

Moreover, there is no need to use:

|       min(FCNVME_ASSOC_HOSTNQN_LEN, NVMF_NQN_SIZE));

I imagine this was originally done to make sure the destination buffer
is NUL-terminated by ensuring we copy a number of bytes less than the
size of our destination, thus leaving some NUL-bytes at the end.

However, with strscpy(), we no longer need to do this and we can instead
opt for the more idiomatic strscpy() usage of:

| strscpy(dest, src, sizeof(dest))

Also, no NUL-padding is required as lsop is zero-allocated:

|       lsop = kzalloc((sizeof(*lsop) +
|                        sizeof(*assoc_rqst) + sizeof(*assoc_acc) +
|                        ctrl->lport->ops->lsrqst_priv_sz), GFP_KERNEL);

... and assoc_rqst points to a field in lsop:

|       assoc_rqst = (struct fcnvme_ls_cr_assoc_rqst *)&lsop[1];

Therefore, any additional NUL-byte assignments (like the ones that
strncpy() makes) are redundant.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Similar-to: https://lore.kernel.org/all/20231018-strncpy-drivers-nvme-host-fabrics-c-v1-1-b6677df40a35@google.com/
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20231019-strncpy-drivers-nvme-host-fc-c-v1-1-5805c15e4b49@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-12-01 09:51:43 -08:00
Justin Stitt
576b75f93b nvme-fabrics: replace deprecated strncpy with strscpy
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

We expect both data->subsysnqn and data->hostnqn to be NUL-terminated
based on their usage with format specifier ("%s"):
fabrics.c:
322: dev_err(ctrl->device,
323:   "%s, subsysnqn \"%s\"\n",
324:   inv_data, data->subsysnqn);
...
349: dev_err(ctrl->device,
350: 	 "Connect for subsystem %s is not allowed, hostnqn: %s\n",
351: 	 data->subsysnqn, data->hostnqn);

Moreover, there's no need to NUL-pad since `data` is zero-allocated
already in fabrics.c:
383: data = kzalloc(sizeof(*data), GFP_KERNEL);
... therefore any further NUL-padding is rendered useless.

Considering the above, a suitable replacement is `strscpy` [2] due to
the fact that it guarantees NUL-termination on the destination buffer
without unnecessarily NUL-padding.

I opted not to switch NVMF_NQN_SIZE to sizeof(data->xyz) because the
size is defined as:
|       /* NQN names in commands fields specified one size */
|       #define NVMF_NQN_FIELD_LEN	256

... while NVMF_NQN_SIZE is defined as:
|       /* However the max length of a qualified name is another size */
|       #define NVMF_NQN_SIZE		223

Since 223 seems pretty magic, I'm not going to touch it.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20231018-strncpy-drivers-nvme-host-fabrics-c-v1-1-b6677df40a35@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-12-01 09:51:43 -08:00
Keith Busch
74fbc88e16 nvme-core: check for too small lba shift
The block layer doesn't support logical block sizes smaller than 512
bytes. The nvme spec doesn't support that small either, but the driver
isn't checking to make sure the device responded with usable data.
Failing to catch this will result in a kernel bug, either from a
division by zero when stacking, or a zero length bio.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-01 07:49:50 -08:00
Ewan D. Milne
d8b90d600a nvme: check for valid nvme_identify_ns() before using it
When scanning namespaces, it is possible to get valid data from the first
call to nvme_identify_ns() in nvme_alloc_ns(), but not from the second
call in nvme_update_ns_info_block().  In particular, if the NSID becomes
inactive between the two commands, a storage device may return a buffer
filled with zero as per 4.1.5.1.  In this case, we can get a kernel crash
due to a divide-by-zero in blk_stack_limits() because ns->lba_shift will
be set to zero.

PID: 326      TASK: ffff95fec3cd8000  CPU: 29   COMMAND: "kworker/u98:10"
 #0 [ffffad8f8702f9e0] machine_kexec at ffffffff91c76ec7
 #1 [ffffad8f8702fa38] __crash_kexec at ffffffff91dea4fa
 #2 [ffffad8f8702faf8] crash_kexec at ffffffff91deb788
 #3 [ffffad8f8702fb00] oops_end at ffffffff91c2e4bb
 #4 [ffffad8f8702fb20] do_trap at ffffffff91c2a4ce
 #5 [ffffad8f8702fb70] do_error_trap at ffffffff91c2a595
 #6 [ffffad8f8702fbb0] exc_divide_error at ffffffff928506e6
 #7 [ffffad8f8702fbd0] asm_exc_divide_error at ffffffff92a00926
    [exception RIP: blk_stack_limits+434]
    RIP: ffffffff92191872  RSP: ffffad8f8702fc80  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffff95efa0c91800  RCX: 0000000000000001
    RDX: 0000000000000000  RSI: 0000000000000001  RDI: 0000000000000001
    RBP: 00000000ffffffff   R8: ffff95fec7df35a8   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000001  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: ffff95fed33c09a8
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffffad8f8702fce0] nvme_update_ns_info_block at ffffffffc06d3533 [nvme_core]
 #9 [ffffad8f8702fd18] nvme_scan_ns at ffffffffc06d6fa7 [nvme_core]

This happened when the check for valid data was moved out of nvme_identify_ns()
into one of the callers.  Fix this by checking in both callers.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=218186
Fixes: 0dd6fff2aa ("nvme: bring back auto-removal of deleted namespaces during sequential scan")
Cc: stable@vger.kernel.org
Signed-off-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-27 14:00:55 -08:00
Maurizio Lombardi
e3139cef82 nvme-core: fix a memory leak in nvme_ns_info_from_identify()
In case of error, free the nvme_id_ns structure that was allocated
by nvme_identify_ns().

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-27 10:26:43 -08:00
Mark O'Donovan
136cfcb8dc nvme: fine-tune sending of first keep-alive
Keep-alive commands are sent half-way through the kato period.
This normally works well but fails when the keep-alive system is
started when we are more than half way through the kato.
This can happen on larger setups or due to host delays.
With this change we now time the initial keep-alive command from
the controller initialisation time, rather than the keep-alive
mechanism activation time.

Signed-off-by: Mark O'Donovan <shiftee@posteo.net>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-27 10:23:13 -08:00
Arnd Bergmann
0e6c4fe782 nvme: tcp: fix compile-time checks for TLS mode
When CONFIG_NVME_KEYRING is enabled as a loadable module, but the TCP
host code is built-in, it fails to link:

arm-linux-gnueabi-ld: drivers/nvme/host/tcp.o: in function `nvme_tcp_setup_ctrl':
tcp.c:(.text+0x1940): undefined reference to `nvme_tls_psk_default'

The problem is that the compile-time conditionals are inconsistent here,
using a mix of #ifdef CONFIG_NVME_TCP_TLS, IS_ENABLED(CONFIG_NVME_TCP_TLS)
and IS_ENABLED(CONFIG_NVME_KEYRING) checks, with CONFIG_NVME_KEYRING
controlling whether the implementation is actually built.

Change it to use IS_ENABLED(CONFIG_NVME_KEYRING) checks consistently,
which should help readability and make it less error-prone. Combining
it with the check for the ctrl->opts->tls flag lets the compiler drop
all the TLS code in configurations without this feature, which also
helps runtime behavior in addition to avoiding the link failure.

To make it possible for the compiler to build the dead code, both
the tls_handshake_timeout variable and the TLS specific members
of nvme_tcp_queue need to be moved out of the #ifdef block as well,
but at least the former of these gets optimized out again.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20231122224719.4042108-4-arnd@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-22 18:41:14 -07:00
Arnd Bergmann
65e2a74c44 nvme: target: fix Kconfig select statements
When the NVME target code is built-in but its TCP frontend is a loadable
module, enabling keyring support causes a link failure:

x86_64-linux-ld: vmlinux.o: in function `nvmet_ports_make':
configfs.c:(.text+0x100a211): undefined reference to `nvme_keyring_id'

The problem is that CONFIG_NVME_TARGET_TCP_TLS is a 'bool' symbol that
depends on the tristate CONFIG_NVME_TARGET_TCP, so any 'select' from
it inherits the state of the tristate symbol rather than the intended
CONFIG_NVME_TARGET one that contains the actual call.

The same thing is true for CONFIG_KEYS, which itself is required for
NVME_KEYRING.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20231122224719.4042108-3-arnd@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-22 18:40:14 -07:00
Arnd Bergmann
d78abcbabe nvme: target: fix nvme_keyring_id() references
In configurations without CONFIG_NVME_TARGET_TCP_TLS, the keyring
code might not be available, or using it will result in a runtime
failure:

x86_64-linux-ld: vmlinux.o: in function `nvmet_ports_make':
configfs.c:(.text+0x100a211): undefined reference to `nvme_keyring_id'

Add a check to ensure we only check the keyring if there is a chance
of it being used, which avoids both the runtime and link-time
problems.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20231122224719.4042108-2-arnd@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-22 18:40:14 -07:00
Hannes Reinecke
3af755a468 nvme: move nvme_stop_keep_alive() back to original position
Stopping keep-alive not only stops the keep-alive workqueue,
but also needs to be synchronized with I/O termination as we
must not send a keep-alive command after all I/O had been
terminated.
So to avoid any regressions move the call to stop_keep_alive()
back to its original position and ensure that keep-alive is
correctly stopped failing to setup the admin queue.

Fixes: 4733b65d82 ("nvme: start keep-alive after admin queue setup")
Suggested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-22 08:07:02 -08:00
Hannes Reinecke
11b9d0b499 nvmet-tcp: always initialize tls_handshake_tmo_work
The TLS handshake timeout work item should always be
initialized to avoid a crash when cancelling the workqueue.

Fixes: 675b453e02 ("nvmet-tcp: enable TLS handshake upcall")
Suggested-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20 09:25:33 -08:00
Christoph Hellwig
1c22e0295a nvmet: nul-terminate the NQNs passed in the connect command
The host and subsystem NQNs are passed in the connect command payload and
interpreted as nul-terminated strings.  Ensure they actually are
nul-terminated before using them.

Fixes: a07b4970f4 "nvmet: add a generic NVMe target")
Reported-by: Alon Zahavi <zahavi.alon@gmail.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20 09:25:33 -08:00
Hannes Reinecke
c7ca9757bd nvme: blank out authentication fabrics options if not configured
If the config option NVME_HOST_AUTH is not selected we should not
accept the corresponding fabrics options. This allows userspace
to detect if NVMe authentication has been enabled for the kernel.

Cc: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: f50fff73d6 ("nvme: implement In-Band authentication")
Signed-off-by: Hannes Reinecke <hare@suse.de>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20 09:25:32 -08:00
Hannes Reinecke
cd9aed6060 nvme: catch errors from nvme_configure_metadata()
nvme_configure_metadata() is issuing I/O, so we might incur an I/O
error which will cause the connection to be reset.
But in that case any further probing will race with reset and
cause UAF errors.
So return a status from nvme_configure_metadata() and abort
probing if there was an I/O error.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20 09:25:32 -08:00
Hannes Reinecke
23441536b6 nvme-tcp: only evaluate 'tls' option if TLS is selected
We only need to evaluate the 'tls' connect option if TLS is
enabled; otherwise we might be getting a link error.

Fixes: 706add1367 ("nvme: keyring: fix conditional compilation")
Reported-by: kernel test robot <yujie.liu@intel.com>
Closes: https://lore.kernel.org/r/202311140426.0eHrTXBr-lkp@intel.com/
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20 09:25:32 -08:00
Mark O'Donovan
38ce1570e2 nvme-auth: set explanation code for failure2 msgs
Some error cases were not setting an auth-failure-reason-code-explanation.
This means an AUTH_Failure2 message will be sent with an explanation value
of 0 which is a reserved value.

Signed-off-by: Mark O'Donovan <shiftee@posteo.net>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20 09:25:23 -08:00
Mark O'Donovan
616add70bf nvme-auth: unlock mutex in one place only
Signed-off-by: Mark O'Donovan <shiftee@posteo.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20 09:25:13 -08:00
Hannes Reinecke
706add1367 nvme: keyring: fix conditional compilation
The keyring and auth functions can be called from both the host and
the target side and are controlled by Kconfig options for each of the
combinations, but the declarations are controlled by #ifdef checks
on the shared Kconfig symbols.

This leads to link failures in combinations where one of the frontends
is built-in and the other one is a module, and the keyring code
ends up in a module that is not reachable from the builtin code:

ld: drivers/nvme/host/core.o: in function `nvme_core_exit':
core.c:(.exit.text+0x4): undefined reference to `nvme_keyring_exit'
ld: drivers/nvme/host/core.o: in function `nvme_core_init':
core.c:(.init.text+0x94): undefined reference to `nvme_keyring_init

ld: drivers/nvme/host/tcp.o: in function `nvme_tcp_setup_ctrl':
tcp.c:(.text+0x4c18): undefined reference to `nvme_tls_psk_default'

Address this by moving nvme_keyring_init()/nvme_keyring_exit() into
module init/exit functions for the keyring module.

Fixes: be8e82caa6 ("nvme-tcp: enable TLS handshake upcall")
Signed-off-by: Hannes Reinecke <hare@suse.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-08 08:07:08 -08:00
Arnd Bergmann
6affe08aea nvme: common: make keyring and auth separate modules
When only the keyring module is included but auth is not, modpost
complains about the lack of a module license tag:

ERROR: modpost: missing MODULE_LICENSE() in drivers/nvme/common/nvme-common.o

Address this by making both modules buildable standalone,
removing the now unnecessary CONFIG_NVME_COMMON symbol
in the process.

Also, now that NVME_KEYRING config symbol can be either a module or
built-in, the stubs need to check for '#if IS_ENABLED' rather than a
simple '#ifdef'.

Fixes: 9d77eb5277 ("nvme-keyring: register '.nvme' keyring")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-07 10:05:15 -08:00
Hannes Reinecke
4733b65d82 nvme: start keep-alive after admin queue setup
Setting up I/O queues might take quite some time on larger and/or
busy setups, so KATO might expire before all I/O queues could be
set up.
Fix this by start keep alive from the ->init_ctrl_finish() callback,
and stopping it when calling nvme_cancel_admin_tagset().

Signed-off-by: Hannes Reinecke <hare@suse.de>
Tested-by: Mark O'Donovan <shiftee@posteo.net>
[fixed nvme-fc compile error]
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-06 09:16:10 -08:00
Hannes Reinecke
55adcdbbdd nvme-loop: always quiesce and cancel commands before destroying admin q
Once ->init_ctrl_finish() is called there may be commands outstanding,
so we should quiesce the admin queue and cancel all commands prior
to call nvme_loop_destroy_admin_queue().

Signed-off-by: Hannes Reinecke <hare@suse.de>
Tested-by: Mark O'Donovan <shiftee@posteo.net>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-06 09:00:15 -08:00
Hannes Reinecke
fd1418de10 nvme-tcp: avoid open-coding nvme_tcp_teardown_admin_queue()
nvme_tcp_setup_ctrl() has an open-coded version of
nvme_tcp_teardown_admin_queue().

Signed-off-by: Hannes Reinecke <hare@suse.de>
Tested-by: Mark O'Donovan <shiftee@posteo.net>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-06 09:00:13 -08:00
Mark O'Donovan
6f66d046ea nvme-auth: always set valid seq_num in dhchap reply
Currently a seqnum of zero is sent during uni-directional
authentication. The zero value is reserved for the secure channel
feature which is not yet implemented.

Relevant extract from the spec:
The value 0h is used to indicate that bidirectional authentication
is not performed, but a challenge value C2 is carried in order to
generate a pre-shared key (PSK) for subsequent establishment of a
secure channel

Signed-off-by: Mark O'Donovan <shiftee@posteo.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
2023-11-06 08:34:20 -08:00
Mark O'Donovan
fc1e03eaca nvme-auth: add flag for bi-directional auth
Introduces an explicit variable for bi-directional auth.
The currently used variable chap->s2 is incorrectly zeroed for
uni-directional auth. That will be fixed in the next patch so this
needs to change to avoid sending unexpected success2 messages

Signed-off-by: Mark O'Donovan <shiftee@posteo.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
2023-11-06 08:34:17 -08:00
Mark O'Donovan
75276847f4 nvme-auth: auth success1 msg always includes resp
In cases where RVALID is false, the response is still transmitted,
but is cleared to zero.

Relevant extract from the spec:
Response R2, if valid (i.e., if the RVALID field is set to 01h),
cleared to 0h otherwise

Signed-off-by: Mark O'Donovan <shiftee@posteo.net>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-11-06 08:34:15 -08:00
Anuj Gupta
1147dd0503 nvme: fix error-handling for io_uring nvme-passthrough
Driver may return an error before submitting the command to the device.
Ensure that such error is propagated up.

Fixes: 456cba386e ("nvme: wire-up uring-cmd support for io-passthru on char-device.")
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-06 08:09:31 -08:00
Daniel Wagner
983a338b96 nvme: update firmware version after commit
The firmware version sysfs entry needs to be updated after a successfully
firmware activation.

nvme-cli stopped issuing an Identify Controller command to list the
current firmware information and relies on sysfs showing the current
firmware version.

Reported-by: Kenji Tomonaga <tkenbo@gmail.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Tested-by: Kenji Tomonaga <tkenbo@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>
[fixed off-by one afi index]
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-06 08:08:11 -08:00
Christophe JAILLET
0e32fdd796 nvme-tcp: Fix a memory leak
All error handling path end to the error handling path, except this one.
Go to the error handling branch as well here, otherwise 'icreq' and
'icresp' will leak.

Fixes: 2837966ab2 ("nvme-tcp: control message handling for recvmsg()")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-06 08:05:53 -08:00
Eric Biggers
744eac783f nvme-auth: use crypto_shash_tfm_digest()
Simplify nvme_auth_augmented_challenge() by using
crypto_shash_tfm_digest() instead of an alloc+init+update+final
sequence.  This should also improve performance.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-06 08:05:53 -08:00
Linus Torvalds
90d624af2e for-6.7/block-2023-10-30
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmU/vjMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqVcEADaNf6X7LVKKrdQ4sA38dBZYGM3kNz0SCYV
 vkjQAs0Fyylbu6EhYOLO/R+UCtpytLlnbr4NmFDbhaEG4OJcwoDLDxpMQ7Gda58v
 4RBXAiIlhZX3g99/ebvtNtVEvQa9gF4h8k2n/gKsG+PoS+cbkKAI0Na2duI1d/pL
 B5nQ31VAHhsyjUv1nIPLrQS6lsL7ZTFvH8L6FLcEVM03poy8PE2H6kN7WoyXwtfo
 LN3KK0Nu7B0Wx2nDx0ffisxcDhbChGs7G2c9ndPTvxg6/4HW+2XSeNUwTxXYpyi2
 ZCD+AHCzMB/w6GNNWFw4xfau5RrZ4c4HdBnmyR6+fPb1u6nGzjgquzFyLyLu5MkA
 n/NvOHP1Cbd3QIXG1TnBi2kDPkQ5FOIAjFSe9IZAGT4dUkZ63wBoDil1jCgMLuCR
 C+AFPLhiIg3cFvu9+fdZ6BkCuZYESd3YboBtRKeMionEexrPTKt4QWqIoVJgd/Y7
 nwvR8jkIBpVgQZT8ocYqhSycLCYV2lGqEBSq4rlRiEb/W1G9Awmg8UTGuUYFSC1G
 vGPCwhGi+SBsbo84aPCfSdUkKDlruNWP0GwIFxo0hsiTOoHP+7UWeenJ2Jw5lNPt
 p0Y72TEDDaSMlE4cJx6IWdWM/B+OWzCyRyl3uVcy7bToEsVhIbBSSth7+sh2n7Cy
 WgH1lrtMzg==
 =sace
 -----END PGP SIGNATURE-----

Merge tag 'for-6.7/block-2023-10-30' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - Improvements to the queue_rqs() support, and adding null_blk support
   for that as well (Chengming)

 - Series improving badblocks support (Coly)

 - Key store support for sed-opal (Greg)

 - IBM partition string handling improvements (Jan)

 - Make number of ublk devices supported configurable (Mike)

 - Cancelation improvements for ublk (Ming)

 - MD pull requests via Song:
     - Handle timeout in md-cluster, by Denis Plotnikov
     - Cleanup pers->prepare_suspend, by Yu Kuai
     - Rewrite mddev_suspend(), by Yu Kuai
     - Simplify md_seq_ops, by Yu Kuai
     - Reduce unnecessary locking array_state_store(), by Mariusz
       Tkaczyk
     - Make rdev add/remove independent from daemon thread, by Yu Kuai
     - Refactor code around quiesce() and mddev_suspend(), by Yu Kuai

 - NVMe pull request via Keith:
     - nvme-auth updates (Mark)
     - nvme-tcp tls (Hannes)
     - nvme-fc annotaions (Kees)

 - Misc cleanups and improvements (Jiapeng, Joel)

* tag 'for-6.7/block-2023-10-30' of git://git.kernel.dk/linux: (95 commits)
  block: ublk_drv: Remove unused function
  md: cleanup pers->prepare_suspend()
  nvme-auth: allow mixing of secret and hash lengths
  nvme-auth: use transformed key size to create resp
  nvme-auth: alloc nvme_dhchap_key as single buffer
  nvmet-tcp: use 'spin_lock_bh' for state_lock()
  powerpc/pseries: PLPKS SED Opal keystore support
  block: sed-opal: keystore access for SED Opal keys
  block:sed-opal: SED Opal keystore
  ublk: simplify aborting request
  ublk: replace monitor with cancelable uring_cmd
  ublk: quiesce request queue when aborting queue
  ublk: rename mm_lock as lock
  ublk: move ublk_cancel_dev() out of ub->mutex
  ublk: make sure io cmd handled in submitter task context
  ublk: don't get ublk device reference in ublk_abort_queue()
  ublk: Make ublks_max configurable
  ublk: Limit dev_id/ub_number values
  md-cluster: check for timeout while a new disk adding
  nvme: rework NVME_AUTH Kconfig selection
  ...
2023-11-01 12:30:07 -10:00
Jan Kara
2a4936e933
nvmet: Convert to bdev_open_by_path()
Convert nvmet to use bdev_open_by_path() and pass the handle around.

CC: linux-nvme@lists.infradead.org
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-13-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28 13:29:18 +02:00
Keith Busch
5c3f406646 nvme-pci: add BOGUS_NID for Intel 0a54 device
These ones claim cmic and nmic capable, so need special consideration to ignore
their duplicate identifiers.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217981
Reported-by: welsh@cassens.com
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-18 14:08:39 -07:00
Maurizio Lombardi
f965b281fd nvmet-auth: complete a request only after freeing the dhchap pointers
It may happen that the work to destroy a queue
(for example nvmet_tcp_release_queue_work()) is started while
an auth-send or auth-receive command is still completing.

nvmet_sq_destroy() will block, waiting for all the references
to the sq to be dropped, the last reference is then
dropped when nvmet_req_complete() is called.

When this happens, both nvmet_sq_destroy() and
nvmet_execute_auth_send()/_receive() will free the dhchap pointers by
calling nvmet_auth_sq_free().
Since there isn't any lock, the two threads may race against each other,
causing double frees and memory corruptions, as reported by KASAN.

Reproduced by stress blktests nvme/041 nvme/042 nvme/043

 nvme nvme2: qid 0: authenticated with hash hmac(sha512) dhgroup ffdhe4096
 ==================================================================
 BUG: KASAN: double-free in kfree+0xec/0x4b0

 Call Trace:
  <TASK>
  kfree+0xec/0x4b0
  nvmet_auth_sq_free+0xe1/0x160 [nvmet]
  nvmet_execute_auth_send+0x482/0x16d0 [nvmet]
  process_one_work+0x8e5/0x1510

 Allocated by task 191846:
  __kasan_kmalloc+0x81/0xa0
  nvmet_auth_ctrl_sesskey+0xf6/0x380 [nvmet]
  nvmet_auth_reply+0x119/0x990 [nvmet]

 Freed by task 143270:
  kfree+0xec/0x4b0
  nvmet_auth_sq_free+0xe1/0x160 [nvmet]
  process_one_work+0x8e5/0x1510

Fix this bug by calling nvmet_req_complete() only after freeing the
pointers, so we will prevent the race by holding the sq reference.

V2: remove redundant code

Fixes: db1312dd95 ("nvmet: implement basic In-Band Authentication")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-18 14:08:39 -07:00
Keith Busch
2b32c76e2b nvme: sanitize metadata bounce buffer for reads
User can request more metadata bytes than the device will write. Ensure
kernel buffer is initialized so we're not leaking unsanitized memory on
the copy-out.

Fixes: 0b7f1f26f9 ("nvme: use the block layer for userspace passthrough metadata")
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-18 14:08:39 -07:00
Mark O'Donovan
32445526d8 nvme-auth: allow mixing of secret and hash lengths
We can now use any of the secret transformation hashes with a
secret, regardless of the secret size.
e.g. a 32 byte key with the SHA-512(64 byte) hash.

The example secret from the spec should now be permitted with
any of the following:
DHHC-1:00:ia6zGodOr4SEG0Zzaw398rpY0wqipUWj4jWjUh4HWUz6aQ2n:
DHHC-1:01:ia6zGodOr4SEG0Zzaw398rpY0wqipUWj4jWjUh4HWUz6aQ2n:
DHHC-1:02:ia6zGodOr4SEG0Zzaw398rpY0wqipUWj4jWjUh4HWUz6aQ2n:
DHHC-1:03:ia6zGodOr4SEG0Zzaw398rpY0wqipUWj4jWjUh4HWUz6aQ2n:

Note: Secrets are still restricted to 32,48 or 64 bits.

Co-developed-by: Akash Appaiah <Akash.Appaiah@dell.com>
Signed-off-by: Akash Appaiah <Akash.Appaiah@dell.com>
Signed-off-by: Mark O'Donovan <shiftee@posteo.net>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-17 13:57:55 -07:00
Mark O'Donovan
f047daed17 nvme-auth: use transformed key size to create resp
This does not change current behaviour as the driver currently
verifies that the secret size is the same size as the length of
the transformation hash.

Co-developed-by: Akash Appaiah <Akash.Appaiah@dell.com>
Signed-off-by: Akash Appaiah <Akash.Appaiah@dell.com>
Signed-off-by: Mark O'Donovan <shiftee@posteo.net>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-17 13:57:54 -07:00
Mark O'Donovan
3ebed3749f nvme-auth: alloc nvme_dhchap_key as single buffer
Co-developed-by: Akash Appaiah <Akash.Appaiah@dell.com>
Signed-off-by: Akash Appaiah <Akash.Appaiah@dell.com>
Signed-off-by: Mark O'Donovan <shiftee@posteo.net>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-17 13:57:54 -07:00
Hannes Reinecke
bbacf79201 nvmet-tcp: use 'spin_lock_bh' for state_lock()
nvmet_tcp_schedule_release_queue() is called from socket state
change callbacks, which may be called from an softirq context.
So use 'spin_lock_bh' to avoid a spin lock warning.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-17 13:57:54 -07:00
Hannes Reinecke
d680063482 nvme: rework NVME_AUTH Kconfig selection
Having a single Kconfig symbol NVME_AUTH conflates the selection
of the authentication functions from nvme/common and nvme/host,
causing kbuild robot to complain when building the nvme target
only. So introduce a Kconfig symbol NVME_HOST_AUTH for the nvme
host bits and use NVME_AUTH for the common functions only.
And move the CRYPTO selection into nvme/common to make it
easier to read.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202310120733.TlPOVeJm-lkp@intel.com/
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-12 08:04:49 -07:00
Hannes Reinecke
70525e5d82 nvmet-tcp: peek icreq before starting TLS
Incoming connection might be either 'normal' NVMe-TCP connections
starting with icreq or TLS handshakes. To ensure that 'normal'
connections can still be handled we need to peek the first packet
and only start TLS handshake if it's not an icreq.
With that we can lift the restriction to always set TREQ to
'required' when TLS1.3 is enabled.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11 10:29:59 -07:00
Hannes Reinecke
a1c5dd8355 nvmet-tcp: control messages for recvmsg()
kTLS requires control messages for recvmsg() to relay any out-of-band
TLS messages (eg TLS alerts) to the caller.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11 10:29:59 -07:00
Hannes Reinecke
675b453e02 nvmet-tcp: enable TLS handshake upcall
TLS handshake is handled in userspace with the netlink tls handshake
protocol.

The patch adds a function to start the TLS handshake upcall for any
incoming network connections if the TCP TSAS sectype is set to 'tls1.3'.

A config option NVME_TARGET_TCP_TLS selects whether the TLS handshake
upcall should be compiled in.  The patch also adds reference counting
to struct nvmet_tcp_queue to ensure the queue is always valid when the
the TLS handshake completes.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11 10:29:59 -07:00
Hannes Reinecke
eb398812bd nvmet: Set 'TREQ' to 'required' when TLS is enabled
The current implementation does not support secure concatenation,
so 'TREQ' is always set to 'required' when TLS is enabled.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11 10:29:59 -07:00
Hannes Reinecke
79a4f186ad nvmet-tcp: allocate socket file
For the TLS upcall we need to allocate a socket file such
that the userspace daemon is able to use the socket.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11 10:29:59 -07:00
Hannes Reinecke
4f8cce2d32 nvmet-tcp: make nvmet_tcp_alloc_queue() a void function
The return value from nvmet_tcp_alloc_queue() are just used to
figure out if sock_release() need to be called. So this patch
moves sock_release() into nvmet_tcp_alloc_queue() and make it
a void function.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11 10:29:59 -07:00
Hannes Reinecke
3f123494db nvmet: make TCP sectype settable via configfs
Add a new configfs attribute 'addr_tsas' to make the TCP sectype
settable via configfs.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11 10:29:58 -07:00
Hannes Reinecke
adf22c520b nvme-fabrics: parse options 'keyring' and 'tls_key'
Parse the fabrics options 'keyring' and 'tls_key' and store the
referenced keys in the options structure.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11 10:29:58 -07:00
Hannes Reinecke
e4f4aabb6f nvme-tcp: improve icreq/icresp logging
When icreq/icresp fails we should be printing out a warning to
inform the user that the connection could not be established;
without it there won't be anything in the kernel message log,
just an error code returned to nvme-cli.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11 10:11:55 -07:00
Hannes Reinecke
2837966ab2 nvme-tcp: control message handling for recvmsg()
kTLS is sending TLS ALERT messages as control messages for recvmsg().
As we can't do anything sensible with it just abort the connection
and let the userspace agent to a re-negotiation.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11 10:11:55 -07:00
Hannes Reinecke
be8e82caa6 nvme-tcp: enable TLS handshake upcall
Add a fabrics option 'tls' and start the TLS handshake upcall
with the default PSK. When TLS is started the PSK key serial
number is displayed in the sysfs attribute 'tls_key'

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11 10:11:55 -07:00
Hannes Reinecke
e40d4eb840 nvme-tcp: allocate socket file
When using the TLS upcall we need to allocate a socket file such
that the userspace daemon is able to use the socket.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11 10:11:55 -07:00
Hannes Reinecke
501cc6f4ac nvme-keyring: implement nvme_tls_psk_default()
Implement a function to select the preferred PSK for TLS.
A 'retained' PSK should be preferred over a 'generated' PSK,
and SHA-384 should be preferred to SHA-256.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11 10:11:54 -07:00
Hannes Reinecke
3bac969b16 nvme-keyring: define a 'psk' keytype
Define a 'psk' keytype to hold the NVMe TLS PSKs.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11 10:11:54 -07:00
Hannes Reinecke
9d77eb5277 nvme-keyring: register '.nvme' keyring
Register a '.nvme' keyring to hold keys for TLS and DH-HMAC-CHAP and
add a new config option NVME_KEYRING.
We need a separate keyring for NVMe as the configuration is done
via individual commands (eg for configfs), and the usual per-session
or per-process keyrings can't be used.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11 10:11:54 -07:00
Martin Wilck
4ae55a7dce nvme-auth: use chap->s2 to indicate bidirectional authentication
Commit 546dea18c9 ("nvme-auth: check chap ctrl_key once constructed")
replaced the condition "if (ctrl->ctrl_key)" (indicating bidirectional
auth) by "if (chap->ctrl_key)", because ctrl->ctrl_key is a resource shared
with sysfs. But chap->ctrl_key is set in
nvme_auth_process_dhchap_challenge() depending on the DHVLEN in the
DH-HMAC-CHAP Challenge message received from the controller, and will thus
be non-NULL for every DH-HMAC-CHAP exchange, even if unidirectional auth
was requested. This will lead to a protocol violation by sending a Success2
message in the unidirectional case (per NVMe base spec 2.0, the
authentication transaction ends after the Success1 message for
unidirectional auth). Use chap->s2 instead, which is non-zero if and only
if the host requested bi-directional authentication from the controller.

Fixes: 546dea18c9 ("nvme-auth: check chap ctrl_key once constructed")
Signed-off-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-10 08:06:06 -07:00
Sagi Grimberg
d920abd1e7 nvmet-tcp: Fix a possible UAF in queue intialization setup
From Alon:
"Due to a logical bug in the NVMe-oF/TCP subsystem in the Linux kernel,
a malicious user can cause a UAF and a double free, which may lead to
RCE (may also lead to an LPE in case the attacker already has local
privileges)."

Hence, when a queue initialization fails after the ahash requests are
allocated, it is guaranteed that the queue removal async work will be
called, hence leave the deallocation to the queue removal.

Also, be extra careful not to continue processing the socket, so set
queue rcv_state to NVMET_TCP_RECV_ERR upon a socket error.

Cc: stable@vger.kernel.org
Reported-by: Alon Zahavi <zahavi.alon@gmail.com>
Tested-by: Alon Zahavi <zahavi.alon@gmail.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-10 08:03:22 -07:00
Maurizio Lombardi
3820c4fdc2 nvme-rdma: do not try to stop unallocated queues
Trying to stop a queue which hasn't been allocated will result
in a warning due to calling mutex_lock() against an uninitialized mutex.

 DEBUG_LOCKS_WARN_ON(lock->magic != lock)
 WARNING: CPU: 4 PID: 104150 at kernel/locking/mutex.c:579

 Call trace:
  RIP: 0010:__mutex_lock+0x1173/0x14a0
  nvme_rdma_stop_queue+0x1b/0xa0 [nvme_rdma]
  nvme_rdma_teardown_io_queues.part.0+0xb0/0x1d0 [nvme_rdma]
  nvme_rdma_delete_ctrl+0x50/0x100 [nvme_rdma]
  nvme_do_delete_ctrl+0x149/0x158 [nvme_core]

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-05 09:37:41 -07:00
Kees Cook
ccd3129aca nvmet-fc: Annotate struct nvmet_fc_tgt_queue with __counted_by
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for
array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).

As found with Coccinelle[1], add __counted_by for struct
nvmet_fc_tgt_queue. Additionally, since the element count member must
be set before accessing the annotated flexible array member, move its
initialization earlier.

Cc: James Smart <james.smart@broadcom.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci [1]
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-05 09:29:04 -07:00
Chengming Zhou
217b613a53 blk-mq: update driver tags request table when start request
Now we update driver tags request table in blk_mq_get_driver_tag(),
so the driver that support queue_rqs() have to update that inflight
table by itself.

Move it to blk_mq_start_request(), which is a better place where
we setup the deadline for request timeout check. And it's just
where the request becomes inflight.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230913151616.3164338-5-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-22 08:52:13 -06:00
Jens Axboe
c266ae774e nvme fixes for Linux 6.6
- nvme-tcp iov len fix (Varun)
  - nvme-hwmon const qualifier for safety (Krzysztof)
  - nvme-fc null pointer checks (Nigel)
  - nvme-pci no numa node fix (Pratyush)
  - nvme timeout fix for non-compliant controllers (Keith)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmUDakkACgkQPe3zGtjz
 RgnSyA/+MR9dvwaa9ADcmWtzKaFnpxOy24Yr0CQ97oqfryv89ZVHSOT9gzQGdhNO
 m3Z3hJ72nU+mjjy8/Ixh0+AdSNLOD6sY9ohQQ/7jev+yGjOaSZnQWXpH6pionfuW
 DCfQweFq8VggCpT+wVinnzs7C4S/mhuuNHunefDAa+A7R6jeJENhridbzillokfi
 semNz6GSKcIMBsaQ85JDlTyh+nGfva7KwLVdEzA0g00XjxMw159bBjqPFtm/Dwde
 Jt280wVNQHEa+Jo8CVgAbYvFAhgAmwt5cqr8d22PisVVR324s7aMUf++BQcslmzX
 5WyVTvr3nvKH5uAWtvKJiDV8VNhs6WM3CuCfOOqK2fOaQ6PN18In08y9aGPDQ73e
 3xkkgDdRlJFa2jae48msIY2KnysJoqPJgymEtvHuqbQNYifE1jym39Ml5soWCol2
 a0aEVobGTceeEb0D7MWdVFntS6HmdFrZcinmSGcgj34IAYCYaguHK8ZSPNovJnAq
 /27YDkfXnt3+8ZgbxhZjxxToH2xCaa/uC842Ujh3KTROqKbppw8YBv/2VoLNxGNh
 Kqlu2vApIbrNP19HuCp/oDu10Q+6KPJZY1qqT68ndLl47CYAXEG0XbyB13I4hdwL
 YUues45a9f9X+Q0cHoJ9Hr4rTLfF6dm5gNv0X1cUdyLNTcB1FbE=
 =WTY0
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.6-2023-09-14' of git://git.infradead.org/nvme into block-6.6

Pull NVMe fixes from Keith:

"nvme fixes for Linux 6.6

 - nvme-tcp iov len fix (Varun)
 - nvme-hwmon const qualifier for safety (Krzysztof)
 - nvme-fc null pointer checks (Nigel)
 - nvme-pci no numa node fix (Pratyush)
 - nvme timeout fix for non-compliant controllers (Keith)"

* tag 'nvme-6.6-2023-09-14' of git://git.infradead.org/nvme:
  nvme: avoid bogus CRTO values
  nvme-pci: do not set the NUMA node of device if it has none
  nvme-fc: Prevent null pointer dereference in nvme_fc_io_getuuid()
  nvme: host: hwmon: constify pointers to hwmon_channel_info
  nvmet-tcp: pass iov_len instead of sg->length to bvec_set_page()
2023-09-14 16:20:31 -06:00
Keith Busch
6cc834ba62 nvme: avoid bogus CRTO values
Some devices are reporting controller ready mode support, but return 0
for CRTO. These devices require a much higher time to ready than that,
so they are failing to initialize after the driver starter preferring
that value over CAP.TO.

The spec requires that CAP.TO match the appropritate CRTO value, or be
set to 0xff if CRTO is larger than that. This means that CAP.TO can be
used to validate if CRTO is reliable, and provides an appropriate
fallback for setting the timeout value if not. Use whichever is larger.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217863
Reported-by: Cláudio Sampaio <patola@gmail.com>
Reported-by: Felix Yan <felixonmars@archlinux.org>
Tested-by: Felix Yan <felixonmars@archlinux.org>
Based-on-a-patch-by: Felix Yan <felixonmars@archlinux.org>
Cc: stable@vger.kernel.org
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-09-14 13:09:52 -07:00
Pratyush Yadav
dad651b2a4 nvme-pci: do not set the NUMA node of device if it has none
If a device has no NUMA node information associated with it, the driver
puts the device in node first_memory_node (say node 0). Not having a
NUMA node and being associated with node 0 are completely different
things and it makes little sense to mix the two.

Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-09-12 09:06:58 -07:00
Linus Torvalds
3d3dfeb3ae for-6.6/block-2023-08-28
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTs08EQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqa4EACu/zKE+omGXBV0Q7kEpVsChjp0ElGtSDIJ
 tJfTuvnWqQjrqRv4ksmZvGdx8SkqFuXri4/7oBXlsaqeUVbIQdWJUpLErBye6nxa
 lUb6nXOFWwyG94cMRYs71lN0loosjb7aiVw7oVLAIhntq3p3doFl/cyy3ndMZrUE
 pZbsrWSt4QiOKhcO0TtIjfAwsr31AN51qFiNNITEiZl3UjXfkGRCK81X0yM2N8zZ
 7Y0h1ldPBsZ/olNWeRyaW1uB64nKM0buR7/nDxCV/NI05nndJ34bIgo/JIj4xy0v
 SiBj2+y86+oMJZt17yYENwOQdtX3hbyESGuVm9dCrO0t9/byVQxkUk0OMm65BM/l
 l2d+gmMQZTbHziqfLlgq9i3i9+B4C2hsb7iBpuo7SW/FPbM45POgi3lpiZycaZyu
 krQo1qwL4KSGXzGN9CabEuKDcJcXqLxqMDOyEDA3R5Kz06V9tNuM+Di/mr4vuZHK
 sVHUfHuWBO9ionLlGPdc3fH/CuMqic8SHjumiAm2menBZV6cSzRDxpm6H4CyLt7y
 tWmw7BNU7dfHFGd+Jw0Ld49sAuEybszEXq6qYv5uYBVfJNqDvOvEeVoQp0RN2jJA
 AG30hymcZgxn9n7gkIgkPQDgIGUjnzUR8B2mE2UFU1CYVHXYXAXU55CCI5oeTkbs
 d0Y/zCZf1A==
 =p1bd
 -----END PGP SIGNATURE-----

Merge tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:
 "Pretty quiet round for this release. This contains:

   - Add support for zoned storage to ublk (Andreas, Ming)

   - Series improving performance for drivers that mark themselves as
     needing a blocking context for issue (Bart)

   - Cleanup the flush logic (Chengming)

   - sed opal keyring support (Greg)

   - Fixes and improvements to the integrity support (Jinyoung)

   - Add some exports for bcachefs that we can hopefully delete again in
     the future (Kent)

   - deadline throttling fix (Zhiguo)

   - Series allowing building the kernel without buffer_head support
     (Christoph)

   - Sanitize the bio page adding flow (Christoph)

   - Write back cache fixes (Christoph)

   - MD updates via Song:
      - Fix perf regression for raid0 large sequential writes (Jan)
      - Fix split bio iostat for raid0 (David)
      - Various raid1 fixes (Heinz, Xueshi)
      - raid6test build fixes (WANG)
      - Deprecate bitmap file support (Christoph)
      - Fix deadlock with md sync thread (Yu)
      - Refactor md io accounting (Yu)
      - Various non-urgent fixes (Li, Yu, Jack)

   - Various fixes and cleanups (Arnd, Azeem, Chengming, Damien, Li,
     Ming, Nitesh, Ruan, Tejun, Thomas, Xu)"

* tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux: (113 commits)
  block: use strscpy() to instead of strncpy()
  block: sed-opal: keyring support for SED keys
  block: sed-opal: Implement IOC_OPAL_REVERT_LSP
  block: sed-opal: Implement IOC_OPAL_DISCOVERY
  blk-mq: prealloc tags when increase tagset nr_hw_queues
  blk-mq: delete redundant tagset map update when fallback
  blk-mq: fix tags leak when shrink nr_hw_queues
  ublk: zoned: support REQ_OP_ZONE_RESET_ALL
  md: raid0: account for split bio in iostat accounting
  md/raid0: Fix performance regression for large sequential writes
  md/raid0: Factor out helper for mapping and submitting a bio
  md raid1: allow writebehind to work on any leg device set WriteMostly
  md/raid1: hold the barrier until handle_read_error() finishes
  md/raid1: free the r1bio before waiting for blocked rdev
  md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io()
  blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init
  drivers/rnbd: restore sysfs interface to rnbd-client
  md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid()
  raid6: test: only check for Altivec if building on powerpc hosts
  raid6: test: make sure all intermediate and artifact files are .gitignored
  ...
2023-08-29 20:21:42 -07:00
Nigel Kirkland
8ae5b3a685 nvme-fc: Prevent null pointer dereference in nvme_fc_io_getuuid()
The nvme_fc_fcp_op structure describing an AEN operation is initialized with a
null request structure pointer. An FC LLDD may make a call to
nvme_fc_io_getuuid passing a pointer to an nvmefc_fcp_req for an AEN operation.

Add validation of the request structure pointer before dereference.

Signed-off-by: Nigel Kirkland <nkirkland2304@gmail.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-08-21 13:29:17 -07:00
Krzysztof Kozlowski
71be868472 nvme: host: hwmon: constify pointers to hwmon_channel_info
Statically allocated array of pointed to hwmon_channel_info can be made
const for safety.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-08-21 12:54:02 -07:00
Varun Prakash
1f0bbf2894 nvmet-tcp: pass iov_len instead of sg->length to bvec_set_page()
iov_len is the valid data length, so pass iov_len instead of sg->length to
bvec_set_page().

Fixes: 5bfaba275a ("nvmet-tcp: don't map pages which can't come from HIGHMEM")
Signed-off-by: Rakshana Sridhar <rakshanas@chelsio.com>
Signed-off-by: Varun Prakash <varun@chelsio.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-08-21 12:54:02 -07:00
Linus Torvalds
360e694282 block-6.5-2023-08-11
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTWfLQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpg3nEACROhaeX6cpeDCSTqDVDW/ontbyn15eX7ep
 tGPLn/TVtKv2AztIobEinS08MdywqBO/VcB7XkxQV9Ov4JqCHIAKhndWI6/HqD9P
 DH3h6tE5JA8RQlNw1aHRrqWWIl1lpDQI6263um1tB2TuaxRa4xuR560jju0VZzAm
 9541ceKlJT8Qc7yG0aiiCv6Bxz+b6Htv3DqCf1mY2yznl3BpN52RQHKhiA0sfnlF
 WKqNsvSJ9/kz3vJbNpFucO7ch8a7W+MzmBx0vf2ickTBpL/3hbhUOrE7dGeKI9rS
 cWh1HaULWqjnKY1uxF9nnapZxm8QoxkT/5T0DgmprKjwuZivfLASAhYpHBc3mT1S
 eQQ0AK8hqx7sPnPeO/kxWtxM2nzRLkeVd19ClbIwux/zDbRrpHWk2/wgnSUUd3/H
 HBbjbgPWbkgLvTOUKhIA5VPBcgkC1efom1+ePzkH/H4TRRuVJwg6s6utGXdgc1PX
 +B4TA8GtXRH/7L0tsblFyJRmd0Y6G7gYE/yy0DYZTMie3oaWrKx3lmz48AQUtEzh
 DG46VRA4wnthHRlw3mkLP7C6z4PJvK9WWBiK11eZ9VfJMF643FNpXQ3/bviR9pfF
 kXdwYXoi1mlnsQ0VUhu2f+JeV4hHalrjwD/VE2H0E8Ogb4ezmJteLyiZKcw5xwaA
 Hmtmbb7Qxw==
 =+1Vt
 -----END PGP SIGNATURE-----

Merge tag 'block-6.5-2023-08-11' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
      - Fixes for request_queue state (Ming)
      - Another uuid quirk (August)

 - RCU poll fix for NVMe (Ming)

 - Fix for an IO stall with polled IO (me)

 - Fix for blk-iocost stats enable/disable accounting (Chengming)

 - Regression fix for large pages for zram (Christoph)

* tag 'block-6.5-2023-08-11' of git://git.kernel.dk/linux:
  nvme: core: don't hold rcu read lock in nvme_ns_chr_uring_cmd_iopoll
  blk-iocost: fix queue stats accounting
  block: don't make REQ_POLLED imply REQ_NOWAIT
  block: get rid of unused plug->nowait flag
  zram: take device and not only bvec offset into account
  nvme-pci: add NVME_QUIRK_BOGUS_NID for Samsung PM9B1 256G and 512G
  nvme-rdma: fix potential unbalanced freeze & unfreeze
  nvme-tcp: fix potential unbalanced freeze & unfreeze
  nvme: fix possible hang when removing a controller during error recovery
2023-08-11 12:14:08 -07:00
Ming Lei
a7a7dabb5d nvme: core: don't hold rcu read lock in nvme_ns_chr_uring_cmd_iopoll
Now nvme_ns_chr_uring_cmd_iopoll() has switched to request based io
polling, and the associated NS is guaranteed to be live in case of
io polling, so request is guaranteed to be valid because blk-mq uses
pre-allocated request pool.

Remove the rcu read lock in nvme_ns_chr_uring_cmd_iopoll(), which
isn't needed any more after switching to request based io polling.

Fix "BUG: sleeping function called from invalid context" because
set_page_dirty_lock() from blk_rq_unmap_user() may sleep.

Fixes: 585079b6e4 ("nvme: wire up async polling for io passthrough commands")
Reported-by: Guangwu Zhang <guazhang@redhat.com>
Cc: Kanchan Joshi <joshi.k@samsung.com>
Cc: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Guangwu Zhang <guazhang@redhat.com>
Link: https://lore.kernel.org/r/20230809020440.174682-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11 08:12:32 -06:00
Jinyoung Choi
80814b8e35 bio-integrity: update the payload size in bio_integrity_add_page()
Previously, the bip's bi_size has been set before an integrity pages
were added. If a problem occurs in the process of adding pages for
bip, the bi_size mismatch problem must be dealt with.

When the page is successfully added to bvec, the bi_size is updated.

The parts affected by the change were also contained in this commit.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com>
Tested-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230803024956epcms2p38186a17392706650c582d38ef3dbcd32@epcms2p3
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09 16:05:35 -06:00
August Wikerfors
688b419c57 nvme-pci: add NVME_QUIRK_BOGUS_NID for Samsung PM9B1 256G and 512G
The Samsung PM9B1 512G SSD found in some Lenovo Yoga 7 14ARB7 laptop units
reports eui as 0001000200030004 when resuming from s2idle, causing the
device to be removed with this error in dmesg:

nvme nvme0: identifiers changed for nsid 1

To fix this, add a quirk to ignore namespace identifiers for this device.

Signed-off-by: August Wikerfors <git@augustwikerfors.se>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-08-01 13:28:29 -07:00
Ming Lei
29b434d1e4 nvme-rdma: fix potential unbalanced freeze & unfreeze
Move start_freeze into nvme_rdma_configure_io_queues(), and there is
at least two benefits:

1) fix unbalanced freeze and unfreeze, since re-connection work may
fail or be broken by removal

2) IO during error recovery can be failfast quickly because nvme fabrics
unquiesces queues after teardown.

One side-effect is that !mpath request may timeout during connecting
because of queue topo change, but that looks not one big deal:

1) same problem exists with current code base

2) compared with !mpath, mpath use case is dominant

Fixes: 9f98772ba3 ("nvme-rdma: fix controller reset hang during traffic")
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-21 00:53:33 -07:00
Ming Lei
99dc264014 nvme-tcp: fix potential unbalanced freeze & unfreeze
Move start_freeze into nvme_tcp_configure_io_queues(), and there is
at least two benefits:

1) fix unbalanced freeze and unfreeze, since re-connection work may
fail or be broken by removal

2) IO during error recovery can be failfast quickly because nvme fabrics
unquiesces queues after teardown.

One side-effect is that !mpath request may timeout during connecting
because of queue topo change, but that looks not one big deal:

1) same problem exists with current code base

2) compared with !mpath, mpath use case is dominant

Fixes: 2875b0aeca ("nvme-tcp: fix controller reset hang during traffic")
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-21 00:53:29 -07:00
Ming Lei
1b95e81791 nvme: fix possible hang when removing a controller during error recovery
Error recovery can be interrupted by controller removal, then the
controller is left as quiesced, and IO hang can be caused.

Fix the issue by unquiescing controller unconditionally when removing
namespaces.

This way is reasonable and safe given forward progress can be made
when removing namespaces.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reported-by: Chunguang Xu <brookxu.cn@gmail.com>
Closes: https://lore.kernel.org/linux-nvme/cover.1685350577.git.chunguang.xu@shopee.com/
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-21 00:53:26 -07:00
Linus Torvalds
be522ac7cd SCSI fixes on 20230714
This is a bunch of small driver fixes and a larger rework of zone disk
 handling (which reaches into blk and nvme).  The aacraid array-bounds
 fix is now critical since the security people turned on -Werror for
 some build tests, which now fail without it.
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCZLGSiCYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishd/BAPwO2i4t
 5uzhcWihoYaIZ6x07oEhgOP/o1h5n5mM908AyAEA6s2hQKDoIxjJexqvkS7lPjni
 P8VMcfvOmdsLDCD3nJ4=
 =+10g
 -----END PGP SIGNATURE-----

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "This is a bunch of small driver fixes and a larger rework of zone disk
  handling (which reaches into blk and nvme).

  The aacraid array-bounds fix is now critical since the security people
  turned on -Werror for some build tests, which now fail without it"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: storvsc: Handle SRB status value 0x30
  scsi: block: Improve checks in blk_revalidate_disk_zones()
  scsi: block: virtio_blk: Set zone limits before revalidating zones
  scsi: block: nullblk: Set zone limits before revalidating zones
  scsi: nvme: zns: Set zone limits before revalidating zones
  scsi: sd_zbc: Set zone limits before revalidating zones
  scsi: ufs: core: Add support for qTimestamp attribute
  scsi: aacraid: Avoid -Warray-bounds warning
  scsi: ufs: ufs-mediatek: Add dependency for RESET_CONTROLLER
  scsi: ufs: core: Update contact email for monitor sysfs nodes
  scsi: scsi_debug: Remove dead code
  scsi: qla2xxx: Use vmalloc_array() and vcalloc()
  scsi: fnic: Use vmalloc_array() and vcalloc()
  scsi: qla2xxx: Fix error code in qla2x00_start_sp()
  scsi: qla2xxx: Silence a static checker warning
  scsi: lpfc: Fix a possible data race in lpfc_unregister_fcf_rescan()
2023-07-14 19:57:29 -07:00
Linus Torvalds
b3bd86a049 block-6.5-2023-07-14
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSxpCwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpu1kEACX29tNFgxVYh6hpFlbEfK4eaAEESulgo0n
 ubkD+VLnjI3pEUBH+GA++0/kXjrdi9hsXQ2LcO2P9oB2LN6da7Tf0CvFWRbFwqzd
 Dpt7z/UFEikLNYHnctahQbtB7fsy7PYek6RIhJrCJo/+t7UUnV8RCUVLzeqROxKz
 WA6/B62/ahPW4wD0ZfW/xPDUOpR61jZ+GqD7/F5qXc7+7MqL7nLwUFH71zstwUoX
 zmqiyA9clArlSCjmARBP0ekjK/7wYDz8NHMlD9wja2ZwJkIrw8/fml7MHS+j9cPB
 rYc6zhHSksfQa/T4PZq00uGMEIC38QqtV+zHzeziIvh0i2lvLXRyiXAO0bY5yiB2
 rgyYOdaV1kEV3zuQ8zwPI/ZZJkZBeFzXM0hBeAq8K02QRPT/fHjcKE7+xxIQUy/h
 ZfX/NJcS6cWQhI0NFf18BtZR7N544A1LFhj2U8hM8m3U2+O8/n1Ozb0v0yXBuXW3
 oxgRdrqRBpI9pAywLIIBo5Ro/E3BK+2MqO0fIs0UxpYWPbSw6WMrFvwRkDEzQvhP
 0pZpgr9Y1LhPPW+nysASqwlx1Bf9PJTaE6zH2ze1eu9nauhtivuStGC0Sc5w0+UC
 3UsNLDDAgw98+SRxllQrjFl1Q+0YCseLgQrz54nCLnu7xen6UN3n/HMX/yI0C2ri
 Ufn+F9jOig==
 =ao7r
 -----END PGP SIGNATURE-----

Merge tag 'block-6.5-2023-07-14' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
      - Don't require quirk to use duplicate namespace identifiers
        (Christoph, Sagi)
      - One more BOGUS_NID quirk (Pankaj)
      - IO timeout and error hanlding fixes for PCI (Keith)
      - Enhanced metadata format mask fix (Ankit)
      - Association race condition fix for fibre channel (Michael)
      - Correct debugfs error checks (Minjie)
      - Use PAGE_SECTORS_SHIFT where needed (Damien)
      - Reduce kernel logs for legacy nguid attribute (Keith)
      - Use correct dma direction when unmapping metadata (Ming)

 - Fix for a flush handling regression in this release (Christoph)

 - Fix for batched request time stamping (Chengming)

 - Fix for a regression in the mq-deadline position calculation (Bart)

 - Lockdep fix for blk-crypto (Eric)

 - Fix for a regression in the Amiga partition handling changes
   (Michael)

* tag 'block-6.5-2023-07-14' of git://git.kernel.dk/linux:
  block: queue data commands from the flush state machine at the head
  blk-mq: fix start_time_ns and alloc_time_ns for pre-allocated rq
  nvme-pci: fix DMA direction of unmapping integrity data
  nvme: don't reject probe due to duplicate IDs for single-ported PCIe devices
  block/mq-deadline: Fix a bug in deadline_from_pos()
  nvme: ensure disabling pairs with unquiesce
  nvme-fc: fix race between error recovery and creating association
  nvme-fc: return non-zero status code when fails to create association
  nvme: fix parameter check in nvme_fault_inject_init()
  nvme: warn only once for legacy uuid attribute
  block: remove dead struc request->completion_data field
  nvme: fix the NVME_ID_NS_NVM_STS_MASK definition
  nvmet: use PAGE_SECTORS_SHIFT
  nvme: add BOGUS_NID quirk for Samsung SM953
  blk-crypto: use dynamic lock class for blk_crypto_profile::lock
  block/partition: fix signedness issue for Amiga partitions
2023-07-14 19:52:18 -07:00
Ming Lei
b8f6446b68 nvme-pci: fix DMA direction of unmapping integrity data
DMA direction should be taken in dma_unmap_page() for unmapping integrity
data.

Fix this DMA direction, and reported in Guangwu's test.

Reported-by: Guangwu Zhang <guazhang@redhat.com>
Fixes: 4aedb70543 ("nvme-pci: split metadata handling from nvme_map_data / nvme_unmap_data")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-13 08:16:07 -07:00
Christoph Hellwig
ac522fc6c3 nvme: don't reject probe due to duplicate IDs for single-ported PCIe devices
While duplicate IDs are still very harmful, including the potential to easily
see changing devices in /dev/disk/by-id, it turn out they are extremely
common for cheap end user NVMe devices.

Relax our check for them for so that it doesn't reject the probe on
single-ported PCIe devices, but prints a big warning instead.  In doubt
we'd still like to see quirk entries to disable the potential for
changing supposed stable device identifier links, but this will at least
allow users how have two (or more) of these devices to use them without
having to manually add a new PCI ID entry with the quirk through sysfs or
by patching the kernel.

Fixes: 2079f41ec6 ("nvme: check that EUI/GUID/UUID are globally unique")
Cc: stable@vger.kernel.org # 6.0+
Co-developed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-13 08:12:50 -07:00
Keith Busch
71a5bb153b nvme: ensure disabling pairs with unquiesce
If any error handling that disables the controller fails to queue the
reset work, like if the state changed to disconnected inbetween, then
the failed teardown needs to unquiesce the queues since it's no longer
paired with reset_work. Just make sure that the controller can be put
into a resetting state prior to starting the disable so that no other
handling can change the queue states while recovery is happening.

Reported-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-12 09:32:58 -07:00
Michael Liang
ee6fdc5055 nvme-fc: fix race between error recovery and creating association
There is a small race window between nvme-fc association creation and error
recovery. Fix this race condition by protecting accessing to controller
state and ASSOC_FAILED flag under nvme-fc controller lock.

Signed-off-by: Michael Liang <mliang@purestorage.com>
Reviewed-by: Caleb Sander <csander@purestorage.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-12 09:29:51 -07:00
Michael Liang
60e445bdfc nvme-fc: return non-zero status code when fails to create association
Return non-zero status code(-EIO) when needed, so re-connecting or
deleting controller will be triggered properly.

Signed-off-by: Michael Liang <mliang@purestorage.com>
Reviewed-by: Caleb Sander <csander@purestorage.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-12 09:29:48 -07:00
Minjie Du
733ba346d9 nvme: fix parameter check in nvme_fault_inject_init()
Make IS_ERR() judge the debugfs_create_dir() function return.

Signed-off-by: Minjie Du <duminjie@vivo.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-12 08:48:33 -07:00
Keith Busch
b718ae835b nvme: warn only once for legacy uuid attribute
Report the legacy fallback behavior for uuid attributes just once
instead of logging repeated warnings for the same condition every time
the attribute is read. The old behavior is too spamy on the kernel logs.

Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reported-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-12 08:48:19 -07:00
Martin K. Petersen
e96277a570 Merge branch '6.5/scsi-staging' into 6.5/scsi-fixes
Pull in the currently staged SCSI fixes for 6.5.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-07-11 12:15:15 -04:00
Damien Le Moal
9bcf156f58 nvmet: use PAGE_SECTORS_SHIFT
Replace occurences of the pattern "PAGE_SHIFT - 9" in the passthru and
loop targets with PAGE_SECTORS_SHIFT.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-10 08:46:30 -07:00
Pankaj Raghav
e5bb0988a5 nvme: add BOGUS_NID quirk for Samsung SM953
Add the quirk as SM953 is reporting bogus namespace ID.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217593
Reported-by: Clemens Springsguth <cspringsguth@gmail.com>
Tested-by: Clemens Springsguth <cspringsguth@gmail.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-10 08:44:36 -07:00
Damien Le Moal
d226b0a2b6 scsi: nvme: zns: Set zone limits before revalidating zones
In nvme_revalidate_zones(), execute blk_queue_chunk_sectors() and
blk_queue_max_zone_append_sectors() to respectively set a ZNS namespace
zone size and maximum zone append sector limit before executing
blk_revalidate_disk_zones(). This is to allow the block layer zone
reavlidation to check these device characteristics prior to checking all
zones of the device.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230703024812.76778-3-dlemoal@kernel.org
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-07-05 21:58:10 -04:00
Linus Torvalds
6843306689 Including fixes from bluetooth, bpf and wireguard.
Current release - regressions:
 
  - nvme-tcp: fix comma-related oops after sendpage changes
 
 Current release - new code bugs:
 
  - ptp: make max_phase_adjustment sysfs device attribute invisible
    when not supported
 
 Previous releases - regressions:
 
  - sctp: fix potential deadlock on &net->sctp.addr_wq_lock
 
  - mptcp:
    - ensure subflow is unhashed before cleaning the backlog
    - do not rely on implicit state check in mptcp_listen()
 
 Previous releases - always broken:
 
  - net: fix net_dev_start_xmit trace event vs skb_transport_offset()
 
  - Bluetooth:
    - fix use-bdaddr-property quirk
    - L2CAP: fix multiple UaFs
    - ISO: use hci_sync for setting CIG parameters
    - hci_event: fix Set CIG Parameters error status handling
    - hci_event: fix parsing of CIS Established Event
    - MGMT: fix marking SCAN_RSP as not connectable
 
  - wireguard: queuing: use saner cpu selection wrapping
 
  - sched: act_ipt: various bug fixes for iptables <> TC interactions
 
  - sched: act_pedit: add size check for TCA_PEDIT_PARMS_EX
 
  - dsa: fixes for receiving PTP packets with 8021q and sja1105 tagging
 
  - eth: sfc: fix null-deref in devlink port without MAE access
 
  - eth: ibmvnic: do not reset dql stats on NON_FATAL err
 
 Misc:
 
  - xsk: honor SO_BINDTODEVICE on bind
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmSlu+MACgkQMUZtbf5S
 Irslgw//S7jf/GL8V6y8VL3te+/OPOZnLDTzFFOdy64/y97FE6XIacJUpyWRhtmz
 oSzcSNHETPW9U+xSGa2ZQlKhAXt6n9iRNvUegql+VBb13Iz+l7AdTeoxRv/YuwDo
 5lTOIB6cBw+ATd0oxS6wr8SyUlcvktUKBfTAItjbVM55aXfIUpXIa84+F7avJgIA
 XP1u/3PHhwItmwo/hXhHH0+P0QA8ix1q2SvRB7DAlQLBsTuQhaKjXWQkYYTKw/Nt
 dtvh8iQSs/YXaHMjTa5CK28HOD8+ywIizr+uJ9VaNqIzV0W5JE9IE8P4NFpBcY7t
 kGjTYODOph7dkNmZ5RLj3N+B6CyC57OXDzoo/tr8940UytCLVj9EVyduarLGLx57
 edqK9cUz5kWejyGoyZ4Pvlo/SKvCQ2HKMeiAJ0/nNpTJMFuygMoqGsaD6ttzkXMj
 fZLPjRUK3axd+15ZzhLEf8HyL5Qh+qPqqX9p7NljfMKwhxMWJ5fuICJfdGOSdMJR
 ndL+wPfRPFQwszZ4pbTY2Ivn29mo8ScBOSOEgQs2mOny+zFzTzmqNWz/jcFfQnjS
 cylxBEHrgudT2FuCImZ/v66TM5yakHXqIdpTGG+zsvJWQqjM96Z3I7WRvi0g9d75
 n84il+j34mnzl90j2xEutqUiK7BQ9ZpZBsutPVTKBIHKWWiortI=
 =9yzk
 -----END PGP SIGNATURE-----

Merge tag 'net-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bluetooth, bpf and wireguard.

  Current release - regressions:

   - nvme-tcp: fix comma-related oops after sendpage changes

  Current release - new code bugs:

   - ptp: make max_phase_adjustment sysfs device attribute invisible
     when not supported

  Previous releases - regressions:

   - sctp: fix potential deadlock on &net->sctp.addr_wq_lock

   - mptcp:
      - ensure subflow is unhashed before cleaning the backlog
      - do not rely on implicit state check in mptcp_listen()

  Previous releases - always broken:

   - net: fix net_dev_start_xmit trace event vs skb_transport_offset()

   - Bluetooth:
      - fix use-bdaddr-property quirk
      - L2CAP: fix multiple UaFs
      - ISO: use hci_sync for setting CIG parameters
      - hci_event: fix Set CIG Parameters error status handling
      - hci_event: fix parsing of CIS Established Event
      - MGMT: fix marking SCAN_RSP as not connectable

   - wireguard: queuing: use saner cpu selection wrapping

   - sched: act_ipt: various bug fixes for iptables <> TC interactions

   - sched: act_pedit: add size check for TCA_PEDIT_PARMS_EX

   - dsa: fixes for receiving PTP packets with 8021q and sja1105 tagging

   - eth: sfc: fix null-deref in devlink port without MAE access

   - eth: ibmvnic: do not reset dql stats on NON_FATAL err

  Misc:

   - xsk: honor SO_BINDTODEVICE on bind"

* tag 'net-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (70 commits)
  nfp: clean mc addresses in application firmware when closing port
  selftests: mptcp: pm_nl_ctl: fix 32-bit support
  selftests: mptcp: depend on SYN_COOKIES
  selftests: mptcp: userspace_pm: report errors with 'remove' tests
  selftests: mptcp: userspace_pm: use correct server port
  selftests: mptcp: sockopt: return error if wrong mark
  selftests: mptcp: sockopt: use 'iptables-legacy' if available
  selftests: mptcp: connect: fail if nft supposed to work
  mptcp: do not rely on implicit state check in mptcp_listen()
  mptcp: ensure subflow is unhashed before cleaning the backlog
  s390/qeth: Fix vipa deletion
  octeontx-af: fix hardware timestamp configuration
  net: dsa: sja1105: always enable the send_meta options
  net: dsa: tag_sja1105: fix MAC DA patching from meta frames
  net: Replace strlcpy with strscpy
  pptp: Fix fib lookup calls.
  mlxsw: spectrum_router: Fix an IS_ERR() vs NULL check
  net/sched: act_pedit: Add size check for TCA_PEDIT_PARMS_EX
  xsk: Honor SO_BINDTODEVICE on bind
  ptp: Make max_phase_adjustment sysfs device attribute invisible when not supported
  ...
2023-07-05 15:44:45 -07:00
Linus Torvalds
e50df24979 block-6.5-2023-07-03
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSjJ2IQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsQMEACQiUBw81tXvetYhz3P/4KrrjvUobgqMU0w
 jtrxqMgPee9FbqCShpj76c+La5wu23DnlrCXoHZxFQuiQLnsX5xFV66NYVi+W1CN
 k5MHP7f2e9V0T7qJ9UoHFRV1k22LF4X6T8njEZimxsm/uXfpav/knkhI7nUDnB1K
 wxlu9akD2Bo/X9O2NTS+X6qjoawZ6rDWN15THMXlC45VzJPLmIcs07Ev+mvw21KE
 XqasoZrxEO0S8dWxmJgJGqnRIOQptTS5U+0OPBZT8H220Qp/1q0pQHPw6iLXNrkc
 w1a2W1Bge012gjJt7gCMkdDnZb76sKiyGuMbFME7DoRbLCQeaOtoSfmg7NoRI2gp
 74TCSr7dPWZUVUy5Tmsy0DCv0552vIbnlQ69W6Xwx8YkplM3FPiMpWrQ5JWEHdvv
 Zl84mLP6Yyo54JVuk9zi8q/2L0HfyfMDj4UM/mNs8hwmcUSbPO2TKdIWDaq8xPuS
 Ed+D+kg6XFux8tLnCSDLNbaD5JE+ak9gTVhNdRa/zFE04o/OeidscKEqRSYTkdXL
 2p34qtw5kEQocO4Pa3eUGO6KJCDTR36Rms5p6ZFybL4O2oZYrAbRi1TGDxaG2Hag
 GCr2vaFbmz1zbGuMpFhLha5B7HeDLs+PHOn+B1iUNjEr9RC0EOHV7moJKqjxlnCh
 4mBkK/Nlyg==
 =kSeX
 -----END PGP SIGNATURE-----

Merge tag 'block-6.5-2023-07-03' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:
 "Mostly items that came in a bit late for the initial pull request,
  wanted to make sure they had the appropriate amount of linux-next soak
  before going upstream.

  Outside of stragglers, just generic fixes for either merge window
  items, or longer standing bugs"

* tag 'block-6.5-2023-07-03' of git://git.kernel.dk/linux: (25 commits)
  md/raid0: add discard support for the 'original' layout
  nvme: disable controller on reset state failure
  nvme: sync timeout work on failed reset
  nvme: ensure unquiesce on teardown
  cdrom/gdrom: Fix build error
  nvme: improved uring polling
  block: add request polling helper
  nvme-mpath: fix I/O failure with EAGAIN when failing over I/O
  nvme: host: fix command name spelling
  blk-sysfs: add a new attr_group for blk_mq
  blk-iocost: move wbt_enable/disable_default() out of spinlock
  blk-wbt: cleanup rwb_enabled() and wbt_disabled()
  blk-wbt: remove dead code to handle wbt enable/disable with io inflight
  blk-wbt: don't create wbt sysfs entry if CONFIG_BLK_WBT is disabled
  blk-mq: fix two misuses on RQF_USE_SCHED
  blk-throttle: Fix io statistics for cgroup v1
  bcache: Fix bcache device claiming
  bcache: Alloc holder object before async registration
  raid10: avoid spin_lock from fastpath from raid10_unplug()
  md: fix 'delete_mutex' deadlock
  ...
2023-07-03 18:48:38 -07:00
David Howells
c97d3fb9e0 nvme-tcp: Fix comma-related oops
Fix a comma that should be a semicolon.  The comma is at the end of an
if-body and thus makes the statement after (a bvec_set_page()) conditional
too, resulting in an oops because we didn't fill out the bio_vec[]:

    BUG: kernel NULL pointer dereference, address: 0000000000000008
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    ...
    Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
    RIP: 0010:skb_splice_from_iter+0xf1/0x370
    ...
    Call Trace:
     tcp_sendmsg_locked+0x3a6/0xdd0
     tcp_sendmsg+0x31/0x50
     inet_sendmsg+0x47/0x80
     sock_sendmsg+0x99/0xb0
     nvme_tcp_try_send_data+0x149/0x490 [nvme_tcp]
     nvme_tcp_try_send+0x1b7/0x300 [nvme_tcp]
     nvme_tcp_io_work+0x40/0xc0 [nvme_tcp]
     process_one_work+0x21c/0x430
     worker_thread+0x54/0x3e0
     kthread+0xf8/0x130

Fixes: 7769887817 ("nvme-tcp: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage")
Reported-by: Aurelien Aptel <aaptel@nvidia.com>
Link: https://lore.kernel.org/r/253mt0il43o.fsf@mtr-vdi-124.i-did-not-set--mail-host-address--so-tickle-me/
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Sagi Grimberg <sagi@grimberg.me>
cc: Willem de Bruijn <willemb@google.com>
cc: Keith Busch <kbusch@kernel.org>
cc: Jens Axboe <axboe@fb.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Chaitanya Kulkarni <kch@nvidia.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-nvme@lists.infradead.org
cc: netdev@vger.kernel.org
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-02 15:44:54 +01:00
Jens Axboe
6e34e784e7 nvme fixes for Linux 6.5
- Reduce spamming kernel logs on repeated controller updates (Breno)
  - Improved struct packing (Christophe JAILLET)
  - Misspelled command name in error logging (Damien)
  - Failover fix for temporary frozen queue (Sagi)
  - Reset error handling fixes (Keith)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmSfHiQACgkQPe3zGtjz
 RglOEg/+IBogV9vdGGaExlt6jKeVCLeq/dvBjTtq9adHFYnyF6s9B39GDW9QuhPv
 wGvE8FsWOVcLBVIvwS9ZiPMnjwyOD5DGQ7kqBVRiMT65JOA7sMHz22H432CjcOTM
 54m5ezdeAwdxMJrbElf6cpCQXxdpu6rlQrcpKPw5F6Gl5PWV1sG4U8c5YGsUQ+Jj
 KTjhQIIcvwWRNWfru/6CcZ3b6YxixtuGbCXnls2WhVPN1e8DIgVvQfy7eMLKYT4w
 3OrFe64nTEPSX1aT4FlV+AZN2icB8GJlhHUjIH4kjdquR49ZpItvVO38DdilJ8SU
 2cFfN1rLE9o4Kwozlxi0VgUe/TUwGL3QLlrQRvQMEdG80Y/pe4vwD0n8N/6btZLo
 EJcK4spSdsAmdAx+RTzizl+IhgV2cBWmwLfPv9qZmBODPIEHwVLXky5iVkFqYifk
 jm6PIUYur5RLZsYlW6SiZY1NBc62cigRbtC5Cu7XuxFHXZFC2MW7KWpyAHYSQxJJ
 k8MQr6vvlBLypzAZzZCm9zC3FlMvaFNM7Q4cq52s9vaL4lxQ5zgQ5SjGZnXp2Ow/
 K4PU7mo0BHSoKwfy42uYnWm42hinFSVKROEZo/E7+RoUIjl8S6uAeUyBnWYYGqsv
 g0VtZCUbSAlM8aP3ffi8UZCm84Po3zdB0s62eKNC5FG/CFtveI4=
 =ZzGx
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.5-2023-06-30' of git://git.infradead.org/nvme into block-6.5

Pull NVMe fixes from Keith:

"nvme fixes for Linux 6.5

 - Reduce spamming kernel logs on repeated controller updates (Breno)
 - Improved struct packing (Christophe JAILLET)
 - Misspelled command name in error logging (Damien)
 - Failover fix for temporary frozen queue (Sagi)
 - Reset error handling fixes (Keith)"

* tag 'nvme-6.5-2023-06-30' of git://git.infradead.org/nvme:
  nvme: disable controller on reset state failure
  nvme: sync timeout work on failed reset
  nvme: ensure unquiesce on teardown
  nvme-mpath: fix I/O failure with EAGAIN when failing over I/O
  nvme: host: fix command name spelling
  nvmet: Reorder fields in 'struct nvmet_ns'
  nvme: Print capabilities changes just once
2023-06-30 14:04:08 -06:00
Linus Torvalds
ca7ce08d6a SCSI misc on 20230629
Updates to the usual drivers (ufs, pm80xx, libata-scsi, smartpqi,
 lpfc, qla2xxx).  We have a couple of major core changes impacting
 other systems: Command Duration Limits, which spills into block and
 ATA and block level Persistent Reservation Operations, which touches
 block, nvme, target and dm (both of which are added with merge commits
 containing a cover letter explaining what's going on).
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCZJ19cSYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishfZpAQCQBuWR
 ELcOhsaG5KzO6xLWcH8mjsOoxffKvazZjTKXlAD5ATEv7++E250oKS3t+yfjae5I
 Lc195MlDju85ItUQgfk=
 =U9ik
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
 "Updates to the usual drivers (ufs, pm80xx, libata-scsi, smartpqi,
  lpfc, qla2xxx).

  We have a couple of major core changes impacting other systems:

   - Command Duration Limits, which spills into block and ATA

   - block level Persistent Reservation Operations, which touches block,
     nvme, target and dm

  Both of these are added with merge commits containing a cover letter
  explaining what's going on"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (187 commits)
  scsi: core: Improve warning message in scsi_device_block()
  scsi: core: Replace scsi_target_block() with scsi_block_targets()
  scsi: core: Don't wait for quiesce in scsi_device_block()
  scsi: core: Don't wait for quiesce in scsi_stop_queue()
  scsi: core: Merge scsi_internal_device_block() and device_block()
  scsi: sg: Increase number of devices
  scsi: bsg: Increase number of devices
  scsi: qla2xxx: Remove unused nvme_ls_waitq wait queue
  scsi: ufs: ufs-pci: Add support for Intel Arrow Lake
  scsi: sd: sd_zbc: Use PAGE_SECTORS_SHIFT
  scsi: ufs: wb: Add explicit flush_threshold sysfs attribute
  scsi: ufs: ufs-qcom: Switch to the new ICE API
  scsi: ufs: dt-bindings: qcom: Add ICE phandle
  scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_RTC quirk
  scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_INTR quirk
  scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_RTC
  scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_INTR
  scsi: ufs: core: Remove dedicated hwq for dev command
  scsi: ufs: core: mcq: Fix the incorrect OCS value for the device command
  scsi: ufs: dt-bindings: samsung,exynos: Drop unneeded quotes
  ...
2023-06-30 11:57:07 -07:00
Keith Busch
4e69d4dabd nvme: disable controller on reset state failure
If the controller is not in a RESETTING state at the point of reset
work, we have to conclude the controller is being deleted. Go to the
cleanup on this condition to ensure proper pairing of request_queue
quiesce state.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-30 10:48:20 -07:00
Keith Busch
a2b5d5443f nvme: sync timeout work on failed reset
Timeouts during reset will set the controller for failure, preventing
the state change to LIVE. Ensure all timeout work is synced after the
controller disabling completes to ensure we don't have any other tasks
messing with any namespace request_queue's.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-30 10:48:15 -07:00
Keith Busch
2ab4e5f44a nvme: ensure unquiesce on teardown
The reset work is called on quiesced IO queues, so ensure these are
unquiesced after a failed reset to flush out any pending requests.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-30 10:47:41 -07:00
Linus Torvalds
3a8a670eee Networking changes for 6.5.
Core
 ----
 
  - Rework the sendpage & splice implementations. Instead of feeding
    data into sockets page by page extend sendmsg handlers to support
    taking a reference on the data, controlled by a new flag called
    MSG_SPLICE_PAGES. Rework the handling of unexpected-end-of-file
    to invoke an additional callback instead of trying to predict what
    the right combination of MORE/NOTLAST flags is.
    Remove the MSG_SENDPAGE_NOTLAST flag completely.
 
  - Implement SCM_PIDFD, a new type of CMSG type analogous to
    SCM_CREDENTIALS, but it contains pidfd instead of plain pid.
 
  - Enable socket busy polling with CONFIG_RT.
 
  - Improve reliability and efficiency of reporting for ref_tracker.
 
  - Auto-generate a user space C library for various Netlink families.
 
 Protocols
 ---------
 
  - Allow TCP to shrink the advertised window when necessary, prevent
    sk_rcvbuf auto-tuning from growing the window all the way up to
    tcp_rmem[2].
 
  - Use per-VMA locking for "page-flipping" TCP receive zerocopy.
 
  - Prepare TCP for device-to-device data transfers, by making sure
    that payloads are always attached to skbs as page frags.
 
  - Make the backoff time for the first N TCP SYN retransmissions
    linear. Exponential backoff is unnecessarily conservative.
 
  - Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO).
 
  - Avoid waking up applications using TLS sockets until we have
    a full record.
 
  - Allow using kernel memory for protocol ioctl callbacks, paving
    the way to issuing ioctls over io_uring.
 
  - Add nolocalbypass option to VxLAN, forcing packets to be fully
    encapsulated even if they are destined for a local IP address.
 
  - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
    in-kernel ECMP implementation (e.g. Open vSwitch) select the same
    link for all packets. Support L4 symmetric hashing in Open vSwitch.
 
  - PPPoE: make number of hash bits configurable.
 
  - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
    (ipconfig).
 
  - Add layer 2 miss indication and filtering, allowing higher layers
    (e.g. ACL filters) to make forwarding decisions based on whether
    packet matched forwarding state in lower devices (bridge).
 
  - Support matching on Connectivity Fault Management (CFM) packets.
 
  - Hide the "link becomes ready" IPv6 messages by demoting their
    printk level to debug.
 
  - HSR: don't enable promiscuous mode if device offloads the proto.
 
  - Support active scanning in IEEE 802.15.4.
 
  - Continue work on Multi-Link Operation for WiFi 7.
 
 BPF
 ---
 
  - Add precision propagation for subprogs and callbacks. This allows
    maintaining verification efficiency when subprograms are used,
    or in fact passing the verifier at all for complex programs,
    especially those using open-coded iterators.
 
  - Improve BPF's {g,s}setsockopt() length handling. Previously BPF
    assumed the length is always equal to the amount of written data.
    But some protos allow passing a NULL buffer to discover what
    the output buffer *should* be, without writing anything.
 
  - Accept dynptr memory as memory arguments passed to helpers.
 
  - Add routing table ID to bpf_fib_lookup BPF helper.
 
  - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands.
 
  - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
    maps as read-only).
 
  - Show target_{obj,btf}_id in tracing link fdinfo.
 
  - Addition of several new kfuncs (most of the names are self-explanatory):
    - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
      bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
      and bpf_dynptr_clone().
    - bpf_task_under_cgroup()
    - bpf_sock_destroy() - force closing sockets
    - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs
 
 Netfilter
 ---------
 
  - Relax set/map validation checks in nf_tables. Allow checking
    presence of an entry in a map without using the value.
 
  - Increase ip_vs_conn_tab_bits range for 64BIT builds.
 
  - Allow updating size of a set.
 
  - Improve NAT tuple selection when connection is closing.
 
 Driver API
 ----------
 
  - Integrate netdev with LED subsystem, to allow configuring HW
    "offloaded" blinking of LEDs based on link state and activity
    (i.e. packets coming in and out).
 
  - Support configuring rate selection pins of SFP modules.
 
  - Factor Clause 73 auto-negotiation code out of the drivers, provide
    common helper routines.
 
  - Add more fool-proof helpers for managing lifetime of MDIO devices
    associated with the PCS layer.
 
  - Allow drivers to report advanced statistics related to Time Aware
    scheduler offload (taprio).
 
  - Allow opting out of VF statistics in link dump, to allow more VFs
    to fit into the message.
 
  - Split devlink instance and devlink port operations.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Synopsys EMAC4 IP support (stmmac)
    - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
    - Marvell 88E6250 7 port switches
    - Microchip LAN8650/1 Rev.B0 PHYs
    - MediaTek MT7981/MT7988 built-in 1GE PHY driver
 
  - WiFi:
    - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
    - Realtek RTL8723DS (SDIO variant)
    - Realtek RTL8851BE
 
  - CAN:
    - Fintek F81604
 
 Drivers
 -------
 
  - Ethernet NICs:
    - Intel (100G, ice):
      - support dynamic interrupt allocation
      - use meta data match instead of VF MAC addr on slow-path
    - nVidia/Mellanox:
      - extend link aggregation to handle 4, rather than just 2 ports
      - spawn sub-functions without any features by default
    - OcteonTX2:
      - support HTB (Tx scheduling/QoS) offload
      - make RSS hash generation configurable
      - support selecting Rx queue using TC filters
    - Wangxun (ngbe/txgbe):
      - add basic Tx/Rx packet offloads
      - add phylink support (SFP/PCS control)
    - Freescale/NXP (enetc):
      - report TAPRIO packet statistics
    - Solarflare/AMD:
      - support matching on IP ToS and UDP source port of outer header
      - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
      - add devlink dev info support for EF10
 
  - Virtual NICs:
    - Microsoft vNIC:
      - size the Rx indirection table based on requested configuration
      - support VLAN tagging
    - Amazon vNIC:
      - try to reuse Rx buffers if not fully consumed, useful for ARM
        servers running with 16kB pages
    - Google vNIC:
      - support TCP segmentation of >64kB frames
 
  - Ethernet embedded switches:
    - Marvell (mv88e6xxx):
      - enable USXGMII (88E6191X)
    - Microchip:
     - lan966x: add support for Egress Stage 0 ACL engine
     - lan966x: support mapping packet priority to internal switch
       priority (based on PCP or DSCP)
 
  - Ethernet PHYs:
    - Broadcom PHYs:
      - support for Wake-on-LAN for BCM54210E/B50212E
      - report LPI counter
    - Microsemi PHYs: support RGMII delay configuration (VSC85xx)
    - Micrel PHYs: receive timestamp in the frame (LAN8841)
    - Realtek PHYs: support optional external PHY clock
    - Altera TSE PCS: merge the driver into Lynx PCS which it is
      a variant of
 
  - CAN: Kvaser PCIEcan:
    - support packet timestamping
 
  - WiFi:
    - Intel (iwlwifi):
      - major update for new firmware and Multi-Link Operation (MLO)
      - configuration rework to drop test devices and split
        the different families
      - support for segmented PNVM images and power tables
      - new vendor entries for PPAG (platform antenna gain) feature
    - Qualcomm 802.11ax (ath11k):
      - Multiple Basic Service Set Identifier (MBSSID) and
        Enhanced MBSSID Advertisement (EMA) support in AP mode
      - support factory test mode
    - RealTek (rtw89):
      - add RSSI based antenna diversity
      - support U-NII-4 channels on 5 GHz band
    - RealTek (rtl8xxxu):
      - AP mode support for 8188f
      - support USB RX aggregation for the newer chips
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmSbJM4ACgkQMUZtbf5S
 IrtoDhAAhEim1+LBIKf4lhPcVdZ2p/TkpnwTz5jsTwSeRBAxTwuNJ2fQhFXg13E3
 MnRq6QaEp8G4/tA/gynLvQop+FEZEnv+horP0zf/XLcC8euU7UrKdrpt/4xxdP07
 IL/fFWsoUGNO+L9LNaHwBo8g7nHvOkPscHEBHc2Xrvzab56TJk6vPySfLqcpKlNZ
 CHWDwTpgRqNZzSKiSpoMVd9OVMKUXcPYHpDmfEJ5l+e8vTXmZzOLHrSELHU5nP5f
 mHV7gxkDCTshoGcaed7UTiOvgu1p6E5EchDJxiLaSUbgsd8SZ3u4oXwRxgj33RK/
 fB2+UaLrRt/DdlHvT/Ph8e8Ygu77yIXMjT49jsfur/zVA0HEA2dFb7V6QlsYRmQp
 J25pnrdXmE15llgqsC0/UOW5J1laTjII+T2T70UOAqQl4LWYAQDG4WwsAqTzU0KY
 dueydDouTp9XC2WYrRUEQxJUzxaOaazskDUHc5c8oHp/zVBT+djdgtvVR9+gi6+7
 yy4elI77FlEEqL0ItdU/lSWINayAlPLsIHkMyhSGKX0XDpKjeycPqkNx4UterXB/
 JKIR5RBWllRft+igIngIkKX0tJGMU0whngiw7d1WLw25wgu4sB53hiWWoSba14hv
 tXMxwZs5iGaPcT38oRVMZz8I1kJM4Dz3SyI7twVvi4RUut64EG4=
 =9i4I
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking changes from Jakub Kicinski:
 "WiFi 7 and sendpage changes are the biggest pieces of work for this
  release. The latter will definitely require fixes but I think that we
  got it to a reasonable point.

  Core:

   - Rework the sendpage & splice implementations

     Instead of feeding data into sockets page by page extend sendmsg
     handlers to support taking a reference on the data, controlled by a
     new flag called MSG_SPLICE_PAGES

     Rework the handling of unexpected-end-of-file to invoke an
     additional callback instead of trying to predict what the right
     combination of MORE/NOTLAST flags is

     Remove the MSG_SENDPAGE_NOTLAST flag completely

   - Implement SCM_PIDFD, a new type of CMSG type analogous to
     SCM_CREDENTIALS, but it contains pidfd instead of plain pid

   - Enable socket busy polling with CONFIG_RT

   - Improve reliability and efficiency of reporting for ref_tracker

   - Auto-generate a user space C library for various Netlink families

  Protocols:

   - Allow TCP to shrink the advertised window when necessary, prevent
     sk_rcvbuf auto-tuning from growing the window all the way up to
     tcp_rmem[2]

   - Use per-VMA locking for "page-flipping" TCP receive zerocopy

   - Prepare TCP for device-to-device data transfers, by making sure
     that payloads are always attached to skbs as page frags

   - Make the backoff time for the first N TCP SYN retransmissions
     linear. Exponential backoff is unnecessarily conservative

   - Create a new MPTCP getsockopt to retrieve all info
     (MPTCP_FULL_INFO)

   - Avoid waking up applications using TLS sockets until we have a full
     record

   - Allow using kernel memory for protocol ioctl callbacks, paving the
     way to issuing ioctls over io_uring

   - Add nolocalbypass option to VxLAN, forcing packets to be fully
     encapsulated even if they are destined for a local IP address

   - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
     in-kernel ECMP implementation (e.g. Open vSwitch) select the same
     link for all packets. Support L4 symmetric hashing in Open vSwitch

   - PPPoE: make number of hash bits configurable

   - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
     (ipconfig)

   - Add layer 2 miss indication and filtering, allowing higher layers
     (e.g. ACL filters) to make forwarding decisions based on whether
     packet matched forwarding state in lower devices (bridge)

   - Support matching on Connectivity Fault Management (CFM) packets

   - Hide the "link becomes ready" IPv6 messages by demoting their
     printk level to debug

   - HSR: don't enable promiscuous mode if device offloads the proto

   - Support active scanning in IEEE 802.15.4

   - Continue work on Multi-Link Operation for WiFi 7

  BPF:

   - Add precision propagation for subprogs and callbacks. This allows
     maintaining verification efficiency when subprograms are used, or
     in fact passing the verifier at all for complex programs,
     especially those using open-coded iterators

   - Improve BPF's {g,s}setsockopt() length handling. Previously BPF
     assumed the length is always equal to the amount of written data.
     But some protos allow passing a NULL buffer to discover what the
     output buffer *should* be, without writing anything

   - Accept dynptr memory as memory arguments passed to helpers

   - Add routing table ID to bpf_fib_lookup BPF helper

   - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands

   - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
     maps as read-only)

   - Show target_{obj,btf}_id in tracing link fdinfo

   - Addition of several new kfuncs (most of the names are
     self-explanatory):
      - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
        bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
        and bpf_dynptr_clone().
      - bpf_task_under_cgroup()
      - bpf_sock_destroy() - force closing sockets
      - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs

  Netfilter:

   - Relax set/map validation checks in nf_tables. Allow checking
     presence of an entry in a map without using the value

   - Increase ip_vs_conn_tab_bits range for 64BIT builds

   - Allow updating size of a set

   - Improve NAT tuple selection when connection is closing

  Driver API:

   - Integrate netdev with LED subsystem, to allow configuring HW
     "offloaded" blinking of LEDs based on link state and activity
     (i.e. packets coming in and out)

   - Support configuring rate selection pins of SFP modules

   - Factor Clause 73 auto-negotiation code out of the drivers, provide
     common helper routines

   - Add more fool-proof helpers for managing lifetime of MDIO devices
     associated with the PCS layer

   - Allow drivers to report advanced statistics related to Time Aware
     scheduler offload (taprio)

   - Allow opting out of VF statistics in link dump, to allow more VFs
     to fit into the message

   - Split devlink instance and devlink port operations

  New hardware / drivers:

   - Ethernet:
      - Synopsys EMAC4 IP support (stmmac)
      - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
      - Marvell 88E6250 7 port switches
      - Microchip LAN8650/1 Rev.B0 PHYs
      - MediaTek MT7981/MT7988 built-in 1GE PHY driver

   - WiFi:
      - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
      - Realtek RTL8723DS (SDIO variant)
      - Realtek RTL8851BE

   - CAN:
      - Fintek F81604

  Drivers:

   - Ethernet NICs:
      - Intel (100G, ice):
         - support dynamic interrupt allocation
         - use meta data match instead of VF MAC addr on slow-path
      - nVidia/Mellanox:
         - extend link aggregation to handle 4, rather than just 2 ports
         - spawn sub-functions without any features by default
      - OcteonTX2:
         - support HTB (Tx scheduling/QoS) offload
         - make RSS hash generation configurable
         - support selecting Rx queue using TC filters
      - Wangxun (ngbe/txgbe):
         - add basic Tx/Rx packet offloads
         - add phylink support (SFP/PCS control)
      - Freescale/NXP (enetc):
         - report TAPRIO packet statistics
      - Solarflare/AMD:
         - support matching on IP ToS and UDP source port of outer
           header
         - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
         - add devlink dev info support for EF10

   - Virtual NICs:
      - Microsoft vNIC:
         - size the Rx indirection table based on requested
           configuration
         - support VLAN tagging
      - Amazon vNIC:
         - try to reuse Rx buffers if not fully consumed, useful for ARM
           servers running with 16kB pages
      - Google vNIC:
         - support TCP segmentation of >64kB frames

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - enable USXGMII (88E6191X)
      - Microchip:
         - lan966x: add support for Egress Stage 0 ACL engine
         - lan966x: support mapping packet priority to internal switch
           priority (based on PCP or DSCP)

   - Ethernet PHYs:
      - Broadcom PHYs:
         - support for Wake-on-LAN for BCM54210E/B50212E
         - report LPI counter
      - Microsemi PHYs: support RGMII delay configuration (VSC85xx)
      - Micrel PHYs: receive timestamp in the frame (LAN8841)
      - Realtek PHYs: support optional external PHY clock
      - Altera TSE PCS: merge the driver into Lynx PCS which it is a
        variant of

   - CAN: Kvaser PCIEcan:
      - support packet timestamping

   - WiFi:
      - Intel (iwlwifi):
         - major update for new firmware and Multi-Link Operation (MLO)
         - configuration rework to drop test devices and split the
           different families
         - support for segmented PNVM images and power tables
         - new vendor entries for PPAG (platform antenna gain) feature
      - Qualcomm 802.11ax (ath11k):
         - Multiple Basic Service Set Identifier (MBSSID) and Enhanced
           MBSSID Advertisement (EMA) support in AP mode
         - support factory test mode
      - RealTek (rtw89):
         - add RSSI based antenna diversity
         - support U-NII-4 channels on 5 GHz band
      - RealTek (rtl8xxxu):
         - AP mode support for 8188f
         - support USB RX aggregation for the newer chips"

* tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits)
  net: scm: introduce and use scm_recv_unix helper
  af_unix: Skip SCM_PIDFD if scm->pid is NULL.
  net: lan743x: Simplify comparison
  netlink: Add __sock_i_ino() for __netlink_diag_dump().
  net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses
  Revert "af_unix: Call scm_recv() only after scm_set_cred()."
  phylink: ReST-ify the phylink_pcs_neg_mode() kdoc
  libceph: Partially revert changes to support MSG_SPLICE_PAGES
  net: phy: mscc: fix packet loss due to RGMII delays
  net: mana: use vmalloc_array and vcalloc
  net: enetc: use vmalloc_array and vcalloc
  ionic: use vmalloc_array and vcalloc
  pds_core: use vmalloc_array and vcalloc
  gve: use vmalloc_array and vcalloc
  octeon_ep: use vmalloc_array and vcalloc
  net: usb: qmi_wwan: add u-blox 0x1312 composition
  perf trace: fix MSG_SPLICE_PAGES build error
  ipvlan: Fix return value of ipvlan_queue_xmit()
  netfilter: nf_tables: fix underflow in chain reference counter
  netfilter: nf_tables: unbind non-anonymous set if rule construction fails
  ...
2023-06-28 16:43:10 -07:00
Keith Busch
9408d8a37e nvme: improved uring polling
Drivers can poll requests directly, so use that. We just need to ensure
the driver's request was allocated from a polled hctx, so a special
driver flag is added to struct io_uring_cmd.

The allows unshared and multipath namespaces to use the same polling
callback, and multipath is guaranteed to get the same queue as the
command was submitted on. Previously multipath polling might check a
different path and poll the wrong info.

The other bonus is we don't need a bio payload in order to poll,
allowing commands like 'flush' and 'write zeroes' to be submitted on the
same high priority queue as read and write commands.

Finally, using the request based polling skips the unnecessary bio
overhead.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230612190343.2087040-3-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-28 16:09:41 -06:00
Linus Torvalds
582c161cf3 hardening updates for v6.5-rc1
- Fix KMSAN vs FORTIFY in strlcpy/strlcat (Alexander Potapenko)
 
 - Convert strreplace() to return string start (Andy Shevchenko)
 
 - Flexible array conversions (Arnd Bergmann, Wyes Karny, Kees Cook)
 
 - Add missing function prototypes seen with W=1 (Arnd Bergmann)
 
 - Fix strscpy() kerndoc typo (Arne Welzel)
 
 - Replace strlcpy() with strscpy() across many subsystems which were
   either Acked by respective maintainers or were trivial changes that
   went ignored for multiple weeks (Azeem Shaikh)
 
 - Remove unneeded cc-option test for UBSAN_TRAP (Nick Desaulniers)
 
 - Add KUnit tests for strcat()-family
 
 - Enable KUnit tests of FORTIFY wrappers under UML
 
 - Add more complete FORTIFY protections for strlcat()
 
 - Add missed disabling of FORTIFY for all arch purgatories.
 
 - Enable -fstrict-flex-arrays=3 globally
 
 - Tightening UBSAN_BOUNDS when using GCC
 
 - Improve checkpatch to check for strcpy, strncpy, and fake flex arrays
 
 - Improve use of const variables in FORTIFY
 
 - Add requested struct_size_t() helper for types not pointers
 
 - Add __counted_by macro for annotating flexible array size members
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmSbftQWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJj0MD/9X9jzJzCmsAU+yNldeoAzC84Sk
 GVU3RBxGcTNysL1gZXynkIgigw7DWc4htMGeSABHHwQRVP65JCH1Kw/VqIkyumbx
 9LdX6IklMJb4pRT4PVU3azebV4eNmSjlur2UxMeW54Czm91/6I8RHbJOyAPnOUmo
 2oomGdP/hpEHtKR7hgy8Axc6w5ySwQixh2V5sVZG3VbvCS5WKTmTXbs6puuRT5hz
 iHt7v+7VtEg/Qf1W7J2oxfoghvVBsaRrSLrExWT/oZYh1ZxM7DsCAAoG/IsDgHGA
 9LBXiRECgAFThbHVxLvvKZQMXdVk0i8iXLX43XMKC0wTA+NTyH7wlcQQ4RWNMuo8
 sfA9Qm9gMArXaf64aymr3Uwn20Zan0391HdlbhOJZAE6v3PPJbleUnM58AzD2d3r
 5Lz6AIFBxDImy+3f9iDWgacCT5/PkeiXTHzk9QnKhJyKKtRA58XJxj4q2+rPnGJP
 n4haXqoxD5FJbxdXiGKk31RS0U5HBug7wkOcUrTqDHUbc/QNU2b7dxTKUx+zYtCU
 uV5emPzpF4H4z+91WpO47n9gkMAfwV0lt9S2dwS8pxsgqctbmIan+Jgip7rsqZ2G
 OgLXBsb43eEs+6WgO8tVt/ZHYj9ivGMdrcNcsIfikzNs/xweUJ53k2xSEn2xEa5J
 cwANDmkL6QQK7yfeeg==
 =s0j1
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening updates from Kees Cook:
 "There are three areas of note:

  A bunch of strlcpy()->strscpy() conversions ended up living in my tree
  since they were either Acked by maintainers for me to carry, or got
  ignored for multiple weeks (and were trivial changes).

  The compiler option '-fstrict-flex-arrays=3' has been enabled
  globally, and has been in -next for the entire devel cycle. This
  changes compiler diagnostics (though mainly just -Warray-bounds which
  is disabled) and potential UBSAN_BOUNDS and FORTIFY _warning_
  coverage. In other words, there are no new restrictions, just
  potentially new warnings. Any new FORTIFY warnings we've seen have
  been fixed (usually in their respective subsystem trees). For more
  details, see commit df8fc4e934.

  The under-development compiler attribute __counted_by has been added
  so that we can start annotating flexible array members with their
  associated structure member that tracks the count of flexible array
  elements at run-time. It is possible (likely?) that the exact syntax
  of the attribute will change before it is finalized, but GCC and Clang
  are working together to sort it out. Any changes can be made to the
  macro while we continue to add annotations.

  As an example of that last case, I have a treewide commit waiting with
  such annotations found via Coccinelle:

    https://git.kernel.org/linus/adc5b3cb48a049563dc673f348eab7b6beba8a9b

  Also see commit dd06e72e68 for more details.

  Summary:

   - Fix KMSAN vs FORTIFY in strlcpy/strlcat (Alexander Potapenko)

   - Convert strreplace() to return string start (Andy Shevchenko)

   - Flexible array conversions (Arnd Bergmann, Wyes Karny, Kees Cook)

   - Add missing function prototypes seen with W=1 (Arnd Bergmann)

   - Fix strscpy() kerndoc typo (Arne Welzel)

   - Replace strlcpy() with strscpy() across many subsystems which were
     either Acked by respective maintainers or were trivial changes that
     went ignored for multiple weeks (Azeem Shaikh)

   - Remove unneeded cc-option test for UBSAN_TRAP (Nick Desaulniers)

   - Add KUnit tests for strcat()-family

   - Enable KUnit tests of FORTIFY wrappers under UML

   - Add more complete FORTIFY protections for strlcat()

   - Add missed disabling of FORTIFY for all arch purgatories.

   - Enable -fstrict-flex-arrays=3 globally

   - Tightening UBSAN_BOUNDS when using GCC

   - Improve checkpatch to check for strcpy, strncpy, and fake flex
     arrays

   - Improve use of const variables in FORTIFY

   - Add requested struct_size_t() helper for types not pointers

   - Add __counted_by macro for annotating flexible array size members"

* tag 'hardening-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (54 commits)
  netfilter: ipset: Replace strlcpy with strscpy
  uml: Replace strlcpy with strscpy
  um: Use HOST_DIR for mrproper
  kallsyms: Replace all non-returning strlcpy with strscpy
  sh: Replace all non-returning strlcpy with strscpy
  of/flattree: Replace all non-returning strlcpy with strscpy
  sparc64: Replace all non-returning strlcpy with strscpy
  Hexagon: Replace all non-returning strlcpy with strscpy
  kobject: Use return value of strreplace()
  lib/string_helpers: Change returned value of the strreplace()
  jbd2: Avoid printing outside the boundary of the buffer
  checkpatch: Check for 0-length and 1-element arrays
  riscv/purgatory: Do not use fortified string functions
  s390/purgatory: Do not use fortified string functions
  x86/purgatory: Do not use fortified string functions
  acpi: Replace struct acpi_table_slit 1-element array with flex-array
  clocksource: Replace all non-returning strlcpy with strscpy
  string: use __builtin_memcpy() in strlcpy/strlcat
  staging: most: Replace all non-returning strlcpy with strscpy
  drm/i2c: tda998x: Replace all non-returning strlcpy with strscpy
  ...
2023-06-27 21:24:18 -07:00
Sagi Grimberg
99160af413 nvme-mpath: fix I/O failure with EAGAIN when failing over I/O
It is possible that the next available path we failover to, happens to
be frozen (for example if it is during connection establishment). If
the original I/O was set with NOWAIT, this cause the I/O to unnecessarily
fail because the request queue cannot be entered, hence the I/O fails with
EAGAIN.

The NOWAIT restriction that was originally set for the I/O is no longer
relevant or needed because this is the nvme requeue context. Hence we
clear the REQ_NOWAIT flag when failing over I/O.

This fix a simple test case of nvme controller reset during I/O when the
multipath device that has only a single path and I/O fails with "Resource
temporarily unavailable" errno. Note that this reproduces with io_uring
which by default sets IOCB_NOWAIT by default.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-27 08:16:26 -07:00
Damien Le Moal
86da1bae4c nvme: host: fix command name spelling
Correctly spell "Zeroes" in nvme_cmd_write_zeroes command name.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-27 08:14:11 -07:00
Linus Torvalds
a0433f8cae for-6.5/block-2023-06-23
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8dwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpilGD/9Yys1oxIXJpRf00fzrylAlBthRxMjFQVWw
 zAut106hAQiBHvU8IkmGA3MvEFVHxtzwYhHI7IR8K3aZBIqscweCqmVI9JyogJw9
 U9Twnzel47VmuKdM94FeoN+hbj1fP8EWTjzmy67/zEEfFCdmHvNlMi3lSrGYIpFy
 39LxTB99Y4UarM5PtWbes37GYYljzMSWKuo4AfBkvq1eQa+sZ0Vq2xAABKq3UM7f
 apqhgHtkJooRePDP0eQp+kAyyVMgW2jIK+oIdJDxNF3CKTu2w40RzaYz6fp+jVSU
 H4R/xS59GW4/xql+VBJDh/qJg9K62DPPYjlW8BmSR8+IjvfFpsyH3/MacE50CD3P
 20fs/Mnj49H79fDrQEHJI53cOOb2EmUitbwLbvOcColNTPpt8loBtdQxjF2RMU8R
 Nyort9DJPFclYCxky1LYg1CNEC2Ln4Zy/jD47wPvqRmOQphOoVlV/hPnOEqvjaZC
 49Vn70W2DeE9cXvYI7ha+XIg6/oj+Gs3iusEbV08Ci7EAtXgI+ZUUsQ97K8UNiUh
 h2lqSJtuI7lBpYP9sf+BeCch5UCC+xGYyTdoM5f58lehWBBPtbs0g7S9RyRyOYxe
 n+yxEUo3dAGzJ/xsKAjinbZfeWIpr0b1TkAh4w3Cq/BKzRr9Bp8lBAxYuancbQ+Y
 1ADPteUOTA==
 =zP4Y
 -----END PGP SIGNATURE-----

Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull request via Keith:
      - Various cleanups all around (Irvin, Chaitanya, Christophe)
      - Better struct packing (Christophe JAILLET)
      - Reduce controller error logs for optional commands (Keith)
      - Support for >=64KiB block sizes (Daniel Gomez)
      - Fabrics fixes and code organization (Max, Chaitanya, Daniel
        Wagner)

 - bcache updates via Coly:
      - Fix a race at init time (Mingzhe Zou)
      - Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye)

 - use page pinning in the block layer for dio (David)

 - convert old block dio code to page pinning (David, Christoph)

 - cleanups for pktcdvd (Andy)

 - cleanups for rnbd (Guoqing)

 - use the unchecked __bio_add_page() for the initial single page
   additions (Johannes)

 - fix overflows in the Amiga partition handling code (Michael)

 - improve mq-deadline zoned device support (Bart)

 - keep passthrough requests out of the IO schedulers (Christoph, Ming)

 - improve support for flush requests, making them less special to deal
   with (Christoph)

 - add bdev holder ops and shutdown methods (Christoph)

 - fix the name_to_dev_t() situation and use cases (Christoph)

 - decouple the block open flags from fmode_t (Christoph)

 - ublk updates and cleanups, including adding user copy support (Ming)

 - BFQ sanity checking (Bart)

 - convert brd from radix to xarray (Pankaj)

 - constify various structures (Thomas, Ivan)

 - more fine grained persistent reservation ioctl capability checks
   (Jingbo)

 - misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan,
   Jordy, Li, Min, Yu, Zhong, Waiman)

* tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits)
  scsi/sg: don't grab scsi host module reference
  ext4: Fix warning in blkdev_put()
  block: don't return -EINVAL for not found names in devt_from_devname
  cdrom: Fix spectre-v1 gadget
  block: Improve kernel-doc headers
  blk-mq: don't insert passthrough request into sw queue
  bsg: make bsg_class a static const structure
  ublk: make ublk_chr_class a static const structure
  aoe: make aoe_class a static const structure
  block/rnbd: make all 'class' structures const
  block: fix the exclusive open mask in disk_scan_partitions
  block: add overflow checks for Amiga partition support
  block: change all __u32 annotations to __be32 in affs_hardblocks.h
  block: fix signed int overflow in Amiga partition support
  block: add capacity validation in bdev_add_partition()
  block: fine-granular CAP_SYS_ADMIN for Persistent Reservation
  block: disallow Persistent Reservation on partitions
  reiserfs: fix blkdev_put() warning from release_journal_dev()
  block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()
  block: document the holder argument to blkdev_get_by_path
  ...
2023-06-26 12:47:20 -07:00
Linus Torvalds
0aa69d53ac for-6.5/io_uring-2023-06-23
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8cEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnvZD/0QWstFCe1CSLWaycdC9fhWftFt3hyEIST5
 CYEL56UZrDWNkv9xTLe855xvMavjd4sdHlUa8NUPghRQeJyKYgRxHBLXRWmy0uNN
 l47Zjiwsolmbr3Nt6qViLdCDYmG39ZGNwWo8b6p3ybWYLtzxeOblocOBTPzoCtkS
 hjo7Z0eMONvsvLX+l0o9IDdWtZIQ2fGo4VYkMIVb6CyxRPpuUuPKbE25qaTx+uBg
 Fy6Qa3SlaTwzqcg3dttggjIP792L/eETUCWGndg5pJNbrkj/fI4Vm4bEljID76eS
 HODl+pWHmyM6avVkypr7N3Tp5HKF0OTUa4vJTLIZo1QiRu6zlXphtuvGn+McEmgV
 hbYmQMYWzqJ22k2iEpCR58pdhmZJC9uB8r4Rwgr/t9GKqt4E+15EzmqkG9cUVMGV
 rfbBwVLwBUd5+0WHwQ8RzdtaUPt17vSIW/8WhU5zoMGVotqVBHO/H+5BtmKPWWpq
 fx1etQ8XJVPIxziJvgsEitb1s6KZzJspcONDlLEitmZkflv3gGdVm99KNbXwJpcp
 m6+FcYQ5d5FivfLPGgpx8go+4M2QuoW2yRGwZHu54buCnpxgNjIk898OjrUrdXCg
 3/0m99GXmOWQQl0VrrTr+Fv99nVsQ2hMQzOFJGMYRtHEEc5xiTcJiZmoxmF7T7/C
 TipyW3czsw==
 =5Me8
 -----END PGP SIGNATURE-----

Merge tag 'for-6.5/io_uring-2023-06-23' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:
 "Nothing major in this release, just a bunch of cleanups and some
  optimizations around networking mostly.

   - clean up file request flags handling (Christoph)

   - clean up request freeing and CQ locking (Pavel)

   - support for using pre-registering the io_uring fd at setup time
     (Josh)

   - Add support for user allocated ring memory, rather than having the
     kernel allocate it. Mostly for packing rings into a huge page (me)

   - avoid an unnecessary double retry on receive (me)

   - maintain ordering for task_work, which also improves performance
     (me)

   - misc cleanups/fixes (Pavel, me)"

* tag 'for-6.5/io_uring-2023-06-23' of git://git.kernel.dk/linux: (39 commits)
  io_uring: merge conditional unlock flush helpers
  io_uring: make io_cq_unlock_post static
  io_uring: inline __io_cq_unlock
  io_uring: fix acquire/release annotations
  io_uring: kill io_cq_unlock()
  io_uring: remove IOU_F_TWQ_FORCE_NORMAL
  io_uring: don't batch task put on reqs free
  io_uring: move io_clean_op()
  io_uring: inline io_dismantle_req()
  io_uring: remove io_free_req_tw
  io_uring: open code io_put_req_find_next
  io_uring: add helpers to decode the fixed file file_ptr
  io_uring: use io_file_from_index in io_msg_grab_file
  io_uring: use io_file_from_index in __io_sync_cancel
  io_uring: return REQ_F_ flags from io_file_get_flags
  io_uring: remove io_req_ffs_set
  io_uring: remove a confusing comment above io_file_get_flags
  io_uring: remove the mode variable in io_file_get_flags
  io_uring: remove __io_file_supports_nowait
  io_uring: wait interruptibly for request completions on exit
  ...
2023-06-26 12:30:26 -07:00
David Howells
c336a79983 nvmet-tcp: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
When transmitting data, call down into TCP using a single sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather than
copied instead of calling sendpage.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Willem de Bruijn <willemb@google.com>
cc: Keith Busch <kbusch@kernel.org>
cc: Jens Axboe <axboe@fb.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Chaitanya Kulkarni <kch@nvidia.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-nvme@lists.infradead.org
Link: https://lore.kernel.org/r/20230623225513.2732256-9-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-24 15:50:12 -07:00
David Howells
7769887817 nvme-tcp: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
When transmitting data, call down into TCP using a sendmsg with
MSG_SPLICE_PAGES instead of sendpage.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Willem de Bruijn <willemb@google.com>
cc: Keith Busch <kbusch@kernel.org>
cc: Jens Axboe <axboe@fb.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Chaitanya Kulkarni <kch@nvidia.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-nvme@lists.infradead.org
Link: https://lore.kernel.org/r/20230623225513.2732256-8-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-24 15:50:12 -07:00
Christophe JAILLET
9d16d26477 nvmet: Reorder fields in 'struct nvmet_ns'
Group some variables based on their sizes to reduce holes.
On x86_64, this shrinks the size of 'struct nvmet_ns' from 520 to 512
bytes.

When such a structure is allocated in nvmet_ns_alloc(), because of the way
memory allocation works, when 520 bytes were requested, 1024 bytes were
allocated.

So, on x86_64, this change saves 512 bytes per allocation.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-21 09:41:40 -07:00
Breno Leitao
d0dd594bed nvme: Print capabilities changes just once
This current dev_info() could be very verbose and being printed very
frequently depending on some userspace application sending some specific
commands.

Just print this message once and skip it until the controller resets.
Use a controller flag (NVME_CTRL_DIRTY_CAPABILITY) to track if the
capability needs a reset.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-21 09:40:05 -07:00
Keith Busch
1c606f7f05 nvme: forward port sysfs delete fix
We had a late fix that modified nvme_sysfs_delete() after the staging
branch for the next merge window relocated the function to a new file.
Port commit 2eb94dd56a ("nvme: do not let the user delete a ctrl
before a complete") to the latest to avoid a potentially confusing merge
conflict.

Cc: Maurizio Lombardi <mlombard@redhat.com>
Cc: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-16 08:15:57 -07:00
Keith Busch
c917dd96fe nvme: skip optional id ctrl csi if it failed
A frequently recieved report is the driver requests the optional Command
Set Specific Identify Controller structure. Some controllers report this
in their error log, which tiggers other warnings to user space
monitoring the devices.

These error entries are harmless and of questionable value to save in
the log, but let's reduce their occurance by not resending the command
if it previously failed. This will not prevent the errors on the initial
module load, but will greatly reduce their occurance on any rescans and
resumes from suspend.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217445
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 18:24:15 -07:00
Irvin Cote
35e797b024 nvme-core: use nvme_ns_head_multipath instead of ns->head->disk
Change the way we check for a multipath nshead so as
to consistently use the same check to assert the same condition.

Signed-off-by: Irvin Cote <irvincoteg@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:44:57 -07:00
Daniel Wagner
d97b4111b3 nvmet-fcloop: Do not wait on completion when unregister fails
The nvme_fc_unregister_localport() returns an error code in case that
the locaport pointer is NULL or has already been unegisterd. localport is
is either in the ONLINE state (all resources allocated) or has already
been put into DELETED state.

In this case we will never receive an wakeup call and thus any caller
will hang, e.g. module unload.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:37:00 -07:00
Chaitanya Kulkarni
959ffef13b nvme-fabrics: open code __nvmf_host_find()
There is no point in maintaining a separate funciton __nvmf_host_find()
that has only one caller nvmf_host_add() especially when caller and
callee both are small enough to merge.

Due to this we are actually repeating the error handling code in both
callee and caller for no reason that can be avoided, but instead we have
to read both function to establish the correctness along with additional
lockdep warning check due to involved locking.

Just open code __nvmf_host_find() in nvme_host_alloc() with appropriate
comment that removes repeated error checks in the callee/caller and
lockdep check that is needed for the nvmf_hosts_mutex involvement,
diffstats :-

 drivers/nvme/host/fabrics.c | 75 +++++++++++++------------------------
 1 file changed, 27 insertions(+), 48 deletions(-)

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:37:00 -07:00
Chaitanya Kulkarni
900095bfbb nvme-fabrics: error out to unlock the mutex
Currently, in the nvmf_host_add() function, if the nvmf_host_alloc()
call failed to allocate memory for the host, the code would directly
return -ENOMEM without unlocking the nvmf_hosts_mutex. This could
lead to potential issues with mutex synchronization.

Fix that error handling mechanism by jumping to the out_unlock label
when nvmf_host_alloc() fails. This ensures that the mutex is unlocked
before returning the error code. The updated code enhances avoids
possible deadlocks.

Fixes: f0cebf82004d ("nvme-fabrics: prevent overriding of existing host")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Closes: https://lore.kernel.org/r/202306020909.MTUEBeIa-lkp@intel.com/
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Julia Lawall <julia.lawall@inria.fr>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:59 -07:00
Daniel Gomez
bdbfcd5f6c nvme: Increase block size variable size to 32-bit
Increase block size variable size to 32-bit unsigned to be able to
support block devices larger than 32k (starting from 64 KiB).

Physical and logical block size already support unsigned 32-bit.

Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:59 -07:00
Chaitanya Kulkarni
e9227d486e nvme-fcloop: no need to return from void function
Remove return at the end of void function.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:59 -07:00
Chaitanya Kulkarni
94c78ea124 nvmet-auth: remove unnecessary break after goto
Remove dead break after goto.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:59 -07:00
Christophe JAILLET
2ad0713c73 nvmet-auth: remove some dead code
'status' is known to be 0 at the point.
And nvmet_auth_challenge() return a -E<ERROR_CODE> or 0.
So these lines of code should just be removed.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:59 -07:00
Irvin Cote
2110a6bcd7 nvme-core: remove redundant check from nvme_init_ns_head
nvme_find_ns_head already checks that the list of namescpaces
in an already existing namespace head is not empty

Signed-off-by: Irvin Cote <irvincoteg@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:59 -07:00
Max Gurtovoy
942e21c042 nvme: move sysfs code to a dedicated sysfs.c file
The core.c file became long and hard to maintain. Create a dedicated
file to centralize the sysfs functionality. This is a common practice to
separate sysfs/configfs related logic from the main driver logic .c file.
For example, in the nvmet module the configfs interface has its own
dedicated file.

This patch does not include any functional changes.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
[merged dhchap memleak fixes, include nvme-auth.h]
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:59 -07:00
Max Gurtovoy
ae8bd606e0 nvme-fabrics: prevent overriding of existing host
When first connecting a target using the "default" host parameters,
setting the hostid from the command line during a subsequent connection
establishment would override the "default" hostid parameter. This would
cause an existing connection that is already using the host definitions
to lose its hostid.

To address this issue, the code has been modified to allow only 1:1
mapping between hostnqn and hostid. This will maintain unambiguous host
identification. Any non 1:1 mapping will be rejected during connection
establishment.

Tested-by: Noam Gottlieb <ngottlieb@nvidia.com>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:58 -07:00
Max Gurtovoy
5e4b55fa52 nvme-fabrics: check hostid using uuid_equal
Use a dedicated function to match uuids instead of duplicating it.

Tested-by: Noam Gottlieb <ngottlieb@nvidia.com>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:58 -07:00
Max Gurtovoy
b86d6595f7 nvme-fabrics: unify common code in admin and io queue connect
To simplify code maintenance, it is recommended to avoid duplicating
code.

Tested-by: Noam Gottlieb <ngottlieb@nvidia.com>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:58 -07:00
Christophe JAILLET
0f5335e158 nvmet: reorder fields in 'struct nvme_dhchap_queue_context'
Group some variables based on their sizes to reduce holes.
On x86_64, this shrinks the size of 'struct nvme_dhchap_queue_context' from
416 to 400 bytes.

This structure is kvcalloc()'ed in nvme_auth_init_ctrl(), so it is likely
that the allocation can be relatively big. Saving 16 bytes per structure
may might a slight difference.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:51 -07:00
Christophe JAILLET
e64b0c807c nvmet: reorder fields in 'struct nvmf_ctrl_options'
Group some variables based on their sizes to reduce holes.
On x86_64, this shrinks the size of 'struct nvmf_ctrl_options' from 136 to
128 bytes.

When such a structure is allocated in nvmf_create_ctrl(), because of the
way memory allocation works, when 136 bytes were requested, 192 bytes were
allocated.

So this saves 64 bytes per allocation, 1 cache line to hold the whole
structure and a few cycles when zeroing the memory in nvmf_create_ctrl().

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:47 -07:00
Christophe JAILLET
9d217fb0e7 nvme: reorder fields in 'struct nvme_ctrl'
Group some variables based on their sizes to reduce holes.
On x86_64, this shrinks the size of 'struct nvme_ctrl' from 5368 to 5344
bytes when all CONFIG_* are defined.

This structure is embedded into some other structures, so it helps reducing
their size as well.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:42 -07:00
Christophe JAILLET
c60651e32f nvmet: reorder fields in 'struct nvmet_sq'
Group some variables based on their sizes to reduce holes.
On x86_64, this shrinks the size of 'struct nvmet_sq' from 472 to 464
bytes when CONFIG_NVME_TARGET_AUTH is defined.

This structure is embedded into some other structures, so it helps reducing
their sizes as well.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:36:37 -07:00
Keith Busch
a249d3066d nvme-fabrics: add queue setup helpers
tcp and rdma transports have lots of duplicate code setting up the
different queue mappings. Add common helpers.

Cc: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:33:03 -07:00
Irvin Cote
4a4d9bc0c8 nvme-pci: cleaning up nvme_pci_init_request
Erase the superfluous line that retrieves the nvme_dev.

Signed-off-by: Irvin Cote <irvincoteg@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:33:03 -07:00
Max Gurtovoy
f3f2837315 nvme-rdma: fix typo in comment
There is no ib_stop_cq API and the need for the +1 is for ib_drain_qp.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:33:02 -07:00
Chaitanya Kulkarni
7ed5cf8e6d nvme-core: fix dev_pm_qos memleak
Call dev_pm_qos_hide_latency_tolerance() in the error unwind patch to
avoid following kmemleak:-

blktests (master) # kmemleak-clear; ./check nvme/044;
blktests (master) # kmemleak-scan ; kmemleak-show
nvme/044 (Test bi-directional authentication)                [passed]
    runtime  2.111s  ...  2.124s
unreferenced object 0xffff888110c46240 (size 96):
  comm "nvme", pid 33461, jiffies 4345365353 (age 75.586s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<0000000069ac2cec>] kmalloc_trace+0x25/0x90
    [<000000006acc66d5>] dev_pm_qos_update_user_latency_tolerance+0x6f/0x100
    [<00000000cc376ea7>] nvme_init_ctrl+0x38e/0x410 [nvme_core]
    [<000000007df61b4b>] 0xffffffffc05e88b3
    [<00000000d152b985>] 0xffffffffc05744cb
    [<00000000f04a4041>] vfs_write+0xc5/0x3c0
    [<00000000f9491baf>] ksys_write+0x5f/0xe0
    [<000000001c46513d>] do_syscall_64+0x3b/0x90
    [<00000000ecf348fe>] entry_SYSCALL_64_after_hwframe+0x72/0xdc

Link: https://lore.kernel.org/linux-nvme/CAHj4cs-nDaKzMx2txO4dbE+Mz9ePwLtU0e3egz+StmzOUgWUrA@mail.gmail.com/
Fixes: f50fff73d6 ("nvme: implement In-Band authentication")
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:33:02 -07:00
Chaitanya Kulkarni
3a12a0b868 nvme-core: add missing fault-injection cleanup
Add missing fault-injection cleanup in nvme_init_ctrl() in the error
unwind path that also fixes following message for blktests:-

linux-block (for-next) # grep debugfs debugfs-err.log
[  147.853464] debugfs: Directory 'nvme1' with parent '/' already present!
[  147.853973] nvme1: failed to create debugfs attr
[  148.802490] debugfs: Directory 'nvme1' with parent '/' already present!
[  148.803244] nvme1: failed to create debugfs attr
[  148.877304] debugfs: Directory 'nvme1' with parent '/' already present!
[  148.877775] nvme1: failed to create debugfs attr
[  149.816652] debugfs: Directory 'nvme1' with parent '/' already present!
[  149.818011] nvme1: failed to create debugfs attr

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:33:02 -07:00
Chaitanya Kulkarni
99c2dcc8ff nvme-core: fix memory leak in dhchap_ctrl_secret
Free dhchap_secret in nvme_ctrl_dhchap_ctrl_secret_store() before we
return when nvme_auth_generate_key() returns error.

Fixes: f50fff73d6 ("nvme: implement In-Band authentication")
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:33:02 -07:00
Chaitanya Kulkarni
a836ca33c5 nvme-core: fix memory leak in dhchap_secret_store
Free dhchap_secret in nvme_ctrl_dhchap_secret_store() before we return
fix following kmemleack:-

unreferenced object 0xffff8886376ea800 (size 64):
  comm "check", pid 22048, jiffies 4344316705 (age 92.199s)
  hex dump (first 32 bytes):
    44 48 48 43 2d 31 3a 30 30 3a 6e 78 72 35 4b 67  DHHC-1:00:nxr5Kg
    75 58 34 75 6f 41 78 73 4a 61 34 63 2f 68 75 4c  uX4uoAxsJa4c/huL
  backtrace:
    [<0000000030ce5d4b>] __kmalloc+0x4b/0x130
    [<000000009be1cdc1>] nvme_ctrl_dhchap_secret_store+0x8f/0x160 [nvme_core]
    [<00000000ac06c96a>] kernfs_fop_write_iter+0x12b/0x1c0
    [<00000000437e7ced>] vfs_write+0x2ba/0x3c0
    [<00000000f9491baf>] ksys_write+0x5f/0xe0
    [<000000001c46513d>] do_syscall_64+0x3b/0x90
    [<00000000ecf348fe>] entry_SYSCALL_64_after_hwframe+0x72/0xdc
unreferenced object 0xffff8886376eaf00 (size 64):
  comm "check", pid 22048, jiffies 4344316736 (age 92.168s)
  hex dump (first 32 bytes):
    44 48 48 43 2d 31 3a 30 30 3a 6e 78 72 35 4b 67  DHHC-1:00:nxr5Kg
    75 58 34 75 6f 41 78 73 4a 61 34 63 2f 68 75 4c  uX4uoAxsJa4c/huL
  backtrace:
    [<0000000030ce5d4b>] __kmalloc+0x4b/0x130
    [<000000009be1cdc1>] nvme_ctrl_dhchap_secret_store+0x8f/0x160 [nvme_core]
    [<00000000ac06c96a>] kernfs_fop_write_iter+0x12b/0x1c0
    [<00000000437e7ced>] vfs_write+0x2ba/0x3c0
    [<00000000f9491baf>] ksys_write+0x5f/0xe0
    [<000000001c46513d>] do_syscall_64+0x3b/0x90
    [<00000000ecf348fe>] entry_SYSCALL_64_after_hwframe+0x72/0xdc

Fixes: f50fff73d6 ("nvme: implement In-Band authentication")
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12 10:33:02 -07:00
Christoph Hellwig
05bdb99653 block: replace fmode_t with a block-specific type for block open flags
The only overlap between the block open flags mapped into the fmode_t and
other uses of fmode_t are FMODE_READ and FMODE_WRITE.  Define a new
blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and
->ioctl and stop abusing fmode_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@ionos.com>		[rnbd]
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:05 -06:00
Christoph Hellwig
7d9d7d59d4 nvme: replace the fmode_t argument to the nvme ioctl handlers with a simple bool
Instead of passing a fmode_t and only checking it fo0r FMODE_WRITE, pass
a bool open_for_write to prepare for callers that won't have the fmode_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-22-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
2736e8eeb0 block: use the holder as indication for exclusive opens
The current interface for exclusive opens is rather confusing as it
requires both the FMODE_EXCL flag and a holder.  Remove the need to pass
FMODE_EXCL and just key off the exclusive open off a non-NULL holder.

For blkdev_put this requires adding the holder argument, which provides
better debug checking that only the holder actually releases the hold,
but at the same time allows removing the now superfluous mode argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>		[btrfs]
Acked-by: Jack Wang <jinpu.wang@ionos.com>		[rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
ae220766d8 block: remove the unused mode argument to ->release
The mode argument to the ->release block_device_operation is never used,
so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Jack Wang <jinpu.wang@ionos.com>			[rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
d32e2bf837 block: pass a gendisk to ->open
->open is only called on the whole device.  Make that explicit by
passing a gendisk instead of the block_device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Jack Wang <jinpu.wang@ionos.com>		[rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
0718afd47f block: introduce holder ops
Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and
installed in the block_device for exclusive claims.  It will be used to
allow the block layer to call back into the user of the block device for
thing like notification of a removed device or a device resize.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:53:04 -06:00
Christoph Hellwig
8563037977 nvme: fix the name of Zone Append for verbose logging
No Management involved in Zone Appened.

Fixes: bd83fe6f2c ("nvme: add verbose error logging")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alan Adamson <alan.adamson@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-05-31 09:21:26 -07:00
Uday Shankar
c7275ce6a5 nvme: improve handling of long keep alives
Upon keep alive completion, nvme_keep_alive_work is scheduled with the
same delay every time. If keep alive commands are completing slowly,
this may cause a keep alive timeout. The following trace illustrates the
issue, taking KATO = 8 and TBKAS off for simplicity:

1. t = 0: run nvme_keep_alive_work, send keep alive
2. t = ε: keep alive reaches controller, controller restarts its keep
          alive timer
3. t = 4: host receives keep alive completion, schedules
          nvme_keep_alive_work with delay 4
4. t = 8: run nvme_keep_alive_work, send keep alive

Here, a keep alive having RTT of 4 causes a delay of at least 8 - ε
between the controller receiving successive keep alives. With ε small,
the controller is likely to detect a keep alive timeout.

Fix this by calculating the RTT of the keep alive command, and adjusting
the scheduling delay of the next keep alive work accordingly.

Reported-by: Costa Sapuntzakis <costa@purestorage.com>
Reported-by: Randy Jennings <randyj@purestorage.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-05-30 20:57:42 -07:00
Uday Shankar
774a963651 nvme: check IO start time when deciding to defer KA
When a command completes, we set a flag which will skip sending a
keep alive at the next run of nvme_keep_alive_work when TBKAS is on.
However, if the command was submitted long ago, it's possible that
the controller may have also restarted its keep alive timer (as a
result of receiving the command) long ago. The following trace
demonstrates the issue, assuming TBKAS is on and KATO = 8 for
simplicity:

1. t = 0: submit I/O commands A, B, C, D, E
2. t = 0.5: commands A, B, C, D, E reach controller, restart its keep
            alive timer
3. t = 1: A completes
4. t = 2: run nvme_keep_alive_work, see recent completion, do nothing
5. t = 3: B completes
6. t = 4: run nvme_keep_alive_work, see recent completion, do nothing
7. t = 5: C completes
8. t = 6: run nvme_keep_alive_work, see recent completion, do nothing
9. t = 7: D completes
10. t = 8: run nvme_keep_alive_work, see recent completion, do nothing
11. t = 9: E completes

At this point, 8.5 seconds have passed without restarting the
controller's keep alive timer, so the controller will detect a keep
alive timeout.

Fix this by checking the IO start time when deciding to defer sending a
keep alive command. Only set comp_seen if the command started after the
most recent run of nvme_keep_alive_work. With this change, the
completions of B, C, and D will not set comp_seen and the run of
nvme_keep_alive_work at t = 4 will send a keep alive.

Reported-by: Costa Sapuntzakis <costa@purestorage.com>
Reported-by: Randy Jennings <randyj@purestorage.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-05-30 09:20:47 -07:00
Uday Shankar
ea4d453b9e nvme: double KA polling frequency to avoid KATO with TBKAS on
With TBKAS on, the completion of one command can defer sending a
keep alive for up to twice the delay between successive runs of
nvme_keep_alive_work. The current delay of KATO / 2 thus makes it
possible for one command to defer sending a keep alive for up to
KATO, which can result in the controller detecting a KATO. The following
trace demonstrates the issue, taking KATO = 8 for simplicity:

1. t = 0: run nvme_keep_alive_work, no keep-alive sent
2. t = ε: I/O completion seen, set comp_seen = true
3. t = 4: run nvme_keep_alive_work, see comp_seen == true,
          skip sending keep-alive, set comp_seen = false
4. t = 8: run nvme_keep_alive_work, see comp_seen == false,
          send a keep-alive command.

Here, there is a delay of 8 - ε between receiving a command completion
and sending the next command. With ε small, the controller is likely to
detect a keep alive timeout.

Fix this by running nvme_keep_alive_work with a delay of KATO / 4
whenever TBKAS is on. Going through the above trace now gives us a
worst-case delay of 4 - ε, which is in line with the recommendation of
sending a command every KATO / 2 in the NVMe specification.

Reported-by: Costa Sapuntzakis <costa@purestorage.com>
Reported-by: Randy Jennings <randyj@purestorage.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-05-30 09:20:44 -07:00
min15.li
31a5978243 nvme: fix miss command type check
In the function nvme_passthru_end(), only the value of the command
opcode is checked, without checking the command type (IO command or
Admin command). When we send a Dataset Management command (The opcode
of the Dataset Management command is the same as the Set Feature
command), kernel thinks it is a set feature command, then sets the
controller's keep alive interval, and calls nvme_keep_alive_work().

Signed-off-by: min15.li <min15.li@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-05-30 08:50:24 -07:00
Kees Cook
d67790ddf0 overflow: Add struct_size_t() helper
While struct_size() is normally used in situations where the structure
type already has a pointer instance, there are places where no variable
is available. In the past, this has been worked around by using a typed
NULL first argument, but this is a bit ugly. Add a helper to do this,
and replace the handful of instances of the code pattern with it.

Instances were found with this Coccinelle script:

@struct_size_t@
identifier STRUCT, MEMBER;
expression COUNT;
@@

-       struct_size((struct STRUCT *)\(0\|NULL\),
+       struct_size_t(struct STRUCT,
                MEMBER, COUNT)

Suggested-by: Christoph Hellwig <hch@infradead.org>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Tony Nguyen <anthony.l.nguyen@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: HighPoint Linux Team <linux@highpoint-tech.com>
Cc: "James E.J. Bottomley" <jejb@linux.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Cc: Don Brace <don.brace@microchip.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Guo Xuenan <guoxuenan@huawei.com>
Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Daniel Latypov <dlatypov@google.com>
Cc: kernel test robot <lkp@intel.com>
Cc: intel-wired-lan@lists.osuosl.org
Cc: netdev@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: linux-scsi@vger.kernel.org
Cc: megaraidlinux.pdl@broadcom.com
Cc: storagedev@microchip.com
Cc: linux-xfs@vger.kernel.org
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20230522211810.never.421-kees@kernel.org
2023-05-26 13:52:19 -07:00
Tatsuki Sugiura
a3a9d63dcd NVMe: Add MAXIO 1602 to bogus nid list.
HIKSEMI FUTURE M.2 SSD uses the same dummy nguid and eui64.
I confirmed it with my two devices.

This patch marks the controller as NVME_QUIRK_BOGUS_NID.

---------------------------------------------------------
sugi@tempest:~% sudo nvme id-ctrl /dev/nvme0
NVME Identify Controller:
vid       : 0x1e4b
ssvid     : 0x1e4b
sn        : 30096022612
mn        : HS-SSD-FUTURE 2048G
fr        : SN10542
rab       : 0
ieee      : 000000
cmic      : 0
mdts      : 7
cntlid    : 0
ver       : 0x10400
rtd3r     : 0x7a120
rtd3e     : 0x1e8480
oaes      : 0x200
ctratt    : 0x2
rrls      : 0
cntrltype : 1
fguid     : 00000000-0000-0000-0000-000000000000
<snip...>
---------------------------------------------------------

---------------------------------------------------------
sugi@tempest:~% sudo nvme id-ns /dev/nvme0n1
NVME Identify Namespace 1:
<snip...>
nguid   : 00000000000000000000000000000000
eui64   : 0000000000000002
lbaf  0 : ms:0   lbads:9  rp:0 (in use)
---------------------------------------------------------

Signed-off-by: Tatsuki Sugiura <sugi@nemui.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-05-26 08:21:50 -07:00
Pavel Begunkov
f026be0e1e nvme: optimise io_uring passthrough completion
Use IOU_F_TWQ_LAZY_WAKE via iou_cmd_exec_in_task_lazy() for passthrough
commands completion. It further delays the execution of task_work for
DEFER_TASKRUN until there are enough of task_work items queued to meet
the waiting criteria, which reduces the number of wake ups we issue.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ecdfacd0967a22d88b7779e2efd09e040825d0f8.1684154817.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-25 08:54:06 -06:00
Martin K. Petersen
7907ad748b Merge patch series "Use block pr_ops in LIO"
Mike Christie <michael.christie@oracle.com> says:

The patches in this thread allow us to use the block pr_ops with LIO's
target_core_iblock module to support cluster applications in VMs. They
were built over Linus's tree. They also apply over linux-next and
Martin's tree and Jens's trees.

Currently, to use windows clustering or linux clustering (pacemaker +
cluster labs scsi fence agents) in VMs with LIO and vhost-scsi, you
have to use tcmu or pscsi or use a cluster aware FS/framework for the
LIO pr file. Setting up a cluster FS/framework is pain and waste when
your real backend device is already a distributed device, and pscsi
and tcmu are nice for specific use cases, but iblock gives you the
best performance and allows you to use stacked devices like
dm-multipath. So these patches allow iblock to work like pscsi/tcmu
where they can pass a PR command to the backend module. And then
iblock will use the pr_ops to pass the PR command to the real devices
similar to what we do for unmap today.

The patches are separated in the following groups:

Patch 1 - 2:

 - Add block layer callouts for reading reservations and rename reservation
   error code.

Patch 3 - 5:

 - SCSI support for new callouts.

Patch 6:

 - DM support for new callouts.

Patch 7 - 13:

 - NVMe support for new callouts.

Patch 14 - 18:

 - LIO support for new callouts.

This patchset has been tested with the libiscsi PGR ops and with
window's failover cluster verification test. Note that for scsi
backend devices we need this patchset:

https://lore.kernel.org/linux-scsi/20230123221046.125483-1-michael.christie@oracle.com/T/#m4834a643ffb5bac2529d65d40906d3cfbdd9b1b7

to handle UAs. To reduce the size of this patchset that's being done
separately to make reviewing easier. And to make merging easier this
patchset and the one above do not have any conflicts so can be merged
in different trees.

Link: https://lore.kernel.org/r/20230407200551.12660-1-michael.christie@oracle.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-05-22 16:35:02 -04:00
Jens Axboe
1878b736e8 Merge tag 'nvme-6.4-2023-05-18' of git://git.infradead.org/nvme into block-6.4
Pull NVMe fixes from Keith:

"nvme fixes for Linux 6.4

 - More device quirks (Sagi, Hristo, Adrian, Daniel)
 - Controller delete race (Maurizo)
 - Multipath cleanup fix (Christoph)"

* tag 'nvme-6.4-2023-05-18' of git://git.infradead.org/nvme:
  nvme-pci: Add quirk for Teamgroup MP33 SSD
  nvme: do not let the user delete a ctrl before a complete initialization
  nvme-multipath: don't call blk_mark_disk_dead in nvme_mpath_remove_disk
  nvme-pci: clamp max_hw_sectors based on DMA optimized limitation
  nvme-pci: add quirk for missing secondary temperature thresholds
  nvme-pci: add NVME_QUIRK_BOGUS_NID for HS-SSD-FUTURE 2048G
2023-05-18 19:46:42 -06:00
Daniel Smith
0649728123 nvme-pci: Add quirk for Teamgroup MP33 SSD
Add a quirk for Teamgroup MP33 that reports duplicate ids for disk.

Signed-off-by: Daniel Smith <dansmith@ds.gy>
[kch: patch formatting]
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Daniel Smith <dansmith@ds.gy>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-05-18 17:53:55 -07:00
Maurizio Lombardi
2eb94dd56a nvme: do not let the user delete a ctrl before a complete initialization
If a userspace application performes a "delete_controller" command
early during the ctrl initialization, the delete operation
may race against the init code and the kernel will crash.

nvme nvme5: Connect command failed: host path error
nvme nvme5: failed to connect queue: 0 ret=880
PF: supervisor write access in kernel mode
PF: error_code(0x0002) - not-present page
 blk_mq_quiesce_queue+0x18/0x90
 nvme_tcp_delete_ctrl+0x24/0x40 [nvme_tcp]
 nvme_do_delete_ctrl+0x7f/0x8b [nvme_core]
 nvme_sysfs_delete.cold+0x8/0xd [nvme_core]
 kernfs_fop_write_iter+0x124/0x1b0
 new_sync_write+0xff/0x190
 vfs_write+0x1ef/0x280

Fix the crash by checking the NVME_CTRL_STARTED_ONCE bit;
if it's not set it means that the nvme controller is still
in the process of getting initialized and the kernel
will return an -EBUSY error to userspace.
Set the NVME_CTRL_STARTED_ONCE later in the nvme_start_ctrl()
function, after the controller start operation is completed.

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-05-17 07:36:17 -07:00
Christoph Hellwig
1743e5f600 nvme-multipath: don't call blk_mark_disk_dead in nvme_mpath_remove_disk
nvme_mpath_remove_disk is called after del_gendisk, at which point a
blk_mark_disk_dead call doesn't make any sense.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-05-17 07:32:42 -07:00
Linus Torvalds
03e5cb7b50 for-6.4/io_uring-2023-05-07
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRXkp8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsH3EADZ4ex4vvc2/giYjRYTdyda00GIdYGiSZaZ
 OBW59OBYJfDjsS99Z2U4638shyqM3rA6BwzvWD6hPUTShD7PUddIKDaanpCYBs+x
 GhcDhzeJxARjsp/5qZPQ4r3Hdp+kWRnOFu1BSniCEdWt43J6AOm0l5duuNxl1db5
 Jz+BkSgF1pICAVB5YfSPDShyFiDfc0qcv3rHZo/rVRjwmX3x4x+OVr97zbZInnH4
 fm2vcv2e3Dc8uQEB0XFEyBWNeC8hdhTNe60t1y3MlILaPjHrQlzl5lkQFZ9dtLG6
 /FDK+bD6Ex8t6SQnnQGrLa9q4vQYA6zlT5u/wdvzAkakmrSxt6phf23U1TbyZSQ8
 clSg6MSuyJAW7YhiT3OqAsNMvBZ5g/nKHjLxfkAe86N9YIrMebZW9GDF+fWf0q0K
 em7O4XPehjkCXKclCInMv2bkv+/pncrd84D7Jt3OmCPljBL6oXfWF3vJJQ6FaY+t
 G3tNxl1hMHqwX8uaaSqyeuMYAEYizZa+ebMmHJwm/151d0EKylAOh3i6O1e4bpk3
 Kn5irF6pfTV1g1TcmY3fOwNgMffcHc9Y4E8bXxMJAN9WiHDuopgn2C9+MrJ+KRrx
 pLN7VD9UK/5yFzHSMBhyM8lm0/oTIhfTVgb+WyuHmJLcRhAnyNy9RrIBhSD0AoH+
 UMyOexZR8w==
 =l7Mo
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4/io_uring-2023-05-07' of git://git.kernel.dk/linux

Pull more io_uring updates from Jens Axboe:
 "Nothing major in here, just two different parts:

   - A small series from Breno that enables passing the full SQE down
     for ->uring_cmd().

     This is a prerequisite for enabling full network socket operations.
     Queued up a bit late because of some stylistic concerns that got
     resolved, would be nice to have this in 6.4-rc1 so the dependent
     work will be easier to handle for 6.5.

   - Fix for the huge page coalescing, which was a regression introduced
     in the 6.3 kernel release (Tobias)"

* tag 'for-6.4/io_uring-2023-05-07' of git://git.kernel.dk/linux:
  io_uring: Remove unnecessary BUILD_BUG_ON
  io_uring: Pass whole sqe to commands
  io_uring: Create a helper to return the SQE size
  io_uring/rsrc: check for nonconsecutive pages
2023-05-07 10:00:09 -07:00
Breno Leitao
fd9b8547bc io_uring: Pass whole sqe to commands
Currently uring CMD operation relies on having large SQEs, but future
operations might want to use normal SQE.

The io_uring_cmd currently only saves the payload (cmd) part of the SQE,
but, for commands that use normal SQE size, it might be necessary to
access the initial SQE fields outside of the payload/cmd block.  So,
saves the whole SQE other than just the pdu.

This changes slightly how the io_uring_cmd works, since the cmd
structures and callbacks are not opaque to io_uring anymore. I.e, the
callbacks can look at the SQE entries, not only, in the cmd structure.

The main advantage is that we don't need to create custom structures for
simple commands.

Creates io_uring_sqe_cmd() that returns the cmd private data as a null
pointer and avoids casting in the callee side.
Also, make most of ublk_drv's sqe->cmd priv structure into const, and use
io_uring_sqe_cmd() to get the private structure, removing the unwanted
cast. (There is one case where the cast is still needed since the
header->{len,addr} is updated in the private structure)

Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20230504121856.904491-3-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-04 08:19:05 -06:00
Adrian Huang
3710e2b056 nvme-pci: clamp max_hw_sectors based on DMA optimized limitation
When running the fio test on a 448-core AMD server + a NVME disk,
a soft lockup or a hard lockup call trace is shown:

[soft lockup]
watchdog: BUG: soft lockup - CPU#126 stuck for 23s! [swapper/126:0]
RIP: 0010:_raw_spin_unlock_irqrestore+0x21/0x50
...
Call Trace:
 <IRQ>
 fq_flush_timeout+0x7d/0xd0
 ? __pfx_fq_flush_timeout+0x10/0x10
 call_timer_fn+0x2e/0x150
 run_timer_softirq+0x48a/0x560
 ? __pfx_fq_flush_timeout+0x10/0x10
 ? clockevents_program_event+0xaf/0x130
 __do_softirq+0xf1/0x335
 irq_exit_rcu+0x9f/0xd0
 sysvec_apic_timer_interrupt+0xb4/0xd0
 </IRQ>
 <TASK>
 asm_sysvec_apic_timer_interrupt+0x1f/0x30
...

Obvisouly, fq_flush_timeout spends over 20 seconds. Here is ftrace log:

               |  fq_flush_timeout() {
               |    fq_ring_free() {
               |      put_pages_list() {
   0.170 us    |        free_unref_page_list();
   0.810 us    |      }
               |      free_iova_fast() {
               |        free_iova() {
 * 85622.66 us |          _raw_spin_lock_irqsave();
   2.860 us    |          remove_iova();
   0.600 us    |          _raw_spin_unlock_irqrestore();
   0.470 us    |          lock_info_report();
   2.420 us    |          free_iova_mem.part.0();
 * 85638.27 us |        }
 * 85638.84 us |      }
               |      put_pages_list() {
   0.230 us    |        free_unref_page_list();
   0.470 us    |      }
   ...            ...
 $ 31017069 us |  }

Most of cores are under lock contention for acquiring iova_rbtree_lock due
to the iova flush queue mechanism.

[hard lockup]
NMI watchdog: Watchdog detected hard LOCKUP on cpu 351
RIP: 0010:native_queued_spin_lock_slowpath+0x2d8/0x330

Call Trace:
 <IRQ>
 _raw_spin_lock_irqsave+0x4f/0x60
 free_iova+0x27/0xd0
 free_iova_fast+0x4d/0x1d0
 fq_ring_free+0x9b/0x150
 iommu_dma_free_iova+0xb4/0x2e0
 __iommu_dma_unmap+0x10b/0x140
 iommu_dma_unmap_sg+0x90/0x110
 dma_unmap_sg_attrs+0x4a/0x50
 nvme_unmap_data+0x5d/0x120 [nvme]
 nvme_pci_complete_batch+0x77/0xc0 [nvme]
 nvme_irq+0x2ee/0x350 [nvme]
 ? __pfx_nvme_pci_complete_batch+0x10/0x10 [nvme]
 __handle_irq_event_percpu+0x53/0x1a0
 handle_irq_event_percpu+0x19/0x60
 handle_irq_event+0x3d/0x60
 handle_edge_irq+0xb3/0x210
 __common_interrupt+0x7f/0x150
 common_interrupt+0xc5/0xf0
 </IRQ>
 <TASK>
 asm_common_interrupt+0x2b/0x40
...

ftrace shows fq_ring_free spends over 10 seconds [1]. Again, most of
cores are under lock contention for acquiring iova_rbtree_lock due
to the iova flush queue mechanism.

[Root Cause]
The root cause is that the max_hw_sectors_kb of nvme disk (mdts=10)
is 4096kb, which streaming DMA mappings cannot benefit from the
scalable IOVA mechanism introduced by the commit 9257b4a206
("iommu/iova: introduce per-cpu caching to iova allocation") if
the length is greater than 128kb.

To fix the lock contention issue, clamp max_hw_sectors based on
DMA optimized limitation in order to leverage scalable IOVA mechanism.

Note: The issue does not happen with another NVME disk (mdts = 5
and max_hw_sectors_kb = 128)

[1] https://gist.github.com/AdrianHuang/bf8ec7338204837631fbdaed25d19cc4

Suggested-by: Keith Busch <kbusch@kernel.org>
Reported-and-tested-by: Jiwei Sun <sunjw10@lenovo.com>
Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-05-03 18:15:44 +02:00
Hristo Venev
bd375feeaf nvme-pci: add quirk for missing secondary temperature thresholds
On Kingston KC3000 and Kingston FURY Renegade (both have the same PCI
IDs) accessing temp3_{min,max} fails with an invalid field error (note
that there is no problem setting the thresholds for temp1).

This contradicts the NVM Express Base Specification 2.0b, page 292:

  The over temperature threshold and under temperature threshold
  features shall be implemented for all implemented temperature sensors
  (i.e., all Temperature Sensor fields that report a non-zero value).

Define NVME_QUIRK_NO_SECONDARY_TEMP_THRESH that disables the thresholds
for all but the composite temperature and set it for this device.

Signed-off-by: Hristo Venev <hristo@venev.name>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-05-03 18:11:43 +02:00
Sagi Grimberg
1616d6c371 nvme-pci: add NVME_QUIRK_BOGUS_NID for HS-SSD-FUTURE 2048G
Add a quirk to fix HS-SSD-FUTURE 2048G SSD drives reporting duplicate
nsids.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217384
Reported-by: Andrey God <andreygod83@protonmail.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-05-03 18:10:01 +02:00
Linus Torvalds
556eb8b791 Driver core changes for 6.4-rc1
Here is the large set of driver core changes for 6.4-rc1.
 
 Once again, a busy development cycle, with lots of changes happening in
 the driver core in the quest to be able to move "struct bus" and "struct
 class" into read-only memory, a task now complete with these changes.
 
 This will make the future rust interactions with the driver core more
 "provably correct" as well as providing more obvious lifetime rules for
 all busses and classes in the kernel.
 
 The changes required for this did touch many individual classes and
 busses as many callbacks were changed to take const * parameters
 instead.  All of these changes have been submitted to the various
 subsystem maintainers, giving them plenty of time to review, and most of
 them actually did so.
 
 Other than those changes, included in here are a small set of other
 things:
   - kobject logging improvements
   - cacheinfo improvements and updates
   - obligatory fw_devlink updates and fixes
   - documentation updates
   - device property cleanups and const * changes
   - firwmare loader dependency fixes.
 
 All of these have been in linux-next for a while with no reported
 problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZEp7Sw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykitQCfamUHpxGcKOAGuLXMotXNakTEsxgAoIquENm5
 LEGadNS38k5fs+73UaxV
 =7K4B
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the large set of driver core changes for 6.4-rc1.

  Once again, a busy development cycle, with lots of changes happening
  in the driver core in the quest to be able to move "struct bus" and
  "struct class" into read-only memory, a task now complete with these
  changes.

  This will make the future rust interactions with the driver core more
  "provably correct" as well as providing more obvious lifetime rules
  for all busses and classes in the kernel.

  The changes required for this did touch many individual classes and
  busses as many callbacks were changed to take const * parameters
  instead. All of these changes have been submitted to the various
  subsystem maintainers, giving them plenty of time to review, and most
  of them actually did so.

  Other than those changes, included in here are a small set of other
  things:

   - kobject logging improvements

   - cacheinfo improvements and updates

   - obligatory fw_devlink updates and fixes

   - documentation updates

   - device property cleanups and const * changes

   - firwmare loader dependency fixes.

  All of these have been in linux-next for a while with no reported
  problems"

* tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (120 commits)
  device property: make device_property functions take const device *
  driver core: update comments in device_rename()
  driver core: Don't require dynamic_debug for initcall_debug probe timing
  firmware_loader: rework crypto dependencies
  firmware_loader: Strip off \n from customized path
  zram: fix up permission for the hot_add sysfs file
  cacheinfo: Add use_arch[|_cache]_info field/function
  arch_topology: Remove early cacheinfo error message if -ENOENT
  cacheinfo: Check cache properties are present in DT
  cacheinfo: Check sib_leaf in cache_leaves_are_shared()
  cacheinfo: Allow early level detection when DT/ACPI info is missing/broken
  cacheinfo: Add arm64 early level initializer implementation
  cacheinfo: Add arch specific early level initializer
  tty: make tty_class a static const structure
  driver core: class: remove struct class_interface * from callbacks
  driver core: class: mark the struct class in struct class_interface constant
  driver core: class: make class_register() take a const *
  driver core: class: mark class_release() as taking a const *
  driver core: remove incorrect comment for device_create*
  MIPS: vpe-cmp: remove module owner pointer from struct class usage.
  ...
2023-04-27 11:53:57 -07:00
Linus Torvalds
9dd6956b38 for-6.4/block-2023-04-21
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRCvcIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpk+JEACj01t7Xen2+Razagu3aTx9tmRGFnTNR3MY
 raFG6B1TADk1TgCWWa2C4Dj67SOispPLm8hbIcOxqB1UscDWCCwjmnr/debADFzW
 Ap6shv/IRwVGmDp+F7ocYas0ynwooOJg4WJTwkSKz2o4m4p3vzlwAKi4fLiSjbXp
 gJTrA7WEvDOVjzajlTFUtjr8rc6PdunbGm25cPIufAxUEhvttYex2VbVqjDmfNsE
 8tyyk9RWbe4AY/ZYaGXVn4yQ/CgL/sXFkVc5noRXNfAQ/K3CVLQrFLJ3JlwUHpiA
 xXBor21TUWCZEo33Y2G5NConAYqE7etoPTkaTDO3/aZ+dAMFyhC/WAYLz1KZGMh1
 +g1fDX1QKEd40H2lfDXvqF1ob7Ut8EzUx+gvBXcc3/AiRpJ5rjfOcj6LPUMUqQJk
 nucLLFTiMKecnDMBERbvixqbaTyrjvkFEj2wYJvgj1LKXAd+x/bj8SGajs9r88Nb
 9YT9ai/+Yl7Ppfb67rCgXJU7oNZQSAQ2H+X/l2jbiqImOgq1u/45AmINnbanS7HH
 Y1I8pbH45AcnCgkJRoQwrNX3BnTOTBJ+D/4Fl4b8jsihq0D3UtwCwPCObHP4LW9S
 MUNPhP3tUuYsAgXqX80+Sao6SYvXDwnbWOM+LOaaZXgjb1ndwDUZXpto8Ra8WB1u
 8kM6s6ZR7g==
 =W1Zb
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - drbd patches, bringing us closer to unifying the out-of-tree version
   and the in tree one (Andreas, Christoph)

 - support for auto-quiesce for the s390 dasd driver (Stefan)

 - MD pull request via Song:
      - md/bitmap: Optimal last page size (Jon Derrick)
      - Various raid10 fixes (Yu Kuai, Li Nan)
      - md: add error_handlers for raid0 and linear (Mariusz Tkaczyk)

 - NVMe pull request via Christoph:
      - Drop redundant pci_enable_pcie_error_reporting (Bjorn Helgaas)
      - Validate nvmet module parameters (Chaitanya Kulkarni)
      - Fence TCP socket on receive error (Chris Leech)
      - Fix async event trace event (Keith Busch)
      - Minor cleanups (Chaitanya Kulkarni, zhenwei pi)
      - Fix and cleanup nvmet Identify handling (Damien Le Moal,
        Christoph Hellwig)
      - Fix double blk_mq_complete_request race in the timeout handler
        (Lei Yin)
      - Fix irq locking in nvme-fcloop (Ming Lei)
      - Remove queue mapping helper for rdma devices (Sagi Grimberg)

 - use structured request attribute checks for nbd (Jakub)

 - fix blk-crypto race conditions between keyslot management (Eric)

 - add sed-opal support for reading read locking range attributes
   (Ondrej)

 - make fault injection configurable for null_blk (Akinobu)

 - clean up the request insertion API (Christoph)

 - clean up the queue running API (Christoph)

 - blkg config helper cleanups (Tejun)

 - lazy init support for blk-iolatency (Tejun)

 - various fixes and tweaks to ublk (Ming)

 - remove hybrid polling. It hasn't really been useful since we got
   async polled IO support, and these days we don't support sync polled
   IO at all (Keith)

 - misc fixes, cleanups, improvements (Zhong, Ondrej, Colin, Chengming,
   Chaitanya, me)

* tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linux: (118 commits)
  nbd: fix incomplete validation of ioctl arg
  ublk: don't return 0 in case of any failure
  sed-opal: geometry feature reporting command
  null_blk: Always check queue mode setting from configfs
  block: ublk: switch to ioctl command encoding
  blk-mq: fix the blk_mq_add_to_requeue_list call in blk_kick_flush
  block, bfq: Fix division by zero error on zero wsum
  fault-inject: fix build error when FAULT_INJECTION_CONFIGFS=y and CONFIGFS_FS=m
  block: store bdev->bd_disk->fops->submit_bio state in bdev
  block: re-arrange the struct block_device fields for better layout
  md/raid5: remove unused working_disks variable
  md/raid10: don't call bio_start_io_acct twice for bio which experienced read error
  md/raid10: fix memleak of md thread
  md/raid10: fix memleak for 'conf->bio_split'
  md/raid10: fix leak of 'r10bio->remaining' for recovery
  md/raid10: don't BUG_ON() in raise_barrier()
  md: fix soft lockup in status_resync
  md: add error_handlers for raid0 and linear
  md: Use optimal I/O size for last bitmap page
  md: Fix types in sb writer
  ...
2023-04-26 12:52:58 -07:00
Duy Truong
74391b3e69 nvme-pci: add NVME_QUIRK_BOGUS_NID for T-FORCE Z330 SSD
Added a quirk to fix the TeamGroup T-Force Cardea Zero Z330 SSDs reporting
duplicate NGUIDs.

Signed-off-by: Duy Truong <dory@dory.moe>
Cc: stable@vger.kernel.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-04-14 07:13:48 +02:00
Ming Lei
4f86a6ff6f nvme-fcloop: fix "inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage"
fcloop_fcp_op() could be called from flush request's ->end_io(flush_end_io) in
which the spinlock of fq->mq_flush_lock is grabbed with irq saved/disabled.

So fcloop_fcp_op() can't call spin_unlock_irq(&tfcp_req->reqlock) simply
which enables irq unconditionally.

Fixes the warning by switching to spin_lock_irqsave()/spin_unlock_irqrestore()

Fixes: c38dbbfab1 ("nvme-fcloop: fix inconsistent lock state warnings")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-04-13 09:02:55 +02:00
Sagi Grimberg
edde9e70bb blk-mq-rdma: remove queue mapping helper for rdma devices
No rdma device exposes its irq vectors affinity today. So the only
mapping that we have left, is the default blk_mq_map_queues, which
we fallback to anyways. Also fixup the only consumer of this helper
(nvme-rdma).

Remove this now dead code.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-04-13 08:59:05 +02:00
zhenwei pi
015ad2b1e4 nvme-rdma: minor cleanup in nvme_rdma_create_cq()
Before cleanup:
enum ib_poll_context poll_ctx;

if (nvme_rdma_poll_queue(queue)) {
        poll_ctx = IB_POLL_DIRECT;
        queue->ib_cq = ib_alloc_cq(ibdev, queue, queue->cq_size,
                                   comp_vector, poll_ctx);
} else {
        poll_ctx = IB_POLL_SOFTIRQ;
        queue->ib_cq = ib_cq_pool_get(ibdev, queue->cq_size,
                                      comp_vector, poll_ctx);
}

After cleanup:
if (nvme_rdma_poll_queue(queue))
        queue->ib_cq = ib_alloc_cq(ibdev, queue, queue->cq_size,
                                   comp_vector, IB_POLL_DIRECT);
else
        queue->ib_cq = ib_cq_pool_get(ibdev, queue->cq_size,
                                      comp_vector, IB_POLL_SOFTIRQ);

IB_POLL_SOFTIRQ/IB_POLL_SOFTIRQ gets used directly in function, this
seems more accessible.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-04-13 08:59:05 +02:00
Lei Yin
d4f1d5f7a4 nvme: fix double blk_mq_complete_request for timeout request with low probability
When nvme_cancel_tagset traverses all tagsets and executes
nvme_cancel_request, this request may be executing blk_mq_free_request
that is called by nvme_rdma_complete_timed_out/nvme_tcp_complete_timed_out.
When blk_mq_free_request executes to WRITE_ONCE(rq->state, MQ_RQ_IDLE) and
__blk_mq_free_request(rq), it will cause double blk_mq_complete_request for
this request, and it will cause a null pointer error in the second
execution of this function because rq->mq_hctx has set to NULL in first
execution.

Signed-off-by: Lei Yin <yinlei2@lenovo.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-04-13 08:59:04 +02:00
Keith Busch
6622b76fe9 nvme: fix async event trace event
Mixing AER Event Type and Event Info has masking clashes. Just print the
event type, but also include the event info of the AER result in the
trace.

Fixes: 09bd1ff4b1 ("nvme-core: add async event trace helper")
Reported-by: Nate Thornton <nate.thornton@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Minwoo Im <minwoo.im@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-04-13 08:59:04 +02:00
Chaitanya Kulkarni
cf806e3ab1 nvme-apple: return directly instead of else
There is no need for the else when direct return is used at the end of
the function.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Eric Curtin <ecurtin@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-04-13 08:55:06 +02:00
Chaitanya Kulkarni
2ce525d40a nvme-apple: return directly instead of else
There is no need for the else when direct return is used at the end of
the function.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Eric Curtin <ecurtin@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-04-13 08:55:06 +02:00
Chaitanya Kulkarni
6fe240bc0d nvmet-tcp: validate idle poll modparam value
The module parameter idle_poll_period_usecs is passed to the function
usecs_to_jiffies() which has following prototype and expect
idle_poll_period_usecs arg type to be unsigned int:-

unsigned long usecs_to_jiffies(const unsigned int u);

Use similar module parameter validation callback as previous patch.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-04-13 08:55:05 +02:00
Chaitanya Kulkarni
44aef3b850 nvmet-tcp: validate so_priority modparam value
The module parameter so_priority is passed to the function
sock_set_priority() which has following prototype and expect
priotity arg type to be u32:-

void sock_set_priority(struct sock *sk, u32 priority);

Add a module parameter validation callback to reject any negative
values for the so_priority as it is defigned as int. Use this
oppurtunity to update the module parameter description and print the
default value.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-04-13 08:55:05 +02:00
Chris Leech
aeacfcefa2 nvme-tcp: fence TCP socket on receive error
Ensure that no further socket reads occur after a receive processing
error, either from io_work being re-scheduled or nvme_tcp_poll.

Failing to do so can result in unrecognised PDU payloads or TCP stream
garbage being processed as a C2H data PDU, and potentially start copying
the payload to an invalid destination after looking up a request using a
bogus command id.

Signed-off-by: Chris Leech <cleech@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: John Meneghini <jmeneghi@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-04-13 08:55:05 +02:00
Christoph Hellwig
c5a9abfad9 nvmet: remove nvmet_req_cns_error_complete
Just fold it into the only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2023-04-13 08:55:05 +02:00
Christoph Hellwig
9326353566 nvmet: rename nvmet_execute_identify_cns_cs_ns
nvmet_execute_identify_ns_zns is a more descriptive name for the
function handling the "I/O Command Set Specific Identify Namespace
Data Structure for the Zoned Namespace Command Set".

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
2023-04-13 08:55:04 +02:00
Christoph Hellwig
2f17f42c7f nvmet: fix Identify Identification Descriptor List handling
The Identification Descriptor List CNS value does not check the CSI
value, so remove the code trying to handle it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
2023-04-13 08:55:04 +02:00
Damien Le Moal
145f0dbb8a nvmet: cleanup nvmet_execute_identify()
Change the order of the cases in nvmet_execute_identify() main
switch-case to match the NVMe 2.0 specification order as defined in
table 273. This is also the increasing order of CNS values.

While at it, for clarity, make it explicit that identify with cns set
to NVME_ID_CNS_CS_NS does not support NVM command set specific data.

No functional changes are introduced by this cleanup.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-04-13 08:55:04 +02:00
Damien Le Moal
a5a6ab0950 nvmet: fix I/O Command Set specific Identify Controller
For an identify command with cns set to NVME_ID_CNS_CS_CTRL, the NVMe
2.0 specification states that:

If the I/O Command Set specified by the CSI field does not have an
Identify Controller data structure, then the controller shall return
a zero filled data structure. If the host requests a data structure for
an I/O Command Set that the controller does not support, the controller
shall abort the command with a status code of Invalid Field in Command.

However, the current implementation of this identify command in
nvmet_execute_identify() only handles the ZNS command set, returning an
error for the NVM command set, which is not compliant with the
specifications as we do support this command set.

Fix this by:
1) Renaming nvmet_execute_identify_cns_cs_ctrl() to
   nvmet_execute_identify_ctrl_zns() to continue handling the
   ZNS command set as is.
2) Introduce a nvmet_execute_identify_ctrl_ns() helper to handle the
   NVM command set, returning a zero filled nvme_id_ctrl_nvm data
   structure.
3) Modify nvmet_execute_identify() to call these helpers based on
   the csi specified, returning an error for unsupported command sets.

Fixes: aaf2e048af ("nvmet: add ZBD over ZNS backend support")
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-04-13 08:55:04 +02:00
Damien Le Moal
97416f67d5 nvmet: fix Identify Active Namespace ID list handling
The identify command with cns set to NVME_ID_CNS_NS_ACTIVE_LIST does
not depend on the command set. The execution of this command should
thus not look at the csi field specified in the command. Simplify
nvmet_execute_identify() to directly call
nvmet_execute_identify_nslist() without the csi switch-case.

Fixes: ab5d0b38c0 ("nvmet: add Command Set Identifier support")
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-04-13 08:55:04 +02:00
Damien Le Moal
62904b3b33 nvmet: fix Identify Controller handling
The identify command with cns set to NVME_ID_CNS_CTRL does not depend on
the command set. The execution of this command should thus not look at
the csi specified in the command. Simplify nvmet_execute_identify() to
directly call nvmet_execute_identify_ctrl() without the csi switch-case.

Fixes: ab5d0b38c0 ("nvmet: add Command Set Identifier support")
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-04-13 08:55:03 +02:00
Damien Le Moal
8c098aa001 nvmet: fix Identify Namespace handling
The identify command with cns set to NVME_ID_CNS_NS does not directly
depend on the command set. The NVMe specifications is rather confusing
here as it appears that this command only applies to the NVM command
set. However, footnote 8 of Figure 273 in the NVMe 2.0 base
specifications clearly state that this command applies to NVM command
sets that support logical blocks, that is, NVM and ZNS. Both the NVM and
ZNS command set specifications also list this identify as mandatory.

The command handling should thus not look at the csi field since it is
defined as unused for this command. Given that we do not support the
KV command set, simply remove the csi switch-case for that command
handling and call directly nvmet_execute_identify_ns() in
nvmet_execute_identify().

Fixes: ab5d0b38c0 ("nvmet: add Command Set Identifier support")
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-04-13 08:55:03 +02:00
Damien Le Moal
ab76e7206b nvmet: fix error handling in nvmet_execute_identify_cns_cs_ns()
Nvme specifications state that:

If the I/O Command Set associated with the namespace identified by the
NSID field does not support the Identify Namespace data structure
specified by the CSI field, the controller shall abort the command with
a status code of Invalid Field in Command.

In other words, if nvmet_execute_identify_cns_cs_ns() is called for a
target with a block device that is not zoned, we should not return any
data and set the status to NVME_SC_INVALID_FIELD.

While at it, it is also better to revalidate the ns block devie *before*
checking if the block device is zoned, to ensure that
nvmet_execute_identify_cns_cs_ns() operates against updated device
characteristics.

Fixes: aaf2e048af ("nvmet: add ZBD over ZNS backend support")
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-04-13 08:55:03 +02:00
Bjorn Helgaas
1ad11eafc6 nvme-pci: drop redundant pci_enable_pcie_error_reporting()
pci_enable_pcie_error_reporting() enables the device to send ERR_*
Messages.  Since f26e58bf6f ("PCI/AER: Enable error reporting when AER is
native"), the PCI core does this for all devices during enumeration, so the
driver doesn't need to do it itself.

Remove the redundant pci_enable_pcie_error_reporting() call from the
driver.  Also remove the corresponding pci_disable_pcie_error_reporting()
from the driver .remove() path.

Note that this only controls ERR_* Messages from the device.  An ERR_*
Message may cause the Root Port to generate an interrupt, depending on the
AER Root Error Command register managed by the AER service driver.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-04-13 08:55:03 +02:00
Mike Christie
28c97ba38f nvme: Add pr_ops read_reservation support
This patch adds support for the pr_ops read_reservation callout by
calling the NVMe Reservation Report helper. It then parses that info to
detect if there is a reservation and if there is then convert the
returned info to a pr_ops pr_held_reservation struct.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Link: https://lore.kernel.org/r/20230407200551.12660-14-michael.christie@oracle.com
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-04-11 21:55:36 -04:00
Mike Christie
be1a7cd2d0 nvme: Add a nvme_pr_type enum
The next patch adds support to report the reservation type, so we need to
be able to convert from the NVMe PR value we get from the device to the
linux block layer PR value that will be returned to callers. To prepare
for that, this patch adds a nvme_pr_type enum and renames the nvme_pr_type
function.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Link: https://lore.kernel.org/r/20230407200551.12660-13-michael.christie@oracle.com
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-04-11 21:55:36 -04:00
Mike Christie
5fd96a4e15 nvme: Add pr_ops read_keys support
This patch adds support for the pr_ops read_keys callout by calling the
NVMe Reservation Report helper, then parsing that info to get the
controller's registered keys. Because the callout is only used in the
kernel where the callers, like LIO, do not know about controller/host IDs,
the callout just returns the registered keys which is required by the SCSI
PR in READ KEYS command.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Link: https://lore.kernel.org/r/20230407200551.12660-12-michael.christie@oracle.com
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-04-11 21:55:35 -04:00
Mike Christie
f0614790b7 nvme: Add helper to send pr command
Move the code that checks for multipath support and sends the pr command
to a new helper so it can be used by the reservation report support added
in the next patches.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Link: https://lore.kernel.org/r/20230407200551.12660-11-michael.christie@oracle.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-04-11 21:55:35 -04:00
Mike Christie
b668f2f546 nvme: Move pr code to it's own file
This patch moves the pr code to it's own file because I'm going to be
adding more functions and core.c is getting bigger.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Link: https://lore.kernel.org/r/20230407200551.12660-10-michael.christie@oracle.com
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-04-11 21:55:35 -04:00
Mike Christie
d45b446bd8 nvme: Don't hardcode the data len for pr commands
Reservation Report support needs to pass in a variable sized buffer, so
this patch has the pr command helpers take a data length argument.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Link: https://lore.kernel.org/r/20230407200551.12660-9-michael.christie@oracle.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-04-11 21:55:35 -04:00
Mike Christie
7ba150834b block: Rename BLK_STS_NEXUS to BLK_STS_RESV_CONFLICT
BLK_STS_NEXUS is used for NVMe/SCSI reservation conflicts and DASD's
locking feature which works similar to NVMe/SCSI reservations where a
host can get a lock on a device and when the lock is taken it will get
failures.

This patch renames BLK_STS_NEXUS so it better reflects this type of
use.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Link: https://lore.kernel.org/r/20230407200551.12660-3-michael.christie@oracle.com
Acked-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-04-11 21:55:35 -04:00
Keith Busch
d3205ab75e nvme: fix discard support without oncs
The device can report discard support without setting the ONCS DSM bit.
When not set, the driver clears max_discard_size expecting it to be set
later. We don't know the size until we have the namespace format,
though, so setting it is deferred until configuring one, but the driver
was abandoning the discard settings due to that initial clearing.

Move the max_discard_size calculation above the check for a '0' discard
size.

Fixes: 1a86924e4f ("nvme: fix interpretation of DMRSL")
Reported-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-04-05 17:13:17 +02:00
Greg Kroah-Hartman
cd8fe5b6db Merge 6.3-rc5 into driver-core-next
We need the fixes in here for testing, as well as the driver core
changes for documentation updates to build on.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-03 09:33:30 +02:00
Sagi Grimberg
88eaba8032 nvme-tcp: fix a possible UAF when failing to allocate an io queue
When we allocate a nvme-tcp queue, we set the data_ready callback before
we actually need to use it. This creates the potential that if a stray
controller sends us data on the socket before we connect, we can trigger
the io_work and start consuming the socket.

In this case reported: we failed to allocate one of the io queues, and
as we start releasing the queues that we already allocated, we get
a UAF [1] from the io_work which is running before it should really.

Fix this by setting the socket ops callbacks only before we start the
queue, so that we can't accidentally schedule the io_work in the
initialization phase before the queue started. While we are at it,
rename nvme_tcp_restore_sock_calls to pair with nvme_tcp_setup_sock_ops.

[1]:
[16802.107284] nvme nvme4: starting error recovery
[16802.109166] nvme nvme4: Reconnecting in 10 seconds...
[16812.173535] nvme nvme4: failed to connect socket: -111
[16812.173745] nvme nvme4: Failed reconnect attempt 1
[16812.173747] nvme nvme4: Reconnecting in 10 seconds...
[16822.413555] nvme nvme4: failed to connect socket: -111
[16822.413762] nvme nvme4: Failed reconnect attempt 2
[16822.413765] nvme nvme4: Reconnecting in 10 seconds...
[16832.661274] nvme nvme4: creating 32 I/O queues.
[16833.919887] BUG: kernel NULL pointer dereference, address: 0000000000000088
[16833.920068] nvme nvme4: Failed reconnect attempt 3
[16833.920094] #PF: supervisor write access in kernel mode
[16833.920261] nvme nvme4: Reconnecting in 10 seconds...
[16833.920368] #PF: error_code(0x0002) - not-present page
[16833.921086] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
[16833.921191] RIP: 0010:_raw_spin_lock_bh+0x17/0x30
...
[16833.923138] Call Trace:
[16833.923271]  <TASK>
[16833.923402]  lock_sock_nested+0x1e/0x50
[16833.923545]  nvme_tcp_try_recv+0x40/0xa0 [nvme_tcp]
[16833.923685]  nvme_tcp_io_work+0x68/0xa0 [nvme_tcp]
[16833.923824]  process_one_work+0x1e8/0x390
[16833.923969]  worker_thread+0x53/0x3d0
[16833.924104]  ? process_one_work+0x390/0x390
[16833.924240]  kthread+0x124/0x150
[16833.924376]  ? set_kthread_struct+0x50/0x50
[16833.924518]  ret_from_fork+0x1f/0x30
[16833.924655]  </TASK>

Reported-by: Yanjun Zhang <zhangyanjun@cestc.cn>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Yanjun Zhang <zhangyanjun@cestc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-03-30 11:24:33 +09:00
Juraj Pecigos
1231363aec nvme-pci: mark Lexar NM760 as IGNORE_DEV_SUBNQN
A system with more than one of these SSDs will only have one usable.
The kernel fails to detect more than one nvme device due to duplicate
cntlids.

before:
[    9.395229] nvme 0000:01:00.0: platform quirk: setting simple suspend
[    9.395262] nvme nvme0: pci function 0000:01:00.0
[    9.395282] nvme 0000:03:00.0: platform quirk: setting simple suspend
[    9.395305] nvme nvme1: pci function 0000:03:00.0
[    9.409873] nvme nvme0: Duplicate cntlid 1 with nvme1, subsys nqn.2022-07.com.siliconmotion:nvm-subsystem-sn-                    , rejecting
[    9.409982] nvme nvme0: Removing after probe failure status: -22
[    9.427487] nvme nvme1: allocated 64 MiB host memory buffer.
[    9.445088] nvme nvme1: 16/0/0 default/read/poll queues
[    9.449898] nvme nvme1: Ignoring bogus Namespace Identifiers

after:
[    1.161890] nvme 0000:01:00.0: platform quirk: setting simple suspend
[    1.162660] nvme nvme0: pci function 0000:01:00.0
[    1.162684] nvme 0000:03:00.0: platform quirk: setting simple suspend
[    1.162707] nvme nvme1: pci function 0000:03:00.0
[    1.191354] nvme nvme0: allocated 64 MiB host memory buffer.
[    1.193378] nvme nvme1: allocated 64 MiB host memory buffer.
[    1.211044] nvme nvme1: 16/0/0 default/read/poll queues
[    1.211080] nvme nvme0: 16/0/0 default/read/poll queues
[    1.216145] nvme nvme0: Ignoring bogus Namespace Identifiers
[    1.216261] nvme nvme1: Ignoring bogus Namespace Identifiers

Adding the NVME_QUIRK_IGNORE_DEV_SUBNQN quirk to resolves the issue.

Signed-off-by: Juraj Pecigos <kernel@juraj.dev>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-03-28 10:09:11 +09:00
Martin George
def84ab600 nvme: send Identify with CNS 06h only to I/O controllers
Identify CNS 06h (I/O Command Set Specific Identify Controller data
structure) is supported only on i/o controllers.

But nvme_init_non_mdts_limits() currently invokes this on all
controllers.  Correct this by ensuring this is sent to I/O
controllers only.

Signed-off-by: Martin George <marting@netapp.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-03-22 09:17:52 +01:00
Jens Axboe
9d2789ac9d block/io_uring: pass in issue_flags for uring_cmd task_work handling
io_uring_cmd_done() currently assumes that the uring_lock is held
when invoked, and while it generally is, this is not guaranteed.
Pass in the issue_flags associated with it, so that we have
IO_URING_F_UNLOCKED available to be able to lock the CQ ring
appropriately when completing events.

Cc: stable@vger.kernel.org
Fixes: ee692a21e9 ("fs,io_uring: add infrastructure for uring-cmd")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-20 20:01:25 -06:00
Greg Kroah-Hartman
1aaba11da9 driver core: class: remove module * from class_create()
The module pointer in class_create() never actually did anything, and it
shouldn't have been requred to be set as a parameter even if it did
something.  So just remove it and fix up all callers of the function in
the kernel tree at the same time.

Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Acked-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Link: https://lore.kernel.org/r/20230313181843.1207845-4-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-17 15:16:33 +01:00
Greg Kroah-Hartman
10a03c36b7 drivers: remove struct module * setting from struct class
There is no need to manually set the owner of a struct class, as the
registering function does it automatically, so remove all of the
explicit settings from various drivers that did so as it is unneeded.

This allows us to remove this pointer entirely from this structure going
forward.

Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Link: https://lore.kernel.org/r/20230313181843.1207845-2-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-17 15:16:27 +01:00
Jens Axboe
890a2fb06e nvme fixes for Linux 6.3
- avoid potential UAF in nvmet_req_complete (Damien Le Moal)
  - more quirks (Elmer Miroslav Mosher Golovin, Philipp Geulen)
  - fix a memory leak in the nvme-pci probe teardown path (Irvin Cote)
  - repair the MAINTAINERS entry (Lukas Bulwahn)
  - fix handling single range discard request (Ming Lei)
  - show more opcode names in trace events (Minwoo Im)
  - fix nvme-tcp timeout reporting (Sagi Grimberg)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmQSy9ILHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYMCqQ/+PcaiHojkyA6B9HBNtVcLemupc2GSc8lozTIqALVT
 EuTJ2AK0ox8F8MYU0ELlpkCGs9Wil135xqLURLDwLi7sKFj7S1IkaSvG+1okDDHK
 hJy4U+bpu1gkyBJkB2v/PrX49lmqplFGD/2eLRFedWbZ4rF82hQRrPHYcDpRL2VF
 0wmVd7lVaUQ9mOngLy1zrBFjt+rOv8d96zwfqF9m67G9rqKJLQhM45M/TiKaeUuy
 3WbsVmZ715rbGv1y7YBObHfdKvs8LfNjdOdJeLNGaNOxuc2y1yOzSbaww2wWW+Wf
 9mB6unjn76xW1UUzPtQs2OALZaa9r0w55IQ5A4V+ugJoDEIdxsyHSUMGDnAc1dTW
 MQGNGhQyGBDD9O9XSvexOXRI+LRxyM3dXfA1dpPUW1BWZm4ACumY26cMasJEJnCr
 YJtIM3SW/Sp3cRnzcN/fxKCfsDUeYTj0mFu5KjamLd1ux5w4pxJyoS9sWxh63qJz
 HBnr6VFczYhf63cUoVgX0qZOKv4jwYaVolVcmuNVWWdFhzr8OC7cb+AztfVdo8M2
 UbIkWxFH7f/obVd2L3N5c8+tKLaoi5R/MzHTlrov2oTP2e46sFuF6cyhtPM4zO//
 lwjkIJ1JSAjvFHxEg44wBHjYIB5y0JCjkIaBLuOiVILQYkxYrV8XYzRaqjjNdk0D
 YSo=
 =GT+2
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.3-2022-03-16' of git://git.infradead.org/nvme into block-6.3

Pull NVMe fixes from Christoph:

"nvme fixes for Linux 6.3

 - avoid potential UAF in nvmet_req_complete (Damien Le Moal)
 - more quirks (Elmer Miroslav Mosher Golovin, Philipp Geulen)
 - fix a memory leak in the nvme-pci probe teardown path (Irvin Cote)
 - repair the MAINTAINERS entry (Lukas Bulwahn)
 - fix handling single range discard request (Ming Lei)
 - show more opcode names in trace events (Minwoo Im)
 - fix nvme-tcp timeout reporting (Sagi Grimberg)"

* tag 'nvme-6.3-2022-03-16' of git://git.infradead.org/nvme:
  nvmet: avoid potential UAF in nvmet_req_complete()
  nvme-trace: show more opcode names
  nvme-tcp: add nvme-tcp pdu size build protection
  nvme-tcp: fix opcode reporting in the timeout handler
  nvme-pci: add NVME_QUIRK_BOGUS_NID for Lexar NM620
  nvme-pci: add NVME_QUIRK_BOGUS_NID for Netac NV3000
  nvme-pci: fixing memory leak in probe teardown path
  nvme: fix handling single range discard request
  MAINTAINERS: repair malformed T: entries in NVM EXPRESS DRIVERS
2023-03-16 07:01:48 -06:00
Yu Kuai
5f27571382 block: count 'ios' and 'sectors' when io is done for bio-based device
While using iostat for raid, I observed very strange 'await'
occasionally, and turns out it's due to that 'ios' and 'sectors' is
counted in bdev_start_io_acct(), while 'nsecs' is counted in
bdev_end_io_acct(). I'm not sure why they are ccounted like that
but I think this behaviour is obviously wrong because user will get
wrong disk stats.

Fix the problem by counting 'ios' and 'sectors' when io is done, like
what rq-based device does.

Fixes: 394ffa503b ("blk: introduce generic io stat accounting help function")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230223091226.1135678-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-15 09:25:04 -06:00
Damien Le Moal
6173a77b7e nvmet: avoid potential UAF in nvmet_req_complete()
An nvme target ->queue_response() operation implementation may free the
request passed as argument. Such implementation potentially could result
in a use after free of the request pointer when percpu_ref_put() is
called in nvmet_req_complete().

Avoid such problem by using a local variable to save the sq pointer
before calling __nvmet_req_complete(), thus avoiding dereferencing the
req pointer after that function call.

Fixes: a07b4970f4 ("nvmet: add a generic NVMe target")
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-03-15 14:58:53 +01:00
Sagi Grimberg
7e87965d38 nvme-tcp: add nvme-tcp pdu size build protection
Make sure that we don't somehow mess up the wire structures in the spec.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kkch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-03-15 14:58:52 +01:00
Sagi Grimberg
a3406352c5 nvme-tcp: fix opcode reporting in the timeout handler
For non in-capsule writes we reuse the request pdu space for a h2cdata
pdu in order to avoid over allocating space (either preallocate or
dynamically upon receving an r2t pdu). However if the request times out
the core expects to find the opcode in the start of the request, which
we override.

In order to prevent that, without sacrificing additional 24 bytes per
request, we just use the tail of the command pdu space instead (last
24 bytes from the 72 bytes command pdu). That should make the command
opcode always available, and we get away from allocating more space.

If in the future we would need the last 24 bytes of the nvme command
available we would need to allocate a dedicated space for it in the
request, but until then we can avoid doing so.

Reported-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kkch@nvidia.com>
Tested-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-03-15 14:58:52 +01:00
Philipp Geulen
b65d44fa0f nvme-pci: add NVME_QUIRK_BOGUS_NID for Lexar NM620
Added a quirk to fix Lexar NM620 1TB SSD reporting duplicate NGUIDs.

Signed-off-by: Philipp Geulen <p.geulen@js-elektronik.de>
Reviewed-by: Chaitanya Kulkarni <kkch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-03-15 14:58:52 +01:00
Elmer Miroslav Mosher Golovin
9630d80655 nvme-pci: add NVME_QUIRK_BOGUS_NID for Netac NV3000
Added a quirk to fix the Netac NV3000 SSD reporting duplicate NGUIDs.

Cc: <stable@vger.kernel.org>
Signed-off-by: Elmer Miroslav Mosher Golovin <miroslav@mishamosher.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-03-15 14:58:51 +01:00
Irvin Cote
a61d265533 nvme-pci: fixing memory leak in probe teardown path
In case the nvme_probe teardown path is triggered the ctrl ref count does
not reach 0 thus creating a memory leak upon failure of nvme_probe.

Signed-off-by: Irvin Cote <irvincoteg@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-03-15 14:58:51 +01:00
Ming Lei
37f0dc2ec7 nvme: fix handling single range discard request
When investigating one customer report on warning in nvme_setup_discard,
we observed the controller(nvme/tcp) actually exposes
queue_max_discard_segments(req->q) == 1.

Obviously the current code can't handle this situation, since contiguity
merge like normal RW request is taken.

Fix the issue by building range from request sector/nr_sectors directly.

Fixes: b35ba01ea6 ("nvme: support ranged discard requests")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-03-15 14:58:51 +01:00
Linus Torvalds
9d0281b56b block-6.3-2023-03-03
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmQB57MQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgputpEADVrc1OFzHOivJq+LJ3HS3ufhLBthtgu1Lp
 sEHvDNp9tBGXMLkomuCYpAju5TBAEKC+AJTZyj9iS1j++ItoezdoP55YRIH7t2Or
 UTy8ex3rLPGkQk6k3o8roWCyajTW/ZS+4fmk+NkVYMLsQBp9I+kFbxgJa5bbREdU
 Z8b/9hcBGz58R8Kq+TEMp/bO7oCV4c8xWumrKER+MktDDx0kc5d+afWXoy7bEKFg
 jLB3gleTM9HUpa9a2GPc4fxqdb0KanQdMtiyn/oplg0JcZLMiHfRbiRnsgQkjN0O
 RVtUcdxXmOkQeFra4GXPiHmQBcIfE85wP4wxb8p/F2StYRhb1epzzeCXOhuNZvv4
 dd6OSARgtzWt3OlHka4aC63H4kzs9SxJp0F2uwuPLV0fM91TP1oOTWV+53FrQr9Z
 OQYyB8d9Il4K72NFLwU4ukJ1fPoCRHjpgAXIIkasEjaBftpJlMNnfblncTZTBumy
 XumFVdKfvqc3OFt8LLKWqLDV0j3TknVeCMPKhsbRwQ0NG4vlNOSWaLkGJCDLJ7ga
 ebf8AD5eaLCT9qyYquBuW5VBKZH5Z4rf5yHta9Dx+Omu0JTQYtTkiiM3UTdpDbtq
 SObZ31UvLoYK2dOZcVgjhE2RgM/AV5jJcx7aHhT3UptavAehHbePgiNhuEEntlKv
 L87kXJkSSQ==
 =ezrg
 -----END PGP SIGNATURE-----

Merge tag 'block-6.3-2023-03-03' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Christoph:
      - Don't access released socket during error recovery (Akinobu
        Mita)
      - Bring back auto-removal of deleted namespaces during sequential
        scan (Christoph Hellwig)
      - Fix an error code in nvme_auth_process_dhchap_challenge (Dan
        Carpenter)
      - Show well known discovery name (Daniel Wagner)
      - Add a missing endianess conversion in effects masking (Keith
        Busch)

 - Fix for a regression introduced in blk-rq-qos during init in this
   merge window (Breno)

 - Reorder a few fields in struct blk_mq_tag_set, eliminating a few
   holes and shrinking it (Christophe)

 - Remove redundant bdev_get_queue() NULL checks (Juhyung)

 - Add sed-opal single user mode support flag (Luca)

 - Remove SQE128 check in ublk as it isn't needed, saving some memory
   (Ming)

 - Op specific segment checking for cloned requests (Uday)

 - Exclusive open partition scan fixes (Yu)

 - Loop offset/size checking before assigning them in the device (Zhong)

 - Bio polling fixes (me)

* tag 'block-6.3-2023-03-03' of git://git.kernel.dk/linux:
  blk-mq: enforce op-specific segment limits in blk_insert_cloned_request
  nvme-fabrics: show well known discovery name
  nvme-tcp: don't access released socket during error recovery
  nvme-auth: fix an error code in nvme_auth_process_dhchap_challenge()
  nvme: bring back auto-removal of deleted namespaces during sequential scan
  blk-iocost: Pass gendisk to ioc_refresh_params
  nvme: fix sparse warning on effects masking
  block: be a bit more careful in checking for NULL bdev while polling
  block: clear bio->bi_bdev when putting a bio back in the cache
  loop: loop_set_status_from_info() check before assignment
  ublk: remove check IO_URING_F_SQE128 in ublk_ch_uring_cmd
  block: remove more NULL checks after bdev_get_queue()
  blk-mq: Reorder fields in 'struct blk_mq_tag_set'
  block: fix scan partition for exclusively open device again
  block: Revert "block: Do not reread partition table on exclusively open device"
  sed-opal: add support flag for SUM in status ioctl
2023-03-03 10:21:39 -08:00
Daniel Wagner
26a57cb355 nvme-fabrics: show well known discovery name
The kernel always logs the unique subsystem name for a discovery
controller, even in the case user space asked for the well known.

This has lead to confusion as the logs of nvme-cli and the kernel
logs didn't match.

First, nvme-cli connects to the well known discovery controller to
figure out if it supports TP8013. If so then nvme-cli disconnects and
connects to the unique discovery controller. Currently, the kernel show
that user space connected twice to the unique one.

To avoid further confusion, show the well known discovery controller if
user space asked for it:

  $ nvme connect-all -v -t tcp -a 192.168.0.1
  nvme0: nqn.2014-08.org.nvmexpress.discovery connected
  nvme0: nqn.2014-08.org.nvmexpress.discovery disconnected
  nvme0: nqn.discovery connected

  kernel log:
  nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 192.168.0.1:8009
  nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
  nvme nvme0: new ctrl: NQN "nqn.discovery", addr 192.168.0.1:8009

Fixes: e5ea42faa7 ("nvme: display correct subsystem NQN")
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-28 06:14:44 -07:00
Akinobu Mita
76d54bf20c nvme-tcp: don't access released socket during error recovery
While the error recovery work is temporarily failing reconnect attempts,
running the 'nvme list' command causes a kernel NULL pointer dereference
by calling getsockname() with a released socket.

During error recovery work, the nvme tcp socket is released and a new one
created, so it is not safe to access the socket without proper check.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Fixes: 02c57a82c0 ("nvme-tcp: print actual source IP address through sysfs "address" attr")
Reviewed-by: Martin Belanger <martin.belanger@dell.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-28 06:14:44 -07:00
Dan Carpenter
51d24f701f nvme-auth: fix an error code in nvme_auth_process_dhchap_challenge()
This function was transitioned from returning NVMe status codes to
returning traditional kernel error codes.  However, this particular
return now accidentally returns positive error codes like ENOMEM instead
of negative -ENOMEM.

Fixes: b0ef1b11d3 ("nvme-auth: don't use NVMe status codes")
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-28 06:14:44 -07:00
Christoph Hellwig
0dd6fff2aa nvme: bring back auto-removal of deleted namespaces during sequential scan
Bring back the check of the Identify Namespace return value for the
legacy NVMe 1.0-style sequential scanning.  While NVMe 1.0 does not
support namespace management, there are "modern" cloud solutions like
Google Cloud Platform that claim the obsolete 1.0 compliance for no
good reason while supporting proprietary sideband namespace management.

Fixes: 1a893c2bfe ("nvme: refactor namespace probing")
Reported-by: Nils Hanke <nh@edgeless.systems>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Nils Hanke <nh@edgeless.systems>
2023-02-28 06:14:29 -07:00
Keith Busch
c0c33b94cf nvme: fix sparse warning on effects masking
The log entries are stored in le32, so use appropriate byte swapping
macros.

Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202302242222.PevBhzvC-lkp@intel.com/
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-27 06:56:04 -07:00
Linus Torvalds
5b7c4cabbb Networking changes for 6.3.
Core
 ----
 
  - Add dedicated kmem_cache for typical/small skb->head, avoid having
    to access struct page at kfree time, and improve memory use.
 
  - Introduce sysctl to set default RPS configuration for new netdevs.
 
  - Define Netlink protocol specification format which can be used
    to describe messages used by each family and auto-generate parsers.
    Add tools for generating kernel data structures and uAPI headers.
 
  - Expose all net/core sysctls inside netns.
 
  - Remove 4s sleep in netpoll if carrier is instantly detected on boot.
 
  - Add configurable limit of MDB entries per port, and port-vlan.
 
  - Continue populating drop reasons throughout the stack.
 
  - Retire a handful of legacy Qdiscs and classifiers.
 
 Protocols
 ---------
 
  - Support IPv4 big TCP (TSO frames larger than 64kB).
 
  - Add IP_LOCAL_PORT_RANGE socket option, to control local port range
    on socket by socket basis.
 
  - Track and report in procfs number of MPTCP sockets used.
 
  - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP
    path manager.
 
  - IPv6: don't check net.ipv6.route.max_size and rely on garbage
    collection to free memory (similarly to IPv4).
 
  - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
 
  - ICMP: add per-rate limit counters.
 
  - Add support for user scanning requests in ieee802154.
 
  - Remove static WEP support.
 
  - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
    reporting.
 
  - WiFi 7 EHT channel puncturing support (client & AP).
 
 BPF
 ---
 
  - Add a rbtree data structure following the "next-gen data structure"
    precedent set by recently added linked list, that is, by using
    kfunc + kptr instead of adding a new BPF map type.
 
  - Expose XDP hints via kfuncs with initial support for RX hash and
    timestamp metadata.
 
  - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key
    to better support decap on GRE tunnel devices not operating
    in collect metadata.
 
  - Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
 
  - Remove the need for trace_printk_lock for bpf_trace_printk
    and bpf_trace_vprintk helpers.
 
  - Extend libbpf's bpf_tracing.h support for tracing arguments of
    kprobes/uprobes and syscall as a special case.
 
  - Significantly reduce the search time for module symbols
    by livepatch and BPF.
 
  - Enable cpumasks to be used as kptrs, which is useful for tracing
    programs tracking which tasks end up running on which CPUs in
    different time intervals.
 
  - Add support for BPF trampoline on s390x and riscv64.
 
  - Add capability to export the XDP features supported by the NIC.
 
  - Add __bpf_kfunc tag for marking kernel functions as kfuncs.
 
  - Add cgroup.memory=nobpf kernel parameter option to disable BPF
    memory accounting for container environments.
 
 Netfilter
 ---------
 
  - Remove the CLUSTERIP target. It has been marked as obsolete
    for years, and we still have WARN splats wrt. races of
    the out-of-band /proc interface installed by this target.
 
  - Add 'destroy' commands to nf_tables. They are identical to
    the existing 'delete' commands, but do not return an error if
    the referenced object (set, chain, rule...) did not exist.
 
 Driver API
 ----------
 
  - Improve cpumask_local_spread() locality to help NICs set the right
    IRQ affinity on AMD platforms.
 
  - Separate C22 and C45 MDIO bus transactions more clearly.
 
  - Introduce new DCB table to control DSCP rewrite on egress.
 
  - Support configuration of Physical Layer Collision Avoidance (PLCA)
    Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
    shared medium Ethernet.
 
  - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
    preemption of low priority frames by high priority frames.
 
  - Add support for controlling MACSec offload using netlink SET.
 
  - Rework devlink instance refcounts to allow registration and
    de-registration under the instance lock. Split the code into multiple
    files, drop some of the unnecessarily granular locks and factor out
    common parts of netlink operation handling.
 
  - Add TX frame aggregation parameters (for USB drivers).
 
  - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
    messages with notifications for debug.
 
  - Allow offloading of UDP NEW connections via act_ct.
 
  - Add support for per action HW stats in TC.
 
  - Support hardware miss to TC action (continue processing in SW from
    a specific point in the action chain).
 
  - Warn if old Wireless Extension user space interface is used with
    modern cfg80211/mac80211 drivers. Do not support Wireless Extensions
    for Wi-Fi 7 devices at all. Everyone should switch to using nl80211
    interface instead.
 
  - Improve the CAN bit timing configuration. Use extack to return error
    messages directly to user space, update the SJW handling, including
    the definition of a new default value that will benefit CAN-FD
    controllers, by increasing their oscillator tolerance.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - nVidia BlueField-3 support (control traffic driver)
    - Ethernet support for imx93 SoCs
    - Motorcomm yt8531 gigabit Ethernet PHY
    - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
    - Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
    - Amlogic gxl MDIO mux
 
  - WiFi:
    - RealTek RTL8188EU (rtl8xxxu)
    - Qualcomm Wi-Fi 7 devices (ath12k)
 
  - CAN:
    - Renesas R-Car V4H
 
 Drivers
 -------
 
  - Bluetooth:
    - Set Per Platform Antenna Gain (PPAG) for Intel controllers.
 
  - Ethernet NICs:
    - Intel (1G, igc):
      - support TSN / Qbv / packet scheduling features of i226 model
    - Intel (100G, ice):
      - use GNSS subsystem instead of TTY
      - multi-buffer XDP support
      - extend support for GPIO pins to E823 devices
    - nVidia/Mellanox:
      - update the shared buffer configuration on PFC commands
      - implement PTP adjphase function for HW offset control
      - TC support for Geneve and GRE with VF tunnel offload
      - more efficient crypto key management method
      - multi-port eswitch support
    - Netronome/Corigine:
      - add DCB IEEE support
      - support IPsec offloading for NFP3800
    - Freescale/NXP (enetc):
      - enetc: support XDP_REDIRECT for XDP non-linear buffers
      - enetc: improve reconfig, avoid link flap and waiting for idle
      - enetc: support MAC Merge layer
    - Other NICs:
      - sfc/ef100: add basic devlink support for ef100
      - ionic: rx_push mode operation (writing descriptors via MMIO)
      - bnxt: use the auxiliary bus abstraction for RDMA
      - r8169: disable ASPM and reset bus in case of tx timeout
      - cpsw: support QSGMII mode for J721e CPSW9G
      - cpts: support pulse-per-second output
      - ngbe: add an mdio bus driver
      - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
      - r8152: handle devices with FW with NCM support
      - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
      - virtio-net: support multi buffer XDP
      - virtio/vsock: replace virtio_vsock_pkt with sk_buff
      - tsnep: XDP support
 
  - Ethernet high-speed switches:
    - nVidia/Mellanox (mlxsw):
      - add support for latency TLV (in FW control messages)
    - Microchip (sparx5):
      - separate explicit and implicit traffic forwarding rules, make
        the implicit rules always active
      - add support for egress DSCP rewrite
      - IS0 VCAP support (Ingress Classification)
      - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.)
      - ES2 VCAP support (Egress Access Control)
      - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1)
 
  - Ethernet embedded switches:
    - Marvell (mv88e6xxx):
      - add MAB (port auth) offload support
      - enable PTP receive for mv88e6390
    - NXP (ocelot):
      - support MAC Merge layer
      - support for the the vsc7512 internal copper phys
    - Microchip:
      - lan9303: convert to PHYLINK
      - lan966x: support TC flower filter statistics
      - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
      - lan937x: support Credit Based Shaper configuration
      - ksz9477: support Energy Efficient Ethernet
    - other:
      - qca8k: convert to regmap read/write API, use bulk operations
      - rswitch: Improve TX timestamp accuracy
 
  - Intel WiFi (iwlwifi):
    - EHT (Wi-Fi 7) rate reporting
    - STEP equalizer support: transfer some STEP (connection to radio
      on platforms with integrated wifi) related parameters from the
      BIOS to the firmware.
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - IPQ5018 support
    - Fine Timing Measurement (FTM) responder role support
    - channel 177 support
 
  - MediaTek WiFi (mt76):
    - per-PHY LED support
    - mt7996: EHT (Wi-Fi 7) support
    - Wireless Ethernet Dispatch (WED) reset support
    - switch to using page pool allocator
 
  - RealTek WiFi (rtw89):
    - support new version of Bluetooth co-existance
 
  - Mobile:
    - rmnet: support TX aggregation.
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmP1VIYACgkQMUZtbf5S
 IrvsChAApz0rNL/sPKxXTEfxZ1tN7D3sYxYKQPomxvl5BV+MvicrLddJy3KmzEFK
 nnJNO3nuRNuH422JQ/ylZ4mGX1opa6+5QJb0UINImXUI7Fm8HHBIuPGkv7d5CheZ
 7JexFqjPJXUy9nPyh1Rra+IA9AcRd2U7jeGEZR38wb99bHJQj5Bzdk20WArEB0el
 n44aqg49LXH71bSeXRz77x5SjkwVtYiccQxLcnmTbjLU2xVraLvI2J+wAhHnVXWW
 9lrU1+V4Ex2Xcd1xR0L0cHeK+meP1TrPRAeF+JDpVI3a/zJiE7cZjfHdG/jH5xWl
 leZJqghVozrZQNtewWWO7XhUFhMDgFu3W/1vNLjSHPZEqaz1JpM67J1+ql6s63l4
 LMWoXbcYZz+SL9ZRCoPkbGue/5fKSHv8/Jl9Sh58+eTS+c/zgN8uFGRNFXLX1+EP
 n8uvt985PxMd6x1+dHumhOUzxnY4Sfi1vjitSunTsNFQ3Cmp4SO0IfBVJWfLUCuC
 xz5hbJGJJbSpvUsO+HWyCg83E5OWghRE/Onpt2jsQSZCrO9HDg4FRTEf3WAMgaqc
 edb5KfbRZPTJQM08gWdluXzSk1nw3FNP2tXW4XlgUrEbjb+fOk0V9dQg2gyYTxQ1
 Nhvn8ZQPi6/GMMELHAIPGmmW1allyOGiAzGlQsv8EmL+OFM6WDI=
 =xXhC
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - Add dedicated kmem_cache for typical/small skb->head, avoid having
     to access struct page at kfree time, and improve memory use.

   - Introduce sysctl to set default RPS configuration for new netdevs.

   - Define Netlink protocol specification format which can be used to
     describe messages used by each family and auto-generate parsers.
     Add tools for generating kernel data structures and uAPI headers.

   - Expose all net/core sysctls inside netns.

   - Remove 4s sleep in netpoll if carrier is instantly detected on
     boot.

   - Add configurable limit of MDB entries per port, and port-vlan.

   - Continue populating drop reasons throughout the stack.

   - Retire a handful of legacy Qdiscs and classifiers.

  Protocols:

   - Support IPv4 big TCP (TSO frames larger than 64kB).

   - Add IP_LOCAL_PORT_RANGE socket option, to control local port range
     on socket by socket basis.

   - Track and report in procfs number of MPTCP sockets used.

   - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path
     manager.

   - IPv6: don't check net.ipv6.route.max_size and rely on garbage
     collection to free memory (similarly to IPv4).

   - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).

   - ICMP: add per-rate limit counters.

   - Add support for user scanning requests in ieee802154.

   - Remove static WEP support.

   - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
     reporting.

   - WiFi 7 EHT channel puncturing support (client & AP).

  BPF:

   - Add a rbtree data structure following the "next-gen data structure"
     precedent set by recently added linked list, that is, by using
     kfunc + kptr instead of adding a new BPF map type.

   - Expose XDP hints via kfuncs with initial support for RX hash and
     timestamp metadata.

   - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to
     better support decap on GRE tunnel devices not operating in collect
     metadata.

   - Improve x86 JIT's codegen for PROBE_MEM runtime error checks.

   - Remove the need for trace_printk_lock for bpf_trace_printk and
     bpf_trace_vprintk helpers.

   - Extend libbpf's bpf_tracing.h support for tracing arguments of
     kprobes/uprobes and syscall as a special case.

   - Significantly reduce the search time for module symbols by
     livepatch and BPF.

   - Enable cpumasks to be used as kptrs, which is useful for tracing
     programs tracking which tasks end up running on which CPUs in
     different time intervals.

   - Add support for BPF trampoline on s390x and riscv64.

   - Add capability to export the XDP features supported by the NIC.

   - Add __bpf_kfunc tag for marking kernel functions as kfuncs.

   - Add cgroup.memory=nobpf kernel parameter option to disable BPF
     memory accounting for container environments.

  Netfilter:

   - Remove the CLUSTERIP target. It has been marked as obsolete for
     years, and we still have WARN splats wrt races of the out-of-band
     /proc interface installed by this target.

   - Add 'destroy' commands to nf_tables. They are identical to the
     existing 'delete' commands, but do not return an error if the
     referenced object (set, chain, rule...) did not exist.

  Driver API:

   - Improve cpumask_local_spread() locality to help NICs set the right
     IRQ affinity on AMD platforms.

   - Separate C22 and C45 MDIO bus transactions more clearly.

   - Introduce new DCB table to control DSCP rewrite on egress.

   - Support configuration of Physical Layer Collision Avoidance (PLCA)
     Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
     shared medium Ethernet.

   - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
     preemption of low priority frames by high priority frames.

   - Add support for controlling MACSec offload using netlink SET.

   - Rework devlink instance refcounts to allow registration and
     de-registration under the instance lock. Split the code into
     multiple files, drop some of the unnecessarily granular locks and
     factor out common parts of netlink operation handling.

   - Add TX frame aggregation parameters (for USB drivers).

   - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
     messages with notifications for debug.

   - Allow offloading of UDP NEW connections via act_ct.

   - Add support for per action HW stats in TC.

   - Support hardware miss to TC action (continue processing in SW from
     a specific point in the action chain).

   - Warn if old Wireless Extension user space interface is used with
     modern cfg80211/mac80211 drivers. Do not support Wireless
     Extensions for Wi-Fi 7 devices at all. Everyone should switch to
     using nl80211 interface instead.

   - Improve the CAN bit timing configuration. Use extack to return
     error messages directly to user space, update the SJW handling,
     including the definition of a new default value that will benefit
     CAN-FD controllers, by increasing their oscillator tolerance.

  New hardware / drivers:

   - Ethernet:
      - nVidia BlueField-3 support (control traffic driver)
      - Ethernet support for imx93 SoCs
      - Motorcomm yt8531 gigabit Ethernet PHY
      - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
      - Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
      - Amlogic gxl MDIO mux

   - WiFi:
      - RealTek RTL8188EU (rtl8xxxu)
      - Qualcomm Wi-Fi 7 devices (ath12k)

   - CAN:
      - Renesas R-Car V4H

  Drivers:

   - Bluetooth:
      - Set Per Platform Antenna Gain (PPAG) for Intel controllers.

   - Ethernet NICs:
      - Intel (1G, igc):
         - support TSN / Qbv / packet scheduling features of i226 model
      - Intel (100G, ice):
         - use GNSS subsystem instead of TTY
         - multi-buffer XDP support
         - extend support for GPIO pins to E823 devices
      - nVidia/Mellanox:
         - update the shared buffer configuration on PFC commands
         - implement PTP adjphase function for HW offset control
         - TC support for Geneve and GRE with VF tunnel offload
         - more efficient crypto key management method
         - multi-port eswitch support
      - Netronome/Corigine:
         - add DCB IEEE support
         - support IPsec offloading for NFP3800
      - Freescale/NXP (enetc):
         - support XDP_REDIRECT for XDP non-linear buffers
         - improve reconfig, avoid link flap and waiting for idle
         - support MAC Merge layer
      - Other NICs:
         - sfc/ef100: add basic devlink support for ef100
         - ionic: rx_push mode operation (writing descriptors via MMIO)
         - bnxt: use the auxiliary bus abstraction for RDMA
         - r8169: disable ASPM and reset bus in case of tx timeout
         - cpsw: support QSGMII mode for J721e CPSW9G
         - cpts: support pulse-per-second output
         - ngbe: add an mdio bus driver
         - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
         - r8152: handle devices with FW with NCM support
         - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
         - virtio-net: support multi buffer XDP
         - virtio/vsock: replace virtio_vsock_pkt with sk_buff
         - tsnep: XDP support

   - Ethernet high-speed switches:
      - nVidia/Mellanox (mlxsw):
         - add support for latency TLV (in FW control messages)
      - Microchip (sparx5):
         - separate explicit and implicit traffic forwarding rules, make
           the implicit rules always active
         - add support for egress DSCP rewrite
         - IS0 VCAP support (Ingress Classification)
         - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS
           etc.)
         - ES2 VCAP support (Egress Access Control)
         - support for Per-Stream Filtering and Policing (802.1Q,
           8.6.5.1)

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - add MAB (port auth) offload support
         - enable PTP receive for mv88e6390
      - NXP (ocelot):
         - support MAC Merge layer
         - support for the the vsc7512 internal copper phys
      - Microchip:
         - lan9303: convert to PHYLINK
         - lan966x: support TC flower filter statistics
         - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
         - lan937x: support Credit Based Shaper configuration
         - ksz9477: support Energy Efficient Ethernet
      - other:
         - qca8k: convert to regmap read/write API, use bulk operations
         - rswitch: Improve TX timestamp accuracy

   - Intel WiFi (iwlwifi):
      - EHT (Wi-Fi 7) rate reporting
      - STEP equalizer support: transfer some STEP (connection to radio
        on platforms with integrated wifi) related parameters from the
        BIOS to the firmware.

   - Qualcomm 802.11ax WiFi (ath11k):
      - IPQ5018 support
      - Fine Timing Measurement (FTM) responder role support
      - channel 177 support

   - MediaTek WiFi (mt76):
      - per-PHY LED support
      - mt7996: EHT (Wi-Fi 7) support
      - Wireless Ethernet Dispatch (WED) reset support
      - switch to using page pool allocator

   - RealTek WiFi (rtw89):
      - support new version of Bluetooth co-existance

   - Mobile:
      - rmnet: support TX aggregation"

* tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits)
  page_pool: add a comment explaining the fragment counter usage
  net: ethtool: fix __ethtool_dev_mm_supported() implementation
  ethtool: pse-pd: Fix double word in comments
  xsk: add linux/vmalloc.h to xsk.c
  sefltests: netdevsim: wait for devlink instance after netns removal
  selftest: fib_tests: Always cleanup before exit
  net/mlx5e: Align IPsec ASO result memory to be as required by hardware
  net/mlx5e: TC, Set CT miss to the specific ct action instance
  net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
  net/mlx5: Refactor tc miss handling to a single function
  net/mlx5: Kconfig: Make tc offload depend on tc skb extension
  net/sched: flower: Support hardware miss to tc action
  net/sched: flower: Move filter handle initialization earlier
  net/sched: cls_api: Support hardware miss to tc action
  net/sched: Rename user cookie and act cookie
  sfc: fix builds without CONFIG_RTC_LIB
  sfc: clean up some inconsistent indentings
  net/mlx4_en: Introduce flexible array to silence overflow warning
  net: lan966x: Fix possible deadlock inside PTP
  net/ulp: Remove redundant ->clone() test in inet_clone_ulp().
  ...
2023-02-21 18:24:12 -08:00
Linus Torvalds
5b0ed59649 for-6.3/block-2023-02-16
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPvfncQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpob2EADXJxcr2jjYHm/7cjKkyuVX8fr80dNdMeuY
 JFdsjG1k6Uj73BVhQQWYTcs/PsrWBHWRsv6uz4WgOELj55eXmf5Q0kJszyUeJW33
 /DjqLvtoppVcYf80xE13wKvCfn73BjwQo6xkGM0qAYn15eaXiD/Ax3xC6eJlsBeK
 PEw7EJyhacbSxZa/1D2B6+mqII1jUQWProTCc3udZ4JHi3WvdWa3Rda0qCqHl4a1
 +K2aP2YTFIRPxBzfMNa/CafWVIFubTdht+4Ds6R60RImzB9e0VUBfcsiUyW5Zg7L
 Fwv7ptXuWrALwVNdW56Oz1QikBxn2pdRR2HMLwKJW1MD8kP9r8LMm2jV5Rhiwe0B
 OQsGRYkOzBvw+bxeP5fvk0iPGVMz6ActH4gkraA5QdLqayDaFYOadlhqz0uRo5SH
 Fb42Vl658K/MHDSIk8U58TNkmrsIJsBGohXI9DOGINPPvv3XOPi4Q1HmXkGRmii0
 y+lNU/QEGh7xXXew29SPP76uQpQaYfC7NxXCMw/OpOMwehzjsjshmM2lpxi8zsgt
 PJUmfHv5qxCplNmTJXmUpmX7sS7550HUdu9FJb13DM+gzKg8bk9jWVuLrzqrVlG5
 1hKWEl1+heg1heRfaIuJVLbPI0au6Sb4uqhih/PHyrP9TWIoAruDbDJM65GKTxyE
 2uEgcHzHQw==
 =poRc
 -----END PGP SIGNATURE-----

Merge tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe updates via Christoph:
      - Small improvements to the logging functionality (Amit Engel)
      - Authentication cleanups (Hannes Reinecke)
      - Cleanup and optimize the DMA mapping cod in the PCIe driver
        (Keith Busch)
      - Work around the command effects for Format NVM (Keith Busch)
      - Misc cleanups (Keith Busch, Christoph Hellwig)
      - Fix and cleanup freeing single sgl (Keith Busch)

 - MD updates via Song:
      - Fix a rare crash during the takeover process
      - Don't update recovery_cp when curr_resync is ACTIVE
      - Free writes_pending in md_stop
      - Change active_io to percpu

 - Updates to drbd, inching us closer to unifying the out-of-tree driver
   with the in-tree one (Andreas, Christoph, Lars, Robert)

 - BFQ update adding support for multi-actuator drives (Paolo, Federico,
   Davide)

 - Make brd compliant with REQ_NOWAIT (me)

 - Fix for IOPOLL and queue entering, fixing stalled IO waiting on
   timeouts (me)

 - Fix for REQ_NOWAIT with multiple bios (me)

 - Fix memory leak in blktrace cleanup (Greg)

 - Clean up sbitmap and fix a potential hang (Kemeng)

 - Clean up some bits in BFQ, and fix a bug in the request injection
   (Kemeng)

 - Clean up the request allocation and issue code, and fix some bugs
   related to that (Kemeng)

 - ublk updates and fixes:
      - Add support for unprivileged ublk (Ming)
      - Improve device deletion handling (Ming)
      - Misc (Liu, Ziyang)

 - s390 dasd fixes (Alexander, Qiheng)

 - Improve utility of request caching and fixes (Anuj, Xiao)

 - zoned cleanups (Pankaj)

 - More constification for kobjs (Thomas)

 - blk-iocost cleanups (Yu)

 - Remove bio splitting from drivers that don't need it (Christoph)

 - Switch blk-cgroups to use struct gendisk. Some of this is now
   incomplete as select late reverts were done. (Christoph)

 - Add bvec initialization helpers, and convert callers to use that
   rather than open-coding it (Christoph)

 - Misc fixes and cleanups (Jinke, Keith, Arnd, Bart, Li, Martin,
   Matthew, Ulf, Zhong)

* tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux: (169 commits)
  brd: use radix_tree_maybe_preload instead of radix_tree_preload
  block: use proper return value from bio_failfast()
  block: bio-integrity: Copy flags when bio_integrity_payload is cloned
  block: Fix io statistics for cgroup in throttle path
  brd: mark as nowait compatible
  brd: check for REQ_NOWAIT and set correct page allocation mask
  brd: return 0/-error from brd_insert_page()
  block: sync mixed merged request's failfast with 1st bio's
  Revert "blk-cgroup: pin the gendisk in struct blkcg_gq"
  Revert "blk-cgroup: pass a gendisk to blkg_lookup"
  Revert "blk-cgroup: delay blk-cgroup initialization until add_disk"
  Revert "blk-cgroup: delay calling blkcg_exit_disk until disk_release"
  Revert "blk-cgroup: move the cgroup information to struct gendisk"
  nvme-pci: remove iod use_sgls
  nvme-pci: fix freeing single sgl
  block: ublk: check IO buffer based on flag need_get_data
  s390/dasd: Fix potential memleak in dasd_eckd_init()
  s390/dasd: sort out physical vs virtual pointers usage
  block: Remove the ALLOC_CACHE_SLACK constant
  block: make kobj_type structures constant
  ...
2023-02-20 14:27:21 -08:00
Linus Torvalds
0e9fd589e6 block-6.2-2023-02-17
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPv7+wQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgkvD/95SElddH7D5GhwNlAjl+HiNjBg1Oaa+BiC
 6QMXz7x0/r+JG5xhzEC/w8Cxaghi841YyPDKHe+oE3Sy8yeA2bfFlJpHl7kwIcJZ
 WhHa5IutOftXdeY+DMiJGwKZOGuBDla4ca9fpK4SOF7NdTrLFQL46mvawNq/WGTE
 qQlXH3e9d7htfUq0AR5ZnnZ/nl2mBE1jRCajhTk29PT3Nje2Bo9lL6Kga2CysX4g
 Zflc5aXX3KsyfxAx0kEkrjkZJ8ukqKyR7yh+bp+fu2pgpbwI9STFVvopv7r4MH1f
 Y+2W1bCiwK/KneKNObbpzLGVRMfG5qv8z+pR6fpW4FRu0E5ZYDBE8JCG1dqg00K3
 dfGydR27FFtt9YWMmiHaIWD6rZnCwHDoZnV0LLiAJ/i+kvJkGvT8fL7435+kiEVX
 7u4lfce7GPuG2JErB7BnE+sG7J2dr+CuG/o7y7hTh+eKjWk+yTP6vn7QHka1Uj7X
 LesTUEh2Y2tic4hXh//jKM4KCJk5xMJ1TDJadg1RNmD3sGvNnu5IItzyQ3rV1ZyJ
 jqdppWxVUb+kRSRQEu/icN/rq4/f4jvPSV4DWb9NrzwCex3uzrMKW1j0oP9cMvQ0
 KYhq0i6iWjzbCSSkJckHFiPt6hgicEuQlIl0zxYxoYpVWJmjmqG8QrPDbLQtz8P8
 GViQGoiJ8w==
 =/7xW
 -----END PGP SIGNATURE-----

Merge tag 'block-6.2-2023-02-17' of git://git.kernel.dk/linux

Pull block fix from Jens Axboe:
 "I guess this is what can happen when you prep things early for going
  away, something else comes in last minute. This one fixes another
  regression in 6.2 for NVMe, from this release, and hence we should
  probably get it submitted for 6.2.

  Still waiting for the original reporter (see bugzilla linked in the
  commit) to test this, but Keith managed to setup and recreate the
  issue and tested the patch that way"

* tag 'block-6.2-2023-02-17' of git://git.kernel.dk/linux:
  nvme-pci: refresh visible attrs for cmb attributes
2023-02-18 09:56:58 -08:00
David S. Miller
675f176b4d Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net
Some of the devlink bits were tricky, but I think I got it right.

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-17 11:06:39 +00:00
Keith Busch
e917a849c3 nvme-pci: refresh visible attrs for cmb attributes
The sysfs group containing the cmb attributes is registered before the
driver knows if they need to be visible or not. Update the group when
cmb attributes are known to exist so the visibility setting is correct.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217037
Fixes: 86adbf0cdb ("nvme: simplify transport specific device attribute handling")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-17 08:31:05 +01:00
Linus Torvalds
d3d6f0eb08 block-6.2-2023-02-16
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPugusQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpm4sEACrMqKGTddM7rwxW0mUzZO3ywO2oS2rBfd1
 +f1xD395wB3w02kAg8Iz6DKP3cO1l7C/YVek0SwzG5U2nvAURaoGI+Mxle1FT7Ho
 OKf+QO2tBsiVXapOnCkAzCh7UQ4PcPHrbf/Kl41nQDkN3UDS2w3rFLNK7RtLD9F9
 hW/4mxNsyUkOQpCDRuzwFi2KhCvyDYQQOTb+9SMgm9rZtrCkEOZxAEE7ByHfb3SG
 aH0g39OcP7C1sD5Zs66tdYNUA/LldwF8BEJqmrwAkrsi2Vs+pForcinpyqS7ubX/
 oHLkRnZ2TSyvmTKJwtxdXrMVn9zMa6TJ2XbIC6JUlXV3B2geonHKL8JItaYF7bbq
 BUWL/oIY3gP7pO25oS6ZpZSmRUSBPx3AHy6v62dhpHu7T6Ayf6kt+X/YS/ab5EsP
 0ot49lgP9c1A6OQp0/JuSU/dKtbLPpz5MVShTo8oc1qtJGjgjD4kM6wu98DSUWVw
 QoreZSHyJxQWXkd/hrH6WD+zLRJCtUDssjANaQvpk2uAqE/0JDG+f3wQZXNcah9D
 DjFgXuwjKjdZuUWfMjWfdpvQ0b+8FgE1r6J1KGC7A7ZZCQWcTDjFM91LaKRRwPOI
 I+WdYe1PuXMUylIVyE4EoHVXqCcai1c7x+0Cui6M/OGytxdu4B4tgI34RQ1ZN0kp
 KV2Y8kNV9Q==
 =QFJG
 -----END PGP SIGNATURE-----

Merge tag 'block-6.2-2023-02-16' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "Just a few NVMe fixes that should go into the 6.2 release, adding a
  quirk and fixing two issues introduced in this release:

   - NVMe fixes via Christoph:
       - Always return an ERR_PTR from nvme_pci_alloc_dev (Irvin Cote)
       - Add bogus ID quirk for ADATA SX6000PNP (Daniel Wagner)
       - Set the DMA mask earlier (Christoph Hellwig)"

* tag 'block-6.2-2023-02-16' of git://git.kernel.dk/linux:
  nvme-pci: always return an ERR_PTR from nvme_pci_alloc_dev
  nvme-pci: set the DMA mask earlier
  nvme-pci: add bogus ID quirk for ADATA SX6000PNP
2023-02-16 12:05:33 -08:00
Keith Busch
b6c0c237be nvme-pci: remove iod use_sgls
It's not used anywhere anymore, so remove it.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-14 06:40:57 +01:00
Keith Busch
8f0edf45bb nvme-pci: fix freeing single sgl
There may only be a single DMA mapped entry from multiple physical
segments, which means we don't allocate a separte SGL list. Check the
number of allocations prior to know if we need to free something.

Freeing a single list allocation is the same for both PRP and SGL
usages, so we don't need to check the use_sgl flag anymore.

Fixes: 01df742d8c ("nvme-pci: remove SGL segment descriptors")
Reported-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
2023-02-14 06:40:52 +01:00
Irvin Cote
dc785d69d7 nvme-pci: always return an ERR_PTR from nvme_pci_alloc_dev
Don't mix NULL and ERR_PTR returns.

Fixes: 2e87570be9 ("nvme-pci: factor out a nvme_pci_alloc_dev helper")
Signed-off-by: Irvin Cote <irvin.cote@insa-lyon.fr>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-14 06:39:02 +01:00
Christoph Hellwig
924bd96ea2 nvme-pci: set the DMA mask earlier
Set the DMA mask before calling dma_addressing_limited, which depends on it.

Note that this stop checking the return value of dma_set_mask_and_coherent
as this function can only fail for masks < 32-bit.

Fixes: 3f30a79c2e ("nvme-pci: set constant paramters in nvme_pci_alloc_ctrl")
Reported-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Michael Kelley <mikelley@microsoft.com>
2023-02-14 06:36:40 +01:00
Daniel Wagner
5f69f009b7 nvme-pci: add bogus ID quirk for ADATA SX6000PNP
Yet another device which needs a quirk:

 nvme nvme1: globally duplicate IDs for nsid 1
 nvme nvme1: VID:DID 10ec:5763 model:ADATA SX6000PNP firmware:V9002s94

Link: http://bugzilla.opensuse.org/show_bug.cgi?id=1207827
Reported-by: Gustavo Freitas <freitasmgustavo@gmail.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-13 07:04:53 +01:00
Linus Torvalds
29716680ad block-6.2-2023-02-10
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPmbGcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpms0D/9O87gfqnGasICPYI4ra8b1mTNWN2yonOVC
 W8TNjgqfm085CSSyfDgZHhJQzB5/t4W+3upLKHjGzs9gi4xPlc/f+I1cnGj/NkKZ
 Z1ocomKZ6Z6xRf+GjXGikzHneJ7e3SiGUGFHpQCHU1mqDAv1sAoie+sHCZviHso+
 UbwrNZ8Yxln2sXeGcernbNisjrh++uFFoIZ/Uf3MeAzbE6vfA4p/FdPn/+jNBoc2
 fFzfcta0KHejiZx9xbm3PkJ+tkPX8f4cVBSWfde0vsp6k4Lsf/pw4PQucLIY6OzW
 gOWnsmq+kjpmuE5fwOukQ7mhdnvogVoIPugwS6HVs11AztBYAwZTSy2XerB+npEk
 qiGCGHgsO3xRr3p+fzGbS+sHsWYg8/65k6dACCiopfasMiblD5FltuHhMdqZcP98
 evDsQ8saAsANnssA/plLEs2TRMXEMajXMjNTIxqnq0rwZzzdWYjAxu9EvdBWMs3H
 KlM9YiZ5WaDEkARHoC0o/SSA+brXEQwR/+mp9gppmWhzMAZvB8K7hV9hEQ1vA7Ja
 153C1mB2Z+vKSkd62UFf0bupSH26j5TbvO5/t2RnRBAD+riYc23EGreuunq1vLz0
 evxFGHbBJuNQwWzEL/CIf0fmR5Bxnlq7yK2Do5Yt5AmSVYMRwTuZQ1p401Vmw4YM
 AWMizMxv1g==
 =FUPy
 -----END PGP SIGNATURE-----

Merge tag 'block-6.2-2023-02-10' of git://git.kernel.dk/linux

Pull block fix from Jens Axboe:
 "A single fix for a smatch regression introduced in this merge window"

* tag 'block-6.2-2023-02-10' of git://git.kernel.dk/linux:
  nvme-auth: mark nvme_auth_wq static
2023-02-10 08:55:09 -08:00
Jakub Kicinski
8697a258ae Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
net/devlink/leftover.c / net/core/devlink.c:
  565b4824c3 ("devlink: change port event netdev notifier from per-net to global")
  f05bd8ebeb ("devlink: move code to a dedicated directory")
  687125b579 ("devlink: split out core code")
https://lore.kernel.org/all/20230208094657.379f2b1a@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-09 12:25:40 -08:00
Tom Rix
70daa5c8f0 nvme-auth: mark nvme_auth_wq static
Fix a smatch report for the newly added nvme_auth_wq.

Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-08 07:28:16 +01:00
Jens Axboe
455944e4e4 nvme updates for Linux 6.3
- small improvements to the logging functionality (Amit Engel)
  - authentication cleanups (Hannes Reinecke)
  - cleanup and optimize the DMA mapping cod in the PCIe driver
    (Keith Busch)
  - work around the command effects for Format NVM (Keith Busch)
  - misc cleanups (Keith Busch, Christoph Hellwig)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmPh7B8LHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYOIgxAAtLJaXzRFrKF6mK6j2PcXulJDVcayQYIiEBBTn4Li
 DkUr7QBqzRf+Y6TlkycNlAKroUa/tUEAFZWLpRf2t4Vvx2wqCMJGfJGz5/cJa+cs
 wM3T/9qijq+FG1t/ZA8L774vDUnwDHmU7WJgdva2xnXrzFGnh0GqgLyOS37G5Mqh
 kU3iQgMo+089IHmvD10IZjFjaWifbxQbC4gKxno53JYxvn3Ktu57PxBI/jPKIsbT
 yQj21LvCZwDIehpQqtCieLjjNrj1g130hYxplcLWB2D4BkC0rcqTCh+Jb7RtprzP
 wHrZig1zAA7IzGXyouQymi56HJiCAdTOgynVrg4j/VmQIqE3+H/KKQ0n7xMGEqIM
 YGO3CXuyS3DcX3SefuHroZapl95EbnF/EBHE1gGG49hPY2SoEGHlZwou3Hu/mWbm
 lOSuLFCCwvahKC6SzGr7+Vk5LgrwzAhl2Nwf9i1B+q0mnBHxiOXCeGmrtMde63W0
 qlpBhbwquoVbdia3oEvLfos7QZmw1eoyL4MWF6IG+rqxut+3A9eoyIAR61XLepny
 0yr7Kn0hZGtiZPJXhTG43p9hoVUP+FoihlwyGpZ8O2/HQFXZOmWRBPBF+L3mL/u2
 TPeiDEycWx3MtiGENts9UzaxEz9mo1pcky3vcsj8Ch4qYoYfFOmUd5F8Ay+nBETX
 VbQ=
 =yiTz
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.3-2023-02-07' of git://git.infradead.org/nvme into for-6.3/block

Pull NVMe updates from Christoph:

"nvme updates for Linux 6.3

 - small improvements to the logging functionality (Amit Engel)
 - authentication cleanups (Hannes Reinecke)
 - cleanup and optimize the DMA mapping cod in the PCIe driver
   (Keith Busch)
 - work around the command effects for Format NVM (Keith Busch)
 - misc cleanups (Keith Busch, Christoph Hellwig)"

* tag 'nvme-6.3-2023-02-07' of git://git.infradead.org/nvme:
  nvme: mask CSE effects for security receive
  nvme: always initialize known command effects
  nvmet: for nvme admin set_features cmd, call nvmet_check_data_len_lte()
  nvme-tcp: add additional info for nvme_tcp_timeout log
  nvme: add nvme_opcode_str function for all nvme cmd types
  nvme: remove nvme_execute_passthru_rq
  nvme-pci: place descriptor addresses in iod
  nvme-pci: use mapped entries for sgl decision
  nvme-pci: remove SGL segment descriptors
  nvme-auth: don't use NVMe status codes
  nvme-fabrics: clarify AUTHREQ result handling
2023-02-07 07:22:58 -07:00
Linus Torvalds
0136d86b78 block-6.2-2023-02-03
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPdRq8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjcqEADcWlRjkcLzRpEMD9g3IyDShasT1JVeSvV6
 xqDuA0kRF6DyObu82jE2wiZ49FRpeCUw6S6ZdVhvwGHgPpfLBuPWonFnTqxYAnSz
 XCYnt4QdZHGiydIHVxkyP8Raz6d24kZawlUmbE7dcfksNziyGR5UjbCsk1HNJhmf
 EvnLZ2EozZwsZLW/RRYZrh9Q8ccB8kJeX+JuUVw7sboNyJ+bW+x+7prlm3CKgopX
 IiP69E6qIPe6RHkyLRdKgYgxRdcgeq6uJk/nuZ/6uPCcyrz+0QEtge3CkTe7zLkF
 CPmbWlqngmNfNsS93nPTK2kHWTz8P2spo+UTkXIegSYBA8CIr9lDxazSFKT0B6zH
 yIWzmQoE7YXRI5B21rlPvNGE/gPSy48mSn1ym/MCf+UyWGneRypeU/K//2Ww3UJK
 F1Xl2c1v/EEr28qPuC8VQbAsQ56GOcZ6zW4Q0grxTYm0KzzJ2O5B3FEHdCWlS/x9
 KY5v3a8a3nXg9rNio0ruXiyD5l7PE5nFESNrBFDS4kEfxk4cx50ZfgDH68d515/W
 //EnNjx9nN20yF+LcKD70KJHxPdWaUXGT2c1+E/tdbrgUKReCpER+5hQc8+YxQML
 DCbzr7LJjX5mmDQ5YI6Y09/L6luzFMjrnxpmXkL7nyWQlSYkMqus3vPtDcJ5Xk2J
 shHBlzIcuw==
 =/+rE
 -----END PGP SIGNATURE-----

Merge tag 'block-6.2-2023-02-03' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "A bit bigger than I'd like at this point, but mostly a bunch of little
  fixes. In detail:

   - NVMe pull request via Christoph:
       - Fix a missing queue put in nvmet_fc_ls_create_association
         (Amit Engel)
       - Clear queue pointers on tag_set initialization failure
         (Maurizio Lombardi)
       - Use workqueue dedicated to authentication (Shin'ichiro
         Kawasaki)

   - Fix for an overflow in ublk (Liu)

   - Fix for leaking a queue reference in block cgroups (Ming)

   - Fix for a use-after-free in BFQ (Yu)"

* tag 'block-6.2-2023-02-03' of git://git.kernel.dk/linux:
  blk-cgroup: don't update io stat for root cgroup
  nvme-auth: use workqueue dedicated to authentication
  nvme: clear the request_queue pointers on failure in nvme_alloc_io_tag_set
  nvme: clear the request_queue pointers on failure in nvme_alloc_admin_tag_set
  nvme-fc: fix a missing queue put in nvmet_fc_ls_create_association
  block: Fix the blk_mq_destroy_queue() documentation
  block: ublk: extending queue_size to fix overflow
  block, bfq: fix uaf for bfqq in bic_set_bfqq()
2023-02-03 11:35:42 -08:00
Christoph Hellwig
4bee16daf1 nvme: use bvec_set_virt to initialize special_vec
Use the bvec_set_virt helper to initialize the special_vec.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230203150634.3199647-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:55 -07:00
Christoph Hellwig
fc41c97a3a nvmet: use bvec_set_page to initialize bvecs
Use the bvec_set_page helper to initialize bvecs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230203150634.3199647-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 08:20:55 -07:00
Jakub Kicinski
82b4a9412b Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
net/core/gro.c
  7d2c89b325 ("skb: Do mix page pool and page referenced frags in GRO")
  b1a78b9b98 ("net: add support for ipv4 big tcp")
https://lore.kernel.org/all/20230203094454.5766f160@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-02 14:49:55 -08:00
Shin'ichiro Kawasaki
bd97a59da6 nvme-auth: use workqueue dedicated to authentication
NVMe In-Band authentication uses two kinds of works: chap->auth_work and
ctrl->dhchap_auth_work. The latter work flushes or cancels the former
work. However, the both works are queued to the same workqueue nvme-wq.
It results in the lockdep WARNING as follows:

 WARNING: possible recursive locking detected
 6.2.0-rc4+ #1 Not tainted
 --------------------------------------------
 kworker/u16:7/69 is trying to acquire lock:
 ffff902d52e65548 ((wq_completion)nvme-wq){+.+.}-{0:0}, at: start_flush_work+0x2c5/0x380

 but task is already holding lock:
 ffff902d52e65548 ((wq_completion)nvme-wq){+.+.}-{0:0}, at: process_one_work+0x210/0x410

To avoid the WARNING, introduce a new workqueue nvme-auth-wq dedicated
to chap->auth_work.

Reported-by: Daniel Wagner <dwagner@suse.de>
Link: https://lore.kernel.org/linux-nvme/20230130110802.paafkiipmitwtnwr@carbon.lan/
Fixes: f50fff73d6 ("nvme: implement In-Band authentication")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-01 16:11:20 +01:00
Keith Busch
baff649144 nvme: mask CSE effects for security receive
The nvme driver will freeze the IO queues in response to an admin
command with CSE bits set. These bits notify the host that the command
that's about to be executed needs to be done exclusively, hence the
freeze.

The Security Receive command is often reported by multiple vendors with
CSE bits set. The reason for this is that the result depends on the
previous Security Send. This has nothing to do with IO queues, though,
so the driver is taking an overly cautious response to seeing this
passthrough command, while unable to fufill the intended admin queue
action.

Rather than freeze IO during this harmless command, mask off the
effects. This freezing is observed to cause IO latency spikes when host
software periodically validates the security state of the drives.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-01 16:10:10 +01:00
Keith Busch
cc115cbe12 nvme: always initialize known command effects
Instead of appending command effects flags per IO, set the known effects
flags the driver needs to react to just once during initial setup.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-01 16:10:09 +01:00
Amit Engel
ddf9171769 nvmet: for nvme admin set_features cmd, call nvmet_check_data_len_lte()
This is due to the fact that the host is allowed to pass the controller
an sgl describing a buffer that is larger than the payload itself

Signed-off-by: Amit Engel <Amit.Engel@dell.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-01 14:22:00 +01:00
Amit Engel
99607843e7 nvme-tcp: add additional info for nvme_tcp_timeout log
This provides additional details about the rq/cmd that is timed out

example log if CONFIG_NVME_VERBOSE_ERRORS is configured:
"nvme nvme0: queue 2 timeout cid 0xd058 type 4 opc Write (0x1)"

example log if CONFIG_NVME_VERBOSE_ERRORS is not configured:
"nvme nvme0: queue 2 timeout cid 0xd058 type 4 opc I/O Cmd (0x1)"

Signed-off-by: Amit Engel <Amit.Engel@dell.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-01 14:22:00 +01:00
Amit Engel
567da14d46 nvme: add nvme_opcode_str function for all nvme cmd types
nvme_opcode_str will handle io/admin/fabrics ops

This improves NVMe errors logging

Signed-off-by: Amit Engel <Amit.Engel@dell.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-01 14:22:00 +01:00
Christoph Hellwig
62281b9ed6 nvme: remove nvme_execute_passthru_rq
After moving the nvme_passthru_end call to the callers of
nvme_execute_passthru_rq, this function has become quite pointless,
so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2023-02-01 14:21:59 +01:00
Keith Busch
7846c1b5a5 nvme-pci: place descriptor addresses in iod
The 'struct nvme_iod' space is appended at the end of the preallocated
'struct request', and padded to the cache line size. This leaves some
free memory (in most kernel configs) up for grabs.

Instead of appending the nvme data descriptor addresses after the
scatterlist, inline these for free within struct nvme_iod. There is now
enough space in the mempool for 128 possibe segments.

And without increasing the size of the preallocated requests, we can
hold up to 5 PRP descriptor elements, allowing the driver to increase
its max transfer size to 8MB.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-01 14:21:59 +01:00
Keith Busch
ae58293503 nvme-pci: use mapped entries for sgl decision
The driver uses the dma entries for setting up its command's SGL/PRP
lists. The dma mapping might have fewer entries than the physical
segments, so check the dma mapped count to determine which nvme data
layout method is more optimal.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-01 14:21:59 +01:00
Keith Busch
01df742d8c nvme-pci: remove SGL segment descriptors
The max segments this driver can see is 127, well below the 256
threshold needed to add an nvme sgl segment descriptor. Remove all the
useless checks and dead code.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-01 14:21:59 +01:00
Hannes Reinecke
b0ef1b11d3 nvme-auth: don't use NVMe status codes
NVMe status codes are part of the wire protocol, and shouldn't be
fabricated in the stack. So with this patch the authentication code
is switched over to use error codes; as a side effect authentication
failures due to internal error won't be retried anymore.
But that shouldn't have happened anyway.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-01 14:21:58 +01:00
Hannes Reinecke
0686fb3cc5 nvme-fabrics: clarify AUTHREQ result handling
The NVMe 2.0 spec defines the ATR and ASCR bits in the AUTHREQ
connect response field to be mutually exclusive. So to clarify the
handling here switch the AUTHREQ handling to use the bit definitions
and check for both bits.
And while we're at it, add a message to the user that secure
concatenation is not supported (yet).

Suggested-by: Mark Lehrer <mark.lehrer@wdc.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-01 14:21:58 +01:00
Maurizio Lombardi
6fbf13c0e2 nvme: clear the request_queue pointers on failure in nvme_alloc_io_tag_set
In nvme_alloc_io_tag_set(), the connect_q pointer should be set to NULL
in case of error to avoid potential invalid pointer dereferences.

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-01 14:18:46 +01:00
Maurizio Lombardi
fd62678ab5 nvme: clear the request_queue pointers on failure in nvme_alloc_admin_tag_set
If nvme_alloc_admin_tag_set() fails, the admin_q and fabrics_q pointers
are left with an invalid, non-NULL value. Other functions may then check
the pointers and dereference them, e.g. in

  nvme_probe() -> out_disable: -> nvme_dev_remove_admin().

Fix the bug by setting admin_q and fabrics_q to NULL in case of error.

Also use the set variable to free the tag_set as ctrl->admin_tagset isn't
initialized yet.

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-01 14:18:46 +01:00
Amit Engel
0cab440487 nvme-fc: fix a missing queue put in nvmet_fc_ls_create_association
As part of nvmet_fc_ls_create_association there is a case where
nvmet_fc_alloc_target_queue fails right after a new association with an
admin queue is created. In this case, no one releases the get taken in
nvmet_fc_alloc_target_assoc.  This fix is adding the missing put.

Signed-off-by: Amit Engel <Amit.Engel@dell.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-01 14:18:46 +01:00
Pankaj Raghav
d67ea690ce block: introduce bdev_zone_no helper
Add a generic bdev_zone_no() helper to calculate zone number for a
given sector in a block device. This helper internally uses disk_zone_no()
to find the zone number.

Use the helper bdev_zone_no() to calculate nr of zones. This lets us
make modifications to the math if needed in one place.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20230110143635.77300-4-p.raghav@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:18:34 -07:00
Anuj Gupta
888545cb43 nvme: set REQ_ALLOC_CACHE for uring-passthru request
This patch sets REQ_ALLOC_CACHE flag for uring-passthru requests.
This is a prep-patch so that normal / IRQ-driven uring-passthru
I/Os can also leverage bio-cache.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20230117120638.72254-2-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:18:34 -07:00
Jakub Kicinski
b568d3072a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

drivers/net/ethernet/intel/ice/ice_main.c
  418e53401e ("ice: move devlink port creation/deletion")
  643ef23bd9 ("ice: Introduce local var for readability")
https://lore.kernel.org/all/20230127124025.0dacef40@canb.auug.org.au/
https://lore.kernel.org/all/20230124005714.3996270-1-anthony.l.nguyen@intel.com/

drivers/net/ethernet/engleder/tsnep_main.c
  3d53aaef43 ("tsnep: Fix TX queue stop/wake for multiple queues")
  25faa6a4c5 ("tsnep: Replace TX spin_lock with __netif_tx_lock")
https://lore.kernel.org/all/20230127123604.36bb3e99@canb.auug.org.au/

net/netfilter/nf_conntrack_proto_sctp.c
  13bd9b31a9 ("Revert "netfilter: conntrack: add sctp DATA_SENT state"")
  a44b765148 ("netfilter: conntrack: unify established states for SCTP paths")
  f71cb8f45d ("netfilter: conntrack: sctp: use nf log infrastructure for invalid packets")
https://lore.kernel.org/all/20230127125052.674281f9@canb.auug.org.au/
https://lore.kernel.org/all/d36076f3-6add-a442-6d4b-ead9f7ffff86@tessares.net/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-27 22:56:18 -08:00
Linus Torvalds
90aaef4e35 block-6.2-2023-01-27
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPULHAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmnjD/94pMYVqqg80e7oi++2ANeinKMF10u8giqq
 zCUhYVTpVuPTx+1DIjCE+wL156kRzyyK6TrL67UNejl2vBqdd1HmvEL51fJ1Za+8
 IltVVvpgkGBQBcZPUR3FOPJQn+canalSvS0xooKX4Asb6UkaN0jVkPihB0ImT8pf
 qZP0C4HsYjHwH4N7WHlgmEQ54wUlDQQTUuNu1npFNd4WW4cx/t5ZQMwZQ2YjXrb8
 pw7jbrNpnmeGvPEP3H0ea+QUGbL4X47f4rU0nCvMGIuhcNaksWchhiB1mWOjYNVi
 3h/bpW8Yjp39zuWdUh5LzTmLkiNCpZKU7M81Wnx73R+qRBZ2WXhSHJEcWweS7u5m
 iBgxuMAQ0zlDoZ/iLJCE8Ec2aBCAOo2dUVC7BSeyMsE6upc4/hC1p55an7vUC+cG
 KzPr6UlPvaXSDMZAO+bTMfPu00B4zmnMQUVrf4040l1tZFibCDuLEba/l9yFBszd
 qfNYFznUhXwvKYQEX7Tev4Hg+IQcGJaFpOyxPtQ9SHZTRxgIm1d/VQM5I7rnQ3vB
 CMySW2wj+lnseimrtC+OPgP3AAEI9WCcIlK/n77GV6f5rY3/t9S4mtBjHJdzIqzC
 YA72NDQ/wHUXC3kvNOMDxajZFcXlU4cTPDKncoClKfqNbn+Y4bLY49ntrkNmexUt
 1Sow6bZT/g==
 =CLAD
 -----END PGP SIGNATURE-----

Merge tag 'block-6.2-2023-01-27' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "Minor tweaks for this release:

   - NVMe pull request via Christoph:
        - Flush initial scan_work for async probe (Keith Busch)
        - Fix passthrough csi check (Keith Busch)
        - Fix nvme-fc initialization order (Ross Lagerwall)

   - Fix for tearing down non-started device in ublk (Ming)"

* tag 'block-6.2-2023-01-27' of git://git.kernel.dk/linux:
  block: ublk: move ublk_chr_class destroying after devices are removed
  nvme: fix passthrough csi check
  nvme-pci: flush initial scan_work for async probe
  nvme-fc: fix initialization order
2023-01-27 16:16:57 -08:00
Keith Busch
85eee6341a nvme: fix passthrough csi check
The namespace head saves the Command Set Indicator enum, so use that
instead of the Command Set Selected. The two values are not the same.

Fixes: 831ed60c2a ("nvme: also return I/O command effects from nvme_command_effects")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-01-25 07:09:38 +01:00
Keith Busch
5a5754a499 nvme-pci: flush initial scan_work for async probe
The nvme device may have a namespace with the root partition, so make
sure we've completed scanning before returning from the async probe.

Fixes: eac3ef2629 ("nvme-pci: split the initial probe from the rest path")
Reported-by: Klaus Jensen <its@irrelevant.dk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Tested-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-01-24 20:16:09 +01:00
Ross Lagerwall
98e3528012 nvme-fc: fix initialization order
ctrl->ops is used by nvme_alloc_admin_tag_set() but set by
nvme_init_ctrl() so reorder the calls to avoid a NULL pointer
dereference.

Fixes: 6dfba1c09c ("nvme-fc: use the tagset alloc/free helpers")
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-01-23 17:42:11 +01:00
Peilin Ye
40e0b09081 net/sock: Introduce trace_sk_data_ready()
As suggested by Cong, introduce a tracepoint for all ->sk_data_ready()
callback implementations.  For example:

<...>
  iperf-609  [002] .....  70.660425: sk_data_ready: family=2 protocol=6 func=sock_def_readable
  iperf-609  [002] .....  70.660436: sk_data_ready: family=2 protocol=6 func=sock_def_readable
<...>

Suggested-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-23 11:26:50 +00:00
Linus Torvalds
edc00350d2 block-6.2-2023-01-20
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPK8NUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptS/EADT+m0n7jjonp7NoENoZT2y4o5ayESuEmBV
 X8QUg/Ji1P3VG3QzI+yCqevGa2Rkkd8EenlokpjLliuqPdb/aZ56G7rsebotzWu3
 zOV3XNvKvD0thiMIjmXABvmUKdb3lcrM5tpC9Uqq6L52SqbtkSsPUVO+rWE/tTZk
 u97dUmyQcaD2brGfn4AcR0wgQoxrcLbmUpa/TKhFIDPDl+4PFi2ePoSQSsdDJT8R
 PTvQhud1dl/wJ3733vj8S8s4Sxkbm5xXt50oDaTSmdOWSNOuMNuyW3WqkZ/SPdyK
 LDmtOXEfuiokJK/l+DZ9SKt6jONW6ShdEaUo37/8yjYCnZFvWkcfn+6mWaDygjqS
 eI3Mwb91w8K9krTZU1tGq3qOtxEJwbtLHCM96nh8SHLjNrYYrkZQZHOcea9CgX8h
 iMzI5ylP2t6RofwHwwFoZYGOxrRz/R5LS+pCFIv720QnBjb9ZpO9zoDQaDl5tOS6
 UpuL3XPzs9rZZizY00NG6+vQeSdSLRyyjs4XIWYxrZy2wuC2EjM0HstMfefldQcJ
 uEfgrVgd/pcUTNzCG8uH8cZbmeflivm18J6OX86l2X9d3m62HD5gULHFOFxbDwsC
 zoQOsyaGVRLpO0+/0MKs7aLaZlk40VDb4XdRsM6qbd4+x+J7yicvGrkUxS6cZMwT
 VlQm3YUc0g==
 =L12Q
 -----END PGP SIGNATURE-----

Merge tag 'block-6.2-2023-01-20' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "Various little tweaks all over the place:

   - NVMe pull request via Christoph:
       - fix controller shutdown regression in nvme-apple (Janne Grunau)
       - fix a polling on timeout regression in nvme-pci (Keith Busch)

   - Fix a bug in the read request side request allocation caching
     (Pavel)

   - pktcdvd was brought back after we configured a NULL return on bio
     splits, make it consistent with the others (me)

   - BFQ refcount fix (Yu)

   - Block cgroup policy activation fix (Yu)

   - Fix for an md regression introduced in the 6.2 cycle (Adrian)"

* tag 'block-6.2-2023-01-20' of git://git.kernel.dk/linux:
  nvme-pci: fix timeout request state check
  nvme-apple: only reset the controller when RTKit is running
  nvme-apple: reset controller during shutdown
  block: fix hctx checks for batch allocation
  block/rnbd-clt: fix wrong max ID in ida_alloc_max
  blk-cgroup: fix missing pd_online_fn() while activating policy
  pktcdvd: check for NULL returna fter calling bio_split_to_limits()
  block, bfq: switch 'bfqg->ref' to use atomic refcount apis
  md: fix incorrect declaration about claim_rdev in md_import_device
2023-01-20 12:44:41 -08:00
Keith Busch
1c58420858 nvme-pci: fix timeout request state check
Polling the completion can progress the request state to IDLE, either
inline with the completion, or through softirq. Either way, the state
may not be COMPLETED, so don't check for that. We only care if the state
isn't IN_FLIGHT.

This is fixing an issue where the driver aborts an IO that we just
completed. Seeing the "aborting" message instead of "polled" is very
misleading as to where the timeout problem resides.

Fixes: bf392a5dc0 ("nvme-pci: Remove tag from process cq")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-01-19 09:08:01 +01:00
Janne Grunau
c0a4a1eafb nvme-apple: only reset the controller when RTKit is running
NVMe controller register access hangs indefinitely when the co-processor
is not running. A missed reset is preferable over a hanging thread since
it could be recoverable.

Signed-off-by: Janne Grunau <j@jannau.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-01-19 09:08:01 +01:00
Janne Grunau
c06ba7b892 nvme-apple: reset controller during shutdown
This is a functional revert of c76b8308e4 ("nvme-apple: fix controller
shutdown in apple_nvme_disable").

The commit broke suspend/resume since apple_nvme_reset_work() tries to
disable the controller on resume. This does not work for the apple NVMe
controller since register access only works while the co-processor
firmware is running.

Disabling the NVMe controller in the shutdown path is also required
for shutting the co-processor down. The original code was appropriate
for this hardware. Add a comment to prevent a similar breaking changes
in the future.

Fixes: c76b8308e4 ("nvme-apple: fix controller shutdown in apple_nvme_disable")
Reported-by: Janne Grunau <j@jannau.net>
Link: https://lore.kernel.org/all/20230110174745.GA3576@jannau.net/
Signed-off-by: Janne Grunau <j@jannau.net>
[hch: updated with a more descriptive comment from Hector Martin]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-01-19 09:07:35 +01:00
Linus Torvalds
97ec4d559d block-6.2-2023-01-13
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPBsFAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqgnEAC0OqxnMsOPNbkLO7k6FsSrG7ZoENkOIMCt
 Grk3D1cPkM13I0xc+WiaOBezMriPzfdXvt5AGDn9fd53Ih47qpSY4eU6pCqoCk5y
 HWdn8KXZvhJGZsSy0Nz+cfPDW/8diJON8YBpJwWM/DfDdP8XibtjlIMTVTtJab6h
 aGWjmy3leNfghOJ0cZ1wjL6maWFoowQASs52PZfajSc0mQ5X0i8BgQb1WOHNu89C
 vEir9PYlTmdMnYlAKLsyEL3KoGUPm++zSLtJeyWYavlCMGK5WTyNkzmeXqsQhAGf
 b1LjovQASe//1t2wvCzQviRf4cae0pE9JhiaYt2oxoDdHrfQj/WPndVS4yE9c+0O
 BnLVTCFHNv86TRXNCbEUzI+Ftj6m9qt4MrHz8YpstX7FxGxYC+T5RqTwYClWZQ0j
 llBuJUHj+kkAv6kBMJCHTyat6pxIDgcb52QMJr5mFWuEaTloraBIJC70hMtxBQV/
 j5mrBYqCngCHVs+hAl9UQ4zqQVSvkeT11QFvwFolxIfs7qtfLqeGzYxvaeomqO3V
 sA+H5NY50OEuPfFFmCpcNUJXeUKg7wP39iNHdz6P5cCDBCfUwbNbgKKKNmBovaC+
 KhPd8Xo1MmzDuF+cylvTcjOBDte4425GN7PBj4vP1xbuHYcjg6AEFLawgqE9Y4XX
 xyNlgJXPOg==
 =ujiw
 -----END PGP SIGNATURE-----

Merge tag 'block-6.2-2023-01-13' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "Nothing major in here, just a collection of NVMe fixes and dropping a
  wrong might_sleep() that static checkers tripped over but which isn't
  valid"

* tag 'block-6.2-2023-01-13' of git://git.kernel.dk/linux:
  MAINTAINERS: stop nvme matching for nvmem files
  nvme: don't allow unprivileged passthrough on partitions
  nvme: replace the "bool vec" arguments with flags in the ioctl path
  nvme: remove __nvme_ioctl
  nvme-pci: fix error handling in nvme_pci_enable()
  nvme-pci: add NVME_QUIRK_IDENTIFY_CNS quirk to Apple T2 controllers
  nvme-apple: add NVME_QUIRK_IDENTIFY_CNS quirk to fix regression
  block: Drop spurious might_sleep() from blk_put_queue()
2023-01-13 17:41:19 -06:00
Christoph Hellwig
313c08c72e nvme: don't allow unprivileged passthrough on partitions
Passthrough commands can always access the entire device, and thus
submitting them on partitions is an privelege escalation.

In hindsight we should have never allowed any passthrough commands on
partitions, but it's probably too late to change that decision now.

Fixes: e4fbcf32c8 ("nvme: identify-namespace without CAP_SYS_ADMIN")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2023-01-10 08:15:57 +01:00
Christoph Hellwig
7b7fdb8e2d nvme: replace the "bool vec" arguments with flags in the ioctl path
To prepare for passing down more information, replace the boolean
vec argument with a more extensible flags one.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2023-01-10 08:15:51 +01:00
Christoph Hellwig
2fa1dc8637 nvme: remove __nvme_ioctl
Open code __nvme_ioctl in the two callers to make future changes that
pass down additional paramters in the ioctl path easier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2023-01-10 08:15:27 +01:00
Tong Zhang
09113abfb6 nvme-pci: fix error handling in nvme_pci_enable()
There are two issues in nvme_pci_enable():

 1) If pci_alloc_irq_vectors() fails, device is left enabled. Fix this by
    adding a goto disable statement.
 2) nvme_pci_configure_admin_queue could return -ENODEV, in this case,
    we will need to free IRQ properly.  Otherwise the following warning
    could be triggered:

[    5.286752] WARNING: CPU: 0 PID: 33 at kernel/irq/irqdomain.c:253 irq_domain_remove+0x12d/0x140
[    5.290547] Call Trace:
[    5.290626]  <TASK>
[    5.290695]  msi_remove_device_irq_domain+0xc9/0xf0
[    5.290843]  msi_device_data_release+0x15/0x80
[    5.290978]  release_nodes+0x58/0x90
[    5.293788] WARNING: CPU: 0 PID: 33 at kernel/irq/msi.c:276 msi_device_data_release+0x76/0x80
[    5.297573] Call Trace:
[    5.297651]  <TASK>
[    5.297719]  release_nodes+0x58/0x90
[    5.297831]  devres_release_all+0xef/0x140
[    5.298339]  device_unbind_cleanup+0x11/0xc0
[    5.298479]  really_probe+0x296/0x320

Fixes: a6ee7f19eb ("nvme-pci: call nvme_pci_configure_admin_queue from nvme_pci_enable")
Co-developed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Tong Zhang <ztong0001@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-01-10 08:15:03 +01:00
Hector Martin
453116a441 nvme-pci: add NVME_QUIRK_IDENTIFY_CNS quirk to Apple T2 controllers
This mirrors the quirk added to Apple Silicon controllers in apple.c.
These controllers do not support the Active NS ID List command and
behave identically to the SoC version judging by existing user
reports/syslogs, so will need the same fix. This quirk reverts
back to NVMe 1.0 behavior and disables the broken commands.

Fixes: 811f4de034 ("nvme: avoid fallback to sequential scan due to transient issues")
Signed-off-by: Hector Martin <marcan@marcan.st>
Tested-by: Orlando Chamberlain <orlandoch.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-01-10 08:15:03 +01:00
Hector Martin
aa96d6aa75 nvme-apple: add NVME_QUIRK_IDENTIFY_CNS quirk to fix regression
From the get-go, this driver and the ANS syslog have been complaining
about namespace identification. In 6.2-rc1, commit 811f4de034 ("nvme:
avoid fallback to sequential scan due to transient issues") regressed
the driver by no longer allowing fallback to sequential namespace scans,
leaving us with no namespaces.

It turns out that the real problem is that this controller claiming
NVMe 1.1 compat is treating the CNS field as a binary field, as in NVMe
1.0. This already has a quirk, NVME_QUIRK_IDENTIFY_CNS, so set it for
the controller to fix all this nonsense (including other errors
triggered by other CNS commands).

Fixes: 811f4de034 ("nvme: avoid fallback to sequential scan due to transient issues")
Fixes: 5bd2927ace ("nvme-apple: Add initial Apple SoC NVMe driver")
Signed-off-by: Hector Martin <marcan@marcan.st>
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-01-10 08:15:03 +01:00
Linus Torvalds
a689b938df block-2023-01-06
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmO4SiAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgc9D/0XJufUgHsLeFCF5G+q6iL5Bz+d7ymw+VFv
 xrNjOz8wUKYKXcJqxLrPdkmL1tcd1+fESNGgyBidn4P53BWoHB9dtbs8+Lova08t
 I4lQmZHgxgbAMhSOwGvHlOTkdlBIw/fBgQ6XdI+1qmpxzma5+gjImjyp7oH+pODP
 zqsg3DKRQmDApKWtvB6D5iItsWc1Jx5TEuOfU5/JjLuVZWl6O2qynNVUccF5T89O
 jkt624yO+r70CVfX3NAdFTm/mOEUiGH97l4l/8OkekJ40pf73xzvNRF/S8z8nHb/
 QUGY1tKvr08xfPusl3epmQ5aO938F0aFpKi2x6P+z3G6Uq+dqMMrjJl8XMDG+J+d
 +yBow5yRH7o6oBb0YPPz/6S5zBjslsHtuKFd/rs4mCDfjp9GHiIIiIpdLxZEWawJ
 WaYlc5WlzSdopT/IxfaRZ9HMHzscdKadjiFngSKdpEdCUw7wxdIey+/9xbKR+xh0
 Es13MzyCCurj4OnyDl5cnetGJUNNiL1JvQmIaFVndyxnMfvOaZBBmKW7h9RYBIU/
 nqi4vZwYoafnGUIfLFL6uq9F627lF/EhodDuLheqz0G2pWhmFJITOJUAakGNFf83
 22CiKY2GyTrOy5tKqkNzv7BG/KyJZGP+CxyyQ/7xm0k2C9wEjYSpZHKcjaNZygU5
 eswPKbZMkw==
 =LJ5Q
 -----END PGP SIGNATURE-----

Merge tag 'block-2023-01-06' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "The big change here is obviously the revert of the pktcdvd driver
  removal. Outside of that, just minor tweaks. In detail:

   - Re-instate the pktcdvd driver, which necessitates adding back
     bio_copy_data_iter() and the fops->devnode() hook for now (me)

   - Fix for splitting of a bio marked as NOWAIT, causing either nowait
     reads or writes to error with EAGAIN even if parts of the IO
     completed (me)

   - Fix for ublk, punting management commands to io-wq as they can all
     easily block for extended periods of time (Ming)

   - Removal of SRCU dependency for the block layer (Paul)"

* tag 'block-2023-01-06' of git://git.kernel.dk/linux:
  block: Remove "select SRCU"
  Revert "pktcdvd: remove driver."
  Revert "block: remove devnode callback from struct block_device_operations"
  Revert "block: bio_copy_data_iter"
  ublk: honor IO_URING_F_NONBLOCK for handling control command
  block: don't allow splitting of a REQ_NOWAIT bio
  block: handle bio_split_to_limits() NULL return
2023-01-06 13:12:42 -08:00
Jens Axboe
613b14884b block: handle bio_split_to_limits() NULL return
This can't happen right now, but in preparation for allowing
bio_split_to_limits() returning NULL if it ended the bio, check for it
in all the callers.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-04 09:05:23 -07:00
Linus Torvalds
bff687b3da block-6.2-2022-12-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOt35IQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpq4QD/9nGWlCdLRLyyiysUWhLwTmsZt7PSebG3KD
 CmCyEt+o8n2PdxBe7Xq8glppvuQTwJwOYynMXcAWd0IBxYUnAkCDF4PTbmdiIiVY
 fzci1UydIXw/HOVft/2IbIC0+Apo+UJ9WVqqhwm7ya0lAkLQYuT7iWmn1pxFdbcI
 hi9ZbaghxtZXSQP4ZtKG+a8tQ99HTsf76xqCM6DdMCVOUH6/V1f5g67iSkYLCL3Q
 V9bAq7U2VEXFdRC9m5yPG7KGUBRllE4etBvVAIIcAQBAgEktyvgvas5luwu5j+W0
 R2z8KXp2X4BWGW+R45hpt2cdyfcJy24+6QnAGNQAs/3Muq1IfEMwmJ5tyR/y8HiS
 0RvIv/BOwDMDOaM9YuW0beyHQMu+bwhtf+C453r1gsKmnL912+ElMzuqUpditkjr
 d4nL5aUTk5iM38jzJpQylZSY+20wyUnOmxCxETpeSMaRrYY75PLOVCJLNncJuZtQ
 GFtqUzMPVURLMGnxyJZLiG+qbGVXh9f7B7OStKDPhBJvqoZ2cQpwTzywmYxQOv+0
 OO1DdmMDtUWNpuBN2U4HOzLElmB034OM3Fcia529IhLoXK/x57n9mXW0D0HeOd84
 /EYSsmsT+spv7psKBNjhXkZwgVpVgsYOu8eUjRKYUmrYLEbTk+fGUtia3rBd4wjl
 uNMuRhRtUA==
 =cqhz
 -----END PGP SIGNATURE-----

Merge tag 'block-6.2-2022-12-29' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "Mostly just NVMe, but also a single fixup for BFQ for a regression
  that happened during the merge window. In detail:

   - NVMe pull requests via Christoph:
      - Fix doorbell buffer value endianness (Klaus Jensen)
      - Fix Linux vs NVMe page size mismatch (Keith Busch)
      - Fix a potential use memory access beyong the allocation limit
        (Keith Busch)
      - Fix a multipath vs blktrace NULL pointer dereference (Yanjun
        Zhang)
      - Fix various problems in handling the Command Supported and
        Effects log (Christoph Hellwig)
      - Don't allow unprivileged passthrough of commands that don't
        transfer data but modify logical block content (Christoph
        Hellwig)
      - Add a features and quirks policy document (Christoph Hellwig)
      - Fix some really nasty code that was correct but made smatch
        complain (Sagi Grimberg)

   - Use-after-free regression in BFQ from this merge window (Yu)"

* tag 'block-6.2-2022-12-29' of git://git.kernel.dk/linux:
  nvme-auth: fix smatch warning complaints
  nvme: consult the CSE log page for unprivileged passthrough
  nvme: also return I/O command effects from nvme_command_effects
  nvmet: don't defer passthrough commands with trivial effects to the workqueue
  nvmet: set the LBCC bit for commands that modify data
  nvmet: use NVME_CMD_EFFECTS_CSUPP instead of open coding it
  nvme: fix the NVME_CMD_EFFECTS_CSE_MASK definition
  docs, nvme: add a feature and quirk policy document
  nvme-pci: update sqsize when adjusting the queue depth
  nvme: fix setting the queue depth in nvme_alloc_io_tag_set
  block, bfq: fix uaf for bfqq in bfq_exit_icq_bfqq
  nvme: fix multipath crash caused by flush request when blktrace is enabled
  nvme-pci: fix page size checks
  nvme-pci: fix mempool alloc size
  nvme-pci: fix doorbell buffer value endianness
2022-12-29 16:57:29 -08:00
Sagi Grimberg
76807fcd73 nvme-auth: fix smatch warning complaints
When initializing auth context, there may be no secrets passed
by the user. Make return code explicit when returning successfully.

smatch warnings:
drivers/nvme/host/auth.c:950 nvme_auth_init_ctrl() warn: missing error code? 'ret'

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-12-28 06:26:35 -10:00
Christoph Hellwig
6f99ac04c4 nvme: consult the CSE log page for unprivileged passthrough
Commands like Write Zeros can change the contents of a namespaces without
actually transferring data.  To protect against this, check the Commands
Supported and Effects log is supported by the controller for any
unprivileg command passthrough and refuse unprivileged passthrough if the
command has any effects that can change data or metadata.

Note: While the Commands Support and Effects log page has only been
mandatory since NVMe 2.0, it is widely supported because Windows requires
it for any command passthrough from userspace.

Fixes: e4fbcf32c8 ("nvme: identify-namespace without CAP_SYS_ADMIN")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
2022-12-28 06:26:31 -10:00
Christoph Hellwig
831ed60c2a nvme: also return I/O command effects from nvme_command_effects
To be able to use the Commands Supported and Effects Log for allowing
unprivileged passtrough, it needs to be corretly reported for I/O
commands as well.  Return the I/O command effects from
nvme_command_effects, and also add a default list of effects for the
NVM command set.  For other command sets, the Commands Supported and
Effects log is required to be present already.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
2022-12-28 06:26:25 -10:00
Christoph Hellwig
2a459f6933 nvmet: don't defer passthrough commands with trivial effects to the workqueue
Mask out the "Command Supported" and "Logical Block Content Change" bits
and only defer execution of commands that have non-trivial effects to
the workqueue for synchronous execution.  This allows to execute admin
commands asynchronously on controllers that provide a Command Supported
and Effects log page, and will keep allowing to execute Write commands
asynchronously once command effects on I/O commands are taken into
account.

Fixes: c1fef73f79 ("nvmet: add passthru code to process commands")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
2022-12-28 06:26:18 -10:00
Christoph Hellwig
f2d1421391 nvmet: set the LBCC bit for commands that modify data
Write, Write Zeroes, Zone append and a Zone Reset through
Zone Management Send modify the logical block content of a namespace,
so make sure the LBCC bit is reported for them.

Fixes: b5d0b38c0475 ("nvmet: add Command Set Identifier support")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-12-28 06:26:13 -10:00
Christoph Hellwig
61f37154c5 nvmet: use NVME_CMD_EFFECTS_CSUPP instead of open coding it
Use NVME_CMD_EFFECTS_CSUPP instead of open coding it and assign a
single value to multiple array entries instead of repeated assignments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-12-28 06:26:08 -10:00
Christoph Hellwig
88d356ca41 nvme-pci: update sqsize when adjusting the queue depth
Update the core sqsize field in addition to the PCIe-specific
q_depth field as the core tagset allocation helpers rely on it.

Fixes: 0da7feaa59 ("nvme-pci: use the tagset alloc/free helpers")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Hugh Dickins <hughd@google.com>
Link: https://lore.kernel.org/r/20221225103234.226794-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-26 12:10:51 -07:00
Christoph Hellwig
33b93727ce nvme: fix setting the queue depth in nvme_alloc_io_tag_set
While the CAP.MQES field in NVMe is a 0s based filed with a natural one
off, we also need to account for the queue wrap condition and fix undo
the one off again in nvme_alloc_io_tag_set.  This was never properly
done by the fabrics drivers, but they don't seem to care because there
is no actual physical queue that can wrap around, but it became a
problem when converting over the PCIe driver.  Also add back the
BLK_MQ_MAX_DEPTH check that was lost in the same commit.

Fixes: 0da7feaa59 ("nvme-pci: use the tagset alloc/free helpers")
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Hugh Dickins <hughd@google.com>
Link: https://lore.kernel.org/r/20221225103234.226794-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-26 12:10:51 -07:00
Linus Torvalds
8395ae05cb SCSI misc on 20221222
Mostly small bug fixes and small updates.  The only things of note is
 a qla2xxx fix for crash on hotplug and timeout and the addition of a
 user exposed abstraction layer for persistent reservation error return
 handling (which necessitates the conversion of nvme.c as well as
 SCSI).
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCY6SZISYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishQkdAP9Juri0
 ihkyA9tVx1ZslVOp8V8mWK3P2VROA4ArvcMRVwD/Qxf2REP8Fx2GIgC0sNaRedg3
 +ncveg3EpZ1n/NXXeDw=
 =q+XO
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull more SCSI updates from James Bottomley:
 "Mostly small bug fixes and small updates.

  The only things of note is a qla2xxx fix for crash on hotplug and
  timeout and the addition of a user exposed abstraction layer for
  persistent reservation error return handling (which necessitates the
  conversion of nvme.c as well as SCSI)"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: qla2xxx: Fix crash when I/O abort times out
  nvme: Convert NVMe errors to PR errors
  scsi: sd: Convert SCSI errors to PR errors
  scsi: core: Rename status_byte to sg_status_byte
  block: Add error codes for common PR failures
  scsi: sd: sd_zbc: Trace zone append emulation
  scsi: libfc: Include the correct header
2022-12-22 11:22:31 -08:00
Yanjun Zhang
3659fb5ac2 nvme: fix multipath crash caused by flush request when blktrace is enabled
The flush request initialized by blk_kick_flush has NULL bio,
and it may be dealt with nvme_end_req during io completion.
When blktrace is enabled, nvme_trace_bio_complete with multipath
activated trying to access NULL pointer bio from flush request
results in the following crash:

[ 2517.831677] BUG: kernel NULL pointer dereference, address: 000000000000001a
[ 2517.835213] #PF: supervisor read access in kernel mode
[ 2517.838724] #PF: error_code(0x0000) - not-present page
[ 2517.842222] PGD 7b2d51067 P4D 0
[ 2517.845684] Oops: 0000 [#1] SMP NOPTI
[ 2517.849125] CPU: 2 PID: 732 Comm: kworker/2:1H Kdump: loaded Tainted: G S                5.15.67-0.cl9.x86_64 #1
[ 2517.852723] Hardware name: XFUSION 2288H V6/BC13MBSBC, BIOS 1.13 07/27/2022
[ 2517.856358] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
[ 2517.859993] RIP: 0010:blk_add_trace_bio_complete+0x6/0x30
[ 2517.863628] Code: 1f 44 00 00 48 8b 46 08 31 c9 ba 04 00 10 00 48 8b 80 50 03 00 00 48 8b 78 50 e9 e5 fe ff ff 0f 1f 44 00 00 41 54 49 89 f4 55 <0f> b6 7a 1a 48 89 d5 e8 3e 1c 2b 00 48 89 ee 4c 89 e7 5d 89 c1 ba
[ 2517.871269] RSP: 0018:ff7f6a008d9dbcd0 EFLAGS: 00010286
[ 2517.875081] RAX: ff3d5b4be00b1d50 RBX: 0000000002040002 RCX: ff3d5b0a270f2000
[ 2517.878966] RDX: 0000000000000000 RSI: ff3d5b0b021fb9f8 RDI: 0000000000000000
[ 2517.882849] RBP: ff3d5b0b96a6fa00 R08: 0000000000000001 R09: 0000000000000000
[ 2517.886718] R10: 000000000000000c R11: 000000000000000c R12: ff3d5b0b021fb9f8
[ 2517.890575] R13: 0000000002000000 R14: ff3d5b0b021fb1b0 R15: 0000000000000018
[ 2517.894434] FS:  0000000000000000(0000) GS:ff3d5b42bfc80000(0000) knlGS:0000000000000000
[ 2517.898299] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2517.902157] CR2: 000000000000001a CR3: 00000004f023e005 CR4: 0000000000771ee0
[ 2517.906053] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2517.909930] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2517.913761] PKRU: 55555554
[ 2517.917558] Call Trace:
[ 2517.921294]  <TASK>
[ 2517.924982]  nvme_complete_rq+0x1c3/0x1e0 [nvme_core]
[ 2517.928715]  nvme_tcp_recv_pdu+0x4d7/0x540 [nvme_tcp]
[ 2517.932442]  nvme_tcp_recv_skb+0x4f/0x240 [nvme_tcp]
[ 2517.936137]  ? nvme_tcp_recv_pdu+0x540/0x540 [nvme_tcp]
[ 2517.939830]  tcp_read_sock+0x9c/0x260
[ 2517.943486]  nvme_tcp_try_recv+0x65/0xa0 [nvme_tcp]
[ 2517.947173]  nvme_tcp_io_work+0x64/0x90 [nvme_tcp]
[ 2517.950834]  process_one_work+0x1e8/0x390
[ 2517.954473]  worker_thread+0x53/0x3c0
[ 2517.958069]  ? process_one_work+0x390/0x390
[ 2517.961655]  kthread+0x10c/0x130
[ 2517.965211]  ? set_kthread_struct+0x40/0x40
[ 2517.968760]  ret_from_fork+0x1f/0x30
[ 2517.972285]  </TASK>

To avoid this situation, add a NULL check for req->bio before
calling trace_block_bio_complete.

Signed-off-by: Yanjun Zhang <zhangyanjun@cestc.cn>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-12-22 09:40:27 +01:00
Linus Torvalds
609d3bc623 Including fixes from bpf, netfilter and can.
Current release - regressions:
 
  - bpf: synchronize dispatcher update with bpf_dispatcher_xdp_func
 
  - rxrpc:
   - fix security setting propagation
   - fix null-deref in rxrpc_unuse_local()
   - fix switched parameters in peer tracing
 
 Current release - new code bugs:
 
  - rxrpc:
    - fix I/O thread startup getting skipped
    - fix locking issues in rxrpc_put_peer_locked()
    - fix I/O thread stop
    - fix uninitialised variable in rxperf server
    - fix the return value of rxrpc_new_incoming_call()
 
  - microchip: vcap: fix initialization of value and mask
 
  - nfp: fix unaligned io read of capabilities word
 
 Previous releases - regressions:
 
  - stop in-kernel socket users from corrupting socket's task_frag
 
  - stream: purge sk_error_queue in sk_stream_kill_queues()
 
  - openvswitch: fix flow lookup to use unmasked key
 
  - dsa: mv88e6xxx: avoid reg_lock deadlock in mv88e6xxx_setup_port()
 
  - devlink:
    - hold region lock when flushing snapshots
    - protect devlink dump by the instance lock
 
 Previous releases - always broken:
 
  - bpf:
    - prevent leak of lsm program after failed attach
    - resolve fext program type when checking map compatibility
 
  - skbuff: account for tail adjustment during pull operations
 
  - macsec: fix net device access prior to holding a lock
 
  - bonding: switch back when high prio link up
 
  - netfilter: flowtable: really fix NAT IPv6 offload
 
  - enetc: avoid buffer leaks on xdp_do_redirect() failure
 
  - unix: fix race in SOCK_SEQPACKET's unix_dgram_sendmsg()
 
  - dsa: microchip: remove IRQF_TRIGGER_FALLING in request_threaded_irq
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmOiGa4ACgkQMUZtbf5S
 IrvetBAAg/AjgG51gboLsuGjgRSwAi5T6ijgVR+pW+kMuoOdaamOF+h/zC1ox/H9
 QrWvTBipy+EqSD8bM4Xz0FNgidch8X4iWYhKGZuBht/4NP5FOzPUG2mNlUy5ANGq
 QZcCw6CUsir8HTb+IJpFEIq0JMwzKCm3WyAkYjEj4iuft0Y93cAgjkMVwoX0RERO
 o/pslC5dsozCLJxEglpw1aJq7aoroNuRSGSXl95nv8fU3UxmUXajnA3HNscXImdV
 6uqSIuyPIaGocpCBPRKUQd0sctkTY4cm8wmxxMCDVsBRVusoaq5eg1VRvxJm9Rxj
 gvDvHvfhnEuSigFF5A+paBp4c+i3C8g/UTBJTtptdAC+Y2tt4UT3Q5aaazYUOAqd
 W4TSJ3bk5zhkhpRF9clb0fNQaM1HOT4rkDEEGTfVN62dtHfPKpNwYufQKaYHdVj1
 RJ3ooH6c7TMVaRs6ZgEWNYToKZj94SIfPhfEhuqWXdNMDBkUMp2BXFFOp9fZDWju
 PsMQrRD7n6+XXpNvScYtnJDORqfIL9yHGZE9kxZA5QSDl9cnPA3SUbNruQPlXHrl
 w0yQlYuG3gcciua4dXaLfz1iN4rPdenuYhVBHhztEwDKl+b61CVQYlOHGkXPVURp
 oft74qCCFbva+Hf/7jENQotjT1tLfxAGdUARuFeDBueJgDRAPsw=
 =goV5
 -----END PGP SIGNATURE-----

Merge tag 'net-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf, netfilter and can.

  Current release - regressions:

   - bpf: synchronize dispatcher update with bpf_dispatcher_xdp_func

   - rxrpc:
      - fix security setting propagation
      - fix null-deref in rxrpc_unuse_local()
      - fix switched parameters in peer tracing

  Current release - new code bugs:

   - rxrpc:
      - fix I/O thread startup getting skipped
      - fix locking issues in rxrpc_put_peer_locked()
      - fix I/O thread stop
      - fix uninitialised variable in rxperf server
      - fix the return value of rxrpc_new_incoming_call()

   - microchip: vcap: fix initialization of value and mask

   - nfp: fix unaligned io read of capabilities word

  Previous releases - regressions:

   - stop in-kernel socket users from corrupting socket's task_frag

   - stream: purge sk_error_queue in sk_stream_kill_queues()

   - openvswitch: fix flow lookup to use unmasked key

   - dsa: mv88e6xxx: avoid reg_lock deadlock in mv88e6xxx_setup_port()

   - devlink:
      - hold region lock when flushing snapshots
      - protect devlink dump by the instance lock

  Previous releases - always broken:

   - bpf:
      - prevent leak of lsm program after failed attach
      - resolve fext program type when checking map compatibility

   - skbuff: account for tail adjustment during pull operations

   - macsec: fix net device access prior to holding a lock

   - bonding: switch back when high prio link up

   - netfilter: flowtable: really fix NAT IPv6 offload

   - enetc: avoid buffer leaks on xdp_do_redirect() failure

   - unix: fix race in SOCK_SEQPACKET's unix_dgram_sendmsg()

   - dsa: microchip: remove IRQF_TRIGGER_FALLING in
     request_threaded_irq"

* tag 'net-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (64 commits)
  net: fec: check the return value of build_skb()
  net: simplify sk_page_frag
  Treewide: Stop corrupting socket's task_frag
  net: Introduce sk_use_task_frag in struct sock.
  mctp: Remove device type check at unregister
  net: dsa: microchip: remove IRQF_TRIGGER_FALLING in request_threaded_irq
  can: kvaser_usb: hydra: help gcc-13 to figure out cmd_len
  can: flexcan: avoid unbalanced pm_runtime_enable warning
  Documentation: devlink: add missing toc entry for etas_es58x devlink doc
  mctp: serial: Fix starting value for frame check sequence
  nfp: fix unaligned io read of capabilities word
  net: stream: purge sk_error_queue in sk_stream_kill_queues()
  myri10ge: Fix an error handling path in myri10ge_probe()
  net: microchip: vcap: Fix initialization of value and mask
  rxrpc: Fix the return value of rxrpc_new_incoming_call()
  rxrpc: rxperf: Fix uninitialised variable
  rxrpc: Fix I/O thread stop
  rxrpc: Fix switched parameters in peer tracing
  rxrpc: Fix locking issues in rxrpc_put_peer_locked()
  rxrpc: Fix I/O thread startup getting skipped
  ...
2022-12-21 08:41:32 -08:00
Keith Busch
841734234a nvme-pci: fix page size checks
The size allocated out of the dma pool is at most NVME_CTRL_PAGE_SIZE,
which may be smaller than the PAGE_SIZE.

Fixes: c61b82c7b7 ("nvme-pci: fix PRP pool size")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-12-21 09:17:30 +01:00
Keith Busch
c89a529e82 nvme-pci: fix mempool alloc size
Convert the max size to bytes to match the units of the divisor that
calculates the worst-case number of PRP entries.

The result is used to determine how many PRP Lists are required. The
code was previously rounding this to 1 list, but we can require 2 in the
worst case. In that scenario, the driver would corrupt memory beyond the
size provided by the mempool.

While unlikely to occur (you'd need a 4MB in exactly 127 phys segments
on a queue that doesn't support SGLs), this memory corruption has been
observed by kfence.

Cc: Jens Axboe <axboe@kernel.dk>
Fixes: 943e942e62 ("nvme-pci: limit max IO size and segments to avoid high order allocations")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-12-21 09:17:25 +01:00
Klaus Jensen
b5f96cb719 nvme-pci: fix doorbell buffer value endianness
When using shadow doorbells, the event index and the doorbell values are
written to host memory. Prior to this patch, the values written would
erroneously be written in host endianness. This causes trouble on
big-endian platforms. Fix this by adding missing endian conversions.

This issue was noticed by Guenter while testing various big-endian
platforms under QEMU[1]. A similar fix required for hw/nvme in QEMU is
up for review as well[2].

  [1]: https://lore.kernel.org/qemu-devel/20221209110022.GA3396194@roeck-us.net/
  [2]: https://lore.kernel.org/qemu-devel/20221212114409.34972-4-its@irrelevant.dk/

Fixes: f9f38e3338 ("nvme: improve performance for virtual NVMe devices")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-12-21 09:05:21 +01:00
Benjamin Coddington
98123866fc Treewide: Stop corrupting socket's task_frag
Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
GFP_NOIO flag on sk_allocation which the networking system uses to decide
when it is safe to use current->task_frag.  The results of this are
unexpected corruption in task_frag when SUNRPC is involved in memory
reclaim.

The corruption can be seen in crashes, but the root cause is often
difficult to ascertain as a crashing machine's stack trace will have no
evidence of being near NFS or SUNRPC code.  I believe this problem to
be much more pervasive than reports to the community may indicate.

Fix this by having kernel users of sockets that may corrupt task_frag due
to reclaim set sk_use_task_frag = false.  Preemptively correcting this
situation for users that still set sk_allocation allows them to convert to
memalloc_nofs_save/restore without the same unexpected corruptions that are
sure to follow, unlikely to show up in testing, and difficult to bisect.

CC: Philipp Reisner <philipp.reisner@linbit.com>
CC: Lars Ellenberg <lars.ellenberg@linbit.com>
CC: "Christoph Böhmwalder" <christoph.boehmwalder@linbit.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Josef Bacik <josef@toxicpanda.com>
CC: Keith Busch <kbusch@kernel.org>
CC: Christoph Hellwig <hch@lst.de>
CC: Sagi Grimberg <sagi@grimberg.me>
CC: Lee Duncan <lduncan@suse.com>
CC: Chris Leech <cleech@redhat.com>
CC: Mike Christie <michael.christie@oracle.com>
CC: "James E.J. Bottomley" <jejb@linux.ibm.com>
CC: "Martin K. Petersen" <martin.petersen@oracle.com>
CC: Valentina Manea <valentina.manea.m@gmail.com>
CC: Shuah Khan <shuah@kernel.org>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: David Howells <dhowells@redhat.com>
CC: Marc Dionne <marc.dionne@auristor.com>
CC: Steve French <sfrench@samba.org>
CC: Christine Caulfield <ccaulfie@redhat.com>
CC: David Teigland <teigland@redhat.com>
CC: Mark Fasheh <mark@fasheh.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: Joseph Qi <joseph.qi@linux.alibaba.com>
CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: Dominique Martinet <asmadeus@codewreck.org>
CC: Ilya Dryomov <idryomov@gmail.com>
CC: Xiubo Li <xiubli@redhat.com>
CC: Chuck Lever <chuck.lever@oracle.com>
CC: Jeff Layton <jlayton@kernel.org>
CC: Trond Myklebust <trond.myklebust@hammerspace.com>
CC: Anna Schumaker <anna@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>

Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-19 17:28:49 -08:00
Linus Torvalds
71a7507afb Driver Core changes for 6.2-rc1
Here is the set of driver core and kernfs changes for 6.2-rc1.
 
 The "big" change in here is the addition of a new macro,
 container_of_const() that will preserve the "const-ness" of a pointer
 passed into it.
 
 The "problem" of the current container_of() macro is that if you pass in
 a "const *", out of it can comes a non-const pointer unless you
 specifically ask for it.  For many usages, we want to preserve the
 "const" attribute by using the same call.  For a specific example, this
 series changes the kobj_to_dev() macro to use it, allowing it to be used
 no matter what the const value is.  This prevents every subsystem from
 having to declare 2 different individual macros (i.e.
 kobj_const_to_dev() and kobj_to_dev()) and having the compiler enforce
 the const value at build time, which having 2 macros would not do
 either.
 
 The driver for all of this have been discussions with the Rust kernel
 developers as to how to properly mark driver core, and kobject, objects
 as being "non-mutable".  The changes to the kobject and driver core in
 this pull request are the result of that, as there are lots of paths
 where kobjects and device pointers are not modified at all, so marking
 them as "const" allows the compiler to enforce this.
 
 So, a nice side affect of the Rust development effort has been already
 to clean up the driver core code to be more obvious about object rules.
 
 All of this has been bike-shedded in quite a lot of detail on lkml with
 different names and implementations resulting in the tiny version we
 have in here, much better than my original proposal.  Lots of subsystem
 maintainers have acked the changes as well.
 
 Other than this change, included in here are smaller stuff like:
   - kernfs fixes and updates to handle lock contention better
   - vmlinux.lds.h fixes and updates
   - sysfs and debugfs documentation updates
   - device property updates
 
 All of these have been in the linux-next tree for quite a while with no
 problems, OTHER than some merge issues with other trees that should be
 obvious when you hit them (block tree deletes a driver that this tree
 modifies, iommufd tree modifies code that this tree also touches).  If
 there are merge problems with these trees, please let me know.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCY5wz3A8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yks0ACeKYUlVgCsER8eYW+x18szFa2QTXgAn2h/VhZe
 1Fp53boFaQkGBjl8mGF8
 =v+FB
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the set of driver core and kernfs changes for 6.2-rc1.

  The "big" change in here is the addition of a new macro,
  container_of_const() that will preserve the "const-ness" of a pointer
  passed into it.

  The "problem" of the current container_of() macro is that if you pass
  in a "const *", out of it can comes a non-const pointer unless you
  specifically ask for it. For many usages, we want to preserve the
  "const" attribute by using the same call. For a specific example, this
  series changes the kobj_to_dev() macro to use it, allowing it to be
  used no matter what the const value is. This prevents every subsystem
  from having to declare 2 different individual macros (i.e.
  kobj_const_to_dev() and kobj_to_dev()) and having the compiler enforce
  the const value at build time, which having 2 macros would not do
  either.

  The driver for all of this have been discussions with the Rust kernel
  developers as to how to properly mark driver core, and kobject,
  objects as being "non-mutable". The changes to the kobject and driver
  core in this pull request are the result of that, as there are lots of
  paths where kobjects and device pointers are not modified at all, so
  marking them as "const" allows the compiler to enforce this.

  So, a nice side affect of the Rust development effort has been already
  to clean up the driver core code to be more obvious about object
  rules.

  All of this has been bike-shedded in quite a lot of detail on lkml
  with different names and implementations resulting in the tiny version
  we have in here, much better than my original proposal. Lots of
  subsystem maintainers have acked the changes as well.

  Other than this change, included in here are smaller stuff like:

   - kernfs fixes and updates to handle lock contention better

   - vmlinux.lds.h fixes and updates

   - sysfs and debugfs documentation updates

   - device property updates

  All of these have been in the linux-next tree for quite a while with
  no problems"

* tag 'driver-core-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (58 commits)
  device property: Fix documentation for fwnode_get_next_parent()
  firmware_loader: fix up to_fw_sysfs() to preserve const
  usb.h: take advantage of container_of_const()
  device.h: move kobj_to_dev() to use container_of_const()
  container_of: add container_of_const() that preserves const-ness of the pointer
  driver core: fix up missed drivers/s390/char/hmcdrv_dev.c class.devnode() conversion.
  driver core: fix up missed scsi/cxlflash class.devnode() conversion.
  driver core: fix up some missing class.devnode() conversions.
  driver core: make struct class.devnode() take a const *
  driver core: make struct class.dev_uevent() take a const *
  cacheinfo: Remove of_node_put() for fw_token
  device property: Add a blank line in Kconfig of tests
  device property: Rename goto label to be more precise
  device property: Move PROPERTY_ENTRY_BOOL() a bit down
  device property: Get rid of __PROPERTY_ENTRY_ARRAY_EL*SIZE*()
  kernfs: fix all kernel-doc warnings and multiple typos
  driver core: pass a const * into of_device_uevent()
  kobject: kset_uevent_ops: make name() callback take a const *
  kobject: kset_uevent_ops: make filter() callback take a const *
  kobject: make kobject_namespace take a const *
  ...
2022-12-16 03:54:54 -08:00
Linus Torvalds
ce8a79d560 for-6.2/block-2022-12-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOScsgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpi5ID/9pLXFYOq1+uDjU0KO/MdjMjK8Ukr34lCnk
 WkajRLheE8JBKOFDE54XJk56sQSZHX9bTWqziar0h1fioh7FlQR/tVvzsERCm2M9
 2y9THJNJygC68wgybStyiKlshFjl7TD7Kv5N9Y3xP3mkQygT+D6o8fXZk5xQbYyH
 YdFSoq4rJVHxRL03yzQiReGGIYdOUEQQh8l1FiLwLlKa3lXAey1KuxWIzksVN0KK
 aZB4QhiBpOiPgDHUVisq2XtyQjpZ2byoCImPzgrcqk9Jo4esvm/e6esrg4xlsvII
 LKFFkTmbVqjUZtFjqakFHmfuzVor4nU5f+xb90ZHExuuODYckkxWp5rWhf9QwqqI
 0ik6WYgI1/5vnHnX8f2DYzOFQf9qa/rLgg0CshyUODlD6RfHa9vntqYvlIFkmOBd
 Q7KblIoK8YTzUS1M+v7X8JQ7gDR2KwygH37Da2KJS+vgvfIb8kJGr1ZORuhJuJJ7
 Bl69gaNkHTHrqufp7UI64YXfueeuNu2J9z3zwzGoxeaFaofF/phDn0/2gCQE1fQI
 XBhsMw+ETqI6B2SPHMnzYDu2DM1S8ZTOYQlaD4G3uqgWnAM1tG707395uAy5yu4n
 D5azU1fVG4UocoNIyPujpaoSRs2zWZycEFEeUQkhyDDww/j4hlHi6H33eOnk0zsr
 wxzFGfvHfw==
 =k/vv
 -----END PGP SIGNATURE-----

Merge tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull requests via Christoph:
      - Support some passthrough commands without CAP_SYS_ADMIN (Kanchan
        Joshi)
      - Refactor PCIe probing and reset (Christoph Hellwig)
      - Various fabrics authentication fixes and improvements (Sagi
        Grimberg)
      - Avoid fallback to sequential scan due to transient issues (Uday
        Shankar)
      - Implement support for the DEAC bit in Write Zeroes (Christoph
        Hellwig)
      - Allow overriding the IEEE OUI and firmware revision in configfs
        for nvmet (Aleksandr Miloserdov)
      - Force reconnect when number of queue changes in nvmet (Daniel
        Wagner)
      - Minor fixes and improvements (Uros Bizjak, Joel Granados, Sagi
        Grimberg, Christoph Hellwig, Christophe JAILLET)
      - Fix and cleanup nvme-fc req allocation (Chaitanya Kulkarni)
      - Use the common tagset helpers in nvme-pci driver (Christoph
        Hellwig)
      - Cleanup the nvme-pci removal path (Christoph Hellwig)
      - Use kstrtobool() instead of strtobool (Christophe JAILLET)
      - Allow unprivileged passthrough of Identify Controller (Joel
        Granados)
      - Support io stats on the mpath device (Sagi Grimberg)
      - Minor nvmet cleanup (Sagi Grimberg)

 - MD pull requests via Song:
      - Code cleanups (Christoph)
      - Various fixes

 - Floppy pull request from Denis:
      - Fix a memory leak in the init error path (Yuan)

 - Series fixing some batch wakeup issues with sbitmap (Gabriel)

 - Removal of the pktcdvd driver that was deprecated more than 5 years
   ago, and subsequent removal of the devnode callback in struct
   block_device_operations as no users are now left (Greg)

 - Fix for partition read on an exclusively opened bdev (Jan)

 - Series of elevator API cleanups (Jinlong, Christoph)

 - Series of fixes and cleanups for blk-iocost (Kemeng)

 - Series of fixes and cleanups for blk-throttle (Kemeng)

 - Series adding concurrent support for sync queues in BFQ (Yu)

 - Series bringing drbd a bit closer to the out-of-tree maintained
   version (Christian, Joel, Lars, Philipp)

 - Misc drbd fixes (Wang)

 - blk-wbt fixes and tweaks for enable/disable (Yu)

 - Fixes for mq-deadline for zoned devices (Damien)

 - Add support for read-only and offline zones for null_blk
   (Shin'ichiro)

 - Series fixing the delayed holder tracking, as used by DM (Yu,
   Christoph)

 - Series enabling bio alloc caching for IRQ based IO (Pavel)

 - Series enabling userspace peer-to-peer DMA (Logan)

 - BFQ waker fixes (Khazhismel)

 - Series fixing elevator refcount issues (Christoph, Jinlong)

 - Series cleaning up references around queue destruction (Christoph)

 - Series doing quiesce by tagset, enabling cleanups in drivers
   (Christoph, Chao)

 - Series untangling the queue kobject and queue references (Christoph)

 - Misc fixes and cleanups (Bart, David, Dawei, Jinlong, Kemeng, Ye,
   Yang, Waiman, Shin'ichiro, Randy, Pankaj, Christoph)

* tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux: (247 commits)
  blktrace: Fix output non-blktrace event when blk_classic option enabled
  block: sed-opal: Don't include <linux/kernel.h>
  sed-opal: allow using IOC_OPAL_SAVE for locking too
  blk-cgroup: Fix typo in comment
  block: remove bio_set_op_attrs
  nvmet: don't open-code NVME_NS_ATTR_RO enumeration
  nvme-pci: use the tagset alloc/free helpers
  nvme: add the Apple shared tag workaround to nvme_alloc_io_tag_set
  nvme: only set reserved_tags in nvme_alloc_io_tag_set for fabrics controllers
  nvme: consolidate setting the tagset flags
  nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set
  block: bio_copy_data_iter
  nvme-pci: split out a nvme_pci_ctrl_is_dead helper
  nvme-pci: return early on ctrl state mismatch in nvme_reset_work
  nvme-pci: rename nvme_disable_io_queues
  nvme-pci: cleanup nvme_suspend_queue
  nvme-pci: remove nvme_pci_disable
  nvme-pci: remove nvme_disable_admin_queue
  nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl
  nvme: use nvme_wait_ready in nvme_shutdown_ctrl
  ...
2022-12-13 10:43:59 -08:00
Linus Torvalds
75f4d9af8b iov_iter work; most of that is about getting rid of
direction misannotations and (hopefully) preventing
 more of the same for the future.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHQEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCY5ZzQAAKCRBZ7Krx/gZQ
 65RZAP4nTkvOn0NZLVFkuGOx8pgJelXAvrteyAuecVL8V6CR4AD40qCVY51PJp8N
 MzwiRTeqnGDxTTF7mgd//IB6hoatAA==
 =bcvF
 -----END PGP SIGNATURE-----

Merge tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull iov_iter updates from Al Viro:
 "iov_iter work; most of that is about getting rid of direction
  misannotations and (hopefully) preventing more of the same for the
  future"

* tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  use less confusing names for iov_iter direction initializers
  iov_iter: saner checks for attempt to copy to/from iterator
  [xen] fix "direction" argument of iov_iter_kvec()
  [vhost] fix 'direction' argument of iov_iter_{init,bvec}()
  [target] fix iov_iter_bvec() "direction" argument
  [s390] memcpy_real(): WRITE is "data source", not destination...
  [s390] zcore: WRITE is "data source", not destination...
  [infiniband] READ is "data destination", not source...
  [fsi] WRITE is "data source", not destination...
  [s390] copy_oldmem_kernel() - WRITE is "data source", not destination
  csum_and_copy_to_iter(): handle ITER_DISCARD
  get rid of unlikely() on page_copy_sane() calls
2022-12-12 18:29:54 -08:00
Linus Torvalds
859c73d439 block-6.1-2022-12-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOSTGAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkW4D/49G+WuEFbBE4kM2Jk56tDgdNH611jsetvk
 k5MmaK62FkAGBfoNl6pRpiqpV/MyJyS//SytyJpsv1Fj7InNkpEzbI7cxvbflm4t
 D4/7Pg9VZgtNwrtq2M2t5NeM28scFFjQq3buzYGM6iKrwfcsLagkKiVU7cx0kTEl
 7hzlG2t/FDwBLWCmDSRHVKMB3JJa5hIxpnZklHBmNBpmNh9rl4F2hCwpmi5x+0t+
 qyti+1PRSknEQKspCMNcZvZwVmz0G3QZh2xYWNPkL0fxdQ7hpM65SV5DUs3SLjAr
 FUt9UsvgTdeZ8uhfS1Ft6KgjM9x1hiZx0UYwASQxRdz7fhoG7ygRK9KY5r5v1cbr
 lcUdwl5NJkPllDm5CZNCXMQYJlYMuA7J1VAMG+IZ/Iu5XiEFaEmOEzNrmmW0NZ57
 5Z+2isfo24GGhRk78ryjuqXuwMhM3+DaYeS9+9/h84JcldUtrglOlG6CzX0sHhch
 8xVCN3JVYc9/uWmIwb6QSIEZKNlsqkbiv5Gru1uu2pzX8MtuyC21rIIh8AUOSFl+
 740prC6//wUxDcOHrA0aphubQADImi9RF5J5+40lE1NxSnAz1nMisZ1G7ywIwb+j
 WjFbzW5p7ddO3DZFV+FENZ4QKFTDsR+3/tbbNdQpSmGEKk/KoT1jZyOVnoHsBSkd
 Q7B23nEe8w==
 =JkGh
 -----END PGP SIGNATURE-----

Merge tag 'block-6.1-2022-12-08' of git://git.kernel.dk/linux

Pull block fix from Jens Axboe:
 "A small fix for initializing the NVMe quirks before initializing the
  subsystem"

* tag 'block-6.1-2022-12-08' of git://git.kernel.dk/linux:
  nvme initialize core quirks before calling nvme_init_subsystem
2022-12-08 15:53:39 -08:00
Sagi Grimberg
19b00e0069 nvmet: don't open-code NVME_NS_ATTR_RO enumeration
It is already there, just go ahead and use it.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-12-07 15:03:09 +01:00
Christoph Hellwig
0da7feaa59 nvme-pci: use the tagset alloc/free helpers
Use the common helpers to allocate and free the tagsets.  To make this
work the generic nvme_ctrl now needs to be stored in the hctx private
data instead of the nvme_dev.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-12-07 15:02:31 +01:00
Christoph Hellwig
93b24f579c nvme: add the Apple shared tag workaround to nvme_alloc_io_tag_set
Add the apple shared tag workaround to nvme_alloc_io_tag_set to prepare
for using that helper in the PCIe driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-12-07 15:02:27 +01:00
Christoph Hellwig
b794d1c2ad nvme: only set reserved_tags in nvme_alloc_io_tag_set for fabrics controllers
The reserved_tags are only needed for fabrics controllers.  Right now only
fabrics drivers call this helper, so this is harmless, but we'll use it
in the PCIe driver soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-12-07 15:02:24 +01:00
Christoph Hellwig
db45e1a5dd nvme: consolidate setting the tagset flags
All nvme transports should be using the same flags for their tagsets,
with the exception for the blocking flag that should only be set for
transports that can block in ->queue_rq.

Add a NVME_F_BLOCKING flag to nvme_ctrl_ops to control the blocking
behavior and lift setting the flags into nvme_alloc_{admin,io}_tag_set.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-12-07 15:02:20 +01:00
Christoph Hellwig
dcef77274a nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set
Don't look at ctrl->ops as only RDMA and TCP actually support multiple
maps.

Fixes: 6dfba1c09c ("nvme-fc: use the tagset alloc/free helpers")
Fixes: ceee1953f9 ("nvme-loop: use the tagset alloc/free helpers")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-12-07 15:02:15 +01:00
Christoph Hellwig
68e81eba67 nvme-pci: split out a nvme_pci_ctrl_is_dead helper
Clean up nvme_dev_disable by splitting the logic to detect if a
controller is dead into a separate helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2022-12-06 14:38:19 +01:00
Christoph Hellwig
8cb9f10b71 nvme-pci: return early on ctrl state mismatch in nvme_reset_work
The only way nvme_reset_work could be called when not in resetting state
is if a reset and remove happen near the same time.  This should not
happen, but if it did we don't want the reset work to disable the
controller because the remove is already doing that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2022-12-06 14:36:54 +01:00
Christoph Hellwig
7d879c90ae nvme-pci: rename nvme_disable_io_queues
This function really deletes the I/O queues, so rename it to match
the functionality.  Also move the main wrapper right next to the
actual underlying implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2022-12-06 14:36:54 +01:00
Christoph Hellwig
10981f23a1 nvme-pci: cleanup nvme_suspend_queue
Remove the unused returne value, pass a dev + qid instead of the queue
as that is better for the callers as well as the function itself, and
remove the entirely pointless kerneldoc comment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2022-12-06 14:36:54 +01:00
Christoph Hellwig
c80767f770 nvme-pci: remove nvme_pci_disable
nvme_pci_disable has a single caller, fold it into that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Eric Curtin <ecurtin@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2022-12-06 14:36:54 +01:00
Christoph Hellwig
47d42d229a nvme-pci: remove nvme_disable_admin_queue
nvme_disable_admin_queue has only a single caller, and just calls two
other funtions, so remove it to clean up the remove path a little more.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Eric Curtin <ecurtin@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2022-12-06 14:36:54 +01:00
Christoph Hellwig
285b6e9b57 nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl
Many of the callers decide which one to use based on a bool argument and
there is at least some code to be shared, so merge these two.  Also
move a comment specific to a single callsite to that callsite.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hector Martin <marcan@marcan.st>
2022-12-06 14:36:54 +01:00
Christoph Hellwig
e6d275de2e nvme: use nvme_wait_ready in nvme_shutdown_ctrl
Refactor the code to wait for CSTS state changes so that it can be reused
by nvme_shutdown_ctrl.  This reduces the delay between each iteration
that checks CSTS from 100ms in the shutdown code to the 1 to 2ms range
done during enable, matching the changes from commit 3e98c2443f that
were only applied to the enable/disable path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
2022-12-06 14:36:53 +01:00
Christoph Hellwig
c76b8308e4 nvme-apple: fix controller shutdown in apple_nvme_disable
nvme_shutdown_ctrl already shuts the controller down, there is no
need to also call nvme_disable_ctrl for the shutdown case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Eric Curtin <ecurtin@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hector Martin <marcan@marcan.st>
2022-12-06 14:36:51 +01:00
Chaitanya Kulkarni
b296958557 nvme-fc: move common code into helper
Add a helper to move the duplicate code for error message
from nvme_fc_rcv_ls_req() to nvme_fc_rcv_ls_req_err_msg().

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-12-06 09:27:24 +01:00
Chaitanya Kulkarni
6c90294d72 nvme-fc: avoid null pointer dereference
Before using dynamically allcoated variable lsop in the
nvme_fc_rcv_ls_req(), add a check for NULL and error out early.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-12-06 09:27:24 +01:00
Joel Granados
ea43fceea4 nvme: allow unprivileged passthrough of Identify Controller
Add unprivileged passthrough of the I/O Command Set Independent and I/O
Command Set Specific Identify Controller sub-command.

This will allow access to attributes (e.g. MDTS and WZSL) that are needed
to effectively form passthrough I/O to the /dev/ng* character devices.

Signed-off-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-12-06 09:27:03 +01:00
Sagi Grimberg
d4d957b53d nvme-multipath: support io stats on the mpath device
Our mpath stack device is just a shim that selects a bottom namespace
and submits the bio to it without any fancy splitting. This also means
that we don't clone the bio or have any context to the bio beyond
submission. However it really sucks that we don't see the mpath device
io stats.

Given that the mpath device can't do that without adding some context
to it, we let the bottom device do it on its behalf (somewhat similar
to the approach taken in nvme_trace_bio_complete).

When the IO starts, we account the request for multipath IO stats using
REQ_NVME_MPATH_IO_STATS nvme_request flag to avoid queue io stats disable
in the middle of the request.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2022-12-06 09:17:01 +01:00
Sagi Grimberg
6887fc6495 nvme: introduce nvme_start_request
In preparation for nvme-multipath IO stats accounting, we want the
accounting to happen in a centralized place. The request completion
is already centralized, but we need a common helper to request I/O
start.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
2022-12-06 09:16:57 +01:00
Christophe JAILLET
99722c8aa8 nvme: use kstrtobool() instead of strtobool()
strtobool() is the same as kstrtobool().
However, the latter is more used within the kernel.

In order to remove strtobool() and slightly simplify kstrtox.h, switch to
the other function name.

While at it, include the corresponding header file (<linux/kstrtox.h>)

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-12-06 09:16:56 +01:00
Christoph Hellwig
ba0718a6d6 nvme: don't call blk_mq_{,un}quiesce_tagset when ctrl->tagset is NULL
The NVMe drivers support a mode where no tagset is allocated for the I/O
queues and only the admin queue is usable.  In that case ctrl->tagset is
NULL and we must not call the block per-tagset quiesce helpers that
dereference it.

Fixes: 98d81f0df7 ("nvme: use blk_mq_[un]quiesce_tagset")
Reported-by: Gerd Bayer <gbayer@linux.ibm.com>
Reported-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Leng <lengchao@huawei.com>
2022-12-06 09:16:56 +01:00
Pankaj Raghav
6f2d71524b nvme initialize core quirks before calling nvme_init_subsystem
A device might have a core quirk for NVME_QUIRK_IGNORE_DEV_SUBNQN
(such as Samsung X5) but it would still give a:

    "missing or invalid SUBNQN field"

warning as core quirks are filled after calling nvme_init_subnqn.  Fill
ctrl->quirks from struct core_quirks before calling nvme_init_subsystem
to fix this.

Tested on a Samsung X5.

Fixes: ab9e00cc72 ("nvme: track subsystems")
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-12-06 09:05:59 +01:00
Linus Torvalds
97ee9d1c16 block-6.1-2022-12-02
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOKM1MQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprErD/4vyIhYg4ZM9HOWNjpuT8oZCG6yRZ4gLhz0
 GT7VRcb8GKEkKUMmeazaxocWbC3fc+yvj49Oan1Uj7/teHTmJDM0pF/fMpJdkJrF
 z+PAy2++MGF++QNBq+wrDEIDsJ4QvRxDDJe9N+KDTtX6UsoBFYxJhem4JzZpM4BI
 4GY8jYiKlx42WM58stZ0DXOucG1DsKaOQKYRQGjtKYvA0dTn7dj9btY+n6rGerEX
 4265huzW5iY+MZWc5KLXGSr0wIJqAiKMoecN03JSBHONFVB4cjMQpZuQfSChqkUS
 3fhVmFOZnYMzMIZgiwhFxuIP/QzLjctdibwU9JusqChYP9Mx7HQ2+gs7H7i5PSdS
 9m64g2u+GuRjbgIeeGPVMPnBR3UG2GE8BDRfFBBCtbdmHXIKoolXdKvG9enRjXit
 e4wjGQDHk6x9iV6LITH1Jn82kzk6TTuBkdSBJN6u8KASeOCoPwWuhgyRXo6+jh5D
 1wd2mYxtM1UB2mZilPpflDSpzZCrp/CMjbLVPIV0aTxmmeEJN+Ao2PnduNjEBxoh
 kYwlScoz9DPvMf59UU45MLc9/vYchL14VoPOl59osLlQrWf9vPMATlU1CaRgQSVa
 apBNAMzWFTMGxXCtIsUoClNX7uuHrqrMEjBbhWuWp4DSOVQoJORrU5ymX9M92MYP
 f0incJSEZQ==
 =Gdkx
 -----END PGP SIGNATURE-----

Merge tag 'block-6.1-2022-12-02' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "Just a small NVMe merge for this week, fixing protection of the name
  space list, and a missing clear of a reserved field when unused"

* tag 'block-6.1-2022-12-02' of git://git.kernel.dk/linux:
  nvme: fix SRCU protection of nvme_ns_head list
  nvme-pci: clear the prp2 field when not used
2022-12-02 16:27:15 -08:00
Mike Christie
7fb42780d0 nvme: Convert NVMe errors to PR errors
This converts the NVMe errors we commonly see during PR handling to PR_STS
errors or -Exyz errors. pr_ops callers can then handle SCSI and NVMe errors
without knowing the device types.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Link: https://lore.kernel.org/r/20221122032603.32766-5-michael.christie@oracle.com
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-12-01 03:22:23 +00:00
Caleb Sander
899d2a05dc nvme: fix SRCU protection of nvme_ns_head list
Walking the nvme_ns_head siblings list is protected by the head's srcu
in nvme_ns_head_submit_bio() but not nvme_mpath_revalidate_paths().
Removing namespaces from the list also fails to synchronize the srcu.
Concurrent scan work can therefore cause use-after-frees.

Hold the head's srcu lock in nvme_mpath_revalidate_paths() and
synchronize with the srcu, not the global RCU, in nvme_ns_remove().

Observed the following panic when making NVMe/RDMA connections
with native multipath on the Rocky Linux 8.6 kernel
(it seems the upstream kernel has the same race condition).
Disassembly shows the faulting instruction is cmp 0x50(%rdx),%rcx;
computing capacity != get_capacity(ns->disk).
Address 0x50 is dereferenced because ns->disk is NULL.
The NULL disk appears to be the result of concurrent scan work
freeing the namespace (note the log line in the middle of the panic).

[37314.206036] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
[37314.206036] nvme0n3: detected capacity change from 0 to 11811160064
[37314.299753] PGD 0 P4D 0
[37314.299756] Oops: 0000 [#1] SMP PTI
[37314.299759] CPU: 29 PID: 322046 Comm: kworker/u98:3 Kdump: loaded Tainted: G        W      X --------- -  - 4.18.0-372.32.1.el8test86.x86_64 #1
[37314.299762] Hardware name: Dell Inc. PowerEdge R720/0JP31P, BIOS 2.7.0 05/23/2018
[37314.299763] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[37314.299783] RIP: 0010:nvme_mpath_revalidate_paths+0x26/0xb0 [nvme_core]
[37314.299790] Code: 1f 44 00 00 66 66 66 66 90 55 53 48 8b 5f 50 48 8b 83 c8 c9 00 00 48 8b 13 48 8b 48 50 48 39 d3 74 20 48 8d 42 d0 48 8b 50 20 <48> 3b 4a 50 74 05 f0 80 60 70 ef 48 8b 50 30 48 8d 42 d0 48 39 d3
[37315.058803] RSP: 0018:ffffabe28f913d10 EFLAGS: 00010202
[37315.121316] RAX: ffff927a077da800 RBX: ffff92991dd70000 RCX: 0000000001600000
[37315.206704] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff92991b719800
[37315.292106] RBP: ffff929a6b70c000 R08: 000000010234cd4a R09: c0000000ffff7fff
[37315.377501] R10: 0000000000000001 R11: ffffabe28f913a30 R12: 0000000000000000
[37315.462889] R13: ffff92992716600c R14: ffff929964e6e030 R15: ffff92991dd70000
[37315.548286] FS:  0000000000000000(0000) GS:ffff92b87fb80000(0000) knlGS:0000000000000000
[37315.645111] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[37315.713871] CR2: 0000000000000050 CR3: 0000002208810006 CR4: 00000000000606e0
[37315.799267] Call Trace:
[37315.828515]  nvme_update_ns_info+0x1ac/0x250 [nvme_core]
[37315.892075]  nvme_validate_or_alloc_ns+0x2ff/0xa00 [nvme_core]
[37315.961871]  ? __blk_mq_free_request+0x6b/0x90
[37316.015021]  nvme_scan_work+0x151/0x240 [nvme_core]
[37316.073371]  process_one_work+0x1a7/0x360
[37316.121318]  ? create_worker+0x1a0/0x1a0
[37316.168227]  worker_thread+0x30/0x390
[37316.212024]  ? create_worker+0x1a0/0x1a0
[37316.258939]  kthread+0x10a/0x120
[37316.297557]  ? set_kthread_struct+0x50/0x50
[37316.347590]  ret_from_fork+0x35/0x40
[37316.390360] Modules linked in: nvme_rdma nvme_tcp(X) nvme_fabrics nvme_core netconsole iscsi_tcp libiscsi_tcp dm_queue_length dm_service_time nf_conntrack_netlink br_netfilter bridge stp llc overlay nft_chain_nat ipt_MASQUERADE nf_nat xt_addrtype xt_CT nft_counter xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment xt_multiport nft_compat nf_tables libcrc32c nfnetlink dm_multipath tg3 rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm intel_rapl_msr iTCO_wdt iTCO_vendor_support dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm irqbypass crct10dif_pclmul crc32_pclmul mlx5_ib ghash_clmulni_intel ib_uverbs rapl intel_cstate intel_uncore ib_core ipmi_si joydev mei_me pcspkr ipmi_devintf mei lpc_ich wmi ipmi_msghandler acpi_power_meter ext4 mbcache jbd2 sd_mod t10_pi sg mgag200 mlx5_core drm_kms_helper syscopyarea
[37316.390419]  sysfillrect ahci sysimgblt fb_sys_fops libahci drm crc32c_intel libata mlxfw pci_hyperv_intf tls i2c_algo_bit psample dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: nvme_core]
[37317.645908] CR2: 0000000000000050

Fixes: e7d65803e2 ("nvme-multipath: revalidate paths during rescan")
Signed-off-by: Caleb Sander <csander@purestorage.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-30 14:37:46 +01:00
Lei Rao
a56ea6147f nvme-pci: clear the prp2 field when not used
If the prp2 field is not filled in nvme_setup_prp_simple(), the prp2
field is garbage data. According to nvme spec, the prp2 is reserved if
the data transfer does not cross a memory page boundary, so clear it to
zero if it is not used.

Signed-off-by: Lei Rao <lei.rao@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-30 14:34:17 +01:00
Al Viro
de4eda9de2 use less confusing names for iov_iter direction initializers
READ/WRITE proved to be actively confusing - the meanings are
"data destination, as used with read(2)" and "data source, as
used with write(2)", but people keep interpreting those as
"we read data from it" and "we write data to it", i.e. exactly
the wrong way.

Call them ITER_DEST and ITER_SOURCE - at least that is harder
to misinterpret...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-11-25 13:01:55 -05:00
Greg Kroah-Hartman
23680f0b7d driver core: make struct class.dev_uevent() take a const *
The dev_uevent() in struct class should not be modifying the device that
is passed into it, so mark it as a const * and propagate the function
signature changes out into all relevant subsystems that use this
callback.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Russ Weight <russell.h.weight@intel.com>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Johan Hovold <johan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Wolfram Sang <wsa+renesas@sang-engineering.com>
Cc: Raed Salem <raeds@nvidia.com>
Cc: Chen Zhongjin <chenzhongjin@huawei.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Avihai Horon <avihaih@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Colin Ian King <colin.i.king@gmail.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Jakob Koschel <jakobkoschel@gmail.com>
Cc: Antoine Tenart <atenart@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Wang Yufen <wangyufen@huawei.com>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: linux-pm@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Acked-by: Sebastian Reichel <sre@kernel.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20221123122523.1332370-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-24 17:12:15 +01:00
Aleksandr Miloserdov
68c5444c31 nvmet: expose firmware revision to configfs
Allow user to set currently active firmware revision

Reviewed-by: Konstantin Shelekhin <k.shelekhin@yadro.com>
Reviewed-by: Dmitriy Bogdanov <d.bogdanov@yadro.com>
Signed-off-by: Aleksandr Miloserdov <a.miloserdov@yadro.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-21 08:35:58 +01:00
Aleksandr Miloserdov
23855abdc4 nvmet: expose IEEE OUI to configfs
Allow user to set OUI for the controller vendor.

Reviewed-by: Konstantin Shelekhin <k.shelekhin@yadro.com>
Reviewed-by: Dmitriy Bogdanov <d.bogdanov@yadro.com>
Signed-off-by: Aleksandr Miloserdov <a.miloserdov@yadro.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-21 08:35:58 +01:00
Linus Torvalds
f4408c3dfc block-6.1-2022-11-18
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmN38ZUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgXxD/9tUSFUKIVGIn4pmNILfY3XV45HOi1w44yR
 zCxCELupcBeT+YixmaJcT8sunrrg2fLPOXMrDJk1cG/izXHzkjAQsHZvERfqC7hC
 f5onH+2MyGm3qBwxV0iGqITJgTwQGInVJijT4f9UZd/8ultymyZR2nOdIdIydHCF
 qzlOjq6hgIuGKHhFgOqRUg/OAkx510ZEEilUDcZ6XVV+zL7ccN6J9+eNTI3c58wT
 7jvxZC4u6QGKteGvVniE3WXgk3QdFiQRORvV09g+PkbG/vPjAIZ5tJFb9PdIOebD
 3guDiNUasgz2vnDetMK+yk4LcedcRfWnqgn+Vm8C26j5Fxs13eDx5kMDteVy7CYh
 3bokOATHohoZZ9qTApgQUswTfGJfBdoy0nUTPuffxPdKDyUPteIxFCADcnyDHnDG
 d/+PjU3FKF31o2HcUfvYp7OMO0VZP0hJSWps8znoVXKxb+LH9qKkYzHVlfni5kkS
 k9XqqD1Ki98Erb346YqgvQjCkz+CUd5DxtGyh9Oh2+oS2qHP6WjdKo1QPFmWD5dp
 EyXGSqGoZrIPtnKohLUN9EiVXanRQWJr3L0gw2CYXpmwfSKfMC3CQraEC1jOc01l
 TfsLJGbl3L5XpLzxoBwDu44cqp+VvbalergdcmsDTLDFHhONY2g5LJh6C9/EDdnQ
 Cde1uHikGw==
 =sOGG
 -----END PGP SIGNATURE-----

Merge tag 'block-6.1-2022-11-18' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Christoph:
      - Two more bogus nid quirks (Bean Huo, Tiago Dias Ferreira)
      - Memory leak fix in nvmet (Sagi Grimberg)

 - Regression fix for block cgroups pinning the wrong blkcg, causing
   leaks of cgroups and blkcgs (Chris)

 - UAF fix for drbd setup error handling (Dan)

 - Fix DMA alignment propagation in DM (Keith)

* tag 'block-6.1-2022-11-18' of git://git.kernel.dk/linux:
  dm-log-writes: set dma_alignment limit in io_hints
  dm-integrity: set dma_alignment limit in io_hints
  block: make blk_set_default_limits() private
  dm-crypt: provide dma_alignment limit in io_hints
  block: make dma_alignment a stacking queue_limit
  nvmet: fix a memory leak in nvmet_auth_set_key
  nvme-pci: add NVME_QUIRK_BOGUS_NID for Netac NV7000
  drbd: use after free in drbd_create_device()
  nvme-pci: add NVME_QUIRK_BOGUS_NID for Micron Nitro
  blk-cgroup: properly pin the parent in blkcg_css_online
2022-11-18 13:59:45 -08:00
Christoph Hellwig
9f27bd701d nvme: rename the queue quiescing helpers
Naming the nvme helpers that wrap the block quiesce functionality
_start/_stop is rather confusing.  Switch to using the quiesce naming
used by the block layer instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-11-18 08:24:23 +01:00
Sagi Grimberg
c58e28afb1 nvmet: fix a memory leak in nvmet_auth_set_key
When changing dhchap secrets we need to release the old
secrets as well.

kmemleak complaint:
--
unreferenced object 0xffff8c7f44ed8180 (size 64):
  comm "check", pid 7304, jiffies 4295686133 (age 72034.246s)
  hex dump (first 32 bytes):
    44 48 48 43 2d 31 3a 30 30 3a 4c 64 4c 4f 64 71  DHHC-1:00:LdLOdq
    79 56 69 67 77 48 55 32 6d 5a 59 4c 7a 35 59 38  yVigwHU2mZYLz5Y8
  backtrace:
    [<00000000b6fc5071>] kstrdup+0x2e/0x60
    [<00000000f0f4633f>] 0xffffffffc0e07ee6
    [<0000000053006c05>] 0xffffffffc0dff783
    [<00000000419ae922>] configfs_write_iter+0xb1/0x120
    [<000000008183c424>] vfs_write+0x2be/0x3c0
    [<000000009005a2a5>] ksys_write+0x5f/0xe0
    [<00000000cd495c89>] do_syscall_64+0x38/0x90
    [<00000000f2a84ac5>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: db1312dd95 ("nvmet: implement basic In-Band Authentication")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 08:36:37 +01:00
Joel Granados
bcaf434b8f nvme: return err on nvme_init_non_mdts_limits fail
In nvme_init_non_mdts_limits function we were returning 0 when kzalloc
failed; it now returns -ENOMEM.

Fixes: 5befc7c26e ("nvme: implement non-mdts command limits")
Signed-off-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 08:36:37 +01:00
Uday Shankar
811f4de034 nvme: avoid fallback to sequential scan due to transient issues
Currently, if nvme_scan_ns_list fails, nvme_scan_work will fall back to
a sequential scan. nvme_scan_ns_list can fail for a variety of reasons,
e.g. a transient transport issue, and the resulting sequential scan can
be extremely expensive on controllers reporting an NN value close to the
maximum allowed (> 4 billion). Avoid sequential scans wherever possible
by only falling back to them in two cases:

- When the NVMe version supported (VS) value reported by the device is
  older than NVME_VS(1, 1, 0), before which support of Identify NS List
  not required.
- When the Identify NS List command fails with the DNR bit set in the
  status. This is to accommodate (non-compliant) devices which report a
  VS value which implies support for Identify NS List, but nevertheless
  do not support the command. Such devices will most likely fail the
  command with the DNR bit set.

The third case is when the device claims support for Identify NS List
but the command fails with DNR not set. In such cases, fallback to
sequential scan is potentially expensive and likely unnecessary, as a
retry of the list scan should succeed. So this change skips the fallback
in this third case.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 08:36:37 +01:00
Sagi Grimberg
91c11d5f32 nvme-rdma: stop auth work after tearing down queues in error recovery
when starting error recovery there might be a authentication work
running, and it involves I/O commands. Given the controller is tearing
down there is no chance for the I/O to complete other than timing out
which may unnecessarily take a full io timeout.

So first tear down the queues, fail/cancel all inflight I/O (including
potentially authentication) and only then stop authentication. This
ensures that failover is not stalled due to blocked authentication I/O.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 08:36:37 +01:00
Sagi Grimberg
1f1a4f8956 nvme-tcp: stop auth work after tearing down queues in error recovery
when starting error recovery there might be a authentication work
running, and it involves I/O commands. Given the controller is tearing
down there is no chance for the I/O to complete other than timing out
which may unnecessarily take a full io timeout.

So first tear down the queues, fail/cancel all inflight I/O (including
potentially authentication) and only then stop authentication. This
ensures that failover is not stalled due to blocked authentication I/O.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 08:36:37 +01:00
Sagi Grimberg
d061a1bd1f nvme-auth: have dhchap_auth_work wait for queues auth to complete
It triggered the queue authentication work elements in parallel, but
the ctrl authentication work itself completes when all of them
completes. Hence wait for queues auth completions.

This also makes nvme_auth_stop simply a sync cancel of ctrl
dhchap_auth_work.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 08:36:36 +01:00
Sagi Grimberg
a2a00d2a66 nvme-auth: remove redundant auth_work flush
only ctrl deletion calls nvme_auth_free, which was stopped prior in the
teardown stage, so there is no possibility that it should ever run when
nvme_auth_free is called. As a result, we can remove a local chap pointer
variable.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 08:36:36 +01:00
Sagi Grimberg
aa36d711e9 nvme-auth: convert dhchap_auth_list to an array
We know exactly how many dhchap contexts we will need, there is no need
to hold a list that we need to protect with a mutex. Convert to
a dynamically allocated array. And dhchap_context access state is
maintained by the chap itself.

Make dhchap_auth_mutex protect only the ctrl host_key and ctrl_key
in a fine-grained lock such that there is no long lasting acquisition
of the lock and no need to take/release this lock when flushing
authentication works.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 08:36:36 +01:00
Sagi Grimberg
546dea18c9 nvme-auth: check chap ctrl_key once constructed
ctrl ctrl_key member may be overwritten from a sysfs context driven
by the user. Once a queue local copy was created, use that instead
to minimize checks on a shared resource.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 08:36:36 +01:00
Sagi Grimberg
e8a420efb6 nvme-auth: no need to reset chap contexts on re-authentication
Now that the chap context is reset upon completion, this is no longer
needed. Also remove nvme_auth_reset as no callers are left.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 08:36:36 +01:00
Sagi Grimberg
96df318393 nvme-auth: remove redundant deallocations
These are now redundant as the dhchap context is
removed after authentication completes.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 08:36:36 +01:00
Sagi Grimberg
8d1c1904e9 nvme-auth: clear sensitive info right after authentication completes
We don't want to keep authentication sensitive info in memory for unlimited
amount of time.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 08:36:35 +01:00
Sagi Grimberg
e481fc0a37 nvme-auth: guarantee dhchap buffers under memory pressure
We want to guarantee that we have chap buffers when a controller
reconnects under memory pressure. Add a mempool specifically
for that.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 08:36:35 +01:00
Sagi Grimberg
b7d604cae8 nvme-auth: don't keep long lived 4k dhchap buffer
dhchap structure is per-queue, it is wasteful to keep it for the entire
lifetime of the queue. Allocate it dynamically and get rid of it after
authentication. We don't need kzalloc because all accessors are clearing
it before writing to it.

Also, remove redundant chap buf_size which is always 4096, use a define
instead.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 08:36:35 +01:00
Sagi Grimberg
bfc4068e1e nvme-auth: remove redundant if statement
No one passes NVME_QID_ANY to nvme_auth_negotiate.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 08:36:35 +01:00
Sagi Grimberg
01604350e1 nvme-auth: don't override ctrl keys before validation
Replace ctrl ctrl_key/host_key only after nvme_auth_generate_key is successful.
Also, this fixes a bug where the keys are leaked.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 08:36:35 +01:00
Sagi Grimberg
193a8c7e5f nvme-auth: don't ignore key generation failures when initializing ctrl keys
nvme_auth_generate_key can fail, don't ignore it upon initialization.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 08:36:35 +01:00
Sagi Grimberg
f6b182fbd5 nvme-auth: remove redundant buffer deallocations
host_response, host_key, ctrl_key and sess_key are
freed in nvme_auth_reset_dhchap which is called from
nvme_auth_free_dhchap.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 08:36:34 +01:00
Sagi Grimberg
c7390f132a nvme-auth: don't re-authenticate if the controller is not LIVE
The connect sequence will re-authenticate.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 08:36:34 +01:00
Sagi Grimberg
100b555bc2 nvme-auth: remove symbol export from nvme_auth_reset
Only the nvme module calls it.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 08:36:34 +01:00
Sagi Grimberg
0c999e69c4 nvme-auth: rename authentication work elements
Use nvme_ctrl_auth_work and nvme_queue_auth_work for better
readability.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 08:36:34 +01:00
Sagi Grimberg
0a7ce375f8 nvme-auth: rename __nvme_auth_[reset|free] to nvme_auth[reset|free]_dhchap
nvme_auth_[reset|free] operate on the controller while
__nvme_auth_[reset|free] operate on a chap struct (which maps to a queue
context). Rename it for clarity.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 08:36:34 +01:00
Christoph Hellwig
c7c16c5b19 nvme-pci: don't unbind the driver on reset failure
Unbind a device driver when a reset fails is very unusual behavior.
Just shut the controller down and leave it in dead state if we fail
to reset it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-11-16 08:36:33 +01:00
Christoph Hellwig
eac3ef2629 nvme-pci: split the initial probe from the rest path
nvme_reset_work is a little fragile as it needs to handle both resetting
a live controller and initializing one during probe.  Split out the initial
probe and open code it in nvme_probe and leave nvme_reset_work to just do
the live controller reset.

This fixes a recently introduced bug where nvme_dev_disable causes a NULL
pointer dereferences in blk_mq_quiesce_tagset because the tagset pointer
is not set when the reset state is entered directly from the new state.
The separate probe code can skip the reset state and probe directly and
fixes this.

To make sure the system isn't single threaded on enabling nvme
controllers, set the PROBE_PREFER_ASYNCHRONOUS flag in the device_driver
structure so that the driver core probes in parallel.

Fixes: 98d81f0df7 ("nvme: use blk_mq_[un]quiesce_tagset")
Reported-by: Gerd Bayer <gbayer@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
2022-11-16 08:36:33 +01:00
Christoph Hellwig
acb71e53bb nvme-pci: move the HMPRE check into nvme_setup_host_mem
Check that a HMB is wanted into the allocation helper instead of the
caller.  This makes life simpler for an upcoming second caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-11-16 08:36:31 +01:00
Sagi Grimberg
0a52566279 nvmet: fix a memory leak in nvmet_auth_set_key
When changing dhchap secrets we need to release the old
secrets as well.

kmemleak complaint:
--
unreferenced object 0xffff8c7f44ed8180 (size 64):
  comm "check", pid 7304, jiffies 4295686133 (age 72034.246s)
  hex dump (first 32 bytes):
    44 48 48 43 2d 31 3a 30 30 3a 4c 64 4c 4f 64 71  DHHC-1:00:LdLOdq
    79 56 69 67 77 48 55 32 6d 5a 59 4c 7a 35 59 38  yVigwHU2mZYLz5Y8
  backtrace:
    [<00000000b6fc5071>] kstrdup+0x2e/0x60
    [<00000000f0f4633f>] 0xffffffffc0e07ee6
    [<0000000053006c05>] 0xffffffffc0dff783
    [<00000000419ae922>] configfs_write_iter+0xb1/0x120
    [<000000008183c424>] vfs_write+0x2be/0x3c0
    [<000000009005a2a5>] ksys_write+0x5f/0xe0
    [<00000000cd495c89>] do_syscall_64+0x38/0x90
    [<00000000f2a84ac5>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: db1312dd95 ("nvmet: implement basic In-Band Authentication")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 07:20:56 +01:00
Tiago Dias Ferreira
8d6e38f636 nvme-pci: add NVME_QUIRK_BOGUS_NID for Netac NV7000
Added a quirk to fix the Netac NV7000 SSD reporting duplicate NGUIDs.

Cc: <stable@vger.kernel.org>
Signed-off-by: Tiago Dias Ferreira <tiagodfer@gmail.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16 07:20:56 +01:00
Christoph Hellwig
65a5464642 nvme-pci: simplify nvme_dbbuf_dma_alloc
Move the OACS check and the error checking into nvme_dbbuf_dma_alloc so
that an upcoming second caller doesn't have to duplicate this boilerplate
code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-11-15 10:56:20 +01:00
Christoph Hellwig
a6ee7f19eb nvme-pci: call nvme_pci_configure_admin_queue from nvme_pci_enable
nvme_pci_configure_admin_queue is called right after nvme_pci_enable, and
it's work is undone by nvme_dev_disable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
2022-11-15 10:56:17 +01:00
Christoph Hellwig
3f30a79c2e nvme-pci: set constant paramters in nvme_pci_alloc_ctrl
Move setting of low-level constant parameters from nvme_reset_work to
nvme_pci_alloc_ctrl.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
2022-11-15 10:56:13 +01:00
Christoph Hellwig
2e87570be9 nvme-pci: factor out a nvme_pci_alloc_dev helper
Add a helper that allocates the nvme_dev structure up to the point where
we can call nvme_init_ctrl.  This pairs with the free_ctrl method and can
thus be used to cleanup the teardown path and make it more symmetric.

Note that this now calls nvme_init_ctrl a lot earlier during probing,
which also means the per-controller character device shows up earlier.
Due to the controller state no commnds can be send on it, but it might
make sense to delay the cdev registration until nvme_init_ctrl_finish.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
2022-11-15 10:56:09 +01:00
Christoph Hellwig
081a7d958c nvme-pci: factor the iod mempool creation into a helper
Add a helper to create the iod mempool.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
2022-11-15 10:56:06 +01:00
Christoph Hellwig
c11b7716d6 nvme-pci: move more teardown work to nvme_remove
nvme_dbbuf_dma_free frees dma coherent memory, so it must not be called
after ->remove has returned.  Fortunately there is no way to use it
after shutdown as no more I/O is possible so it can be moved.  Similarly
the iod_mempool can't be used for a device kept alive after shutdown, so
move it next to freeing the PRP pools.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
2022-11-15 10:56:03 +01:00
Christoph Hellwig
96ef1be536 nvme-pci: put the admin queue in nvme_dev_remove_admin
Once the controller is shutdown no one can access the admin queue.  Tear
it down in nvme_dev_remove_admin, which matches the flow in the other
drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
2022-11-15 10:55:59 +01:00
Christoph Hellwig
86adbf0cdb nvme: simplify transport specific device attribute handling
Allow the transport driver to override the attribute groups for the
control device, so that the PCIe driver doesn't manually have to add a
group after device creation and keep track of it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
2022-11-15 10:55:56 +01:00
Christoph Hellwig
94cc781f69 nvme: move OPAL setup from PCIe to core
Nothing about the TCG Opal support is PCIe transport specific, so move it
to the core code.  For this nvme_init_ctrl_finish grows a new
was_suspended argument that allows the transport driver to tell the OPAL
code if the controller came out of a suspend cycle.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
2022-11-15 10:55:53 +01:00
Christoph Hellwig
1e37a307f1 nvme: don't call nvme_init_ctrl_finish from nvme_passthru_end
nvme_passthrough_end can race with a reset, which can lead to
racing stores to the cels xarray as well as further shengians
with upcoming more complicated initialization.

So drop the call and just log that the controller capabilities
might have changed and a reset could be required to use the new
controller capabilities.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
2022-11-15 10:55:49 +01:00
Christoph Hellwig
1b96f862ec nvme: implement the DEAC bit for the Write Zeroes command
While the specification allows devices to either deallocate data
or to actually write zeroes on any Write Zeroes command, many SSDs
only do the sensible thing and deallocate data when the DEAC bit
is specific.  Set it when it is supported and the caller doesn't
explicitly opt out of deallocation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-11-15 10:50:31 +01:00
Kanchan Joshi
e4fbcf32c8 nvme: identify-namespace without CAP_SYS_ADMIN
Allow all identify-namespace variants (CNS 00h, 05h and 08h) without
requiring CAP_SYS_ADMIN. The information (retrieved using id-ns) is
needed to form IO commands for passthrough interface.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-15 10:50:30 +01:00
Kanchan Joshi
855b7717f4 nvme: fine-granular CAP_SYS_ADMIN for nvme io commands
Currently both io and admin commands are kept under a
coarse-granular CAP_SYS_ADMIN check, disregarding file mode completely.

$ ls -l /dev/ng*
crw-rw-rw- 1 root root 242, 0 Sep  9 19:20 /dev/ng0n1
crw------- 1 root root 242, 1 Sep  9 19:20 /dev/ng0n2

In the example above, ng0n1 appears as if it may allow unprivileged
read/write operation but it does not and behaves same as ng0n2.

This patch implements a shift from CAP_SYS_ADMIN to more fine-granular
control for io-commands.
If CAP_SYS_ADMIN is present, nothing else is checked as before.
Otherwise, following rules are in place
- any admin-cmd is not allowed
- vendor-specific and fabric commmand are not allowed
- io-commands that can write are allowed if matching FMODE_WRITE
permission is present
- io-commands that read are allowed

Add a helper nvme_cmd_allowed that implements above policy.
Change all the callers of CAP_SYS_ADMIN to go through nvme_cmd_allowed
for any decision making.
Since file open mode is counted for any approval/denial, change at
various places to keep file-mode information handy.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-15 10:50:30 +01:00
Christophe JAILLET
cf3d008401 nvme-fc: improve memory usage in nvme_fc_rcv_ls_req()
sizeof( struct nvmefc_ls_rcv_op ) = 64
sizeof( union nvmefc_ls_requests ) = 1024
sizeof( union nvmefc_ls_responses ) = 128

So, in nvme_fc_rcv_ls_req(), 1216 bytes of memory are requested when
kzalloc() is called.

Because of the way memory allocations are performed, 2048 bytes are
allocated. So about 800 bytes are wasted for each request.

Switch to 3 distinct memory allocations, in order to:
   - save these 800 bytes
   - avoid zeroing this extra memory
   - make sure that memory is properly aligned in case of DMA access
    ("fc_dma_map_single(lsop->rspbuf)" just a few lines below)

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-15 10:50:30 +01:00
Christoph Hellwig
fa8f9ac423 nvmet: only allocate a single slab for bvecs
There is no need to have a separate slab cache for each namespace,
and having separate ones creates duplicate debugs file names as well.

Fixes: d5eff33ee6 ("nvmet: add simple file backed ns support")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-11-15 10:50:22 +01:00
Daniel Wagner
2be2cd5287 nvmet: force reconnect when number of queue changes
In order to test queue number changes we need to make sure that the
host reconnects. Because only when the host disconnects from the
target the number of queues are allowed to change according the spec.

The initial idea was to disable and re-enable the ports and have the
host wait until the KATO timer expires, triggering error
recovery. Though the host would see a DNR reply when trying to
reconnect. Because of the DNR bit the connection is dropped
completely. There is no point in trying to reconnect with the same
parameters according the spec.

We can force to reconnect the host is by deleting all controllers. The
host will observe any newly posted request to fail and thus starts the
error recovery but this time without the DNR bit set.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Chaitanya Kulkarni  <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-15 10:49:48 +01:00
Uros Bizjak
bbf5410bc6 nvmet: use try_cmpxchg in nvmet_update_sq_head
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
nvmet_update_sq_head.  x86 CMPXCHG instruction returns success in ZF flag, so
this change saves a compare after cmpxchg (and related move instruction in
front of cmpxchg).

Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
fails. There is no need to re-read the value in the loop.

Note that the value from *ptr should be read using READ_ONCE to prevent
the compiler from merging, refetching or reordering the read.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-15 10:49:48 +01:00
Bean Huo
d5ceb4d1c5 nvme-pci: add NVME_QUIRK_BOGUS_NID for Micron Nitro
Added a quirk to fix Micron Nitro NVMe reporting duplicate NGUIDs.

Cc: <stable@vger.kernel.org>
Signed-off-by: Bean Huo <beanhuo@micron.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-15 10:48:59 +01:00
Linus Torvalds
b0b6e2c9d3 block-6.1-2022-11-11
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmNuaacQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiHOD/wMxAiJcZfhTUakXuJnOOdNqgOzIkTOw1u9
 BHS23p8FwwaESevpTOEiGHh9DVRGBzDJknwsAf/YoHV5CA3BvhlW8I2zHp8ybzWD
 Mq9LLK/waifYo0/5eWdEG2b4cH1kXeK9n377RWi+LstL+C7X/+0w6Q0wBTV5SxNF
 mWHfhnomtTz1A0qcxgSkyIuJOoUQ5iH9LZvoOze+kIiJf0S7C2/oKfBKuXO8iPxI
 wt76qMlb1+uNTuTLVHpZDbF11df7wYSrTZIfYBH5hYZ5KefM3cHUSgedoBbOb3Gy
 2TdctzWyjxBhUKeeZxkWgV3kJ3ha0hQ5lRxvy8R9uYs8NMxfhe2lfoyJmU1NtEvm
 xNIs1sRRYQ8BpnVOdwPRPVqmpGCauGj9I7W8KEOEzvGdUFN1TIpEucIfRL3mg88w
 8/4JCDi10PNRpyc1G1bb/vqXF11iX2YI8Fr9M+R9oW8V28qdMFBob5MK+TTCBGDL
 2lQHx0wCZMK3dUiLLv0mqFPcrK9v1mxpBBwpPGkzGf/FvmB00aV1n02Bo8prCD/d
 tY/aghHviDPkpaR0MJ4+MHllloZR+gbcxYfGbpdDUrN8ZVYRMIzi8NrwwPb98zqB
 d6CX8BPevi3/azjORf/I/v7egTSTRhH/JHBw7derANhPd7OSWLQfjhIHDhZoYs/q
 wsuIlnJOyA==
 =lEq0
 -----END PGP SIGNATURE-----

Merge tag 'block-6.1-2022-11-11' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Christoph:
        - Quiet user passthrough command errors (Keith Busch)
        - Fix memory leak in nvmet_subsys_attr_model_store_locked
        - Fix a memory leak in nvmet-auth (Sagi Grimberg)

 - Fix a potential NULL point deref in bfq (Yu)

 - Allocate command/response buffers separately for DMA for sed-opal,
   rather than rely on embedded alignment (Serge)

* tag 'block-6.1-2022-11-11' of git://git.kernel.dk/linux:
  nvmet: fix a memory leak
  nvmet: fix memory leak in nvmet_subsys_attr_model_store_locked
  nvme: quiet user passthrough command errors
  block: sed-opal: kmalloc the cmd/resp buffers
  block, bfq: fix null pointer dereference in bfq_bio_bfqg()
2022-11-11 14:08:30 -08:00
Sagi Grimberg
e65fdf530f nvmet: fix a memory leak
We need to also free the dhchap_ctrl_secret when releasing nvmet_host.
kmemleak complaint:
--
unreferenced object 0xffff99b1cbca5140 (size 64):
  comm "check", pid 4864, jiffies 4305092436 (age 2913.583s)
  hex dump (first 32 bytes):
    44 48 48 43 2d 31 3a 30 30 3a 65 36 2b 41 63 44  DHHC-1:00:e6+AcD
    39 76 47 4d 52 57 59 78 67 54 47 44 51 59 47 78  9vGMRWYxgTGDQYGx
  backtrace:
    [<00000000c07d369d>] kstrdup+0x2e/0x60
    [<000000001372171c>] 0xffffffffc0cceec6
    [<0000000010dbf50b>] 0xffffffffc0cc6783
    [<000000007465e93c>] configfs_write_iter+0xb1/0x120
    [<0000000039c23f62>] vfs_write+0x2be/0x3c0
    [<000000002da4351c>] ksys_write+0x5f/0xe0
    [<00000000d5011e32>] do_syscall_64+0x38/0x90
    [<00000000503870cf>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: db1312dd95 ("nvmet: implement basic In-Band Authentication")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-09 14:29:13 +01:00
Aleksandr Miloserdov
becc4cac30 nvmet: fix memory leak in nvmet_subsys_attr_model_store_locked
Since model_number is allocated before it needs to be freed before
kmemdump_nul.

Reviewed-by: Konstantin Shelekhin <k.shelekhin@yadro.com>
Reviewed-by: Dmitriy Bogdanov <d.bogdanov@yadro.com>
Signed-off-by: Aleksandr Miloserdov <a.miloserdov@yadro.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-09 14:28:27 +01:00
Keith Busch
d7ac8dca93 nvme: quiet user passthrough command errors
The driver is spamming the kernel logs for entirely harmless errors from
user space submitting unsupported commands. Just silence the errors.
The application has direct access to command status, so there's no need
to log these.

And since every passthrough command now uses the quiet flag, move the
setting to the common initializer.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Alan Adamson <alan.adamson@oracle.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Tested-by: Alan Adamson <alan.adamson@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-09 14:28:27 +01:00
Chao Leng
98d81f0df7 nvme: use blk_mq_[un]quiesce_tagset
All controller namespaces share the same tagset, so we can use this
interface which does the optimal operation for parallel quiesce based on
the tagset type(e.g. blocking tagsets and non-blocking tagsets).

nvme connect_q should not be quiesced when quiesce tagset, so set the
QUEUE_FLAG_SKIP_TAGSET_QUIESCE to skip it when init connect_q.

Currently we use NVME_NS_STOPPED to ensure pairing quiescing and
unquiescing. If use blk_mq_[un]quiesce_tagset, NVME_NS_STOPPED will be
invalided, so introduce NVME_CTRL_STOPPED to replace NVME_NS_STOPPED.
In addition, we never really quiesce a single namespace. It is a better
choice to move the flag from ns to ctrl.

Signed-off-by: Chao Leng <lengchao@huawei.com>
[hch: rebased on top of prep patches]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02 08:35:34 -06:00
Christoph Hellwig
483239c75b blk-mq: pass a tagset to blk_mq_wait_quiesce_done
Nothing in blk_mq_wait_quiesce_done needs the request_queue now, so just
pass the tagset, and move the non-mq check into the only caller that
needs it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02 08:35:34 -06:00
Christoph Hellwig
2b4c2355c5 nvme-apple: don't unquiesce the I/O queues in apple_nvme_reset_work
apple_nvme_reset_work schedules apple_nvme_remove, to be called, which
will call apple_nvme_disable and unquiesce the I/O queues.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02 08:35:34 -06:00
Christoph Hellwig
bad3e021ae nvme-pci: don't unquiesce the I/O queues in nvme_remove_dead_ctrl
nvme_remove_dead_ctrl schedules nvme_remove to be called, which will
call nvme_dev_disable and unquiesce the I/O queues.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02 08:35:34 -06:00
Christoph Hellwig
cd50f9b247 nvme: split nvme_kill_queues
nvme_kill_queues does two things:

 1) mark the gendisk of all namespaces dead
 2) unquiesce all I/O queues

These used to be be intertwined due to block layer issues, but aren't
any more.  So move the unquiscing of the I/O queues into the callers,
and rename the rest of the function to the now more descriptive
nvme_mark_namespaces_dead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02 08:35:34 -06:00
Christoph Hellwig
6bcd5089ee nvme: don't unquiesce the admin queue in nvme_kill_queues
None of the callers of nvme_kill_queues needs it to unquiesce the
admin queues, as all of them already do it themselves:

 1) nvme_reset_work explicit call nvme_start_admin_queue toward the
    beginning of the function.  The extra call to nvme_start_admin_queue
    in nvme_reset_work this won't do anything as
    NVME_CTRL_ADMIN_Q_STOPPED will already be cleared.
 2) nvme_remove calls nvme_dev_disable with shutdown flag set to true at
    the very beginning of the function if the PCIe device was not present,
    which is the precondition for the call to nvme_kill_queues.
    nvme_dev_disable already calls nvme_start_admin_queue toward the
    end of the function when the shutdown flag is set to true, so the
    admin queue is already enabled at this point.
 3) nvme_remove_dead_ctrl schedules a workqueue to unbind the driver,
    which will end up in nvme_remove, which calls nvme_dev_disable with
    the shutdown flag.  This case will call nvme_start_admin_queue a bit
    later than before.
 4) apple_nvme_remove uses the same sequence as nvme_remove_dead_ctrl
    above.
 5) nvme_remove_namespaces only calls nvme_kill_queues when the
    controller is in the DEAD state.  That can only happen in the PCIe
    driver, and only from nvme_remove. See item 2) above for the
    conditions there.

So it is safe to just remove the call to nvme_start_admin_queue in
nvme_kill_queues without replacement.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02 08:35:34 -06:00
Christoph Hellwig
fde776afdd nvme: remove the NVME_NS_DEAD check in nvme_validate_ns
At the point where namespaces are marked dead, the controller is in a
non-live state and we won't get pass the identify commands.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02 08:35:34 -06:00
Christoph Hellwig
4f17344e9d nvme: remove the NVME_NS_DEAD check in nvme_remove_invalid_namespaces
The NVME_NS_DEAD check only made sense when we revalidated namespaces
in nvme_passthrough_end for commands that affected the namespace inventory.
These days NVME_NS_DEAD is only set during reset or when tearing down
namespaces, and we always remove all namespaces right after that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02 08:35:34 -06:00
Christoph Hellwig
23a908647e nvme: don't remove namespaces in nvme_passthru_end
The call to nvme_remove_invalid_namespaces made sense when
nvme_passthru_end revalidated all namespaces and had to remove those that
didn't exist any more.  Since we don't revalidate from nvme_passthru_end
now, this call is entirely spurious.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02 08:35:34 -06:00
Christoph Hellwig
0ffc7e98bf nvme-pci: refactor the tagset handling in nvme_reset_work
The code to create, update or delete a tagset and namespaces in
nvme_reset_work is a bit convoluted.  Refactor it with a two high-level
conditionals for first probe vs reset and I/O queues vs no I/O queues
to make the code flow more clear.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-3-hch@lst.de
[axboe: fix whitespace issue]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02 08:35:19 -06:00
Christoph Hellwig
71b26083d5 block: set the disk capacity to 0 in blk_mark_disk_dead
nvme and xen-blkfront are already doing this to stop buffered writes from
creating dirty pages that can't be written out later.  Move it to the
common code.

This also removes the comment about the ordering from nvme, as bd_mutex
not only is gone entirely, but also hasn't been used for locking updates
to the disk size long before that, and thus the ordering requirement
documented there doesn't apply any more.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02 08:32:31 -06:00
Linus Torvalds
c6e0e874a8 block-6.1-2022-10-28
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmNcRTYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkc2D/4yxjQ+lAmXLrqZOGAZc+r2GpiCFsgFcpKP
 i3ezeeo3zmdaoUH778DcbPo0oeWY/iIvV2RDo3/0PBHIlGL43W9e7zsnauRxUwtw
 /Aj140Tsm5/lKnBy8n0nT+DO4LE22JnBHi5XjlFELBwM+deBxS+izinFtcOQC8sj
 58XWSmKag/Lv5JLvcYMj+PprtGOKzfNAacXvTjouy0IlEyb9E/yPMELS8lWFv8i2
 QELtvuEDxODpQtA+Ph0O/o00A8Fg/lC4EH5uvExFMr8k74CGFm32Bar1UuaJ9QAs
 5b8wateTra51yOGW3NEl1ph+4qVe9e4mutrLOFrChYylk5LePOVAki3wYb3lREiU
 rTOEKzUj3P/LHLpl4els0yIQ0gHXs60/M/Vn3TC50+2DnV00qEfvaocZ8vtXOux4
 YR+2cKUxk2CRNyj2BB3WRlrIkCIVk+ehl17E2cdrg0m8SMqk0GAYbpXD753L9uiy
 I7IQEqYB+op501pmTcVskFUfW9ozT96YD53fwSOTR/pEK+esHN0GfqxI6lcA6Q0O
 M2AWEiu8t1PbSONVH/p895gfgGHdRHl6zgvR+ADJMDEmc7dpEoAxsoTj4HIirXbe
 sGHi7ycrQR6aLdHahjCukjUVkZkuhXJkAQmq2XURJgmEcz7iJme23WqtWWUUoQvi
 pk6e1RSqSA==
 =Zfnu
 -----END PGP SIGNATURE-----

Merge tag 'block-6.1-2022-10-28' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Christoph:
      - make the multipath dma alignment match the non-multipath one
        (Keith Busch)
      - fix a bogus use of sg_init_marker() (Nam Cao)
      - fix circulr locking in nvme-tcp (Sagi Grimberg)

 - Initialization fix for requests allocated via the special hw queue
   allocator (John)

 - Fix for a regression added in this release with the batched
   completions of end_io backed requests (Ming)

 - Error handling leak fix for rbd (Yang)

 - Error handling leak fix for add_disk() failure (Yu)

* tag 'block-6.1-2022-10-28' of git://git.kernel.dk/linux:
  blk-mq: Properly init requests from blk_mq_alloc_request_hctx()
  blk-mq: don't add non-pt request with ->end_io to batch
  rbd: fix possible memory leak in rbd_sysfs_init()
  nvme-multipath: set queue dma alignment to 3
  nvme-tcp: fix possible circular locking when deleting a controller under memory pressure
  nvme-tcp: replace sg_init_marker() with sg_init_table()
  block: fix memory leak for elevator on add_disk failure
2022-10-29 18:06:52 -07:00
Keith Busch
fe8714b04f nvme-multipath: set queue dma alignment to 3
NVMe spec requires all transports support dword aligned addresses, which
is already set in the namespace request_queue. Set the same limit in the
multipath device's request_queue as well.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-25 08:07:53 -07:00
Sagi Grimberg
83e1226b0e nvme-tcp: fix possible circular locking when deleting a controller under memory pressure
When destroying a queue, when calling sock_release, the network stack
might need to allocate an skb to send a FIN/RST. When that happens
during memory pressure, there is a need to reclaim memory, which
in turn may ask the nvme-tcp device to write out dirty pages, however
this is not possible due to a ctrl teardown that is going on.

Set PF_MEMALLOC to the task that releases the socket to grant access
to PF_MEMALLOC reserves. In addition, do the same for the nvme-tcp
thread as this may also originate from the swap itself and should
be more resilient to memory pressure situations.

This fixes the following lockdep complaint:
--
======================================================
 WARNING: possible circular locking dependency detected
 6.0.0-rc2+ #25 Tainted: G        W
 ------------------------------------------------------
 kswapd0/92 is trying to acquire lock:
 ffff888114003240 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: tcp_sendpage+0x23/0xa0

 but task is already holding lock:
 ffffffff97e95ca0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x987/0x10d0

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (fs_reclaim){+.+.}-{0:0}:
        fs_reclaim_acquire+0x11e/0x160
        kmem_cache_alloc_node+0x44/0x530
        __alloc_skb+0x158/0x230
        tcp_send_active_reset+0x7e/0x730
        tcp_disconnect+0x1272/0x1ae0
        __tcp_close+0x707/0xd90
        tcp_close+0x26/0x80
        inet_release+0xfa/0x220
        sock_release+0x85/0x1a0
        nvme_tcp_free_queue+0x1fd/0x470 [nvme_tcp]
        nvme_do_delete_ctrl+0x130/0x13d [nvme_core]
        nvme_sysfs_delete.cold+0x8/0xd [nvme_core]
        kernfs_fop_write_iter+0x356/0x530
        vfs_write+0x4e8/0xce0
        ksys_write+0xfd/0x1d0
        do_syscall_64+0x58/0x80
        entry_SYSCALL_64_after_hwframe+0x63/0xcd

 -> #0 (sk_lock-AF_INET-NVME){+.+.}-{0:0}:
        __lock_acquire+0x2a0c/0x5690
        lock_acquire+0x18e/0x4f0
        lock_sock_nested+0x37/0xc0
        tcp_sendpage+0x23/0xa0
        inet_sendpage+0xad/0x120
        kernel_sendpage+0x156/0x440
        nvme_tcp_try_send+0x48a/0x2630 [nvme_tcp]
        nvme_tcp_queue_rq+0xefb/0x17e0 [nvme_tcp]
        __blk_mq_try_issue_directly+0x452/0x660
        blk_mq_plug_issue_direct.constprop.0+0x207/0x700
        blk_mq_flush_plug_list+0x6f5/0xc70
        __blk_flush_plug+0x264/0x410
        blk_finish_plug+0x4b/0xa0
        shrink_lruvec+0x1263/0x1ea0
        shrink_node+0x736/0x1a80
        balance_pgdat+0x740/0x10d0
        kswapd+0x5f2/0xaf0
        kthread+0x256/0x2f0
        ret_from_fork+0x1f/0x30

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(fs_reclaim);
                               lock(sk_lock-AF_INET-NVME);
                               lock(fs_reclaim);
  lock(sk_lock-AF_INET-NVME);

 *** DEADLOCK ***

3 locks held by kswapd0/92:
 #0: ffffffff97e95ca0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x987/0x10d0
 #1: ffff88811f21b0b0 (q->srcu){....}-{0:0}, at: blk_mq_flush_plug_list+0x6b3/0xc70
 #2: ffff888170b11470 (&queue->send_mutex){+.+.}-{3:3}, at: nvme_tcp_queue_rq+0xeb9/0x17e0 [nvme_tcp]

Fixes: 3f2304f8c6 ("nvme-tcp: add NVMe over TCP host driver")
Reported-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-25 08:07:50 -07:00
Nam Cao
5fa9add66b nvme-tcp: replace sg_init_marker() with sg_init_table()
In nvme_tcp_ddgst_update(), sg_init_marker() is called with an
uninitialized scatterlist. This is probably fine, but gcc complains:

  CC [M]  drivers/nvme/host/tcp.o
In file included from ./include/linux/dma-mapping.h:10,
                 from ./include/linux/skbuff.h:31,
                 from ./include/net/net_namespace.h:43,
                 from ./include/linux/netdevice.h:38,
                 from ./include/net/sock.h:46,
                 from drivers/nvme/host/tcp.c:12:
In function ‘sg_mark_end’,
    inlined from ‘sg_init_marker’ at ./include/linux/scatterlist.h:356:2,
    inlined from ‘nvme_tcp_ddgst_update’ at drivers/nvme/host/tcp.c:390:2:
./include/linux/scatterlist.h:234:11: error: ‘sg.page_link’ is used uninitialized [-Werror=uninitialized]
  234 |         sg->page_link |= SG_END;
      |         ~~^~~~~~~~~~~
drivers/nvme/host/tcp.c: In function ‘nvme_tcp_ddgst_update’:
drivers/nvme/host/tcp.c:388:28: note: ‘sg’ declared here
  388 |         struct scatterlist sg;
      |                            ^~
cc1: all warnings being treated as errors

Use sg_init_table() instead, which basically memset the scatterlist to
zero first before calling sg_init_marker().

Signed-off-by: Nam Cao <namcaov@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-25 08:07:50 -07:00
Christoph Hellwig
941f7298c7 nvme-apple: remove an extra queue reference
Now that blk_mq_destroy_queue does not release the queue reference, there
is no need for a second admin queue reference to be held by the
apple_nvme structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20221018135720.670094-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-25 08:25:35 -06:00
Christoph Hellwig
7dcebef90d nvme-pci: remove an extra queue reference
Now that blk_mq_destroy_queue does not release the queue reference, there
is no need for a second admin queue reference to be held by the nvme_dev.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20221018135720.670094-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-25 08:25:35 -06:00
Christoph Hellwig
2b3f056f72 blk-mq: move the call to blk_put_queue out of blk_mq_destroy_queue
The fact that blk_mq_destroy_queue also drops a queue reference leads
to various places having to grab an extra reference.  Move the call to
blk_put_queue into the callers to allow removing the extra references.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20221018135720.670094-2-hch@lst.de
[axboe: fix fabrics_q vs admin_q conflict in nvme core.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-25 08:25:10 -06:00
Linus Torvalds
d4b7332eef block-6.1-2022-10-20
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmNSA3oQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoTfEACwFAyQsAGMVVsTmU/XggFsNRczTKH+J7dO
 4tBhOhfrVEWyDbB++3kUepN0GenSiu3dvcAzWQbX0lPiLjkqFWvpj2GIrqHwXJCt
 Li8SPoPsQARBvJYUERmdlQDUeEKCK/IwOREQSH4LZaNVvU2l0MSAqjihUjmcXHBE
 OeCWwYLJ27Bo0j5py9xSo7Z1C84YtXHDKAbx1nVhX6ByyCvpSGLLunrwT7oeEjT7
 FAbVvL4p5xwZe5e3XiiC5J7g85BkIZGwhqViYrVyHPWrUVWycSvXQdqXGGHiSqlj
 g3jcT6nLNBscAPuWnnP4jobaFVjm1Lr3eXtVX1k66kRwrxVcB+tZGszvjh4cuneO
 N23OjrRjtGIXenSyOslCxMAIxgMjsxeuuoQb66Js9g8bA9zJeKPcpKWbXplkaXoH
 VsT23W24Xp72MkBAz9nP9sHXbQEAkfNGK+6qZLBpkb8bJsymc+4Vn6lpV6ybl8WF
 LJWouLss2/I+G/D/WLvVcygvefQxVSLDFjWcU8BcJjDkdXgQWUjZfcZEsRE2v0cr
 tLZOh1WhHgVwTBCZxEus+QoAE1tXn2C+xhMu0uqRQ7zlngy5AycM6eSJtQ5Lwm2H
 91MSYoUvGkOx5Hqp265AnZEBayOg+xnsp7qMZghoQqLUc5yU4zyUbvxPi0vNuduJ
 mbys/sxVsg==
 =YfkM
 -----END PGP SIGNATURE-----

Merge tag 'block-6.1-2022-10-20' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Christoph:
      - fix nvme-hwmon for DMA non-cohehrent architectures (Serge Semin)
      - add a nvme-hwmong maintainer (Christoph Hellwig)
      - fix error pointer dereference in error handling (Dan Carpenter)
      - fix invalid memory reference in nvmet_subsys_attr_qid_max_show
        (Daniel Wagner)
      - don't limit the DMA segment size in nvme-apple (Russell King)
      - fix workqueue MEM_RECLAIM flushing dependency (Sagi Grimberg)
      - disable write zeroes on various Kingston SSDs (Xander Li)

 - fix a memory leak with block device tracing (Ye)

 - flexible-array fix for ublk (Yushan)

 - document the ublk recovery feature from this merge window
   (ZiyangZhang)

 - remove dead bfq variable in struct (Yuwei)

 - error handling rq clearing fix (Yu)

 - add an IRQ safety check for the cached bio freeing (Pavel)

 - drbd bio cloning fix (Christoph)

* tag 'block-6.1-2022-10-20' of git://git.kernel.dk/linux:
  blktrace: remove unnessary stop block trace in 'blk_trace_shutdown'
  blktrace: fix possible memleak in '__blk_trace_remove'
  blktrace: introduce 'blk_trace_{start,stop}' helper
  bio: safeguard REQ_ALLOC_CACHE bio put
  block, bfq: remove unused variable for bfq_queue
  drbd: only clone bio if we have a backing device
  ublk_drv: use flexible-array member instead of zero-length array
  nvmet: fix invalid memory reference in nvmet_subsys_attr_qid_max_show
  nvmet: fix workqueue MEM_RECLAIM flushing dependency
  nvme-hwmon: kmalloc the NVME SMART log buffer
  nvme-hwmon: consistently ignore errors from nvme_hwmon_init
  nvme: add Guenther as nvme-hwmon maintainer
  nvme-apple: don't limit DMA segement size
  nvme-pci: disable write zeroes on various Kingston SSD
  nvme: fix error pointer dereference in error handling
  Documentation: document ublk user recovery feature
  blk-mq: fix null pointer dereference in blk_mq_clear_rq_mapping()
2022-10-21 15:14:14 -07:00
Daniel Wagner
94f5a06884 nvmet: fix invalid memory reference in nvmet_subsys_attr_qid_max_show
The item passed into nvmet_subsys_attr_qid_max_show is not a member of
struct nvmet_port, it is part of nvmet_subsys.  Hence, don't try to
dereference it as struct nvme_ctrl pointer.

Fixes: 3e980f5995 ("nvmet: Expose max queues to configfs")
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20220913064203.133536-1-dwagner@suse.de
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-19 12:43:13 +02:00
Sagi Grimberg
ddd2b8de9f nvmet: fix workqueue MEM_RECLAIM flushing dependency
The keep alive timer needs to stay on nvmet_wq, and not
modified to reschedule on the system_wq.

This fixes a warning:
------------[ cut here ]------------
workqueue: WQ_MEM_RECLAIM
nvmet-wq:nvmet_rdma_release_queue_work [nvmet_rdma] is flushing
!WQ_MEM_RECLAIM events:nvmet_keep_alive_timer [nvmet]
WARNING: CPU: 3 PID: 1086 at kernel/workqueue.c:2628
check_flush_dependency+0x16c/0x1e0

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Fixes: 8832cf9221 ("nvmet: use a private workqueue instead of the system workqueue")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-19 12:43:13 +02:00
Serge Semin
c94b7f9bab nvme-hwmon: kmalloc the NVME SMART log buffer
Recent commit 52fde2c07d ("nvme: set dma alignment to dword") has
caused a regression on our platform.

It turned out that the nvme_get_log() method invocation caused the
nvme_hwmon_data structure instance corruption.  In particular the
nvme_hwmon_data.ctrl pointer was overwritten either with zeros or with
garbage.  After some research we discovered that the problem happened
even before the actual NVME DMA execution, but during the buffer mapping.
Since our platform is DMA-noncoherent, the mapping implied the cache-line
invalidations or write-backs depending on the DMA-direction parameter.
In case of the NVME SMART log getting the DMA was performed
from-device-to-memory, thus the cache-invalidation was activated during
the buffer mapping.  Since the log-buffer isn't cache-line aligned, the
cache-invalidation caused the neighbour data to be discarded.  The
neighbouring data turned to be the data surrounding the buffer in the
framework of the nvme_hwmon_data structure.

In order to fix that we need to make sure that the whole log-buffer is
defined within the cache-line-aligned memory region so the
cache-invalidation procedure wouldn't involve the adjacent data. One of
the option to guarantee that is to kmalloc the DMA-buffer [1]. Seeing the
rest of the NVME core driver prefer that method it has been chosen to fix
this problem too.

Note after a deeper researches we found out that the denoted commit wasn't
a root cause of the problem. It just revealed the invalidity by activating
the DMA-based NVME SMART log getting performed in the framework of the
NVME hwmon driver. The problem was here since the initial commit of the
driver.

[1] Documentation/core-api/dma-api-howto.rst

Fixes: 400b6a7b13 ("nvme: Add hardware monitoring support")
Signed-off-by: Serge Semin <Sergey.Semin@baikalelectronics.ru>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-19 12:43:13 +02:00
Christoph Hellwig
6b8cf94005 nvme-hwmon: consistently ignore errors from nvme_hwmon_init
An NVMe controller works perfectly fine even when the hwmon
initialization fails.  Stop returning errors that do not come from a
controller reset from nvme_hwmon_init to handle this case consistently.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Serge Semin <fancer.lancer@gmail.com>
2022-10-19 12:42:58 +02:00
Russell King (Oracle)
d622f8477a nvme-apple: don't limit DMA segement size
NVMe uses PRPs for data transfers and has no specific limit for a single
DMA segement.  Limiting the size will cause problems because the block
layer assumes PRP-ish devices using a virt boundary mask don't have a
segment limit.  And while this is true, we also really need to tell the
DMA mapping layer about it, otherwise dma-debug will trip over it.

Fixes: 5bd2927ace ("nvme-apple: Add initial Apple SoC NVMe driver")
Suggested-by: Sven Peter <sven@svenpeter.dev>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
[hch: rewrote the commit message based on the PCIe commit]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Curtin <ecurtin@redhat.com>
Reviewed-by: Sven Peter <sven@svenpeter.dev>
2022-10-19 10:36:39 +02:00
Xander Li
ac9b57d4e1 nvme-pci: disable write zeroes on various Kingston SSD
Kingston SSDs do support NVMe Write_Zeroes cmd but take long time to
process.  The firmware version is locked by these SSDs, we can not expect
firmware improvement, so disable Write_Zeroes cmd.

Signed-off-by: Xander Li <xander_li@kingston.com.tw>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-19 10:36:39 +02:00
Dan Carpenter
4739824e2d nvme: fix error pointer dereference in error handling
There is typo here so it releases the wrong variable.  "ctrl->admin_q"
was intended instead of "ctrl->fabrics_q".

Fixes: fe60e8c534 ("nvme: add common helpers to allocate and free tagsets")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-19 10:36:39 +02:00
Linus Torvalds
f1947d7c8a Random number generator fixes for Linux 6.1-rc1.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmNHYD0ACgkQSfxwEqXe
 A655AA//dJK0PdRghqrKQsl18GOCffV5TUw5i1VbJQbI9d8anfxNjVUQiNGZi4et
 qUwZ8OqVXxYx1Z1UDgUE39PjEDSG9/cCvOpMUWqN20/+6955WlNZjwA7Fk6zjvlM
 R30fz5CIJns9RFvGT4SwKqbVLXIMvfg/wDENUN+8sxt36+VD2gGol7J2JJdngEhM
 lW+zqzi0ABqYy5so4TU2kixpKmpC08rqFvQbD1GPid+50+JsOiIqftDErt9Eg1Mg
 MqYivoFCvbAlxxxRh3+UHBd7ZpJLtp1UFEOl2Rf00OXO+ZclLCAQAsTczucIWK9M
 8LCZjb7d4lPJv9RpXFAl3R1xvfc+Uy2ga5KeXvufZtc5G3aMUKPuIU7k28ZyblVS
 XXsXEYhjTSd0tgi3d0JlValrIreSuj0z2QGT5pVcC9utuAqAqRIlosiPmgPlzXjr
 Us4jXaUhOIPKI+Musv/fqrxsTQziT0jgVA3Njlt4cuAGm/EeUbLUkMWwKXjZLTsv
 vDsBhEQFmyZqxWu4pYo534VX2mQWTaKRV1SUVVhQEHm57b00EAiZohoOvweB09SR
 4KiJapikoopmW4oAUFotUXUL1PM6yi+MXguTuc1SEYuLz/tCFtK8DJVwNpfnWZpE
 lZKvXyJnHq2Sgod/hEZq58PMvT6aNzTzSg7YzZy+VabxQGOO5mc=
 =M+mV
 -----END PGP SIGNATURE-----

Merge tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random

Pull more random number generator updates from Jason Donenfeld:
 "This time with some large scale treewide cleanups.

  The intent of this pull is to clean up the way callers fetch random
  integers. The current rules for doing this right are:

   - If you want a secure or an insecure random u64, use get_random_u64()

   - If you want a secure or an insecure random u32, use get_random_u32()

     The old function prandom_u32() has been deprecated for a while
     now and is just a wrapper around get_random_u32(). Same for
     get_random_int().

   - If you want a secure or an insecure random u16, use get_random_u16()

   - If you want a secure or an insecure random u8, use get_random_u8()

   - If you want secure or insecure random bytes, use get_random_bytes().

     The old function prandom_bytes() has been deprecated for a while
     now and has long been a wrapper around get_random_bytes()

   - If you want a non-uniform random u32, u16, or u8 bounded by a
     certain open interval maximum, use prandom_u32_max()

     I say "non-uniform", because it doesn't do any rejection sampling
     or divisions. Hence, it stays within the prandom_*() namespace, not
     the get_random_*() namespace.

     I'm currently investigating a "uniform" function for 6.2. We'll see
     what comes of that.

  By applying these rules uniformly, we get several benefits:

   - By using prandom_u32_max() with an upper-bound that the compiler
     can prove at compile-time is ≤65536 or ≤256, internally
     get_random_u16() or get_random_u8() is used, which wastes fewer
     batched random bytes, and hence has higher throughput.

   - By using prandom_u32_max() instead of %, when the upper-bound is
     not a constant, division is still avoided, because
     prandom_u32_max() uses a faster multiplication-based trick instead.

   - By using get_random_u16() or get_random_u8() in cases where the
     return value is intended to indeed be a u16 or a u8, we waste fewer
     batched random bytes, and hence have higher throughput.

  This series was originally done by hand while I was on an airplane
  without Internet. Later, Kees and I worked on retroactively figuring
  out what could be done with Coccinelle and what had to be done
  manually, and then we split things up based on that.

  So while this touches a lot of files, the actual amount of code that's
  hand fiddled is comfortably small"

* tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
  prandom: remove unused functions
  treewide: use get_random_bytes() when possible
  treewide: use get_random_u32() when possible
  treewide: use get_random_{u8,u16}() when possible, part 2
  treewide: use get_random_{u8,u16}() when possible, part 1
  treewide: use prandom_u32_max() when possible, part 2
  treewide: use prandom_u32_max() when possible, part 1
2022-10-16 15:27:07 -07:00
Sagi Grimberg
72e3b8883a nvme-multipath: fix possible hang in live ns resize with ANA access
When we revalidate paths as part of ns size change (as of commit
e7d65803e2), it is possible that during the path revalidation, the
only paths that is IO capable (i.e. optimized/non-optimized) are the
ones that ns resize was not yet informed to the host, which will cause
inflight requests to be requeued (as we have available paths but none
are IO capable). These requests on the requeue list are waiting for
someone to resubmit them at some point.

The IO capable paths will eventually notify the ns resize change to the
host, but there is nothing that will kick the requeue list to resubmit
the queued requests.

Fix this by always kicking the requeue list, and if no IO capable path
exists, these requests will be queued again.

A typical log that indicates that IOs are requeued:
--
nvme nvme1: creating 4 I/O queues.
nvme nvme1: new ctrl: "testnqn1"
nvme nvme2: creating 4 I/O queues.
nvme nvme2: mapped 4/0/0 default/read/poll queues.
nvme nvme2: new ctrl: NQN "testnqn1", addr 127.0.0.1:8009
nvme nvme1: rescanning namespaces.
nvme1n1: detected capacity change from 2097152 to 4194304
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
nvme nvme2: rescanning namespaces.
--

Reported-by: Yogev Cohen <yogev@lightbitslabs.com>
Fixes: e7d65803e2 ("nvme-multipath: revalidate paths during rescan")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Cc: <stable@vger.kernel.org> # v5.15+
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-12 11:42:58 +02:00
Xi Ruoyao
d5d3c100ac nvme-pci: avoid the deepest sleep state on ZHITAI TiPro5000 SSDs
ZHITAI TiPro5000 SSDs has the same APST sleep problem as its cousin,
TiPro7000.  The quirk for TiPro7000 has been added in
commit 6b961bce50 ("nvme-pci: avoid the deepest sleep state on
ZHITAI TiPro7000 SSDs"), use the same quirk for TiPro5000.

The ASPT data from "nvme id-ctrl /dev/nvme1":

vid       : 0x1e49
ssvid     : 0x1e49
sn        : ZTA21T0KA2227304LM
mn        : ZHITAI TiPlus5000 1TB
fr        : ZTA09139
[...]
ps    0 : mp:6.50W operational enlat:0 exlat:0 rrt:0 rrl:0
         rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:5.80W operational enlat:0 exlat:0 rrt:1 rrl:1
         rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:3.60W operational enlat:0 exlat:0 rrt:2 rrl:2
         rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0500W non-operational enlat:5000 exlat:10000 rrt:3 rrl:3
         rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0025W non-operational enlat:8000 exlat:45000 rrt:4 rrl:4
         rwt:4 rwl:4 idle_power:- active_power:-

Reported-and-tested-by: Chang Feng <flukehn@gmail.com>
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-12 11:42:58 +02:00
Abhijit
80b2624094 nvme-pci: add NVME_QUIRK_BOGUS_NID for Lexar NM760
Add a quirk to fix Lexar NM760 SSD drives reporting duplicate nsids.

Signed-off-by: Abhijit <abhijit@abhijittomar.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-12 11:42:45 +02:00
Sagi Grimberg
c4abd87571 nvme-tcp: fix possible hang caused during ctrl deletion
When we delete a controller, we execute the following:
1. nvme_stop_ctrl() - stop some work elements that may be
	inflight or scheduled (specifically also .stop_ctrl
	which cancels ctrl error recovery work)
2. nvme_remove_namespaces() - which first flushes scan_work
	to avoid competing ns addition/removal
3. continue to teardown the controller

However, if err_work was scheduled to run in (1), it is designed to
cancel any inflight I/O, particularly I/O that is originating from ns
scan_work in (2), but because it is cancelled in .stop_ctrl(), we can
prevent forward progress of (2) as ns scanning is blocking on I/O
(that will never be cancelled).

The race is:
1. transport layer error observed -> err_work is scheduled
2. scan_work executes, discovers ns, generate I/O to it
3. nvme_ctop_ctrl() -> .stop_ctrl() -> cancel_work_sync(err_work)
   - err_work never executed
4. nvme_remove_namespaces() -> flush_work(scan_work)
--> deadlock, because scan_work is blocked on I/O that was supposed
to be cancelled by err_work, but was cancelled before executing (see
stack trace [1]).

Fix this by flushing err_work instead of cancelling it, to force it
to execute and cancel all inflight I/O.

[1]:
--
Call Trace:
 <TASK>
 __schedule+0x390/0x910
 ? scan_shadow_nodes+0x40/0x40
 schedule+0x55/0xe0
 io_schedule+0x16/0x40
 do_read_cache_page+0x55d/0x850
 ? __page_cache_alloc+0x90/0x90
 read_cache_page+0x12/0x20
 read_part_sector+0x3f/0x110
 amiga_partition+0x3d/0x3e0
 ? osf_partition+0x33/0x220
 ? put_partition+0x90/0x90
 bdev_disk_changed+0x1fe/0x4d0
 blkdev_get_whole+0x7b/0x90
 blkdev_get_by_dev+0xda/0x2d0
 device_add_disk+0x356/0x3b0
 nvme_mpath_set_live+0x13c/0x1a0 [nvme_core]
 ? nvme_parse_ana_log+0xae/0x1a0 [nvme_core]
 nvme_update_ns_ana_state+0x3a/0x40 [nvme_core]
 nvme_mpath_add_disk+0x120/0x160 [nvme_core]
 nvme_alloc_ns+0x594/0xa00 [nvme_core]
 nvme_validate_or_alloc_ns+0xb9/0x1a0 [nvme_core]
 ? __nvme_submit_sync_cmd+0x1d2/0x210 [nvme_core]
 nvme_scan_work+0x281/0x410 [nvme_core]
 process_one_work+0x1be/0x380
 worker_thread+0x37/0x3b0
 ? process_one_work+0x380/0x380
 kthread+0x12d/0x150
 ? set_kthread_struct+0x50/0x50
 ret_from_fork+0x1f/0x30
 </TASK>
INFO: task nvme:6725 blocked for more than 491 seconds.
      Not tainted 5.15.65-f0.el7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:nvme            state:D
 stack:    0 pid: 6725 ppid:  1761 flags:0x00004000
Call Trace:
 <TASK>
 __schedule+0x390/0x910
 ? sched_clock+0x9/0x10
 schedule+0x55/0xe0
 schedule_timeout+0x24b/0x2e0
 ? try_to_wake_up+0x358/0x510
 ? finish_task_switch+0x88/0x2c0
 wait_for_completion+0xa5/0x110
 __flush_work+0x144/0x210
 ? worker_attach_to_pool+0xc0/0xc0
 flush_work+0x10/0x20
 nvme_remove_namespaces+0x41/0xf0 [nvme_core]
 nvme_do_delete_ctrl+0x47/0x66 [nvme_core]
 nvme_sysfs_delete.cold.96+0x8/0xd [nvme_core]
 dev_attr_store+0x14/0x30
 sysfs_kf_write+0x38/0x50
 kernfs_fop_write_iter+0x146/0x1d0
 new_sync_write+0x114/0x1b0
 ? intel_pmu_handle_irq+0xe0/0x420
 vfs_write+0x18d/0x270
 ksys_write+0x61/0xe0
 __x64_sys_write+0x1a/0x20
 do_syscall_64+0x37/0x90
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
--

Fixes: 3f2304f8c6 ("nvme-tcp: add NVMe over TCP host driver")
Reported-by: Jonathan Nicklin <jnicklin@blockbridge.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Jonathan Nicklin <jnicklin@blockbridge.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-12 11:35:46 +02:00
Sagi Grimberg
a1ae8d4d9b nvme-rdma: fix possible hang caused during ctrl deletion
When we delete a controller, we execute the following:
1. nvme_stop_ctrl() - stop some work elements that may be
        inflight or scheduled (specifically also .stop_ctrl
        which cancels ctrl error recovery work)
2. nvme_remove_namespaces() - which first flushes scan_work
        to avoid competing ns addition/removal
3. continue to teardown the controller

However, if err_work was scheduled to run in (1), it is designed to
cancel any inflight I/O, particularly I/O that is originating from ns
scan_work in (2), but because it is cancelled in .stop_ctrl(), we can
prevent forward progress of (2) as ns scanning is blocking on I/O
(that will never be cancelled).

The race is:
1. transport layer error observed -> err_work is scheduled
2. scan_work executes, discovers ns, generate I/O to it
3. nvme_ctop_ctrl() -> .stop_ctrl() -> cancel_work_sync(err_work)
   - err_work never executed
4. nvme_remove_namespaces() -> flush_work(scan_work)
--> deadlock, because scan_work is blocked on I/O that was supposed
to be cancelled by err_work, but was cancelled before executing.

Fix this by flushing err_work instead of cancelling it, to force it
to execute and cancel all inflight I/O.

Fixes: b435ecea2a ("nvme: Add .stop_ctrl to nvme ctrl ops")
Fixes: f6c8e432cb ("nvme: flush namespace scanning work just before removing namespaces")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-12 11:35:46 +02:00
Jason A. Donenfeld
a251c17aa5 treewide: use get_random_u32() when possible
The prandom_u32() function has been a deprecated inline wrapper around
get_random_u32() for several releases now, and compiles down to the
exact same code. Replace the deprecated wrapper with a direct call to
the real function. The same also applies to get_random_int(), which is
just a wrapper around get_random_u32(). This was done as a basic find
and replace.

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Acked-by: Helge Deller <deller@gmx.de> # for parisc
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-11 17:42:58 -06:00
Linus Torvalds
7c989b1da3 for-6.1/passthrough-2022-10-04
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM8rp4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjTHD/9eeWwaG7oSSu5J1YzkKn+hptaDzZwreL98
 Mh8euiQScUVpvHGkNowBhjBZ5cIAAcYaH17rjW7dWu6A7tv/iqygWd/YvIbs1JOe
 STSD9yf0RV4dI0MG6Wu2w6YxObaLvE5BTRxqb/WuFWNTgsYf2HEp4PM9sTio71+H
 WwWdRvsIxsRxVYemds3vBxd+BcM8vm26EoUTSaCwRhfopaJwBNceCYIIrM7VHUNM
 5G6+DJkm3mB1a8nsdguYZQC/y8F/9P5Ch9CdxA12yOZEryr3wzsyRNGdm7oRmFGM
 bAkjFcddhwk5+SuTzGX6t4/Z3ODIjeCXbMBg4p7AShHws4Yx1trJePiqoNQ8xd5A
 PkMfxhQpBPlDFKLmwtObPLInyzMpp5P8KYMIZfyymKD/+XjmqAlR6TXbFUTihzBU
 lHSFhwG8ysT2cAVrFBMDJu4UPIThIHqfkkF/nTkHePTSArJ/k5rGV7v5sQpZ+jtY
 R0gvoNHTq2IvgKGEEbTgDjpwVcCn5ERVorZuGjVN2nMdLj35kXpo7YNgyYMaD5LJ
 9SOR5a8iQjjudAfdGyZCGzNaOecizVFjABozUYc1XJi/boNuFTsq4XCE/tCLTixc
 V4sElRpgrlXxNXkiVdbuWIPuYo4sDw5gqZQynpVNH5PkmX/NqmpWYVEWJ20o+pwg
 3ag39nZQVQ==
 =nwLk
 -----END PGP SIGNATURE-----

Merge tag 'for-6.1/passthrough-2022-10-04' of git://git.kernel.dk/linux

Pull passthrough updates from Jens Axboe:
 "With these changes, passthrough NVMe support over io_uring now
  performs at the same level as block device O_DIRECT, and in many cases
  6-8% better.

  This contains:

   - Add support for fixed buffers for passthrough (Anuj, Kanchan)

   - Enable batched allocations and freeing on passthrough, similarly to
     what we support on the normal storage path (me)

   - Fix from Geert fixing an issue with !CONFIG_IO_URING"

* tag 'for-6.1/passthrough-2022-10-04' of git://git.kernel.dk/linux:
  io_uring: Add missing inline to io_uring_cmd_import_fixed() dummy
  nvme: wire up fixed buffer support for nvme passthrough
  nvme: pass ubuffer as an integer
  block: extend functionality to map bvec iterator
  block: factor out blk_rq_map_bio_alloc helper
  block: rename bio_map_put to blk_mq_map_bio_put
  nvme: refactor nvme_alloc_request
  nvme: refactor nvme_add_user_metadata
  nvme: Use blk_rq_map_user_io helper
  scsi: Use blk_rq_map_user_io helper
  block: add blk_rq_map_user_io
  io_uring: introduce fixed buffer support for io_uring_cmd
  io_uring: add io_uring_cmd_import_fixed
  nvme: enable batched completions of passthrough IO
  nvme: split out metadata vs non metadata end_io uring_cmd completions
  block: allow end_io based requests in the completion batch handling
  block: change request end_io handler to pass back a return value
  block: enable batched allocation for blk_mq_alloc_request()
  block: kill deprecated BUG_ON() in the flush handling
2022-10-07 09:35:50 -07:00
Linus Torvalds
513389809e for-6.1/block-2022-10-03
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM67XkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiHoD/9eN+6YnNRPu5+2zeGnnm1Nlwic6YMZeORr
 KFIeC0COMWoFhNBIPFkgAKT+0qIH+uGt5UsHSM3Y5La7wMR8yLxD4PAnvTZ/Ijtt
 yxVIOmonJoQ0OrQ2kTbvDXL/9OCUrzwXXyUIEPJnH0Ca1mxeNOgDHbE7VGF6DMul
 0D3pI8qs2WLnHlDi1V/8kH5qZ6WoAJSDcb8sTzOUVnyveZPNaZhGQJuHA2XAYMtg
 fqKMDJqgmNk6jdTMUgdF5B+rV64PQoCy28I7fXqGkEe+RE5TBy57vAa0XY84V8XR
 /a8CEuwMts2ypk1hIcJG8Vv8K6u5war9yPM5MTngKsoMpzNIlhrhaJQVyjKdcs+E
 Ixwzexu6xTYcrcq+mUARgeTh79FzTBM/uXEdbCG2G3S6HPd6UZWUJZGfxw/l0Aem
 V4xB7lj6SQaJDU1iJCYUaHcekNXhQAPvyVG+R2ED1SO3McTpTPIM1aeigxw6vj7u
 bH3Kfdr94Z8HNuoLuiS6YYfjNt2Shf4LEB6GxKJ9TYHtyhdOyO0H64jGHpygrWqN
 cSnkWPUqUUNpF7srKM0ZgbliCshvmyJc4aMOFd0gBY/kXf5J/j7IXvh8TFCi9rHH
 0KyZH3/3Zsu9geUn3ynznlr4FXU+BcqE6boaa/iWb9sN1m+Rvaahv8cSch/dh44a
 vQNj/iOBQA==
 =R05e
 -----END PGP SIGNATURE-----

Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull requests via Christoph:
      - handle number of queue changes in the TCP and RDMA drivers
        (Daniel Wagner)
      - allow changing the number of queues in nvmet (Daniel Wagner)
      - also consider host_iface when checking ip options (Daniel
        Wagner)
      - don't map pages which can't come from HIGHMEM (Fabio M. De
        Francesco)
      - avoid unnecessary flush bios in nvmet (Guixin Liu)
      - shrink and better pack the nvme_iod structure (Keith Busch)
      - add comment for unaligned "fake" nqn (Linjun Bao)
      - print actual source IP address through sysfs "address" attr
        (Martin Belanger)
      - various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang)
      - handle effects after freeing the request (Keith Busch)
      - copy firmware_rev on each init (Keith Busch)
      - restrict management ioctls to admin (Keith Busch)
      - ensure subsystem reset is single threaded (Keith Busch)
      - report the actual number of tagset maps in nvme-pci (Keith
        Busch)
      - small fabrics authentication fixups (Christoph Hellwig)
      - add common code for tagset allocation and freeing (Christoph
        Hellwig)
      - stop using the request_queue in nvmet (Christoph Hellwig)
      - set min_align_mask before calculating max_hw_sectors (Rishabh
        Bhatnagar)
      - send a rediscover uevent when a persistent discovery controller
        reconnects (Sagi Grimberg)
      - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi)

 - MD pull request via Song:
      - Various raid5 fix and clean up, by Logan Gunthorpe and David
        Sloan.
      - Raid10 performance optimization, by Yu Kuai.

 - sbitmap wakeup hang fixes (Hugh, Keith, Jan, Yu)

 - IO scheduler switching quisce fix (Keith)

 - s390/dasd block driver updates (Stefan)

 - support for recovery for the ublk driver (ZiyangZhang)

 - rnbd drivers fixes and updates (Guoqing, Santosh, ye, Christoph)

 - blk-mq and null_blk map fixes (Bart)

 - various bcache fixes (Coly, Jilin, Jules)

 - nbd signal hang fix (Shigeru)

 - block writeback throttling fix (Yu)

 - optimize the passthrough mapping handling (me)

 - prepare block cgroups to being gendisk based (Christoph)

 - get rid of an old PSI hack in the block layer, moving it to the
   callers instead where it belongs (Christoph)

 - blk-throttle fixes and cleanups (Yu)

 - misc fixes and cleanups (Liu Shixin, Liu Song, Miaohe, Pankaj,
   Ping-Xiang, Wolfram, Saurabh, Li Jinlin, Li Lei, Lin, Li zeming,
   Miaohe, Bart, Coly, Gaosheng

* tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux: (162 commits)
  sbitmap: fix lockup while swapping
  block: add rationale for not using blk_mq_plug() when applicable
  block: adapt blk_mq_plug() to not plug for writes that require a zone lock
  s390/dasd: use blk_mq_alloc_disk
  blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep
  nvmet: don't look at the request_queue in nvmet_bdev_set_limits
  nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all
  blk-mq: use quiesced elevator switch when reinitializing queues
  block: replace blk_queue_nowait with bdev_nowait
  nvme: remove nvme_ctrl_init_connect_q
  nvme-loop: use the tagset alloc/free helpers
  nvme-loop: store the generic nvme_ctrl in set->driver_data
  nvme-loop: initialize sqsize later
  nvme-fc: use the tagset alloc/free helpers
  nvme-fc: store the generic nvme_ctrl in set->driver_data
  nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
  nvme-rdma: use the tagset alloc/free helpers
  nvme-rdma: store the generic nvme_ctrl in set->driver_data
  nvme-tcp: use the tagset alloc/free helpers
  nvme-tcp: store the generic nvme_ctrl in set->driver_data
  ...
2022-10-07 09:19:14 -07:00
Linus Torvalds
0a78a376ef for-6.1/io_uring-2022-10-03
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM67S0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppnPEACkBzilBLKwT9MWdUAITwyrMXsAa1R9gsR9
 Tb3Xs+mNO2meuycLAUh4LIbb28NNr7/S5rwWet5NRZ71hgv4Q/WA/0EemAGGXYqd
 +3MEBAWU3FBFkC/cJXCnT8F5yCXYRkT5n/hzCSNEpNKjQ5JnAhHDlWAjgzZRuD/A
 A+YJjoBVJJuI1wY4I5XCpeQXEmg/Wc1MgXfyHgWVtGKnYrrxibiCnBZnqbAMZNvD
 hGn1Vl02ooamGTFm/nW/OAt71DtqsjWUCVOHKmlZ+zBUjbUj6FMXmPVV7vCV9o2w
 PT4Dx3CTc2iXwa8KfEFNPvXBzy0Qfu8edweP/MvZHWHVZREpEAh4cG6GhwW8whD+
 5mPisqmRjZKe0BBS4k/wKN1RXEypSQoTU4EdljfbQPU/usn35lmjMmEXXgs3IhqM
 fcTdO5ZUOp+CGyzI0Bc7UtS8vilJbX9ynN8G80MUUAZzuQg39MH7lNQYSJSSvJfU
 OlvzmL3lhRLYM1s/KKiZzdDBoMvC7R4oHmzCveOjQTMIHf6WNyqKFlrWScq2wzpN
 oRxqt0xiVQ3PFMmFj6N08f145qtbASuF3sKv7dbU3QXTsCAos3wdTdX+PejYApEZ
 W3dr0TDjNBicLNVPiSj132p0ZRtdZvLGuGVkBD4GPQeH2NwswxMHQAfz8e2lqmA4
 9bWG6BM7Yw==
 =m9kX
 -----END PGP SIGNATURE-----

Merge tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:

 - Add supported for more directly managed task_work running.

   This is beneficial for real world applications that end up issuing
   lots of system calls as part of handling work. Normal task_work will
   always execute as we transition in and out of the kernel, even for
   "unrelated" system calls. It's more efficient to defer the handling
   of io_uring's deferred work until the application wants it to be run,
   generally in batches.

   As part of ongoing work to write an io_uring network backend for
   Thrift, this has been shown to greatly improve performance. (Dylan)

 - Add IOPOLL support for passthrough (Kanchan)

 - Improvements and fixes to the send zero-copy support (Pavel)

 - Partial IO handling fixes (Pavel)

 - CQE ordering fixes around CQ ring overflow (Pavel)

 - Support sendto() for non-zc as well (Pavel)

 - Support sendmsg for zerocopy (Pavel)

 - Networking iov_iter fix (Stefan)

 - Misc fixes and cleanups (Pavel, me)

* tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linux: (56 commits)
  io_uring/net: fix notif cqe reordering
  io_uring/net: don't update msg_name if not provided
  io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL
  io_uring/rw: defer fsnotify calls to task context
  io_uring/net: fix fast_iov assignment in io_setup_async_msg()
  io_uring/net: fix non-zc send with address
  io_uring/net: don't skip notifs for failed requests
  io_uring/rw: don't lose short results on io_setup_async_rw()
  io_uring/rw: fix unexpected link breakage
  io_uring/net: fix cleanup double free free_iov init
  io_uring: fix CQE reordering
  io_uring/net: fix UAF in io_sendrecv_fail()
  selftest/net: adjust io_uring sendzc notif handling
  io_uring: ensure local task_work marks task as running
  io_uring/net: zerocopy sendmsg
  io_uring/net: combine fail handlers
  io_uring/net: rename io_sendzc()
  io_uring/net: support non-zerocopy sendto
  io_uring/net: refactor io_setup_async_addr
  io_uring/net: don't lose partial send_zc on fail
  ...
2022-10-07 08:52:43 -07:00
Linus Torvalds
7bc6e90d7a block-6.0-2022-09-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM2W7wQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplm+EACplcKQA0fDsHOAyhqi6Ms7VK38kYAItlK2
 ztSeIQONBUtEuzSD6HA6W5+3gPPWRwXhNrIRMQp6rbKY8WS7jtTP/XqNISxjtvPx
 SMCDnDj8ANE8ykgLxiNECh87Q8D04wu3mY7Rs45aIowvXYxv8ALxGZ3QjFL/vDPT
 BHY8+EBwHbLRcpmI+J/eg1Qa9EfAixCTo+rimAucGe4qHVHmPawEjGY2hgmIcrdo
 k/WTelymDVkbrlRRcERWTFxjRkmS/f6lMQlEj3sPPsFkl2ka/l6BT135FOCJI/v4
 AEOpGjSoLTr0E3b4BCzu0MlU9WsTflinSgWKqJt2J4jyhuWjyCALv4ZrEIetKOZI
 NO+iEiTs2oO/dlGB3x0YFEieDACFRC6/leBAYgJIsG01jRBnQfU9/evxr25pK7kd
 iax+y0f/ilL0/aWR61YhJ67OH+aPLcgwJ1TUHD9N9BCcvq85rk46bcvmRf+ar9JZ
 4srCnybHN9yhkPkxeZKuTiNqJ3xZSesglNxxDMOfU9iWY1CjQX/aOoBuydJkJz4m
 gH8tyfEcvJxdhuLI3mvo7Mv2WUAyS2ohG4k1EAFZbPUa0pAHkyM9b1ZvSnbHWa6Z
 k6yKWmv8WLFKD2VY28HrsMHYBBVDCcJNsDzLqydaKFTU2g8YEQdk3h70Ga3tbhwS
 k23Pfs5bcw==
 =SZZ1
 -----END PGP SIGNATURE-----

Merge tag 'block-6.0-2022-09-29' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "A single NVMe pull request via Christoph with a few fixes that should
  go into the 6.0 release:

   - Fix IOC_PR_CLEAR and IOC_PR_RELEASE ioctls for nvme devices
     (Michael Kelley)

   - Disable Write Zeroes on Phison E3C/E4C (Tina Hsu)"

* tag 'block-6.0-2022-09-29' of git://git.kernel.dk/linux:
  nvme-pci: disable Write Zeroes on Phison E3C/E4C
  nvme: Fix IOC_PR_CLEAR and IOC_PR_RELEASE ioctls for nvme devices
2022-09-30 09:33:33 -07:00
Kanchan Joshi
23fd22e55b nvme: wire up fixed buffer support for nvme passthrough
if io_uring sends passthrough command with IORING_URING_CMD_FIXED flag,
use the pre-registered buffer for IO (non-vectored variant). Pass the
buffer/length to io_uring and get the bvec iterator for the range. Next,
pass this bvec to block-layer and obtain a bio/request for subsequent
processing.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220930062749.152261-13-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:51:13 -06:00
Kanchan Joshi
4d17448682 nvme: pass ubuffer as an integer
This is a prep patch. Modify nvme_submit_user_cmd and
nvme_map_user_request to take ubuffer as plain integer
argument, and do away with nvme_to_user_ptr conversion in callers.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220930062749.152261-12-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:51:13 -06:00
Kanchan Joshi
470e900c80 nvme: refactor nvme_alloc_request
nvme_alloc_request expects a large number of parameters.
Split this out into two functions to reduce number of parameters.
First one retains the name nvme_alloc_request, while second one is
named nvme_map_user_request.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220930062749.152261-8-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:51:13 -06:00
Kanchan Joshi
38c0ddab7b nvme: refactor nvme_add_user_metadata
Pass struct request rather than bio. It helps to kill a parameter, and
some processing clean-up too.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220930062749.152261-7-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:51:13 -06:00
Anuj Gupta
7f05635764 nvme: Use blk_rq_map_user_io helper
User blk_rq_map_user_io instead of duplicating the same code at
different places

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/20220930062749.152261-6-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:51:13 -06:00
Jens Axboe
851eb780de nvme: enable batched completions of passthrough IO
Now that the normal passthrough end_io path doesn't need the request
anymore, we can kill the explicit blk_mq_free_request() and just pass
back RQ_END_IO_FREE instead. This enables the batched completion from
freeing batches of requests at the time.

This brings passthrough IO performance at least on par with bdev based
O_DIRECT with io_uring. With this and batche allocations, peak performance
goes from 110M IOPS to 122M IOPS. For IRQ based, passthrough is now also
about 10% faster than previously, going from ~61M to ~67M IOPS.

Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Co-developed-by: Stefan Roesch <shr@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:49:11 -06:00
Jens Axboe
c0a7ba77e8 nvme: split out metadata vs non metadata end_io uring_cmd completions
By splitting up the metadata and non-metadata end_io handling, we can
remove any request dependencies on the normal non-metadata IO path. This
is in preparation for enabling the normal IO passthrough path to pass
the ownership of the request back to the block layer.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Co-developed-by: Stefan Roesch <shr@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:49:11 -06:00
Jens Axboe
de671d6116 block: change request end_io handler to pass back a return value
Everything is just converted to returning RQ_END_IO_NONE, and there
should be no functional changes with this patch.

In preparation for allowing the end_io handler to pass ownership back
to the block layer, rather than retain ownership of the request.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:49:09 -06:00
Jens Axboe
5853a7b551 Merge branch 'for-6.1/io_uring' into for-6.1/passthrough
* for-6.1/io_uring: (56 commits)
  io_uring/net: fix notif cqe reordering
  io_uring/net: don't update msg_name if not provided
  io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL
  io_uring/rw: defer fsnotify calls to task context
  io_uring/net: fix fast_iov assignment in io_setup_async_msg()
  io_uring/net: fix non-zc send with address
  io_uring/net: don't skip notifs for failed requests
  io_uring/rw: don't lose short results on io_setup_async_rw()
  io_uring/rw: fix unexpected link breakage
  io_uring/net: fix cleanup double free free_iov init
  io_uring: fix CQE reordering
  io_uring/net: fix UAF in io_sendrecv_fail()
  selftest/net: adjust io_uring sendzc notif handling
  io_uring: ensure local task_work marks task as running
  io_uring/net: zerocopy sendmsg
  io_uring/net: combine fail handlers
  io_uring/net: rename io_sendzc()
  io_uring/net: support non-zerocopy sendto
  io_uring/net: refactor io_setup_async_addr
  io_uring/net: don't lose partial send_zc on fail
  ...
2022-09-30 07:47:42 -06:00
Jens Axboe
736feaa3a0 Merge branch 'for-6.1/block' into for-6.1/passthrough
* for-6.1/block: (162 commits)
  sbitmap: fix lockup while swapping
  block: add rationale for not using blk_mq_plug() when applicable
  block: adapt blk_mq_plug() to not plug for writes that require a zone lock
  s390/dasd: use blk_mq_alloc_disk
  blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep
  nvmet: don't look at the request_queue in nvmet_bdev_set_limits
  nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all
  blk-mq: use quiesced elevator switch when reinitializing queues
  block: replace blk_queue_nowait with bdev_nowait
  nvme: remove nvme_ctrl_init_connect_q
  nvme-loop: use the tagset alloc/free helpers
  nvme-loop: store the generic nvme_ctrl in set->driver_data
  nvme-loop: initialize sqsize later
  nvme-fc: use the tagset alloc/free helpers
  nvme-fc: store the generic nvme_ctrl in set->driver_data
  nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
  nvme-rdma: use the tagset alloc/free helpers
  nvme-rdma: store the generic nvme_ctrl in set->driver_data
  nvme-tcp: use the tagset alloc/free helpers
  nvme-tcp: store the generic nvme_ctrl in set->driver_data
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:47:38 -06:00
Christoph Hellwig
84fe64f898 nvmet: don't look at the request_queue in nvmet_bdev_set_limits
nvmet is a consumer of the block layer and should not directly look at
the request_queue.  Use the bdev_ helpers to retrieve the device limits
instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2022-09-27 18:51:50 +02:00
Christoph Hellwig
8df20252c0 nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all
nvmet is a consumer of the block layer and should not directly look at
the request_queue.  Just use the NUMA node ID from the gendisk instead of
the request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2022-09-27 18:51:39 +02:00
Christoph Hellwig
fe6f04c079 nvme: remove nvme_ctrl_init_connect_q
Unused now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27 14:44:17 +02:00
Christoph Hellwig
ceee1953f9 nvme-loop: use the tagset alloc/free helpers
Use the common helpers to allocate and free the tagsets.  To make this
work the generic nvme_ctrl now needs to be stored in the hctx private
data instead of the nvme_loop_ctrl.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27 14:44:17 +02:00
Christoph Hellwig
2ade82213b nvme-loop: store the generic nvme_ctrl in set->driver_data
Point the private data to the generic controller structure in preparation
of using the common tagset init/exit code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27 14:44:17 +02:00
Christoph Hellwig
379e0df5ab nvme-loop: initialize sqsize later
Defer initializing the sqsize field from the options until it has been
capped by MAXCMD.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27 14:44:17 +02:00
Christoph Hellwig
6dfba1c09c nvme-fc: use the tagset alloc/free helpers
Use the common helpers to allocate and free the tagsets.  To make this
work the generic nvme_ctrl now needs to be stored in the hctx private
data instead of the nvme_fc_ctrl.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <jsmart2021@gmail.com>
2022-09-27 14:44:17 +02:00
Christoph Hellwig
1864ea4615 nvme-fc: store the generic nvme_ctrl in set->driver_data
Point the private data to the generic controller structure in preparation
of using the common tagset init/exit code and use the chance the cleanup
the init_hctx methods a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <jsmart2021@gmail.com>
2022-09-27 14:44:17 +02:00
Christoph Hellwig
18ecd97506 nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
Also update the sqsize field when capping the queue size, and remove the
check a queue size that is larger than sqsize given that sqsize is only
initialized from opts->queue_size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <jsmart2021@gmail.com>
2022-09-27 14:44:16 +02:00
Christoph Hellwig
cefa1032f1 nvme-rdma: use the tagset alloc/free helpers
Use the common helpers to allocate and free the tagsets.  To make this
work the generic nvme_ctrl now needs to be stored in the hctx private
data instead of the nvme_rdma_ctrl.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27 14:44:16 +02:00
Christoph Hellwig
2d60738c8f nvme-rdma: store the generic nvme_ctrl in set->driver_data
Point the private data to the generic controller structure in preparation
of using the common tagset init/exit code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27 14:44:16 +02:00
Christoph Hellwig
de777825e4 nvme-tcp: use the tagset alloc/free helpers
Use the common helpers to allocate and free the tagsets.  To make this
work the generic nvme_ctrl now needs to be stored in the hctx private
data instead of the nvme_tcp_ctrl.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27 14:44:16 +02:00
Christoph Hellwig
06427ca09b nvme-tcp: store the generic nvme_ctrl in set->driver_data
Point the private data to the generic controller structure in preparation
of using the common tagset init/exit code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27 14:44:16 +02:00
Christoph Hellwig
fb8745d040 nvme-tcp: remove the unused queue_size member in nvme_tcp_queue
->nvme_tcp_queue is not used anywhere, so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27 14:44:16 +02:00
Christoph Hellwig
fe60e8c534 nvme: add common helpers to allocate and free tagsets
Add common helpers to allocate and tear down the admin and I/O tag sets,
including the special queues allocated with them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27 14:44:15 +02:00
Christoph Hellwig
ab46d8d40f nvmet: add helpers to set the result field for connect commands
The code to set the result field for the admin and I/O connect commands
is not only verbose and duplicated, but also violates the aliasing
rules as it accesses both the u16 and u32 members in the union.

Add a little helper to sort all that out.

Fixes: db1312dd95 ("nvmet: implement basic In-Band Authentication")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
2022-09-27 09:22:09 +02:00
Christoph Hellwig
1c32a8012b nvme: improve the NVME_CONNECT_AUTHREQ* definitions
Mark them as unsigned so that we don't need extra casts, and define
them relative to cdword0 instead of requiring extra shifts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
2022-09-27 09:22:09 +02:00
Christoph Hellwig
1befd944e0 nvmet-auth: don't try to cancel a non-initialized work_struct
Currently blktests nvme/002 trips up debugobjects if CONFIG_NVME_AUTH is
enabled, but authentication is not on a queue.  This is because
nvmet_auth_sq_free cancels sq->auth_expired_work unconditionaly, while
auth_expired_work is only ever initialized if authentication is enabled
for a given controller.

Fix this by calling most of what is nvmet_init_auth unconditionally
when initializing the SQ, and just do the setting of the result
field in the connect command handler.

Fixes: db1312dd95 ("nvmet: implement basic In-Band Authentication")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
2022-09-27 09:22:09 +02:00
zhenwei pi
0700542a82 nvmet-tcp: remove nvmet_tcp_finish_cmd
There is only a single call-site of nvmet_tcp_finish_cmd(), this
becomes redundant. Remove nvmet_tcp_finish_cmd() and use the original
function body instead.

Suggested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:22:08 +02:00
Varun Prakash
b6a545ffa2 nvmet-tcp: add bounds check on Transfer Tag
ttag is used as an index to get cmd in nvmet_tcp_handle_h2c_data_pdu(),
add a bounds check to avoid out-of-bounds access.

Signed-off-by: Varun Prakash <varun@chelsio.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:22:08 +02:00
Varun Prakash
f614b937d8 nvmet-tcp: handle ICReq PDU received in NVMET_TCP_Q_LIVE state
As per NVMe/TCP transport specification ICReq PDU is the first PDU received
by the controller and controller should receive only one ICReq PDU.

If controller receives more than one ICReq PDU then this can be considered
as fatal error.

nvmet-tcp driver does not check for ICReq PDU opcode if queue state is
NVMET_TCP_Q_LIVE. In LIVE state ICReq PDU is treated as CapsuleCmd PDU,
this can result in abnormal behavior.

Add a check for ICReq PDU in nvmet_tcp_done_recv_pdu() to fix this issue.

Signed-off-by: Varun Prakash <varun@chelsio.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:22:08 +02:00
zhenwei pi
db94f24028 nvmet-tcp: fix NULL pointer dereference during release
nvmet-tcp frees CMD buffers in nvmet_tcp_uninit_data_in_cmds(),
and waits the inflight IO requests in nvmet_sq_destroy(). During wait
the inflight IO requests, the callback nvmet_tcp_queue_response()
is called from backend after IO complete, this leads a typical
Use-After-Free issue like this:

 BUG: kernel NULL pointer dereference, address: 0000000000000008
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 107f80067 P4D 107f80067 PUD 10789e067 PMD 0
 Oops: 0000 [#1] PREEMPT SMP NOPTI
 CPU: 1 PID: 123 Comm: kworker/1:1H Kdump: loaded Tainted: G            E      6.0.0-rc2.bm.1-amd64 #15
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
 Workqueue: nvmet_tcp_wq nvmet_tcp_io_work [nvmet_tcp]
 RIP: 0010:shash_ahash_digest+0x2b/0x110
 Code: 1f 44 00 00 41 57 41 56 41 55 41 54 55 48 89 fd 53 48 89 f3 48 83 ec 08 44 8b 67 30 45 85 e4 74 1c 48 8b 57 38 b8 00 10 00 00 <44> 8b 7a 08 44 29 f8 39 42 0c 0f 46 42 0c 41 39 c4 76 43 48 8b 03
 RSP: 0018:ffffc9000051bdd8 EFLAGS: 00010206
 RAX: 0000000000001000 RBX: ffff888100ab5470 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: ffff888100ab5470 RDI: ffff888100ab5420
 RBP: ffff888100ab5420 R08: ffff8881024d08c8 R09: ffff888103e1b4b8
 R10: 8080808080808080 R11: 0000000000000000 R12: 0000000000001000
 R13: 0000000000000000 R14: ffff88813412bd4c R15: ffff8881024d0800
 FS:  0000000000000000(0000) GS:ffff88883fa40000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000008 CR3: 0000000104b48000 CR4: 0000000000350ee0
 Call Trace:
  <TASK>
  nvmet_tcp_io_work+0xa52/0xb52 [nvmet_tcp]
  ? __switch_to+0x106/0x420
  process_one_work+0x1ae/0x380
  ? process_one_work+0x380/0x380
  worker_thread+0x30/0x360
  ? process_one_work+0x380/0x380
  kthread+0xe6/0x110
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30

Separate nvmet_tcp_uninit_data_in_cmds() into two steps:

  uninit data in cmds                  <- new step 1
  nvmet_sq_destroy();
  cancel_work_sync(&queue->io_work);
  free CMD buffers                     <- new step 2

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:22:08 +02:00
Keith Busch
6ee742fa8e nvme-pci: report the actual number of tagset maps
We've been reporting 2 maps regardless of whether the module parameter
asked for anything beyond the default queues. A consequence of this
means that blk-mq will reinitialize the all the hardware contexts and io
schedulers on every controller reset when the mapping is exactly the
same as before. This unnecessary overhead is adding several milliseconds
on a reset for environments that don't need it. Report the actual number
of mappings in use.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:22:07 +02:00
Rishabh Bhatnagar
61ce339f19 nvme-pci: set min_align_mask before calculating max_hw_sectors
If swiotlb is force enabled dma_max_mapping_size ends up calling
swiotlb_max_mapping_size which takes into account the min align mask for
the device.  Set the min align mask for nvme driver before calling
dma_max_mapping_size while calculating max hw sectors.

Signed-off-by: Rishabh Bhatnagar <risbhat@amazon.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:22:07 +02:00
Sagi Grimberg
f46ef9e87c nvme: send a rediscover uevent when a persistent discovery controller reconnects
When a discovery controller is disconnected, no AENs will arrive to
notify the host about discovery log change events.

In order to solve this, send a uevent notification when a
persistent discovery controller reconnects. We add a new ctrl
flag NVME_CTRL_STARTED_ONCE that will be set on the first
start, and consecutive calls will find it set, and send the
event to userspace if the controller is a discovery controller.

Upon the event reception, userspace will re-read the discovery
log page and will act upon changes as it sees fit.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:22:07 +02:00
Sagi Grimberg
bf093d9716 nvme: enumerate controller flags
We expect to grow a few of these flags for various purposes
so make them a proper enumeration.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:22:07 +02:00
Tina Hsu
d14c273132 nvme-pci: disable Write Zeroes on Phison E3C/E4C
E3C/E4C SSDs do support the Write Zeroes command in theory, but have very
bad performance when using it.  As the firmware has been frozen for these
products we can not expect firmware improvements for it, so disable
Write Zeroes.

Signed-off-by: Tina Hsu <tina_hsu@phison.corp-partner.google.com>
[hch: update the commit message]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:20:30 +02:00
Michael Kelley
c292a337d0 nvme: Fix IOC_PR_CLEAR and IOC_PR_RELEASE ioctls for nvme devices
The IOC_PR_CLEAR and IOC_PR_RELEASE ioctls are
non-functional on NVMe devices because the nvme_pr_clear()
and nvme_pr_release() functions set the IEKEY field incorrectly.
The IEKEY field should be set only when the key is zero (i.e,
not specified).  The current code does it backwards.

Furthermore, the NVMe spec describes the persistent
reservation "clear" function as an option on the reservation
release command. The current implementation of nvme_pr_clear()
erroneously uses the reservation register command.

Fix these errors. Note that NVMe version 1.3 and later specify
that setting the IEKEY field will return an error of Invalid
Field in Command.  The fix will set IEKEY when the key is zero,
which is appropriate as these ioctls consider a zero key to
be "unspecified", and the intention of the spec change is
to require a valid key.

Tested on a version 1.4 PCI NVMe device in an Azure VM.

Fixes: 1673f1f08c ("nvme: move block_device_operations and ns/ctrl freeing to common code")
Fixes: 1d277a637a ("NVMe: Add persistent reservation ops")
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:20:29 +02:00
Keith Busch
1e866afd4b nvme: ensure subsystem reset is single threaded
The subsystem reset writes to a register, so we have to ensure the
device state is capable of handling that otherwise the driver may access
unmapped registers. Use the state machine to ensure the subsystem reset
doesn't try to write registers on a device already undergoing this type
of reset.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=214771
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:15:56 +02:00
Keith Busch
23e085b2de nvme: restrict management ioctls to admin
The passthrough commands already have this restriction, but the other
operations do not. Require the same capabilities for all users as all of
these operations, which include resets and rescans, can be disruptive.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:15:56 +02:00
Keith Busch
a8eb6c1ba4 nvme: copy firmware_rev on each init
The firmware revision can change on after a reset so copy the most
recent info each time instead of just the first time, otherwise the
sysfs firmware_rev entry may contain stale data.

Reported-by: Jeff Lien <jeff.lien@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:15:56 +02:00
Keith Busch
bc8fb906b0 nvme: handle effects after freeing the request
If a reset occurs after the scan work attempts to issue a command, the
reset may quisce the admin queue, which blocks the scan work's command
from dispatching. The scan work will not be able to complete while the
queue is quiesced.

Meanwhile, the reset work will cancel all outstanding admin tags and
wait until all requests have transitioned to idle, which includes the
passthrough request. But the passthrough request won't be set to idle
until after the scan_work flushes, so we're deadlocked.

Fix this by handling the end effects after the request has been freed.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216354
Reported-by: Jonathan Derrick <Jonathan.Derrick@solidigm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:15:56 +02:00
Jens Axboe
de97fcb303 fs: add batch and poll flags to the uring_cmd_iopoll() handler
We need the poll_flags to know how to poll for the IO, and we should
have the batch structure in preparation for supporting batched
completions with iopoll.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:43 -06:00
Kanchan Joshi
585079b6e4 nvme: wire up async polling for io passthrough commands
Store a cookie during submission, and use that to implement
completion-polling inside the ->uring_cmd_iopoll handler.
This handler makes use of existing bio poll facility.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/20220823161443.49436-5-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:42 -06:00
Martin Belanger
02c57a82c0 nvme-tcp: print actual source IP address through sysfs "address" attr
TCP transport relies on the routing table to determine which source
address and interface to use when making a connection. Currently, there
is no way to tell from userspace where a connection was made. This
patch exposes the actual source address using a new field named
"src_addr=" in the "address" attribute.

This is needed to diagnose and identify connectivity issues. With the
source address we can infer the interface associated with each
connection.

This was tested with nvme-cli 2.0 to verify it does not have any
adverse effect. The new "src_addr=" field will simply be displayed in
the output of the "list-subsys" or "list -v" commands as shown here.

$ nvme list-subsys
nvme-subsys0 - NQN=nqn.2014-08.org.nvmexpress.discovery
\
 +- nvme0 tcp traddr=192.168.56.1,trsvcid=8009,src_addr=192.168.56.101 live

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 17:55:28 +02:00
Fabio M. De Francesco
5bfaba275a nvmet-tcp: don't map pages which can't come from HIGHMEM
kmap() is being deprecated in favor of kmap_local_page().[1]

There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.

The pages which will be mapped are allocated in nvmet_tcp_map_data(),
using the GFP_KERNEL flag. This assures that they cannot come from
HIGHMEM. This imply that a straight page_address() can replace the kmap()
of sg_page(sg) in nvmet_tcp_map_pdu_iovec(). As a side effect, we might
also delete the field "nr_mapped" from struct "nvmet_tcp_cmd" because,
after removing the kmap() calls, there would be no longer any need of it.

In addition, there is no reason to use a kvec for the command receive
data buffers iovec, use a bio_vec instead and let iov_iter handle the
buffer mapping and data copy.

Test with blktests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel with
HIGHMEM64GB enabled.

[1] "[PATCH] checkpatch: Add kmap and kmap_atomic to the deprecated
list" https://lore.kernel.org/all/20220813220034.806698-1-ira.weiny@intel.com/

Cc: Chaitanya Kulkarni <chaitanyak@nvidia.com>
Cc: Keith Busch <kbusch@kernel.org>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
[sagi: added bio_vec plus minor naming changes]
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 17:55:25 +02:00
Keith Busch
c4c22c5208 nvme-pci: move iod dma_len fill gaps
The 32-bit field, dma_len, packs better in the iod struct above the
dma_addr_t on 64-bit systems.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 17:55:25 +02:00
Keith Busch
c372cdd1ef nvme-pci: iod npages fits in s8
The largest allowed transfer is 4MB, which can use at most 1025 PRPs.
Each PRP is 8 bytes, so the maximum number of 4k nvme pages needed for
the iod_list is 3, which fits in an 's8' type.

While modifying this field, change the name to "nr_allocations" to
better represent that this is referring to the number of units allocated
from a dma_pool.

Also introduce a BUILD_BUG_ON to ensure we never accidently increase the
largest transfer limit beyond 127 chained prp lists.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 17:55:25 +02:00
Keith Busch
52da4f3f5c nvme-pci: iod's 'aborted' is a bool
It's only true or false, so make this a bool to reflect that and save
some space in nvme_iod.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 17:55:25 +02:00
Keith Busch
a53232cb3a nvme-pci: remove nvme_queue from nvme_iod
We can get the nvme_queue from the req just as easily, so remove the
duplicate path to the same structure to save some space.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 17:55:25 +02:00
Daniel Wagner
4cde03d82e nvme: consider also host_iface when checking ip options
It's perfectly fine to use the same traddr and trsvcid more than once
as long we use different host interface. This is used in setups where
the host has more than one interface but the target exposes only one
traddr/trsvcid combination.

Use the same acceptance rules for host_iface as we have for
host_traddr.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 17:55:24 +02:00
Daniel Wagner
1c467e2595 nvme-rdma: handle number of queue changes
On reconnect, the number of queues might have changed.

In the case where we have more queues available than previously we try
to access queues which are not initialized yet.

The other case where we have less queues than previously, the
connection attempt will fail because the target doesn't support the
old number of queues and we end up in a reconnect loop.

Thus, only start queues which are currently present in the tagset
limited by the number of available queues. Then we update the tagset
and we can start any new queue.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 17:55:21 +02:00
Daniel Wagner
09035f8649 nvme-tcp: handle number of queue changes
On reconnect, the number of queues might have changed.

In the case where we have more queues available than previously we try
to access queues which are not initialized yet.

The other case where we have less queues than previously, the
connection attempt will fail because the target doesn't support the
old number of queues and we end up in a reconnect loop.

Thus, only start queues which are currently present in the tagset
limited by the number of available queues. Then we update the tagset
and we can start any new queue.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 17:55:19 +02:00
Daniel Wagner
3e980f5995 nvmet: expose max queues to configfs
Allow to set the max queues the target supports. This is useful for
testing the reconnect attempt of the host with changing numbers of
supported queues.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 17:55:16 +02:00
Guixin Liu
d416800776 nvmet: avoid unnecessary flush bio
For no volatile write cache block device backend, sending flush bio is
unnecessary, avoid to do that.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 11:33:54 +02:00
Genjian Zhang
c46724cb89 nvmet-auth: remove redundant parameters req
The parameter is not used in this function, so remove it.

Signed-off-by: Genjian Zhang <zhanggenjian@kylinos.cn>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 11:33:54 +02:00
Jackie Liu
4214798156 nvmet-auth: clean up with done_kfree
Jump directly to done_kfree to release d, which is consistent with the
code style behind.

Reported-by: Genjian Zhang <zhanggenjian@kylinos.cn>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 11:33:54 +02:00
Jackie Liu
6e6fee569d nvme-auth: remove the redundant req->cqe->result.u16 assignment operation
req->cqe->result.u16 has already been assigned in the previous line, no
need to do it again.

Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 11:33:54 +02:00
Wolfram Sang
a8817cc09d nvme: move from strlcpy with unused retval to strscpy
Follow the advice of the below link and prefer 'strscpy' in this
subsystem. Conversion is 1:1 because the return value is not used.
Generated by a coccinelle script.

Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 11:33:53 +02:00
Linjun Bao
1abc696174 nvme: add comment for unaligned "fake" nqn
Current "fake" nqn field is "nqn.2014.08.org.nvmexpress:", it is
not aligned with the canonical version for history reasons.

Signed-off-by: Linjun Bao <meljbao@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 11:33:53 +02:00
Linus Torvalds
9ebc0ecb21 block-6.0-2022-09-09
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmMbdksQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqm8EACPqE5eFwFtZZnYV8eFjtTQCA5aZ+u3HyF8
 Xby5FEZxwpRaaGPwUZ6T7A45ZVH5ka+JPvSQYeSTKxdjBVeqmAUVnlhqLjxeTvJt
 WhfSCrPq553eYIKoeUVGxLfQ5GhK4f/2noiEMUIEOMSsDyh4txm46MRNdfi5doj1
 2JlbJyoTIK0ZRoraAuZp07bZPFnbxtjSB9YJlUqyKpiSMbM+xocPRaB3Tpr42So8
 vUdgt2/tDZ2xNOf4gd1KqUlosIlaG0RWeZeXTU9knrQjPYtB77iM6MVDUGhgiYT/
 w25XYCbZiGt1YwRTOYFX9bE6jGuPDc54isRqCmPSjYwTAbeEd7sWytn5D1W0IkaF
 dH93mDaW+8/GrjdGVCJ4jLPRCfQvZZ8i5XlwXaXQfnTQ8okUelx5BJ6kEfKRUkhV
 nZD0r5MskH4dQv4ESluaeblrDWphMaznZLNjp+UXfiycsbE9l2TX1FOjyK9E7f7o
 4zGYiertjc7wj0OnnoDTf0wdpXI13xDqQGFhe1Lnz4tOrpTfzZqwyv4IkFo3VghR
 OocJFk4GSuKLwDPASDY/B+nynhQw5fft/fp458Q5pFY/2zItvmg2rxloFetUAz9i
 0kcrO4NuO1l6E20KoBED2HMrQOow68dqcCJTrRcXhDfUIRIk0XSg4Kd/nGJBhZZS
 5mv/hhkjjQ==
 =3RZd
 -----END PGP SIGNATURE-----

Merge tag 'block-6.0-2022-09-09' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull via Christoph:
      - fix a use after free in nvmet (Bart Van Assche)
      - fix a use after free when detecting digest errors
        (Sagi Grimberg)
      - fix regression that causes sporadic TCP requests to time out
        (Sagi Grimberg)
      - fix two off by ones errors in the nvmet ZNS support
        (Dennis Maisenbacher)
      - requeue aen after firmware activation (Keith Busch)

 - Fix missing request flags in debugfs code (me)

 - Partition scan fix (Ming)

* tag 'block-6.0-2022-09-09' of git://git.kernel.dk/linux-block:
  block: add missing request flags to debugfs code
  nvme: requeue aen after firmware activation
  nvmet: fix mar and mor off-by-one errors
  nvme-tcp: fix regression that causes sporadic requests to time out
  nvme-tcp: fix UAF when detecting digest errors
  nvmet: fix a use-after-free
  block: don't add partitions if GD_SUPPRESS_PART_SCAN is set
2022-09-09 15:03:08 -04:00
Keith Busch
371a982cd2 nvme: requeue aen after firmware activation
The driver prevents async event work while handling a processing paused
event, but someone needs to restart it after the controller returns to a
live state.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216400
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-07 08:38:25 +02:00
Dennis Maisenbacher
b7e97872a6 nvmet: fix mar and mor off-by-one errors
Maximum Active Resources (MAR) and Maximum Open Resources (MOR) are 0's
based vales where a value of 0xffffffff indicates that there is no limit.

Decrement the values that are returned by bdev_max_open_zones and
bdev_max_active_zones as the block layer helpers are not 0's based.
A 0 returned by the block layer helpers indicates no limit, thus convert
it to 0xffffffff (U32_MAX).

Fixes: aaf2e048af ("nvmet: add ZBD over ZNS backend support")
Suggested-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-07 08:33:17 +02:00
Sagi Grimberg
3770a42bb8 nvme-tcp: fix regression that causes sporadic requests to time out
When we queue requests, we strive to batch as much as possible and also
signal the network stack that more data is about to be sent over a socket
with MSG_SENDPAGE_NOTLAST. This flag looks at the pending requests queued
as well as queue->more_requests that is derived from the block layer
last-in-batch indication.

We set more_request=true when we flush the request directly from
.queue_rq submission context (in nvme_tcp_send_all), however this is
wrongly assuming that no other requests may be queued during the
execution of nvme_tcp_send_all.

Due to this, a race condition may happen where:

 1. request X is queued as !last-in-batch
 2. request X submission context calls nvme_tcp_send_all directly
 3. nvme_tcp_send_all is preempted and schedules to a different cpu
 4. request Y is queued as last-in-batch
 5. nvme_tcp_send_all context sends request X+Y, however signals for
    both MSG_SENDPAGE_NOTLAST because queue->more_requests=true.

==> none of the requests is pushed down to the wire as the network
stack is waiting for more data, both requests timeout.

To fix this, we eliminate queue->more_requests and only rely on
the queue req_list and send_list to be not-empty.

Fixes: 122e5b9f3d ("nvme-tcp: optimize network stack with setting msg flags according to batch size")
Reported-by: Jonathan Nicklin <jnicklin@blockbridge.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Jonathan Nicklin <jnicklin@blockbridge.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-06 06:40:44 +02:00
Sagi Grimberg
160f3549a9 nvme-tcp: fix UAF when detecting digest errors
We should also bail from the io_work loop when we set rd_enabled to true,
so we don't attempt to read data from the socket when the TCP stream is
already out-of-sync or corrupted.

Fixes: 3f2304f8c6 ("nvme-tcp: add NVMe over TCP host driver")
Reported-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-06 06:40:40 +02:00
Bart Van Assche
6a02a61e81 nvmet: fix a use-after-free
Fix the following use-after-free complaint triggered by blktests nvme/004:

BUG: KASAN: user-memory-access in blk_mq_complete_request_remote+0xac/0x350
Read of size 4 at addr 0000607bd1835943 by task kworker/13:1/460
Workqueue: nvmet-wq nvme_loop_execute_work [nvme_loop]
Call Trace:
 show_stack+0x52/0x58
 dump_stack_lvl+0x49/0x5e
 print_report.cold+0x36/0x1e2
 kasan_report+0xb9/0xf0
 __asan_load4+0x6b/0x80
 blk_mq_complete_request_remote+0xac/0x350
 nvme_loop_queue_response+0x1df/0x275 [nvme_loop]
 __nvmet_req_complete+0x132/0x4f0 [nvmet]
 nvmet_req_complete+0x15/0x40 [nvmet]
 nvmet_execute_io_connect+0x18a/0x1f0 [nvmet]
 nvme_loop_execute_work+0x20/0x30 [nvme_loop]
 process_one_work+0x56e/0xa70
 worker_thread+0x2d1/0x640
 kthread+0x183/0x1c0
 ret_from_fork+0x1f/0x30

Cc: stable@vger.kernel.org
Fixes: a07b4970f4 ("nvmet: add a generic NVMe target")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-05 15:03:51 +02:00
Linus Torvalds
d895ec7938 block-6.0-2022-09-02
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmMSIeAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoR0EADSxpD2LFb6yyqirZscKr+i0NA3HWNzw8+m
 aZRq8Leb3iozOCtnUPdrIQRvUI5FhW7Q8DUa/vB314xoIbregawJxQi5Q2o8akr7
 LhlyRp9ElKSM4QnCiKBWob6YIUdjVlx0LWhZPdncgc2fAGSxGJNWVHHM9X0UHkpV
 Sr3eWFKyl0LJxV0wu42dT2VcgRMFrr8+9VuKkgBXA5u8p4CpEbJtIlo2b4XgKmwp
 8oeJkQFaSnOSI6Zt4p2oi75PIa27pOGm6MfNH366tgbikH261F28LMCTE3lInzgm
 Rh1U0pRhh36Q+Qf9X9m4AVd8qk31m/YaDgIN4rQXlnc+N2i2dweW2DBZfasQKHJX
 ofxheTOfnqogT9ZJ4T10rHCQmWAC2mVeOgMafQDV6lUuyTqvcaBGkCxBkh9ASIXi
 lpcnWRVzVsiN5YTYxXiXheKiuklcIUnfgTLSu7BMjz32smb/ZqrkPs3ZNAKr8d+q
 HmUd+9THZ560eGMcQgSod2B6DD0NbWvGzIWFthADP9xVA6y68rN+e43MG2TrzkcP
 h4tgnA63lZ8eSaBCfsr0yOGfNkY55OYh4BPH8ZxP7tyBgYwJO2+/09dQgn6EjJnB
 qHorVqd6OoLKWkb88yz9uGtObj00ll4Ks4HcXwW0eF8idjyCEph57BJoocrIIqmP
 IitYAkqtAg==
 =9Eyh
 -----END PGP SIGNATURE-----

Merge tag 'block-6.0-2022-09-02' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request via Christoph:
     - error handling fix for the new auth code (Hannes Reinecke)
     - fix unhandled tcp states in nvmet_tcp_state_change (Maurizio
       Lombardi)
     - add NVME_QUIRK_BOGUS_NID for Lexar NM610 (Shyamin Ayesh)

 - Add documentation for the ublk driver merged in this merge window
   (Ming)

* tag 'block-6.0-2022-09-02' of git://git.kernel.dk/linux-block:
  Documentation: document ublk
  nvmet-tcp: fix unhandled tcp states in nvmet_tcp_state_change()
  nvmet-auth: add missing goto in nvmet_setup_auth()
  nvme-pci: add NVME_QUIRK_BOGUS_NID for Lexar NM610
2022-09-02 16:44:30 -07:00
Maurizio Lombardi
478814a558 nvmet-tcp: fix unhandled tcp states in nvmet_tcp_state_change()
TCP_FIN_WAIT2 and TCP_LAST_ACK were not handled, the connection is closing
so we can ignore them and avoid printing the "unhandled state"
warning message.

[ 1298.852386] nvmet_tcp: queue 2 unhandled state 5
[ 1298.879112] nvmet_tcp: queue 7 unhandled state 5
[ 1298.884253] nvmet_tcp: queue 8 unhandled state 5
[ 1298.889475] nvmet_tcp: queue 9 unhandled state 5

v2: Do not call nvmet_tcp_schedule_release_queue(), just ignore
the fin_wait2 and last_ack states.

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-08-31 07:58:10 +03:00
Hannes Reinecke
da0342a3aa nvmet-auth: add missing goto in nvmet_setup_auth()
There's a goto missing in nvmet_setup_auth(), causing a kernel oops
when nvme_auth_extract_key() fails.

Reported-by: Tal Lossos <tallossos@gmail.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-08-31 07:57:59 +03:00
Shyamin Ayesh
200dccd07d nvme-pci: add NVME_QUIRK_BOGUS_NID for Lexar NM610
Lexar NM610 reports bogus eui64 values that appear to be the same across
all drives. Quirk them out so they are not marked as "non globally unique"
duplicates.

Signed-off-by: Shyamin Ayesh <me@shyamin.com>
[patch formatting]
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-08-31 07:57:28 +03:00
Bart Van Assche
a4e1d0b76e block: Change the return type of blk_mq_map_queues() into void
Since blk_mq_map_queues() and the .map_queues() callbacks always return 0,
change their return type into void. Most callers ignore the returned value
anyway.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: John Garry <john.garry@huawei.com>
Acked-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20220815170043.19489-3-bvanassche@acm.org
[axboe: fold in fix from Bart]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-22 10:07:53 -06:00
Linus Torvalds
abe7a481aa block-6.0-2022-08-12
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmL2SxQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpl24D/4nxPDuhvDxLHpMUR/C2lCtAnqVhHtGLhIX
 QSw+GVA5LuJvuB0L5Zr7ODDILPY5ZyWI/F1FPVJUOJE1NJ3tFiH4WzFIkqtFtVCE
 2jFTXH63A/o/fyo9nscsZ1g6eEswSAbvenHEa9HNpjgFxz0lnXjrniP5VFPo5HNl
 F8/MO1CBkKmhsGazZn7o1J3Ws6RvApq59YzxHmVz1hFHPgJFN2KwIAQjY2+GGoOD
 ifpRBbZBCTzj2dEEFZHeK1aCYhTNP4VqbNnBDQPZHwEB3qkml5R9GhTlUe7Ej17t
 7o4A05efcm/24TXcODMHP5YaGA14otPUr8wQiJjuOIFLw8sMC61OyS9qDdu1IvyW
 JCnTtDkzwZZEkhXlraU1HmiLSaBjMEvd/2puxbcS9kISdO7baCLd4Oj7+8ThVhIf
 JHIt2x3vzKaCzWI93IMrw5iJFK0+NS+SLAD6eyuEgC71Rj5ooemxrBYxKBQ7jb3o
 GCC3SaU8lFmB1Z/zKo63gGS1b7eaCNGauNm9/gSe1jM8Sor4hlT0yeNRpHf7egAu
 1dUQdSwgon6EH6JOGX3CFSC9lnIEAew733QZLaYBqar2WHcq5Wpq6LcWUHhhujgB
 dSTeLY1Tnhs3GWvMFe4JH/YZilFpbMKzxuFPCV7sQScxzlaGusliX8kVkha/VPLK
 9rtT8uyJXg==
 =1/w+
 -----END PGP SIGNATURE-----

Merge tag 'block-6.0-2022-08-12' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request
     - print nvme connect Linux error codes properly (Amit Engel)
     - fix the fc_appid_store return value (Christoph Hellwig)
     - fix a typo in an error message (Christophe JAILLET)
     - add another non-unique identifier quirk (Dennis P. Kliem)
     - check if the queue is allocated before stopping it in nvme-tcp
       (Maurizio Lombardi)
     - restart admin queue if the caller needs to restart queue in
       nvme-fc (Ming Lei)
     - use kmemdup instead of kmalloc + memcpy in nvme-auth (Zhang
       Xiaoxu)

 - __alloc_disk_node() error handling fix (Rafael)

* tag 'block-6.0-2022-08-12' of git://git.kernel.dk/linux-block:
  block: Do not call blk_put_queue() if gendisk allocation fails
  nvme-pci: add NVME_QUIRK_BOGUS_NID for ADATA XPG GAMMIX S70
  nvme-tcp: check if the queue is allocated before stopping it
  nvme-fabrics: Fix a typo in an error message
  nvme-fabrics: parse nvme connect Linux error codes
  nvmet-auth: use kmemdup instead of kmalloc + memcpy
  nvme-fc: fix the fc_appid_store return value
  nvme-fc: restart admin queue if the caller needs to restart queue
2022-08-13 13:37:36 -07:00
Dennis P. Kliem
f37527a09d nvme-pci: add NVME_QUIRK_BOGUS_NID for ADATA XPG GAMMIX S70
ADATA XPG GAMMIX S70 reports bogus eui64 values that appear to be the same
across all drives. Quirk them out so they are not marked as "non globally
unique" duplicates.

Signed-off-by: Dennis P. Kliem <dpkliem@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-08-11 14:10:16 +02:00
Maurizio Lombardi
2bff487f9a nvme-tcp: check if the queue is allocated before stopping it
When an error is detected and the host reconnects, the
nvme_tcp_error_recovery_work() function is called and starts
tearing down the io queues and de-allocating them;
If at the same time the "nvme" process deletes the controller via sysfs,
the nvme_tcp_delete_ctrl() gets called and waits until the
nvme_tcp_error_recovery_work() finishes its job; then starts
tearing down the io queues, but at this point they have already
been freed and the mutexes are destroyed.

Calling mutex_lock() against a destroyed mutex triggers a warning:

[ 1299.025575] nvme nvme1: Reconnecting in 10 seconds...
[ 1299.636449] nvme nvme1: Removing ctrl: NQN "blktests-subsystem-1"
[ 1299.645262] ------------[ cut here ]------------
[ 1299.649949] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
[ 1299.649971] WARNING: CPU: 4 PID: 104150 at kernel/locking/mutex.c:579 __mutex_lock+0x2d0/0x7dc

[ 1299.717934] CPU: 4 PID: 104150 Comm: nvme
[ 1299.828075] Call trace:
[ 1299.830526]  __mutex_lock+0x2d0/0x7dc
[ 1299.834203]  mutex_lock_nested+0x64/0xd4
[ 1299.838139]  nvme_tcp_stop_queue+0x54/0xe0 [nvme_tcp]
[ 1299.843211]  nvme_tcp_teardown_io_queues.part.0+0x90/0x280 [nvme_tcp]
[ 1299.849672]  nvme_tcp_delete_ctrl+0x6c/0xf0 [nvme_tcp]
[ 1299.854831]  nvme_do_delete_ctrl+0x108/0x120 [nvme_core]
[ 1299.860181]  nvme_sysfs_delete+0xec/0xf0 [nvme_core]
[ 1299.865179]  dev_attr_store+0x40/0x70

Fix the warning by checking if the queues are allocated
in the nvme_tcp_stop_queue(). If they are not, it makes no
sense to try to stop them.

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-08-10 16:21:31 +02:00
Christophe JAILLET
c50cd03dbe nvme-fabrics: Fix a typo in an error message
A 'c' is missing.
s/fabris/fabrics/

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-08-10 16:21:31 +02:00
Amit Engel
ec9e96b523 nvme-fabrics: parse nvme connect Linux error codes
This fixes the assumption that errval is an unsigned nvme error

Signed-off-by: Amit Engel <amit.engel@dell.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-08-10 16:05:10 +02:00
Zhang Xiaoxu
14446f9abd nvmet-auth: use kmemdup instead of kmalloc + memcpy
For code neat purpose, we can use kmemdup to replace
kmalloc + memcpy.

Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-08-10 16:05:09 +02:00
Christoph Hellwig
9317d00144 nvme-fc: fix the fc_appid_store return value
"nvme-fc: fold t fc_update_appid into fc_appid_store" accidentally
changed the userspace interface for the appid attribute, because the code
that decrements "count" to remove a trailing '\n' in the parsing results
in the decremented value being incorrectly be returned from the sysfs
write.  Fix this by keeping an orig_count variable for the full length
of the write.

Fixes: c814153c83 ("nvme-fc: fold t fc_update_appid into fc_appid_store")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Tested-by:  Muneendra Kumar M <muneendra.kumar@broadcom.com>
2022-08-10 16:05:08 +02:00
Ming Lei
6fb271f1bc nvme-fc: restart admin queue if the caller needs to restart queue
Without restarting admin queue in __nvme_fc_abort_outstanding_ios(),
it leaves controller not capable of handling admin pt request, and
causes io hang.

Fixes it by restarting admin queue if the caller of __nvme_fc_abort_outstanding_ios
requires to restart queue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Tested-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-08-10 16:05:07 +02:00
Linus Torvalds
c993e07be0 dma-mapping updates
- convert arm32 to the common dma-direct code (Arnd Bergmann, Robin Murphy,
    Christoph Hellwig)
  - restructure the PCIe peer to peer mapping support (Logan Gunthorpe)
  - allow the IOMMU code to communicate an optional DMA mapping length
    and use that in scsi and libata (John Garry)
  - split the global swiotlb lock (Tianyu Lan)
  - various fixes and cleanup (Chao Gao, Dan Carpenter, Dongli Zhang,
    Lukas Bulwahn, Robin Murphy)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmLuIYULHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYPS5A//Ty1ZNyXExmwZ6J6g7/oIvQlpAHilDr22mCd8tR8Y
 Ne7TgLa/X+usFvJTxJfkvg/LNMDjD7qx0J/mhDGm4reOFcEL4/PBy0rDSOgnmntV
 k/fPhgwnpuztiAQ+s+WkJ3pkrmG1HaEId7GGj2JaoYdas6RX2mGX7vL8uvUFepjw
 lYPAqWMtJHkOfsDK0PqqyQsr7dcC6lyFLqnn/wqvHtTJeKCfGs6W/SIrlWme2SZY
 3dNx84ZR1uPjaazAmtf2IWfjh/TBmd0ETRYycgUUKRP9iwsCkBQDBwsBGSIYXiWj
 BUKQ5oMvjAlUGRF0jYz9e77KuedE6GxWiXNQstitBmid142M37DHA5tvZRf65MPS
 THHcjTDmmoaO4YfFhhXOcFOrjG4/V8bF7fgHB6XkHDjhVVTcnIx8zuOAXIVBZvIV
 VAALmamBqEfIZZrCqgr7hzFssK2bip+TIMkdoD46Wcr+D7bAlujhuzWxubn9+ulT
 23v/pAvC80ut6LvKj6EA+GpRm/pejfOtEbjXPoO2hguNxvuUKvPQqNh9hy0q+v1e
 8n2Y/4lhy5bv02S7wKooNkfCoV753jBY1TIru45UmEYc3EkTQPii6okYe0DvW4QX
 VCnKgo156wSBfE+9eWdxCROv2SZqJFMV/wL3vw54dpJQMbDy7VkNsh4mGREdUkU1
 uek=
 =Bv19
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.20-2022-08-06' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping updates from Christoph Hellwig:

 - convert arm32 to the common dma-direct code (Arnd Bergmann, Robin
   Murphy, Christoph Hellwig)

 - restructure the PCIe peer to peer mapping support (Logan Gunthorpe)

 - allow the IOMMU code to communicate an optional DMA mapping length
   and use that in scsi and libata (John Garry)

 - split the global swiotlb lock (Tianyu Lan)

 - various fixes and cleanup (Chao Gao, Dan Carpenter, Dongli Zhang,
   Lukas Bulwahn, Robin Murphy)

* tag 'dma-mapping-5.20-2022-08-06' of git://git.infradead.org/users/hch/dma-mapping: (45 commits)
  swiotlb: fix passing local variable to debugfs_create_ulong()
  dma-mapping: reformat comment to suppress htmldoc warning
  PCI/P2PDMA: Remove pci_p2pdma_[un]map_sg()
  RDMA/rw: drop pci_p2pdma_[un]map_sg()
  RDMA/core: introduce ib_dma_pci_p2p_dma_supported()
  nvme-pci: convert to using dma_map_sgtable()
  nvme-pci: check DMA ops when indicating support for PCI P2PDMA
  iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg
  iommu: Explicitly skip bus address marked segments in __iommu_map_sg()
  dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support
  dma-direct: support PCI P2PDMA pages in dma-direct map_sg
  dma-mapping: allow EREMOTEIO return code for P2PDMA transfers
  PCI/P2PDMA: Introduce helpers for dma_map_sg implementations
  PCI/P2PDMA: Attempt to set map_type if it has not been set
  lib/scatterlist: add flag for indicating P2PDMA segments in an SGL
  swiotlb: clean up some coding style and minor issues
  dma-mapping: update comment after dmabounce removal
  scsi: sd: Add a comment about limiting max_sectors to shost optimal limit
  ata: libata-scsi: cap ata_device->max_sectors according to shost->max_sectors
  scsi: scsi_transport_sas: cap shost opt_sectors according to DMA optimal limit
  ...
2022-08-06 10:56:45 -07:00
Linus Torvalds
fa9db655d0 for-5.20/block-2022-08-04
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLsRfkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpj43EADBydQhe7nQHH65gecqvttnio2GqEmcbozt
 lKFQlPPd3SHGMAJjSdR1dIwqtPsJ8q6xZXH+TjHhLXb2kgVu+TQ31krNHIqBwE14
 s7SsgGRgvopA46lSf/ls18/8sh6Yz1NgI39YcMVPjvkbLaVFK7zRkL9OSp4RQCwH
 u/IIHJmV415EeF6QNTgABBel/gEIPBLsvwOxTBIkzDOyUohtExZPYj83MDm7jdr3
 jsTUd2MiumNMh7ziMJIp1iN32nQOtIKtwWZaMHDCzfU/IUnBSmh2nj9oXr3+vcwo
 IsBMDUfUj9Eig5QQ/XcVIrFezi0GnunpBhScXPqL+dxPN812lzxNjkx6PsC+rPn8
 mWmXoaeK1ayoyotdHJlmINNmWUSCkOMwVnA2r1c4Hp4cQS5vRUtkKcpNLTpMhk4I
 OwQ3bjt9mA//WlH+apbhJqXqxjcoBwCwMoveJ4mHVtku9lo+JJAKVGdUs17QjZkC
 NxACP1MtBcXy1hurNQf14oH5C0Hyg4TBJShPauKmrqGtOFnbOAdX2qIhldvyNfH1
 l9cOvGNSgbQ6FLD6MVto6dC/KYOEM3LelVxgNB/80GbSmGwj88Kd/nzQLYFP89JJ
 0Wkt14mSkm82gabOvNqXGG8P8hLb/+v6sp4qZv0mf+op0xmb4FB5eaZvoceptVzM
 3Z+hmT7MfA==
 =pgNf
 -----END PGP SIGNATURE-----

Merge tag 'for-5.20/block-2022-08-04' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:

 - NVMe pull requests via Christoph:
      - add support for In-Band authentication (Hannes Reinecke)
      - handle the persistent internal error AER (Michael Kelley)
      - use in-capsule data for TCP I/O queue connect (Caleb Sander)
      - remove timeout for getting RDMA-CM established event (Israel
        Rukshin)
      - misc cleanups (Joel Granados, Sagi Grimberg, Chaitanya Kulkarni,
        Guixin Liu, Xiang wangx)
      - use command_id instead of req->tag in trace_nvme_complete_rq()
        (Bean Huo)
      - various fixes for the new authentication code (Lukas Bulwahn,
        Dan Carpenter, Colin Ian King, Chaitanya Kulkarni, Hannes
        Reinecke)
      - small cleanups (Liu Song, Christoph Hellwig)
      - restore compat_ioctl support (Nick Bowler)
      - make a nvmet-tcp workqueue lockdep-safe (Sagi Grimberg)
      - enable generic interface (/dev/ngXnY) for unknown command sets
        (Joel Granados, Christoph Hellwig)
      - don't always build constants.o (Christoph Hellwig)
      - print the command name of aborted commands (Christoph Hellwig)

 - MD pull requests via Song:
      - Improve raid5 lock contention, by Logan Gunthorpe.
      - Misc fixes to raid5, by Logan Gunthorpe.
      - Fix race condition with md_reap_sync_thread(), by Guoqing Jiang.
      - Fix potential deadlock with raid5_quiesce and
        raid5_get_active_stripe, by Logan Gunthorpe.
      - Refactoring md_alloc(), by Christoph"
      - Fix md disk_name lifetime problems, by Christoph Hellwig
      - Convert prepare_to_wait() to wait_woken() api, by Logan
        Gunthorpe;
      - Fix sectors_to_do bitmap issue, by Logan Gunthorpe.

 - Work on unifying the null_blk module parameters and configfs API
   (Vincent)

 - drbd bitmap IO error fix (Lars)

 - Set of rnbd fixes (Guoqing, Md Haris)

 - Remove experimental marker on bcache async device registration (Coly)

 - Series from cleaning up the bio splitting (Christoph)

 - Removal of the sx8 block driver. This hardware never really
   widespread, and it didn't receive a lot of attention after the
   initial merge of it back in 2005 (Christoph)

 - A few fixes for s390 dasd (Eric, Jiang)

 - Followup set of fixes for ublk (Ming)

 - Support for UBLK_IO_NEED_GET_DATA for ublk (ZiyangZhang)

 - Fixes for the dio dma alignment (Keith)

 - Misc fixes and cleanups (Ming, Yu, Dan, Christophe

* tag 'for-5.20/block-2022-08-04' of git://git.kernel.dk/linux-block: (136 commits)
  s390/dasd: Establish DMA alignment
  s390/dasd: drop unexpected word 'for' in comments
  ublk_drv: add support for UBLK_IO_NEED_GET_DATA
  ublk_cmd.h: add one new ublk command: UBLK_IO_NEED_GET_DATA
  ublk_drv: cleanup ublksrv_ctrl_dev_info
  ublk_drv: add SET_PARAMS/GET_PARAMS control command
  ublk_drv: fix ublk device leak in case that add_disk fails
  ublk_drv: cancel device even though disk isn't up
  block: fix leaking page ref on truncated direct io
  block: ensure bio_iov_add_page can't fail
  block: ensure iov_iter advances for added pages
  drivers:md:fix a potential use-after-free bug
  md/raid5: Ensure batch_last is released before sleeping for quiesce
  md/raid5: Move stripe_request_ctx up
  md/raid5: Drop unnecessary call to r5c_check_stripe_cache_usage()
  md/raid5: Make is_inactive_blocked() helper
  md/raid5: Refactor raid5_get_active_stripe()
  block: pass struct queue_limits to the bio splitting helpers
  block: move bio_allowed_max_sectors to blk-merge.c
  block: move the call to get_max_io_size out of blk_bio_segment_split
  ...
2022-08-04 20:00:14 -07:00
Linus Torvalds
5264406cdb iov_iter work, part 1 - isolated cleanups and optimizations.
One of the goals is to reduce the overhead of using ->read_iter()
 and ->write_iter() instead of ->read()/->write(); new_sync_{read,write}()
 has a surprising amount of overhead, in particular inside iocb_flags().
 That's why the beginning of the series is in this pile; it's not directly
 iov_iter-related, but it's a part of the same work...
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYurGOQAKCRBZ7Krx/gZQ
 6ysyAP91lvBfMRepcxpd9kvtuzWkU8A3rfSziZZteEHANB9Q7QEAiPn2a2OjWkcZ
 uAyUWfCkHCNx+dSMkEvUgR5okQ0exAM=
 =9UCV
 -----END PGP SIGNATURE-----

Merge tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs iov_iter updates from Al Viro:
 "Part 1 - isolated cleanups and optimizations.

  One of the goals is to reduce the overhead of using ->read_iter() and
  ->write_iter() instead of ->read()/->write().

  new_sync_{read,write}() has a surprising amount of overhead, in
  particular inside iocb_flags(). That's the explanation for the
  beginning of the series is in this pile; it's not directly
  iov_iter-related, but it's a part of the same work..."

* tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  first_iovec_segment(): just return address
  iov_iter: massage calling conventions for first_{iovec,bvec}_segment()
  iov_iter: first_{iovec,bvec}_segment() - simplify a bit
  iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment()
  iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT
  iov_iter_bvec_advance(): don't bother with bvec_iter
  copy_page_{to,from}_iter(): switch iovec variants to generic
  keep iocb_flags() result cached in struct file
  iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC
  struct file: use anonymous union member for rcuhead and llist
  btrfs: use IOMAP_DIO_NOSYNC
  teach iomap_dio_rw() to suppress dsync
  No need of likely/unlikely on calls of check_copy_size()
2022-08-03 13:50:22 -07:00
Christoph Hellwig
5a97806f7d block: change the blk_queue_split calling convention
The double indirect bio leads to somewhat suboptimal code generation.
Instead return the (original or split) bio, and make sure the
request_queue arguments to the lower level helpers is passed after the
bio to avoid constant reshuffling of the argument passing registers.

Also give it and the helpers used to implement it more descriptive names.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220727162300.3089193-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:53 -06:00
Sagi Grimberg
533d2e8b4d nvmet-tcp: fix lockdep complaint on nvmet_tcp_wq flush during queue teardown
We probably need nvmet_tcp_wq to have MEM_RECLAIM as we are
sending/receiving for the socket from works on this workqueue.
Also this eliminates lockdep complaints:
--
[ 6174.010200] workqueue: WQ_MEM_RECLAIM
nvmet-wq:nvmet_tcp_release_queue_work [nvmet_tcp] is flushing
!WQ_MEM_RECLAIM nvmet_tcp_wq:nvmet_tcp_io_work [nvmet_tcp]
[ 6174.010216] WARNING: CPU: 20 PID: 14456 at kernel/workqueue.c:2628
check_flush_dependency+0x110/0x14c

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:53 -06:00
Joel Granados
eb867ee995 nvme: enable generic interface (/dev/ngXnY) for unknown command sets
Extend nvme_alloc_ns() and nvme_validate_ns() for unknown command-set as
well. Both are made to use a new helper (nvme_update_ns_info_cs_indep)
which is similar to nvme_update_ns_info but performs fewer operations
to get the generic interface up.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
[hch: rebased on other refactoring patches]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:53 -06:00
Christoph Hellwig
1e4ea66af1 nvme: factor out a nvme_ns_is_readonly helper
Add a little helper to check if a namespace should be marked read-only
that uses a new is_readonly flag in the nvme_ns_info structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:53 -06:00
Christoph Hellwig
1a893c2bfe nvme: refactor namespace probing
Change nvme_ns_scan to gather all information needed for generic
namespace setup into a nvme_ns_info structure.  This structure is filled
from the Command Set Idependent Identify Namespace data structure if
it is available or else the legacy Identify namespace structure.

With that everything related to the NVM command set (and the ZNS command
set derived from it) can be encapsulated in the nvme_update_ns_info_block
function while keeping the rest of the namespace probing generic.

The downside is that we now always issue two Identify Namespace calls for
each probed namespace instead of usually just a single one previously.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:53 -06:00
Christoph Hellwig
71882e7d23 nvme: generalize the nvme_multi_css check in nvme_scan_ns
Check for multiple command set support early on an error out if is
not supported when a !NVM command set namespace is found.  This
prepares for adding command set independent passthrough support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:51 -06:00
Christoph Hellwig
04c170f669 nvme: rename nvme_validate_or_alloc_ns to nvme_scan_ns
This shorter name much better fits what this function does in
the scanning process.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:51 -06:00
Christoph Hellwig
e06b425bc8 nvme: catch -ENODEV from nvme_revalidate_zones again
nvme_revalidate_zones can also return -ENODEV if e.g. zone sizes aren't
constant or not a power of two.  In that case we should jump to marking
the gendisk hidden and only support pass through.

Fixes: 602e57c979 ("nvme: also mark passthrough-only namespaces ready in nvme_update_ns_info")
Reported-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:51 -06:00
Lukas Bulwahn
4cf42ec366 nvmet-auth: select the intended CRYPTO_DH_RFC7919_GROUPS
Commit 71ebe3842ebe ("nvmet-auth: Diffie-Hellman key exchange support")
intends to select 'Support for RFC 7919 FFDHE group parameters' for using
FFDHE groups for NVMe In-Band Authentication.

It however selects CRYPTO_DH_GROUPS_RFC7919, instead of the intended
CRYPTO_DH_RFC7919_GROUPS; notice the swapping of words here.

Correct the select to the intended config option.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:51 -06:00
Chaitanya Kulkarni
be2ada6d0e nvmet-auth: fix return value check in auth receive
nvmet_auth_challenge() return type is int and currently it uses status
variable that is of type u16 in nvmet_execute_auth_receive().

Catch the return value of nvmet_auth_challenge() into int and set the
NVME_SC_INTERNAL as status variable before we jump to error.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:51 -06:00
Chaitanya Kulkarni
1040415c29 nvmet-auth: fix return value check in auth send
nvmet_setup_auth() return type is int and currently it uses status
variable that is of type u16 in nvmet_execute_auth_send().

Catch the return value of nvmet_setup_auth() into int and set the
NVME_SC_INTERNAL as status variable before we jump to error.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:51 -06:00