Commit Graph

3912 Commits

Author SHA1 Message Date
Jens Axboe
170e086ad3 nvme: fix bogus kzalloc() return check in nvme_init_effects_log()
nvme_init_effects_log() returns failure when kzalloc() is successful,
which is obviously wrong and causes failures to boot. Correct the
check.

Fixes: d4a95adeab ("nvme: Add error path for xa_store in nvme_init_effects")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-13 10:27:54 -07:00
Jens Axboe
9752b55035 nvme updates for Linux 6.14
- Target support for PCI-Endpoint transport (Damien)
  - TCP IO queue spreading fixes (Sagi, Chaitanya)
  - Target handling for "limited retry" flags (Guixen)
  - Poll type fix (Yongsoo)
  - Xarray storage error handling (Keisuke)
  - Host memory buffer free size fix on error (Francis)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmeEP3AACgkQPe3zGtjz
 RgnXLw/7B8liECkx93+Ceg1JCuXX2WbbYkgyjjggEFYFNbuZdlVMoAjUsTaJLDLs
 V9Werl314aY+RKui7hGBiQ/F9ozAHLjwbi2k3/hmd6mVekZEWVTXWTQvcRb94YQX
 PJLq9ihuQb4cwuzk6MY0yZDX0cgLMb2brCQ+E4/fbfNWX8VKgRkKwEsejYciFXCz
 k8b45wY9ytW+svCcE/6Shmr0oWjcpr8F/5KASlOgmpZHuYKXBZwXErDk7ZndxoUL
 gjnELYnu1l62Ki3khkhk84ap5OH9WPRzOEBn3ab66yqyZNHpFQ4eaakzpEh7SBIO
 eZWMniSTIH1p+9evZCKd1pMKgoi+vzFm+i0mgKWpClmI0vjLgwDc3prj9xex+Wbx
 w3QKuRqlqIT0az3e64MAZcJiFzPWs1851NnI4Wb5jH9SblutJ4DtU5//aBgjTI2W
 wsgZfN9TmNBLek3YafDz1hsI5rxGTolJcYykC1VmojVbCXkhH05KjL0jGgOnvafo
 HSF6ezkEZMeB4G2sHIFsO82P0DQcdpx1zZA68X91jSDHwlGlBoMZjlLf0wYxbxBk
 iGtz/uJQeVb/PsRaP5nVlfXmT0QJLGdGtqLliJFnN6xIO8hqSW072lkHh4IdYbzG
 2CD1kwMh3Bso/HKdyjSaB3TcSRyAagVHlXDj6mAp8ZntetB6mmY=
 =37g5
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.14-2025-01-12' of git://git.infradead.org/nvme into for-6.14/block

Pull NVMe updates from Keith:

"nvme updates for Linux 6.14

 - Target support for PCI-Endpoint transport (Damien)
 - TCP IO queue spreading fixes (Sagi, Chaitanya)
 - Target handling for "limited retry" flags (Guixen)
 - Poll type fix (Yongsoo)
 - Xarray storage error handling (Keisuke)
 - Host memory buffer free size fix on error (Francis)"

* tag 'nvme-6.14-2025-01-12' of git://git.infradead.org/nvme: (25 commits)
  nvme-pci: use correct size to free the hmb buffer
  nvme: Add error path for xa_store in nvme_init_effects
  nvme-pci: fix comment typo
  Documentation: Document the NVMe PCI endpoint target driver
  nvmet: New NVMe PCI endpoint function target driver
  nvmet: Implement arbitration feature support
  nvmet: Implement interrupt config feature support
  nvmet: Implement interrupt coalescing feature support
  nvmet: Implement host identifier set feature support
  nvmet: Introduce get/set_feature controller operations
  nvmet: Do not require SGL for PCI target controller commands
  nvmet: Add support for I/O queue management admin commands
  nvmet: Introduce nvmet_sq_create() and nvmet_cq_create()
  nvmet: Introduce nvmet_req_transfer_len()
  nvmet: Improve nvmet_alloc_ctrl() interface and implementation
  nvme: Add PCI transport type
  nvmet: Add drvdata field to struct nvmet_ctrl
  nvmet: Introduce nvmet_get_cmd_effects_admin()
  nvmet: Export nvmet_update_cc() and nvmet_cc_xxx() helpers
  nvmet: Add vendor_id and subsys_vendor_id subsystem attributes
  ...
2025-01-13 07:12:15 -07:00
Francis Pravin
4a324970fa nvme-pci: use correct size to free the hmb buffer
dev->host_mem_size value is updated only after the successful buffer
allocation of hmb descriptor. Otherwise, it may have some undefined value.
So, use the correct size to free the hmb buffer when the hmb descriptor
buffer allocation failed.

Signed-off-by: Francis Pravin <francis.p@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-12 14:11:29 -08:00
Keisuke Nishimura
d4a95adeab nvme: Add error path for xa_store in nvme_init_effects
The xa_store() may fail due to memory allocation failure because there
is no guarantee that the index NVME_CSI_NVM is already used. This fix
introduces a new function to handle the error path.

Fixes: cc115cbe12 ("nvme: always initialize known command effects")
Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-12 14:11:29 -08:00
Baruch Siach
e4a0a3058d nvme-pci: fix comment typo
envent -> event.

Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-12 14:11:29 -08:00
Damien Le Moal
0faa0fe6f9 nvmet: New NVMe PCI endpoint function target driver
Implement a PCI target driver using the PCI endpoint framework. This
requires hardware with a PCI controller capable of executing in endpoint
mode.

The PCI endpoint framework is used to set up a PCI endpoint function
and its BAR compatible with a NVMe PCI controller. The framework is also
used to map local memory to the PCI address space to execute MMIO
accesses for retrieving NVMe commands from submission queues and posting
completion entries to completion queues. If supported, DMA is used for
command retreival and command data transfers, based on the PCI address
segments indicated by the command using either PRPs or SGLs.

The NVMe target driver relies on the NVMe target core code to execute
all commands isssued by the host. The PCI target driver is mainly
responsible for the following:
 - Initialization and teardown of the endpoint device and its backend
   PCI target controller. The PCI target controller is created using a
   subsystem and a port defined through configfs. The port used must be
   initialized with the "pci" transport type. The target controller is
   allocated and initialized when the PCI endpoint is started by binding
   it to the endpoint PCI device (nvmet_pci_epf_epc_init() function).

 - Manage the endpoint controller state according to the PCI link state
   and the actions of the host (e.g. checking the CC.EN register) and
   propagate these actions to the PCI target controller. Polling of the
   controller enable/disable is done using a delayed work scheduled
   every 5ms (nvmet_pci_epf_poll_cc() function). This work is started
   whenever the PCI link comes up (nvmet_pci_epf_link_up() notifier
   function) and stopped when the PCI link comes down
   (nvmet_pci_epf_link_down() notifier function).
   nvmet_pci_epf_poll_cc() enables and disables the PCI controller using
   the functions nvmet_pci_epf_enable_ctrl() and
   nvmet_pci_epf_disable_ctrl(). The controller admin queue is created
   using nvmet_pci_epf_create_cq(), which calls nvmet_cq_create(), and
   nvmet_pci_epf_create_sq() which uses nvmet_sq_create().
   nvmet_pci_epf_disable_ctrl() always resets the PCI controller to its
   initial state so that nvmet_pci_epf_enable_ctrl() can be called
   again. This ensures correct operation if, for instance, the host
   reboots causing the PCI link to be temporarily down.

 - Manage the controller admin and I/O submission queues using local
   memory. Commands are obtained from submission queues using a work
   item that constantly polls the doorbells of all submissions queues
   (nvmet_pci_epf_poll_sqs() function). This work is started whenever
   the controller is enabled (nvmet_pci_epf_enable_ctrl() function) and
   stopped when the controller is disabled (nvmet_pci_epf_disable_ctrl()
   function). When new commands are submitted by the host, DMA transfers
   are used to retrieve the commands.

 - Initiate the execution of all admin and I/O commands using the target
   core code, by calling a requests execute() function. All commands are
   individually handled using a per-command work item
   (nvmet_pci_epf_iod_work() function). A command overall execution
   includes: initializing a struct nvmet_req request for the command,
   using nvmet_req_transfer_len() to get a command data transfer length,
   parse the command PRPs or SGLs to get the PCI address segments of
   the command data buffer, retrieve data from the host (if the command
   is a write command), call req->execute() to execute the command and
   transfer data to the host (for read commands).

 - Handle the completions of commands as notified by the
   ->queue_response() operation of the PCI target controller
   (nvmet_pci_epf_queue_response() function). Completed commands are
   added to a list of completed command for their CQ. Each CQ list of
   completed command is processed using a work item
   (nvmet_pci_epf_cq_work() function) which posts entries for the
   completed commands in the CQ memory and raise an IRQ to the host to
   signal the completion. IRQ coalescing is supported as mandated by the
   NVMe base specification for PCI controllers. Of note is that
   completion entries are transmitted to the host using MMIO, after
   mapping the completion queue memory to the host PCI address space.
   Unlike for retrieving commands from SQs, DMA is not used as it
   degrades performance due to the transfer serialization needed (which
   delays completion entries transmission).

The configuration of a NVMe PCI endpoint controller is done using
configfs. First the NVMe PCI target controller configuration must be
done to set up a subsystem and a port with the "pci" addr_trtype
attribute. The subsystem can be setup using a file or block device
backed namespace or using a passthrough NVMe device. After this, the
PCI endpoint can be configured and bound to the PCI endpoint controller
to start the NVMe endpoint controller.

In order to not overcomplicate this initial implementation of an
endpoint PCI target controller driver, protection information is not
for now supported. If the PCI controller port and namespace are
configured with protection information support, an error will be
returned when the controller is created and initialized when the
endpoint function is started. Protection information support will be
added in a follow-up patch series.

Using a Rock5B board (Rockchip RK3588 SoC, PCI Gen3x4 endpoint
controller) with a target PCI controller setup with 4 I/O queues and a
null_blk block device as a namespace, the maximum performance using fio
was measured at 131 KIOPS for random 4K reads and up to 2.8 GB/S
throughput. Some data points are:

Rnd read,   4KB,  QD=1, 1 job : IOPS=16.9k, BW=66.2MiB/s (69.4MB/s)
Rnd read,   4KB, QD=32, 1 job : IOPS=78.5k, BW=307MiB/s (322MB/s)
Rnd read,   4KB, QD=32, 4 jobs: IOPS=131k, BW=511MiB/s (536MB/s)
Seq read, 512KB, QD=32, 1 job : IOPS=5381, BW=2691MiB/s (2821MB/s)

The NVMe PCI endpoint target driver is not intended for production use.
It is a tool for learning NVMe, exploring existing features and testing
implementations of new NVMe features.

Co-developed-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:49 -08:00
Damien Le Moal
a0ed77d4c9 nvmet: Implement arbitration feature support
NVMe base specification v2.1 mandates support for the arbitration
feature (NVME_FEAT_ARBITRATION). Introduce the data structure
struct nvmet_feat_arbitration to define the high, medium and low
priority weight fields and the arbitration burst field of this feature
and implement the functions nvmet_get_feat_arbitration() and
nvmet_set_feat_arbitration() functions to get and set these fields.

Since there is no generic way to implement support for the arbitration
feature, these functions respectively use the controller get_feature()
and set_feature() operations to process the feature with the help of
the controller driver. If the controller driver does not implement these
operations and a get feature command or a set feature command for this
feature is received, the command is failed with an invalid field error.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:49 -08:00
Damien Le Moal
f1ecd491b6 nvmet: Implement interrupt config feature support
The NVMe base specifications v2.1 mandate supporting the interrupt
config feature (NVME_FEAT_IRQ_CONFIG) for PCI controllers. Introduce the
data structure struct nvmet_feat_irq_config to define the coalescing
disabled (cd) and interrupt vector (iv) fields of this feature and
implement the functions nvmet_get_feat_irq_config() and
nvmet_set_feat_irq_config() functions to get and set these fields. These
functions respectively use the controller get_feature() and
set_feature() operations to fill and handle the fields of struct
nvmet_feat_irq_config.

Support for this feature is prohibited for fabrics controllers. If a get
feature command or a set feature command for this feature is received
for a fabrics controller, the command is failed with an invalid field
error.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:49 -08:00
Damien Le Moal
89b94a6cbe nvmet: Implement interrupt coalescing feature support
The NVMe base specifications v2.1 mandate Supporting the interrupt
coalescing feature (NVME_FEAT_IRQ_COALESCE) for PCI controllers.
Introduce the data structure struct nvmet_feat_irq_coalesce to define
the time and threshold (thr) fields of this feature and implement the
functions nvmet_get_feat_irq_coalesce() and
nvmet_set_feat_irq_coalesce() to get and set this feature. These
functions respectively use the controller get_feature() and
set_feature() operations to fill and handle the fields of struct
nvmet_feat_irq_coalesce.

While the Linux kernel nvme driver does not use this feature and thus
will not complain if it is not implemented, other major OSes fail
initializing the NVMe device if this feature support is missing.

Support for this feature is prohibited for fabrics controllers. If a get
feature or set feature command for this feature is received for a
fabrics controller, the command is failed with an invalid field error.

Suggested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:48 -08:00
Damien Le Moal
2f2b20fad9 nvmet: Implement host identifier set feature support
The NVMe specifications mandate support for the host identifier
set_features for controllers that also supports reservations. Satisfy
this requirement by implementing handling of the NVME_FEAT_HOST_ID
feature for the nvme_set_features command. This implementation is for
now effective only for PCI target controllers. For other controller
types, the set features command is failed with a NVME_SC_CMD_SEQ_ERROR
status as before.

As noted in the code, 128 bits host identifiers are supported since the
NVMe base specifications version 2.1 indicate in section 5.1.25.1.28.1
that "The controller may support a 64-bit Host Identifier...".

The RHII (Reservations and Host Identifier Interaction) bit of the
controller attribute (ctratt) field of the identify controller data is
also set to indicate that a host ID of "0" is supported but that the
host ID must be a non-zero value to use reservations.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:48 -08:00
Damien Le Moal
08461535a9 nvmet: Introduce get/set_feature controller operations
The implementation of some features cannot always be done generically by
the target core code. Arbitraion and IRQ coalescing features are
examples of such features: their implementation must be provided (at
least partially) by the target controller driver.

Introduce the set_feature() and get_feature() controller fabrics
operations (in struct nvmet_fabrics_ops) to allow supporting such
features.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:48 -08:00
Damien Le Moal
1ad8630ffa nvmet: Do not require SGL for PCI target controller commands
Support for SGL is optional for the PCI transport. Modify
nvmet_req_init() to not require the NVME_CMD_SGL_METABUF command flag to
be set if the target controller transport type is NVMF_TRTYPE_PCI.
In addition to this, the NVMe base specification v2.1 mandate that all
admin commands use PRP, that is, have CDW0.PSDT cleared to 0. Modify
nvmet_parse_admin_cmd() to check this.

Finally, modify nvmet_check_transfer_len() and
nvmet_check_data_len_lte() to return the appropriate error status
depending on the command using SGL or PRP. Since for fabrics
nvmet_req_init() checks that a command uses SGL, always, this change
affects only PCI target controllers.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:48 -08:00
Damien Le Moal
60d3cd8561 nvmet: Add support for I/O queue management admin commands
The I/O submission queue management admin commands
(nvme_admin_delete_sq, nvme_admin_create_sq, nvme_admin_delete_cq,
and nvme_admin_create_cq) are mandatory admin commands for I/O
controllers using the PCI transport, that is, support for these commands
is mandatory for a a PCI target I/O controller.

Implement support for these commands by adding the functions
nvmet_execute_delete_sq(), nvmet_execute_create_sq(),
nvmet_execute_delete_cq() and nvmet_execute_create_cq() to set as the
execute method of requests for these commands. These functions will
return an invalid opcode error for any controller that is not a PCI
target controller. Support for the I/O queue management commands is also
reported in the command effect log  of PCI target controllers (using
nvmet_get_cmd_effects_admin()).

Each management command is backed by a controller fabric operation
that can be defined by a PCI target controller driver to setup I/O
queues using nvmet_sq_create() and nvmet_cq_create() or delete I/O
queues using nvmet_sq_destroy().

As noted in a comment in nvmet_execute_create_sq(), we do not yet
support sharing a single CQ between multiple SQs.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:48 -08:00
Damien Le Moal
1eb380caf5 nvmet: Introduce nvmet_sq_create() and nvmet_cq_create()
Introduce the new functions nvmet_sq_create() and nvmet_cq_create() to
allow a target driver to initialize and setup admin and IO queues
directly, without needing to execute connect fabrics commands.
The helper functions nvmet_check_cqid() and nvmet_check_sqid() are
implemented to check the correctness of SQ and CQ IDs when
nvmet_sq_create() and nvmet_cq_create() are called.

nvmet_sq_create() and nvmet_cq_create() are primarily intended for use
with PCI target controller drivers and thus are not well integrated
with the current queue creation of fabrics controllers using the connect
command. These fabrices drivers are not modified to use these functions.
This simple implementation of SQ and CQ management for PCI target
controller drivers does not allow multiple SQs to share the same CQ,
similarly to other fabrics transports. This is a specification
violation. A more involved set of changes will follow to add support for
this required completion queue sharing feature.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:48 -08:00
Damien Le Moal
43043c9b97 nvmet: Introduce nvmet_req_transfer_len()
Add the new function nvmet_req_transfer_len() to parse a request command
to extract the transfer length of the command. This function
implementation relies on multiple helper functions for parsing I/O
commands (nvmet_io_cmd_transfer_len()), admin commands
(nvmet_admin_cmd_data_len()) and fabrics connect commands
(nvmet_connect_cmd_data_len).

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:47 -08:00
Damien Le Moal
6202783184 nvmet: Improve nvmet_alloc_ctrl() interface and implementation
Introduce struct nvmet_alloc_ctrl_args to define the arguments for
the function nvmet_alloc_ctrl() to avoid the need for passing a pointer
to a struct nvmet_req as an argument. This new data structure aggregates
together the arguments that were passed to nvmet_alloc_ctrl()
(subsysnqn, hostnqn and kato), together with the struct nvmet_req fields
used by nvmet_alloc_ctrl(), that is, the fields port, p2p_client, and
ops as input and the result and error_loc fields as output, as well as a
status field. nvmet_alloc_ctrl() is also changed to return a pointer
to the allocated and initialized controller structure instead of a
status code, as the status is now returned through the status field of
struct nvmet_alloc_ctrl_args.

The function nvmet_setup_p2p_ns_map() is changed to not take a pointer
to a struct nvmet_req as argument, instead, directly specify the
p2p_client device pointer needed as argument.

The code in nvmet_execute_admin_connect() that initializes a new target
controller after allocating it is moved into nvmet_alloc_ctrl().
The code that sets up an admin queue for the controller (and the call
to nvmet_install_queue()) remains in nvmet_execute_admin_connect().

Finally, nvmet_alloc_ctrl() is also exported to allow target drivers to
use this function directly to allocate and initialize a new controller
structure without the need to rely on a fabrics connect command request.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:47 -08:00
Damien Le Moal
200adac758 nvme: Add PCI transport type
Define the transport type NVMF_TRTYPE_PCI for PCI endpoint targets.
This transport type is defined using the value 0 which is reserved in
the NVMe base specifications v2.1 (Figure 294). Given that struct
nvmet_port are zeroed out on creation, to avoid having this transsport
type becoming the new default, nvmet_referral_make() and
nvmet_ports_make() are modified to initialize a port discovery address
transport type field (disc_addr.trtype) to NVMF_TRTYPE_MAX.

Any port using this transport type is also skipped and not reported in
the discovery log page (nvmet_execute_disc_get_log_page()).

The helper function nvmet_is_pci_ctrl() is also introduced to check if
a target controller uses the PCI transport.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:47 -08:00
Damien Le Moal
35c593e530 nvmet: Add drvdata field to struct nvmet_ctrl
Allow a target driver to attach private data to a target controller by
adding the new field drvdata to struct nvmet_ctrl.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:47 -08:00
Damien Le Moal
1ee4531054 nvmet: Introduce nvmet_get_cmd_effects_admin()
In order to have a logically better organized implementation of the
effects log page, split out reporting the supported admin commands from
nvmet_get_cmd_effects_nvm() into the new function
nvmet_get_cmd_effects_admin().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:47 -08:00
Damien Le Moal
15e9d26445 nvmet: Export nvmet_update_cc() and nvmet_cc_xxx() helpers
Make the function nvmet_update_cc() available to target drivers by
exporting it. To also facilitate the manipulation of the cc register
bits, move the inline helper functions nvmet_cc_en(), nvmet_cc_css(),
nvmet_cc_mps(), nvmet_cc_ams(), nvmet_cc_shn(), nvmet_cc_iosqes(), and
nvmet_cc_iocqes() from core.c to nvmet.h so that these functions can be
reused in target controller drivers.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:47 -08:00
Damien Le Moal
5d4f4ea8fa nvmet: Add vendor_id and subsys_vendor_id subsystem attributes
Define the new vendor_id and subsys_vendor_id configfs attribute for
target subsystems. These attributes are respectively reported as the
vid field and as the ssvid field of the identify controller data of
a target controllers using the subsystem for which these attributes
are set.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:46 -08:00
Damien Le Moal
30e77e0fbe nvme: Move opcode string helper functions declarations
Move the declaration of all helper functions converting NVMe command
opcodes and status codes into strings from drivers/nvme/host/nvme.h
into include/linux/nvme.h, together with the commands definitions.
This allows NVMe target drivers to call these functions without having
to include a host header file.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:46 -08:00
Yongsoo Joo
002bb02729 nvme: change return type of nvme_poll_cq() to bool
The nvme_poll_cq() function currently returns the number of CQEs
found, However, only one caller, nvme_poll(), requires a boolean
value to check whether any CQE was completed. The other callers do
not use the return value at all.

To better reflect its usage, update the return type of nvme_poll_cq()
from int to bool.

Signed-off-by: Yongsoo Joo <ysjoo@kookmin.ac.kr>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:46 -08:00
Keisuke Nishimura
ac32057acc nvme: Add error check for xa_store in nvme_get_effects_log
The xa_store() may fail due to memory allocation failure because there
is no guarantee that the index csi is already used. This fix adds an
error check of the return value of xa_store() in nvme_get_effects_log().

Fixes: 1cf7a12e09 ("nvme: use an xarray to lookup the Commands Supported and Effects log")
Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:46 -08:00
Sagi Grimberg
3219378987 nvme-tcp: Fix I/O queue cpu spreading for multiple controllers
Since day-1 we are assigning the queue io_cpu very naively. We always
base the queue id (controller scope) and assign it its matching cpu
from the online mask. This works fine when the number of queues match
the number of cpu cores.

The problem starts when we have less queues than cpu cores. First, we
should take into account the mq_map and select a cpu within the cpus
that are assigned to this queue by the mq_map in order to minimize cross
numa cpu bouncing.

Second, even worse is that we don't take into account multiple
controllers may have assigned queues to a given cpu. As a result we may
simply compund more and more queues on the same set of cpus, which is
suboptimal.

We fix this by introducing global per-cpu counters that tracks the
number of queues assigned to each cpu, and we select the least used cpu
based on the mq_map and the per-cpu counters, and assign it as the queue
io_cpu.

The behavior for a single controller is slightly optimized by selecting
better cpu candidates by consulting with the mq_map, and multiple
controllers are spreading queues among cpu cores much better, resulting
in lower average cpu load, and less likelihood to hit hotspots.

Note that the accounting is not 100% perfect, but we don't need to be,
we're simply putting our best effort to select the best candidate cpu
core that we find at any given point.

Another byproduct is that every controller reset/reconnect may change
the queues io_cpu mapping, based on the current LRU accounting scheme.

Here is the baseline queue io_cpu assignment for 4 controllers, 2 queues
per controller, and 4 cpus on the host:
nvme1: queue 0: using cpu 0
nvme1: queue 1: using cpu 1
nvme2: queue 0: using cpu 0
nvme2: queue 1: using cpu 1
nvme3: queue 0: using cpu 0
nvme3: queue 1: using cpu 1
nvme4: queue 0: using cpu 0
nvme4: queue 1: using cpu 1

And this is the fixed io_cpu assignment:
nvme1: queue 0: using cpu 0
nvme1: queue 1: using cpu 2
nvme2: queue 0: using cpu 1
nvme2: queue 1: using cpu 3
nvme3: queue 0: using cpu 0
nvme3: queue 1: using cpu 2
nvme4: queue 0: using cpu 1
nvme4: queue 1: using cpu 3

Fixes: 3f2304f8c6 ("nvme-tcp: add NVMe over TCP host driver")
Suggested-by: Hannes Reinecke <hare@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[fixed kbuild reported errors]
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:04 -08:00
Christoph Hellwig
473106dd3a nvme: fix queue freeze vs limits lock order
Match the locking order used by the core block code by only freezing
the queue after taking the limits lock.

Unlike most queue updates this does not use the
queue_limits_commit_update_frozen helper as the nvme driver want the
queue frozen for more than just the limits update.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250110054726.1499538-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10 07:29:24 -07:00
Guixin Liu
3ec5c62cfc nvmet: handle rw's limited retry flag
In some scenarios, some multipath software setup places the
REQ_FAILFAST_DEV flag on I/O to prevent retries and immediately
switch to other paths for issuing I/O commands. This will reflect
on the NVMe read and write commands with the limited retry flag.

However, the current NVMe target side does not handle the limited
retry flag, and the target's underlying driver still retries the
I/O. This will result in the I/O not being quickly switched to
other paths, ultimately leading to increased I/O latency.

When the nvme target receive an rw command with limited retry flag,
handle it in block backend by setting the REQ_FAILFAST_DEV flag to
bio.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-07 08:05:19 -08:00
Christoph Hellwig
e7602bb4f3 block: remove BLK_MQ_F_NO_SCHED
The only queues that really can't support a scheduler are those that
do not have a gendisk associated with them, and thus can't be used for
non-passthrough commands.  In addition to those null_blk can optionally
set the flag, which is a bad odd.  Replace the null_blk usage with
BLK_MQ_F_NO_SCHED_BY_DEFAULT to keep the expected semantics and then
remove BLK_MQ_F_NO_SCHED as the non-disk queues never call into
elevator_init_mq or blk_register_queue which adds the sysfs attributes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250106083531.799976-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-06 07:37:41 -07:00
Christoph Hellwig
6aeb4f8364 block: remove bio_add_pc_page
Lift bio_split_rw_at into blk_rq_append_bio so that it validates the
hardware limits.  With this all passthrough callers can simply add
bio_add_page to build the bio and delay checking for exceeding of limits
to this point instead of doing it for each page.

While this looks like adding a new expensive loop over all bio_vecs,
blk_rq_append_bio is already doing that just to counter the number of
segments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20250103073417.459715-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-04 15:27:35 -07:00
Jens Axboe
cc0331e29f nvme fixes for Linux 6.13
- Fix device specific quirk for PRP list alignment (Robert)
  - Fix target name overflow (Leo)
  - Fix target write granularity (Luis)
  - Fix target sleeping in atomic context (Nilay)
  - Remove unnecessary tcp queue teardown (Chunguang)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmd0K5cACgkQPe3zGtjz
 Rgm5jQ/9EdJb8y3eFetsAs7P5JyeZKdOWOBLgm1fdP5kPeQrbPgOqtXag2JQbdLY
 rFI7fm9wTsSnrz+v+4iRTwS0MPd/WTJw9cA+lTFlFBoc871Obg+aniiswxW+lnl2
 1KLzVSRFU3LbSSSBCNi+op+MVIgbVmiLZq+mKI7JqG4WrDXumEUSlNV4gLtDPiJo
 Z8fYoEtZIgsfmm9p8ySs6nmyrqsyM567ISoaxhAcisfAIXz20ul3fHkeLI4wa0xD
 gQPWdrz9Yz5aijr40FuiFBwKHU1Zg/vlqTl8o5gGZvKsx/epJpnQNoDTagmmYR8u
 oGU+c1R9LHVnxJjDitP6uyseafCYBJVfCdZoVXFdcDvc9aY2Pe2Sgo5y9IZGuLlP
 Vis7PHN/vFpTF1SRVFDALxXSkGR67zYSVB58CpWyxIFv3y1B212yicvrPfT5xcNE
 SbZglQd9qVaGuzXwKKHf80NOucEgagxYluCDKBOXCZj+u0S92ZAHuWv88WvzIFdL
 oK/GmFNxE3hAUfENf6FJ8Rfzx9+a+bN+QpaI1SCGFAS7dsQ8qlOBDPfQnYT8Q2T7
 yN1LHvnBEqx59/yCSN0FdWTpTEne8TgGJkj+EZN9sblbJaSfOr3f5rQbz59SXHHh
 Zap5KIlaPsPdukcWwrZp34meq3BpUuP3ELPxxcdINS4busOLP98=
 =SspQ
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.13-2024-12-31' of git://git.infradead.org/nvme into block-6.13

Pull NVMe fixes from Keith:

"nvme fixes for Linux 6.13

 - Fix device specific quirk for PRP list alignment (Robert)
 - Fix target name overflow (Leo)
 - Fix target write granularity (Luis)
 - Fix target sleeping in atomic context (Nilay)
 - Remove unnecessary tcp queue teardown (Chunguang)"

* tag 'nvme-6.13-2024-12-31' of git://git.infradead.org/nvme:
  nvme-tcp: remove nvme_tcp_destroy_io_queues()
  nvmet-loop: avoid using mutex in IO hotpath
  nvmet: propagate npwg topology
  nvmet: Don't overflow subsysnqn
  nvme-pci: 512 byte aligned dma pool segment quirk
2024-12-31 10:41:58 -07:00
Chunguang.xu
36e3b1f9ab nvme-tcp: remove nvme_tcp_destroy_io_queues()
Now when destroying the IO queue we call nvme_tcp_stop_io_queues()
twice, nvme_tcp_destroy_io_queues() has an unnecessary call. Here we
try to remove nvme_tcp_destroy_io_queues() and merge it into
nvme_tcp_teardown_io_queues(), simplify the code and align with
nvme-rdma, make it easy to maintaince.

Signed-off-by: Chunguang.xu <chunguang.xu@shopee.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-27 13:33:48 -08:00
Nilay Shroff
74d16965d7 nvmet-loop: avoid using mutex in IO hotpath
Using mutex lock in IO hot path causes the kernel BUG sleeping while
atomic. Shinichiro[1], first encountered this issue while running blktest
nvme/052 shown below:

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:585
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 996, name: (udev-worker)
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
2 locks held by (udev-worker)/996:
 #0: ffff8881004570c8 (mapping.invalidate_lock){.+.+}-{3:3}, at: page_cache_ra_unbounded+0x155/0x5c0
 #1: ffffffff8607eaa0 (rcu_read_lock){....}-{1:2}, at: blk_mq_flush_plug_list+0xa75/0x1950
CPU: 2 UID: 0 PID: 996 Comm: (udev-worker) Not tainted 6.12.0-rc3+ #339
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x6a/0x90
 __might_resched.cold+0x1f7/0x23d
 ? __pfx___might_resched+0x10/0x10
 ? vsnprintf+0xdeb/0x18f0
 __mutex_lock+0xf4/0x1220
 ? nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
 ? __pfx_vsnprintf+0x10/0x10
 ? __pfx___mutex_lock+0x10/0x10
 ? snprintf+0xa5/0xe0
 ? xas_load+0x1ce/0x3f0
 ? nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
 nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
 ? __pfx_nvmet_subsys_nsid_exists+0x10/0x10 [nvmet]
 nvmet_req_find_ns+0x24e/0x300 [nvmet]
 nvmet_req_init+0x694/0xd40 [nvmet]
 ? blk_mq_start_request+0x11c/0x750
 ? nvme_setup_cmd+0x369/0x990 [nvme_core]
 nvme_loop_queue_rq+0x2a7/0x7a0 [nvme_loop]
 ? __pfx___lock_acquire+0x10/0x10
 ? __pfx_nvme_loop_queue_rq+0x10/0x10 [nvme_loop]
 __blk_mq_issue_directly+0xe2/0x1d0
 ? __pfx___blk_mq_issue_directly+0x10/0x10
 ? blk_mq_request_issue_directly+0xc2/0x140
 blk_mq_plug_issue_direct+0x13f/0x630
 ? lock_acquire+0x2d/0xc0
 ? blk_mq_flush_plug_list+0xa75/0x1950
 blk_mq_flush_plug_list+0xa9d/0x1950
 ? __pfx_blk_mq_flush_plug_list+0x10/0x10
 ? __pfx_mpage_readahead+0x10/0x10
 __blk_flush_plug+0x278/0x4d0
 ? __pfx___blk_flush_plug+0x10/0x10
 ? lock_release+0x460/0x7a0
 blk_finish_plug+0x4e/0x90
 read_pages+0x51b/0xbc0
 ? __pfx_read_pages+0x10/0x10
 ? lock_release+0x460/0x7a0
 page_cache_ra_unbounded+0x326/0x5c0
 force_page_cache_ra+0x1ea/0x2f0
 filemap_get_pages+0x59e/0x17b0
 ? __pfx_filemap_get_pages+0x10/0x10
 ? lock_is_held_type+0xd5/0x130
 ? __pfx___might_resched+0x10/0x10
 ? find_held_lock+0x2d/0x110
 filemap_read+0x317/0xb70
 ? up_write+0x1ba/0x510
 ? __pfx_filemap_read+0x10/0x10
 ? inode_security+0x54/0xf0
 ? selinux_file_permission+0x36d/0x420
 blkdev_read_iter+0x143/0x3b0
 vfs_read+0x6ac/0xa20
 ? __pfx_vfs_read+0x10/0x10
 ? __pfx_vm_mmap_pgoff+0x10/0x10
 ? __pfx___seccomp_filter+0x10/0x10
 ksys_read+0xf7/0x1d0
 ? __pfx_ksys_read+0x10/0x10
 do_syscall_64+0x93/0x180
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on+0x78/0x100
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f565bd1ce11
Code: 00 48 8b 15 09 90 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 d0 ad 01 00 f3 0f 1e fa 80 3d 35 12 0e 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
RSP: 002b:00007ffd6e7a20c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f565bd1ce11
RDX: 0000000000001000 RSI: 00007f565babb000 RDI: 0000000000000014
RBP: 00007ffd6e7a2130 R08: 00000000ffffffff R09: 0000000000000000
R10: 0000556000bfa610 R11: 0000000000000246 R12: 000000003ffff000
R13: 0000556000bfa5b0 R14: 0000000000000e00 R15: 0000556000c07328
 </TASK>

Apparently, the above issue is caused due to using mutex lock while
we're in IO hot path. It's a regression caused with commit 505363957f
("nvmet: fix nvme status code when namespace is disabled"). The mutex
->su_mutex is used to find whether a disabled nsid exists in the config
group or not. This is to differentiate between a nsid that is disabled
vs non-existent.

To mitigate the above issue, we've worked upon a fix[2] where we now
insert nsid in subsys Xarray as soon as it's created under config group
and later when that nsid is enabled, we add an Xarray mark on it and set
ns->enabled to true. The Xarray mark is useful while we need to loop
through all enabled namepsaces under a subsystem using xa_for_each_marked()
API. If later a nsid is disabled then we clear Xarray mark from it and also
set ns->enabled to false. It's only when nsid is deleted from the config
group we delete it from the Xarray.

So with this change, now we could easily differentiate a nsid is disabled
(i.e. Xarray entry for ns exists but ns->enabled is set to false) vs non-
existent (i.e.Xarray entry for ns doesn't exist).

Link: https://lore.kernel.org/linux-nvme/20241022070252.GA11389@lst.de/ [2]
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/linux-nvme/tqcy3sveity7p56v7ywp7ssyviwcb3w4623cnxj3knoobfcanq@yxgt2mjkbkam/ [1]
Fixes: 505363957f ("nvmet: fix nvme status code when namespace is disabled")
Fix-suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-27 13:24:00 -08:00
Luis Chamberlain
b579d6fdc3 nvmet: propagate npwg topology
Ensure we propagate npwg to the target as well instead
of assuming its the same logical blocks per physical block.

This ensures devices with large IUs information properly
propagated on the target.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-27 13:18:01 -08:00
Leo Stone
4db3d750ac nvmet: Don't overflow subsysnqn
nvmet_root_discovery_nqn_store treats the subsysnqn string like a fixed
size buffer, even though it is dynamically allocated to the size of the
string.

Create a new string with kstrndup instead of using the old buffer.

Reported-by: syzbot+ff4aab278fa7e27e0f9e@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=ff4aab278fa7e27e0f9e
Fixes: 95409e277d ("nvmet: implement unique discovery NQN")
Signed-off-by: Leo Stone <leocstone@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-27 13:14:31 -08:00
Christoph Hellwig
cc76ace465 block: remove BLK_MQ_F_SHOULD_MERGE
BLK_MQ_F_SHOULD_MERGE is set for all tag_sets except those that purely
process passthrough commands (bsg-lib, ufs tmf, various nvme admin
queues) and thus don't even check the flag.  Remove it to simplify the
driver interface.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241219060214.1928848-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23 08:17:23 -07:00
Daniel Wagner
4425f6492a nvme: replace blk_mq_pci_map_queues with blk_mq_map_hw_queues
Replace all users of blk_mq_pci_map_queues with the more generic
blk_mq_map_hw_queues. This in preparation to retire
blk_mq_pci_map_queues.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Link: https://lore.kernel.org/r/20241202-refactor-blk-affinity-helpers-v6-6-27211e9c2cd5@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23 08:17:23 -07:00
Kanchan Joshi
472292cd8c nvme: add support for passing on the application tag
With user integrity buffer, there is a way to specify the app_tag.
Set the corresponding protocol specific flags and send the app_tag down.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20241128112240.8867-9-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23 08:17:16 -07:00
Anuj Gupta
2c0487d8b1 block: introduce BIP_CHECK_GUARD/REFTAG/APPTAG bip_flags
This patch introduces BIP_CHECK_GUARD/REFTAG/APPTAG bip_flags which
indicate how the hardware should check the integrity payload.
BIP_CHECK_GUARD/REFTAG are conversion of existing semantics, while
BIP_CHECK_APPTAG is a new flag. The driver can now just rely on block
layer flags, and doesn't need to know the integrity source. Submitter
of PI decides which tags to check. This would also give us a unified
interface for user and kernel generated integrity.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20241128112240.8867-8-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23 08:17:16 -07:00
Luis Chamberlain
51588b1b77 nvme: use blk_validate_block_size() for max LBA check
The block layer already has support to validates proper block sizes
with blk_validate_block_size(), we can leverage that as well.

No functional changes.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241218020212.3657139-3-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-18 07:22:30 -07:00
Robert Beckett
ebefac5647 nvme-pci: 512 byte aligned dma pool segment quirk
We initially introduced a quick fix limiting the queue depth to 1 as
experimentation showed that it fixed data corruption on 64GB steamdecks.

Further experimentation revealed corruption only happens when the last
PRP data element aligns to the end of the page boundary. The device
appears to treat this as a PRP chain to a new list instead of the data
element that it actually is. This implementation is in violation of the
spec. Encountering this errata with the Linux driver requires the host
request a 128k transfer and coincidently be handed the last small pool
dma buffer within a page.

The QD1 quirk effectly works around this because the last data PRP
always was at a 248 byte offset from the page start, so it never
appeared at the end of the page, but comes at the expense of throttling
IO and wasting the remainder of the PRP page beyond 256 bytes. Also to
note, the MDTS on these devices is small enough that the "large" prp
pool can hold enough PRP elements to never reach the end, so that pool
is not a problem either.

Introduce a new quirk to ensure the small pool is always aligned such
that the last PRP element can't appear a the end of the page. This comes
at the expense of wasting 256 bytes per small pool page allocated.

Link: https://lore.kernel.org/linux-nvme/20241113043151.GA20077@lst.de/T/#u
Fixes: 83bdfcbdbe ("nvme-pci: qdepth 1 quirk")
Cc: Paweł Anikiel <panikiel@google.com>
Signed-off-by: Robert Beckett <bob.beckett@collabora.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-11 14:46:23 -08:00
Jens Axboe
d64fd5f777 nvme fixes for Linux 6.13
- Target fix using incorrect zero buffer (Nilay)
  - Device specifc deallocate quirk fixes (Christoph, Keith)
  - Fabrics fix for handling max command target bugs (Maurizio)
  - Cocci fix usage for kzalloc (Yu-Chen)
  - DMA size fix for host memory buffer feature (Christoph)
  - Fabrics queue cleanup fixes (Chunguang)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmdR3fkACgkQPe3zGtjz
 RgkAbQ/+IblDuMKO6pzZkzMVECwjtJz8E195weCBVWsc5FE80AXX9lz3I5w0N1I5
 5etZlXoQ9YDyPWIzlkg2WKkXtGI4vKUUiJ02dXbre7xBMbRA1Qb4C9hH2KmvnxA4
 Qb52717FSMDxzjQKz4e3cz34cS/RaDENypSvAu1SDGvJhKSNrkEjz/lVXy6eFWvI
 DELL7yDOdl4H91G4vGlC7eKV0BI6rCAThyAnivkKS1P+UWy9tJja7UNKix2cuh6Y
 PhEdSynH+JgfKO873j45qEPWaa/htjXSsa0gp6yEd2ZdO5lIeoDRbLYCX8gE6Vcs
 yw4xbuQwexhnJi4LVeAR3fUNA2RdPy8zSjVM0gwpQCI161QEocl/KaWNmBcjcwLp
 0dZGRftQoLGN+Ugmv7eTBBN3qb60nYwfCDxHrLv6ffQyai2++2b1PqLnsd+oHGOQ
 vjmFxBSDbmXybbb5h47lJY+EU6Cr3HRKq136Ypmwx7t3d+1cLCzVPsal///5AuYq
 /Srg1CENmAWEiM4LnuZYqs45BGu+/HN4TWjbi1ddAcPmc+YpOz89hdHZ/TnS9Gjl
 nVN8CHrCBuSnG9ZVyla+nFaqAr3A8YFWfDNp05AVFUsLW3hWrWrj2m0wmGegwoIE
 JVgC+ebhQg6UWTAdq4gBnSwz8bbgj/YwRYbqqzfJTioOXbgkZBY=
 =niio
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.13-2024-12-05' of git://git.infradead.org/nvme into block-6.13

Pull NVMe fixess from Keith:

"nvme fixes for Linux 6.13

 - Target fix using incorrect zero buffer (Nilay)
 - Device specifc deallocate quirk fixes (Christoph, Keith)
 - Fabrics fix for handling max command target bugs (Maurizio)
 - Cocci fix usage for kzalloc (Yu-Chen)
 - DMA size fix for host memory buffer feature (Christoph)
 - Fabrics queue cleanup fixes (Chunguang)"

* tag 'nvme-6.13-2024-12-05' of git://git.infradead.org/nvme:
  nvme-tcp: simplify nvme_tcp_teardown_io_queues()
  nvme-tcp: no need to quiesce admin_q in nvme_tcp_teardown_io_queues()
  nvme-rdma: unquiesce admin_q before destroy it
  nvme-tcp: fix the memleak while create new ctrl failed
  nvme-pci: don't use dma_alloc_noncontiguous with 0 merge boundary
  nvmet: replace kmalloc + memset with kzalloc for data allocation
  nvme-fabrics: handle zero MAXCMD without closing the connection
  nvme-pci: remove two deallocate zeroes quirks
  nvme: don't apply NVME_QUIRK_DEALLOCATE_ZEROES when DSM is not supported
  nvmet: use kzalloc instead of ZERO_PAGE in nvme_execute_identify_ns_nvm()
2024-12-05 10:14:36 -07:00
Chunguang.xu
b4e12f5728 nvme-tcp: simplify nvme_tcp_teardown_io_queues()
As nvme_tcp_teardown_io_queues() is the only one caller of
nvme_tcp_destroy_admin_queue(), so we can merge it into
nvme_tcp_teardown_io_queues() to simplify the code.

Signed-off-by: Chunguang.xu <chunguang.xu@shopee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-04 10:15:46 -08:00
Chunguang.xu
fdc5664c69 nvme-tcp: no need to quiesce admin_q in nvme_tcp_teardown_io_queues()
As we quiesce admin_q in nvme_tcp_teardown_admin_queue(), so we should no
need to quiesce it in nvme_tcp_reaardown_io_queues(), make things simple.

Signed-off-by: Chunguang.xu <chunguang.xu@shopee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-04 10:15:46 -08:00
Chunguang.xu
5858b68755 nvme-rdma: unquiesce admin_q before destroy it
Kernel will hang on destroy admin_q while we create ctrl failed, such
as following calltrace:

PID: 23644    TASK: ff2d52b40f439fc0  CPU: 2    COMMAND: "nvme"
 #0 [ff61d23de260fb78] __schedule at ffffffff8323bc15
 #1 [ff61d23de260fc08] schedule at ffffffff8323c014
 #2 [ff61d23de260fc28] blk_mq_freeze_queue_wait at ffffffff82a3dba1
 #3 [ff61d23de260fc78] blk_freeze_queue at ffffffff82a4113a
 #4 [ff61d23de260fc90] blk_cleanup_queue at ffffffff82a33006
 #5 [ff61d23de260fcb0] nvme_rdma_destroy_admin_queue at ffffffffc12686ce
 #6 [ff61d23de260fcc8] nvme_rdma_setup_ctrl at ffffffffc1268ced
 #7 [ff61d23de260fd28] nvme_rdma_create_ctrl at ffffffffc126919b
 #8 [ff61d23de260fd68] nvmf_dev_write at ffffffffc024f362
 #9 [ff61d23de260fe38] vfs_write at ffffffff827d5f25
    RIP: 00007fda7891d574  RSP: 00007ffe2ef06958  RFLAGS: 00000202
    RAX: ffffffffffffffda  RBX: 000055e8122a4d90  RCX: 00007fda7891d574
    RDX: 000000000000012b  RSI: 000055e8122a4d90  RDI: 0000000000000004
    RBP: 00007ffe2ef079c0   R8: 000000000000012b   R9: 000055e8122a4d90
    R10: 0000000000000000  R11: 0000000000000202  R12: 0000000000000004
    R13: 000055e8122923c0  R14: 000000000000012b  R15: 00007fda78a54500
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

This due to we have quiesced admi_q before cancel requests, but forgot
to unquiesce before destroy it, as a result we fail to drain the
pending requests, and hang on blk_mq_freeze_queue_wait() forever. Here
try to reuse nvme_rdma_teardown_admin_queue() to fix this issue and
simplify the code.

Fixes: 958dc1d32c ("nvme-rdma: add clean action for failed reconnection")
Reported-by: Yingfu.zhou <yingfu.zhou@shopee.com>
Signed-off-by: Chunguang.xu <chunguang.xu@shopee.com>
Signed-off-by: Yue.zhao <yue.zhao@shopee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-04 10:15:46 -08:00
Chunguang.xu
fec55c29e5 nvme-tcp: fix the memleak while create new ctrl failed
Now while we create new ctrl failed, we have not free the
tagset occupied by admin_q, here try to fix it.

Fixes: fd1418de10 ("nvme-tcp: avoid open-coding nvme_tcp_teardown_admin_queue()")
Signed-off-by: Chunguang.xu <chunguang.xu@shopee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-04 10:15:46 -08:00
Christoph Hellwig
ad0cf42e1f nvme-pci: don't use dma_alloc_noncontiguous with 0 merge boundary
Only call into nvme_alloc_host_mem_single which uses
dma_alloc_noncontiguous when there is non-null dma merge boundary.
Without this we'll call into dma_alloc_noncontiguous for device using
dma-direct, which can work fine as long as the preferred size is below the
MAX_ORDER of the page allocator, but blows up with a warning if it is
too large.

Fixes: 63a5c7a4b4 ("nvme-pci: use dma_alloc_noncontigous if possible")
Reported-by: Leon Romanovsky <leon@kernel.org>
Reported-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Leon Romanovsky <leon@kernel.org>
Tested-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-04 09:23:11 -08:00
Yu-Chun Lin
41d826c8a9 nvmet: replace kmalloc + memset with kzalloc for data allocation
cocci warnings: (new ones prefixed by >>)
>> drivers/nvme/target/pr.c:831:8-15: WARNING: kzalloc should be used for data, instead of kmalloc/memset

The pattern of using 'kmalloc' followed by 'memset' is replaced with
'kzalloc', which is functionally equivalent to 'kmalloc' + 'memset',
but more efficient. 'kzalloc' automatically zeroes the allocated
memory, making it a faster and more streamlined solution.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202411301434.LEckbcWx-lkp@intel.com/
Reviewed-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-04 09:20:00 -08:00
Maurizio Lombardi
88c23a32b8 nvme-fabrics: handle zero MAXCMD without closing the connection
The NVMe specification states that MAXCMD is mandatory
for NVMe-over-Fabrics implementations. However, some NVMe/TCP
and NVMe/FC arrays from major vendors have buggy firmware
that reports MAXCMD as zero in the Identify Controller data structure.

Currently, the implementation closes the connection in such cases,
completely preventing the host from connecting to the target.

Fix the issue by printing a clear error message about the firmware bug
and allowing the connection to proceed. It assumes that the
target supports a MAXCMD value of SQSIZE + 1. If any issues arise,
the user can manually adjust SQSIZE to mitigate them.

Fixes: 4999568184 ("nvme-fabrics: check max outstanding commands")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-04 09:19:54 -08:00
Keith Busch
b0de5456e2 nvme-pci: remove two deallocate zeroes quirks
The quirk was initially used as a signal to set the discard_zeroes_data
queue limit because there were some use cases that relied on that
behavior. The queue limit no longer exists as every user of it has been
converted to use the write zeroes operation instead.

The quirk now means to use a discard command as an alias to a write
zeroes request. Two of the devices previously using the quirk support
the write zeroes command directly, so these don't need or want to use
discard when the desired operation is to write zeroes.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-03 10:56:27 -08:00
Peter Zijlstra
cdd30ebb1b module: Convert symbol namespace to string literal
Clean up the existing export namespace code along the same lines of
commit 33def8498f ("treewide: Convert macro and uses of __section(foo)
to __section("foo")") and for the same reason, it is not desired for the
namespace argument to be a macro expansion itself.

Scripted using

  git grep -l -e MODULE_IMPORT_NS -e EXPORT_SYMBOL_NS | while read file;
  do
    awk -i inplace '
      /^#define EXPORT_SYMBOL_NS/ {
        gsub(/__stringify\(ns\)/, "ns");
        print;
        next;
      }
      /^#define MODULE_IMPORT_NS/ {
        gsub(/__stringify\(ns\)/, "ns");
        print;
        next;
      }
      /MODULE_IMPORT_NS/ {
        $0 = gensub(/MODULE_IMPORT_NS\(([^)]*)\)/, "MODULE_IMPORT_NS(\"\\1\")", "g");
      }
      /EXPORT_SYMBOL_NS/ {
        if ($0 ~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+),/) {
  	if ($0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/ &&
  	    $0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(\)/ &&
  	    $0 !~ /^my/) {
  	  getline line;
  	  gsub(/[[:space:]]*\\$/, "");
  	  gsub(/[[:space:]]/, "", line);
  	  $0 = $0 " " line;
  	}

  	$0 = gensub(/(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/,
  		    "\\1(\\2, \"\\3\")", "g");
        }
      }
      { print }' $file;
  done

Requested-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://mail.google.com/mail/u/2/#inbox/FMfcgzQXKWgMmjdFwwdsfgxzKpVHWPlc
Acked-by: Greg KH <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-02 11:34:44 -08:00
Christoph Hellwig
58a0c875ce nvme: don't apply NVME_QUIRK_DEALLOCATE_ZEROES when DSM is not supported
Commit 63dfa10043 ("nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of
nvme_config_discard") started applying the NVME_QUIRK_DEALLOCATE_ZEROES
quirk even then the Dataset Management is not supported.  It turns out
that there versions of these old Intel SSDs that have DSM support
disabled in the firmware, which will now lead to errors everytime
a Write Zeroes command is issued.  Fix this by checking for DSM support
before applying the quirk.

Reported-by: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com>
Fixes: 63dfa10043 ("nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of nvme_config_discard")
Tested-by: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-02 10:03:19 -08:00
Nilay Shroff
84909f7dec nvmet: use kzalloc instead of ZERO_PAGE in nvme_execute_identify_ns_nvm()
The nvme_execute_identify_ns_nvm function uses ZERO_PAGE for copying
SG list with all zeros. As ZERO_PAGE would not necessarily return the
virtual-address of the zero page, we need to first convert the page
address to kernel virtual-address and then use it as source address
for copying the data to SG list with all zeros. Using return address
of ZERO_PAGE(0) as source address for copying data to SG list would
fill the target buffer with random/garbage value and causes the
undesired side effect.

As other identify implemenations uses kzalloc for allocating a zero
filled buffer, we decided use kzalloc for allocating a zero filled
buffer in nvme_execute_identify_ns_nvm function and then use this
buffer for copying all zeros to SG list buffers. So esentially, we
now avoid using ZERO_PAGE.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Fixes: 64a51080ea ("nvmet: implement id ns for nvm command set")
Link: https://lore.kernel.org/all/CAHj4cs8OVyxmn4XTvA=y4uQ3qWpdw-x3M3FSUYr-KpE-nhaFEA@mail.gmail.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-02 10:02:47 -08:00
Linus Torvalds
e70140ba0d Get rid of 'remove_new' relic from platform driver struct
The continual trickle of small conversion patches is grating on me, and
is really not helping.  Just get rid of the 'remove_new' member
function, which is just an alias for the plain 'remove', and had a
comment to that effect:

  /*
   * .remove_new() is a relic from a prototype conversion of .remove().
   * New drivers are supposed to implement .remove(). Once all drivers are
   * converted to not use .remove_new any more, it will be dropped.
   */

This was just a tree-wide 'sed' script that replaced '.remove_new' with
'.remove', with some care taken to turn a subsequent tab into two tabs
to make things line up.

I did do some minimal manual whitespace adjustment for places that used
spaces to line things up.

Then I just removed the old (sic) .remove_new member function, and this
is the end result.  No more unnecessary conversion noise.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-01 15:12:43 -08:00
Linus Torvalds
cfd47302ac block-6.13-20242901
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmdJ6jwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvgeEADaPL+qmsRyp070K1OI/yTA9Jhf6liEyJ31
 0GPVar5Vt6ObH6/POObrJqbBtAo5asQanFvwyVyLztKYxPHU7sdQaRcD+vvj7q3+
 EhmQZSKM7Spp77awWhRWbeQfUBvdhTeGHjQH/0e60eIrF9KtEL9sM9hVqc8hBD9F
 YtDNWPCk7Rz1PPNYlGEkQ2JmYmaxh3Gn29c/k1cSSo3InEQOFj6x+0Cgz6RjbTx3
 9HfpLhVG3WV5MlZCCwp7KG36aJzlc0nq53x/sC9cg+F17RvL2EwNAOUfLl75/Kp/
 t7PCQSd2ODciiDN9qZW71KGtVtlJ07W048Rk0nB+ogneC0uh4fuIYTidP9D7io7D
 bBMrhDuUpnlPzlOqg0aeedXePQL7TRfT3CTentol6xldqg14n7C4QTQFQMSJCgJf
 gr4YCTwl0RTknXo0A3ja16XwsUq5+2xsSoCTU25TY+wgKiAcc5lN9fhbvPRzbCQC
 u9EQ9I9IFAMqEdnE51sw0x16fLtN2w4/zOkvTF+gD/KooEjSn9lcfeNue7jt1O0/
 gFvFJCdXK/2GgxwHihvsEVdcNeaS8JowNafKUsfOM2G0qWQbY+l2vl/b5PfwecWi
 0knOaqNWlGMwrQ+z+fgsEeFG7X98ninC7tqVZpzoZ7j0x65anH+Jq4q1Egongj0H
 90zclclxjg==
 =6cbB
 -----END PGP SIGNATURE-----

Merge tag 'block-6.13-20242901' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:

 - NVMe pull request via Keith:
      - Use correct srcu list traversal (Breno)
      - Scatter-gather support for metadata (Keith)
      - Fabrics shutdown race condition fix (Nilay)
      - Persistent reservations updates (Guixin)

 - Add the required bits for MD atomic write support for raid0/1/10

 - Correct return value for unknown opcode in ublk

 - Fix deadlock with zone revalidation

 - Fix for the io priority request vs bio cleanups

 - Use the correct unsigned int type for various limit helpers

 - Fix for a race in loop

 - Cleanup blk_rq_prep_clone() to prevent uninit-value warning and make
   it easier for actual humans to read

 - Fix potential UAF when iterating tags

 - A few fixes for bfq-iosched UAF issues

 - Fix for brd discard not decrementing the allocated page count

 - Various little fixes and cleanups

* tag 'block-6.13-20242901' of git://git.kernel.dk/linux: (36 commits)
  brd: decrease the number of allocated pages which discarded
  block, bfq: fix bfqq uaf in bfq_limit_depth()
  block: Don't allow an atomic write be truncated in blkdev_write_iter()
  mq-deadline: don't call req_get_ioprio from the I/O completion handler
  block: Prevent potential deadlock in blk_revalidate_disk_zones()
  block: Remove extra part pointer NULLify in blk_rq_init()
  nvme: tuning pr code by using defined structs and macros
  nvme: introduce change ptpl and iekey definition
  block: return bool from get_disk_ro and bdev_read_only
  block: remove a duplicate definition for bdev_read_only
  block: return bool from blk_rq_aligned
  block: return unsigned int from blk_lim_dma_alignment_and_pad
  block: return unsigned int from queue_dma_alignment
  block: return unsigned int from bdev_io_opt
  block: req->bio is always set in the merge code
  block: don't bother checking the data direction for merges
  block: blk-mq: fix uninit-value in blk_rq_prep_clone and refactor
  Revert "block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()"
  md/raid10: Atomic write support
  md/raid1: Atomic write support
  ...
2024-11-30 15:47:29 -08:00
Guixin Liu
029cc98dec nvme: tuning pr code by using defined structs and macros
All the modifications are simply to make the code more readable,
and this patch does not include any functional changes.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-21 08:57:42 -08:00
Nilay Shroff
e9869c85c8 nvme-fabrics: fix kernel crash while shutting down controller
The nvme keep-alive operation, which executes at a periodic interval,
could potentially sneak in while shutting down a fabric controller.
This may lead to a race between the fabric controller admin queue
destroy code path (invoked while shutting down controller) and hw/hctx
queue dispatcher called from the nvme keep-alive async request queuing
operation. This race could lead to the kernel crash shown below:

Call Trace:
    autoremove_wake_function+0x0/0xbc (unreliable)
    __blk_mq_sched_dispatch_requests+0x114/0x24c
    blk_mq_sched_dispatch_requests+0x44/0x84
    blk_mq_run_hw_queue+0x140/0x220
    nvme_keep_alive_work+0xc8/0x19c [nvme_core]
    process_one_work+0x200/0x4e0
    worker_thread+0x340/0x504
    kthread+0x138/0x140
    start_kernel_thread+0x14/0x18

While shutting down fabric controller, if nvme keep-alive request sneaks
in then it would be flushed off. The nvme_keep_alive_end_io function is
then invoked to handle the end of the keep-alive operation which
decrements the admin->q_usage_counter and assuming this is the last/only
request in the admin queue then the admin->q_usage_counter becomes zero.
If that happens then blk-mq destroy queue operation (blk_mq_destroy_
queue()) which could be potentially running simultaneously on another
cpu (as this is the controller shutdown code path) would forward
progress and deletes the admin queue. So, now from this point onward
we are not supposed to access the admin queue resources. However the
issue here's that the nvme keep-alive thread running hw/hctx queue
dispatch operation hasn't yet finished its work and so it could still
potentially access the admin queue resource while the admin queue had
been already deleted and that causes the above crash.

The above kernel crash is regression caused due to changes implemented
in commit a54a93d0e3 ("nvme: move stopping keep-alive into
nvme_uninit_ctrl()"). Ideally we should stop keep-alive before destroyin
g the admin queue and freeing the admin tagset so that it wouldn't sneak
in during the shutdown operation. However we removed the keep alive stop
operation from the beginning of the controller shutdown code path in commit
a54a93d0e3 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()")
and added it under nvme_uninit_ctrl() which executes very late in the
shutdown code path after the admin queue is destroyed and its tagset is
removed. So this change created the possibility of keep-alive sneaking in
and interfering with the shutdown operation and causing observed kernel
crash.

To fix the observed crash, we decided to move nvme_stop_keep_alive() from
nvme_uninit_ctrl() to nvme_remove_admin_tag_set(). This change would ensure
that we don't forward progress and delete the admin queue until the keep-
alive operation is finished (if it's in-flight) or cancelled and that would
help contain the race condition explained above and hence avoid the crash.

Moving nvme_stop_keep_alive() to nvme_remove_admin_tag_set() instead of
adding nvme_stop_keep_alive() to the beginning of the controller shutdown
code path in nvme_stop_ctrl(), as was the case earlier before commit
a54a93d0e3 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()"),
would help save one callsite of nvme_stop_keep_alive().

Fixes: a54a93d0e3 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()")
Link: https://lore.kernel.org/all/1a21f37b-0f2a-4745-8c56-4dc8628d3983@linux.ibm.com/
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-19 07:49:48 -08:00
Nilay Shroff
8448828216 Revert "nvme: make keep-alive synchronous operation"
This reverts commit d06923670b.

It was realized that the fix implemented to contain the race condition
among the keep alive task and the fabric shutdown code path in the commit
d06923670b5ia ("nvme: make keep-alive synchronous operation") is not
optimal. The reason being keep-alive runs under the workqueue and making
it synchronous would waste a workqueue context.
Furthermore, we later found that the above race condition is a regression
caused due to the changes implemented in commit a54a93d0e3 ("nvme: move
stopping keep-alive into nvme_uninit_ctrl()"). So we decided to revert the
commit d06923670b ("nvme: make keep-alive synchronous operation") and
then fix the regression.

Link: https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-19 07:49:48 -08:00
Linus Torvalds
77a0cfafa9 for-6.13/block-20241118
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmc7S40QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjHVD/43rDZ8ehs+IAAr6S0RemNX1SRG0mK2UOEb
 kMoNogS7StO/c4JYW3JuzCyLRn5ZsgeWV/muqxwDEWQrmTGrvi+V45KikrZPwm3k
 p0ump33qV9EU2jiR1MKZjtwK2P0CI7/DD3W8ww6IOvKbTT7RcqQcdHznvXArFBtc
 xCuQPpayFG7ZasC+N9VaBwtiUEVgU3Ek9AFT7UVZRWajjHPNalQwaooJWayO0rEG
 KdoW5yG0ryLrgCY2ACSvRLS+2s14EJtb8hgT08WKHTNgd5LxhSKxfsTapamua+7U
 FdVS6Ij0tEkgu2jpvgj7QKO0Uw10Cnep2gj7RHts/LVewvkliS6XcheOzqRS1jWU
 I2EI+UaGOZ11OUiw52VIveEVS5zV/NWhgy5BSP9LYEvXw0BUAHRDYGMem8o5G1V1
 SWqjIM1UWvcQDlAnMF9FDVzojvjVUmYWvcAlFFztO8J0B7SavHR3NcfHwEf57reH
 rNoUbi/9c4/wjJJF33gejiR5pU+ewy/Mk75GrtX3xpEqlztfRbf9/FbPCMEAO1KR
 DF/b3lkUV9i2/BRW6a0SpZ5RDSmSYMnateel6TrPyVSRnpiSSFO8FrbynwUOa17b
 6i49YDFWzzXOrR1YWDg6IEtTrcmBEmvi7F6aoDs020qUnL0hwLn1ZuoIxuiFEpor
 Z0iFF1B/nw==
 =PWTH
 -----END PGP SIGNATURE-----

Merge tag 'for-6.13/block-20241118' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe updates via Keith:
      - Use uring_cmd helper (Pavel)
      - Host Memory Buffer allocation enhancements (Christoph)
      - Target persistent reservation support (Guixin)
      - Persistent reservation tracing (Guixen)
      - NVMe 2.1 specification support (Keith)
      - Rotational Meta Support (Matias, Wang, Keith)
      - Volatile cache detection enhancment (Guixen)

 - MD updates via Song:
      - Maintainers update
      - raid5 sync IO fix
      - Enhance handling of faulty and blocked devices
      - raid5-ppl atomic improvement
      - md-bitmap fix

 - Support for manually defining embedded partition tables

 - Zone append fixes and cleanups

 - Stop sending the queued requests in the plug list to the driver
   ->queue_rqs() handle in reverse order.

 - Zoned write plug cleanups

 - Cleanups disk stats tracking and add support for disk stats for
   passthrough IO

 - Add preparatory support for file system atomic writes

 - Add lockdep support for queue freezing. Already found a bunch of
   issues, and some fixes for that are in here. More will be coming.

 - Fix race between queue stopping/quiescing and IO queueing

 - ublk recovery improvements

 - Fix ublk mmap for 64k pages

 - Various fixes and cleanups

* tag 'for-6.13/block-20241118' of git://git.kernel.dk/linux: (118 commits)
  MAINTAINERS: Update git tree for mdraid subsystem
  block: make struct rq_list available for !CONFIG_BLOCK
  block/genhd: use seq_put_decimal_ull for diskstats decimal values
  block: don't reorder requests in blk_mq_add_to_batch
  block: don't reorder requests in blk_add_rq_to_plug
  block: add a rq_list type
  block: remove rq_list_move
  virtio_blk: reverse request order in virtio_queue_rqs
  nvme-pci: reverse request order in nvme_queue_rqs
  btrfs: validate queue limits
  block: export blk_validate_limits
  nvmet: add tracing of reservation commands
  nvme: parse reservation commands's action and rtype to string
  nvmet: report ns's vwc not present
  md/raid5: Increase r5conf.cache_name size
  block: remove the ioprio field from struct request
  block: remove the write_hint field from struct request
  nvme: check ns's volatile write cache not present
  nvme: add rotational support
  nvme: use command set independent id ns if available
  ...
2024-11-18 16:50:08 -08:00
Keith Busch
6fad84a4d6 nvme-pci: use sgls for all user requests if possible
If the device supports SGLs, use these for all user requests. This
format encodes the expected transfer length so it can catch short buffer
errors in a user command, whether it occurred accidently or maliciously.

For controllers that support SGL data mode, this is a viable mitigation
to CVE-2023-6238. For controllers that don't support SGLs, log a warning
in the passthrough path since not having the capability can corrupt
data if the interface is not used correctly.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-18 09:27:47 -08:00
Keith Busch
6399a0db8c nvme: define the remaining used sgls constants
This provides a little more context when reading the code than hardcoded
magic numbers.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-18 09:17:26 -08:00
Keith Busch
979c6342f9 nvme-pci: add support for sgl metadata
Supporting this mode allows creating and merging multi-segment metadata
requests that wouldn't be possible otherwise. It also allows directly
using user space requests that straddle physically discontiguous pages.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-18 09:17:25 -08:00
Breno Leitao
5dd18f09ce nvme/multipath: Fix RCU list traversal to use SRCU primitive
The code currently uses list_for_each_entry_rcu() while holding an SRCU
lock, triggering false positive warnings with CONFIG_PROVE_RCU=y
enabled:

	drivers/nvme/host/multipath.c:168 RCU-list traversed in non-reader section!!
	drivers/nvme/host/multipath.c:227 RCU-list traversed in non-reader section!!
	drivers/nvme/host/multipath.c:260 RCU-list traversed in non-reader section!!

While the list is properly protected by SRCU lock, the code uses the
wrong list traversal primitive. Replace list_for_each_entry_rcu() with
list_for_each_entry_srcu() to correctly indicate SRCU-based protection
and eliminate the false warning.

Signed-off-by: Breno Leitao <leitao@debian.org>
Fixes: be647e2c76 ("nvme: use srcu for iterating namespace list")
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-18 09:11:24 -08:00
Christoph Hellwig
e70c301fae block: don't reorder requests in blk_add_rq_to_plug
Add requests to the tail of the list instead of the front so that they
are queued up in submission order.

Remove the re-reordering in blk_mq_dispatch_plug_list, virtio_queue_rqs
and nvme_queue_rqs now that the list is ordered as expected.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-13 12:04:58 -07:00
Christoph Hellwig
a3396b9999 block: add a rq_list type
Replace the semi-open coded request list helpers with a proper rq_list
type that mirrors the bio_list and has head and tail pointers.  Besides
better type safety this actually allows to insert at the tail of the
list, which will be useful soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-13 12:04:58 -07:00
Christoph Hellwig
beadf00885 nvme-pci: reverse request order in nvme_queue_rqs
blk_mq_flush_plug_list submits requests in the reverse order that they
were submitted, which leads to a rather suboptimal I/O pattern especially
in rotational devices.  Fix this by rewriting nvme_queue_rqs so that it
always pops the requests from the passed in request list, and then adds
them to the head of a local submit list.  This actually simplifies the
code a bit as it removes the complicated list splicing, at the cost of
extra updates of the rq_next pointer.  As that should be cache hot
anyway it should be an easy price to pay.

Fixes: d62cbcf62f ("nvme: add support for mq_ops->queue_rqs()")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-13 11:40:33 -07:00
Jens Axboe
15da3dd3f5 nvme updates for Linux 6.13
- Use uring_cmd helper (Pavel)
  - Host Memory Buffer allocation enhancements (Christoph)
  - Target persistent reservation support (Guixin)
  - Persistent reservation tracing (Guixen)
  - NVMe 2.1 specification support (Keith)
  - Rotational Meta Support (Matias, Wang, Keith)
  - Volatile cache detection enhancment (Guixen)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmc0264ACgkQPe3zGtjz
 RgmhvhAAzEVbniR/OBlJbqZ+rdwEHOj181XJIWUD72yZUVl2akikYq88JpiMCfcS
 pwdVAdFDEfvMjyIGpWXqE/G2NIYzb2qdGC0D3q5e/CgH/mxJ+5zJKKjj+6pqtWBt
 BJnoJ0YZcTnLXQWOrY6NxUOVn2LxxtvrKArCbh467GnDxWF7WJbwv+wkbPZZ78YR
 6IYRQU0La/uAvdpZ+ijHEOdieHtN3uJtu1AxxCFOK9gMpbHq92tm4Ya6bF09VDbG
 F+Ywhuu/gZkglTL5jEUvtt1Jd4VlhtGzBC2BhCFeSI54IwjhV3UFCajQeBh0zT/V
 Ca2VkFMAO1/Z3gRuK1QtEYkAf6Bwv591zpsoUEYvvlolXDL2aRKT5Jggwe/SMYYI
 ZA/3dSW/gRAV+bny2htVMK2n+hcn+VXhFaJlpZ7kSySK0b89wMlQ96BupTnmfyMD
 PdgVVaWVQ4onQcEu7/ItD9uFVe9tvTCH12MXRqlgJx4iM0w4ucpBh8QdOdHxMorD
 0bVCE4oLSbw6XJrfKmlytHJs4ZMdmNEoXzaJuBMsPDAlCvZiaihzTusIY7dWq4xi
 xNt6mQOOriONNpYRlaBrBGsinmQx6Ysz8q60RT9mLmGAwwI/nY9r1oxAd4ZknhKv
 c9clP0F20uO3se8vKUMbXOeGe8ZETD+S94hcGtHp9uxF8w6DfnQ=
 =Vi8L
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.13-2024-11-13' of git://git.infradead.org/nvme into for-6.13/block

Pull NVMe updates from Keith:

"nvme updates for Linux 6.13

 - Use uring_cmd helper (Pavel)
 - Host Memory Buffer allocation enhancements (Christoph)
 - Target persistent reservation support (Guixin)
 - Persistent reservation tracing (Guixen)
 - NVMe 2.1 specification support (Keith)
 - Rotational Meta Support (Matias, Wang, Keith)
 - Volatile cache detection enhancment (Guixen)"

* tag 'nvme-6.13-2024-11-13' of git://git.infradead.org/nvme: (22 commits)
  nvmet: add tracing of reservation commands
  nvme: parse reservation commands's action and rtype to string
  nvmet: report ns's vwc not present
  nvme: check ns's volatile write cache not present
  nvme: add rotational support
  nvme: use command set independent id ns if available
  nvmet: support for csi identify ns
  nvmet: implement rotational media information log
  nvmet: implement endurance groups
  nvmet: declare 2.1 version compliance
  nvmet: implement crto property
  nvmet: implement supported features log
  nvmet: implement supported log pages
  nvmet: implement active command set ns list
  nvmet: implement id ns for nvm command set
  nvmet: support reservation feature
  nvme: add reservation command's defines
  nvme-core: remove repeated wq flags
  nvmet: make nvmet_wq visible in sysfs
  nvme-pci: use dma_alloc_noncontigous if possible
  ...
2024-11-13 10:43:11 -07:00
Guixin Liu
50bee3857d nvmet: add tracing of reservation commands
Add tracing of reservation commands, including register, acquire,
release and report, and also parse the action and rtype to string
to make the trace log more human-readable.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-13 08:51:24 -08:00
Guixin Liu
8a502b5c16 nvme: parse reservation commands's action and rtype to string
Parse reservation commands's action(including rrega, racqa and rrela)
and rtype to string to make the trace log more human-readable.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-13 08:51:24 -08:00
Guixin Liu
609e60a3a9 nvmet: report ns's vwc not present
Currently, we report that controller has vwc even though the ns may
not have vwc. Report ns's vwc not present when not buffered_io or
backdev doesn't have vwc.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-13 08:51:17 -08:00
Guixin Liu
8a825d22a7 nvme: check ns's volatile write cache not present
When the VWC of a namespace does not exist, the BLK_FEAT_WRITE_CACHE
flag should not be set when registering the block device, regardless
of whether the controller supports VWC.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:50 -08:00
Wang Yugui
1d81143885 nvme: add rotational support
Rotational devices, such as hard-drives, can be detected using
the rotational bit in the namespace independent identify namespace
data structure. Make the bit visible to the block layer through the
rotational queue setting.

Signed-off-by: Wang Yugui <wangyugui@e16-tech.com>
Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:50 -08:00
Matias Bjørling
ee9f36db1f nvme: use command set independent id ns if available
The NVMe 2.0 specification adds an independent identify namespace
data structure that contains generic attributes that apply to all
namespace types. Some attributes carry over from the NVM command set
identify namespace data structure, and others are new.

Currently, the data structure only considered when CRIMS is enabled or
when the namespace type is key-value.

However, the independent namespace data structure is mandatory for
devices that implement features from the 2.0+ specification. Therefore,
we can check this data structure first. If unavailable, retrieve the
generic attributes from the NVM command set identify namespace data
structure.

Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:50 -08:00
Keith Busch
e2758c76a0 nvmet: support for csi identify ns
Implements reporting the I/O Command Set Independent Identify Namespace
command.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:49 -08:00
Keith Busch
5fd075cdaf nvmet: implement rotational media information log
Most of the information is stubbed. Supporting these commands is a
requirement for supporting rotational media.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:49 -08:00
Keith Busch
266b652c65 nvmet: implement endurance groups
Most of the returned information is just stubbed data. The target must
support these in order to report rotational media. Since this driver
doesn't know any better, each namespace is its own endurance group with
the engid value matching the nsid.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:49 -08:00
Keith Busch
81ee2f2811 nvmet: declare 2.1 version compliance
The target driver implements all the mandatory logs, identifications,
features, and properties up to nvme sepcification 2.1.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:49 -08:00
Keith Busch
1e058089d2 nvmet: implement crto property
This property is required for nvme 2.1. The target only supports ready
with media, so this is just the same value as CAP.TO.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:49 -08:00
Keith Busch
e973c91727 nvmet: implement supported features log
This log is required for nvme 2.1.

Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:49 -08:00
Keith Busch
83acb24e6d nvmet: implement supported log pages
This log is required for nvme 2.1.

Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:48 -08:00
Keith Busch
61c9967cd6 nvmet: implement active command set ns list
This is required for nvme 2.1 for targets that support multiple command
sets. We support NVM and ZNS, so are required to support this
identification.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:48 -08:00
Keith Busch
64a51080ea nvmet: implement id ns for nvm command set
We don't report anything here, but it's a mandatory identification for
nvme 2.1.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:48 -08:00
Guixin Liu
5a47c2080a nvmet: support reservation feature
This patch implements the reservation feature, including:
  1. reservation register(register, unregister and replace).
  2. reservation acquire(acquire, preempt, preempt and abort).
  3. reservation release(release and clear).
  4. reservation report.
  5. set feature and get feature of reservation notify mask.
  6. get log page of reservation event.

Not supported:
  1. persistent reservation through power loss.

Test cases:
  Use nvme-cli and fio to test all implemented sub features:
  1. use nvme resv-register to register host a registrant or
     unregister or replace a new key.
  2. use nvme resv-acquire to set host to the holder, and use fio
     to send read and write io in all reservation type. And also
     test preempt and "preempt and abort".
  3. use nvme resv-report to show all registrants and reservation
     status.
  4. use nvme resv-release to release all registrants.
  5. use nvme get-log to get events generated by the preceding
     operations.

In addition, make reservation configurable, one can set ns to
support reservation before enable ns. The default of resv_enable
is false.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11 09:49:48 -08:00
Christoph Hellwig
0b4ace9da5 nvme-multipath: don't bother clearing max_hw_zone_append_sectors
The limits stacking now properly zeroes it if at least one of the
underlying limits clears it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20241108154657.845768-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-11 09:20:36 -07:00
Christoph Hellwig
559218d43e block: pre-calculate max_zone_append_sectors
max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using queue_limits_max_zone_append_sectors helper.  This not
only adds (tiny) extra overhead to the I/O path, but also can be easily
forgotten in file system code.

Add a new max_hw_zone_append_sectors value to queue_limits which is
set by the driver, and calculate max_zone_append_sectors from that and
the other inputs in blk_validate_zoned_limits, similar to how
max_sectors is calculated to fix this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20241108154657.845768-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-11 09:20:36 -07:00
Linus Torvalds
a58f4dd952 block-6.12-20241108
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmcuxukQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqzrEACLE/7MG7YmKL5gz8fPRXN2OFFwSJNCwuY+
 q3Y3aD3YaBn37EgwG6NsHMGfNLkyrbhEyWdfJjHbgA/mtYIRCMRQ2Yp7dINLhjq7
 giW/NQEeWzr/QJaD18GDFy5NARGL7a36/pYvU4RdPfGR0o0N8++OZ7JJ/YY3f0CO
 4Oo3uglzDdtHJnVOBM/ciknXNPAwY7oRqkBa5heuexF2YR7m1u5YOT0voEdLR6cv
 2DCgHDyDn5I4sz708ALvHEpXFHPrfK8+OcvSK34+OEFlphT1jB1udMG/jzZlj5ZR
 v5YHPITHpTUddyDI4kP00LIKavqiq5lRfFQyqh6Q8f+Xp66+15apXYitDZcfPUic
 Zq6wabOhEhJX6v18AafM4Wq9W+sTAVj3oxg1GN/edBpOFMi4fKkQHsaqfL8fiBve
 eR1kk5gLDs7DS2PYvXX5pPNUdlxyraWlXsExr19tLgeO7YdZzwZ+takWVqSnzXep
 apxI7xnenewsnJ9t1f5ttamrHxXexj63Mnbc0eJCFPJaWPktqxQbe9vKhi/20evp
 ZI36PLDOfCcchKorzuayj23RBxB8HRSh9f/JRhgGtoLVguzvzI6KTXQ/h01cYzNl
 mTot/mgNLyVO0Ij1Wr54SCFW/hP3okz8G5NBOqGv/h2n0UlW8GUD+XTbVT3hiQ9X
 nJyHwGC26Q==
 =v8nG
 -----END PGP SIGNATURE-----

Merge tag 'block-6.12-20241108' of git://git.kernel.dk/linux

Pull block fix from Jens Axboe:
 "Single fix for an issue triggered with PROVE_RCU=y, with nvme using
  the wrong iterators for an SRCU protected list"

* tag 'block-6.12-20241108' of git://git.kernel.dk/linux:
  nvme/host: Fix RCU list traversal to use SRCU primitive
2024-11-09 12:55:32 -08:00
Jens Axboe
ab9bc81c1c Revert "block: pre-calculate max_zone_append_sectors"
This causes issue on, at least, nvme-mpath where my boot fails with:

WARNING: CPU: 354 PID: 2729 at block/blk-settings.c:75 blk_validate_limits+0x356/0x380
Modules linked in: tg3(+) nvme usbcore scsi_mod ptp i2c_piix4 libphy nvme_core crc32c_intel scsi_common usb_common pps_core i2c_smbus
CPU: 354 UID: 0 PID: 2729 Comm: kworker/u2061:1 Not tainted 6.12.0-rc6+ #181
Hardware name: Dell Inc. PowerEdge R7625/06444F, BIOS 1.8.3 04/02/2024
Workqueue: async async_run_entry_fn
RIP: 0010:blk_validate_limits+0x356/0x380
Code: f6 47 01 04 75 28 83 bf 94 00 00 00 00 75 39 83 bf 98 00 00 00 00 75 34 83 7f 68 00 75 32 31 c0 83 7f 5c 00 0f 84 9b fd ff ff <0f> 0b eb 13 0f 0b eb 0f 48 c7 c0 74 12 58 92 48 89 c7 e8 13 76 46
RSP: 0018:ffffa8a1dfb93b30 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff9232829c8388 RCX: 0000000000000088
RDX: 0000000000000080 RSI: 0000000000000200 RDI: ffffa8a1dfb93c38
RBP: 000000000000000c R08: 00000000ffffffff R09: 000000000000ffff
R10: 0000000000000000 R11: 0000000000000000 R12: ffff9232829b9000
R13: ffff9232829b9010 R14: ffffa8a1dfb93c38 R15: ffffa8a1dfb93c38
FS:  0000000000000000(0000) GS:ffff923867c80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055c1b92480a8 CR3: 0000002484ff0002 CR4: 0000000000370ef0
Call Trace:
 <TASK>
 ? __warn+0xca/0x1a0
 ? blk_validate_limits+0x356/0x380
 ? report_bug+0x11a/0x1a0
 ? handle_bug+0x5e/0x90
 ? exc_invalid_op+0x16/0x40
 ? asm_exc_invalid_op+0x16/0x20
 ? blk_validate_limits+0x356/0x380
 blk_alloc_queue+0x7a/0x250
 __blk_alloc_disk+0x39/0x80
 nvme_mpath_alloc_disk+0x13d/0x1b0 [nvme_core]
 nvme_scan_ns+0xcc7/0x1010 [nvme_core]
 async_run_entry_fn+0x27/0x120
 process_scheduled_works+0x1a0/0x360
 worker_thread+0x2bc/0x350
 ? pr_cont_work+0x1b0/0x1b0
 kthread+0x111/0x120
 ? kthread_unuse_mm+0x90/0x90
 ret_from_fork+0x30/0x40
 ? kthread_unuse_mm+0x90/0x90
 ret_from_fork_asm+0x11/0x20
 </TASK>
---[ end trace 0000000000000000 ]---

presumably due to max_zone_append_sectors not being cleared to zero,
resulting in blk_validate_zoned_limits() complaining and failing.

This reverts commit 2a8f6153e1.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07 05:45:34 -07:00
Chaitanya Kulkarni
43d5d3b417 nvme-core: remove repeated wq flags
In nvme_core_init() nvme_wq, nvme_reset_wq, nvme_delete_wq share same
flags :- WQ_UNBOUND | WQ_MEM_RECLAIM | WQ_SYSFS.

Insated of repeating these flags in each call use the common variable.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-05 08:43:24 -08:00
Guixin Liu
c74649b6e4 nvmet: make nvmet_wq visible in sysfs
In some complex scenarios, we deploy multiple tasks on a single machine
(hybrid deployment), such as Docker containers for function computation
(background processing), real-time tasks, monitoring, event handling,
and management, along with an NVMe target server.

Each of these components is restricted to its own CPU cores to prevent
mutual interference and ensure strict isolation. To achieve this level
of isolation for nvmet_wq we need to  use sysfs tunables such as
cpumask that are currently not accessible.

Add WQ_SYSFS flag to alloc_workqueue() when creating nvmet_wq so
workqueue tunables are exported in the userspace via sysfs.

with this patch :-

  nvme (nvme-6.13) # ls /sys/devices/virtual/workqueue/nvmet-wq/
  affinity_scope  affinity_strict  cpumask  max_active  nice per_cpu
  power  subsystem  uevent

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-05 08:36:18 -08:00
Christoph Hellwig
63a5c7a4b4 nvme-pci: use dma_alloc_noncontigous if possible
Use dma_alloc_noncontigous to allocate a single IOVA-contigous segment
when backed by an IOMMU.  This allow to easily use bigger segments and
avoids running into segment limits if we can avoid it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-05 07:54:34 -08:00
Christoph Hellwig
3c2fb1ca80 nvme-pci: fix freeing of the HMB descriptor table
The HMB descriptor table is sized to the maximum number of descriptors
that could be used for a given device, but __nvme_alloc_host_mem could
break out of the loop earlier on memory allocation failure and end up
using less descriptors than planned for, which leads to an incorrect
size passed to dma_free_coherent.

In practice this was not showing up because the number of descriptors
tends to be low and the dma coherent allocator always allocates and
frees at least a page.

Fixes: 87ad72a59a ("nvme-pci: implement host memory buffer support")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-05 07:54:34 -08:00
Breno Leitao
6d1c69945c nvme/host: Fix RCU list traversal to use SRCU primitive
The code currently uses list_for_each_entry_rcu() while holding an SRCU
lock, triggering false positive warnings with CONFIG_PROVE_RCU=y
enabled:

  drivers/nvme/host/core.c:3770 RCU-list traversed in non-reader section!!

While the list is properly protected by SRCU lock, the code uses the wrong
list traversal primitive. Replace list_for_each_entry_rcu() with
list_for_each_entry_srcu() to correctly indicate SRCU-based protection
and eliminate the false warning.

Fixes: be647e2c76 ("nvme: use srcu for iterating namespace list")
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-04 13:25:41 -08:00
Christoph Hellwig
2a8f6153e1 block: pre-calculate max_zone_append_sectors
max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using queue_limits_max_zone_append_sectors helper.  This not
only adds (tiny) extra overhead to the I/O path, but also can be easily
forgotten in file system code.

Add a new max_hw_zone_append_sectors value to queue_limits which is
set by the driver, and calculate max_zone_append_sectors from that and
the other inputs in blk_validate_zoned_limits, similar to how
max_sectors is calculated to fix this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-04 10:34:07 -07:00
Linus Torvalds
f4a1e8e369 block-6.12-20241101
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmclGQUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkKsEADWz2dU1Edfbg5lez9Kf2SZwLKUbh9VVBQY
 kuIeGI/lQ1eeIdIgEJtCmx6ePlbk9D5WYu7lJvXfODAh5gUBLMpI4t6p4L1TkU2o
 yZ9UDoy5SK8o0UAOZhrHw4fE70EdI31Olm2mz3ToIevmpKOabNr/joVEfpEYaoPV
 t9FIGH1DvUZQSWUgw6G2lxWA7gTwfzvAfOegXF0VlFyhCEXjLXwJSUq+lWaShL33
 oDxETRYGaoBVfNVLIaZ6LFVUrvmgzFQ1Q6q1iY+DAUgMh9NXV2mzQ+/17HZjWaeU
 jsEeeOMnvtxhvQQpRyEvGA93Fh+gDkx/wljB4EmK2oqETs+oDVzts+L+YJk2ef7K
 n71dhuj04BRfduu6DFNg6ufAcMJOTgd0odgN9h9hPa0z3y0sx9hOSW6XZfepABl1
 +XDDyI4p/lVTMH/vYlS2ay2RUo7KrxfX26qSYbz6UItzo6dMl/3YjtKbzYpP0mZM
 4+Yu5YxDXirbnxpk9uhzOA4CdwiXXWxLIVIX1KeUWkJZ5wzmH2rFqmpzH8OL91uH
 J4SIMLgMCm4e45nDOLH5ZSLIc1MO2df0sJ1gZZD6MoT0xbxULyIQuj19sBszeHwH
 YgV4sQlPead4NT2zoSDeXSpeIWmv5dd15To7wVpeInyj/DHyXSHzycHYSehgdmv1
 FBQ+LMffyA==
 =e3uK
 -----END PGP SIGNATURE-----

Merge tag 'block-6.12-20241101' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - Fixup for a recent blk_rq_map_user_bvec() patch

 - NVMe pull request via Keith:
     - Spec compliant identification fix (Keith)
     - Module parameter to enable backward compatibility on unusual
       namespace formats (Keith)
     - Target double free fix when using keys (Vitaliy)
     - Passthrough command error handling fix (Keith)

* tag 'block-6.12-20241101' of git://git.kernel.dk/linux:
  nvme: re-fix error-handling for io_uring nvme-passthrough
  nvmet-auth: assign dh_key to NULL after kfree_sensitive
  nvme: module parameter to disable pi with offsets
  block: fix queue limits checks in blk_rq_map_user_bvec for real
  nvme: enhance cns version checking
2024-11-01 13:41:55 -10:00
Christoph Hellwig
f187b9bf1a block: remove bio_add_zone_append_page
This is only used by the nvmet zns passthrough code, which can trivially
just use bio_add_pc_page and do the sanity check for the max zone append
limit itself.

All future zoned file systems should follow the btrfs lead and let the
upper layers fill up bios unlimited by hardware constraints and split
them to the limits in the I/O submission handler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20241030051859.280923-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-31 10:54:25 -06:00
Keith Busch
5eed4fb274 nvme: re-fix error-handling for io_uring nvme-passthrough
This was previously fixed with commit 1147dd0503
("nvme: fix error-handling for io_uring nvme-passthrough"), but the
change was mistakenly undone in a later commit.

Fixes: d6aacee925 ("nvme: use bio_integrity_map_user")
Cc: stable@vger.kernel.org
Reported-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-30 07:19:18 -07:00
Vitaliy Shevtsov
d2f551b1f7 nvmet-auth: assign dh_key to NULL after kfree_sensitive
ctrl->dh_key might be used across multiple calls to nvmet_setup_dhgroup()
for the same controller. So it's better to nullify it after release on
error path in order to avoid double free later in nvmet_destroy_auth().

Found by Linux Verification Center (linuxtesting.org) with Svace.

Fixes: 7a277c37d3 ("nvmet-auth: Diffie-Hellman key exchange support")
Cc: stable@vger.kernel.org
Signed-off-by: Vitaliy Shevtsov <v.shevtsov@maxima.ru>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-30 07:19:18 -07:00
Keith Busch
42ab37eaad nvme: module parameter to disable pi with offsets
A recent commit enables integrity checks for formats the previous kernel
versions registered with the "nop" integrity profile. This means
namespaces using that format become unreadable when upgrading the kernel
past that commit.

Introduce a module parameter to restore the "nop" integrity profile so
that storage can be readable once again. This could be a boot device, so
the setting needs to happen at module load time.

Fixes: 921e81db52 ("nvme: allow integrity when PI is not in first bytes")
Reported-by: David Wei <dw@davidwei.uk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-30 07:19:18 -07:00
Keith Busch
133008e84b blk-integrity: remove seed for user mapped buffers
The seed is only used for kernel generation and verification. That
doesn't happen for user buffers, so passing the seed around doesn't
accomplish anything.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20241016201309.1090320-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-30 07:49:32 -06:00
Ming Lei
6b6f6c41c8 nvme: core: switch to non_owner variant of start_freeze/unfreeze queue
nvme_start_freeze() and nvme_unfreeze() may be called from same context,
so switch them to call non_owner variant of start_freeze/unfreeze queue.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241025003722.3630252-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-26 07:14:53 -06:00
Keith Busch
f54f0d0e2b nvme: enhance cns version checking
The number of CNS bits in the command is specific to the nvme spec
version compliance. The existing check is not sufficient for possible
CNS values the driver uses that may create confusion between host and
device, so enhance the check to consider the version and desired CNS
value.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-22 09:11:03 -07:00
Pavel Begunkov
5e52f71f85 nvme: use helpers to access io_uring cmd space
Command implementations shouldn't be directly looking into io_uring_cmd
to carve free space. Use an io_uring helper, which will also do build
time size sanitisation.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-22 07:28:36 -07:00
Linus Torvalds
f8eacd8ad7 block-6.12-20241018
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmcSk4AQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuuXD/0UERdP+djJNoXBW5Mv7U5a4rJ7ZgfPL7ku
 z3ZfdnNYGitZhYkVjNQ60TLzXRQyUaIIxMVBWzkb59I6ixmuQbzm/lC55B6s/FIR
 bfT3afe1WRgLCaFbStu91qRs/44Mq4yK6wXcIU7LwutRT/5cqZwelqRZLK7DMFln
 zlGX4zNrCMRUDTr6PLa6CvyY4dmQSL17Ib1ypcKXjGs5YjDntzSrIsKVT1Wayans
 WroGGPG6W7r2c2kn8pe4uPIjZVfMUF2vrdIs0KEYaAQOC7ppEucCgDZMEWRs7kdH
 63hheudJjVSwLF/qYnXNHe/Bz12QCZohPp6UsqRpC8o96Ralgo6Q+FxkXsVelMXW
 JKhtDqYGBDHOQrjrEWN1rnYw/DauEQAgvOtdVfEx2IBzPsG07cB8yv8MNA90H9QH
 KStI7h9qnBEMMNcXX8prOymCHNWAeuF4mbitVrRfSfEVm/0BbQ19qoyGrvwNFgEf
 6T+4Xj/P+FsiLVe8vsgBZDaxEEU5Ifd/rki/QFVk/2z72BBZxmdf2nm51SOM28V7
 HGMHwJI3H8rdmPXvt5Q/ve6GWNOYLO5PSAJgSSe96UStvtsAHGB4eM+LykdnE7cI
 SoytU5KfAM8DD6wnyHIgYuvJyZWrmLoVDrRjym8emc2KrJOe7qg+Ah4ERcNTCnhl
 nw50f27G4w==
 =waNY
 -----END PGP SIGNATURE-----

Merge tag 'block-6.12-20241018' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
     - Fix target passthrough identifier (Nilay)
     - Fix tcp locking (Hannes)
     - Replace list with sbitmap for tracking RDMA rsp tags (Guixen)
     - Remove unnecessary fallthrough statements (Tokunori)
     - Remove ready-without-media support (Greg)
     - Fix multipath partition scan deadlock (Keith)
     - Fix concurrent PCI reset and remove queue mapping (Maurizio)
     - Fabrics shutdown fixes (Nilay)

 - Fix for a kerneldoc warning (Keith)

 - Fix a race with blk-rq-qos and wakeups (Omar)

 - Cleanup of checking for always-set tag_set (SurajSonawane2415)

 - Fix for a crash with CPU hotplug notifiers (Ming)

 - Don't allow zero-copy ublk on unprivileged device (Ming)

 - Use array_index_nospec() for CDROM (Josh)

 - Remove dead code in drbd (David)

 - Tweaks to elevator loading (Breno)

* tag 'block-6.12-20241018' of git://git.kernel.dk/linux:
  cdrom: Avoid barrier_nospec() in cdrom_ioctl_media_changed()
  nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function
  nvme: make keep-alive synchronous operation
  nvme-loop: flush off pending I/O while shutting down loop controller
  nvme-pci: fix race condition between reset and nvme_dev_disable()
  ublk: don't allow user copy for unprivileged device
  blk-rq-qos: fix crash on rq_qos_wait vs. rq_qos_wake_function race
  nvme-multipath: defer partition scanning
  blk-mq: setup queue ->tag_set before initializing hctx
  elevator: Remove argument from elevator_find_get
  elevator: do not request_module if elevator exists
  drbd: Remove unused conn_lowest_minor
  nvme: disable CC.CRIME (NVME_CC_CRIME)
  nvme: delete unnecessary fallthru comment
  nvmet-rdma: use sbitmap to replace rsp free list
  block: Fix elevator_get_default() checking for NULL q->tag_set
  nvme: tcp: avoid race between queue_lock lock and destroy
  nvmet-passthru: clear EUID/NGUID/UUID while using loop target
  block: fix blk_rq_map_integrity_sg kernel-doc
2024-10-18 15:53:00 -07:00
Nilay Shroff
599d9f3a10 nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function
We no more need acquiring ctrl->lock before accessing the
NVMe controller state and instead we can now use the helper
nvme_ctrl_state. So replace the use of ctrl->lock from
nvme_keep_alive_finish function with nvme_ctrl_state call.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-17 11:07:37 -07:00
Nilay Shroff
d06923670b nvme: make keep-alive synchronous operation
The nvme keep-alive operation, which executes at a periodic interval,
could potentially sneak in while shutting down a fabric controller.
This may lead to a race between the fabric controller admin queue
destroy code path (invoked while shutting down controller) and hw/hctx
queue dispatcher called from the nvme keep-alive async request queuing
operation. This race could lead to the kernel crash shown below:

Call Trace:
    autoremove_wake_function+0x0/0xbc (unreliable)
    __blk_mq_sched_dispatch_requests+0x114/0x24c
    blk_mq_sched_dispatch_requests+0x44/0x84
    blk_mq_run_hw_queue+0x140/0x220
    nvme_keep_alive_work+0xc8/0x19c [nvme_core]
    process_one_work+0x200/0x4e0
    worker_thread+0x340/0x504
    kthread+0x138/0x140
    start_kernel_thread+0x14/0x18

While shutting down fabric controller, if nvme keep-alive request sneaks
in then it would be flushed off. The nvme_keep_alive_end_io function is
then invoked to handle the end of the keep-alive operation which
decrements the admin->q_usage_counter and assuming this is the last/only
request in the admin queue then the admin->q_usage_counter becomes zero.
If that happens then blk-mq destroy queue operation (blk_mq_destroy_
queue()) which could be potentially running simultaneously on another
cpu (as this is the controller shutdown code path) would forward
progress and deletes the admin queue. So, now from this point onward
we are not supposed to access the admin queue resources. However the
issue here's that the nvme keep-alive thread running hw/hctx queue
dispatch operation hasn't yet finished its work and so it could still
potentially access the admin queue resource while the admin queue had
been already deleted and that causes the above crash.

This fix helps avoid the observed crash by implementing keep-alive as a
synchronous operation so that we decrement admin->q_usage_counter only
after keep-alive command finished its execution and returns the command
status back up to its caller (blk_execute_rq()). This would ensure that
fabric shutdown code path doesn't destroy the fabric admin queue until
keep-alive request finished execution and also keep-alive thread is not
running hw/hctx queue dispatch operation.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-17 11:07:37 -07:00
Nilay Shroff
c199fac88f nvme-loop: flush off pending I/O while shutting down loop controller
While shutting down loop controller, we first quiesce the admin/IO queue,
delete the admin/IO tag-set and then at last destroy the admin/IO queue.
However it's quite possible that during the window between quiescing and
destroying of the admin/IO queue, some admin/IO request might sneak in
and if that happens then we could potentially encounter a hung task
because shutdown operation can't forward progress until any pending I/O
is flushed off.

This commit helps ensure that before destroying the admin/IO queue, we
unquiesce the admin/IO queue so that any outstanding requests, which are
added after the admin/IO queue is quiesced, are now flushed to its
completion.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-17 11:07:37 -07:00
Maurizio Lombardi
26bc0a81f6 nvme-pci: fix race condition between reset and nvme_dev_disable()
nvme_dev_disable() modifies the dev->online_queues field, therefore
nvme_pci_update_nr_queues() should avoid racing against it, otherwise
we could end up passing invalid values to blk_mq_update_nr_hw_queues().

 WARNING: CPU: 39 PID: 61303 at drivers/pci/msi/api.c:347
          pci_irq_get_affinity+0x187/0x210
 Workqueue: nvme-reset-wq nvme_reset_work [nvme]
 RIP: 0010:pci_irq_get_affinity+0x187/0x210
 Call Trace:
  <TASK>
  ? blk_mq_pci_map_queues+0x87/0x3c0
  ? pci_irq_get_affinity+0x187/0x210
  blk_mq_pci_map_queues+0x87/0x3c0
  nvme_pci_map_queues+0x189/0x460 [nvme]
  blk_mq_update_nr_hw_queues+0x2a/0x40
  nvme_reset_work+0x1be/0x2a0 [nvme]

Fix the bug by locking the shutdown_lock mutex before using
dev->online_queues. Give up if nvme_dev_disable() is running or if
it has been executed already.

Fixes: 949928c1c7 ("NVMe: Fix possible queue use after freed")
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-17 09:55:22 -07:00
Keith Busch
1f021341ee nvme-multipath: defer partition scanning
We need to suppress the partition scan from occuring within the
controller's scan_work context. If a path error occurs here, the IO will
wait until a path becomes available or all paths are torn down, but that
action also occurs within scan_work, so it would deadlock. Defer the
partion scan to a different context that does not block scan_work.

Reported-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-15 08:32:07 -07:00
Greg Joyce
0ce96a6708 nvme: disable CC.CRIME (NVME_CC_CRIME)
Disable NVME_CC_CRIME so that CSTS.RDY indicates that the media
is ready and able to handle commands without returning
NVME_SC_ADMIN_COMMAND_MEDIA_NOT_READY.

Signed-off-by: Greg Joyce <gjoyce@linux.ibm.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Tested-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-09 14:45:19 -07:00
Tokunori Ikegami
9c7072df53 nvme: delete unnecessary fallthru comment
Signed-off-by: Tokunori Ikegami <ikegami.t@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-08 13:50:44 -07:00
Guixin Liu
40f0e5dc2f nvmet-rdma: use sbitmap to replace rsp free list
We can use sbitmap to manage all the nvmet_rdma_rsp instead of using
free lists and spinlock, and we can use an additional tag to
determine whether the nvmet_rdma_rsp is extra allocated.

In addition, performance has improved:
1. testing environment is local rxe rdma devie and mem-based
backstore device.
2. fio command, test the average 5 times:
fio -filename=/dev/nvme0n1 --ioengine=libaio -direct=1
-size=1G -name=1 -thread -runtime=60 -time_based -rw=read -numjobs=16
-iodepth=128 -bs=4k -group_reporting
3. Before: 241k IOPS, After: 256k IOPS, an increase of about 5%.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
2024-10-08 13:45:36 -07:00
Hannes Reinecke
782373ba27 nvme: tcp: avoid race between queue_lock lock and destroy
Commit 76d54bf20c ("nvme-tcp: don't access released socket during
error recovery") added a mutex_lock() call for the queue->queue_lock
in nvme_tcp_get_address(). However, the mutex_lock() races with
mutex_destroy() in nvme_tcp_free_queue(), and causes the WARN below.

DEBUG_LOCKS_WARN_ON(lock->magic != lock)
WARNING: CPU: 3 PID: 34077 at kernel/locking/mutex.c:587 __mutex_lock+0xcf0/0x1220
Modules linked in: nvmet_tcp nvmet nvme_tcp nvme_fabrics iw_cm ib_cm ib_core pktcdvd nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables qrtr sunrpc ppdev 9pnet_virtio 9pnet pcspkr netfs parport_pc parport e1000 i2c_piix4 i2c_smbus loop fuse nfnetlink zram bochs drm_vram_helper drm_ttm_helper ttm drm_kms_helper xfs drm sym53c8xx floppy nvme scsi_transport_spi nvme_core nvme_auth serio_raw ata_generic pata_acpi dm_multipath qemu_fw_cfg [last unloaded: ib_uverbs]
CPU: 3 UID: 0 PID: 34077 Comm: udisksd Not tainted 6.11.0-rc7 #319
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
RIP: 0010:__mutex_lock+0xcf0/0x1220
Code: 08 84 d2 0f 85 c8 04 00 00 8b 15 ef b6 c8 01 85 d2 0f 85 78 f4 ff ff 48 c7 c6 20 93 ee af 48 c7 c7 60 91 ee af e8 f0 a7 6d fd <0f> 0b e9 5e f4 ff ff 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1
RSP: 0018:ffff88811305f760 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff88812c652058 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000001
RBP: ffff88811305f8b0 R08: 0000000000000001 R09: ffffed1075c36341
R10: ffff8883ae1b1a0b R11: 0000000000010498 R12: 0000000000000000
R13: 0000000000000000 R14: dffffc0000000000 R15: ffff88812c652058
FS:  00007f9713ae4980(0000) GS:ffff8883ae180000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fcd78483c7c CR3: 0000000122c38000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ? __warn.cold+0x5b/0x1af
 ? __mutex_lock+0xcf0/0x1220
 ? report_bug+0x1ec/0x390
 ? handle_bug+0x3c/0x80
 ? exc_invalid_op+0x13/0x40
 ? asm_exc_invalid_op+0x16/0x20
 ? __mutex_lock+0xcf0/0x1220
 ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
 ? __pfx___mutex_lock+0x10/0x10
 ? __lock_acquire+0xd6a/0x59e0
 ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
 nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
 ? __pfx_nvme_tcp_get_address+0x10/0x10 [nvme_tcp]
 nvme_sysfs_show_address+0x81/0xc0 [nvme_core]
 dev_attr_show+0x42/0x80
 ? __asan_memset+0x1f/0x40
 sysfs_kf_seq_show+0x1f0/0x370
 seq_read_iter+0x2cb/0x1130
 ? rw_verify_area+0x3b1/0x590
 ? __mutex_lock+0x433/0x1220
 vfs_read+0x6a6/0xa20
 ? lockdep_hardirqs_on+0x78/0x100
 ? __pfx_vfs_read+0x10/0x10
 ksys_read+0xf7/0x1d0
 ? __pfx_ksys_read+0x10/0x10
 ? __x64_sys_openat+0x105/0x1d0
 do_syscall_64+0x93/0x180
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on+0x78/0x100
 ? do_syscall_64+0x9f/0x180
 ? __pfx_ksys_read+0x10/0x10
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on+0x78/0x100
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on+0x78/0x100
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on+0x78/0x100
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on+0x78/0x100
 ? do_syscall_64+0x9f/0x180
 ? do_syscall_64+0x9f/0x180
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f9713f55cfa
Code: 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 e8 74 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 2e 44 89 c7 48 89 45 f8 e8 42 75 f8 ff 48 8b
RSP: 002b:00007ffd7f512e70 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 000055c38f316859 RCX: 00007f9713f55cfa
RDX: 0000000000000fff RSI: 00007ffd7f512eb0 RDI: 0000000000000011
RBP: 00007ffd7f512e90 R08: 0000000000000000 R09: 00000000ffffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000055c38f317148
R13: 0000000000000000 R14: 00007f96f4004f30 R15: 000055c3b6b623c0
 </TASK>

The WARN is observed when the blktests test case nvme/014 is repeated
with tcp transport. It is rare, and 200 times repeat is required to
recreate in some test environments.

To avoid the WARN, check the NVME_TCP_Q_LIVE flag before locking
queue->queue_lock. The flag is cleared long time before the lock gets
destroyed.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-03 12:14:35 -07:00
Al Viro
5f60d5f6bb move asm/unaligned.h to linux/unaligned.h
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.

auto-generated by the following:

for i in `git grep -l -w asm/unaligned.h`; do
	sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
	sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
2024-10-02 17:23:23 -04:00
Nilay Shroff
e38dad438f nvmet-passthru: clear EUID/NGUID/UUID while using loop target
When nvme passthru is configured using loop target, the clear_ids
attribute is, by default, set to true. This attribute would ensure that
EUID/NGUID/UUID is cleared for the loop passthru target.

The newer NVMe disk supporting the NVMe spec 1.3 or higher, typically,
implements the support for "Namespace Identification Descriptor list"
command. This command when issued from host returns EUID/NGUID/UUID
assigned to the inquired namespace. Not clearing these values, while
using nvme passthru using loop target, would result in NVMe host driver
rejecting the namespace. This check was implemented in the commit
2079f41ec6 ("nvme: check that EUI/GUID/UUID are globally unique").

The fix implemented in this commit ensure that when host issues ns-id
descriptor list command, the EUID/NGUID/UUID are cleared by passthru
target. In fact, the function nvmet_passthru_override_id_descs() which
clears those unique ids already exits, so we just need to ensure that
ns-id descriptor list command falls through the corretc code path. And
while we're at it, we also combines the three passthru admin command
cases together which shares the same code.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-01 11:08:40 -07:00
Linus Torvalds
11a299a793 for-6.12/block-20240925
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmb0T5AQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnfHEADCXqmqZC+xr3sHZH9T1lz9KaFp1FjuBhCw
 bGpUgXQ9aLcqQUWJxmYVer8N2x2+Ds+xq4fm/rP1BfvNgRupqheHBwuLxSrz14EX
 lYmKZ+krMIPTDaLFewmEWflDwmZX0WFgV6nKTMLiO5BMeI4zXCkFGtwYFys2+Cdd
 9zYCFPgGDZUR77Ws5PpyqPVz2MoiNtsjrGmHpEmNZ+rIDzlpVOYgYk27X9ZbvNxC
 /l0KTc9+ayAeG0Kx5jO+m6Hrj3I6ehvM9JZMgpS/tF/jtccD2oVkJFJDlU+Jciv6
 BwVzgyDPGV7sXFT1fnSqDBYYwr/73nzNH0Gk8wn4Jg2LhjmVANVo9eQSOXDTYZI+
 O4HfIHGTIrk75TQd4bhq3dqaylS78pKBI/eQJUli2UNoyLWMrMyE88yh2YJam2Fs
 vJ/MHGxvFRurYbAlqLr33nb3ajvpg+D7XuAYfqHPMc2ZUe28Kza50Dj+luNjfVCu
 3qfR6qBlsdWuABtUS3vneB9jZp5jDnOpVfuBgtcAqIboUjehTXsI7If09Ex/mxLq
 O0KqNwBMfunPOKd5kGXlAgY8LRMfOhNaAAFBlXYUZB2eAadQnqVselTFvHMZkXo7
 wH/l6trd+/Tf+7Rav0YduNIlpVr7IctC+A7ph4zPdIjQxFEySCrC7cvAjel29LyV
 zgWW0Mw/sA==
 =yiWu
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12/block-20240925' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:

 - Improve blk-integrity segment counting and merging (Keith)

 - NVMe pull request via Keith:
      - Multipath fixes (Hannes)
      - Sysfs attribute list NULL terminate fix (Shin'ichiro)
      - Remove problematic read-back (Keith)

 - Fix for a regression with the IO scheduler switching freezing from
   6.11 (Damien)

 - Use a raw spinlock for sbitmap, as it may get called from preempt
   disabled context (Ming)

 - Cleanup for bd_claiming waiting, using var_waitqueue() rather than
   the bit waitqueues, as that more accurately describes that it does
   (Neil)

 - Various cleanups (Kanchan, Qiu-ji, David)

* tag 'for-6.12/block-20240925' of git://git.kernel.dk/linux:
  nvme: remove CC register read-back during enabling
  nvme: null terminate nvme_tls_attrs
  nvme-multipath: avoid hang on inaccessible namespaces
  nvme-multipath: system fails to create generic nvme device
  lib/sbitmap: define swap_lock as raw_spinlock_t
  block: Remove unused blk_limits_io_{min,opt}
  drbd: Fix atomicity violation in drbd_uuid_set_bm()
  block: Fix elv_iosched_local_module handling of "none" scheduler
  block: remove bogus union
  block: change wait on bd_claiming to use a var_waitqueue
  blk-integrity: improved sg segment mapping
  block: unexport blk_rq_count_integrity_sg
  nvme-rdma: use request to get integrity segments
  scsi: use request to get integrity segments
  block: provide a request helper for user integrity segments
  blk-integrity: consider entire bio list for merging
  blk-integrity: properly account for segments
  blk-mq: set the nr_integrity_segments from bio
  blk-mq: unconditional nr_integrity_segments
2024-09-25 14:56:40 -07:00
Keith Busch
9064610348 nvme: remove CC register read-back during enabling
Any non-posted read should flush the previous write, so we don't
necessarily need to read back the value we just wrote. I've found at
least some controllers that respond with 0 for short moments after
writing the CC register with EN (enable) cleared, so the read-back is
overwriting our valid ctrl_config value and ends up breaking on the
subsequent enabling.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-24 23:35:10 -07:00
Shin'ichiro Kawasaki
83340d9c61 nvme: null terminate nvme_tls_attrs
Commit 1e48b34c9b ("nvme: split off TLS sysfs attributes into a
separate group") introduced the struct attribute array nvme_tls_attrs.
However, the array was not null terminated and caused BUG KASAN global-
out-of-bounds. To avoid the BUG, null terminate the array.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/linux-nvme/jhllwfxcedrcxcnbajwl4x2l2ujcqowqcd4ps574zrafrqhjna@f4icvecutekm/
Fixes: 1e48b34c9b ("nvme: split off TLS sysfs attributes into a separate group")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-24 23:34:13 -07:00
Hannes Reinecke
3b97f5a05c nvme-multipath: avoid hang on inaccessible namespaces
During repetitive namespace remapping operations on the target the
namespace might have changed between the time the initial scan
was performed, and partition scan was invoked by device_add_disk()
in nvme_mpath_set_live(). We then end up with a stuck scanning process:

[<0>] folio_wait_bit_common+0x12a/0x310
[<0>] filemap_read_folio+0x97/0xd0
[<0>] do_read_cache_folio+0x108/0x390
[<0>] read_part_sector+0x31/0xa0
[<0>] read_lba+0xc5/0x160
[<0>] efi_partition+0xd9/0x8f0
[<0>] bdev_disk_changed+0x23d/0x6d0
[<0>] blkdev_get_whole+0x78/0xc0
[<0>] bdev_open+0x2c6/0x3b0
[<0>] bdev_file_open_by_dev+0xcb/0x120
[<0>] disk_scan_partitions+0x5d/0x100
[<0>] device_add_disk+0x402/0x420
[<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
[<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
[<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
[<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
[<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]

This happens when we have several paths, some of which are inaccessible,
and the active paths are removed first. Then nvme_find_path() will requeue
I/O in the ns_head (as paths are present), but the requeue list is never
triggered as all remaining paths are inactive.

This patch checks for NVME_NSHEAD_DISK_LIVE in nvme_available_path(),
and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once
the last path has been removed to properly terminate pending I/O.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-24 23:30:28 -07:00
Hannes Reinecke
63bcf9014e nvme-multipath: system fails to create generic nvme device
NVME_NSHEAD_DISK_LIVE is a flag for struct nvme_ns_head, not nvme_ns.
The current code has a typo causing NVME_NSHEAD_DISK_LIVE never to
be cleared once device_add_disk_fails, causing the system never to
create the 'generic' character device. Even several rescan attempts
will change the situation and the system has to be rebooted to fix
the issue.

Fixes: 11384580e3 ("nvme-multipath: add error handling support for add_disk()")
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-24 23:30:28 -07:00
Linus Torvalds
7856a56541 Many singleton patches - please see the various changelogs for details.
Quite a lot of nilfs2 work this time around.
 
 Notable patch series in this pull request are:
 
 "mul_u64_u64_div_u64: new implementation" by Nicolas Pitre, with
 assistance from Uwe Kleine-König.  Reimplement mul_u64_u64_div_u64() to
 provide (much) more accurate results.  The current implementation was
 causing Uwe some issues in the PWM drivers.
 
 "xz: Updates to license, filters, and compression options" from Lasse
 Collin.  Miscellaneous maintenance and kinor feature work to the xz
 decompressor.
 
 "Fix some GDB command error and add some GDB commands" from Kuan-Ying Lee.
 Fixes and enhancements to the gdb scripts.
 
 "treewide: add missing MODULE_DESCRIPTION() macros" from Jeff Johnson.
 Adds lots of MODULE_DESCRIPTIONs, thus fixing lots of warnings about this.
 
 "nilfs2: add support for some common ioctls" from Ryusuke Konishi.  Adds
 various commonly-available ioctls to nilfs2.
 
 "This series fixes a number of formatting issues in kernel doc comments"
 from Ryusuke Konishi does that.
 
 "nilfs2: prevent unexpected ENOENT propagation" from Ryusuke Konishi.  Fix
 issues where -ENOENT was being unintentionally and inappropriately
 returned to userspace.
 
 "nilfs2: assorted cleanups" from Huang Xiaojia.
 
 "nilfs2: fix potential issues with empty b-tree nodes" from Ryusuke
 Konishi fixes some issues which can occur on corrupted nilfs2 filesystems.
 
 "scripts/decode_stacktrace.sh: improve error reporting and usability" from
 Luca Ceresoli does those things.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZu7dpAAKCRDdBJ7gKXxA
 jsPqAPwMDEZyKlfSw7QioEHNHDkmkbP7VYCYR0CbUnppbztwpAD8D37aVbWQ+UzM
 3nnOq3W2Pc2o/20zqi8Upf1mnvUrygQ=
 =/NWE
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2024-09-21-07-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:
 "Many singleton patches - please see the various changelogs for
  details.

  Quite a lot of nilfs2 work this time around.

  Notable patch series in this pull request are:

   - "mul_u64_u64_div_u64: new implementation" by Nicolas Pitre, with
     assistance from Uwe Kleine-König. Reimplement mul_u64_u64_div_u64()
     to provide (much) more accurate results. The current implementation
     was causing Uwe some issues in the PWM drivers.

   - "xz: Updates to license, filters, and compression options" from
     Lasse Collin. Miscellaneous maintenance and kinor feature work to
     the xz decompressor.

   - "Fix some GDB command error and add some GDB commands" from
     Kuan-Ying Lee. Fixes and enhancements to the gdb scripts.

   - "treewide: add missing MODULE_DESCRIPTION() macros" from Jeff
     Johnson. Adds lots of MODULE_DESCRIPTIONs, thus fixing lots of
     warnings about this.

   - "nilfs2: add support for some common ioctls" from Ryusuke Konishi.
     Adds various commonly-available ioctls to nilfs2.

   - "This series fixes a number of formatting issues in kernel doc
     comments" from Ryusuke Konishi does that.

   - "nilfs2: prevent unexpected ENOENT propagation" from Ryusuke
     Konishi. Fix issues where -ENOENT was being unintentionally and
     inappropriately returned to userspace.

   - "nilfs2: assorted cleanups" from Huang Xiaojia.

   - "nilfs2: fix potential issues with empty b-tree nodes" from Ryusuke
     Konishi fixes some issues which can occur on corrupted nilfs2
     filesystems.

   - "scripts/decode_stacktrace.sh: improve error reporting and
     usability" from Luca Ceresoli does those things"

* tag 'mm-nonmm-stable-2024-09-21-07-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (103 commits)
  list: test: increase coverage of list_test_list_replace*()
  list: test: fix tests for list_cut_position()
  proc: use __auto_type more
  treewide: correct the typo 'retun'
  ocfs2: cleanup return value and mlog in ocfs2_global_read_info()
  nilfs2: remove duplicate 'unlikely()' usage
  nilfs2: fix potential oob read in nilfs_btree_check_delete()
  nilfs2: determine empty node blocks as corrupted
  nilfs2: fix potential null-ptr-deref in nilfs_btree_insert()
  user_namespace: use kmemdup_array() instead of kmemdup() for multiple allocation
  tools/mm: rm thp_swap_allocator_test when make clean
  squashfs: fix percpu address space issues in decompressor_multi_percpu.c
  lib: glob.c: added null check for character class
  nilfs2: refactor nilfs_segctor_thread()
  nilfs2: use kthread_create and kthread_stop for the log writer thread
  nilfs2: remove sc_timer_task
  nilfs2: do not repair reserved inode bitmap in nilfs_new_inode()
  nilfs2: eliminate the shared counter and spinlock for i_generation
  nilfs2: separate inode type information from i_state field
  nilfs2: use the BITS_PER_LONG macro
  ...
2024-09-21 08:20:50 -07:00
Jens Axboe
42b16d3ac3 Linux 6.11
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmbm9fQeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGXcwH/A8+IXnrGv+VzYgD
 +mE4hgGGHt4dClcUZ31gQetkkT6xktEVp6pB6JkFO7oEgBiTkJBbYGl6VZtsAIOd
 Fi3jic8ik0uhZLFcxDJcHTceh6Pw8bkhWoh0tkF3bkDRwbppJdG7Khyk8DxTl24w
 ldqh9om2cC7w9IPVx93xTgKgMMZ63qiJyUdTvxEZI3BG8F70smlgZSPskLp2Iktd
 FIJZPcyKM0bhJYwZOpXK0vx5C2cA4oIW4xriHUw4aklv646OBxNKevB2JJAft2uA
 6LyvuLgnYn/OpdFGZ8slvdmhm6hLWft5B1/bWKorUkz7p5YGiySFzpkMVAkNJ6mS
 cRwHJNc=
 =flw3
 -----END PGP SIGNATURE-----

Merge tag 'v6.11' into for-6.12/block

Merge in 6.11 final to get the fix for preventing deadlocks on an
elevator switch, as there's a fixup for that patch.

* tag 'v6.11': (1788 commits)
  Linux 6.11
  Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"
  pinctrl: pinctrl-cy8c95x0: Fix regcache
  cifs: Fix signature miscalculation
  mm: avoid leaving partial pfn mappings around in error case
  drm/xe/client: add missing bo locking in show_meminfo()
  drm/xe/client: fix deadlock in show_meminfo()
  drm/xe/oa: Enable Xe2+ PES disaggregation
  drm/xe/display: fix compat IS_DISPLAY_STEP() range end
  drm/xe: Fix access_ok check in user_fence_create
  drm/xe: Fix possible UAF in guc_exec_queue_process_msg
  drm/xe: Remove fence check from send_tlb_invalidation
  drm/xe/gt: Remove double include
  net: netfilter: move nf flowtable bpf initialization in nf_flow_table_module_init()
  PCI: Fix potential deadlock in pcim_intx()
  workqueue: Clear worker->pool in the worker thread context
  net: tighten bad gso csum offset check in virtio_net_hdr
  netlink: specs: mptcp: fix port endianness
  net: dpaa: Pad packets to ETH_ZLEN
  mptcp: pm: Fix uaf in __timer_delete_sync
  ...
2024-09-17 08:32:53 -06:00
Linus Torvalds
26bb0d3f38 for-6.12/block-20240913
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbkZhQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjOKD/0fzd4yOcqxSI9W3OLGd04VrOTJIQa4CRbV
 GmoTq39pOeIDVGug5ekkTpqqHHnuGk+nQhCzD9vsN/eTmC7yZOIr847O2aWzvYEn
 PzFRgmJpoo2E9sr/IsTR5LnJjbaIZhQVkqLH6ZOj9tpKlVwN2SK0nIRVNrAi5zgT
 MaDrto/2OUld+vmA99Rgb23jxM6UBdCPIjuiVa+11Vg9Z3D1tWbBmrsG7OMysyIf
 FbASBeKHqFSO61/ipFCZv6VV1X8zoWEVyT8n4A1yUbbN5rLzPgoQJVbfSqQRXIdr
 cdrKeCbKxl+joSgKS6LKpvnfwRgGF+hgAfpZg4c0vrbZGTQcRhhLFECyh/aVI08F
 p5TOMArhVaX59664gHgSPq4KnGTXOO29dot9N3Jya/ZQnxinjY9r+GVOfLuduPPy
 1B04vab8oAsk4zK7fZbkDxgYUyifwzK/vQ6OqYq2mYdpdIS/AE7T2ou61Bz5mI7I
 /BuucNV0Z96OKlyLEXwXXZjZgNu1TFcq6ARIBJ8L08PY64Fesj5BXabRyXkeNH26
 0exyz9heeJs6OwRGfngXmS24tDSS0k74CeZX3KoePNj69u6KCn346KiU1qgntwwD
 E5F7AEHqCl5FjUEIWB4M1EPlfA8U0MzOL+tkx2xKJAjsU60wAy7jRSyOIcqodpMs
 6UlPcJzgYg==
 =uuLl
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - MD changes via Song:
      - md-bitmap refactoring (Yu Kuai)
      - raid5 performance optimization (Artur Paszkiewicz)
      - Other small fixes (Yu Kuai, Chen Ni)
      - Add a sysfs entry 'new_level' (Xiao Ni)
      - Improve information reported in /proc/mdstat (Mateusz Kusiak)

 - NVMe changes via Keith:
      - Asynchronous namespace scanning (Stuart)
      - TCP TLS updates (Hannes)
      - RDMA queue controller validation (Niklas)
      - Align field names to the spec (Anuj)
      - Metadata support validation (Puranjay)
      - A syntax cleanup (Shen)
      - Fix a Kconfig linking error (Arnd)
      - New queue-depth quirk (Keith)

 - Add missing unplug trace event (Keith)

 - blk-iocost fixes (Colin, Konstantin)

 - t10-pi modular removal and fixes (Alexey)

 - Fix for potential BLKSECDISCARD overflow (Alexey)

 - bio splitting cleanups and fixes (Christoph)

 - Deal with folios rather than rather than pages, speeding up how the
   block layer handles bigger IOs (Kundan)

 - Use spinlocks rather than bit spinlocks in zram (Sebastian, Mike)

 - Reduce zoned device overhead in ublk (Ming)

 - Add and use sendpages_ok() for drbd and nvme-tcp (Ofir)

 - Fix regression in partition error pointer checking (Riyan)

 - Add support for write zeroes and rotational status in nbd (Wouter)

 - Add Yu Kuai as new BFQ maintainer. The scheduler has been
   unmaintained for quite a while.

 - Various sets of fixes for BFQ (Yu Kuai)

 - Misc fixes and cleanups (Alvaro, Christophe, Li, Md Haris, Mikhail,
   Yang)

* tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux: (120 commits)
  nvme-pci: qdepth 1 quirk
  block: fix potential invalid pointer dereference in blk_add_partition
  blk_iocost: make read-only static array vrate_adj_pct const
  block: unpin user pages belonging to a folio at once
  mm: release number of pages of a folio
  block: introduce folio awareness and add a bigger size from folio
  block: Added folio-ized version of bio_add_hw_page()
  block, bfq: factor out a helper to split bfqq in bfq_init_rq()
  block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq()
  block, bfq: remove local variable 'split' in bfq_init_rq()
  block, bfq: remove bfq_log_bfqg()
  block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()
  block, bfq: fix procress reference leakage for bfqq in merge chain
  block, bfq: fix uaf for accessing waker_bfqq after splitting
  blk-throttle: support prioritized processing of metadata
  blk-throttle: remove last_low_overflow_time
  drbd: Add NULL check for net_conf to prevent dereference in state validation
  nvme-tcp: fix link failure for TCP auth
  blk-mq: add missing unplug trace event
  mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init()
  ...
2024-09-16 13:33:06 +02:00
Keith Busch
76c313f658 blk-integrity: improved sg segment mapping
Make the integrity mapping more like data mapping, blk_rq_map_sg. Use
the request to validate the segment count, and update the callers so
they don't have to.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240913191746.2628196-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-13 13:22:09 -06:00
Keith Busch
f4330766bc nvme-rdma: use request to get integrity segments
The request tracks the integrity segments already, so no need to recount
the segments again.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240913182854.2445457-8-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-13 12:31:45 -06:00
Keith Busch
d2c5b1facc block: provide a request helper for user integrity segments
Provide a helper to keep the request flags and nr_integrity_segments in
sync with the bio's integrity payload. This is an integrity equivalent
to the normal data helper function, 'blk_rq_map_user()'.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240913182854.2445457-6-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-13 12:31:45 -06:00
Keith Busch
83bdfcbdbe nvme-pci: qdepth 1 quirk
Another device has been reported to be unreliable if we have more than
one outstanding command. In this new case, data corruption may occur.
Since we have two devices now needing this quirky behavior, make a
generic quirk flag.

The same Apple quirk is clearly not "temporary", so update the comment
while moving it.

Link: https://lore.kernel.org/linux-nvme/191d810a4e3.fcc6066c765804.973611676137075390@collabora.com/
Reported-by: Robert Beckett <bob.beckett@collabora.com>
Reviewed-by: Christoph Hellwig hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-13 07:29:30 -07:00
Arnd Bergmann
2d5a333e09 nvme-tcp: fix link failure for TCP auth
The nvme fabric driver calls the nvme_tls_key_lookup() function from
nvmf_parse_key() when the keyring is enabled, but this is broken in a
configuration with CONFIG_NVME_FABRICS=y and CONFIG_NVME_TCP=m because
this leads to the function definition being in a loadable module:

x86_64-linux-ld: vmlinux.o: in function `nvmf_parse_key':
fabrics.c:(.text+0xb1bdec): undefined reference to `nvme_tls_key_lookup'

Move the 'select' up to CONFIG_NVME_FABRICS itself to force this
part to be built-in as well if needed.

Fixes: 5bc46b49c8 ("nvme-tcp: check for invalidated or revoked key")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-10 07:41:07 -07:00
Shen Lichuan
389e72c5d1 nvme: Convert comma to semicolon
To ensure code clarity and prevent potential errors, it's advisable
to employ the ';' as a statement separator, except when ',' are
intentionally used for specific purposes.

Signed-off-by: Shen Lichuan <shenlichuan@vivo.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-06 14:20:38 -07:00
Maurizio Lombardi
899d2e5a4e nvmet: Identify-Active Namespace ID List command should reject invalid nsid
nsid values of 0xFFFFFFFE and 0XFFFFFFFF should be rejected with
a status code of "Invalid Namespace or Format".
See NVMe Base Specification, Active Namespace ID list (CNS 02h).

Fixes: a07b4970f4 ("nvmet: add a generic NVMe target")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-03 10:05:40 -07:00
Christoph Hellwig
28982ad73d nvme: set BLK_FEAT_ZONED for ZNS multipath disks
The new stricter limits validation doesn't like a max_append_sectors value
to be set without BLK_FEAT_ZONED.  Set it before allocation the disk to
fix this instead of just inheriting it later.

Fixes: d690cb8ae1 ("block: add an API to atomically update queue limits")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-03 10:00:10 -07:00
Jani Nikula
6ce2082fd3 fault-inject: improve build for CONFIG_FAULT_INJECTION=n
The fault-inject.h users across the kernel need to add a lot of #ifdef
CONFIG_FAULT_INJECTION to cater for shortcomings in the header.  Make
fault-inject.h self-contained for CONFIG_FAULT_INJECTION=n, and add stubs
for DECLARE_FAULT_ATTR(), setup_fault_attr(), should_fail_ex(), and
should_fail() to allow removal of conditional compilation.

[akpm@linux-foundation.org: repair fallout from no longer including debugfs.h into fault-inject.h]
[akpm@linux-foundation.org: fix drivers/misc/xilinx_tmr_inject.c]
[akpm@linux-foundation.org: Add debugfs.h inclusion to more files, per Stephen]
Link: https://lkml.kernel.org/r/20240813121237.2382534-1-jani.nikula@intel.com
Fixes: 6ff1cb355e ("[PATCH] fault-injection capabilities infrastructure")
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:43:33 -07:00
Puranjay Mohan
7c2fd76048 nvme: fix metadata handling in nvme-passthrough
On an NVMe namespace that does not support metadata, it is possible to
send an IO command with metadata through io-passthru. This allows issues
like [1] to trigger in the completion code path.
nvme_map_user_request() doesn't check if the namespace supports metadata
before sending it forward. It also allows admin commands with metadata to
be processed as it ignores metadata when bdev == NULL and may report
success.

Reject an IO command with metadata when the NVMe namespace doesn't
support it and reject an admin command if it has metadata.

[1] https://lore.kernel.org/all/mb61pcylvnym8.fsf@amazon.com/

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Puranjay Mohan <pjy@amazon.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-30 07:50:29 -07:00
Georg Gottleuber
61aa894e7a nvme-pci: Add sleep quirk for Samsung 990 Evo
On some TUXEDO platforms, a Samsung 990 Evo NVMe leads to a high
power consumption in s2idle sleep (2-3 watts).

This patch applies 'Force No Simple Suspend' quirk to achieve a
sleep with a lower power consumption, typically around 0.5 watts.

Signed-off-by: Georg Gottleuber <ggo@tuxedocomputers.com>
Signed-off-by: Werner Sembach <wse@tuxedocomputers.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-27 12:06:43 -07:00
Keith Busch
6f01bdbfef nvme-pci: allocate tagset on reset if necessary
If a drive is unable to create IO queues on the initial probe, a
subsequent reset will need to allocate the tagset if IO queue creation
is successful. Without this, blk_mq_update_nr_hw_queues will crash on a
bad pointer due to the invalid tagset.

Fixes: eac3ef2629 ("nvme-pci: split the initial probe from the rest path")
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-27 07:55:40 -07:00
Maurizio Lombardi
5572a55a6f nvmet-tcp: fix kernel crash if commands allocation fails
If the commands allocation fails in nvmet_tcp_alloc_cmds()
the kernel crashes in nvmet_tcp_release_queue_work() because of
a NULL pointer dereference.

  nvmet: failed to install queue 0 cntlid 1 ret 6
  Unable to handle kernel NULL pointer dereference at
         virtual address 0000000000000008

Fix the bug by setting queue->nr_cmds to zero in case
nvmet_tcp_alloc_cmd() fails.

Fixes: 872d26a391 ("nvmet-tcp: add NVMe over TCP target driver")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-26 16:00:52 -07:00
Anuj Gupta
cead0b8991 nvme: rename apptag and appmask to lbat and lbatm
Rename apptag and appmask to lbat and lbatm so that it matches the field
names used in NVMe spec.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-26 09:51:32 -07:00
Niklas Cassel
03c3d7c743 nvme-rdma: send cntlid in the RDMA_CM_REQUEST Private Data
When sending a RDMA_CM_REQUEST, the NVMe RDMA Transport Specification
allows you to populate the cntlid field in the RDMA_CM_REQUEST Private
Data.

The cntlid is returned by the target on completion of the first
RDMA_CM_REQUEST command (which creates the admin queue).

The cntlid field can then be populated by the host when the I/O queues
are created (using additional RDMA_CM_REQUEST commands), such that the
target can perform extra validation for additional RDMA_CM_REQUEST
commands.

This additional error code and error message is also added, such that
nvme_rdma_cm_msg() will display the proper error message if the target
fails the RDMA_CM_REQUEST command because of this extra validation.

Signed-off-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-26 08:11:19 -07:00
Keith Busch
5a6d3a638c nvme: use better description for async reset reason
The NVMe AER notification of a persistent internal error triggers a
reset. The existing warning message just says "due to AER", which can be
confused with the unrelated PCIe AER condition. Just say what the event
was instead of the generic overloaded acronym.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-23 09:52:04 -07:00
Jinjie Ruan
f4bd313993 nvmet: Make nvmet_debugfs static
The sparse tool complains as follows:

drivers/nvme/target/debugfs.c:16:15: warning:
	symbol 'nvmet_debugfs' was not declared. Should it be static?

This symbol is not used outside debugfs.c, so marks it static.

Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-23 09:50:16 -07:00
Nilay Shroff
fe01751347 nvme: Remove unused field
The "name" field in struct nvme_ctrl is unsued so removing it.
This would help save 12 bytes of space for each nvme_ctrl instance
created.

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22 13:28:40 -07:00
Ming Lei
a54a93d0e3 nvme: move stopping keep-alive into nvme_uninit_ctrl()
Commit 4733b65d82 ("nvme: start keep-alive after admin queue setup")
moves starting keep-alive from nvme_start_ctrl() into
nvme_init_ctrl_finish(), but don't move stopping keep-alive into
nvme_uninit_ctrl(), so keep-alive work can be started and keep pending
after failing to start controller, finally use-after-free is triggered if
nvme host driver is unloaded.

This patch fixes kernel panic when running nvme/004 in case that connection
failure is triggered, by moving stopping keep-alive into nvme_uninit_ctrl().

This way is reasonable because keep-alive is now started in
nvme_init_ctrl_finish().

Fixes: 3af755a468 ("nvme: move nvme_stop_keep_alive() back to original position")
Cc: Hannes Reinecke <hare@suse.de>
Cc: Mark O'Donovan <shiftee@posteo.net>
Reported-by: Changhui Zhong <czhong@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22 13:28:40 -07:00
Hannes Reinecke
ff4a0a4088 nvme-target: do not check authentication status for admin commands twice
nvmet_check_ctrl_status() checks the authentication status, so
we don't need to do that prior to calling it.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22 13:25:11 -07:00
Hannes Reinecke
bb2df18958 nvmet-auth: allow to clear DH-HMAC-CHAP keys
As we can set DH-HMAC-CHAP keys, we should also be
able to unset them.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22 13:25:11 -07:00
Hannes Reinecke
02a3688c53 nvme-sysfs: add 'tls_keyring' attribute
Add a 'tls_keyring' attribute to display the contents of the
--keyring option from the connect string. Adding this attribute
allows us to recreate the original connect string from sysfs
settings.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22 13:25:11 -07:00
Hannes Reinecke
f5eb739747 nvme-sysfs: add 'tls_configured_key' sysfs attribute
There is a difference between the negotiated TLS key (which is
always present for a TLS encrypted connection) and the configured
TLS key (which is specified with the --tls_key command line option).
To differentate between these two add a new sysfs attribute
'tls_configured_key' to hold the specified on the command line.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22 13:25:11 -07:00
Hannes Reinecke
1e48b34c9b nvme: split off TLS sysfs attributes into a separate group
Split off TLS sysfs attributes into a separate group to improve
readability and to keep all TLS related handling in one section.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22 13:25:11 -07:00
Hannes Reinecke
c5f2ca52d0 nvme: add a newline to the 'tls_key' sysfs attribute
Print a newline for easier userspace handling.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22 13:25:11 -07:00
Hannes Reinecke
5bc46b49c8 nvme-tcp: check for invalidated or revoked key
key_lookup() will always return a key, even if that key is revoked
or invalidated. So check for invalid keys before continuing.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22 13:25:07 -07:00
Hannes Reinecke
363895767f nvme-tcp: sanitize TLS key handling
There is a difference between TLS configured (ie the user has
provisioned/requested a key) and TLS enabled (ie the connection
is encrypted with TLS). This becomes important for secure concatenation,
where the initial authentication is run on an unencrypted connection
(ie with TLS configured, but not enabled), and then the queue is reset to
run over TLS (ie TLS configured _and_ enabled).
So to differentiate between those two states store the generated
key in opts->tls_key (as we're using the same TLS key for all queues),
the key serial of the resulting TLS handshake in ctrl->tls_pskid
(to signal that TLS on the admin queue is enabled), and a simple
flag for the queues to indicated that TLS has been enabled.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22 13:22:41 -07:00
Hannes Reinecke
79559c7533 nvme-keyring: restrict match length for version '1' identifiers
TP8018 introduced a new TLS PSK identifier version (version 1), which appended
a PSK hash value to the existing identifier (cf NVMe TCP specification v1.1,
section 3.6.1.3 'TLS PSK and PSK Identity Derivation').
An original (version 0) identifier has the form:

NVMe0<type><hmac> <hostnqn> <subsysnqn>

and a version 1 identifier has the form:

NVMe1<type><hmac> <hostnqn> <subsysnqn> <hash>

This patch modifies the lookup algorthm to compare only the first part
of the identifier (excluding the hash value) to handle both version 0 and
version 1 identifiers.
And the spec declares 'version 0' identifiers obsolete, so the lookup
algorithm is modified to prever v1 identifiers.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22 13:22:41 -07:00
Stuart Hayes
4e893ca811 nvme_core: scan namespaces asynchronously
Use async function calls to make namespace scanning happen in parallel.

Without the patch, NVME namespaces are scanned serially, so it can take
a long time for all of a controller's namespaces to become available,
especially with a slower (TCP) interface with large number of
namespaces.

It is not uncommon to have large numbers (hundreds or thousands) of
namespaces on nvme-of with storage servers.

The time it took for all namespaces to show up after connecting (via
TCP) to a controller with 1002 namespaces was measured on one system:

network latency   without patch   with patch
     0                 6s            1s
    50ms             210s           10s
   100ms             417s           18s

Measurements taken on another system show the effect of the patch on the
time nvme_scan_work() took to complete, when connecting to a linux
nvme-of target with varying numbers of namespaces, on a network of
400us.

namespaces    without patch   with patch
     1            16ms           14ms
     2            24ms           16ms
     4            49ms           22ms
     8           101ms           33ms
    16           207ms           56ms
   100           1.4s           0.6s
  1000          12.9s           2.0s

On the same system, connecting to a local PCIe NVMe drive (a Samsung
PM1733) instead of a network target:

namespaces    without patch   with patch
     1            13ms           12ms
     2            41ms           13ms

Signed-off-by: Stuart Hayes <stuart.w.hayes@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2024-08-22 10:05:22 -07:00
Kanchan Joshi
b4c1f33a5d nvme: reorganize nvme_ns_head fields
shuffle few fields to reduce the holes within nvme_ns_head.
On x86_64, the size is reduced to 1104 bytes from 1120 bytes.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-31 07:40:10 -07:00
Kanchan Joshi
73d148ccb9 nvme: change data type of lba_shift
u8 fits the need, so stop using int for it.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-31 07:40:10 -07:00
Kanchan Joshi
6339b7edad nvme: remove a field from nvme_ns_head
pi_offset field is not required to be present in nvme_ns_head.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-31 07:40:10 -07:00
Kanchan Joshi
7ec5bd247a nvme: remove unused parameter
First parameter of nvme_init_integrity() is unused.
Remove it, and modify the callers.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-29 07:27:58 -07:00
Ofir Gal
6af7331a70 nvme-tcp: use sendpages_ok() instead of sendpage_ok()
Currently nvme_tcp_try_send_data() use sendpage_ok() in order to disable
MSG_SPLICE_PAGES, it check the first page of the iterator, the iterator
may represent contiguous pages.

MSG_SPLICE_PAGES enables skb_splice_from_iter() which checks all the
pages it sends with sendpage_ok().

When nvme_tcp_try_send_data() sends an iterator that the first page is
sendable, but one of the other pages isn't skb_splice_from_iter() warns
and aborts the data transfer.

Using the new helper sendpages_ok() in order to disable MSG_SPLICE_PAGES
solves the issue.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ofir Gal <ofir.gal@volumez.com>
Link: https://lore.kernel.org/r/20240718084515.3833733-3-ofir.gal@volumez.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-28 16:47:52 -06:00
Jens Axboe
f6bb5254b7 nvme fixes for Linux 6.11
- Fix request without payloads cleanup  (Leon)
  - Use new protection information format (Francis)
  - Improved debug message for lost pci link (Bart)
  - Another apst quirk (Wang)
  - Use appropriate sysfs api for printing chars (Markus)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmajrFcACgkQPe3zGtjz
 Rgk4QRAA357EtsOgDKAbodaePFsGhmfhVEVhizCkw+yj+l73jGWMRMDfbpNsnkii
 OTJaXhQgVPrCBGnmbgHn5GehEKJ/VZwW4fFbF42wvetFXWSYFDPL2yqC3YpaPjNp
 aSJvUM0rw4f+JE52tTPQvVzLQ/M9PZFsJ6sUmoozV5MoPLt4eKKUEHKigCXpXV/p
 AOwiejVVk035WbKNq4R8DoQsa05Yk0Tv5zKsFgmXEZjrnorC0dpqWQjT5HH6V9pt
 eHTA2cxKq9qAHBN1Zm/3HUOmxmJZ1GW3AKLxYM+k0ornnfnO7inlQwNJDsQItXXS
 ZNBELiYIIObVoy6COB03NWMSCcS/TrpfSKJ9s+JOdJt/T+AOVCwQkqpIff6aJTaH
 k4ppVjChmaY3+taIkLQ5nC1zecCZr7hY+xL0ZkUGhlznKn9x2a1zOBZ6tUuabul6
 57JztkeXPyTNZ/t1WhYQQpGQ4MCLXnu81gMRzVKfJcmtMOSrBOO1p8eNjb/LgM4M
 Qpu9VKS33OBrmMEBlhnBvhkFIXxHUU1CjZJkQ2MYm4YLLnaO4VmmOk3tcX1mpID4
 R6GrsXAOBlqjtAyqsTGQJ/+Z5BRp1nhOk/E0APiD+sbgJPiRSJ2HNZpDEbOroYOy
 4IiaQwZbyGSjieBWWe0fMMCBbazk2S9ws57eAbVh5/r6L+DSWTc=
 =1Rzv
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.11-2024-07-26' of git://git.infradead.org/nvme into block-6.11

Pull NVMe fixes from Keith:

"nvme fixes for Linux 6.11

 - Fix request without payloads cleanup  (Leon)
 - Use new protection information format (Francis)
 - Improved debug message for lost pci link (Bart)
 - Another apst quirk (Wang)
 - Use appropriate sysfs api for printing chars (Markus)"

* tag 'nvme-6.11-2024-07-26' of git://git.infradead.org/nvme:
  nvme-pci: add missing condition check for existence of mapped data
  nvme-core: choose PIF from QPIF if QPIFS supports and PIF is QTYPE
  nvme-pci: Fix the instructions for disabling power management
  nvme: remove redundant bdev local variable
  nvme-fabrics: Use seq_putc() in __nvmf_concat_opt_tokens()
  nvme/pci: Add APST quirk for Lenovo N60z laptop
2024-07-26 08:06:15 -06:00
Leon Romanovsky
c31fad1470 nvme-pci: add missing condition check for existence of mapped data
nvme_map_data() is called when request has physical segments, hence
the nvme_unmap_data() should have same condition to avoid dereference.

Fixes: 4aedb70543 ("nvme-pci: split metadata handling from nvme_map_data / nvme_unmap_data")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-25 07:20:54 -07:00
Linus Torvalds
0256994887 for-6.11/block-post-20240722
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaeY00QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjPGD/9CPo93+V/ztfzY1J18KhA2CCUh1uuxZIjx
 dLfi07Bo+gyLwB1vaSf0bNy9gM8SzGFSMszSIDTErNq9/F6RvWjXN0CchyQf1Wii
 o2UyQg8JLjT2o1pJSsdJySZQRsG/daWUHzHaX1kD343Cd6OBV2YaVFdYTaXUGg4v
 G1AVh7qFvQhAIg1jV8q2z7QC7PSeuTnvyvY65Z8/iVJe95FayOrtGmDPTaJab8r2
 7uEFiWZk23erzNygVdcSoNIrwWFmRARz5o3IvwJJfEL08hkdoAqu6vD2oCUZspKU
 3g4wU6JrN0QYQpVwIJ9WcwYcoOm6iMm9xwCVMsp8R3KRUU107HjaiEazFDGk4HW4
 ozZTa7leTXnrRqnjVhcQpUvC+1uVLCFN8sSElNY7m2dg0IojnlMz+t3lMiTtaR9N
 Rt6wy5alVQFlb2uhzALuUh6HM1zA98swWySNoP0arTkOT9kjXwwAgn0I+M1s9Uxo
 FaQvM0YnAsb2C8LSpNtZWLaTlRSLTzUsGThLSJMBZueIJ9+BF23i7W7euklCNxjj
 Jl6CykEkEkacOxU6b9PG6qSnUq9JJ+W7gcJVing+ugAFrZDutxy6eJZXVv8wuvCC
 EOxaADpSs2xAaH9V0BMmwO51w0NDWySyGPHB5UBkhNjqOji/oG3FvAITiboQArgS
 FES4jtU1TA==
 =dn4l
 -----END PGP SIGNATURE-----

Merge tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linux

Pull block integrity mapping updates from Jens Axboe:
 "A set of cleanups and fixes for the block integrity support.

  Sent separately from the main block changes from last week, as they
  depended on later fixes in the 6.10-rc cycle"

* tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linux:
  block: don't free the integrity payload in bio_integrity_unmap_free_user
  block: don't free submitter owned integrity payload on I/O completion
  block: call bio_integrity_unmap_free_user from blk_rq_unmap_user
  block: don't call bio_uninit from bio_endio
  block: also return bio_integrity_payload * from stubs
  block: split integrity support out of bio.h
2024-07-22 11:04:09 -07:00
Francis Pravin
415fb383ec nvme-core: choose PIF from QPIF if QPIFS supports and PIF is QTYPE
As per TP4141a:
"If the Qualified Protection Information Format Support(QPIFS) bit is
set to 1 and the Protection Information Format(PIF) field is set to 11b
(i.e., Qualified Type), then the pif is as defined in the Qualified
Protection Information Format (QPIF) field."
So, choose PIF from QPIF if QPIFS supports and PIF is QTYPE.

Signed-off-by: Francis Pravin <francis.p@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-16 07:55:31 -07:00
Linus Torvalds
3e78198862 for-6.11/block-20240710
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaOTd8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppqIEACUr8Vv2FtezvT3OfVSlYWHHLXzkRhwEG5s
 vdk0o7Ow6U54sMjfymbHTgLD0ZOJf3uJ6BI95FQuW41jPzDFVbx4Hy8QzqonMkw9
 1D/YQ4zrVL2mOKBzATbKpoGJzMOzGeoXEueFZ1AYPAX7RrDtP4xPQNfrcfkdE2zF
 LycJN70Vp6lrZZMuI9yb9ts1tf7TFzK0HJANxOAKTgSiPmBmxesjkJlhrdUrgkAU
 qDVyjj7u/ssndBJAb9i6Bl95Do8s9t4DeJq5/6wgKqtf5hClMXzPVB8Wy084gr6E
 rTRsCEhOug3qEZSqfAgAxnd3XFRNc/p2KMUe5YZ4mAqux4hpSmIQQDM/5X5K9vEv
 f4MNqUGlqyqntZx+KPyFpf7kLHFYS1qK4ub0FojWJEY4GrbBPNjjncLJ9+ozR0c8
 kNDaFjMNAjalBee1FxNNH8LdVcd28rrCkPxRLEfO/gvBMUmvJf4ZyKmSED0v5DhY
 vZqKlBqG+wg0EXvdiWEHMDh9Y+q/2XBIkS6NN/Bhh61HNu+XzC838ts1X7lR+4o2
 AM5Vapw+v0q6kFBMRP3IcJI/c0UcIU8EQU7axMyzWtvhog8kx8x01hIj1L4UyYYr
 rUdWrkugBVXJbywFuH/QIJxWxS/z4JdSw5VjASJLIrXy+aANmmG9Wonv95eyhpUv
 5iv+EdRSNA==
 =wVi8
 -----END PGP SIGNATURE-----

Merge tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe updates via Keith:
     - Device initialization memory leak fixes (Keith)
     - More constants defined (Weiwen)
     - Target debugfs support (Hannes)
     - PCIe subsystem reset enhancements (Keith)
     - Queue-depth multipath policy (Redhat and PureStorage)
     - Implement get_unique_id (Christoph)
     - Authentication error fixes (Gaosheng)

 - MD updates via Song
     - sync_action fix and refactoring (Yu Kuai)
     - Various small fixes (Christoph Hellwig, Li Nan, and Ofir Gal, Yu
       Kuai, Benjamin Marzinski, Christophe JAILLET, Yang Li)

 - Fix loop detach/open race (Gulam)

 - Fix lower control limit for blk-throttle (Yu)

 - Add module descriptions to various drivers (Jeff)

 - Add support for atomic writes for block devices, and statx reporting
   for same. Includes SCSI and NVMe (John, Prasad, Alan)

 - Add IO priority information to block trace points (Dongliang)

 - Various zone improvements and tweaks (Damien)

 - mq-deadline tag reservation improvements (Bart)

 - Ignore direct reclaim swap writes in writeback throttling (Baokun)

 - Block integrity improvements and fixes (Anuj)

 - Add basic support for rust based block drivers. Has a dummy null_blk
   variant for now (Andreas)

 - Series converting driver settings to queue limits, and cleanups and
   fixes related to that (Christoph)

 - Cleanup for poking too deeply into the bvec internals, in preparation
   for DMA mapping API changes (Christoph)

 - Various minor tweaks and fixes (Jiapeng, John, Kanchan, Mikulas,
   Ming, Zhu, Damien, Christophe, Chaitanya)

* tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux: (206 commits)
  floppy: add missing MODULE_DESCRIPTION() macro
  loop: add missing MODULE_DESCRIPTION() macro
  ublk_drv: add missing MODULE_DESCRIPTION() macro
  xen/blkback: add missing MODULE_DESCRIPTION() macro
  block/rnbd: Constify struct kobj_type
  block: take offset into account in blk_bvec_map_sg again
  block: fix get_max_segment_size() warning
  loop: Don't bother validating blocksize
  virtio_blk: Don't bother validating blocksize
  null_blk: Don't bother validating blocksize
  block: Validate logical block size in blk_validate_limits()
  virtio_blk: Fix default logical block size fallback
  nvmet-auth: fix nvmet_auth hash error handling
  nvme: implement ->get_unique_id
  block: pass a phys_addr_t to get_max_segment_size
  block: add a bvec_phys helper
  blk-lib: check for kill signal in ioctl BLKZEROOUT
  block: limit the Write Zeroes to manually writing zeroes fallback
  block: refacto blkdev_issue_zeroout
  block: move read-only and supported checks into (__)blkdev_issue_zeroout
  ...
2024-07-15 14:20:22 -07:00
Bart Van Assche
92fc2c469e nvme-pci: Fix the instructions for disabling power management
pcie_aspm=off tells the kernel not to modify the ASPM configuration. This
setting does not guarantee that ASPM (Active State Power Management) is
disabled. Hence add pcie_port_pm=off. This disables power management for
all PCIe ports.

This patch has been tested on a workstation with a Samsung SSD 970 EVO Plus
NVMe SSD.

Fixes: 4641a8e6e1 ("nvme-pci: add trouble shooting steps for timeouts")
Cc: Keith Busch <kbusch@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-15 13:46:00 -07:00
Israel Rukshin
88c918d1ee nvme: remove redundant bdev local variable
Use disk directly instead of getting it from bdev->bd_disk.

Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-15 13:44:59 -07:00
Markus Elfring
1a7812b25e nvme-fabrics: Use seq_putc() in __nvmf_concat_opt_tokens()
Single characters should be put into a sequence.
Thus use the corresponding function “seq_putc”.

This issue was transformed by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-15 13:43:41 -07:00
WangYuli
ab091ec536 nvme/pci: Add APST quirk for Lenovo N60z laptop
There is a hardware power-saving problem with the Lenovo N60z
board. When turn it on and leave it for 10 hours, there is a
20% chance that a nvme disk will not wake up until reboot.

Link: https://lore.kernel.org/all/2B5581C46AC6E335+9c7a81f1-05fb-4fd0-9fbb-108757c21628@uniontech.com
Signed-off-by: hmy <huanglin@uniontech.com>
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-15 13:43:39 -07:00
Jens Axboe
6b43537fae nvme updates for Linux 6.11
- Device initialization memory leak fixes (Keith)
  - More constants defined (Weiwen)
  - Target debugfs support (Hannes)
  - PCIe subsystem reset enhancements (Keith)
  - Queue-depth multipath policy (Redhat and PureStorage)
  - Implement get_unique_id (Christoph)
  - Authentication error fixes (Gaosheng)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmaMS6oACgkQPe3zGtjz
 Rgljeg//f47yJNR61V+zePH2prcTUdTA21z2xnEbd1IT40qkpbBIf2mS5R6loBzT
 fm8cGGgd3fdZ3qUnvaMExL//A4aaQCgoKmbtJHLv616KeCu8Iwy8GC0myG2gl05h
 +a96zy8b30FZjRE6H08EBt0JyRZUqjxVcFtvIq1ZlRu2xLWG7f2hg4S7sSKDPKw/
 SZWTeFcY1GRiUupb0tcgaJeqiQ0U7BPuyUuHGnuklMY60ePO5qgtw3D6dmRwLgj+
 pbcRzrp2OWbKCUH1JrW34Ku1gzDbOYFYLL04akAq4rIp4JzsXEgvs7rlyRn0PtDP
 V6gGAvIvb7ktMhD4GvXFQlqT29AoOCYmWr5w2xS6uXAoLZ+t3gK9VBKa+7oRBOeo
 hiZNJy/EdIIuF54nWaFpZ4FBKunNocp7w9BEYkuu/HBm5GW4m9mwLxL66BLLYgPj
 kuC2t4Nc/waO+SrZFrHErtdb+QZNW8IUWIRG3jXLjo6yipGrv+K6lZpOY3HCEXev
 7F0AAOVNFWW+nWv+mEVQkd5lCFrHDjbVX2rRC3z4saKJvDOh69pBaSCKyikZR0bO
 95wz3B//sF2STBa4b/570KMPHJTJfTkKRtaaZvkHPT/0IQi0kmuWM/yKy57Q7NPE
 Ehkk3hfWjLUHU7jNuI5wxky8un7GMZJKArFg3Q1rkQCQ8OxQUXM=
 =0en2
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.11-2024-07-08' of git://git.infradead.org/nvme into for-6.11/block

Pull NVMe updates from Keith:

"nvme updates for Linux 6.11

 - Device initialization memory leak fixes (Keith)
 - More constants defined (Weiwen)
 - Target debugfs support (Hannes)
 - PCIe subsystem reset enhancements (Keith)
 - Queue-depth multipath policy (Redhat and PureStorage)
 - Implement get_unique_id (Christoph)
 - Authentication error fixes (Gaosheng)"

* tag 'nvme-6.11-2024-07-08' of git://git.infradead.org/nvme: (21 commits)
  nvmet-auth: fix nvmet_auth hash error handling
  nvme: implement ->get_unique_id
  nvme-multipath: implement "queue-depth" iopolicy
  nvme-multipath: prepare for "queue-depth" iopolicy
  nvme-pci: do not directly handle subsys reset fallout
  lpfc_nvmet: implement 'host_traddr'
  nvme-fcloop: implement 'host_traddr'
  nvmet-fc: implement host_traddr()
  nvmet-rdma: implement host_traddr()
  nvmet-tcp: implement host_traddr()
  nvmet: add 'host_traddr' callback for debugfs
  nvmet: add debugfs support
  mailmap: add entry for Weiwen Hu
  nvme: rename CDR/MORE/DNR to NVME_STATUS_*
  nvme: fix status magic numbers
  nvme: rename nvme_sc_to_pr_err to nvme_status_to_pr_err
  nvme: split device add from initialization
  nvme: fc: split controller bringup handling
  nvme: rdma: split controller bringup handling
  nvme: tcp: split controller bringup handling
  ...
2024-07-08 23:57:02 -06:00
Gaosheng Cui
89f58f96d1 nvmet-auth: fix nvmet_auth hash error handling
If we fail to call nvme_auth_augmented_challenge, or fail to kmalloc
for shash, we should free the memory allocation for challenge, so add
err path out_free_challenge to fix the memory leak.

Fixes: 7a277c37d3 ("nvmet-auth: Diffie-Hellman key exchange support")
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-08 10:28:16 -07:00
Christoph Hellwig
18f03a063d nvme: implement ->get_unique_id
Implement the get_unique_id method to allow pNFS SCSI layout access to
NVMe namespaces.

This is the server side implementation of RFC 9561 "Using the Parallel
NFS (pNFS) SCSI Layout to Access Non-Volatile Memory Express (NVMe)
Storage Devices".

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-08 10:25:39 -07:00
Damien Le Moal
f2a7bea237 block: Remove REQ_OP_ZONE_RESET_ALL emulation
Now that device mapper can handle resetting all zones of a mapped zoned
device using REQ_OP_ZONE_RESET_ALL, all zoned block device drivers
support this operation. With this, the request queue feature
BLK_FEAT_ZONE_RESETALL is not necessary and the emulation code in
blk-zone.c can be removed.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240704052816.623865-5-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05 00:42:04 -06:00
Christoph Hellwig
f8924374fd block: call bio_integrity_unmap_free_user from blk_rq_unmap_user
blk_rq_unmap_user always unmaps user space pass-through request.  If such
a request has integrity data attached it must come from a user mapping
as well.  Call bio_integrity_unmap_free_user from blk_rq_unmap_user
and remove the nvme_unmap_bio wrapper in the nvme driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240702151047.1746127-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-03 10:21:16 -06:00
Christoph Hellwig
da042a3655 block: split integrity support out of bio.h
Split struct bio_integrity_payload and the related prototypes out of
bio.h into a separate bio-integrity.h header so that it is only pulled
in by the few places that need it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240702151047.1746127-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-03 10:21:15 -06:00
Jens Axboe
1a50d14670 Linux 6.10-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmaB0NweHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGkvwH/36UJRk/o6wvXnyH
 E6QjCSWo2226APyWks22NjtC3I/8Iqdvkneuh6wG0qL2sXAB078EMjUq5R81bF8H
 wWFBJwetjYTp8GEyLioMEb2wCH/J3R29dLFC4UYTplafXRGP6//xcpJaKmTxcgdR
 31IzvTPXbApZ7L3k1U6rA2bK9PNKcFCOvZlrNMUCuwMrabymHsDfOUt1DqXyg2xp
 zjqiWYBwlklozmgawSWt/mdEgkWuTcAbg+KyqDVQF59s9aj/OOwZ0j+HACq5V8CM
 quTPIAYL6CC9p7uxa69lGr/sgC0Is/BZLPX7RTZAwCgarGvnX+1HUsjDcaFCtrVg
 O6fPUV8=
 =pgUx
 -----END PGP SIGNATURE-----

Merge tag 'v6.10-rc6' into for-6.11/block-post

Pull in v6.10-rc6 to resolve a conflict for the integrity cleanups.

* tag 'v6.10-rc6': (778 commits)
  Linux 6.10-rc6
  ata: ahci: Clean up sysfs file on error
  ata: libata-core: Fix double free on error
  ata,scsi: libata-core: Do not leak memory for ata_port struct members
  ata: libata-core: Fix null pointer dereference on error
  x86-32: fix cmpxchg8b_emu build error with clang
  x86: stop playing stack games in profile_pc()
  i2c: testunit: discard write requests while old command is running
  i2c: testunit: don't erase registers after STOP
  tty: mxser: Remove __counted_by from mxser_board.ports[]
  randomize_kstack: Remove non-functional per-arch entropy filtering
  string: kunit: add missing MODULE_DESCRIPTION() macros
  ata: libata-core: Add ATA_HORKAGE_NOLPM for all Crucial BX SSD1 models
  MAINTAINERS: Update IOMMU tree location
  tools/power turbostat: Add local build_bug.h header for snapshot target
  tools/power turbostat: Fix unc freq columns not showing with '-q' or '-l'
  tools/power turbostat: option '-n' is ambiguous
  drm/drm_file: Fix pid refcounting race
  kallsyms: rework symbol lookup return codes
  gpiolib: cdev: Ignore reconfiguration without direction
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-03 10:20:05 -06:00
Thomas Song
f227345f0a nvme-multipath: implement "queue-depth" iopolicy
The round-robin path selector is inefficient in cases where there is a
difference in latency between paths.  In the presence of one or more
high latency paths the round-robin selector continues to use the high
latency path equally. This results in a bias towards the highest latency
path and can cause a significant decrease in overall performance as IOs
pile on the highest latency path. This problem is acute with NVMe-oF
controllers.

The queue-depth path selector sends I/O down the path with the lowest
number of requests in its request queue. Paths with lower latency will
clear requests more quickly and have less requests queued compared to
higher latency paths. The goal of this path selector is to make more use
of lower latency paths which will bring down overall IO latency and
increase throughput and performance.

Signed-off-by: Thomas Song <tsong@purestorage.com>
[emilne: commandeered patch developed by Thomas Song @ Pure Storage]
Co-developed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Ewan D. Milne <emilne@redhat.com>
Co-developed-by: John Meneghini <jmeneghi@redhat.com>
Signed-off-by: John Meneghini <jmeneghi@redhat.com>
Link: https://lore.kernel.org/linux-nvme/20240509202929.831680-1-jmeneghi@redhat.com/
Tested-by: Marco Patalano <mpatalan@redhat.com>
Tested-by: Jyoti Rani <jrani@purestorage.com>
Tested-by: John Meneghini <jmeneghi@redhat.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-02 06:47:19 -07:00
Christoph Hellwig
f3bf25d513 nvme: don't set io_opt if NOWS is zero
NOWS is one of the annoying "0's based values" in NVMe, where 0 means one
and we thus can't detect if it isn't set.  Thus a NOWS value of 0 means
that the Namespace Optimal Write Size is a single LBA, which is clearly
bogus.  Ignore the value in that case and don't propagate an io_opt
value to the block layer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240701051800.1245240-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-01 06:52:42 -06:00
Nathan Chancellor
440e2051c5 nvmet-fc: Remove __counted_by from nvmet_fc_tgt_queue.fod[]
Work for __counted_by on generic pointers in structures (not just
flexible array members) has started landing in Clang 19 (current tip of
tree). During the development of this feature, a restriction was added
to __counted_by to prevent the flexible array member's element type from
including a flexible array member itself such as:

  struct foo {
    int count;
    char buf[];
  };

  struct bar {
    int count;
    struct foo data[] __counted_by(count);
  };

because the size of data cannot be calculated with the standard array
size formula:

  sizeof(struct foo) * count

This restriction was downgraded to a warning but due to CONFIG_WERROR,
it can still break the build. The application of __counted_by on the fod
member of 'struct nvmet_fc_tgt_queue' triggers this restriction,
resulting in:

  drivers/nvme/target/fc.c:151:2: error: 'counted_by' should not be applied to an array with element of unknown size because 'struct nvmet_fc_fcp_iod' is a struct type with a flexible array member. This will be an error in a future compiler version [-Werror,-Wbounds-safety-counted-by-elt-type-unknown-size]
    151 |         struct nvmet_fc_fcp_iod         fod[] __counted_by(sqsize);
        |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  1 error generated.

Remove this use of __counted_by to fix the warning/error. However,
rather than remove it altogether, leave it commented, as it may be
possible to support this in future compiler releases.

Cc: stable@vger.kernel.org
Closes: https://github.com/ClangBuiltLinux/linux/issues/2027
Fixes: ccd3129aca ("nvmet-fc: Annotate struct nvmet_fc_tgt_queue with __counted_by")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-26 10:13:04 -07:00
John Meneghini
3d7c2fd2ea nvme-multipath: prepare for "queue-depth" iopolicy
This patch prepares for the introduction of a new iopolicy by breaking up
the nvme_find_path() code path into sub-routines.

Signed-off-by: John Meneghini <jmeneghi@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-26 09:45:57 -07:00
Mikulas Patocka
cf546dd289 block: change rq_integrity_vec to respect the iterator
If we allocate a bio that is larger than NVMe maximum request size,
attach integrity metadata to it and send it to the NVMe subsystem, the
integrity metadata will be corrupted.

Splitting the bio works correctly. The function bio_split will clone the
bio, trim the iterator of the first bio and advance the iterator of the
second bio.

However, the function rq_integrity_vec has a bug - it returns the first
vector of the bio's metadata and completely disregards the metadata
iterator that was advanced when the bio was split. Thus, the second bio
uses the same metadata as the first bio and this leads to metadata
corruption.

This commit changes rq_integrity_vec, so that it calls mp_bvec_iter_bvec
instead of returning the first vector. mp_bvec_iter_bvec reads the
iterator and uses it to build a bvec for the current position in the
iterator.

The "queue_max_integrity_segments(rq->q) > 1" check was removed, because
the updated rq_integrity_vec function works correctly with multiple
segments.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/49d1afaa-f934-6ed2-a678-e0d428c63a65@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26 09:14:33 -06:00
Keith Busch
210b1f6576 nvme-pci: do not directly handle subsys reset fallout
Scheduling reset_work after a nvme subsystem reset is expected to fail
on pcie, but this also prevents potential handling the platform's pcie
services may provide that might successfully recovering the link without
re-enumeration. Such examples include AER, DPC, and power's EEH.

Provide a pci specific operation that safely initiates a subsystem
reset, and instead of scheduling reset work, read back the status
register to trigger a pcie read error.

Since this only affects pci, the other fabrics drivers subscribe to a
generic nvmf subsystem reset that is exactly the same as before. The
loop fabric doesn't use it because nvmet doesn't support setting that
property anyway.

And since we're using the magic NSSR value in two places now, provide a
symbolic define for it.

Reported-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-26 07:02:22 -07:00
Hannes Reinecke
bbb443e99c nvme-fcloop: implement 'host_traddr'
Implement the 'host_traddr' callback to display the host transport
address for nvmet debugfs.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:43 -07:00
Hannes Reinecke
99032e9dba nvmet-fc: implement host_traddr()
Implement callback to display the host transport address by
adding a callback 'host_traddr' for nvmet_fc_target_template.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:43 -07:00
Hannes Reinecke
c7ea20c3af nvmet-rdma: implement host_traddr()
Implement callback to display the host transport address.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:43 -07:00
Hannes Reinecke
b4bbe00d21 nvmet-tcp: implement host_traddr()
Implement callback to display the host transport address.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:43 -07:00
Hannes Reinecke
7e5c3de3f2 nvmet: add 'host_traddr' callback for debugfs
We want to display the transport address of the connected host
in debugfs, but this is a property of the transport.
So add a callback 'host_traddr' to allow the transport drivers
to fill in the data.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:43 -07:00
Hannes Reinecke
649fd41420 nvmet: add debugfs support
Add a debugfs hierarchy to display the configured subsystems
and the controllers attached to the subsystems.

Suggested-by: Redouane BOUFENGHOUR <redouane.boufenghour@shadow.tech>
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:42 -07:00
Weiwen Hu
dd0b0a4a2c nvme: rename CDR/MORE/DNR to NVME_STATUS_*
CDR/MORE/DNR fields are not belonging to SC in the NVMe spec, rename
them to NVME_STATUS_* to avoid confusion.

Signed-off-by: Weiwen Hu <huweiwen@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:42 -07:00
Weiwen Hu
d89a5c6705 nvme: fix status magic numbers
Replaced some magic numbers about SC and SCT with enum and macro.

Signed-off-by: Weiwen Hu <huweiwen@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:42 -07:00
Weiwen Hu
22f19a584d nvme: rename nvme_sc_to_pr_err to nvme_status_to_pr_err
This should better match its semantic.  "sc" is used in the NVMe spec to
specifically refer to the last 8 bits in the status field. We should not
reuse "sc" here.

Signed-off-by: Weiwen Hu <huweiwen@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:42 -07:00
Keith Busch
1a9e218195 nvme: split device add from initialization
Combining both creates an ambiguous cleanup scenario for the caller if
an error is returned: does the device reference need to be dropped or
did the error occur before the device was initialized? If an error
occurs after the device is added, then the existing cleanup routines
will leak memory.

Furthermore, the nvme core is taking it upon itself to free the device's
kobj name under certain conditions rather than go through the core
device API. We shouldn't be peaking into these implementation details.

Split the device initialization from the addition to make it easier to
know the error handling actions, fix the existing memory leaks, and stop
the device layering violations.

Link: https://lore.kernel.org/linux-nvme/c4050a37-ecc9-462c-9772-65e25166f439@grimberg.me/
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:42 -07:00
Keith Busch
72cded7573 nvme: fc: split controller bringup handling
Drivers must call nvme_uninit_ctrl after a successful nvme_init_ctrl.
Split the allocation side out to make the error handling boundary easier
to navigate. The nvme fc driver's error handling had different returns
in the error goto label's, which harm readability.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:42 -07:00
Keith Busch
ea47c471a2 nvme: rdma: split controller bringup handling
Drivers must call nvme_uninit_ctrl after a successful nvme_init_ctrl.
Split the allocation side out to make the error handling boundary easier
to navigate. The nvme rdma driver's error handling had different returns
in the error goto label's, which harm readability.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:42 -07:00
Keith Busch
10fd7fb676 nvme: tcp: split controller bringup handling
Drivers must call nvme_uninit_ctrl after a successful nvme_init_ctrl.
Split the allocation side out to make the error handling boundary easier
to navigate. The nvme tcp driver's error handling had different returns
in the error goto label's, which harm readability.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:42 -07:00
Keith Busch
b9ecbfa455 nvme: apple: fix device reference counting
Drivers must call nvme_uninit_ctrl after a successful nvme_init_ctrl.
Split the allocation side out to make the error handling boundary easier
to navigate. The apple driver had been doing this wrong, leaking the
controller device memory on a tagset failure.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24 12:53:41 -07:00
Hannes Reinecke
0f1f580392 nvmet: make 'tsas' attribute idempotent for RDMA
The RDMA transport defines values for TSAS, but it cannot be changed as
we only support the 'connected' mode.
So to avoid errors during reconfiguration we should allow to write the
current value.

Fixes: 3f123494db ("nvmet: make TCP sectype settable via configfs")
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-21 08:49:10 -07:00
Alan Adamson
5f9bbea02f nvme: Atomic write support
Add support to set block layer request_queue atomic write limits. The
limits will be derived from either the namespace or controller atomic
parameters.

NVMe atomic-related parameters are grouped into "normal" and "power-fail"
(or PF) class of parameter. For atomic write support, only PF parameters
are of interest. The "normal" parameters are concerned with racing reads
and writes (which also applies to PF). See NVM Command Set Specification
Revision 1.0d section 2.1.4 for reference.

Whether to use per namespace or controller atomic parameters is decided by
NSFEAT bit 1 - see Figure 97: Identify – Identify Namespace Data
Structure, NVM Command Set.

NVMe namespaces may define an atomic boundary, whereby no atomic guarantees
are provided for a write which straddles this per-lba space boundary. The
block layer merging policy is such that no merges may occur in which the
resultant request would straddle such a boundary.

Unlike SCSI, NVMe specifies no granularity or alignment rules, apart from
atomic boundary rule. In addition, again unlike SCSI, there is no
dedicated atomic write command - a write which adheres to the atomic size
limit and boundary is implicitly atomic.

If NSFEAT bit 1 is set, the following parameters are of interest:
- NAWUPF (Namespace Atomic Write Unit Power Fail)
- NABSPF (Namespace Atomic Boundary Size Power Fail)
- NABO (Namespace Atomic Boundary Offset)

and we set request_queue limits as follows:
- atomic_write_unit_max = rounddown_pow_of_two(NAWUPF)
- atomic_write_max_bytes = NAWUPF
- atomic_write_boundary = NABSPF

If in the unlikely scenario that NABO is non-zero, then atomic writes will
not be supported at all as dealing with this adds extra complexity. This
policy may change in future.

In all cases, atomic_write_unit_min is set to the logical block size.

If NSFEAT bit 1 is unset, the following parameter is of interest:
- AWUPF (Atomic Write Unit Power Fail)

and we set request_queue limits as follows:
- atomic_write_unit_max = rounddown_pow_of_two(AWUPF)
- atomic_write_max_bytes = AWUPF
- atomic_write_boundary = 0

A new function, nvme_valid_atomic_write(), is also called from submission
path to verify that a request has been submitted to the driver will
actually be executed atomically. As mentioned, there is no dedicated NVMe
atomic write command (which may error for a command which exceeds the
controller atomic write limits).

Note on NABSPF:
There seems to be some vagueness in the spec as to whether NABSPF applies
for NSFEAT bit 1 being unset. Figure 97 does not explicitly mention NABSPF
and how it is affected by bit 1. However Figure 4 does tell to check Figure
97 for info about per-namespace parameters, which NABSPF is, so it is
implied. However currently nvme_update_disk_info() does check namespace
parameter NABO regardless of this bit.

Signed-off-by: Alan Adamson <alan.adamson@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
jpg: total rewrite
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240620125359.2684798-11-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 15:19:17 -06:00
Jeff Johnson
5a5696a11f nvme-apple: add missing MODULE_DESCRIPTION()
make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-apple.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.

Reviewed-by: Eric Curtin <ecurtin@redhat.com>
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-20 10:21:09 -07:00
Christoph Hellwig
8c8f5c85b2 block: move the skip_tagset_quiesce flag to queue_limits
Move the skip_tagset_quiesce flag into the queue_limits feature field so
that it can be set atomically with the queue frozen.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-26-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
9c1e42e3c8 block: move the pci_p2pdma flag to queue_limits
Move the pci_p2pdma flag into the queue_limits feature field so that it
can be set atomically with the queue frozen.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-25-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
a52758a397 block: move the zone_resetall flag to queue_limits
Move the zone_resetall flag into the queue_limits feature field so that
it can be set atomically with the queue frozen.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-24-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
b1fc937a55 block: move the zoned flag into the features field
Move the zoned flags into the features field to reclaim a little
bit of space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-23-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
8023e144f9 block: move the poll flag to queue_limits
Move the poll flag into the queue_limits feature field so that it can
be set atomically with the queue frozen.

Stacking drivers are simplified in that they now can simply set the
flag, and blk_stack_limits will clear it when the features is not
supported by any of the underlying devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-22-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
f76af42f8b block: move the nowait flag to queue_limits
Move the nowait flag into the queue_limits feature field so that it can
be set atomically with the queue frozen.

Stacking drivers are simplified in that they now can simply set the
flag, and blk_stack_limits will clear it when the features is not
supported by any of the underlying devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-20-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
1a02f3a73f block: move the stable_writes flag to queue_limits
Move the stable_writes flag into the queue_limits feature field so that
it can be set atomically with the queue frozen.

The flag is now inherited by blk_stack_limits, which greatly simplifies
the code in dm, and fixed md which previously did not pass on the flag
set on lower devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
cdb2497918 block: move the io_stat flag setting to queue_limits
Move the io_stat flag into the queue_limits feature field so that it can
be set atomically with the queue frozen.

Simplify md and dm to set the flag unconditionally instead of avoiding
setting a simple flag for cases where it already is set by other means,
which is a bit pointless.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
bd4a633b6f block: move the nonrot flag to queue_limits
Move the nonrot flag into the queue_limits feature field so that it can
be set atomically with the queue frozen.

Use the chance to switch to defaulting to non-rotational and require
the driver to opt into rotational, which matches the polarity of the
sysfs interface.

For the z2ram, ps3vram, 2x memstick, ubiblock and dcssblk the new
rotational flag is not set as they clearly are not rotational despite
this being a behavior change.  There are some other drivers that
unconditionally set the rotational flag to keep the existing behavior
as they arguably can be used on rotational devices even if that is
probably not their main use today (e.g. virtio_blk and drbd).

The flag is automatically inherited in blk_stack_limits matching the
existing behavior in dm and md.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
1122c0c1cc block: move cache control settings out of queue->flags
Move the cache control settings into the queue_limits so that the flags
can be set atomically with the device queue frozen.

Add new features and flags field for the driver set flags, and internal
(usually sysfs-controlled) flags in the block layer.  Note that we'll
eventually remove enough field from queue_limits to bring it back to the
previous size.

The disable flag is inverted compared to the previous meaning, which
means it now survives a rescan, similar to the max_sectors and
max_discard_sectors user limits.

The FLUSH and FUA flags are now inherited by blk_stack_limits, which
simplified the code in dm a lot, but also causes a slight behavior
change in that dm-switch and dm-unstripe now advertise a write cache
despite setting num_flush_bios to 0.  The I/O path will handle this
gracefully, but as far as I can tell the lack of num_flush_bios
and thus flush support is a pre-existing data integrity bug in those
targets that really needs fixing, after which a non-zero num_flush_bios
should be required in dm for targets that map to underlying devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Hannes Reinecke
f31e85a4d7 nvmet: do not return 'reserved' for empty TSAS values
The 'TSAS' value is only defined for TCP and RDMA, but returning
'reserved' for undefined values tricked nvmetcli to try to write
'reserved' when restoring from a config file. This caused an error
and the configuration would not be applied.

Fixes: 3f123494db ("nvmet: make TCP sectype settable via configfs")
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-17 11:29:22 -07:00
Boyang Yu
9570a48847 nvme: fix NVME_NS_DEAC may incorrectly identifying the disk as EXT_LBA.
The value of NVME_NS_DEAC is 3,
which means NVME_NS_METADATA_SUPPORTED | NVME_NS_EXT_LBAS. Provide a
unique value for this feature flag.

Fixes 1b96f862ec ("nvme: implement the DEAC bit for the Write Zeroes command")
Signed-off-by: Boyang Yu <yuboyang@dapustor.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-17 11:24:12 -07:00
Christoph Hellwig
c6e56cf6b2 block: move integrity information into queue_limits
Move the integrity information into the queue limits so that it can be
set atomically with other queue limits, and that the sysfs changes to
the read_verify and write_generate flags are properly synchronized.
This also allows to provide a more useful helper to stack the integrity
fields, although it still is separate from the main stacking function
as not all stackable devices want to inherit the integrity settings.
Even with that it greatly simplifies the code in md and dm.

Note that the integrity field is moved as-is into the queue limits.
While there are good arguments for removing the separate blk_integrity
structure, this would cause a lot of churn and might better be done at a
later time if desired.  However the integrity field in the queue_limits
structure is now unconditional so that various ifdefs can be avoided or
replaced with IS_ENABLED().  Given that tiny size of it that seems like
a worthwhile trade off.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240613084839.1044015-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14 10:20:07 -06:00
Christoph Hellwig
3c3e85ddff block: bypass the STABLE_WRITES flag for protection information
Currently registering a checksum-enabled (aka PI) integrity profile sets
the QUEUE_FLAG_STABLE_WRITE flag, and unregistering it clears the flag.
This can incorrectly clear the flag when the driver requires stable
writes even without PI, e.g. in case of iSCSI or NVMe/TCP with data
digest enabled.

Fix this by looking at the csum_type directly in bdev_stable_writes and
not setting the queue flag.  Also remove the blk_queue_stable_writes
helper as the only user in nvme wants to only look at the actual
QUEUE_FLAG_STABLE_WRITE flag as it inherits the integrity configuration
by other means.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240613084839.1044015-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14 10:20:06 -06:00
Christoph Hellwig
e9f5f44ad3 block: remove the blk_integrity_profile structure
Block layer integrity configuration is a bit complex right now, as it
indirects through operation vectors for a simple two-dimensional
configuration:

 a) the checksum type of none, ip checksum, crc, crc64
 b) the presence or absence of a reference tag

Remove the integrity profile, and instead add a separate csum_type flag
which replaces the existing ip-checksum field and a new flag that
indicates the presence of the reference tag.

This removes up to two layers of indirect calls, remove the need to
offload the no-op verification of non-PI metadata to a workqueue and
generally simplifies the code. The downside is that block/t10-pi.c now
has to be built into the kernel when CONFIG_BLK_DEV_INTEGRITY is
supported.  Given that both nvme and SCSI require t10-pi.ko, it is loaded
for all usual configurations that enabled CONFIG_BLK_DEV_INTEGRITY
already, though.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240613084839.1044015-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14 10:20:06 -06:00
Jens Axboe
e3e53683cc nvme fixes for Linux 6.10
- Discard double free on error conditions (Chunguang)
  - Target Fixes (Daniel)
  - Namespace detachment regression fix (Keith)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmZrP4AACgkQPe3zGtjz
 Rgmg9g//SA59e4N7XH72xZxpGXkKOMtvGp1Ku2C4Yf1yzmWcst2tBkYq8ZRBStnj
 Ohb6BkEhKcLL63/PVgXX84AtaDvzJ7qjaQzivK5zjnbtPsspOe2Wieuyx3c/UJxM
 B3g1IV4mbG9qlA9yxaA6DyehyKwy0iJC9Y/bd6MuikeXBsdr/rnJ0Mhu/PRDrn1A
 XdZxu1JCvS4azMes5Iu0L1WEZSexsFEj1DYx376YIWKRfORmho87d1hv4JO+EE1C
 6atmuDhCwqGUZbgHQ8CauS9/OiK+5Pl+LeE2Cbwy9tb1RfB6/GYzVW6c2svAMbvD
 DP/n+/AIjkjw7fgKxoE52LEvCYpFPMoatmWKNSTyfZ35EYYe7aPA+uAo9xHNXdwQ
 D1od01zQ/04V1w8iRu/ASxyksoHqfT9jjg6FjT7QgwIurDYAKjgpdP8KWoQQbYWp
 IaHcTru1plIFecKgq5D3y1EeQEAkPJf1bQWZUMIWVST4/e4dL4KYKtoIlDLwoIei
 PNjeYp4JpG5APJtZM2sb4Y8Xm77T3wMBAJEa768naelAjqzH9GKQpVC/x1yu0Z8Q
 7m8vbvFRkUADdJcEb7Bo4O6zVJhOFSGNfE7M6ozSr4HTrggutHhTdMXHhUyg3OGf
 gmBcb6+OAqt/cuxJ3JsXhtIOdYyFgiwrDzR7cIR1+Nb5rjLvkjI=
 =RK8+
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.10-2024-06-13' of git://git.infradead.org/nvme into block-6.10

Pull NVMe fixes from Keith:

"nvme fixes for Linux 6.10

 - Discard double free on error conditions (Chunguang)
 - Target Fixes (Daniel)
 - Namespace detachment regression fix (Keith)"

* tag 'nvme-6.10-2024-06-13' of git://git.infradead.org/nvme:
  nvme: fix namespace removal list
  nvmet: always initialize cqe.result
  nvmet-passthru: propagate status from id override functions
  nvme: avoid double free special payload
2024-06-13 14:19:57 -06:00
Keith Busch
ff0ffe5b7c nvme: fix namespace removal list
This function wants to move a subset of a list from one element to the
tail into another list. It also needs to use the srcu synchronize
instead of the regular rcu version. Do this one element at a time
because that's the only to do it.

Fixes: be647e2c76 ("nvme: use srcu for iterating namespace list")
Reported-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-13 11:47:40 -07:00
Daniel Wagner
cd0c1b8e04 nvmet: always initialize cqe.result
The spec doesn't mandate that the first two double words (aka results)
for the command queue entry need to be set to 0 when they are not
used (not specified). Though, the target implemention returns 0 for TCP
and FC but not for RDMA.

Let's make RDMA behave the same and thus explicitly initializing the
result field. This prevents leaking any data from the stack.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-12 11:00:08 -07:00
Daniel Wagner
d76584e53f nvmet-passthru: propagate status from id override functions
The id override functions return a status which is not propagated to the
caller.

Fixes: c1fef73f79 ("nvmet: add passthru code to process commands")
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-12 11:00:08 -07:00
Chunguang Xu
e5d574ab37 nvme: avoid double free special payload
If a discard request needs to be retried, and that retry may fail before
a new special payload is added, a double free will result. Clear the
RQF_SPECIAL_LOAD when the request is cleaned.

Signed-off-by: Chunguang Xu <chunguang.xu@shopee.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-12 10:56:50 -07:00
Anuj Gupta
e038ee6189 block: unmap and free user mapped integrity via submitter
The user mapped intergity is copied back and unpinned by
bio_integrity_free which is a low-level routine. Do it via the submitter
rather than doing it in the low-level block layer code, to split the
submitter side from the consumer side of the bio.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240610111144.14647-1-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-12 11:00:50 -06:00
Weiwen Hu
b1a1fdd709 nvme: fix nvme_pr_* status code parsing
Fix the parsing if extra status bits (e.g. MORE) is present.

Fixes: 7fb42780d0 ("nvme: Convert NVMe errors to PR errors")
Signed-off-by: Weiwen Hu <huweiwen@linux.alibaba.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-31 13:50:59 -07:00
Chunguang Xu
7dc3bfcb4c nvme-fabrics: use reserved tag for reg read/write command
In some scenarios, if too many commands are issued by nvme command in
the same time by user tasks, this may exhaust all tags of admin_q. If
a reset (nvme reset or IO timeout) occurs before these commands finish,
reconnect routine may fail to update nvme regs due to insufficient tags,
which will cause kernel hang forever. In order to workaround this issue,
maybe we can let reg_read32()/reg_read64()/reg_write32() use reserved
tags. This maybe safe for nvmf:

1. For the disable ctrl path,  we will not issue connect command
2. For the enable ctrl / fw activate path, since connect and reg_xx()
   are called serially.

So the reserved tags may still be enough while reg_xx() use reserved tags.

Signed-off-by: Chunguang Xu <chunguang.xu@shopee.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-31 13:26:15 -07:00
Sagi Grimberg
c758b77d4a nvmet: fix a possible leak when destroy a ctrl during qp establishment
In nvmet_sq_destroy we capture sq->ctrl early and if it is non-NULL we
know that a ctrl was allocated (in the admin connect request handler)
and we need to release pending AERs, clear ctrl->sqs and sq->ctrl
(for nvme-loop primarily), and drop the final reference on the ctrl.

However, a small window is possible where nvmet_sq_destroy starts (as
a result of the client giving up and disconnecting) concurrently with
the nvme admin connect cmd (which may be in an early stage). But *before*
kill_and_confirm of sq->ref (i.e. the admin connect managed to get an sq
live reference). In this case, sq->ctrl was allocated however after it was
captured in a local variable in nvmet_sq_destroy.
This prevented the final reference drop on the ctrl.

Solve this by re-capturing the sq->ctrl after all inflight request has
completed, where for sure sq->ctrl reference is final, and move forward
based on that.

This issue was observed in an environment with many hosts connecting
multiple ctrls simoutanuosly, creating a delay in allocating a ctrl
leading up to this race window.

Reported-by: Alex Turin <alex@vastdata.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-28 10:01:52 -07:00
Keith Busch
be647e2c76 nvme: use srcu for iterating namespace list
The nvme pci driver synchronizes with all the namespace queues during a
reset to ensure that there's no pending timeout work.

Meanwhile the timeout work potentially iterates those same namespaces to
freeze their queues.

Each of those namespace iterations use the same read lock. If a write
lock should somehow get between the synchronize and freeze steps, then
forward progress is deadlocked.

We had been relying on the nvme controller state machine to ensure the
reset work wouldn't conflict with timeout work. That guarantee may be a
bit fragile to rely on, so iterate the namespace lists without taking
potentially circular locks, as reported by lockdep.

Link: https://lore.kernel.org/all/20220930001943.zdbvolc3gkekfmcv@shindev/
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-28 09:43:32 -07:00
Kundan Kumar
1bd293fcf3 nvme: adjust multiples of NVME_CTRL_PAGE_SIZE in offset
bio_vec start offset may be relatively large particularly when large
folio gets added to the bio. A bigger offset will result in avoiding the
single-segment mapping optimization and end up using expensive
mempool_alloc further.

Rather than using absolute value, adjust bv_offset by
NVME_CTRL_PAGE_SIZE while checking if segment can be fitted into one/two
PRP entries.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-24 08:59:16 -07:00
Kanchan Joshi
64e3d02b43 nvme: remove sgs and sws
sgs/sws are unused, so remove these from nvme_ns_head structure.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-24 08:57:40 -07:00
Sagi Grimberg
f97914e35f nvmet: fix ns enable/disable possible hang
When disabling an nvmet namespace, there is a period where the
subsys->lock is released, as the ns disable waits for backend IO to
complete, and the ns percpu ref to be properly killed. The original
intent was to avoid taking the subsystem lock for a prolong period as
other processes may need to acquire it (for example new incoming
connections).

However, it opens up a window where another process may come in and
enable the ns, (re)intiailizing the ns percpu_ref, causing the disable
sequence to hang.

Solve this by taking the global nvmet_config_sem over the entire configfs
enable/disable sequence.

Fixes: a07b4970f4 ("nvmet: add a generic NVMe target")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-23 13:44:42 -07:00
Keith Busch
a2e4c5f5f6 nvme-multipath: fix io accounting on failover
There are io stats accounting that needs to be handled, so don't call
blk_mq_end_request() directly. Use the existing nvme_end_req() helper
that already handles everything.

Fixes: d4d957b53d ("nvme-multipath: support io stats on the mpath device")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-23 13:44:42 -07:00
Keith Busch
2fe7b42246 nvme: fix multipath batched completion accounting
Batched completions were missing the io stats accounting and bio trace
events. Move the common code to a helper and call it from the batched
and non-batched functions.

Fixes: d4d957b53d ("nvme-multipath: support io stats on the mpath device")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-23 13:44:35 -07:00
Nilay Shroff
d3a043733f nvme-multipath: find NUMA path only for online numa-node
In current native multipath design when a shared namespace is created,
we loop through each possible numa-node, calculate the NUMA distance of
that node from each nvme controller and then cache the optimal IO path
for future reference while sending IO. The issue with this design is that
we may refer to the NUMA distance table for an offline node which may not
be populated at the time and so we may inadvertently end up finding and
caching a non-optimal path for IO. Then latter when the corresponding
numa-node becomes online and hence the NUMA distance table entry for that
node is created, ideally we should re-calculate the multipath node distance
for the newly added node however that doesn't happen unless we rescan/reset
the controller. So essentially, we may keep using non-optimal IO path for a
node which is made online after namespace is created.
This patch helps fix this issue ensuring that when a shared namespace is
created, we calculate the multipath node distance for each online numa-node
instead of each possible numa-node. Then latter when a node becomes online
and we receive any IO on that newly added node, we would calculate the
multipath node distance for newly added node but this time NUMA distance
table would have been already populated for newly added node. Hence we
would be able to correctly calculate the multipath node distance and choose
the optimal path for the IO.

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-21 06:43:08 -07:00
Jens Axboe
803fbb96c1 nvme updates for Linux 6.10
- Fabrics connection retries (Daniel, Hannes)
  - Fabrics logging enhancements (Tokunori)
  - RDMA delete optimization (Sagi)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmZDfbEACgkQPe3zGtjz
 Rgl2UA/+MrR/E1WkSv3D3NepPdNSoay+s6wWHirNnyffAOu7fYpBQbUsnV2aEHNC
 y5m6gUYZ8HGffZAxOLjNsgo42DEj2TJx3Ef/0DgHTiCVLNFVT0C5tkQMHOO79Vn9
 MG0fDj7hBFIdoQTuiFDb64jLkUdFUQe9mAE4hThho1stlCd0HsAaHQyrmDlV3e0x
 cTVMg9OlxvMDe5ER7vuawuVvhrAR5fTOmJEFIKItyZ2Zafsn4vuqjGk18aLAhf5m
 t8IALy3Y6ILe/RSq+QFvHXbX2fh8T+GtUv9a6qKNiYvmIsW8LwQgvsx8EGFU7+/P
 wfY71+Ic1VlDacjn2snLMnTZjBBkDq8iJ9VACCQYycPEHxAUhIShAvykUZkiuyfZ
 hnS6CRXcjv5MCS0syWgfL9avSML50J/dszPH/TNa0UE6QAXw0GSB+gN1z2j9l399
 8lDz4rvwgQw5jaFkuz8ricEMWeAJA7pP07OrfOTsIlGavjxuNFzOXX8cwIHm7veL
 LuLX2Lk+x/4BfPB+e40EsoaHb5ZY0Z4ojBdsUjWpgRU2vlavWu9lxyO7X2imRxMg
 nvfW0fDQ6noVMWggG4wLdj9MHGDlAi3DHsh1a2K0FET0aBTX6AOCiXaM059irsnv
 rfV7igvkMYJLCS1GwYno69GFrldt3zaLTlDEFdtgd1WHFbCuieA=
 =O4P1
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.10-2024-05-14' of git://git.infradead.org/nvme into block-6.10

Pull NVMe updates and fixes from Keith:

"nvme updates for Linux 6.10

 - Fabrics connection retries (Daniel, Hannes)
 - Fabrics logging enhancements (Tokunori)
 - RDMA delete optimization (Sagi)"

* tag 'nvme-6.10-2024-05-14' of git://git.infradead.org/nvme:
  nvme-rdma, nvme-tcp: include max reconnects for reconnect logging
  nvmet-rdma: Avoid o(n^2) loop in delete_ctrl
  nvme: do not retry authentication failures
  nvme-fabrics: short-circuit reconnect retries
  nvme: return kernel error codes for admin queue connect
  nvmet: return DHCHAP status codes from nvmet_setup_auth()
  nvmet: lock config semaphore when accessing DH-HMAC-CHAP key
2024-05-14 09:14:49 -06:00
Linus Torvalds
0c9f4ac808 for-6.10/block-20240511
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmY/YgsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvi0EACwnFRtYioizBH0x7QUHTBcIr0IhACd5gfz
 bm+uwlDUtf6G6lupHdJT9gOVB2z2z1m2Pz//8RuUVWw3Eqw2+rfgG8iJd+yo7IaV
 DpX3WaM4NnBvB7FKOKHlMPvGuf7KgbZ3uPm3x8cbrn/axMmkZ6ljxTixJ3p5t4+s
 xRsef/lVdG71DkXIFgTKATB86yNRJNlRQTbL+sZW22vdXdtfyBbOgR1sBuFfp7Hd
 g/uocZM/z0ahM6JH/5R2IX2ttKXMIBZLA8HRkJdvYqg022cj4js2YyRCPU3N6jQN
 MtN4TpJV5I++8l6SPQOOhaDNrK/6zFtDQpwG0YBiKKj3nQDgVbWWb8ejYTIUv4MP
 SrEto4MVBEqg5N65VwYYhIf45rmueFyJp6z0Vqv6Owur5nuww/YIFknmoMa/WDMd
 V8dIU3zL72FZDbPjIBjxHeqAGz9OgzEVafled7pi0Xbw6wqiB4kZihlMGXlD+WBy
 Yd6xo8PX4i5+d2LLKKPxpW1X0eJlKYJ/4dnYCoFN8LmXSiPJnMx2pYrV+NqMxy4X
 Thr8lxswLQC7j9YBBuIeDl8NB9N5FZZLvaC6I25QKq045M2ckJ+VrounsQb3vGwJ
 72nlxxBZL8wz3sasgX9Pc1Cez9AqYbM+UZahq8ezPY5y3Jh0QfRw/MOk1ZaDNC8V
 CNOHBH0E+Q==
 =HnjE
 -----END PGP SIGNATURE-----

Merge tag 'for-6.10/block-20240511' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - Add a partscan attribute in sysfs, fixing an issue with systemd
   relying on an internal interface that went away.

 - Attempt #2 at making long running discards interruptible. The
   previous attempt went into 6.9, but we ended up mostly reverting it
   as it had issues.

 - Remove old ida_simple API in bcache

 - Support for zoned write plugging, greatly improving the performance
   on zoned devices.

 - Remove the old throttle low interface, which has been experimental
   since 2017 and never made it beyond that and isn't being used.

 - Remove page->index debugging checks in brd, as it hasn't caught
   anything and prepares us for removing in struct page.

 - MD pull request from Song

 - Don't schedule block workers on isolated CPUs

* tag 'for-6.10/block-20240511' of git://git.kernel.dk/linux: (84 commits)
  blk-throttle: delay initialization until configuration
  blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW
  block: fix that util can be greater than 100%
  block: support to account io_ticks precisely
  block: add plug while submitting IO
  bcache: fix variable length array abuse in btree_iter
  bcache: Remove usage of the deprecated ida_simple_xx() API
  md: Revert "md: Fix overflow in is_mddev_idle"
  blk-lib: check for kill signal in ioctl BLKDISCARD
  block: add a bio_await_chain helper
  block: add a blk_alloc_discard_bio helper
  block: add a bio_chain_and_submit helper
  block: move discard checks into the ioctl handler
  block: remove the discard_granularity check in __blkdev_issue_discard
  block/ioctl: prefer different overflow check
  null_blk: Fix the WARNING: modpost: missing MODULE_DESCRIPTION()
  block: fix and simplify blkdevparts= cmdline parsing
  block: refine the EOF check in blkdev_iomap_begin
  block: add a partscan sysfs attribute for disks
  block: add a disk_has_partscan helper
  ...
2024-05-13 13:03:54 -07:00
Linus Torvalds
9961a78594 for-6.10/io_uring-20240511
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmY/YdYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnmVEADBq8QT9Oa3HTIONHwxjmGMOalr7PSrBP89
 S6Inv/l+3xDlyolyLh1HIXUC84iS9Ihi2pNC3dZct4fNcpA99H0CFaHDGwZ5rVri
 MrFaubZAps1qSzeypqEq3zWGKVUoaYWaOKhuOjye5Ei2tKymbguhDKl1WiKibD21
 E9qOYbhSUFdub/xtx9Rv4BS05QW5bHZ2Y/tTFqB8MY4JUsdb9g/deVZkyGUQYRSd
 40mDallRldjQQTQ8iU4H6/ORdGIN/90aLPbmzMdFtQcymnmRyid3rOEwhwWYe4NO
 ljnI8m1SJQilZz1d5oHBXBB5QubVptY1JWxbk8GQCSmOU5wrCq+ARCJXUtBXwniJ
 K4VFsGm9MkZcc5vsIwIzvsrk8DODla6EVo/jyDy8iFceZcNWfVxdwa5NS67V/6QT
 macbF785XDsmA5E4UjslbZqU047w+A5N1yazcZWzMk0coJDeB8AtsA1/C2WZOm8p
 HVoiAzsqt81hvPItnjCyZluL/YW+BKeOTnq04QbpQKcJpZBzszO4ZLtuD+IXkE69
 8ZZPGFPnPS4ZMQojKkwsBr+Yo65S18oBDkib36mr2lsdnoWTpGq47C7ScUDBbqGm
 iI7U8tYMnVVkQQHVVmGI4KOr5/4lxxp8398kqCaxfW3D5BQhbtUOF/OBjBHj1ZSV
 9aZx87CyhA==
 =DwAV
 -----END PGP SIGNATURE-----

Merge tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:

 - Greatly improve send zerocopy performance, by enabling coalescing of
   sent buffers.

   MSG_ZEROCOPY already does this with send(2) and sendmsg(2), but the
   io_uring side did not. In local testing, the crossover point for send
   zerocopy being faster is now around 3000 byte packets, and it
   performs better than the sync syscall variants as well.

   This feature relies on a shared branch with net-next, which was
   pulled into both branches.

 - Unification of how async preparation is done across opcodes.

   Previously, opcodes that required extra memory for async retry would
   allocate that as needed, using on-stack state until that was the
   case. If async retry was needed, the on-stack state was adjusted
   appropriately for a retry and then copied to the allocated memory.

   This led to some fragile and ugly code, particularly for read/write
   handling, and made storage retries more difficult than they needed to
   be. Allocate the memory upfront, as it's cheap from our pools, and
   use that state consistently both initially and also from the retry
   side.

 - Move away from using remap_pfn_range() for mapping the rings.

   This is really not the right interface to use and can cause lifetime
   issues or leaks. Additionally, it means the ring sq/cq arrays need to
   be physically contigious, which can cause problems in production with
   larger rings when services are restarted, as memory can be very
   fragmented at that point.

   Move to using vm_insert_page(s) for the ring sq/cq arrays, and apply
   the same treatment to mapped ring provided buffers. This also helps
   unify the code we have dealing with allocating and mapping memory.

   Hard to see in the diffstat as we're adding a few features as well,
   but this kills about ~400 lines of code from the codebase as well.

 - Add support for bundles for send/recv.

   When used with provided buffers, bundles support sending or receiving
   more than one buffer at the time, improving the efficiency by only
   needing to call into the networking stack once for multiple sends or
   receives.

 - Tweaks for our accept operations, supporting both a DONTWAIT flag for
   skipping poll arm and retry if we can, and a POLLFIRST flag that the
   application can use to skip the initial accept attempt and rely
   purely on poll for triggering the operation. Both of these have
   identical flags on the receive side already.

 - Make the task_work ctx locking unconditional.

   We had various code paths here that would do a mix of lock/trylock
   and set the task_work state to whether or not it was locked. All of
   that goes away, we lock it unconditionally and get rid of the state
   flag indicating whether it's locked or not.

   The state struct still exists as an empty type, can go away in the
   future.

 - Add support for specifying NOP completion values, allowing it to be
   used for error handling testing.

 - Use set/test bit for io-wq worker flags. Not strictly needed, but
   also doesn't hurt and helps silence a KCSAN warning.

 - Cleanups for io-wq locking and work assignments, closing a tiny race
   where cancelations would not be able to find the work item reliably.

 - Misc fixes, cleanups, and improvements

* tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux: (97 commits)
  io_uring: support to inject result for NOP
  io_uring: fail NOP if non-zero op flags is passed in
  io_uring/net: add IORING_ACCEPT_POLL_FIRST flag
  io_uring/net: add IORING_ACCEPT_DONTWAIT flag
  io_uring/filetable: don't unnecessarily clear/reset bitmap
  io_uring/io-wq: Use set_bit() and test_bit() at worker->flags
  io_uring/msg_ring: cleanup posting to IOPOLL vs !IOPOLL ring
  io_uring: Require zeroed sqe->len on provided-buffers send
  io_uring/notif: disable LAZY_WAKE for linked notifs
  io_uring/net: fix sendzc lazy wake polling
  io_uring/msg_ring: reuse ctx->submitter_task read using READ_ONCE instead of re-reading it
  io_uring/rw: reinstate thread check for retries
  io_uring/notif: implement notification stacking
  io_uring/notif: simplify io_notif_flush()
  net: add callback for setting a ubuf_info to skb
  net: extend ubuf_info callback to ops structure
  io_uring/net: support bundles for recv
  io_uring/net: support bundles for send
  io_uring/kbuf: add helpers for getting/peeking multiple buffers
  io_uring/net: add provided buffer support for IORING_OP_SEND
  ...
2024-05-13 12:48:06 -07:00
Sagi Grimberg
73964c1d07 nvmet-rdma: fix possible bad dereference when freeing rsps
It is possible that the host connected and saw a cm established
event and started sending nvme capsules on the qp, however the
ctrl did not yet see an established event. This is why the
rsp_wait_list exists (for async handling of these cmds, we move
them to a pending list).

Furthermore, it is possible that the ctrl cm times out, resulting
in a connect-error cm event. in this case we hit a bad deref [1]
because in nvmet_rdma_free_rsps we assume that all the responses
are in the free list.

We are freeing the cmds array anyways, so don't even bother to
remove the rsp from the free_list. It is also guaranteed that we
are not racing anything when we are releasing the queue so no
other context accessing this array should be running.

[1]:
--
Workqueue: nvmet-free-wq nvmet_rdma_free_queue_work [nvmet_rdma]
[...]
pc : nvmet_rdma_free_rsps+0x78/0xb8 [nvmet_rdma]
lr : nvmet_rdma_free_queue_work+0x88/0x120 [nvmet_rdma]
 Call trace:
 nvmet_rdma_free_rsps+0x78/0xb8 [nvmet_rdma]
 nvmet_rdma_free_queue_work+0x88/0x120 [nvmet_rdma]
 process_one_work+0x1ec/0x4a0
 worker_thread+0x48/0x490
 kthread+0x158/0x160
 ret_from_fork+0x10/0x18
--

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-08 06:17:01 -07:00
Dan Carpenter
d15dcd0f1a nvmet: prevent sprintf() overflow in nvmet_subsys_nsid_exists()
The nsid value is a u32 that comes from nvmet_req_find_ns().  It's
endian data and we're on an error path and both of those raise red
flags.  So let's make this safer.

1) Make the buffer large enough for any u32.
2) Remove the unnecessary initialization.
3) Use snprintf() instead of sprintf() for even more safety.
4) The sprintf() function returns the number of bytes printed, not
   counting the NUL terminator. It is impossible for the return value to
   be <= 0 so delete that.

Fixes: 505363957f ("nvmet: fix nvme status code when namespace is disabled")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-08 06:10:32 -07:00
Tokunori Ikegami
54a76c8732 nvme-rdma, nvme-tcp: include max reconnects for reconnect logging
Makes clear max reconnects translated by ctrl loss tmo and reconnect delay.

Signed-off-by: Tokunori Ikegami <ikegami.t@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-07 08:50:37 -07:00
Sagi Grimberg
34cfb09cdc nvmet: make nvmet_wq unbound
When deleting many controllers one-by-one, it takes a very
long time as these work elements may serialize as they are
scheduled on the executing cpu instead of spreading. In general
nvmet_wq can definitely be used for long standing work elements
so its better to make it unbound regardless.

Signed-off-by: Sagi Grimberg <sagi.grimberg@vastdata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-07 08:07:05 -07:00
Sagi Grimberg
c51a22e63f nvmet-rdma: Avoid o(n^2) loop in delete_ctrl
When deleting a nvmet-rdma ctrl, we essentially loop over all
queues that belong to the controller and schedule a removal of
each. Instead of restarting the loop every time a queue is found,
do a simple safe list traversal.

This addresses an unneeded time spent scheduling queue removal in
cases there a lot of queues.

Signed-off-by: Sagi Grimberg <sagi.grimberg@vastdata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-07 08:04:02 -07:00
Maurizio Lombardi
4b9a89be21 nvmet-auth: return the error code to the nvmet_auth_ctrl_hash() callers
If nvmet_auth_ctrl_hash() fails, return the error code to its callers

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-07 07:57:38 -07:00
Sean Anderson
d5887dc6b6 nvme-pci: Add quirk for broken MSIs
Sandisk SN530 NVMe drives have broken MSIs. On systems without MSI-X
support, all commands time out resulting in the following message:

nvme nvme0: I/O tag 12 (100c) QID 0 timeout, completion polled

These timeouts cause the boot to take an excessively-long time (over 20
minutes) while the initial command queue is flushed.

Address this by adding a quirk for drives with buggy MSIs. The lspci
output for this device (recorded on a system with MSI-X support) is:

02:00.0 Non-Volatile memory controller: Sandisk Corp Device 5008 (rev 01) (prog-if 02 [NVM Express])
	Subsystem: Sandisk Corp Device 5008
	Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 0
	Memory at f7e00000 (64-bit, non-prefetchable) [size=16K]
	Memory at f7e04000 (64-bit, non-prefetchable) [size=256]
	Capabilities: [80] Power Management version 3
	Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
	Capabilities: [b0] MSI-X: Enable+ Count=17 Masked-
	Capabilities: [c0] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [150] Device Serial Number 00-00-00-00-00-00-00-00
	Capabilities: [1b8] Latency Tolerance Reporting
	Capabilities: [300] Secondary PCI Express
	Capabilities: [900] L1 PM Substates
	Kernel driver in use: nvme
	Kernel modules: nvme

Cc: <stable@vger.kernel.org>
Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-05-07 07:55:14 -07:00
Daniel Wagner
0e34bd9605 nvme: do not retry authentication failures
When the key is invalid there is no point in retrying. Because the auth
code returns kernel error codes only, we can't test on the DNR bit.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-01 03:07:20 -07:00
Hannes Reinecke
adfde7ed0b nvme-fabrics: short-circuit reconnect retries
Returning a nvme status from nvme_tcp_setup_ctrl() indicates that the
association was established and we have received a status from the
controller; consequently we should honour the DNR bit. If not any future
reconnect attempts will just return the same error, so we can
short-circuit the reconnect attempts and fail the connection directly.

Signed-off-by: Hannes Reinecke <hare@suse.de>
[dwagner: - extended nvme_should_reconnect]
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-01 03:07:20 -07:00
Hannes Reinecke
44350336fd nvme: return kernel error codes for admin queue connect
nvmf_connect_admin_queue returns NVMe error status codes and kernel
error codes. This mixes the different domains which makes maintainability
difficult.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-01 03:07:20 -07:00
Hannes Reinecke
44e3c25efa nvmet: return DHCHAP status codes from nvmet_setup_auth()
A failure in nvmet_setup_auth() does not mean that the NVMe
authentication command failed, so we should rather return a protocol
error with a 'failure1' response than an NVMe status.

Also update the type used for dhchap_step and dhchap_status to u8 to
avoid confusions with nvme status. Furthermore, split dhchap_status and
nvme status so we don't accidentally mix these return values.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hannes Reinecke <hare@suse.de>
[dwagner: - use u8 as type for dhchap_{step|status}
          - separate nvme status from dhcap_status]
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-01 03:07:20 -07:00
Hannes Reinecke
213cbada7b nvmet: lock config semaphore when accessing DH-HMAC-CHAP key
When the DH-HMAC-CHAP key is accessed via configfs we need to take the
config semaphore as a reconnect might be running at the same time.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-01 03:07:20 -07:00
Hannes Reinecke
50abcc179e nvme-tcp: strict pdu pacing to avoid send stalls on TLS
TLS requires a strict pdu pacing via MSG_EOR to signal the end
of a record and subsequent encryption. If we do not set MSG_EOR
at the end of a sequence the record won't be closed, encryption
doesn't start, and we end up with a send stall as the message
will never be passed on to the TCP layer.
So do not check for the queue status when TLS is enabled but
rather make the MSG_MORE setting dependent on the current
request only.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-01 02:58:43 -07:00
Sagi Grimberg
505363957f nvmet: fix nvme status code when namespace is disabled
If the user disabled a nvmet namespace, it is removed from the subsystem
namespaces list. When nvmet processes a command directed to an nsid that
was disabled, it cannot differentiate between a nsid that is disabled
vs. a non-existent namespace, and resorts to return NVME_SC_INVALID_NS
with the dnr bit set.

This translates to a non-retryable status for the host, which translates
to a user error. We should expect disabled namespaces to not cause an
I/O error in a multipath environment.

Address this by searching a configfs item for the namespace nvmet failed
to find, and if we found one, conclude that the namespace is disabled
(perhaps temporarily). Return NVME_SC_INTERNAL_PATH_ERROR in this case
and keep DNR bit cleared.

Reported-by: Jirong Feng <jirong.feng@easystack.cn>
Tested-by: Jirong Feng <jirong.feng@easystack.cn>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-01 02:58:43 -07:00
Sagi Grimberg
6825bdde44 nvmet-tcp: fix possible memory leak when tearing down a controller
When we teardown the controller, we wait for pending I/Os to complete
(sq->ref on all queues to drop to zero) and then we go over the commands,
and free their command buffers in case they are still fetching data from
the host (e.g. processing nvme writes) and have yet to take a reference
on the sq.

However, we may miss the case where commands have failed before executing
and are queued for sending a response, but will never occur because the
queue socket is already down. In this case we may miss deallocating command
buffers.

Solve this by freeing all commands buffers as nvmet_tcp_free_cmd_buffers is
idempotent anyways.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-01 02:58:42 -07:00
Nilay Shroff
25bb3534ee nvme: cancel pending I/O if nvme controller is in terminal state
While I/O is running, if the pci bus error occurs then
in-flight I/O can not complete. Worst, if at this time,
user (logically) hot-unplug the nvme disk then the
nvme_remove() code path can't forward progress until
in-flight I/O is cancelled. So these sequence of events
may potentially hang hot-unplug code path indefinitely.
This patch helps cancel the pending/in-flight I/O from the
nvme request timeout handler in case the nvme controller
is in the terminal (DEAD/DELETING/DELETING_NOIO) state and
that helps nvme_remove() code path forward progress and
finish successfully.

Link: https://lore.kernel.org/all/199be893-5dfa-41e5-b6f2-40ac90ebccc4@linux.ibm.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-01 02:58:42 -07:00
Maurizio Lombardi
445f9119e7 nvmet-auth: replace pr_debug() with pr_err() to report an error.
In nvmet_auth_host_hash(), if a mismatch is detected in the hash length
the kernel should print an error.

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-01 02:58:42 -07:00
Maurizio Lombardi
46b8f9f74f nvmet-auth: return the error code to the nvmet_auth_host_hash() callers
If the nvmet_auth_host_hash() function fails, the error code should
be returned to its callers.

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-01 02:58:42 -07:00
Nilay Shroff
863fe60ed2 nvme: find numa distance only if controller has valid numa id
On system where native nvme multipath is configured and iopolicy
is set to numa but the nvme controller numa node id is undefined
or -1 (NUMA_NO_NODE) then avoid calculating node distance for
finding optimal io path. In such case we may access numa distance
table with invalid index and that may potentially refer to incorrect
memory. So this patch ensures that if the nvme controller numa node
id is -1 then instead of calculating node distance for finding optimal
io path, we set the numa node distance of such controller to default 10
(LOCAL_DISTANCE).

Link: https://lore.kernel.org/all/20240413090614.678353-1-nilay@linux.ibm.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-01 02:58:42 -07:00
Damien Le Moal
9b3c08b90f block: Simplify blk_revalidate_disk_zones() interface
The only user of blk_revalidate_disk_zones() second argument was the
SCSI disk driver (sd). Now that this driver does not require this
update_driver_data argument, remove it to simplify the interface of
blk_revalidate_disk_zones(). Also update the function kdoc comment to
be more accurate (i.e. there is no gendisk ->revalidate method).

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-21-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
Damien Le Moal
d2a9b5fdc1 nvmet: zns: Do not reference the gendisk conv_zones_bitmap
The gendisk conventional zone bitmap is going away. So to check for the
presence of conventional zones on a zoned target device, always use
report zones.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-19-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
Jens Axboe
1afdb76038 nvme/io_uring: use helper for polled completions
NVMe is making up issue_flags, which is a no-no in general, and to make
matters worse, they are completely the wrong ones. For a pure polled
request, which it does check for, we're already inside the
ctx->uring_lock when the completions are run off io_do_iopoll(). Hence
the correct flag would be '0' rather than IO_URING_F_UNLOCKED.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15 08:10:24 -06:00
Yi Zhang
0bc2e80b9b nvme: fix warn output about shared namespaces without CONFIG_NVME_MULTIPATH
Move the stray '.' that is currently at the end of the line after
newline '\n' to before newline character which is the right position.

Fixes: ce8d78616a ("nvme: warn about shared namespaces without CONFIG_NVME_MULTIPATH")
Signed-off-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-04-11 17:22:13 -07:00
Daniel Wagner
205fb5fa6f nvme-fc: rename free_ctrl callback to match name pattern
Rename nvme_fc_nvme_ctrl_freed to nvme_fc_free_ctrl to match the name
pattern for the callback.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-04-04 08:47:56 -07:00
Daniel Wagner
db67bb39ef nvmet-fc: move RCU read lock to nvmet_fc_assoc_exists
The RCU lock is only needed for the lookup loop and not for
list_ad_tail_rcu call. Thus move it down the call chain into
nvmet_fc_assoc_exists.

While at it also fix the name typo of the function.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-04-04 08:47:56 -07:00
Hannes Reinecke
95409e277d nvmet: implement unique discovery NQN
Unique discovery NQNs allow to differentiate between discovery
services from (typically physically separate) NVMe-oF subsystems.
This is required for establishing secured connections as otherwise
the credentials won't be unique and the integrity of the connection
cannot be guaranteed.
This patch adds a configfs attribute 'discovery_nqn' in the 'nvmet'
configfs directory to specify the unique discovery NQN.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-04-04 08:35:49 -07:00
Christoph Hellwig
0551ec93a0 nvme: don't create a multipath node for zero capacity devices
Apparently there are nvme controllers around that report namespaces
in the namespace list which have zero capacity.  Return -ENXIO instead
of -ENODEV from nvme_update_ns_info_block so we don't create a hidden
multipath node for these namespaces but entirely ignore them.

Fixes: 46e7422cda ("nvme: move common logic into nvme_update_ns_info")
Reported-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-04-04 08:33:15 -07:00
Christoph Hellwig
c85c9ab926 nvme: split nvme_update_zone_info
nvme_update_zone_info does (admin queue) I/O to the device and can fail.
We fail to abort the queue limits update if that happen, but really
should avoid with the frozen I/O queue as much as possible anyway.

Split the logic into a helper to query the information that can be
called on an unfrozen queue and one to apply it to the queue limits.

Fixes: 9b130d681443 ("nvme: use the atomic queue limits update API")
Reported-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-04-02 08:21:33 -07:00
Christoph Hellwig
ac229a2d09 nvme-multipath: don't inherit LBA-related fields for the multipath node
Linux 6.9 made the nvme multipath nodes not properly pick up changes when
the LBA size goes smaller after an nvme format.  This is because we now
try to inherit the queue settings for the multipath node entirely from
the individual paths.  That is the right thing to do for I/O size
limitations, which make up most of the queue limits, but it is wrong for
changes to the namespace configuration, where we do want to pick up the
new format, which will eventually show up on all paths once they are
re-queried.

Fix this by not inheriting the block size and related fields and always
for updating them.

Fixes: 8f03cfa117 ("nvme: don't use nvme_update_disk_info for the multipath disk")
Reported-by: Nilay Shroff <nilay@linux.ibm.com>
Tested-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-04-02 08:06:55 -07:00
Jens Axboe
0760267809 nvme updates for Linux 6.9
- Make an informative message less ominous (Keith)
  - Enhanced trace decoding (Guixin)
  - TCP updates (Hannes, Li)
  - Fabrics connect deadlock fix (Chunguang)
  - Platform API migration update (Uwe)
  - A new device quirk (Jiawei)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmX8eFoACgkQPe3zGtjz
 RglB1A//Tw/U2aXAMSe6sX01vL6u3uCY6moXz/C7taa3tkn6TeFDg905jBDbX89E
 HCaQKBpppG9SfQg/CS5pkTmUboW0kDjDKUUwK5IOUrVbaPC9JQ4FhJ4dqoTvd8I+
 CgsUpuF6HWQaSWr+2atfJLDSklGCVJrvs2YxycCpYaaHaBeBkc8dwRk+ec1RlePW
 m2Kq4volZ8mVQ4FXu21YfH281Hw9gndqWMXPFMiv7U2zrVB9OYP9Z3BqLTq2cUa5
 ce/4CY7nXv/kEddnuLYTBKUz/ymnXvD+FE8Ibzt1YH8laC96Juj9WUZmRy/okzAZ
 V/BpGriYzzCANVNyQxNmp9okskBdGXuhpefwOXZ17jbRUrfmNo5yU/TuxM8nvG9P
 ikR5qZSaOq3cELDAwtVOHkq6dyhimxW4HNVr2fIY0KEVgqL1ph4zSVckMGiW4/Dq
 FbIB2fTc4sVjZmhHuuevYoelAulHnlMxX8RNrk5us6AT4f+1hYcqdxCCR3IsX/Jw
 GdUYPz/EB8HnSmxCGEWPyI2Ldf+Id9xFUOMpsoEWz+637TRGdNv1ADfBeUUnBoWz
 DBEBo8YMcYeKgv0ohZql5B5g3Bl9GfyRCcDua/uubwobrm/tzGfHamE4LYp5NeQY
 fqt0F8WmAG0Dey/6whSo+29dhwYnCpHl7xQLNhz6wMhqlXUdQm0=
 =Cntu
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.9-2024-03-21' of git://git.infradead.org/nvme into block-6.9

Pull NVMe fixes from Keith:

"nvme updates for Linux 6.9

 - Make an informative message less ominous (Keith)
 - Enhanced trace decoding (Guixin)
 - TCP updates (Hannes, Li)
 - Fabrics connect deadlock fix (Chunguang)
 - Platform API migration update (Uwe)
 - A new device quirk (Jiawei)"

* tag 'nvme-6.9-2024-03-21' of git://git.infradead.org/nvme:
  nvmet-rdma: remove NVMET_RDMA_REQ_INVALIDATE_RKEY flag
  nvme: remove redundant BUILD_BUG_ON check
  nvme/tcp: Add wq_unbound modparam for nvme_tcp_wq
  nvme-tcp: Export the nvme_tcp_wq to sysfs
  drivers/nvme: Add quirks for device 126f:2262
  nvme: parse format command's lbafu when tracing
  nvme: add tracing of reservation commands
  nvme: parse zns command's zsa and zrasf to string
  nvme: use nvme_disk_is_ns_head helper
  nvme: fix reconnection fail due to reserved tag allocation
  nvmet: add tracing of zns commands
  nvmet: add tracing of authentication commands
  nvme-apple: Convert to platform remove callback returning void
  nvmet-tcp: do not continue for invalid icreq
  nvme: change shutdown timeout setting message
2024-03-21 13:23:07 -06:00
Guixin Liu
910934da94 nvmet-rdma: remove NVMET_RDMA_REQ_INVALIDATE_RKEY flag
We can simply use invalidate_rkey to check instead of adding a flag.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-21 10:46:53 -07:00
Guixin Liu
1e1c4bd16e nvme: remove redundant BUILD_BUG_ON check
Remove redundant BUILD_BUG_ON check of struct nvme_dsm_range, it's
already checked in nvme_init_ctrl().

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-21 10:46:12 -07:00
Li Feng
0c29f9fa46 nvme/tcp: Add wq_unbound modparam for nvme_tcp_wq
The default nvme_tcp_wq will use all CPUs to process tasks. Sometimes it is
necessary to set CPU affinity to improve performance.

A new module parameter wq_unbound is added here. If set to true, users can
configure cpu affinity through
/sys/devices/virtual/workqueue/nvme_tcp_wq/cpumask.

Signed-off-by: Li Feng <fengli@smartx.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-18 13:41:11 -07:00
Li Feng
ec58afb49e nvme-tcp: Export the nvme_tcp_wq to sysfs
Make the workqueue userspace visible for easy viewing and configuration.

Signed-off-by: Li Feng <fengli@smartx.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-18 13:41:11 -07:00
Jiawei Fu (iBug)
e89086c43f drivers/nvme: Add quirks for device 126f:2262
This commit adds NVME_QUIRK_NO_DEEPEST_PS and NVME_QUIRK_BOGUS_NID for
device [126f:2262], which appears to be a generic VID:PID pair used for
many SSDs based on the Silicon Motion SM2262/SM2262EN controller.

Two of my SSDs with this VID:PID pair exhibit the same behavior:

  * They frequently have trouble exiting the deepest power state (5),
    resulting in the entire disk unresponsive.
    Verified by setting nvme_core.default_ps_max_latency_us=10000 and
    observing them behaving normally.
  * They produce all-zero nguid and eui64 with `nvme id-ns` command.

The offending products are:

  * HP SSD EX950 1TB
  * HIKVISION C2000Pro 2TB

Signed-off-by: Jiawei Fu <i@ibugone.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-18 13:31:00 -07:00
Guixin Liu
798edad968 nvme: parse format command's lbafu when tracing
Add the parse of format command's lbafu to calculate lbaf.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-14 11:38:28 -07:00
Guixin Liu
6a0164f9f4 nvme: add tracing of reservation commands
Add detailed parsing of reservation commands to make the trace log
more consistent and human-readable.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-14 11:38:28 -07:00
Guixin Liu
8d539f755c nvme: parse zns command's zsa and zrasf to string
Parse zone mgmt send commands's zsa and receive command's
zrasf to string to make the trace log more human-readable.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-14 11:36:10 -07:00
Guixin Liu
dcad6f5f43 nvme: use nvme_disk_is_ns_head helper
Use nvme_disk_is_ns_head helper instead of check fops directly,
and also drop CONFIG_NVME_MULTIPATH check.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-14 11:34:55 -07:00
Chunguang Xu
de105068fe nvme: fix reconnection fail due to reserved tag allocation
We found a issue on production environment while using NVMe over RDMA,
admin_q reconnect failed forever while remote target and network is ok.
After dig into it, we found it may caused by a ABBA deadlock due to tag
allocation. In my case, the tag was hold by a keep alive request
waiting inside admin_q, as we quiesced admin_q while reset ctrl, so the
request maked as idle and will not process before reset success. As
fabric_q shares tagset with admin_q, while reconnect remote target, we
need a tag for connect command, but the only one reserved tag was held
by keep alive command which waiting inside admin_q. As a result, we
failed to reconnect admin_q forever. In order to fix this issue, I
think we should keep two reserved tags for admin queue.

Fixes: ed01fee283 ("nvme-fabrics: only reserve a single tag")
Signed-off-by: Chunguang Xu <chunguang.xu@shopee.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-14 11:32:39 -07:00
Linus Torvalds
9187210eee Networking changes for 6.9.
Core & protocols
 ----------------
 
  - Large effort by Eric to lower rtnl_lock pressure and remove locks:
 
    - Make commonly used parts of rtnetlink (address, route dumps etc.)
      lockless, protected by RCU instead of rtnl_lock.
 
    - Add a netns exit callback which already holds rtnl_lock,
      allowing netns exit to take rtnl_lock once in the core
      instead of once for each driver / callback.
 
    - Remove locks / serialization in the socket diag interface.
 
    - Remove 6 calls to synchronize_rcu() while holding rtnl_lock.
 
    - Remove the dev_base_lock, depend on RCU where necessary.
 
  - Support busy polling on a per-epoll context basis. Poll length
    and budget parameters can be set independently of system defaults.
 
  - Introduce struct net_hotdata, to make sure read-mostly global config
    variables fit in as few cache lines as possible.
 
  - Add optional per-nexthop statistics to ease monitoring / debug
    of ECMP imbalance problems.
 
  - Support TCP_NOTSENT_LOWAT in MPTCP.
 
  - Ensure that IPv6 temporary addresses' preferred lifetimes are long
    enough, compared to other configured lifetimes, and at least 2 sec.
 
  - Support forwarding of ICMP Error messages in IPSec, per RFC 4301.
 
  - Add support for the independent control state machine for bonding
    per IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled
    control state machine.
 
  - Add "network ID" to MCTP socket APIs to support hosts with multiple
    disjoint MCTP networks.
 
  - Re-use the mono_delivery_time skbuff bit for packets which user
    space wants to be sent at a specified time. Maintain the timing
    information while traversing veth links, bridge etc.
 
  - Take advantage of MSG_SPLICE_PAGES for RxRPC DATA and ACK packets.
 
  - Simplify many places iterating over netdevs by using an xarray
    instead of a hash table walk (hash table remains in place, for
    use on fastpaths).
 
  - Speed up scanning for expired routes by keeping a dedicated list.
 
  - Speed up "generic" XDP by trying harder to avoid large allocations.
 
  - Support attaching arbitrary metadata to netconsole messages.
 
 Things we sprinkled into general kernel code
 --------------------------------------------
 
  - Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce
    VM_SPARSE kind and vm_area_[un]map_pages (used by bpf_arena).
 
  - Rework selftest harness to enable the use of the full range of
    ksft exit code (pass, fail, skip, xfail, xpass).
 
 Netfilter
 ---------
 
  - Allow userspace to define a table that is exclusively owned by a daemon
    (via netlink socket aliveness) without auto-removing this table when
    the userspace program exits. Such table gets marked as orphaned and
    a restarting management daemon can re-attach/regain ownership.
 
  - Speed up element insertions to nftables' concatenated-ranges set type.
    Compact a few related data structures.
 
 BPF
 ---
 
  - Add BPF token support for delegating a subset of BPF subsystem
    functionality from privileged system-wide daemons such as systemd
    through special mount options for userns-bound BPF fs to a trusted
    & unprivileged application.
 
  - Introduce bpf_arena which is sparse shared memory region between BPF
    program and user space where structures inside the arena can have
    pointers to other areas of the arena, and pointers work seamlessly
    for both user-space programs and BPF programs.
 
  - Introduce may_goto instruction that is a contract between the verifier
    and the program. The verifier allows the program to loop assuming it's
    behaving well, but reserves the right to terminate it.
 
  - Extend the BPF verifier to enable static subprog calls in spin lock
    critical sections.
 
  - Support registration of struct_ops types from modules which helps
    projects like fuse-bpf that seeks to implement a new struct_ops type.
 
  - Add support for retrieval of cookies for perf/kprobe multi links.
 
  - Support arbitrary TCP SYN cookie generation / validation in the TC
    layer with BPF to allow creating SYN flood handling in BPF firewalls.
 
  - Add code generation to inline the bpf_kptr_xchg() helper which
    improves performance when stashing/popping the allocated BPF objects.
 
 Wireless
 --------
 
  - Add SPP (signaling and payload protected) AMSDU support.
 
  - Support wider bandwidth OFDMA, as required for EHT operation.
 
 Driver API
 ----------
 
  - Major overhaul of the Energy Efficient Ethernet internals to support
    new link modes (2.5GE, 5GE), share more code between drivers
    (especially those using phylib), and encourage more uniform behavior.
    Convert and clean up drivers.
 
  - Define an API for querying per netdev queue statistics from drivers.
 
  - IPSec: account in global stats for fully offloaded sessions.
 
  - Create a concept of Ethernet PHY Packages at the Device Tree level,
    to allow parameterizing the existing PHY package code.
 
  - Enable Rx hashing (RSS) on GTP protocol fields.
 
 Misc
 ----
 
  - Improvements and refactoring all over networking selftests.
 
  - Create uniform module aliases for TC classifiers, actions,
    and packet schedulers to simplify creating modprobe policies.
 
  - Address all missing MODULE_DESCRIPTION() warnings in networking.
 
  - Extend the Netlink descriptions in YAML to cover message encapsulation
    or "Netlink polymorphism", where interpretation of nested attributes
    depends on link type, classifier type or some other "class type".
 
 Drivers
 -------
 
  - Ethernet high-speed NICs:
    - Add a new driver for Marvell's Octeon PCI Endpoint NIC VF.
    - Intel (100G, ice, idpf):
      - support E825-C devices
    - nVidia/Mellanox:
      - support devices with one port and multiple PCIe links
    - Broadcom (bnxt):
      - support n-tuple filters
      - support configuring the RSS key
    - Wangxun (ngbe/txgbe):
      - implement irq_domain for TXGBE's sub-interrupts
    - Pensando/AMD:
      - support XDP
      - optimize queue submission and wakeup handling (+17% bps)
      - optimize struct layout, saving 28% of memory on queues
 
  - Ethernet NICs embedded and virtual:
    - Google cloud vNIC:
      - refactor driver to perform memory allocations for new queue
        config before stopping and freeing the old queue memory
    - Synopsys (stmmac):
      - obey queueMaxSDU and implement counters required by 802.1Qbv
    - Renesas (ravb):
      - support packet checksum offload
      - suspend to RAM and runtime PM support
 
  - Ethernet switches:
    - nVidia/Mellanox:
      - support for nexthop group statistics
    - Microchip:
      - ksz8: implement PHY loopback
      - add support for KSZ8567, a 7-port 10/100Mbps switch
 
  - PTP:
    - New driver for RENESAS FemtoClock3 Wireless clock generator.
    - Support OCP PTP cards designed and built by Adva.
 
  - CAN:
    - Support recvmsg() flags for own, local and remote traffic
      on CAN BCM sockets.
    - Support for esd GmbH PCIe/402 CAN device family.
    - m_can:
      - Rx/Tx submission coalescing
      - wake on frame Rx
 
  - WiFi:
    - Intel (iwlwifi):
      - enable signaling and payload protected A-MSDUs
      - support wider-bandwidth OFDMA
      - support for new devices
      - bump FW API to 89 for AX devices; 90 for BZ/SC devices
    - MediaTek (mt76):
      - mt7915: newer ADIE version support
      - mt7925: radio temperature sensor support
    - Qualcomm (ath11k):
      - support 6 GHz station power modes: Low Power Indoor (LPI),
        Standard Power) SP and Very Low Power (VLP)
      - QCA6390 & WCN6855: support 2 concurrent station interfaces
      - QCA2066 support
    - Qualcomm (ath12k):
      - refactoring in preparation for Multi-Link Operation (MLO) support
      - 1024 Block Ack window size support
      - firmware-2.bin support
      - support having multiple identical PCI devices (firmware needs to
        have ATH12K_FW_FEATURE_MULTI_QRTR_ID)
      - QCN9274: support split-PHY devices
      - WCN7850: enable Power Save Mode in station mode
      - WCN7850: P2P support
    - RealTek:
      - rtw88: support for more rtw8811cu and rtw8821cu devices
      - rtw89: support SCAN_RANDOM_SN and SET_SCAN_DWELL
      - rtlwifi: speed up USB firmware initialization
      - rtwl8xxxu:
        - RTL8188F: concurrent interface support
        - Channel Switch Announcement (CSA) support in AP mode
    - Broadcom (brcmfmac):
      - per-vendor feature support
      - per-vendor SAE password setup
      - DMI nvram filename quirk for ACEPC W5 Pro
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmXv0mgACgkQMUZtbf5S
 IrtgMxAAuRd+WJW++SENr4KxIWhYO1q6Xcxnai43wrNkan9swD24icG8TYALt4f3
 yoT6idQvWReAb5JNlh9rUQz8R7E0nJXlvEFn5MtJwcthx2C6wFo/XkJlddlRrT+j
 c2xGILwLjRhW65LaC0MZ2ECbEERkFz8xcGfK2SWzUgh6KYvPjcRfKFxugpM7xOQK
 P/Wnqhs4fVRS/Mj/bCcXcO+yhwC121Q3qVeQVjGS0AzEC65hAW87a/kc2BfgcegD
 EyI9R7mf6criQwX+0awubjfoIdr4oW/8oDVNvUDczkJkbaEVaLMQk9P5x/0XnnVS
 UHUchWXyI80Q8Rj12uN1/I0h3WtwNQnCRBuLSmtm6GLfCAwbLvp2nGWDnaXiqryW
 DVKUIHGvqPKjkOOMOVfSvfB3LvkS3xsFVVYiQBQCn0YSs/gtu4CoF2Nty9CiLPbK
 tTuxUnLdPDZDxU//l0VArZmP8p2JM7XQGJ+JH8GFH4SBTyBR23e0iyPSoyaxjnYn
 RReDnHMVsrS1i7GPhbqDJWn+uqMSs7N149i0XmmyeqwQHUVSJN3J2BApP2nCaDfy
 H2lTuYly5FfEezt61NvCE4qr/VsWeEjm1fYlFQ9dFn4pGn+HghyCpw+xD1ZN56DN
 lujemau5B3kk1UTtAT4ypPqvuqjkRFqpNV2LzsJSk/Js+hApw8Y=
 =oY52
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Large effort by Eric to lower rtnl_lock pressure and remove locks:

      - Make commonly used parts of rtnetlink (address, route dumps
        etc) lockless, protected by RCU instead of rtnl_lock.

      - Add a netns exit callback which already holds rtnl_lock,
        allowing netns exit to take rtnl_lock once in the core instead
        of once for each driver / callback.

      - Remove locks / serialization in the socket diag interface.

      - Remove 6 calls to synchronize_rcu() while holding rtnl_lock.

      - Remove the dev_base_lock, depend on RCU where necessary.

   - Support busy polling on a per-epoll context basis. Poll length and
     budget parameters can be set independently of system defaults.

   - Introduce struct net_hotdata, to make sure read-mostly global
     config variables fit in as few cache lines as possible.

   - Add optional per-nexthop statistics to ease monitoring / debug of
     ECMP imbalance problems.

   - Support TCP_NOTSENT_LOWAT in MPTCP.

   - Ensure that IPv6 temporary addresses' preferred lifetimes are long
     enough, compared to other configured lifetimes, and at least 2 sec.

   - Support forwarding of ICMP Error messages in IPSec, per RFC 4301.

   - Add support for the independent control state machine for bonding
     per IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled
     control state machine.

   - Add "network ID" to MCTP socket APIs to support hosts with multiple
     disjoint MCTP networks.

   - Re-use the mono_delivery_time skbuff bit for packets which user
     space wants to be sent at a specified time. Maintain the timing
     information while traversing veth links, bridge etc.

   - Take advantage of MSG_SPLICE_PAGES for RxRPC DATA and ACK packets.

   - Simplify many places iterating over netdevs by using an xarray
     instead of a hash table walk (hash table remains in place, for use
     on fastpaths).

   - Speed up scanning for expired routes by keeping a dedicated list.

   - Speed up "generic" XDP by trying harder to avoid large allocations.

   - Support attaching arbitrary metadata to netconsole messages.

  Things we sprinkled into general kernel code:

   - Enforce VM_IOREMAP flag and range in ioremap_page_range and
     introduce VM_SPARSE kind and vm_area_[un]map_pages (used by
     bpf_arena).

   - Rework selftest harness to enable the use of the full range of ksft
     exit code (pass, fail, skip, xfail, xpass).

  Netfilter:

   - Allow userspace to define a table that is exclusively owned by a
     daemon (via netlink socket aliveness) without auto-removing this
     table when the userspace program exits. Such table gets marked as
     orphaned and a restarting management daemon can re-attach/regain
     ownership.

   - Speed up element insertions to nftables' concatenated-ranges set
     type. Compact a few related data structures.

  BPF:

   - Add BPF token support for delegating a subset of BPF subsystem
     functionality from privileged system-wide daemons such as systemd
     through special mount options for userns-bound BPF fs to a trusted
     & unprivileged application.

   - Introduce bpf_arena which is sparse shared memory region between
     BPF program and user space where structures inside the arena can
     have pointers to other areas of the arena, and pointers work
     seamlessly for both user-space programs and BPF programs.

   - Introduce may_goto instruction that is a contract between the
     verifier and the program. The verifier allows the program to loop
     assuming it's behaving well, but reserves the right to terminate
     it.

   - Extend the BPF verifier to enable static subprog calls in spin lock
     critical sections.

   - Support registration of struct_ops types from modules which helps
     projects like fuse-bpf that seeks to implement a new struct_ops
     type.

   - Add support for retrieval of cookies for perf/kprobe multi links.

   - Support arbitrary TCP SYN cookie generation / validation in the TC
     layer with BPF to allow creating SYN flood handling in BPF
     firewalls.

   - Add code generation to inline the bpf_kptr_xchg() helper which
     improves performance when stashing/popping the allocated BPF
     objects.

  Wireless:

   - Add SPP (signaling and payload protected) AMSDU support.

   - Support wider bandwidth OFDMA, as required for EHT operation.

  Driver API:

   - Major overhaul of the Energy Efficient Ethernet internals to
     support new link modes (2.5GE, 5GE), share more code between
     drivers (especially those using phylib), and encourage more
     uniform behavior. Convert and clean up drivers.

   - Define an API for querying per netdev queue statistics from
     drivers.

   - IPSec: account in global stats for fully offloaded sessions.

   - Create a concept of Ethernet PHY Packages at the Device Tree level,
     to allow parameterizing the existing PHY package code.

   - Enable Rx hashing (RSS) on GTP protocol fields.

  Misc:

   - Improvements and refactoring all over networking selftests.

   - Create uniform module aliases for TC classifiers, actions, and
     packet schedulers to simplify creating modprobe policies.

   - Address all missing MODULE_DESCRIPTION() warnings in networking.

   - Extend the Netlink descriptions in YAML to cover message
     encapsulation or "Netlink polymorphism", where interpretation of
     nested attributes depends on link type, classifier type or some
     other "class type".

  Drivers:

   - Ethernet high-speed NICs:
      - Add a new driver for Marvell's Octeon PCI Endpoint NIC VF.
      - Intel (100G, ice, idpf):
         - support E825-C devices
      - nVidia/Mellanox:
         - support devices with one port and multiple PCIe links
      - Broadcom (bnxt):
         - support n-tuple filters
         - support configuring the RSS key
      - Wangxun (ngbe/txgbe):
         - implement irq_domain for TXGBE's sub-interrupts
      - Pensando/AMD:
         - support XDP
         - optimize queue submission and wakeup handling (+17% bps)
         - optimize struct layout, saving 28% of memory on queues

   - Ethernet NICs embedded and virtual:
      - Google cloud vNIC:
         - refactor driver to perform memory allocations for new queue
           config before stopping and freeing the old queue memory
      - Synopsys (stmmac):
         - obey queueMaxSDU and implement counters required by 802.1Qbv
      - Renesas (ravb):
         - support packet checksum offload
         - suspend to RAM and runtime PM support

   - Ethernet switches:
      - nVidia/Mellanox:
         - support for nexthop group statistics
      - Microchip:
         - ksz8: implement PHY loopback
         - add support for KSZ8567, a 7-port 10/100Mbps switch

   - PTP:
      - New driver for RENESAS FemtoClock3 Wireless clock generator.
      - Support OCP PTP cards designed and built by Adva.

   - CAN:
      - Support recvmsg() flags for own, local and remote traffic on CAN
        BCM sockets.
      - Support for esd GmbH PCIe/402 CAN device family.
      - m_can:
         - Rx/Tx submission coalescing
         - wake on frame Rx

   - WiFi:
      - Intel (iwlwifi):
         - enable signaling and payload protected A-MSDUs
         - support wider-bandwidth OFDMA
         - support for new devices
         - bump FW API to 89 for AX devices; 90 for BZ/SC devices
      - MediaTek (mt76):
         - mt7915: newer ADIE version support
         - mt7925: radio temperature sensor support
      - Qualcomm (ath11k):
         - support 6 GHz station power modes: Low Power Indoor (LPI),
           Standard Power) SP and Very Low Power (VLP)
         - QCA6390 & WCN6855: support 2 concurrent station interfaces
         - QCA2066 support
      - Qualcomm (ath12k):
         - refactoring in preparation for Multi-Link Operation (MLO)
           support
         - 1024 Block Ack window size support
         - firmware-2.bin support
         - support having multiple identical PCI devices (firmware needs
           to have ATH12K_FW_FEATURE_MULTI_QRTR_ID)
         - QCN9274: support split-PHY devices
         - WCN7850: enable Power Save Mode in station mode
         - WCN7850: P2P support
      - RealTek:
         - rtw88: support for more rtw8811cu and rtw8821cu devices
         - rtw89: support SCAN_RANDOM_SN and SET_SCAN_DWELL
         - rtlwifi: speed up USB firmware initialization
         - rtwl8xxxu:
             - RTL8188F: concurrent interface support
             - Channel Switch Announcement (CSA) support in AP mode
      - Broadcom (brcmfmac):
         - per-vendor feature support
         - per-vendor SAE password setup
         - DMI nvram filename quirk for ACEPC W5 Pro"

* tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2255 commits)
  nexthop: Fix splat with CONFIG_DEBUG_PREEMPT=y
  nexthop: Fix out-of-bounds access during attribute validation
  nexthop: Only parse NHA_OP_FLAGS for dump messages that require it
  nexthop: Only parse NHA_OP_FLAGS for get messages that require it
  bpf: move sleepable flag from bpf_prog_aux to bpf_prog
  bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes()
  selftests/bpf: Add kprobe multi triggering benchmarks
  ptp: Move from simple ida to xarray
  vxlan: Remove generic .ndo_get_stats64
  vxlan: Do not alloc tstats manually
  devlink: Add comments to use netlink gen tool
  nfp: flower: handle acti_netdevs allocation failure
  net/packet: Add getsockopt support for PACKET_COPY_THRESH
  net/netlink: Add getsockopt support for NETLINK_LISTEN_ALL_NSID
  selftests/bpf: Add bpf_arena_htab test.
  selftests/bpf: Add bpf_arena_list test.
  selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages
  bpf: Add helper macro bpf_addr_space_cast()
  libbpf: Recognize __arena global variables.
  bpftool: Recognize arena map type
  ...
2024-03-12 17:44:08 -07:00
Linus Torvalds
1ddeeb2a05 for-6.9/block-20240310
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmXuFO4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpq33D/9hyNyBce2A9iyo026eK8EqLDoed6BPzuvB
 kLKj5tsGvX4YlfuswvP86M5dgibTASXclnfUK394TijW/JPOfJ3mNhi9gMnHzRoK
 ZaR1di0Lum56dY1FkpMmWiGmE4fB79PAtXYKtajOkuoIcNzylncEAAACUY4/Ouhg
 Cm+LMg2prcc+m9g8rKDNQ51pUFg4U21KAUTl35XLMUAaQk1ahW3EDEVYhweC/zwE
 V/5hJsv8UY72+oQGY2Dc/YgQk/Zj4ZDh7C+oHR9XeB/ro99kr3/Vopagu0gBMLZi
 Rq6qqz6PVMhVcuz8uN2rsTQKXmXhsBn9/adsl4AKtdxcW5D5moWb5BLq1P0WQylc
 nzMxa1d6cVcTKZpaUQQv3Rj6ZMrLuDwP277UYHfn5x1oPWYRZCG7FtHuOo1gNcpG
 DrSNwVG6BSDcbABqI+MIS2oD1JoUMyevjwT7e2hOXukZhc6GLO5F3ODWE5j3KnCR
 S/aGSAmcdR4fTcgavULqWdQVt7SYl4f1IxT8KrUirJGVhc2LgahaWj69ooklVHoU
 fPDFRiruwJ5YkH4RWCSDm9mi4kAz6eUf+f4yE06wZOFOb2fT8/1ZK2Snpz2KeXuZ
 INO0RejtFzT8L0OUlu7dBmF20y6rgAYt87lR8mIt71yuuATIrVhzlX1VdsvhdrAo
 VLHGV1Ncgw==
 =WlVL
 -----END PGP SIGNATURE-----

Merge tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - MD pull requests via Song:
      - Cleanup redundant checks (Yu Kuai)
      - Remove deprecated headers (Marc Zyngier, Song Liu)
      - Concurrency fixes (Li Lingfeng)
      - Memory leak fix (Li Nan)
      - Refactor raid1 read_balance (Yu Kuai, Paul Luse)
      - Clean up and fix for md_ioctl (Li Nan)
      - Other small fixes (Gui-Dong Han, Heming Zhao)
      - MD atomic limits (Christoph)

 - NVMe pull request via Keith:
      - RDMA target enhancements (Max)
      - Fabrics fixes (Max, Guixin, Hannes)
      - Atomic queue_limits usage (Christoph)
      - Const use for class_register (Ricardo)
      - Identification error handling fixes (Shin'ichiro, Keith)

 - Improvement and cleanup for cached request handling (Christoph)

 - Moving towards atomic queue limits. Core changes and driver bits so
   far (Christoph)

 - Fix UAF issues in aoeblk (Chun-Yi)

 - Zoned fix and cleanups (Damien)

 - s390 dasd cleanups and fixes (Jan, Miroslav)

 - Block issue timestamp caching (me)

 - noio scope guarding for zoned IO (Johannes)

 - block/nvme PI improvements (Kanchan)

 - Ability to terminate long running discard loop (Keith)

 - bdev revalidation fix (Li)

 - Get rid of old nr_queues hack for kdump kernels (Ming)

 - Support for async deletion of ublk (Ming)

 - Improve IRQ bio recycling (Pavel)

 - Factor in CPU capacity for remote vs local completion (Qais)

 - Add shared_tags configfs entry for null_blk (Shin'ichiro

 - Fix for a regression in page refcounts introduced by the folio
   unification (Tony)

 - Misc fixes and cleanups (Arnd, Colin, John, Kunwu, Li, Navid,
   Ricardo, Roman, Tang, Uwe)

* tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux: (221 commits)
  block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC
  block/swim: Convert to platform remove callback returning void
  cdrom: gdrom: Convert to platform remove callback returning void
  block: remove disk_stack_limits
  md: remove mddev->queue
  md: don't initialize queue limits
  md/raid10: use the atomic queue limit update APIs
  md/raid5: use the atomic queue limit update APIs
  md/raid1: use the atomic queue limit update APIs
  md/raid0: use the atomic queue limit update APIs
  md: add queue limit helpers
  md: add a mddev_is_dm helper
  md: add a mddev_add_trace_msg helper
  md: add a mddev_trace_remap helper
  bcache: move calculation of stripe_size and io_opt into bcache_device_init
  virtio_blk: Do not use disk_set_max_open/active_zones()
  aoe: fix the potential use-after-free problem in aoecmd_cfg_pkts
  block: move capacity validation to blkpg_do_ioctl()
  block: prevent division by zero in blk_rq_stat_sum()
  drbd: atomically update queue limits in drbd_reconsider_queue_parameters
  ...
2024-03-11 11:43:44 -07:00
Linus Torvalds
910202f00a vfs-6.9.super
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem4DwAKCRCRxhvAZXjc
 ooTRAQDRI6Qz6wJym5Yblta8BScMGbt/SgrdgkoCvT6y83MtqwD+Nv/AZQzi3A3l
 9NdULtniW1reuCYkc8R7dYM8S+yAwAc=
 =Y1qX
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull block handle updates from Christian Brauner:
 "Last cycle we changed opening of block devices, and opening a block
  device would return a bdev_handle. This allowed us to implement
  support for restricting and forbidding writes to mounted block
  devices. It was accompanied by converting and adding helpers to
  operate on bdev_handles instead of plain block devices.

  That was already a good step forward but ultimately it isn't necessary
  to have special purpose helpers for opening block devices internally
  that return a bdev_handle.

  Fundamentally, opening a block device internally should just be
  equivalent to opening files. So now all internal opens of block
  devices return files just as a userspace open would. Instead of
  introducing a separate indirection into bdev_open_by_*() via struct
  bdev_handle bdev_file_open_by_*() is made to just return a struct
  file. Opening and closing a block device just becomes equivalent to
  opening and closing a file.

  This all works well because internally we already have a pseudo fs for
  block devices and so opening block devices is simple. There's a few
  places where we needed to be careful such as during boot when the
  kernel is supposed to mount the rootfs directly without init doing it.
  Here we need to take care to ensure that we flush out any asynchronous
  file close. That's what we already do for opening, unpacking, and
  closing the initramfs. So nothing new here.

  The equivalence of opening and closing block devices to regular files
  is a win in and of itself. But it also has various other advantages.
  We can remove struct bdev_handle completely. Various low-level helpers
  are now private to the block layer. Other helpers were simply
  removable completely.

  A follow-up series that is already reviewed build on this and makes it
  possible to remove bdev->bd_inode and allows various clean ups of the
  buffer head code as well. All places where we stashed a bdev_handle
  now just stash a file and use simple accessors to get to the actual
  block device which was already the case for bdev_handle"

* tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
  block: remove bdev_handle completely
  block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access
  bdev: remove bdev pointer from struct bdev_handle
  bdev: make struct bdev_handle private to the block layer
  bdev: make bdev_{release, open_by_dev}() private to block layer
  bdev: remove bdev_open_by_path()
  reiserfs: port block device access to file
  ocfs2: port block device access to file
  nfs: port block device access to files
  jfs: port block device access to file
  f2fs: port block device access to files
  ext4: port block device access to file
  erofs: port device access to file
  btrfs: port device access to file
  bcachefs: port block device access to file
  target: port block device access to file
  s390: port block device access to file
  nvme: port block device access to file
  block2mtd: port device access to files
  bcache: port block device access to files
  ...
2024-03-11 10:52:34 -07:00
Guixin Liu
2bc9174309 nvmet: add tracing of zns commands
Add nvme_cmd_zone_append, nvme_cmd_zone_mgmt_send and
nvme_cmd_zone_mgmt_recv parse to nvme target tracing.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-08 06:58:20 -08:00
Guixin Liu
8fc3b0f1f4 nvmet: add tracing of authentication commands
Add nvme_fabrics_type_auth_send and nvme_fabrics_type_auth_receive
to the nvme target's tracing facility.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-08 06:58:20 -08:00
Uwe Kleine-König
1843671f86 nvme-apple: Convert to platform remove callback returning void
The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.

To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new(), which already returns void. Eventually after all drivers
are converted, .remove_new() will be renamed to .remove().

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-08 06:55:47 -08:00
Hannes Reinecke
0889d13b9e nvmet-tcp: do not continue for invalid icreq
When the length check for an icreq sqe fails we should not
continue processing but rather return immediately as all
other contents of that sqe cannot be relied on.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-08 06:49:57 -08:00
Keith Busch
34485c37ea nvme: change shutdown timeout setting message
User visible messages containing the word "timeout" can be alarming.
This one from nvme is just reporting a potentially informative device
configuration, and everything is working as designed. Change the text to
report the less concerning "D3 entry latency", which is where this value
comes from anyway.

Reported-by: Len Brown <lenb@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-08 06:46:51 -08:00
Keith Busch
7e80eb792b nvme: clear caller pointer on identify failure
The memory allocated for the identification is freed on failure. Set
it to NULL so the caller doesn't have a pointer to that freed address.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-06 06:29:01 -08:00
Shin'ichiro Kawasaki
8d0d244739 nvme: host: fix double-free of struct nvme_id_ns in ns_update_nuse()
When nvme_identify_ns() fails, it frees the pointer to the struct
nvme_id_ns before it returns. However, ns_update_nuse() calls kfree()
for the pointer even when nvme_identify_ns() fails. This results in
KASAN double-free, which was observed with blktests nvme/045 with
proposed patches [1] on the kernel v6.8-rc7. Fix the double-free by
skipping kfree() when nvme_identify_ns() fails.

Link: https://lore.kernel.org/linux-block/20240304161303.19681-1-dwagner@suse.de/ [1]
Fixes: a1a825ab6a ("nvme: add csi, ms and nuse to sysfs")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-06 06:02:15 -08:00
Ricardo B. Marliere
800bb2b02f nvme: fcloop: make fcloop_class constant
Since commit 43a7206b09 ("driver core: class: make class_register() take
a const *"), the driver core allows for struct class to be in read-only
memory, so move the fcloop_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-05 07:56:21 -08:00
Ricardo B. Marliere
3c2bcfd5ac nvme: fabrics: make nvmf_class constant
Since commit 43a7206b09 ("driver core: class: make class_register() take
a const *"), the driver core allows for struct class to be in read-only
memory, so move the nvmf_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-05 07:56:19 -08:00
Ricardo B. Marliere
ab21f3d909 nvme: core: constify struct class usage
Since commit 43a7206b09 ("driver core: class: make class_register() take
a const *"), the driver core allows for struct class to be in read-only
memory, so move the structures nvme_class, nvme_subsys_class and
nvme_ns_chr_class to be declared at build time placing them into read-only
memory, instead of having to be dynamically allocated at boot time.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-05 07:56:03 -08:00
Yunsheng Lin
a0727489ac net: introduce page_frag_cache_drain()
When draining a page_frag_cache, most user are doing
the similar steps, so introduce an API to avoid code
duplication.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-05 11:38:14 +01:00
Hannes Reinecke
5f5ea0e491 nvme-fabrics: typo in nvmf_parse_key()
Of course we should use the key if there is no error ...

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04 08:26:30 -08:00
Christoph Hellwig
f7e0a545f7 nvme-multipath: use atomic queue limits API for stacking limits
Switch to the queue_limits_* helpers to stack the bdev limits, which also
includes updating the readahead settings.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04 08:24:57 -08:00
Christoph Hellwig
c5be5df721 nvme-multipath: pass queue_limits to blk_alloc_disk
The multipath disk starts out with the stacking default limits.
The one interesting part here is that blk_set_stacking_limits
sets the max_zone_append_sectorts to UINT_MAX, which fails the
validation for non-zoned devices.  With the old one call per
limit scheme this was fine because no one verified this weird
mismatch and it was fixed by blk_stack_limits a little later
before I/O could be issued.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04 08:24:56 -08:00
Christoph Hellwig
e6c9b130d6 nvme: use the atomic queue limits update API
Changes the callchains that update queue_limits to build an on-stack
queue_limits and update it atomically.  Note that for now only the
admin queue actually passes it to the queue allocation function.
Doing the same for the gendisks used for the namespaces will require
a little more work.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04 08:24:56 -08:00
Christoph Hellwig
27cb91a3a1 nvme: cleanup nvme_configure_metadata
Fold nvme_init_ms into nvme_configure_metadata after splitting up
a little helper to deal with the extended LBA formats.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04 08:24:56 -08:00
Christoph Hellwig
e5ea00a510 nvme: don't query identify data in configure_metadata
Move reading the Identify Namespace Data Structure, NVM Command Set out
of configure_metadata into the caller.  This allows doing the identify
call outside the frozen I/O queues, and prepares for using data from
the Identify data structure for other purposes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04 08:24:56 -08:00
Christoph Hellwig
c6fce9f127 nvme: split out a nvme_identify_ns_nvm helper
Split the logic to query the Identify Namespace Data Structure, NVM
Command Set into a separate helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04 08:24:56 -08:00
Christoph Hellwig
46e7422cda nvme: move common logic into nvme_update_ns_info
nvme_update_ns_info_generic and nvme_update_ns_info_block share a
fair amount of logic related to not fully supported namespace
formats and updating the multipath information.  Move this logic
into the common caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04 08:24:56 -08:00
Christoph Hellwig
d60c23e455 nvme: move setting the write cache flags out of nvme_set_queue_limits
nvme_set_queue_limits is used on the admin queue and all gendisks
including hidden ones that don't support block I/O.  The write cache
setting on the other hand only makes sense for block I/O.  Move the
blk_queue_write_cache call to nvme_update_ns_info_block instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04 08:24:56 -08:00
Christoph Hellwig
a5b1cd6182 nvme: move a few things out of nvme_update_disk_info
Move setting up the integrity profile and setting the disk capacity out
of nvme_update_disk_info to get nvme_update_disk_info into a shape where
it just sets queue_limits eventually.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04 08:24:56 -08:00
Christoph Hellwig
8f03cfa117 nvme: don't use nvme_update_disk_info for the multipath disk
Currently nvme_update_ns_info_block calls nvme_update_disk_info both for
the namespace attached disk, and the multipath one (if it exists).  This
is very different from how other stacking drivers work, and leads to
a lot of complexity.

Switch to setting the disk capacity and initializing the integrity
profile, and let blk_stack_limits which already is called just below
deal with updating the other limits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04 08:24:56 -08:00
Christoph Hellwig
414c62e2ce nvme: move blk_integrity_unregister into nvme_init_integrity
Move uneregistering the existing integrity profile into the helper
dealing with all the other integrity / metadata setup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04 08:24:56 -08:00
Christoph Hellwig
f467b48e38 nvme: cleanup the nvme_init_integrity calling conventions
Handle the no metadata support case in nvme_init_integrity as well to
simplify the calling convention and prepare for future changes in the
area.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04 08:24:55 -08:00
Christoph Hellwig
f404dd928b nvme: move max_integrity_segments handling out of nvme_init_integrity
max_integrity_segments is just a hardware limit and doesn't need to be
in nvme_init_integrity with the PI setup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04 08:24:55 -08:00
Christoph Hellwig
1b2f5d5d28 nvme: remove nvme_revalidate_zones
Handle setting the zone size / chunk_sectors and max_append_sectors
limits together with the other ZNS limits, and just open code the
call to blk_revalidate_zones in the current place.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04 08:24:55 -08:00
Christoph Hellwig
63dfa10043 nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of nvme_config_discard
Move the handling of the NVME_QUIRK_DEALLOCATE_ZEROES quirk out of
nvme_config_discard so that it is combined with the normal write_zeroes
limit handling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04 08:24:55 -08:00
Christoph Hellwig
152694c829 nvme: set max_hw_sectors unconditionally
All transports set a max_hw_sectors value in the nvme_ctrl, so make
the code using it unconditional and clean it up using a little helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04 08:24:55 -08:00
Guixin Liu
4999568184 nvme-fabrics: check max outstanding commands
Maxcmd is mandatory for fabrics, check it early to identify the root
cause instead of waiting for it to propagate to "sqsize" and "allocing
queue".

By the way, change nvme_check_ctrl_fabric_info() to
nvmf_validate_identify_ctrl().

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02 15:18:09 -08:00