Commit Graph

8576 Commits

Author SHA1 Message Date
Linus Torvalds
059dd502b2 block-6.13-20241228
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmdwJ8UQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpokHD/wN0fLwLNKF2WFCI/8VXxJ8wAQ4q61hrgKD
 PhqfIY+wsz8RnPnoma7aduVzh+JeDK2FurlMFDUXx/hSQI2CxFZHVFjdpAEcTu2z
 MHJ4xnRHR/XbfsmN3rfLgiKson15cQ7wC9IFUoM1fIDhosfWyt5vu7kvLTbLScql
 bfcoW3iBipLHOb7J272ZDjdD8QgLMtCOZ/o7/GTWZfw0ocei2xloCjg1Gxq6qCu0
 P+POptLQvqMrRwu+hG3Ff/svUM1zu36DfIXrJCrxsi1jV2kFTcXytTtNo6x7DsHR
 2gpKPs/v1TJQSMg7CHtTdPAK7o/7mjZxjTlRDMR95l+F4EEC7nVvi0D+ZaAcCznA
 d9oJ9QMGlB99blYN838o17v0K3shy1BhPJt9NO7jqbls/tD/i4U+wJfMWDjhK6oA
 xmNifLb4ecSTiCnGmOfLnCky4ELSpUoQMPg9vu/FtYd09dxHiUhx+8ZBkTQ7TXLT
 I5MDE+5fh36NaLzlyjJM1jxv5KNQSAd5Ow47zCejvjgvAtFJoEXpfQV7/2n3YYm7
 XVxBHhY/e/mXYCYAQ7hwTfQjIto7NurET0QX0LYpOHBpRLADf3okV0VA4jScwcRK
 SbMe5puCADCFEv1AUBVgs7HSUIo5yiXmBGko7MvHBIC3o7qedGc1YDEW4tYwQWDJ
 czYf97xi1Q==
 =HcC5
 -----END PGP SIGNATURE-----

Merge tag 'block-6.13-20241228' of git://git.kernel.dk/linux

Pull block fix from Jens Axboe:
 "Just a single fix for ublk setup error handling"

* tag 'block-6.13-20241228' of git://git.kernel.dk/linux:
  ublk: detach gendisk from ublk device if add_disk() fails
2024-12-28 11:02:35 -08:00
Ming Lei
75cd4005da ublk: detach gendisk from ublk device if add_disk() fails
Inside ublk_abort_requests(), gendisk is grabbed for aborting all
inflight requests. And ublk_abort_requests() is called when exiting
the uring context or handling timeout.

If add_disk() fails, the gendisk may have been freed when calling
ublk_abort_requests(), so use-after-free can be caused when getting
disk's reference in ublk_abort_requests().

Fixes the bug by detaching gendisk from ublk device if add_disk() fails.

Fixes: bd23f6c2c2 ("ublk: quiesce request queue when aborting queue")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241225110640.351531-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-26 06:42:55 -07:00
Christoph Hellwig
cc76ace465 block: remove BLK_MQ_F_SHOULD_MERGE
BLK_MQ_F_SHOULD_MERGE is set for all tag_sets except those that purely
process passthrough commands (bsg-lib, ufs tmf, various nvme admin
queues) and thus don't even check the flag.  Remove it to simplify the
driver interface.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241219060214.1928848-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23 08:17:23 -07:00
Daniel Wagner
a5665c3d15 virtio: blk/scsi: replace blk_mq_virtio_map_queues with blk_mq_map_hw_queues
Replace all users of blk_mq_virtio_map_queues with the more generic
blk_mq_map_hw_queues. This in preparation to retire
blk_mq_virtio_map_queues.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Link: https://lore.kernel.org/r/20241202-refactor-blk-affinity-helpers-v6-7-27211e9c2cd5@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23 08:17:23 -07:00
Matthew Wilcox (Oracle)
0e20669a91 null_blk: Remove accesses to page->index
Use page->private to store the index instead of page->index.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20241216160849.31739-1-willy@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23 08:17:23 -07:00
Benoît du Garreau
53328a3671 block: rnull: Initialize the module in place
Using `InPlaceModule` avoids an allocation and an indirection.

Signed-off-by: Benoît du Garreau <benoit@dugarreau.fr>
Acked-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://lore.kernel.org/r/20241204-rnull_in_place-v1-1-efe3eafac9fb@dugarreau.fr
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23 08:17:23 -07:00
John Garry
19206d3f5e block: Delete bio_set_prio()
Since commit 43b62ce3ff ("block: move bio io prio to a new field"), macro
bio_set_prio() does nothing but set bio->bi_ioprio. All other places just
set bio->bi_ioprio directly, so replace bio_set_prio() remaining
callsites with setting bio->bi_ioprio directly and delete that macro.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20241202111957.2311683-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23 08:17:23 -07:00
Damien Le Moal
b56426bcf8 null_blk: Add rotational feature support
To facilitate testing of kernel functions related to the rotational
feature (BLK_FEAT_ROTATIONAL) of a block device (e.g. NVMe rotational
bit support), add the rotational boolean configfs attribute and module
parameter to the null_blk driver. If set, a null block device will
report being a rotational device through it queue limits features with
the BLK_FEAT_ROTATIONAL flag.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20241126000956.95983-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23 08:17:22 -07:00
Kairui Song
74363ec674 zram: fix uninitialized ZRAM not releasing backing device
Setting backing device is done before ZRAM initialization.  If we set the
backing device, then remove the ZRAM module without initializing the
device, the backing device reference will be leaked and the device will be
hold forever.

Fix this by always reset the ZRAM fully on rmmod or reset store.

Link: https://lkml.kernel.org/r/20241209165717.94215-3-ryncsn@gmail.com
Fixes: 013bf95a83 ("zram: add interface to specif backing device")
Signed-off-by: Kairui Song <kasong@tencent.com>
Reported-by: Desheng Wu <deshengwu@tencent.com>
Suggested-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:04:44 -08:00
Kairui Song
be48c412f6 zram: refuse to use zero sized block device as backing device
Patch series "zram: fix backing device setup issue", v2.

This series fixes two bugs of backing device setting:

- ZRAM should reject using a zero sized (or the uninitialized ZRAM
  device itself) as the backing device.
- Fix backing device leaking when removing a uninitialized ZRAM
  device.


This patch (of 2):

Setting a zero sized block device as backing device is pointless, and one
can easily create a recursive loop by setting the uninitialized ZRAM
device itself as its own backing device by (zram0 is uninitialized):

    echo /dev/zram0 > /sys/block/zram0/backing_dev

It's definitely a wrong config, and the module will pin itself, kernel
should refuse doing so in the first place.

By refusing to use zero sized device we avoided misuse cases including
this one above.

Link: https://lkml.kernel.org/r/20241209165717.94215-1-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20241209165717.94215-2-ryncsn@gmail.com
Fixes: 013bf95a83 ("zram: add interface to specif backing device")
Signed-off-by: Kairui Song <kasong@tencent.com>
Reported-by: Desheng Wu <deshengwu@tencent.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-18 19:04:44 -08:00
Ming Lei
7678abee08 virtio-blk: don't keep queue frozen during system suspend
Commit 4ce6e2db00 ("virtio-blk: Ensure no requests in virtqueues before
deleting vqs.") replaces queue quiesce with queue freeze in virtio-blk's
PM callbacks. And the motivation is to drain inflight IOs before suspending.

block layer's queue freeze looks very handy, but it is also easy to cause
deadlock, such as, any attempt to call into bio_queue_enter() may run into
deadlock if the queue is frozen in current context. There are all kinds
of ->suspend() called in suspend context, so keeping queue frozen in the
whole suspend context isn't one good idea. And Marek reported lockdep
warning[1] caused by virtio-blk's freeze queue in virtblk_freeze().

[1] https://lore.kernel.org/linux-block/ca16370e-d646-4eee-b9cc-87277c89c43c@samsung.com/

Given the motivation is to drain in-flight IOs, it can be done by calling
freeze & unfreeze, meantime restore to previous behavior by keeping queue
quiesced during suspend.

Cc: Yi Sun <yi.sun@unisoc.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: virtualization@lists.linux.dev
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/20241112125821.1475793-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-05 10:00:20 -07:00
FUJITA Tomonori
3c93e4e4a2 block: rnull: add missing MODULE_DESCRIPTION
Add the missing description to fix the following warning:

WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/rnull_mod.o

Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com>
Acked-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://lore.kernel.org/r/20241130094521.193924-1-fujita.tomonori@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-03 06:34:11 -07:00
Linus Torvalds
e70140ba0d Get rid of 'remove_new' relic from platform driver struct
The continual trickle of small conversion patches is grating on me, and
is really not helping.  Just get rid of the 'remove_new' member
function, which is just an alias for the plain 'remove', and had a
comment to that effect:

  /*
   * .remove_new() is a relic from a prototype conversion of .remove().
   * New drivers are supposed to implement .remove(). Once all drivers are
   * converted to not use .remove_new any more, it will be dropped.
   */

This was just a tree-wide 'sed' script that replaced '.remove_new' with
'.remove', with some care taken to turn a subsequent tab into two tabs
to make things line up.

I did do some minimal manual whitespace adjustment for places that used
spaces to line things up.

Then I just removed the old (sic) .remove_new member function, and this
is the end result.  No more unnecessary conversion noise.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-01 15:12:43 -08:00
Linus Torvalds
cfd47302ac block-6.13-20242901
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmdJ6jwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvgeEADaPL+qmsRyp070K1OI/yTA9Jhf6liEyJ31
 0GPVar5Vt6ObH6/POObrJqbBtAo5asQanFvwyVyLztKYxPHU7sdQaRcD+vvj7q3+
 EhmQZSKM7Spp77awWhRWbeQfUBvdhTeGHjQH/0e60eIrF9KtEL9sM9hVqc8hBD9F
 YtDNWPCk7Rz1PPNYlGEkQ2JmYmaxh3Gn29c/k1cSSo3InEQOFj6x+0Cgz6RjbTx3
 9HfpLhVG3WV5MlZCCwp7KG36aJzlc0nq53x/sC9cg+F17RvL2EwNAOUfLl75/Kp/
 t7PCQSd2ODciiDN9qZW71KGtVtlJ07W048Rk0nB+ogneC0uh4fuIYTidP9D7io7D
 bBMrhDuUpnlPzlOqg0aeedXePQL7TRfT3CTentol6xldqg14n7C4QTQFQMSJCgJf
 gr4YCTwl0RTknXo0A3ja16XwsUq5+2xsSoCTU25TY+wgKiAcc5lN9fhbvPRzbCQC
 u9EQ9I9IFAMqEdnE51sw0x16fLtN2w4/zOkvTF+gD/KooEjSn9lcfeNue7jt1O0/
 gFvFJCdXK/2GgxwHihvsEVdcNeaS8JowNafKUsfOM2G0qWQbY+l2vl/b5PfwecWi
 0knOaqNWlGMwrQ+z+fgsEeFG7X98ninC7tqVZpzoZ7j0x65anH+Jq4q1Egongj0H
 90zclclxjg==
 =6cbB
 -----END PGP SIGNATURE-----

Merge tag 'block-6.13-20242901' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:

 - NVMe pull request via Keith:
      - Use correct srcu list traversal (Breno)
      - Scatter-gather support for metadata (Keith)
      - Fabrics shutdown race condition fix (Nilay)
      - Persistent reservations updates (Guixin)

 - Add the required bits for MD atomic write support for raid0/1/10

 - Correct return value for unknown opcode in ublk

 - Fix deadlock with zone revalidation

 - Fix for the io priority request vs bio cleanups

 - Use the correct unsigned int type for various limit helpers

 - Fix for a race in loop

 - Cleanup blk_rq_prep_clone() to prevent uninit-value warning and make
   it easier for actual humans to read

 - Fix potential UAF when iterating tags

 - A few fixes for bfq-iosched UAF issues

 - Fix for brd discard not decrementing the allocated page count

 - Various little fixes and cleanups

* tag 'block-6.13-20242901' of git://git.kernel.dk/linux: (36 commits)
  brd: decrease the number of allocated pages which discarded
  block, bfq: fix bfqq uaf in bfq_limit_depth()
  block: Don't allow an atomic write be truncated in blkdev_write_iter()
  mq-deadline: don't call req_get_ioprio from the I/O completion handler
  block: Prevent potential deadlock in blk_revalidate_disk_zones()
  block: Remove extra part pointer NULLify in blk_rq_init()
  nvme: tuning pr code by using defined structs and macros
  nvme: introduce change ptpl and iekey definition
  block: return bool from get_disk_ro and bdev_read_only
  block: remove a duplicate definition for bdev_read_only
  block: return bool from blk_rq_aligned
  block: return unsigned int from blk_lim_dma_alignment_and_pad
  block: return unsigned int from queue_dma_alignment
  block: return unsigned int from bdev_io_opt
  block: req->bio is always set in the merge code
  block: don't bother checking the data direction for merges
  block: blk-mq: fix uninit-value in blk_rq_prep_clone and refactor
  Revert "block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()"
  md/raid10: Atomic write support
  md/raid1: Atomic write support
  ...
2024-11-30 15:47:29 -08:00
Zhang Xianwei
82734209be brd: decrease the number of allocated pages which discarded
The number of allocated pages which discarded will not decrease.
Fix it.

Fixes: 9ead7efc6f ("brd: implement discard support")

Signed-off-by: Zhang Xianwei <zhang.xianwei8@zte.com.cn>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241128170056565nPKSz2vsP8K8X2uk2iaDG@zte.com.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-29 08:43:52 -07:00
Linus Torvalds
798bb342e0 Rust changes for v6.13
Toolchain and infrastructure:
 
  - Enable a series of lints, including safety-related ones, e.g. the
    compiler will now warn about missing safety comments, as well as
    unnecessary ones. How safety documentation is organized is a frequent
    source of review comments, thus having the compiler guide new
    developers on where they are expected (and where not) is very nice.
 
  - Start using '#[expect]': an interesting feature in Rust (stabilized
    in 1.81.0) that makes the compiler warn if an expected warning was
    _not_ emitted. This is useful to avoid forgetting cleaning up locally
    ignored diagnostics ('#[allow]'s).
 
  - Introduce '.clippy.toml' configuration file for Clippy, the Rust
    linter, which will allow us to tweak its behaviour. For instance, our
    first use cases are declaring a disallowed macro and, more
    importantly, enabling the checking of private items.
 
  - Lints-related fixes and cleanups related to the items above.
 
  - Migrate from 'receiver_trait' to 'arbitrary_self_types': to get the
    kernel into stable Rust, one of the major pieces of the puzzle is the
    support to write custom types that can be used as 'self', i.e. as
    receivers, since the kernel needs to write types such as 'Arc' that
    common userspace Rust would not. 'arbitrary_self_types' has been
    accepted to become stable, and this is one of the steps required to
    get there.
 
  - Remove usage of the 'new_uninit' unstable feature.
 
  - Use custom C FFI types. Includes a new 'ffi' crate to contain our
    custom mapping, instead of using the standard library 'core::ffi'
    one. The actual remapping will be introduced in a later cycle.
 
  - Map '__kernel_{size_t,ssize_t,ptrdiff_t}' to 'usize'/'isize' instead
    of 32/64-bit integers.
 
  - Fix 'size_t' in bindgen generated prototypes of C builtins.
 
  - Warn on bindgen < 0.69.5 and libclang >= 19.1 due to a double issue
    in the projects, which we managed to trigger with the upcoming
    tracepoint support. It includes a build test since some distributions
    backported the fix (e.g. Debian -- thanks!). All major distributions
    we list should be now OK except Ubuntu non-LTS.
 
 'macros' crate:
 
  - Adapt the build system to be able run the doctests there too; and
    clean up and enable the corresponding doctests.
 
 'kernel' crate:
 
  - Add 'alloc' module with generic kernel allocator support and remove
    the dependency on the Rust standard library 'alloc' and the extension
    traits we used to provide fallible methods with flags.
 
    Add the 'Allocator' trait and its implementations '{K,V,KV}malloc'.
    Add the 'Box' type (a heap allocation for a single value of type 'T'
    that is also generic over an allocator and considers the kernel's GFP
    flags) and its shorthand aliases '{K,V,KV}Box'. Add 'ArrayLayout'
    type. Add 'Vec' (a contiguous growable array type) and its shorthand
    aliases '{K,V,KV}Vec', including iterator support.
 
    For instance, now we may write code such as:
 
        let mut v = KVec::new();
        v.push(1, GFP_KERNEL)?;
        assert_eq!(&v, &[1]);
 
    Treewide, move as well old users to these new types.
 
  - 'sync' module: add global lock support, including the
    'GlobalLockBackend' trait; the 'Global{Lock,Guard,LockedBy}' types
     and the 'global_lock!' macro. Add the 'Lock::try_lock' method.
 
  - 'error' module: optimize 'Error' type to use 'NonZeroI32' and make
    conversion functions public.
 
  - 'page' module: add 'page_align' function.
 
  - Add 'transmute' module with the existing 'FromBytes' and 'AsBytes'
    traits.
 
  - 'block::mq::request' module: improve rendered documentation.
 
  - 'types' module: extend 'Opaque' type documentation and add simple
    examples for the 'Either' types.
 
 drm/panic:
 
  - Clean up a series of Clippy warnings.
 
 Documentation:
 
  - Add coding guidelines for lints and the '#[expect]' feature.
 
  - Add Ubuntu to the list of distributions in the Quick Start guide.
 
 MAINTAINERS:
  - Add Danilo Krummrich as maintainer of the new 'alloc' module.
 
 And a few other small cleanups and fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEPjU5OPd5QIZ9jqqOGXyLc2htIW0FAmdFMIgACgkQGXyLc2ht
 IW16hQ/+KX/jmdGoXtNXx7T6yG6SJ/txPOieGAWfhBf6C3bqkGrU9Gnw/O3VWrxf
 eyj1QLQaIVUkmumWCefeiy9u3xRXx5fpS0tWJOjUtxC5NcS7vCs0AHQs1skIa6H+
 YD6UKDPOy7CB5fVYqo13B5xnFAlciU0dLo6IGB6bB/lSpCudGLE9+nukfn5H3/R1
 DTc3/fbSoYQU6Ij/FKscB+D/A7ojdYaReodhbzNw1lChg1MrJlCpqoQvHPE8ijg+
 UDljHFFvgKdhSQL9GTa3LC7X4DsnihMWzXt14m6mMOqBa6TqF47WUhhgC77pHEI2
 v/Yy8MLq0pdIzT1wFjsqs6opuvXc7K5Yk5Y60HDsDyIyjk2xgOjh6ZlD0EV161gS
 7w1NtaKd/Cn7hnL7Ua51yJDxJTMllne3fTWemhs3Zd63j7ham98yOoiw+6L2QaM4
 C9nW48vfUuTwDuYJ5HU0uSugubuHW3Ng5JEvMcvd4QjmaI1bQNkgVzefR5j3dLw8
 9kEOTzJoxHpu5B7PZVTEd/L95hlmk1csSQObxi7JYCCimMkusF1S+heBzV/SqWD5
 5ioEhCnSKE05fhQs0Uxns1HkcFle8Bn6r3aSAWV6yaR8oF94yHcuaZRUKxKMHw+1
 cmBE2X8Yvtldw+CYDwEGWjKDtwOStbqk+b/ZzP7f7/p56QH9lSg=
 =Kn7b
 -----END PGP SIGNATURE-----

Merge tag 'rust-6.13' of https://github.com/Rust-for-Linux/linux

Pull rust updates from Miguel Ojeda:
 "Toolchain and infrastructure:

   - Enable a series of lints, including safety-related ones, e.g. the
     compiler will now warn about missing safety comments, as well as
     unnecessary ones. How safety documentation is organized is a
     frequent source of review comments, thus having the compiler guide
     new developers on where they are expected (and where not) is very
     nice.

   - Start using '#[expect]': an interesting feature in Rust (stabilized
     in 1.81.0) that makes the compiler warn if an expected warning was
     _not_ emitted. This is useful to avoid forgetting cleaning up
     locally ignored diagnostics ('#[allow]'s).

   - Introduce '.clippy.toml' configuration file for Clippy, the Rust
     linter, which will allow us to tweak its behaviour. For instance,
     our first use cases are declaring a disallowed macro and, more
     importantly, enabling the checking of private items.

   - Lints-related fixes and cleanups related to the items above.

   - Migrate from 'receiver_trait' to 'arbitrary_self_types': to get the
     kernel into stable Rust, one of the major pieces of the puzzle is
     the support to write custom types that can be used as 'self', i.e.
     as receivers, since the kernel needs to write types such as 'Arc'
     that common userspace Rust would not. 'arbitrary_self_types' has
     been accepted to become stable, and this is one of the steps
     required to get there.

   - Remove usage of the 'new_uninit' unstable feature.

   - Use custom C FFI types. Includes a new 'ffi' crate to contain our
     custom mapping, instead of using the standard library 'core::ffi'
     one. The actual remapping will be introduced in a later cycle.

   - Map '__kernel_{size_t,ssize_t,ptrdiff_t}' to 'usize'/'isize'
     instead of 32/64-bit integers.

   - Fix 'size_t' in bindgen generated prototypes of C builtins.

   - Warn on bindgen < 0.69.5 and libclang >= 19.1 due to a double issue
     in the projects, which we managed to trigger with the upcoming
     tracepoint support. It includes a build test since some
     distributions backported the fix (e.g. Debian -- thanks!). All
     major distributions we list should be now OK except Ubuntu non-LTS.

  'macros' crate:

   - Adapt the build system to be able run the doctests there too; and
     clean up and enable the corresponding doctests.

  'kernel' crate:

   - Add 'alloc' module with generic kernel allocator support and remove
     the dependency on the Rust standard library 'alloc' and the
     extension traits we used to provide fallible methods with flags.

     Add the 'Allocator' trait and its implementations '{K,V,KV}malloc'.
     Add the 'Box' type (a heap allocation for a single value of type
     'T' that is also generic over an allocator and considers the
     kernel's GFP flags) and its shorthand aliases '{K,V,KV}Box'. Add
     'ArrayLayout' type. Add 'Vec' (a contiguous growable array type)
     and its shorthand aliases '{K,V,KV}Vec', including iterator
     support.

     For instance, now we may write code such as:

         let mut v = KVec::new();
         v.push(1, GFP_KERNEL)?;
         assert_eq!(&v, &[1]);

     Treewide, move as well old users to these new types.

   - 'sync' module: add global lock support, including the
     'GlobalLockBackend' trait; the 'Global{Lock,Guard,LockedBy}' types
     and the 'global_lock!' macro. Add the 'Lock::try_lock' method.

   - 'error' module: optimize 'Error' type to use 'NonZeroI32' and make
     conversion functions public.

   - 'page' module: add 'page_align' function.

   - Add 'transmute' module with the existing 'FromBytes' and 'AsBytes'
     traits.

   - 'block::mq::request' module: improve rendered documentation.

   - 'types' module: extend 'Opaque' type documentation and add simple
     examples for the 'Either' types.

  drm/panic:

   - Clean up a series of Clippy warnings.

  Documentation:

   - Add coding guidelines for lints and the '#[expect]' feature.

   - Add Ubuntu to the list of distributions in the Quick Start guide.

  MAINTAINERS:

   - Add Danilo Krummrich as maintainer of the new 'alloc' module.

  And a few other small cleanups and fixes"

* tag 'rust-6.13' of https://github.com/Rust-for-Linux/linux: (82 commits)
  rust: alloc: Fix `ArrayLayout` allocations
  docs: rust: remove spurious item in `expect` list
  rust: allow `clippy::needless_lifetimes`
  rust: warn on bindgen < 0.69.5 and libclang >= 19.1
  rust: use custom FFI integer types
  rust: map `__kernel_size_t` and friends also to usize/isize
  rust: fix size_t in bindgen prototypes of C builtins
  rust: sync: add global lock support
  rust: macros: enable the rest of the tests
  rust: macros: enable paste! use from macro_rules!
  rust: enable macros::module! tests
  rust: kbuild: expand rusttest target for macros
  rust: types: extend `Opaque` documentation
  rust: block: fix formatting of `kernel::block::mq::request` module
  rust: macros: fix documentation of the paste! macro
  rust: kernel: fix THIS_MODULE header path in ThisModule doc comment
  rust: page: add Rust version of PAGE_ALIGN
  rust: helpers: remove unnecessary header includes
  rust: exports: improve grammar in commentary
  drm/panic: allow verbose version check
  ...
2024-11-26 14:00:26 -08:00
Linus Torvalds
5c00ff742b - The series "zram: optimal post-processing target selection" from
Sergey Senozhatsky improves zram's post-processing selection algorithm.
   This leads to improved memory savings.
 
 - Wei Yang has gone to town on the mapletree code, contributing several
   series which clean up the implementation:
 
 	- "refine mas_mab_cp()"
 	- "Reduce the space to be cleared for maple_big_node"
 	- "maple_tree: simplify mas_push_node()"
 	- "Following cleanup after introduce mas_wr_store_type()"
 	- "refine storing null"
 
 - The series "selftests/mm: hugetlb_fault_after_madv improvements" from
   David Hildenbrand fixes this selftest for s390.
 
 - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
   implements some rationaizations and cleanups in the page mapping code.
 
 - The series "mm: optimize shadow entries removal" from Shakeel Butt
   optimizes the file truncation code by speeding up the handling of shadow
   entries.
 
 - The series "Remove PageKsm()" from Matthew Wilcox completes the
   migration of this flag over to being a folio-based flag.
 
 - The series "Unify hugetlb into arch_get_unmapped_area functions" from
   Oscar Salvador implements a bunch of consolidations and cleanups in the
   hugetlb code.
 
 - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
   takes away the wp-fault time practice of turning a huge zero page into
   small pages.  Instead we replace the whole thing with a THP.  More
   consistent cleaner and potentiall saves a large number of pagefaults.
 
 - The series "percpu: Add a test case and fix for clang" from Andy
   Shevchenko enhances and fixes the kernel's built in percpu test code.
 
 - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
   optimizes mremap() by avoiding doing things which we didn't need to do.
 
 - The series "Improve the tmpfs large folio read performance" from
   Baolin Wang teaches tmpfs to copy data into userspace at the folio size
   rather than as individual pages.  A 20% speedup was observed.
 
 - The series "mm/damon/vaddr: Fix issue in
   damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON splitting.
 
 - The series "memcg-v1: fully deprecate charge moving" from Shakeel Butt
   removes the long-deprecated memcgv2 charge moving feature.
 
 - The series "fix error handling in mmap_region() and refactor" from
   Lorenzo Stoakes cleanup up some of the mmap() error handling and
   addresses some potential performance issues.
 
 - The series "x86/module: use large ROX pages for text allocations" from
   Mike Rapoport teaches x86 to use large pages for read-only-execute
   module text.
 
 - The series "page allocation tag compression" from Suren Baghdasaryan
   is followon maintenance work for the new page allocation profiling
   feature.
 
 - The series "page->index removals in mm" from Matthew Wilcox remove
   most references to page->index in mm/.  A slow march towards shrinking
   struct page.
 
 - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
   interface tests" from Andrew Paniakin performs maintenance work for
   DAMON's self testing code.
 
 - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
   improves zswap's batching of compression and decompression.  It is a
   step along the way towards using Intel IAA hardware acceleration for
   this zswap operation.
 
 - The series "kasan: migrate the last module test to kunit" from
   Sabyrzhan Tasbolatov completes the migration of the KASAN built-in tests
   over to the KUnit framework.
 
 - The series "implement lightweight guard pages" from Lorenzo Stoakes
   permits userapace to place fault-generating guard pages within a single
   VMA, rather than requiring that multiple VMAs be created for this.
   Improved efficiencies for userspace memory allocators are expected.
 
 - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
   tracepoints to provide increased visibility into memcg stats flushing
   activity.
 
 - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
   fixes a zram buglet which potentially affected performance.
 
 - The series "mm: add more kernel parameters to control mTHP" from
   Maíra Canal enhances our ability to control/configuremultisize THP from
   the kernel boot command line.
 
 - The series "kasan: few improvements on kunit tests" from Sabyrzhan
   Tasbolatov has a couple of fixups for the KASAN KUnit tests.
 
 - The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
   from Kairui Song optimizes list_lru memory utilization when lockdep is
   enabled.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzwFqgAKCRDdBJ7gKXxA
 jkeuAQCkl+BmeYHE6uG0hi3pRxkupseR6DEOAYIiTv0/l8/GggD/Z3jmEeqnZaNq
 xyyenpibWgUoShU2wZ/Ha8FE5WDINwg=
 =JfWR
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - The series "zram: optimal post-processing target selection" from
   Sergey Senozhatsky improves zram's post-processing selection
   algorithm. This leads to improved memory savings.

 - Wei Yang has gone to town on the mapletree code, contributing several
   series which clean up the implementation:
	- "refine mas_mab_cp()"
	- "Reduce the space to be cleared for maple_big_node"
	- "maple_tree: simplify mas_push_node()"
	- "Following cleanup after introduce mas_wr_store_type()"
	- "refine storing null"

 - The series "selftests/mm: hugetlb_fault_after_madv improvements" from
   David Hildenbrand fixes this selftest for s390.

 - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
   implements some rationaizations and cleanups in the page mapping
   code.

 - The series "mm: optimize shadow entries removal" from Shakeel Butt
   optimizes the file truncation code by speeding up the handling of
   shadow entries.

 - The series "Remove PageKsm()" from Matthew Wilcox completes the
   migration of this flag over to being a folio-based flag.

 - The series "Unify hugetlb into arch_get_unmapped_area functions" from
   Oscar Salvador implements a bunch of consolidations and cleanups in
   the hugetlb code.

 - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
   takes away the wp-fault time practice of turning a huge zero page
   into small pages. Instead we replace the whole thing with a THP. More
   consistent cleaner and potentiall saves a large number of pagefaults.

 - The series "percpu: Add a test case and fix for clang" from Andy
   Shevchenko enhances and fixes the kernel's built in percpu test code.

 - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
   optimizes mremap() by avoiding doing things which we didn't need to
   do.

 - The series "Improve the tmpfs large folio read performance" from
   Baolin Wang teaches tmpfs to copy data into userspace at the folio
   size rather than as individual pages. A 20% speedup was observed.

 - The series "mm/damon/vaddr: Fix issue in
   damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON
   splitting.

 - The series "memcg-v1: fully deprecate charge moving" from Shakeel
   Butt removes the long-deprecated memcgv2 charge moving feature.

 - The series "fix error handling in mmap_region() and refactor" from
   Lorenzo Stoakes cleanup up some of the mmap() error handling and
   addresses some potential performance issues.

 - The series "x86/module: use large ROX pages for text allocations"
   from Mike Rapoport teaches x86 to use large pages for
   read-only-execute module text.

 - The series "page allocation tag compression" from Suren Baghdasaryan
   is followon maintenance work for the new page allocation profiling
   feature.

 - The series "page->index removals in mm" from Matthew Wilcox remove
   most references to page->index in mm/. A slow march towards shrinking
   struct page.

 - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
   interface tests" from Andrew Paniakin performs maintenance work for
   DAMON's self testing code.

 - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
   improves zswap's batching of compression and decompression. It is a
   step along the way towards using Intel IAA hardware acceleration for
   this zswap operation.

 - The series "kasan: migrate the last module test to kunit" from
   Sabyrzhan Tasbolatov completes the migration of the KASAN built-in
   tests over to the KUnit framework.

 - The series "implement lightweight guard pages" from Lorenzo Stoakes
   permits userapace to place fault-generating guard pages within a
   single VMA, rather than requiring that multiple VMAs be created for
   this. Improved efficiencies for userspace memory allocators are
   expected.

 - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
   tracepoints to provide increased visibility into memcg stats flushing
   activity.

 - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
   fixes a zram buglet which potentially affected performance.

 - The series "mm: add more kernel parameters to control mTHP" from
   Maíra Canal enhances our ability to control/configuremultisize THP
   from the kernel boot command line.

 - The series "kasan: few improvements on kunit tests" from Sabyrzhan
   Tasbolatov has a couple of fixups for the KASAN KUnit tests.

 - The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
   from Kairui Song optimizes list_lru memory utilization when lockdep
   is enabled.

* tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits)
  cma: enforce non-zero pageblock_order during cma_init_reserved_mem()
  mm/kfence: add a new kunit test test_use_after_free_read_nofault()
  zram: fix NULL pointer in comp_algorithm_show()
  memcg/hugetlb: add hugeTLB counters to memcg
  vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event
  mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount
  zram: ZRAM_DEF_COMP should depend on ZRAM
  MAINTAINERS/MEMORY MANAGEMENT: add document files for mm
  Docs/mm/damon: recommend academic papers to read and/or cite
  mm: define general function pXd_init()
  kmemleak: iommu/iova: fix transient kmemleak false positive
  mm/list_lru: simplify the list_lru walk callback function
  mm/list_lru: split the lock to per-cgroup scope
  mm/list_lru: simplify reparenting and initial allocation
  mm/list_lru: code clean up for reparenting
  mm/list_lru: don't export list_lru_add
  mm/list_lru: don't pass unnecessary key parameters
  kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller
  kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW
  kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols
  ...
2024-11-23 09:58:07 -08:00
Linus Torvalds
c01f664e4c \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmc/WXkACgkQnJ2qBz9k
 QNnwjAf/c8K3Vhw9RuKMtPF0K+gC//0mLsq+WmgrtXfMLvbSymrACnwHFJzpNGeS
 iEqCYlCC7vlqzPXpsVRlFeHpM52oVnE/wFF0Hp1h/Y1oqbRSzur6iSl4epmmBN+K
 AsPoWEXco7ABqtrhoZb0b1n7io9VorHN4nLhO6KWD83nZAawJDWgSw0sNCqcT6to
 vVxR3baP/EhONxNquxXe2lxq26dMilehmTk4AOyYslNYb0iG4r18TPyNb7fmuuKG
 M+nFfMnM9EPH8lnmgx6Mg/X77d/eZoq4pMRmeqSsroB5k/AQJnNrGweNL1+yr7OY
 adWNOMGWdNNQXPFgGbL5yZwNZ64kRA==
 =Eq1B
 -----END PGP SIGNATURE-----

Merge tag 'reiserfs_delete' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull reiserfs removal from Jan Kara:
 "The deprecation period of reiserfs is ending at the end of this year
  so it is time to remove it"

* tag 'reiserfs_delete' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  reiserfs: The last commit
2024-11-21 09:50:18 -08:00
Ming Lei
34c1227035 ublk: fix error code for unsupported command
ENOTSUPP is for kernel use only, and shouldn't be sent to userspace.

Fix it by replacing it with EOPNOTSUPP.

Cc: stable@vger.kernel.org
Fixes: bfbcef0363 ("ublk_drv: move ublk_get_device_from_id into ublk_ctrl_uring_cmd")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241119030646.2319030-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-19 09:19:46 -07:00
OGAWA Hirofumi
b49125574c loop: Fix ABBA locking race
Current loop calls vfs_statfs() while holding the q->limits_lock. If
FS takes some locking in vfs_statfs callback, this may lead to ABBA
locking bug (at least, FAT fs has this issue actually).

So this patch calls vfs_statfs() outside q->limits_locks instead,
because looks like no reason to hold q->limits_locks while getting
discord configs.

Chain exists of:
  &sbi->fat_lock --> &q->q_usage_counter(io)#17 --> &q->limits_lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&q->limits_lock);
                               lock(&q->q_usage_counter(io)#17);
                               lock(&q->limits_lock);
  lock(&sbi->fat_lock);

 *** DEADLOCK ***

Reported-by: syzbot+a5d8c609c02f508672cc@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=a5d8c609c02f508672cc
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-19 07:54:56 -07:00
Liu Shixin
f364cdeb38 zram: fix NULL pointer in comp_algorithm_show()
LTP reported a NULL pointer dereference as followed:

 CPU: 7 UID: 0 PID: 5995 Comm: cat Kdump: loaded Not tainted 6.12.0-rc6+ #3
 Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
 pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : __pi_strcmp+0x24/0x140
 lr : zcomp_available_show+0x60/0x100 [zram]
 sp : ffff800088b93b90
 x29: ffff800088b93b90 x28: 0000000000000001 x27: 0000000000400cc0
 x26: 0000000000000ffe x25: ffff80007b3e2388 x24: 0000000000000000
 x23: ffff80007b3e2390 x22: ffff0004041a9000 x21: ffff80007b3e2900
 x20: 0000000000000000 x19: 0000000000000000 x18: 0000000000000000
 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
 x11: 0000000000000000 x10: ffff80007b3e2900 x9 : ffff80007b3cb280
 x8 : 0101010101010101 x7 : 0000000000000000 x6 : 0000000000000000
 x5 : 0000000000000040 x4 : 0000000000000000 x3 : 00656c722d6f7a6c
 x2 : 0000000000000000 x1 : ffff80007b3e2900 x0 : 0000000000000000
 Call trace:
  __pi_strcmp+0x24/0x140
  comp_algorithm_show+0x40/0x70 [zram]
  dev_attr_show+0x28/0x80
  sysfs_kf_seq_show+0x90/0x140
  kernfs_seq_show+0x34/0x48
  seq_read_iter+0x1d4/0x4e8
  kernfs_fop_read_iter+0x40/0x58
  new_sync_read+0x9c/0x168
  vfs_read+0x1a8/0x1f8
  ksys_read+0x74/0x108
  __arm64_sys_read+0x24/0x38
  invoke_syscall+0x50/0x120
  el0_svc_common.constprop.0+0xc8/0xf0
  do_el0_svc+0x24/0x38
  el0_svc+0x38/0x138
  el0t_64_sync_handler+0xc0/0xc8
  el0t_64_sync+0x188/0x190

The zram->comp_algs[ZRAM_PRIMARY_COMP] can be NULL in zram_add() if
comp_algorithm_set() has not been called.  User can access the zram device
by sysfs after device_add_disk(), so there is a time window to trigger the
NULL pointer dereference.  Move it ahead device_add_disk() to make sure
when user can access the zram device, it is ready.  comp_algorithm_set()
is protected by zram->init_lock in other places and no such problem.

Link: https://lkml.kernel.org/r/20241108100147.3776123-1-liushixin2@huawei.com
Fixes: 7ac07a26de ("zram: preparation for multi-zcomp support")
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-14 22:49:19 -08:00
Christoph Hellwig
e70c301fae block: don't reorder requests in blk_add_rq_to_plug
Add requests to the tail of the list instead of the front so that they
are queued up in submission order.

Remove the re-reordering in blk_mq_dispatch_plug_list, virtio_queue_rqs
and nvme_queue_rqs now that the list is ordered as expected.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-13 12:04:58 -07:00
Christoph Hellwig
a3396b9999 block: add a rq_list type
Replace the semi-open coded request list helpers with a proper rq_list
type that mirrors the bio_list and has head and tail pointers.  Besides
better type safety this actually allows to insert at the tail of the
list, which will be useful soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-13 12:04:58 -07:00
Christoph Hellwig
7f212e997e virtio_blk: reverse request order in virtio_queue_rqs
blk_mq_flush_plug_list submits requests in the reverse order that they
were submitted, which leads to a rather suboptimal I/O pattern
especially in rotational devices. Fix this by rewriting virtio_queue_rqs
so that it always pops the requests from the passed in request list, and
then adds them to the head of a local submit list. This actually
simplifies the code a bit as it removes the complicated list splicing,
at the cost of extra updates of the rq_next pointer. As that should be
cache hot anyway it should be an easy price to pay.

Fixes: 0e9911fa76 ("virtio-blk: support mq_ops->queue_rqs()")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-13 12:04:26 -07:00
Geert Uytterhoeven
9f3310ccc7 zram: ZRAM_DEF_COMP should depend on ZRAM
When Compressed RAM block device support is disabled, the
CONFIG_ZRAM_DEF_COMP symbol still ends up in the generated config file:

    CONFIG_ZRAM_DEF_COMP="unset-value"

While this causes no real harm, avoid polluting the config file by
adding a dependency on ZRAM.

Link: https://lkml.kernel.org/r/64e05bad68a9bd5cc322efd114a04d25de525940.1730807319.git.geert@linux-m68k.org
Fixes: 917a59e81c ("zram: introduce custom comp backends API")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11 17:22:27 -08:00
Sergey Senozhatsky
d37da422ed zram: clear IDLE flag in mark_idle()
If entry does not fulfill current mark_idle() parameters, e.g.  cutoff
time, then we should clear its ZRAM_IDLE from previous mark_idle()
invocations.

Consider the following case:
- mark_idle() cutoff time 8h
- mark_idle() cutoff time 4h
- writeback() idle - will writeback entries with cutoff time 8h,
  while it should only pick entries with cutoff time 4h

The bug was reported by Shin Kawamura.

Link: https://lkml.kernel.org/r/20241028153629.1479791-3-senozhatsky@chromium.org
Fixes: 755804d169 ("zram: introduce an aged idle interface")
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reported-by: Shin Kawamura <kawasin@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@vger.kernel.org>
2024-11-11 13:09:35 -08:00
Sergey Senozhatsky
f852190966 zram: clear IDLE flag after recompression
Patch series "zram: IDLE flag handling fixes", v2.

zram can wrongly preserve ZRAM_IDLE flag on its entries which can result
in premature post-processing (writeback and recompression) of such
entries.

This patch (of 2)

Recompression should clear ZRAM_IDLE flag on the entries it has accessed,
because otherwise some entries, specifically those for which recompression
has failed, become immediate candidate entries for another post-processing
(e.g.  writeback).

Consider the following case:
- recompression marks entries IDLE every 4 hours and attempts
  to recompress them
- some entries are incompressible, so we keep them intact and
  hence preserve IDLE flag
- writeback marks entries IDLE every 8 hours and writebacks
  IDLE entries, however we have IDLE entries left from
  recompression, so writeback prematurely writebacks those
  entries.

The bug was reported by Shin Kawamura.

Link: https://lkml.kernel.org/r/20241028153629.1479791-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20241028153629.1479791-2-senozhatsky@chromium.org
Fixes: 84b33bf788 ("zram: introduce recompress sysfs knob")
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reported-by: Shin Kawamura <kawasin@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@vger.kernel.org>
2024-11-11 13:09:12 -08:00
Christoph Hellwig
559218d43e block: pre-calculate max_zone_append_sectors
max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using queue_limits_max_zone_append_sectors helper.  This not
only adds (tiny) extra overhead to the I/O path, but also can be easily
forgotten in file system code.

Add a new max_hw_zone_append_sectors value to queue_limits which is
set by the driver, and calculate max_zone_append_sectors from that and
the other inputs in blk_validate_zoned_limits, similar to how
max_sectors is calculated to fix this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20241108154657.845768-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-11 09:20:36 -07:00
Ming Lei
d369735e02 ublk: fix ublk_ch_mmap() for 64K page size
In ublk_ch_mmap(), queue id is calculated in the following way:

	(vma->vm_pgoff << PAGE_SHIFT) / `max_cmd_buf_size`

'max_cmd_buf_size' is equal to

	`UBLK_MAX_QUEUE_DEPTH * sizeof(struct ublksrv_io_desc)`

and UBLK_MAX_QUEUE_DEPTH is 4096 and part of UAPI, so 'max_cmd_buf_size'
is always page aligned in 4K page size kernel. However, it isn't true in
64K page size kernel.

Fixes the issue by always rounding up 'max_cmd_buf_size' with PAGE_SIZE.

Cc: stable@vger.kernel.org
Fixes: 71f28f3136 ("ublk_drv: add io_uring based userspace block driver")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241111110718.1394001-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-11 08:16:17 -07:00
Li Wang
8e604cac49 loop: fix type of block size
PAGE_SIZE may be 64K, and the max block size can be PAGE_SIZE, so any
variable for holding block size can't be defined as 'unsigned short'.

Unfortunately commit 473516b361 ("loop: use the atomic queue limits
update API") passes 'bsize' with type of 'unsigned short' to
loop_reconfigure_limits(), and causes LTP/ioctl_loop06 test failure:

  12 ioctl_loop06.c:76: TINFO: Using LOOP_SET_BLOCK_SIZE with arg > PAGE_SIZE
  13 ioctl_loop06.c:59: TFAIL: Set block size succeed unexpectedly
  ...
  18 ioctl_loop06.c:76: TINFO: Using LOOP_CONFIGURE with block_size > PAGE_SIZE
  19 ioctl_loop06.c:59: TFAIL: Set block size succeed unexpectedly

Fixes the issue by defining 'block size' variable with 'unsigned int', which is
aligned with block layer's definition.

(improve commit log & add fixes tag)

Fixes: 473516b361 ("loop: use the atomic queue limits update API")
Cc: John Garry <john.g.garry@oracle.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Li Wang <liwang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241109022744.1126003-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-09 20:05:33 -07:00
Ming Lei
a471977780 rbd: unfreeze queue after marking disk as dead
Unfreeze queue after returning from blk_mark_disk_dead(), this way at
least allows us to verify queue freeze correctly with lockdep.

Suggested-by: Christoph Hellwig <hch@lst.de>
Cc: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241031133723.303835-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07 16:27:22 -07:00
Jens Axboe
ab9bc81c1c Revert "block: pre-calculate max_zone_append_sectors"
This causes issue on, at least, nvme-mpath where my boot fails with:

WARNING: CPU: 354 PID: 2729 at block/blk-settings.c:75 blk_validate_limits+0x356/0x380
Modules linked in: tg3(+) nvme usbcore scsi_mod ptp i2c_piix4 libphy nvme_core crc32c_intel scsi_common usb_common pps_core i2c_smbus
CPU: 354 UID: 0 PID: 2729 Comm: kworker/u2061:1 Not tainted 6.12.0-rc6+ #181
Hardware name: Dell Inc. PowerEdge R7625/06444F, BIOS 1.8.3 04/02/2024
Workqueue: async async_run_entry_fn
RIP: 0010:blk_validate_limits+0x356/0x380
Code: f6 47 01 04 75 28 83 bf 94 00 00 00 00 75 39 83 bf 98 00 00 00 00 75 34 83 7f 68 00 75 32 31 c0 83 7f 5c 00 0f 84 9b fd ff ff <0f> 0b eb 13 0f 0b eb 0f 48 c7 c0 74 12 58 92 48 89 c7 e8 13 76 46
RSP: 0018:ffffa8a1dfb93b30 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff9232829c8388 RCX: 0000000000000088
RDX: 0000000000000080 RSI: 0000000000000200 RDI: ffffa8a1dfb93c38
RBP: 000000000000000c R08: 00000000ffffffff R09: 000000000000ffff
R10: 0000000000000000 R11: 0000000000000000 R12: ffff9232829b9000
R13: ffff9232829b9010 R14: ffffa8a1dfb93c38 R15: ffffa8a1dfb93c38
FS:  0000000000000000(0000) GS:ffff923867c80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055c1b92480a8 CR3: 0000002484ff0002 CR4: 0000000000370ef0
Call Trace:
 <TASK>
 ? __warn+0xca/0x1a0
 ? blk_validate_limits+0x356/0x380
 ? report_bug+0x11a/0x1a0
 ? handle_bug+0x5e/0x90
 ? exc_invalid_op+0x16/0x40
 ? asm_exc_invalid_op+0x16/0x20
 ? blk_validate_limits+0x356/0x380
 blk_alloc_queue+0x7a/0x250
 __blk_alloc_disk+0x39/0x80
 nvme_mpath_alloc_disk+0x13d/0x1b0 [nvme_core]
 nvme_scan_ns+0xcc7/0x1010 [nvme_core]
 async_run_entry_fn+0x27/0x120
 process_scheduled_works+0x1a0/0x360
 worker_thread+0x2bc/0x350
 ? pr_cont_work+0x1b0/0x1b0
 kthread+0x111/0x120
 ? kthread_unuse_mm+0x90/0x90
 ret_from_fork+0x30/0x40
 ? kthread_unuse_mm+0x90/0x90
 ret_from_fork_asm+0x11/0x20
 </TASK>
---[ end trace 0000000000000000 ]---

presumably due to max_zone_append_sectors not being cleared to zero,
resulting in blk_validate_zoned_limits() complaining and failing.

This reverts commit 2a8f6153e1.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07 05:45:34 -07:00
Sergey Senozhatsky
01a9097aa3 zram: do not open-code comp priority 0
A cosmetic change: do not open-code compression priority 0, use
ZRAM_PRIMARY_COMP instead.

Link: https://lkml.kernel.org/r/20241009042908.750260-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06 20:11:11 -08:00
Philipp Stanner
91ff97a722 mtip32xx: Replace deprecated PCI functions
pcim_iomap_table() and pcim_request_regions() have been deprecated in
commit e354bb84a4 ("PCI: Deprecate pcim_iomap_table(),
pcim_iomap_regions_request_all()") and commit d140f80f60 ("PCI:
Deprecate pcim_iomap_regions() in favor of pcim_iomap_region()"),
respectively.

Replace these functions with pcim_iomap_region().

Signed-off-by: Philipp Stanner <pstanner@redhat.com>
Link: https://lore.kernel.org/r/20241106145249.108996-2-pstanner@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-06 07:54:50 -07:00
Sergey Senozhatsky
5e99893444 zram: remove UNDER_WB and simplify writeback
We now have only one active post-processing at any time, so we don't have
same race conditions that we had before.  If slot selected for
post-processing gets freed or freed and reallocated it loses its PP_SLOT
flag and there is no way for such a slot to gain PP_SLOT flag again until
current post-processing terminates.

Link: https://lkml.kernel.org/r/20240917021020.883356-8-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 16:56:23 -08:00
Sergey Senozhatsky
1a1d0f8992 zram: reshuffle zram_free_page() flags operations
Drop some redundant zram_test_flag() calls and re-order zram_clear_flag()
calls.  Plus two small trivial coding style fixes.  No functional changes.

Link: https://lkml.kernel.org/r/20240917021020.883356-7-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 16:56:23 -08:00
Sergey Senozhatsky
b967fa1ba7 zram: do not mark idle slots that cannot be idle
ZRAM_SAME slots cannot be post-processed (writeback or recompress) so do
not mark them ZRAM_IDLE.  Same with ZRAM_WB slots, they cannot be
ZRAM_IDLE because they are not in zsmalloc pool anymore.

Link: https://lkml.kernel.org/r/20240917021020.883356-6-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 16:56:22 -08:00
Sergey Senozhatsky
330edc2bc0 zram: rework writeback target selection strategy
Writeback suffers from the same problem as recompression did before -
target slot selection for writeback is just a simple iteration over
zram->table entries (stored pages) which selects suboptimal targets for
writeback.  This is especially problematic for writeback, because we
uncompress objects before writeback so each of them takes 4K out of
limited writeback storage.  For example, when we take a 48 bytes slot and
store it as a 4K object to writeback device we only save 48 bytes of
memory (release from zsmalloc pool).  We naturally want to pick the
largest objects for writeback, because then each writeback will release
the largest amount of memory.

This patch applies the same solution and strategy as for recompression
target selection: pp control (post-process) with 16 buckets of candidate
pp slots.  Slots are assigned to pp buckets based on sizes - the larger
the slot the higher the group index.  This gives us sorted by size lists
of candidate slots (in linear time), so that among post-processing
candidate slots we always select the largest ones first and maximize the
memory saving.

TEST
====

A very simple demonstration: zram is configured with a writeback device. 
A limited writeback (wb_limit 2500 pages) is performed then, with a log of
sizes of slots that were written back.  You can see that patched zram
selects slots for recompression in significantly different manner, which
leads to higher memory savings (see column #2 of mm_stat output).

BASE
----

*** initial state of zram device
/sys/block/zram0/mm_stat
1750327296 619765836 631902208        0 631902208        1        0    34278    34278

*** writeback idle wb_limit 2500
/sys/block/zram0/mm_stat
1750327296 617622333 631578624        0 631902208        1        0    34278    34278

Sizes of selected objects for writeback:
... 193 349 46 46 46 46 852 1002 543 162 107 49 34 34 34 ...

PATCHED
-------

*** initial state of zram device
/sys/block/zram0/mm_stat
1750319104 619760957 631992320        0 631992320        1        0    34278    34278

*** writeback idle wb_limit 2500
/sys/block/zram0/mm_stat
1750319104 612672056 626135040        0 631992320        1        0    34278    34278

Sizes of selected objects for writeback:
... 3667 3580 3581 3580 3581 3581 3581 3231 3211 3203 3231 3246 ...

Note, pp-slots are not strictly sorted, there is a PP_BUCKET_SIZE_RANGE
variation of sizes within particular bucket.

Link: https://lkml.kernel.org/r/20240917021020.883356-5-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 16:56:22 -08:00
Sergey Senozhatsky
3f909a60ce zram: rework recompress target selection strategy
Target slot selection for recompression is just a simple iteration over
zram->table entries (stored pages) from slot 0 to max slot.  Given that
zram->table slots are written in random order and are not sorted by size,
a simple iteration over slots selects suboptimal targets for
recompression.  This is not a problem if we recompress every single
zram->table slot, but we never do that in reality.  In reality we limit
the number of slots we can recompress (via max_pages parameter) and hence
proper slot selection becomes very important.  The strategy is quite
simple, suppose we have two candidate slots for recompression, one of size
48 bytes and one of size 2800 bytes, and we can recompress only one, then
it certainly makes more sense to pick 2800 entry for recompression. 
Because even if we manage to compress 48 bytes objects even further the
savings are going to be very small.  Potential savings after good
re-compression of 2800 bytes objects are much higher.

This patch reworks slot selection and introduces the strategy described
above: among candidate slots always select the biggest ones first.

For that the patch introduces zram_pp_ctl (post-processing) structure
which holds NUM_PP_BUCKETS pp buckets of slots.  Slots are assigned to a
particular group based on their sizes - the larger the size of the slot
the higher the group index.  This, basically, sorts slots by size in liner
time (we still perform just one iteration over zram->table slots).  When
we select slot for recompression we always first lookup in higher pp
buckets (those that hold the largest slots).  Which achieves the desired
behavior.

TEST
====

A very simple demonstration: zram is configured with zstd, and zstd with
dict as a recompression stream.  A limited (max 4096 pages) recompression
is performed then, with a log of sizes of slots that were recompressed. 
You can see that patched zram selects slots for recompression in
significantly different manner, which leads to higher memory savings (see
column #2 of mm_stat output).

BASE
----

*** initial state of zram device
/sys/block/zram0/mm_stat
1750994944 504491413 514203648        0 514203648        1        0    34204    34204

*** recompress idle max_pages=4096
/sys/block/zram0/mm_stat
1750994944 504262229 514953216        0 514203648        1        0    34204    34204

Sizes of selected objects for recompression:
... 45 58 24 226 91 40 24 24 24 424 2104 93 2078 2078 2078 959 154 ...

PATCHED
-------

*** initial state of zram device
/sys/block/zram0/mm_stat
1750982656 504492801 514170880        0 514170880        1        0    34204    34204

*** recompress idle max_pages=4096
/sys/block/zram0/mm_stat
1750982656 503716710 517586944        0 514170880        1        0    34204    34204

Sizes of selected objects for recompression:
... 3680 3694 3667 3590 3614 3553 3537 3548 3550 3542 3543 3537 ...

Note, pp-slots are not strictly sorted, there is a PP_BUCKET_SIZE_RANGE
variation of sizes within particular bucket.

[senozhatsky@chromium.org: do not skip the first bucket]
  Link: https://lkml.kernel.org/r/20241001085634.1948384-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20240917021020.883356-4-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 16:56:22 -08:00
Sergey Senozhatsky
58652f2b6d zram: permit only one post-processing operation at a time
Both recompress and writeback soon will unlock slots during processing,
which makes things too complex wrt possible race-conditions.  We still
want to clear PP_SLOT in slot_free, because this is how we figure out that
slot that was selected for post-processing has been released under us and
when we start post-processing we check if slot still has PP_SLOT set.  At
the same time, theoretically, we can have something like this:

CPU0			    CPU1

recompress
scan slots
set PP_SLOT
unlock slot
			slot_free
			clear PP_SLOT

			allocate PP_SLOT
			writeback
			scan slots
			set PP_SLOT
			unlock slot
select PP-slot
test PP_SLOT

So recompress will not detect that slot has been re-used and re-selected
for concurrent writeback post-processing.

Make sure that we only permit on post-processing operation at a time.  So
now recompress and writeback post-processing don't race against each
other, we only need to handle slot re-use (slot_free and write), which is
handled individually by each pp operation.

Having recompress and writeback competing for the same slots is not
exactly good anyway (can't imagine anyone doing that).

Link: https://lkml.kernel.org/r/20240917021020.883356-3-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 16:56:22 -08:00
Sergey Senozhatsky
bf779fb9af zram: introduce ZRAM_PP_SLOT flag
Patch series "zram: optimal post-processing target selection", v5.

Problem:
--------
Both recompression and writeback perform a very simple linear scan of all
zram slots in search for post-processing (writeback or recompress)
candidate slots.  This often means that we pick the worst candidate for pp
(post-processing), e.g.  a 48 bytes object for writeback, which is nearly
useless, because it only releases 48 bytes from zsmalloc pool, but
consumes an entire 4K slot in the backing device.  Similarly,
recompression of an 48 bytes objects is unlikely to save more memory that
recompression of a 3000 bytes object.  Both recompression and writeback
consume constrained resources (CPU time, batter, backing device storage
space) and quite often have a (daily) limit on the number of items they
post-process, so we should utilize those constrained resources in the most
optimal way.

Solution:
---------
This patch reworks the way we select pp targets.  We, quite clearly, want
to sort all the candidates and always pick the largest, be it
recompression or writeback.  Especially for writeback, because the larger
object we writeback the more memory we release.  This series introduces
concept of pp buckets and pp scan/selection.

The scan step is a simple iteration over all zram->table entries, just
like what we currently do, but we don't post-process a candidate slot
immediately.  Instead we assign it to a PP (post-processing) bucket.  PP
bucket is, basically, a list which holds pp candidate slots that belong to
the same size class.  PP buckets are 64 bytes apart, slots are not
strictly sorted within a bucket there is a 64 bytes variance.

The select step simply iterates over pp buckets from highest to lowest and
picks all candidate slots a particular buckets contains.  So this gives us
sorted candidates (in linear time) and allows us to select most optimal
(largest) candidates for post-processing first.


This patch (of 7):

This flag indicates that the slot was selected as a candidate slot for
post-processing (pp) and was assigned to a pp bucket.  It does not
necessarily mean that the slot is currently under post-processing, but may
mean so.  The slot can loose its PP_SLOT flag, while still being in the
pp-bucket, if it's accessed or slot_free-ed.

Link: https://lkml.kernel.org/r/20240917021020.883356-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20240917021020.883356-2-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 16:56:21 -08:00
Christoph Hellwig
2a8f6153e1 block: pre-calculate max_zone_append_sectors
max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using queue_limits_max_zone_append_sectors helper.  This not
only adds (tiny) extra overhead to the I/O path, but also can be easily
forgotten in file system code.

Add a new max_hw_zone_append_sectors value to queue_limits which is
set by the driver, and calculate max_zone_append_sectors from that and
the other inputs in blk_validate_zoned_limits, similar to how
max_sectors is calculated to fix this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-04 10:34:07 -07:00
John Garry
d47de6ac88 loop: Simplify discard granularity calc
A bdev discard granularity is always at least SECTOR_SIZE, so don't check
for a zero value.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241101092215.422428-1-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-01 20:17:52 -06:00
Yang Erkun
826cc42adf brd: defer automatic disk creation until module initialization succeeds
My colleague Wupeng found the following problems during fault injection:

BUG: unable to handle page fault for address: fffffbfff809d073
PGD 6e648067 P4D 123ec8067 PUD 123ec4067 PMD 100e38067 PTE 0
Oops: Oops: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 5 UID: 0 PID: 755 Comm: modprobe Not tainted 6.12.0-rc3+ #17
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.16.1-2.fc37 04/01/2014
RIP: 0010:__asan_load8+0x4c/0xa0
...
Call Trace:
 <TASK>
 blkdev_put_whole+0x41/0x70
 bdev_release+0x1a3/0x250
 blkdev_release+0x11/0x20
 __fput+0x1d7/0x4a0
 task_work_run+0xfc/0x180
 syscall_exit_to_user_mode+0x1de/0x1f0
 do_syscall_64+0x6b/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

loop_init() is calling loop_add() after __register_blkdev() succeeds and
is ignoring disk_add() failure from loop_add(), for loop_add() failure
is not fatal and successfully created disks are already visible to
bdev_open().

brd_init() is currently calling brd_alloc() before __register_blkdev()
succeeds and is releasing successfully created disks when brd_init()
returns an error. This can cause UAF for the latter two case:

case 1:
    T1:
modprobe brd
  brd_init
    brd_alloc(0) // success
      add_disk
        disk_scan_partitions
          bdev_file_open_by_dev // alloc file
          fput // won't free until back to userspace
    brd_alloc(1) // failed since mem alloc error inject
  // error path for modprobe will release code segment
  // back to userspace
  __fput
    blkdev_release
      bdev_release
        blkdev_put_whole
          bdev->bd_disk->fops->release // fops is freed now, UAF!

case 2:
    T1:                            T2:
modprobe brd
  brd_init
    brd_alloc(0) // success
                                   open(/dev/ram0)
    brd_alloc(1) // fail
  // error path for modprobe

                                   close(/dev/ram0)
                                   ...
                                   /* UAF! */
                                   bdev->bd_disk->fops->release

Fix this problem by following what loop_init() does. Besides,
reintroduce brd_devices_mutex to help serialize modifications to
brd_list.

Fixes: 7f9b348cb5 ("brd: convert to blk_alloc_disk/blk_cleanup_disk")
Reported-by: Wupeng Ma <mawupeng1@huawei.com>
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241030034914.907829-1-yangerkun@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-30 07:31:07 -06:00
John Garry
8d3fd059dd loop: Use bdev limit helpers for configuring discard
Instead of directly looking at the request_queue limits, use the bdev
limits helpers, which is preferable.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241030111900.3981223-1-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-30 07:23:57 -06:00
Uday Shankar
59eaa01ce7 ublk: support device recovery without I/O queueing
ublk currently supports the following behaviors on ublk server exit:

A: outstanding I/Os get errors, subsequently issued I/Os get errors
B: outstanding I/Os get errors, subsequently issued I/Os queue
C: outstanding I/Os get reissued, subsequently issued I/Os queue

and the following behaviors for recovery of preexisting block devices by
a future incarnation of the ublk server:

1: ublk devices stopped on ublk server exit (no recovery possible)
2: ublk devices are recoverable using start/end_recovery commands

The userspace interface allows selection of combinations of these
behaviors using flags specified at device creation time, namely:

default behavior: A + 1
UBLK_F_USER_RECOVERY: B + 2
UBLK_F_USER_RECOVERY|UBLK_F_USER_RECOVERY_REISSUE: C + 2

The behavior A + 2 is currently unsupported. Add support for this
behavior under the new flag combination
UBLK_F_USER_RECOVERY|UBLK_F_USER_RECOVERY_FAIL_IO.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241007182419.3263186-5-ushankar@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:37 -06:00
Uday Shankar
27b5d4170c ublk: merge stop_work and quiesce_work
Save some lines by merging stop_work and quiesce_work into nosrv_work,
which looks at the recovery flags and does the right thing when the "no
ublk server" condition is detected.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241007182419.3263186-4-ushankar@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:37 -06:00
Uday Shankar
3b939b8f71 ublk: refactor recovery configuration flag helpers
ublk currently supports the following behaviors on ublk server exit:

A: outstanding I/Os get errors, subsequently issued I/Os get errors
B: outstanding I/Os get errors, subsequently issued I/Os queue
C: outstanding I/Os get reissued, subsequently issued I/Os queue

and the following behaviors for recovery of preexisting block devices by
a future incarnation of the ublk server:

1: ublk devices stopped on ublk server exit (no recovery possible)
2: ublk devices are recoverable using start/end_recovery commands

The userspace interface allows selection of combinations of these
behaviors using flags specified at device creation time, namely:

default behavior: A + 1
UBLK_F_USER_RECOVERY: B + 2
UBLK_F_USER_RECOVERY|UBLK_F_USER_RECOVERY_REISSUE: C + 2

We can't easily change the userspace interface to allow independent
selection of one of {A, B, C} and one of {1, 2}, but we can refactor the
internal helpers which test for the flags. Replace the existing helpers
with the following set:

ublk_nosrv_should_reissue_outstanding: tests for behavior C
ublk_nosrv_[dev_]should_queue_io: tests for behavior B
ublk_nosrv_should_stop_dev: tests for behavior 1

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241007182419.3263186-3-ushankar@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:37 -06:00
Uday Shankar
d00c0ea179 ublk: check recovery flags for validity
Setting UBLK_F_USER_RECOVERY_REISSUE without also setting
UBLK_F_USER_RECOVERY is currently silently equivalent to not setting any
recovery flags at all, even though that's obviously not intended. Check
for this case and fail add_dev (with a paranoid warning to aid debugging
any program which might rely on the old behavior) with EINVAL if it is
detected.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241007182419.3263186-2-ushankar@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22 08:16:37 -06:00
Jan Kara
fb6f20ecb1 reiserfs: The last commit
Deprecation period of reiserfs ends with the end of this year so it is
time to remove it from the kernel.

Acked-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2024-10-21 16:29:38 +02:00
Linus Torvalds
f8eacd8ad7 block-6.12-20241018
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmcSk4AQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuuXD/0UERdP+djJNoXBW5Mv7U5a4rJ7ZgfPL7ku
 z3ZfdnNYGitZhYkVjNQ60TLzXRQyUaIIxMVBWzkb59I6ixmuQbzm/lC55B6s/FIR
 bfT3afe1WRgLCaFbStu91qRs/44Mq4yK6wXcIU7LwutRT/5cqZwelqRZLK7DMFln
 zlGX4zNrCMRUDTr6PLa6CvyY4dmQSL17Ib1ypcKXjGs5YjDntzSrIsKVT1Wayans
 WroGGPG6W7r2c2kn8pe4uPIjZVfMUF2vrdIs0KEYaAQOC7ppEucCgDZMEWRs7kdH
 63hheudJjVSwLF/qYnXNHe/Bz12QCZohPp6UsqRpC8o96Ralgo6Q+FxkXsVelMXW
 JKhtDqYGBDHOQrjrEWN1rnYw/DauEQAgvOtdVfEx2IBzPsG07cB8yv8MNA90H9QH
 KStI7h9qnBEMMNcXX8prOymCHNWAeuF4mbitVrRfSfEVm/0BbQ19qoyGrvwNFgEf
 6T+4Xj/P+FsiLVe8vsgBZDaxEEU5Ifd/rki/QFVk/2z72BBZxmdf2nm51SOM28V7
 HGMHwJI3H8rdmPXvt5Q/ve6GWNOYLO5PSAJgSSe96UStvtsAHGB4eM+LykdnE7cI
 SoytU5KfAM8DD6wnyHIgYuvJyZWrmLoVDrRjym8emc2KrJOe7qg+Ah4ERcNTCnhl
 nw50f27G4w==
 =waNY
 -----END PGP SIGNATURE-----

Merge tag 'block-6.12-20241018' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
     - Fix target passthrough identifier (Nilay)
     - Fix tcp locking (Hannes)
     - Replace list with sbitmap for tracking RDMA rsp tags (Guixen)
     - Remove unnecessary fallthrough statements (Tokunori)
     - Remove ready-without-media support (Greg)
     - Fix multipath partition scan deadlock (Keith)
     - Fix concurrent PCI reset and remove queue mapping (Maurizio)
     - Fabrics shutdown fixes (Nilay)

 - Fix for a kerneldoc warning (Keith)

 - Fix a race with blk-rq-qos and wakeups (Omar)

 - Cleanup of checking for always-set tag_set (SurajSonawane2415)

 - Fix for a crash with CPU hotplug notifiers (Ming)

 - Don't allow zero-copy ublk on unprivileged device (Ming)

 - Use array_index_nospec() for CDROM (Josh)

 - Remove dead code in drbd (David)

 - Tweaks to elevator loading (Breno)

* tag 'block-6.12-20241018' of git://git.kernel.dk/linux:
  cdrom: Avoid barrier_nospec() in cdrom_ioctl_media_changed()
  nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function
  nvme: make keep-alive synchronous operation
  nvme-loop: flush off pending I/O while shutting down loop controller
  nvme-pci: fix race condition between reset and nvme_dev_disable()
  ublk: don't allow user copy for unprivileged device
  blk-rq-qos: fix crash on rq_qos_wait vs. rq_qos_wake_function race
  nvme-multipath: defer partition scanning
  blk-mq: setup queue ->tag_set before initializing hctx
  elevator: Remove argument from elevator_find_get
  elevator: do not request_module if elevator exists
  drbd: Remove unused conn_lowest_minor
  nvme: disable CC.CRIME (NVME_CC_CRIME)
  nvme: delete unnecessary fallthru comment
  nvmet-rdma: use sbitmap to replace rsp free list
  block: Fix elevator_get_default() checking for NULL q->tag_set
  nvme: tcp: avoid race between queue_lock lock and destroy
  nvmet-passthru: clear EUID/NGUID/UUID while using loop target
  block: fix blk_rq_map_integrity_sg kernel-doc
2024-10-18 15:53:00 -07:00
Ming Lei
42aafd8b48 ublk: don't allow user copy for unprivileged device
UBLK_F_USER_COPY requires userspace to call write() on ublk char
device for filling request buffer, and unprivileged device can't
be trusted.

So don't allow user copy for unprivileged device.

Cc: stable@vger.kernel.org
Fixes: 1172d5b8be ("ublk: support user copy")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241016134847.2911721-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-16 08:08:18 -06:00
Danilo Krummrich
8373147ce4 rust: treewide: switch to our kernel Box type
Now that we got the kernel `Box` type in place, convert all existing
`Box` users to make use of it.

Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Benno Lossin <benno.lossin@proton.me>
Reviewed-by: Gary Guo <gary@garyguo.net>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://lore.kernel.org/r/20241004154149.93856-13-dakr@kernel.org
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2024-10-15 22:56:59 +02:00
Dr. David Alan Gilbert
1e3fc20000 drbd: Remove unused conn_lowest_minor
conn_lowest_minor() last use was removed by 2011 commit
69a227731a ("drbd: Pass a peer device to a number of fuctions")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://lore.kernel.org/r/20241010204426.277535-1-linux@treblig.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-11 07:11:44 -06:00
Linus Torvalds
360c1f1f24 block-6.12-20241004
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmcAA+4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprKuD/99KWHNFwDyhwPVRlK8ZiCo556DWJ9sT6aJ
 ICt0MAehx2KUgjQTcOoAE0gPAd2ASJA+hisvY5yiMtXNvOgyYXcUfH+UqYoFWSO8
 O7vm7/w/YK3jvuwRwTDSH2s3+L/j3oROHZQSZVAkoi9KcbAaPt5FbFbpqWmAuBgJ
 ymsGWybW27GSOCDh55Jnm4IDaHMUfpcsvEg8Ryl6DyD2C0ZkMy64DmGiopUKCyML
 ceeiVUboNzWYg9qeZswKm1OVSSzZWocy4zDbJYjueVPmr7juvQ4ILxHDqKhTOZpO
 JHBnI1c3i6pOyZqtyaU3Eb/I0b3ZebwFhY3p1+CxkW1FIatrML86/b0ZIxowEiJO
 a8tRHkk2fXaTyuaxTPlkaE9P5cIK/BpEvxn/v8mddVPTwjAAAULINfZAYoGhwWyF
 cSQuN+BOJnxU6DUp66OVKlxhLiZAttBy+kyCPaLUVLyJX7w/VaQz2PwQjzHybjy7
 zNz7M4x4/06bFC3ykmLgIRM+Hqo+El0SE6wx/GHi70GuC+cqkHnfUL4Yscj1t6SD
 Uadul3PYaxVBZ7lcdHy09iD8JxdF1wYxyqiGULi7DPyUqBJ2r4e/TH2x2MxZ3sKA
 CEWBVHZ8yNdfckhybLArbdkQYd+GaiFna/ySUvMjrAdF2Hes42uo9eF4L5tAFtxq
 pKg3IPp0oA==
 =ujN7
 -----END PGP SIGNATURE-----

Merge tag 'block-6.12-20241004' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - Fix another use-after-free in aoe

 - Fixup wrong nested non-saving irq disable/restore in blk-iocost

 - Fixup a kerneldoc complaint introduced by a merge window patch

* tag 'block-6.12-20241004' of git://git.kernel.dk/linux:
  aoe: fix the potential use-after-free problem in more places
  blk_iocost: remove some duplicate irq disable/enables
  block: fix blk_rq_map_integrity_sg kernel-doc
2024-10-04 10:43:44 -07:00
Al Viro
5f60d5f6bb move asm/unaligned.h to linux/unaligned.h
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.

auto-generated by the following:

for i in `git grep -l -w asm/unaligned.h`; do
	sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
	sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
2024-10-02 17:23:23 -04:00
Chun-Yi Lee
6d6e54fc71 aoe: fix the potential use-after-free problem in more places
For fixing CVE-2023-6270, f98364e926 ("aoe: fix the potential
use-after-free problem in aoecmd_cfg_pkts") makes tx() calling dev_put()
instead of doing in aoecmd_cfg_pkts(). It avoids that the tx() runs
into use-after-free.

Then Nicolai Stange found more places in aoe have potential use-after-free
problem with tx(). e.g. revalidate(), aoecmd_ata_rw(), resend(), probe()
and aoecmd_cfg_rsp(). Those functions also use aoenet_xmit() to push
packet to tx queue. So they should also use dev_hold() to increase the
refcnt of skb->dev.

On the other hand, moving dev_put() to tx() causes that the refcnt of
skb->dev be reduced to a negative value, because corresponding
dev_hold() are not called in revalidate(), aoecmd_ata_rw(), resend(),
probe(), and aoecmd_cfg_rsp(). This patch fixed this issue.

Cc: stable@vger.kernel.org
Link: https://nvd.nist.gov/vuln/detail/CVE-2023-6270
Fixes: f98364e926 ("aoe: fix the potential use-after-free problem in aoecmd_cfg_pkts")
Reported-by: Nicolai Stange <nstange@suse.com>
Signed-off-by: Chun-Yi Lee <jlee@suse.com>
Link: https://lore.kernel.org/stable/20240624064418.27043-1-jlee%40suse.com
Link: https://lore.kernel.org/r/20241002035458.24401-1-jlee@suse.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-02 07:16:44 -06:00
Linus Torvalds
eee280841e 19 hotfixes. 13 are cc:stable.
There's a focus on fixes for the memfd_pin_folios() work which was added
 into 6.11.  Apart from that, the usual shower of singleton fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZvbhSAAKCRDdBJ7gKXxA
 jp8CAP47txk2c+tBLggog2MkQamADY5l5MT6E3fYq3ghSiKtVQEAnqX3LiQJ02tB
 o9LcPcVrM90QntpKrLP1CpWCVdR+zA8=
 =e0QC
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-09-27-09-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull  misc fixes from Andrew Morton:
 "19 hotfixes.  13 are cc:stable.

  There's a focus on fixes for the memfd_pin_folios() work which was
  added into 6.11. Apart from that, the usual shower of singleton fixes"

* tag 'mm-hotfixes-stable-2024-09-27-09-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  ocfs2: fix uninit-value in ocfs2_get_block()
  zram: don't free statically defined names
  memory tiers: use default_dram_perf_ref_source in log message
  Revert "list: test: fix tests for list_cut_position()"
  kselftests: mm: fix wrong __NR_userfaultfd value
  compiler.h: specify correct attribute for .rodata..c_jump_table
  mm/damon/Kconfig: update DAMON doc URL
  mm: kfence: fix elapsed time for allocated/freed track
  ocfs2: fix deadlock in ocfs2_get_system_file_inode
  ocfs2: reserve space for inline xattr before attaching reflink tree
  mm: migrate: annotate data-race in migrate_folio_unmap()
  mm/hugetlb: simplify refs in memfd_alloc_folio
  mm/gup: fix memfd_pin_folios alloc race panic
  mm/gup: fix memfd_pin_folios hugetlb page allocation
  mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak
  mm/hugetlb: fix memfd_pin_folios free_huge_pages leak
  mm/filemap: fix filemap_get_folios_contig THP panic
  mm: make SPLIT_PTE_PTLOCKS depend on SMP
  tools: fix shared radix-tree build
2024-09-27 10:27:22 -07:00
Al Viro
cb787f4ac0 [tree-wide] finally take no_llseek out
no_llseek had been defined to NULL two years ago, in commit 868941b144
("fs: remove no_llseek")

To quote that commit,

  At -rc1 we'll need do a mechanical removal of no_llseek -

  git grep -l -w no_llseek | grep -v porting.rst | while read i; do
	sed -i '/\<no_llseek\>/d' $i
  done

  would do it.

Unfortunately, that hadn't been done.  Linus, could you do that now, so
that we could finally put that thing to rest? All instances are of the
form
	.llseek = no_llseek,
so it's obviously safe.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-27 08:18:43 -07:00
Andrey Skvortsov
486fd58af7 zram: don't free statically defined names
When CONFIG_ZRAM_MULTI_COMP isn't set ZRAM_SECONDARY_COMP can hold
default_compressor, because it's the same offset as ZRAM_PRIMARY_COMP, so
we need to make sure that we don't attempt to kfree() the statically
defined compressor name.

This is detected by KASAN.

==================================================================
  Call trace:
   kfree+0x60/0x3a0
   zram_destroy_comps+0x98/0x198 [zram]
   zram_reset_device+0x22c/0x4a8 [zram]
   reset_store+0x1bc/0x2d8 [zram]
   dev_attr_store+0x44/0x80
   sysfs_kf_write+0xfc/0x188
   kernfs_fop_write_iter+0x28c/0x428
   vfs_write+0x4dc/0x9b8
   ksys_write+0x100/0x1f8
   __arm64_sys_write+0x74/0xb8
   invoke_syscall+0xd8/0x260
   el0_svc_common.constprop.0+0xb4/0x240
   do_el0_svc+0x48/0x68
   el0_svc+0x40/0xc8
   el0t_64_sync_handler+0x120/0x130
   el0t_64_sync+0x190/0x198
==================================================================

Link: https://lkml.kernel.org/r/20240923164843.1117010-1-andrej.skvortzov@gmail.com
Fixes: 684826f827 ("zram: free secondary algorithms names")
Signed-off-by: Andrey Skvortsov <andrej.skvortzov@gmail.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reported-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
Closes: https://lore.kernel.org/lkml/57130e48-dbb6-4047-a8c7-ebf5aaea93f4@linux.vnet.ibm.com/
Tested-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
Cc: Chris Li <chrisl@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-26 14:01:44 -07:00
Linus Torvalds
11a299a793 for-6.12/block-20240925
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmb0T5AQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnfHEADCXqmqZC+xr3sHZH9T1lz9KaFp1FjuBhCw
 bGpUgXQ9aLcqQUWJxmYVer8N2x2+Ds+xq4fm/rP1BfvNgRupqheHBwuLxSrz14EX
 lYmKZ+krMIPTDaLFewmEWflDwmZX0WFgV6nKTMLiO5BMeI4zXCkFGtwYFys2+Cdd
 9zYCFPgGDZUR77Ws5PpyqPVz2MoiNtsjrGmHpEmNZ+rIDzlpVOYgYk27X9ZbvNxC
 /l0KTc9+ayAeG0Kx5jO+m6Hrj3I6ehvM9JZMgpS/tF/jtccD2oVkJFJDlU+Jciv6
 BwVzgyDPGV7sXFT1fnSqDBYYwr/73nzNH0Gk8wn4Jg2LhjmVANVo9eQSOXDTYZI+
 O4HfIHGTIrk75TQd4bhq3dqaylS78pKBI/eQJUli2UNoyLWMrMyE88yh2YJam2Fs
 vJ/MHGxvFRurYbAlqLr33nb3ajvpg+D7XuAYfqHPMc2ZUe28Kza50Dj+luNjfVCu
 3qfR6qBlsdWuABtUS3vneB9jZp5jDnOpVfuBgtcAqIboUjehTXsI7If09Ex/mxLq
 O0KqNwBMfunPOKd5kGXlAgY8LRMfOhNaAAFBlXYUZB2eAadQnqVselTFvHMZkXo7
 wH/l6trd+/Tf+7Rav0YduNIlpVr7IctC+A7ph4zPdIjQxFEySCrC7cvAjel29LyV
 zgWW0Mw/sA==
 =yiWu
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12/block-20240925' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:

 - Improve blk-integrity segment counting and merging (Keith)

 - NVMe pull request via Keith:
      - Multipath fixes (Hannes)
      - Sysfs attribute list NULL terminate fix (Shin'ichiro)
      - Remove problematic read-back (Keith)

 - Fix for a regression with the IO scheduler switching freezing from
   6.11 (Damien)

 - Use a raw spinlock for sbitmap, as it may get called from preempt
   disabled context (Ming)

 - Cleanup for bd_claiming waiting, using var_waitqueue() rather than
   the bit waitqueues, as that more accurately describes that it does
   (Neil)

 - Various cleanups (Kanchan, Qiu-ji, David)

* tag 'for-6.12/block-20240925' of git://git.kernel.dk/linux:
  nvme: remove CC register read-back during enabling
  nvme: null terminate nvme_tls_attrs
  nvme-multipath: avoid hang on inaccessible namespaces
  nvme-multipath: system fails to create generic nvme device
  lib/sbitmap: define swap_lock as raw_spinlock_t
  block: Remove unused blk_limits_io_{min,opt}
  drbd: Fix atomicity violation in drbd_uuid_set_bm()
  block: Fix elv_iosched_local_module handling of "none" scheduler
  block: remove bogus union
  block: change wait on bd_claiming to use a var_waitqueue
  blk-integrity: improved sg segment mapping
  block: unexport blk_rq_count_integrity_sg
  nvme-rdma: use request to get integrity segments
  scsi: use request to get integrity segments
  block: provide a request helper for user integrity segments
  blk-integrity: consider entire bio list for merging
  blk-integrity: properly account for segments
  blk-mq: set the nr_integrity_segments from bio
  blk-mq: unconditional nr_integrity_segments
2024-09-25 14:56:40 -07:00
Linus Torvalds
617a814f14 ALong with the usual shower of singleton patches, notable patch series in
this pull request are:
 
 "Align kvrealloc() with krealloc()" from Danilo Krummrich.  Adds
 consistency to the APIs and behaviour of these two core allocation
 functions.  This also simplifies/enables Rustification.
 
 "Some cleanups for shmem" from Baolin Wang.  No functional changes - mode
 code reuse, better function naming, logic simplifications.
 
 "mm: some small page fault cleanups" from Josef Bacik.  No functional
 changes - code cleanups only.
 
 "Various memory tiering fixes" from Zi Yan.  A small fix and a little
 cleanup.
 
 "mm/swap: remove boilerplate" from Yu Zhao.  Code cleanups and
 simplifications and .text shrinkage.
 
 "Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt.  This
 is a feature, it adds new feilds to /proc/vmstat such as
 
     $ grep kstack /proc/vmstat
     kstack_1k 3
     kstack_2k 188
     kstack_4k 11391
     kstack_8k 243
     kstack_16k 0
 
 which tells us that 11391 processes used 4k of stack while none at all
 used 16k.  Useful for some system tuning things, but partivularly useful
 for "the dynamic kernel stack project".
 
 "kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov.
 Teaches kmemleak to detect leaksage of percpu memory.
 
 "mm: memcg: page counters optimizations" from Roman Gushchin.  "3
 independent small optimizations of page counters".
 
 "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David
 Hildenbrand.  Improves PTE/PMD splitlock detection, makes powerpc/8xx work
 correctly by design rather than by accident.
 
 "mm: remove arch_make_page_accessible()" from David Hildenbrand.  Some
 folio conversions which make arch_make_page_accessible() unneeded.
 
 "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel.
 Cleans up and fixes our handling of the resetting of the cgroup/process
 peak-memory-use detector.
 
 "Make core VMA operations internal and testable" from Lorenzo Stoakes.
 Rationalizaion and encapsulation of the VMA manipulation APIs.  With a
 view to better enable testing of the VMA functions, even from a
 userspace-only harness.
 
 "mm: zswap: fixes for global shrinker" from Takero Funaki.  Fix issues in
 the zswap global shrinker, resulting in improved performance.
 
 "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao.  Fill in
 some missing info in /proc/zoneinfo.
 
 "mm: replace follow_page() by folio_walk" from David Hildenbrand.  Code
 cleanups and rationalizations (conversion to folio_walk()) resulting in
 the removal of follow_page().
 
 "improving dynamic zswap shrinker protection scheme" from Nhat Pham.  Some
 tuning to improve zswap's dynamic shrinker.  Significant reductions in
 swapin and improvements in performance are shown.
 
 "mm: Fix several issues with unaccepted memory" from Kirill Shutemov.
 Improvements to the new unaccepted memory feature,
 
 "mm/mprotect: Fix dax puds" from Peter Xu.  Implements mprotect on DAX
 PUDs.  This was missing, although nobody seems to have notied yet.
 
 "Introduce a store type enum for the Maple tree" from Sidhartha Kumar.
 Cleanups and modest performance improvements for the maple tree library
 code.
 
 "memcg: further decouple v1 code from v2" from Shakeel Butt.  Move more
 cgroup v1 remnants away from the v2 memcg code.
 
 "memcg: initiate deprecation of v1 features" from Shakeel Butt.  Adds
 various warnings telling users that memcg v1 features are deprecated.
 
 "mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li.
 Greatly improves the success rate of the mTHP swap allocation.
 
 "mm: introduce numa_memblks" from Mike Rapoport.  Moves various disparate
 per-arch implementations of numa_memblk code into generic code.
 
 "mm: batch free swaps for zap_pte_range()" from Barry Song.  Greatly
 improves the performance of munmap() of swap-filled ptes.
 
 "support large folio swap-out and swap-in for shmem" from Baolin Wang.
 With this series we no longer split shmem large folios into simgle-page
 folios when swapping out shmem.
 
 "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao.  Nice performance
 improvements and code reductions for gigantic folios.
 
 "support shmem mTHP collapse" from Baolin Wang.  Adds support for
 khugepaged's collapsing of shmem mTHP folios.
 
 "mm: Optimize mseal checks" from Pedro Falcato.  Fixes an mprotect()
 performance regression due to the addition of mseal().
 
 "Increase the number of bits available in page_type" from Matthew Wilcox.
 Increases the number of bits available in page_type!
 
 "Simplify the page flags a little" from Matthew Wilcox.  Many legacy page
 flags are now folio flags, so the page-based flags and their
 accessors/mutators can be removed.
 
 "mm: store zero pages to be swapped out in a bitmap" from Usama Arif.  An
 optimization which permits us to avoid writing/reading zero-filled zswap
 pages to backing store.
 
 "Avoid MAP_FIXED gap exposure" from Liam Howlett.  Fixes a race window
 which occurs when a MAP_FIXED operqtion is occurring during an unrelated
 vma tree walk.
 
 "mm: remove vma_merge()" from Lorenzo Stoakes.  Major rotorooting of the
 vma_merge() functionality, making ot cleaner, more testable and better
 tested.
 
 "misc fixups for DAMON {self,kunit} tests" from SeongJae Park.  Minor
 fixups of DAMON selftests and kunit tests.
 
 "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang.  Code
 cleanups and folio conversions.
 
 "Shmem mTHP controls and stats improvements" from Ryan Roberts.  Cleanups
 for shmem controls and stats.
 
 "mm: count the number of anonymous THPs per size" from Barry Song.  Expose
 additional anon THP stats to userspace for improved tuning.
 
 "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio
 conversions and removal of now-unused page-based APIs.
 
 "replace per-quota region priorities histogram buffer with per-context
 one" from SeongJae Park.  DAMON histogram rationalization.
 
 "Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae
 Park.  DAMON documentation updates.
 
 "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve
 related doc and warn" from Jason Wang: fixes usage of page allocator
 __GFP_NOFAIL and GFP_ATOMIC flags.
 
 "mm: split underused THPs" from Yu Zhao.  Improve THP=always policy - this
 was overprovisioning THPs in sparsely accessed memory areas.
 
 "zram: introduce custom comp backends API" frm Sergey Senozhatsky.  Add
 support for zram run-time compression algorithm tuning.
 
 "mm: Care about shadow stack guard gap when getting an unmapped area" from
 Mark Brown.  Fix up the various arch_get_unmapped_area() implementations
 to better respect guard areas.
 
 "Improve mem_cgroup_iter()" from Kinsey Ho.  Improve the reliability of
 mem_cgroup_iter() and various code cleanups.
 
 "mm: Support huge pfnmaps" from Peter Xu.  Extends the usage of huge
 pfnmap support.
 
 "resource: Fix region_intersects() vs add_memory_driver_managed()" from
 Huang Ying.  Fix a bug in region_intersects() for systems with CXL memory.
 
 "mm: hwpoison: two more poison recovery" from Kefeng Wang.  Teaches a
 couple more code paths to correctly recover from the encountering of
 poisoned memry.
 
 "mm: enable large folios swap-in support" from Barry Song.  Support the
 swapin of mTHP memory into appropriately-sized folios, rather than into
 single-page folios.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZu1BBwAKCRDdBJ7gKXxA
 jlWNAQDYlqQLun7bgsAN4sSvi27VUuWv1q70jlMXTfmjJAvQqwD/fBFVR6IOOiw7
 AkDbKWP2k0hWPiNJBGwoqxdHHx09Xgo=
 =s0T+
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Along with the usual shower of singleton patches, notable patch series
  in this pull request are:

   - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds
     consistency to the APIs and behaviour of these two core allocation
     functions. This also simplifies/enables Rustification.

   - "Some cleanups for shmem" from Baolin Wang. No functional changes -
     mode code reuse, better function naming, logic simplifications.

   - "mm: some small page fault cleanups" from Josef Bacik. No
     functional changes - code cleanups only.

   - "Various memory tiering fixes" from Zi Yan. A small fix and a
     little cleanup.

   - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and
     simplifications and .text shrinkage.

   - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel
     Butt. This is a feature, it adds new feilds to /proc/vmstat such as

       $ grep kstack /proc/vmstat
       kstack_1k 3
       kstack_2k 188
       kstack_4k 11391
       kstack_8k 243
       kstack_16k 0

     which tells us that 11391 processes used 4k of stack while none at
     all used 16k. Useful for some system tuning things, but
     partivularly useful for "the dynamic kernel stack project".

   - "kmemleak: support for percpu memory leak detect" from Pavel
     Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory.

   - "mm: memcg: page counters optimizations" from Roman Gushchin. "3
     independent small optimizations of page counters".

   - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from
     David Hildenbrand. Improves PTE/PMD splitlock detection, makes
     powerpc/8xx work correctly by design rather than by accident.

   - "mm: remove arch_make_page_accessible()" from David Hildenbrand.
     Some folio conversions which make arch_make_page_accessible()
     unneeded.

   - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David
     Finkel. Cleans up and fixes our handling of the resetting of the
     cgroup/process peak-memory-use detector.

   - "Make core VMA operations internal and testable" from Lorenzo
     Stoakes. Rationalizaion and encapsulation of the VMA manipulation
     APIs. With a view to better enable testing of the VMA functions,
     even from a userspace-only harness.

   - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix
     issues in the zswap global shrinker, resulting in improved
     performance.

   - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill
     in some missing info in /proc/zoneinfo.

   - "mm: replace follow_page() by folio_walk" from David Hildenbrand.
     Code cleanups and rationalizations (conversion to folio_walk())
     resulting in the removal of follow_page().

   - "improving dynamic zswap shrinker protection scheme" from Nhat
     Pham. Some tuning to improve zswap's dynamic shrinker. Significant
     reductions in swapin and improvements in performance are shown.

   - "mm: Fix several issues with unaccepted memory" from Kirill
     Shutemov. Improvements to the new unaccepted memory feature,

   - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on
     DAX PUDs. This was missing, although nobody seems to have notied
     yet.

   - "Introduce a store type enum for the Maple tree" from Sidhartha
     Kumar. Cleanups and modest performance improvements for the maple
     tree library code.

   - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move
     more cgroup v1 remnants away from the v2 memcg code.

   - "memcg: initiate deprecation of v1 features" from Shakeel Butt.
     Adds various warnings telling users that memcg v1 features are
     deprecated.

   - "mm: swap: mTHP swap allocator base on swap cluster order" from
     Chris Li. Greatly improves the success rate of the mTHP swap
     allocation.

   - "mm: introduce numa_memblks" from Mike Rapoport. Moves various
     disparate per-arch implementations of numa_memblk code into generic
     code.

   - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly
     improves the performance of munmap() of swap-filled ptes.

   - "support large folio swap-out and swap-in for shmem" from Baolin
     Wang. With this series we no longer split shmem large folios into
     simgle-page folios when swapping out shmem.

   - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice
     performance improvements and code reductions for gigantic folios.

   - "support shmem mTHP collapse" from Baolin Wang. Adds support for
     khugepaged's collapsing of shmem mTHP folios.

   - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect()
     performance regression due to the addition of mseal().

   - "Increase the number of bits available in page_type" from Matthew
     Wilcox. Increases the number of bits available in page_type!

   - "Simplify the page flags a little" from Matthew Wilcox. Many legacy
     page flags are now folio flags, so the page-based flags and their
     accessors/mutators can be removed.

   - "mm: store zero pages to be swapped out in a bitmap" from Usama
     Arif. An optimization which permits us to avoid writing/reading
     zero-filled zswap pages to backing store.

   - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race
     window which occurs when a MAP_FIXED operqtion is occurring during
     an unrelated vma tree walk.

   - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of
     the vma_merge() functionality, making ot cleaner, more testable and
     better tested.

   - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park.
     Minor fixups of DAMON selftests and kunit tests.

   - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang.
     Code cleanups and folio conversions.

   - "Shmem mTHP controls and stats improvements" from Ryan Roberts.
     Cleanups for shmem controls and stats.

   - "mm: count the number of anonymous THPs per size" from Barry Song.
     Expose additional anon THP stats to userspace for improved tuning.

   - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more
     folio conversions and removal of now-unused page-based APIs.

   - "replace per-quota region priorities histogram buffer with
     per-context one" from SeongJae Park. DAMON histogram
     rationalization.

   - "Docs/damon: update GitHub repo URLs and maintainer-profile" from
     SeongJae Park. DAMON documentation updates.

   - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and
     improve related doc and warn" from Jason Wang: fixes usage of page
     allocator __GFP_NOFAIL and GFP_ATOMIC flags.

   - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy.
     This was overprovisioning THPs in sparsely accessed memory areas.

   - "zram: introduce custom comp backends API" frm Sergey Senozhatsky.
     Add support for zram run-time compression algorithm tuning.

   - "mm: Care about shadow stack guard gap when getting an unmapped
     area" from Mark Brown. Fix up the various arch_get_unmapped_area()
     implementations to better respect guard areas.

   - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability
     of mem_cgroup_iter() and various code cleanups.

   - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge
     pfnmap support.

   - "resource: Fix region_intersects() vs add_memory_driver_managed()"
     from Huang Ying. Fix a bug in region_intersects() for systems with
     CXL memory.

   - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches
     a couple more code paths to correctly recover from the encountering
     of poisoned memry.

   - "mm: enable large folios swap-in support" from Barry Song. Support
     the swapin of mTHP memory into appropriately-sized folios, rather
     than into single-page folios"

* tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits)
  zram: free secondary algorithms names
  uprobes: turn xol_area->pages[2] into xol_area->page
  uprobes: introduce the global struct vm_special_mapping xol_mapping
  Revert "uprobes: use vm_special_mapping close() functionality"
  mm: support large folios swap-in for sync io devices
  mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
  mm: fix swap_read_folio_zeromap() for large folios with partial zeromap
  mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries
  set_memory: add __must_check to generic stubs
  mm/vma: return the exact errno in vms_gather_munmap_vmas()
  memcg: cleanup with !CONFIG_MEMCG_V1
  mm/show_mem.c: report alloc tags in human readable units
  mm: support poison recovery from copy_present_page()
  mm: support poison recovery from do_cow_fault()
  resource, kunit: add test case for region_intersects()
  resource: make alloc_free_mem_region() works for iomem_resource
  mm: z3fold: deprecate CONFIG_Z3FOLD
  vfio/pci: implement huge_fault support
  mm/arm64: support large pfn mappings
  mm/x86: support large pfn mappings
  ...
2024-09-21 07:29:05 -07:00
Qiu-ji Chen
2f02b5af3a drbd: Fix atomicity violation in drbd_uuid_set_bm()
The violation of atomicity occurs when the drbd_uuid_set_bm function is
executed simultaneously with modifying the value of
device->ldev->md.uuid[UI_BITMAP]. Consider a scenario where, while
device->ldev->md.uuid[UI_BITMAP] passes the validity check when its
value is not zero, the value of device->ldev->md.uuid[UI_BITMAP] is
written to zero. In this case, the check in drbd_uuid_set_bm might refer
to the old value of device->ldev->md.uuid[UI_BITMAP] (before locking),
which allows an invalid value to pass the validity check, resulting in
inconsistency.

To address this issue, it is recommended to include the data validity
check within the locked section of the function. This modification
ensures that the value of device->ldev->md.uuid[UI_BITMAP] does not
change during the validation process, thereby maintaining its integrity.

This possible bug is found by an experimental static analysis tool
developed by our team. This tool analyzes the locking APIs to extract
function pairs that can be concurrently executed, and then analyzes the
instructions in the paired functions to identify possible concurrency
bugs including data races and atomicity violations.

Fixes: 9f2247bb9b ("drbd: Protect accesses to the uuid set with a spinlock")
Cc: stable@vger.kernel.org
Signed-off-by: Qiu-ji Chen <chenqiuji666@gmail.com>
Reviewed-by: Philipp Reisner <philipp.reisner@linbit.com>
Link: https://lore.kernel.org/r/20240913083504.10549-1-chenqiuji666@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-18 04:16:23 -06:00
Jens Axboe
42b16d3ac3 Linux 6.11
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmbm9fQeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGXcwH/A8+IXnrGv+VzYgD
 +mE4hgGGHt4dClcUZ31gQetkkT6xktEVp6pB6JkFO7oEgBiTkJBbYGl6VZtsAIOd
 Fi3jic8ik0uhZLFcxDJcHTceh6Pw8bkhWoh0tkF3bkDRwbppJdG7Khyk8DxTl24w
 ldqh9om2cC7w9IPVx93xTgKgMMZ63qiJyUdTvxEZI3BG8F70smlgZSPskLp2Iktd
 FIJZPcyKM0bhJYwZOpXK0vx5C2cA4oIW4xriHUw4aklv646OBxNKevB2JJAft2uA
 6LyvuLgnYn/OpdFGZ8slvdmhm6hLWft5B1/bWKorUkz7p5YGiySFzpkMVAkNJ6mS
 cRwHJNc=
 =flw3
 -----END PGP SIGNATURE-----

Merge tag 'v6.11' into for-6.12/block

Merge in 6.11 final to get the fix for preventing deadlocks on an
elevator switch, as there's a fixup for that patch.

* tag 'v6.11': (1788 commits)
  Linux 6.11
  Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"
  pinctrl: pinctrl-cy8c95x0: Fix regcache
  cifs: Fix signature miscalculation
  mm: avoid leaving partial pfn mappings around in error case
  drm/xe/client: add missing bo locking in show_meminfo()
  drm/xe/client: fix deadlock in show_meminfo()
  drm/xe/oa: Enable Xe2+ PES disaggregation
  drm/xe/display: fix compat IS_DISPLAY_STEP() range end
  drm/xe: Fix access_ok check in user_fence_create
  drm/xe: Fix possible UAF in guc_exec_queue_process_msg
  drm/xe: Remove fence check from send_tlb_invalidation
  drm/xe/gt: Remove double include
  net: netfilter: move nf flowtable bpf initialization in nf_flow_table_module_init()
  PCI: Fix potential deadlock in pcim_intx()
  workqueue: Clear worker->pool in the worker thread context
  net: tighten bad gso csum offset check in virtio_net_hdr
  netlink: specs: mptcp: fix port endianness
  net: dpaa: Pad packets to ETH_ZLEN
  mptcp: pm: Fix uaf in __timer_delete_sync
  ...
2024-09-17 08:32:53 -06:00
Sergey Senozhatsky
684826f827 zram: free secondary algorithms names
We need to kfree() secondary algorithms names when reset zram device that
had multi-streams, otherwise we leak memory.

[senozhatsky@chromium.org: kfree(NULL) is legal]
  Link: https://lkml.kernel.org/r/20240917013021.868769-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20240911025600.3681789-1-senozhatsky@chromium.org
Fixes: 001d927357 ("zram: add recompression algorithm sysfs knob")
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-17 01:07:01 -07:00
Linus Torvalds
26bb0d3f38 for-6.12/block-20240913
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbkZhQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjOKD/0fzd4yOcqxSI9W3OLGd04VrOTJIQa4CRbV
 GmoTq39pOeIDVGug5ekkTpqqHHnuGk+nQhCzD9vsN/eTmC7yZOIr847O2aWzvYEn
 PzFRgmJpoo2E9sr/IsTR5LnJjbaIZhQVkqLH6ZOj9tpKlVwN2SK0nIRVNrAi5zgT
 MaDrto/2OUld+vmA99Rgb23jxM6UBdCPIjuiVa+11Vg9Z3D1tWbBmrsG7OMysyIf
 FbASBeKHqFSO61/ipFCZv6VV1X8zoWEVyT8n4A1yUbbN5rLzPgoQJVbfSqQRXIdr
 cdrKeCbKxl+joSgKS6LKpvnfwRgGF+hgAfpZg4c0vrbZGTQcRhhLFECyh/aVI08F
 p5TOMArhVaX59664gHgSPq4KnGTXOO29dot9N3Jya/ZQnxinjY9r+GVOfLuduPPy
 1B04vab8oAsk4zK7fZbkDxgYUyifwzK/vQ6OqYq2mYdpdIS/AE7T2ou61Bz5mI7I
 /BuucNV0Z96OKlyLEXwXXZjZgNu1TFcq6ARIBJ8L08PY64Fesj5BXabRyXkeNH26
 0exyz9heeJs6OwRGfngXmS24tDSS0k74CeZX3KoePNj69u6KCn346KiU1qgntwwD
 E5F7AEHqCl5FjUEIWB4M1EPlfA8U0MzOL+tkx2xKJAjsU60wAy7jRSyOIcqodpMs
 6UlPcJzgYg==
 =uuLl
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - MD changes via Song:
      - md-bitmap refactoring (Yu Kuai)
      - raid5 performance optimization (Artur Paszkiewicz)
      - Other small fixes (Yu Kuai, Chen Ni)
      - Add a sysfs entry 'new_level' (Xiao Ni)
      - Improve information reported in /proc/mdstat (Mateusz Kusiak)

 - NVMe changes via Keith:
      - Asynchronous namespace scanning (Stuart)
      - TCP TLS updates (Hannes)
      - RDMA queue controller validation (Niklas)
      - Align field names to the spec (Anuj)
      - Metadata support validation (Puranjay)
      - A syntax cleanup (Shen)
      - Fix a Kconfig linking error (Arnd)
      - New queue-depth quirk (Keith)

 - Add missing unplug trace event (Keith)

 - blk-iocost fixes (Colin, Konstantin)

 - t10-pi modular removal and fixes (Alexey)

 - Fix for potential BLKSECDISCARD overflow (Alexey)

 - bio splitting cleanups and fixes (Christoph)

 - Deal with folios rather than rather than pages, speeding up how the
   block layer handles bigger IOs (Kundan)

 - Use spinlocks rather than bit spinlocks in zram (Sebastian, Mike)

 - Reduce zoned device overhead in ublk (Ming)

 - Add and use sendpages_ok() for drbd and nvme-tcp (Ofir)

 - Fix regression in partition error pointer checking (Riyan)

 - Add support for write zeroes and rotational status in nbd (Wouter)

 - Add Yu Kuai as new BFQ maintainer. The scheduler has been
   unmaintained for quite a while.

 - Various sets of fixes for BFQ (Yu Kuai)

 - Misc fixes and cleanups (Alvaro, Christophe, Li, Md Haris, Mikhail,
   Yang)

* tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux: (120 commits)
  nvme-pci: qdepth 1 quirk
  block: fix potential invalid pointer dereference in blk_add_partition
  blk_iocost: make read-only static array vrate_adj_pct const
  block: unpin user pages belonging to a folio at once
  mm: release number of pages of a folio
  block: introduce folio awareness and add a bigger size from folio
  block: Added folio-ized version of bio_add_hw_page()
  block, bfq: factor out a helper to split bfqq in bfq_init_rq()
  block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq()
  block, bfq: remove local variable 'split' in bfq_init_rq()
  block, bfq: remove bfq_log_bfqg()
  block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()
  block, bfq: fix procress reference leakage for bfqq in merge chain
  block, bfq: fix uaf for accessing waker_bfqq after splitting
  blk-throttle: support prioritized processing of metadata
  blk-throttle: remove last_low_overflow_time
  drbd: Add NULL check for net_conf to prevent dereference in state validation
  nvme-tcp: fix link failure for TCP auth
  blk-mq: add missing unplug trace event
  mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init()
  ...
2024-09-16 13:33:06 +02:00
Mikhail Lobanov
a5e61b50c9 drbd: Add NULL check for net_conf to prevent dereference in state validation
If the net_conf pointer is NULL and the code attempts to access its
fields without a check, it will lead to a null pointer dereference.
Add a NULL check before dereferencing the pointer.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: 44ed167da7 ("drbd: rcu_read_lock() and rcu_dereference() for tconn->net_conf")
Cc: stable@vger.kernel.org
Signed-off-by: Mikhail Lobanov <m.lobanov@rosalinux.ru>
Link: https://lore.kernel.org/r/20240909133740.84297-1-m.lobanov@rosalinux.ru
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-10 13:44:06 -06:00
Sergey Senozhatsky
e899007a5e zram: support priority parameter in recompression
recompress device attribute supports alg=NAME parameter so that we can
specify only one particular algorithm we want to perform recompression
with.  However, with algo params we now can have several exactly same
secondary algorithms but each with its own params tuning (e.g.  priority 1
configured to use more aggressive level, and priority 2 configured to use
a pre-trained dictionary).  Support priority=NUM parameter so that we can
correctly determine which secondary algorithm we want to use.

Link: https://lkml.kernel.org/r/20240902105656.1383858-25-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:12 -07:00
Sergey Senozhatsky
6a559ecd6e zram: add dictionary support to zstd backend
This adds support for pre-trained zstd dictionaries [1] Dictionary is
setup in params once (per-comp) and loaded to Cctx and Dctx by reference,
so we don't allocate extra memory.

TEST
====

*** zstd
/sys/block/zram0/mm_stat
1750654976 504565092 514203648        0 514203648        1        0    34204    34204

*** zstd dict=/etc/zstd-dict-amd64
/sys/block/zram0/mm_stat
1750638592 465851259 475373568        0 475373568        1        0    34185    34185

*** zstd level=8 dict=/etc/zstd-dict-amd64
/sys/block/zram0/mm_stat
1750642688 430765171 439955456        0 439955456        1        0    34185    34185

[1] https://github.com/facebook/zstd/blob/dev/programs/zstd.1.md#dictionary-builder

Link: https://lkml.kernel.org/r/20240902105656.1383858-23-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:11 -07:00
Sergey Senozhatsky
1e673c8cf7 zram: add dictionary support to lz4hc
Support pre-trained dictionary param.  Just like lz4, lz4hc doesn't
mandate specific format of the dictionary and zstd --train can be used to
train a dictionary for lz4, according to [1].

TEST
====

*** lz4hc
/sys/block/zram0/mm_stat
1750638592 608954620 621031424        0 621031424        1        0    34288    34288

*** lz4hc dict=/etc/lz4-dict-amd64
/sys/block/zram0/mm_stat
1750671360 505068582 514994176        0 514994176        1        0    34278    34278

[1] https://github.com/lz4/lz4/issues/557

Link: https://lkml.kernel.org/r/20240902105656.1383858-22-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:11 -07:00
Sergey Senozhatsky
fb4f644ee8 zram: add dictionary support to lz4
Support pre-trained dictionary param.  lz4 doesn't mandate specific format
of the dictionary and even zstd --train can be used to train a dictionary
for lz4, according to [1].

TEST
====

*** lz4
/sys/block/zram0/mm_stat
1750654976 664188565 676864000        0 676864000        1        0    34288    34288

*** lz4 dict=/etc/lz4-dict-amd64
/sys/block/zram0/mm_stat
1750638592 619891141 632053760        0 632053760        1        0    34278    34278

*** lz4 level=5 dict=/etc/lz4-dict-amd64
/sys/block/zram0/mm_stat
1750638592 727174243 740810752        0 740810752        1        0    34437    34437

[1] https://github.com/lz4/lz4/issues/557

Link: https://lkml.kernel.org/r/20240902105656.1383858-21-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:11 -07:00
Sergey Senozhatsky
b8f03cb703 zram: move immutable comp params away from per-CPU context
Immutable params never change once comp has been allocated and setup, so
we don't need to store multiple copies of them in each per-CPU backend
context.  Move those to per-comp zcomp_params and pass it to backends
callbacks for requests execution.  Basically, this means parameters
sharing between different contexts.

Also introduce two new backends callbacks: setup_params() and
release_params().  First, we need to validate params in a driver-specific
way; second, driver may want to allocate its specific representation of
the params which is needed to execute requests.

Link: https://lkml.kernel.org/r/20240902105656.1383858-20-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:10 -07:00
Sergey Senozhatsky
6a81bdfeb3 zram: introduce zcomp_ctx structure
Keep run-time driver data (scratch buffers, etc.) in zcomp_ctx structure. 
This structure is allocated per-CPU because drivers (backends) need to
modify its content during requests execution.

We will split mutable and immutable driver data, this is a preparation
path.

Link: https://lkml.kernel.org/r/20240902105656.1383858-19-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:10 -07:00
Sergey Senozhatsky
52c7b4e2ba zram: introduce zcomp_req structure
Encapsulate compression/decompression data in zcomp_req structure.

Link: https://lkml.kernel.org/r/20240902105656.1383858-18-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:10 -07:00
Sergey Senozhatsky
dea77d7aea zram: add support for dict comp config
Handle dict=path algorithm param so that we can read a pre-trained
compression algorithm dictionary which we then pass to the backend
configuration.

Link: https://lkml.kernel.org/r/20240902105656.1383858-17-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:10 -07:00
Sergey Senozhatsky
4eac932103 zram: introduce algorithm_params device attribute
This attribute is used to setup compression algorithms' parameters, so we
can tweak algorithms' characteristics.  At this point only 'level' is
supported (to be extended in the future).

Each call sets up parameters for one particular algorithm, which should be
specified either by the algorithm's priority or algo name.  This is
expected to be called after corresponding algorithm is selected via
comp_algorithm or recomp_algorithm.

 echo "priority=0 level=1" > /sys/block/zram0/algorithm_params
or
 echo "algo=zstd level=1" > /sys/block/zram0/algorithm_params

Link: https://lkml.kernel.org/r/20240902105656.1383858-16-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:09 -07:00
Sergey Senozhatsky
eb826a0190 zram: recalculate zstd compression params once
zstd compression params depends on level, but are constant for a given
instance of zstd compression backend.  Calculate once (during ctx
creation).

Link: https://lkml.kernel.org/r/20240902105656.1383858-15-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:09 -07:00
Sergey Senozhatsky
f2bac7ad18 zram: introduce zcomp_params structure
We will store a per-algorithm parameters there (compression level,
dictionary, dictionary size, etc.).

Link: https://lkml.kernel.org/r/20240902105656.1383858-14-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:09 -07:00
Sergey Senozhatsky
1a78390d87 zram: check that backends array has at least one backend
Make sure that backends array has anything apart from the sentinel NULL
value.

We also select LZO_BACKEND if none backends were selected.

Link: https://lkml.kernel.org/r/20240902105656.1383858-13-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:09 -07:00
Sergey Senozhatsky
1d3100cf14 zram: add 842 compression backend support
Add s/w 842 compression support.

Link: https://lkml.kernel.org/r/20240902105656.1383858-12-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:08 -07:00
Sergey Senozhatsky
84112e314f zram: add zlib compression backend support
Add s/w zlib (inflate/deflate) compression.

Link: https://lkml.kernel.org/r/20240902105656.1383858-11-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:08 -07:00
Sergey Senozhatsky
dbf2763cec zram: pass estimated src size hint to zstd
zram works with PAGE_SIZE buffers, so we always know exact size of the
source buffer and hence can pass estimated_src_size to zstd_get_params().

This hint on x86_64, for example, reduces the size of the work memory
buffer from 1303520 bytes down to 90080 bytes.  Given that compression
streams are per-CPU that's quite some memory saving.

Link: https://lkml.kernel.org/r/20240902105656.1383858-10-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:08 -07:00
Sergey Senozhatsky
73e7d81abb zram: add zstd compression backend support
Add s/w zstd compression.

Link: https://lkml.kernel.org/r/20240902105656.1383858-9-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:08 -07:00
Sergey Senozhatsky
c60a4ef544 zram: add lz4hc compression backend support
Add s/w lz4hc compression support.

Link: https://lkml.kernel.org/r/20240902105656.1383858-8-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:07 -07:00
Sergey Senozhatsky
22d651c3b3 zram: add lz4 compression backend support
Add s/w lz4 compression support.

Link: https://lkml.kernel.org/r/20240902105656.1383858-7-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:07 -07:00
Sergey Senozhatsky
2152247c55 zram: add lzo and lzorle compression backends support
Add s/w lzo/lzorle compression support.

Link: https://lkml.kernel.org/r/20240902105656.1383858-6-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:07 -07:00
Sergey Senozhatsky
917a59e81c zram: introduce custom comp backends API
Moving to custom backends implementation gives us ability to have our own
minimalistic and extendable API, and algorithms tunings becomes possible.

The list of compression backends is empty at this point, we will add
backends in the followup patches.

Link: https://lkml.kernel.org/r/20240902105656.1383858-5-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:07 -07:00
Li Zetao
a02e98bebc mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init()
Since the debugfs_create_dir() never returns a null pointer, checking
the return value for a null pointer is redundant. Since
debugfs_create_file() can deal with a ERR_PTR() style pointer, drop
the check.  Since mtip_hw_debugfs_init does not pay attention to the
return value, its return type can be changed to void.

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20240907034046.3595268-1-lizetao1@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-07 07:40:49 -06:00
Linus Torvalds
b66f0b119c block-6.11-20240906
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbbAicQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqu1D/48quZBeoRXnXoNdAEUIQ36I9f33JzbwqnD
 bGW0++IwV7ZGrKSuy/IQuoosZhhM8ZyyQxha9JontLse0XlZKzXeN0hT3aovngmy
 GP9Nw+EXfcP7cDYjUxrGasbsAnl5p9Orojc1IQms43tAb6KCI4xVunYzkJ9rqcNq
 5Ik8FY46c28ivqlYFjj17NVDZE29AEgBxFa3jGYVusmnUpPkqysl4iQJdMeRAaIQ
 +UaDJst2ZVtDqqD20gvVL/p9xR2JnPoiv3evSXStkpDf7kp+WXC+51t9mcz6TX+b
 OwmnnrwkZuJ9jTzEsE39fDwWyfUr83UKR5jXKI/4SdAkq+GeTHZAyHuavSm5v4ym
 +pITxruaoEL3eVfC6wlXTrsnEXLn5b402Bv8tjT+Q8HBHXRuC4soCGN3sdbFZj5E
 ETS5TNkGbVcYv4mye75QDBkG0DBXXEc9zkM5PIEX8A/iLwNq10QzfAFHwyhydlvt
 mpOrf4M9IvMtHYejmouAmfd/00gHfPxboGLObtNjdZD65HGJP99PRtURj+Ng/gDd
 XbHSX8hSKLhQd8ReQojSasounm0dc+FFaMUFprS96g36G12zq2c3dAaLXTPZeOM3
 JQIu6707SqmxEFiwJ86QT/nvEO6JgBgQy48Y5r8i0++w4Qb4UbfLnAQXOJXdI3nm
 0uYdUCeYMQ==
 =BwIz
 -----END PGP SIGNATURE-----

Merge tag 'block-6.11-20240906' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "Mostly just some fixlets for NVMe, but also a bug fix for the ublk
  driver and an integrity fix"

* tag 'block-6.11-20240906' of git://git.kernel.dk/linux:
  bio-integrity: don't restrict the size of integrity metadata
  ublk_drv: fix NULL pointer dereference in ublk_ctrl_start_recovery()
  nvmet: Identify-Active Namespace ID List command should reject invalid nsid
  nvme: set BLK_FEAT_ZONED for ZNS multipath disks
  nvme-pci: Add sleep quirk for Samsung 990 Evo
  nvme-pci: allocate tagset on reset if necessary
  nvmet-tcp: fix kernel crash if commands allocation fails
  nvme: use better description for async reset reason
  nvmet: Make nvmet_debugfs static
2024-09-06 12:04:06 -07:00
Sebastian Andrzej Siewior
68d20eb60e zram: Shrink zram_table_entry::flags.
The zram_table_entry::flags member is of type long and uses 8 bytes on a
64bit architecture. With a PAGE_SIZE of 256KiB we have PAGE_SHIFT of 18
which in turn leads to __NR_ZRAM_PAGEFLAGS = 27. This still fits in an
ordinary integer.
By reducing the size of `flags' to four bytes, the size of the struct
goes back to 16 bytes. The padding between the lock and ac_time (if
enabled) is also gone.

Make zram_table_entry::flags an unsigned int and update the build test
to reflect the change.

Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20240906141520.730009-4-bigeasy@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-06 08:51:08 -06:00
Sebastian Andrzej Siewior
6086aeb49e zram: Remove ZRAM_LOCK
The ZRAM_LOCK was used for locking and after the addition of spinlock_t
the bit set and cleared but there no reader of it.

Remove the ZRAM_LOCK bit.

Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20240906141520.730009-3-bigeasy@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-06 08:51:08 -06:00
Mike Galbraith
9518e5bfaa zram: Replace bit spinlocks with a spinlock_t.
The bit spinlock disables preemption. The spinlock_t lock becomes a sleeping
lock on PREEMPT_RT and it can not be acquired in this context. In this locked
section, zs_free() acquires a zs_pool::lock, and there is access to
zram::wb_limit_lock.

Add a spinlock_t for locking. Keep the set/ clear ZRAM_LOCK bit after
the lock has been acquired/ dropped. The size of struct zram_table_entry
increases by 4 bytes due to lock and additional 4 bytes padding with
CONFIG_ZRAM_TRACK_ENTRY_ACTIME enabled.

Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20240906141520.730009-2-bigeasy@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-06 08:51:08 -06:00
Wouter Verhelst
296dbc72d2 nbd: correct the maximum value for discard sectors
The version of the NBD protocol implemented by the kernel driver
currently has a 32 bit field for length values. As the NBD protocol uses
bytes as a unit of length, length values larger than 2^32 bytes cannot
be expressed.

Update the max_hw_discard_sectors field to match that.

Signed-off-by: Wouter Verhelst <w@uter.be>
Fixes: 268283244c ("nbd: use the atomic queue limits API in nbd_set_size")
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Eric Blake <eblake@redhat.Com>
Link: https://lore.kernel.org/r/20240812133032.115134-8-w@uter.be
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-06 08:31:40 -06:00
Wouter Verhelst
41372f5c9a nbd: nbd_bg_flags_show: add NBD_FLAG_ROTATIONAL
Also handle NBD_FLAG_ROTATIONAL in our debug helper function

Signed-off-by: Wouter Verhelst <w@uter.be>
Cc: Eric Blake <eblake@redhat.Com>
Link: https://lore.kernel.org/r/20240812133032.115134-6-w@uter.be
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-06 08:31:40 -06:00
Wouter Verhelst
e49dacc71e nbd: implement the WRITE_ZEROES command
The NBD protocol defines a message for zeroing out a region of an export

Add support to the kernel driver for that message.

Signed-off-by: Wouter Verhelst <w@uter.be>
Cc: Eric Blake <eblake@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240812133032.115134-3-w@uter.be
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-06 08:31:40 -06:00
Li Nan
e58f5142f8 ublk_drv: fix NULL pointer dereference in ublk_ctrl_start_recovery()
When two UBLK_CMD_START_USER_RECOVERY commands are submitted, the
first one sets 'ubq->ubq_daemon' to NULL, and the second one triggers
WARN in ublk_queue_reinit() and subsequently a NULL pointer dereference
issue.

Fix it by adding the check in ublk_ctrl_start_recovery() and return
immediately in case of zero 'ub->nr_queues_ready'.

  BUG: kernel NULL pointer dereference, address: 0000000000000028
  RIP: 0010:ublk_ctrl_start_recovery.constprop.0+0x82/0x180
  Call Trace:
   <TASK>
   ? __die+0x20/0x70
   ? page_fault_oops+0x75/0x170
   ? exc_page_fault+0x64/0x140
   ? asm_exc_page_fault+0x22/0x30
   ? ublk_ctrl_start_recovery.constprop.0+0x82/0x180
   ublk_ctrl_uring_cmd+0x4f7/0x6c0
   ? pick_next_task_idle+0x26/0x40
   io_uring_cmd+0x9a/0x1b0
   io_issue_sqe+0x193/0x3f0
   io_wq_submit_work+0x9b/0x390
   io_worker_handle_work+0x165/0x360
   io_wq_worker+0xcb/0x2f0
   ? finish_task_switch.isra.0+0x203/0x290
   ? finish_task_switch.isra.0+0x203/0x290
   ? __pfx_io_wq_worker+0x10/0x10
   ret_from_fork+0x2d/0x50
   ? __pfx_io_wq_worker+0x10/0x10
   ret_from_fork_asm+0x1a/0x30
   </TASK>

Fixes: c732a852b4 ("ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support")
Reported-and-tested-by: Changhui Zhong <czhong@redhat.com>
Closes: https://lore.kernel.org/all/CAGVVp+UvLiS+bhNXV-h2icwX1dyybbYHeQUuH7RYqUvMQf6N3w@mail.gmail.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Li Nan <linan122@huawei.com>
Link: https://lore.kernel.org/r/20240904031348.4139545-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-04 07:15:38 -06:00
Matthew Wilcox (Oracle)
04cb7502a5 zsmalloc: use all available 24 bits of page_type
Now that we have an extra 8 bits, we don't need to limit ourselves to
supporting a 64KiB page size.  I'm sure both Hexagon users are grateful,
but it does reduce complexity a little.  We can also remove
reset_first_obj_offset() as calling __ClearPageZsmalloc() will now reset
all 32 bits of page_type.

Link: https://lkml.kernel.org/r/20240821173914.2270383-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03 21:15:43 -07:00
Ming Lei
c9ea57c91f nbd: fix race between timeout and normal completion
If request timetout is handled by nbd_requeue_cmd(), normal completion
has to be stopped for avoiding to complete this requeued request, other
use-after-free can be triggered.

Fix the race by clearing NBD_CMD_INFLIGHT in nbd_requeue_cmd(), meantime
make sure that cmd->lock is grabbed for clearing the flag and the
requeue.

Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Fixes: 2895f1831e ("nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240830034145.1827742-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-30 14:46:59 -06:00
Md Haris Iqbal
f6f84be089 block/rnbd-srv: Add sanity check and remove redundant assignment
The bio->bi_iter.bi_size is updated when bio_add_page() is called. So we
do not need to assign msg->bi_size again to it, since its redudant and
can also be harmful. Instead we can use it to add a sanity check, which
checks the locally calculated bi_size, with the one sent in msg.

Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://lore.kernel.org/r/20240809135346.978320-1-haris.iqbal@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-28 08:50:35 -06:00
Yang Ruibin
752a59298e pktcdvd: remove unnecessary debugfs_create_dir() error check
Remove the debugfs_create_dir() error check. It's safe to pass in error
pointers to the debugfs API, hence the user isn't supposed to include
error checking of the return values.

Signed-off-by: Yang Ruibin <11162571@vivo.com>
Link: https://lore.kernel.org/r/20240827022741.3410294-1-11162571@vivo.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-27 09:18:08 -06:00
Christophe JAILLET
87599eddc2 drbd: Remove an unused field in struct drbd_device
'next_barrier_nr' is not used in this driver. Remove it.

It was already part of the original commit b411b3637f ("The DRBD driver")
Apparently, it has never been used.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/d5322ef88d1d6f544963ee277cb0b427da8dceef.1724602922.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-26 07:15:53 -06:00
Ming Lei
9327b51c9a ublk: move zone report data out of request pdu
ublk zoned takes 16 bytes in each request pdu just for handling REPORT_ZONE
operation, this way does waste memory since request pdu is allocated
statically.

Store the transient zone report data into one global xarray, and remove
it after the report zone request is completed. This way is reasonable
since report zone is run in slow code path.

Fixes: 29802d7ca3 ("ublk: enable zoned storage support")
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Andreas Hindborg <a.hindborg@samsung.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240812013624.587587-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-12 10:29:45 -06:00
YueHaibing
f48ada402d drbd: Remove unused extern declarations
Commit b411b3637f ("The DRBD driver") declared but never implemented
drbd_read_remote(), is_valid_ar_handle() and drbd_set_recv_tcq().
And commit 668700b40a ("drbd: Create a dedicated workqueue for sending acks on the control connection")
never implemented drbd_send_ping_wf().

Commit 2451fc3b2b ("drbd: Removed the BIO_RW_BARRIER support form the receiver/epoch code")
leave w_e_reissue() declaration unused.

Commit 8fe605513a ("drbd: Rename drbdd_init() -> drbd_receiver()")
rename drbdd_init() and leave unsued declaration. Also drbd_asender() is removed in
commit 1c03e52083 ("drbd: Rename asender to ack_receiver").

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20240802095147.2788218-1-yuehaibing@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-02 13:36:19 -06:00
Wouter Verhelst
7543ae2269 nbd: add support for rotational devices
The NBD protocol defines the flag NBD_FLAG_ROTATIONAL to flag that the
export in use should be treated as a rotational device.

Add support for that flag to the kernel driver.

Signed-off-by: Wouter Verhelst <w@uter.be>
Reviewed-by: Eric Blake <eblake@redhat.com>
Link: https://lore.kernel.org/r/20240725164536.1275851-1-w@uter.be
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-28 16:47:52 -06:00
Ofir Gal
7960af373a drbd: use sendpages_ok() instead of sendpage_ok()
Currently _drbd_send_page() use sendpage_ok() in order to enable
MSG_SPLICE_PAGES, it check the first page of the iterator, the iterator
may represent contiguous pages.

MSG_SPLICE_PAGES enables skb_splice_from_iter() which checks all the
pages it sends with sendpage_ok().

When _drbd_send_page() sends an iterator that the first page is
sendable, but one of the other pages isn't skb_splice_from_iter() warns
and aborts the data transfer.

Using the new helper sendpages_ok() in order to enable MSG_SPLICE_PAGES
solves the issue.

Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Signed-off-by: Ofir Gal <ofir.gal@volumez.com>
Link: https://lore.kernel.org/r/20240718084515.3833733-4-ofir.gal@volumez.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-28 16:47:52 -06:00
Linus Torvalds
6342649c33 block-6.11-20240726
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmajsToQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptHKD/sHZcgf+bt/AOC1XebvCyIYYwWvNaMK2kQG
 7Z3kUknbuyML6QcAWSAhG5Z+6qGoIZPFXy1c7EUjs2LLM0KC13zw2D23ipTYXjyQ
 6cJPtgTY6Z4WrWyI7QVRG/cByYiUUNv20Cantexzv0H28/S1a5ne9ajwzkuJ5ZaM
 9EBIpbb9OtR+HKwzaw8oMng5XwXvxuo+DCX/MnJEHWnVYpRyQbQO7yCWb74zYKs0
 cwQp2FyACbGQ6AAt8c8loXYSObhGh6ofdAJoGwC1REZhgS1bxMHFgtnn5Gfq7EyJ
 MfPLFevZ99pxHgUkd9WR2DObOVAhY6kTOfuT2d26W50WP4PH40iAp5vapbEs1VDi
 /Rjr43PgVlzdGc4+u4O5B+E16mAIOmApBP6p6tgbypGU3+AFLhgKEdX3Tc3GgUqC
 zazucEO96ZCMVEjPhSrLLcWwBjlrHwINIgqAVq878D7b+3Qs7RSF6VLr6HP5y2nL
 B0CeSz9PY2xOjRgVqPrFeQLuYk9fRo6zAM2Ewd0TOgSg17z8XXznk0E2ME1ULvZW
 6Cuz5Pi59l2aqrIhm9gvVmMr+KEAu4YCiYOWI5yhNcz7EalY9SmNTovbBdnpaOTa
 74BrL+6lCviLsP4B29iNsqNai5GhAHYS/obJUITuaTAoZpvn9JjWlVBadZd7unbo
 Wn+53IKuaA==
 =Ci8Z
 -----END PGP SIGNATURE-----

Merge tag 'block-6.11-20240726' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
     - Fix request without payloads cleanup  (Leon)
     - Use new protection information format (Francis)
     - Improved debug message for lost pci link (Bart)
     - Another apst quirk (Wang)
     - Use appropriate sysfs api for printing chars (Markus)

 - ublk async device deletion fix (Ming)

 - drbd kerneldoc fixups (Simon)

 - Fix deadlock between sd removal and release (Yang)

* tag 'block-6.11-20240726' of git://git.kernel.dk/linux:
  nvme-pci: add missing condition check for existence of mapped data
  ublk: fix UBLK_CMD_DEL_DEV_ASYNC handling
  block: fix deadlock between sd_remove & sd_release
  drbd: Add peer_device to Kernel doc
  nvme-core: choose PIF from QPIF if QPIFS supports and PIF is QTYPE
  nvme-pci: Fix the instructions for disabling power management
  nvme: remove redundant bdev local variable
  nvme-fabrics: Use seq_putc() in __nvmf_concat_opt_tokens()
  nvme/pci: Add APST quirk for Lenovo N60z laptop
2024-07-27 15:28:53 -07:00
Linus Torvalds
6467dfdfc9 A small patchset to address bogus I/O errors and ultimately an
assertion failure in the face of watch errors with -o exclusive
 mappings in RBD marked for stable and some assorted CephFS fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmajtIUTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi1fwB/4/CsFZLQC+ybWSqoZVYD01qND0muol
 44Nr6NyKdW302jQhAXchecB6s6L+N0azhVlAKKB8sO9XCifKA8RzuW75Y0+8z78B
 pgB6K7ZOzAIuPIG9mmNbutUHEd24CzGXNA28lEQkrNT8D6UZTENXQRJb1dS2GzQ7
 T2PyjFFoyF0z1bDZ85URHxeyMetEe/TzWUlG1P2QI98V+RP8nK+mGYmdXNGKhH87
 Ltf2pxjsiomtoH4QRm4LX7vwOUY1Ljf4HpSS1p+c5Fova2UTtTDbVfTFbh+ZjuQV
 hlh1ypNLM+igifu3nVeJ/Ga2f71zVFM66tnmpjcY3DxZAp70e1W2HMFD
 =Zfcy
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.11-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "A small patchset to address bogus I/O errors and ultimately an
  assertion failure in the face of watch errors with -o exclusive
  mappings in RBD marked for stable and some assorted CephFS fixes"

* tag 'ceph-for-6.11-rc1' of https://github.com/ceph/ceph-client:
  rbd: don't assume rbd_is_lock_owner() for exclusive mappings
  rbd: don't assume RBD_LOCK_STATE_LOCKED for exclusive mappings
  rbd: rename RBD_LOCK_STATE_RELEASING and releasing_wait
  ceph: fix incorrect kmalloc size of pagevec mempool
  ceph: periodically flush the cap releases
  ceph: convert comma to semicolon in __ceph_dentry_dir_lease_touch()
  ceph: use cap_wait_list only if debugfs is enabled
2024-07-26 10:34:42 -07:00
Ilya Dryomov
3ceccb14f5 rbd: don't assume rbd_is_lock_owner() for exclusive mappings
Expanding on the previous commit, assuming that rbd_is_lock_owner()
always returns true (i.e. that we are either in RBD_LOCK_STATE_LOCKED
or RBD_LOCK_STATE_QUIESCING) if the mapping is exclusive is wrong too.
In case ceph_cls_set_cookie() fails, the lock would be temporarily
released even if the mapping is exclusive, meaning that we can end up
even in RBD_LOCK_STATE_UNLOCKED.

IOW, exclusive mappings are really "just" about disabling automatic
lock transitions (as documented in the man page), not about grabbing
the lock and holding on to it whatever it takes.

Cc: stable@vger.kernel.org
Fixes: 637cd06053 ("rbd: new exclusive lock wait/wake code")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2024-07-25 12:18:29 +02:00
Ilya Dryomov
2237ceb71f rbd: don't assume RBD_LOCK_STATE_LOCKED for exclusive mappings
Every time a watch is reestablished after getting lost, we need to
update the cookie which involves quiescing exclusive lock.  For this,
we transition from RBD_LOCK_STATE_LOCKED to RBD_LOCK_STATE_QUIESCING
roughly for the duration of rbd_reacquire_lock() call.  If the mapping
is exclusive and I/O happens to arrive in this time window, it's failed
with EROFS (later translated to EIO) based on the wrong assumption in
rbd_img_exclusive_lock() -- "lock got released?" check there stopped
making sense with commit a2b1da0979 ("rbd: lock should be quiesced on
reacquire").

To make it worse, any such I/O is added to the acquiring list before
EROFS is returned and this sets up for violating rbd_lock_del_request()
precondition that the request is either on the running list or not on
any list at all -- see commit ded080c86b ("rbd: don't move requests
to the running list on errors").  rbd_lock_del_request() ends up
processing these requests as if they were on the running list which
screws up quiescing_wait completion counter and ultimately leads to

    rbd_assert(!completion_done(&rbd_dev->quiescing_wait));

being triggered on the next watch error.

Cc: stable@vger.kernel.org # 06ef84c4e9c4: rbd: rename RBD_LOCK_STATE_RELEASING and releasing_wait
Cc: stable@vger.kernel.org
Fixes: 637cd06053 ("rbd: new exclusive lock wait/wake code")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2024-07-25 12:18:28 +02:00
Ilya Dryomov
f5c466a0fd rbd: rename RBD_LOCK_STATE_RELEASING and releasing_wait
... to RBD_LOCK_STATE_QUIESCING and quiescing_wait to recognize that
this state and the associated completion are backing rbd_quiesce_lock(),
which isn't specific to releasing the lock.

While exclusive lock does get quiesced before it's released, it also
gets quiesced before an attempt to update the cookie is made and there
the lock is not released as long as ceph_cls_set_cookie() succeeds.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2024-07-25 12:18:01 +02:00
Ming Lei
55fbb9a5d6 ublk: fix UBLK_CMD_DEL_DEV_ASYNC handling
In ublk_ctrl_uring_cmd(), ioctl command NR should be used for
matching _IOC_NR(cmd_op).

Fix it by adding one private macro, and this way is clean.

Fixes: 13fe8e6825 ("ublk: add UBLK_CMD_DEL_DEV_ASYNC")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240724143311.2646330-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-24 09:51:46 -06:00
Simon Horman
b8a4518b5c drbd: Add peer_device to Kernel doc
Add missing documentation of peer_device parameter to Kernel doc.

These parameters were added in commit 8164dd6c8a ("drbd: Add peer
device parameter to whole-bitmap I/O handlers")

Flagged by W=1 builds.

Signed-off-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240723-drbd-doc-v1-1-a04d9b7a9688@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-23 07:31:38 -06:00
Linus Torvalds
fbc90c042c - 875fa64577da ("mm/hugetlb_vmemmap: fix race with speculative PFN
walkers") is known to cause a performance regression
   (https://lore.kernel.org/all/3acefad9-96e5-4681-8014-827d6be71c7a@linux.ibm.com/T/#mfa809800a7862fb5bdf834c6f71a3a5113eb83ff).
   Yu has a fix which I'll send along later via the hotfixes branch.
 
 - In the series "mm: Avoid possible overflows in dirty throttling" Jan
   Kara addresses a couple of issues in the writeback throttling code.
   These fixes are also targetted at -stable kernels.
 
 - Ryusuke Konishi's series "nilfs2: fix potential issues related to
   reserved inodes" does that.  This should actually be in the
   mm-nonmm-stable tree, along with the many other nilfs2 patches.  My bad.
 
 - More folio conversions from Kefeng Wang in the series "mm: convert to
   folio_alloc_mpol()"
 
 - Kemeng Shi has sent some cleanups to the writeback code in the series
   "Add helper functions to remove repeated code and improve readability of
   cgroup writeback"
 
 - Kairui Song has made the swap code a little smaller and a little
   faster in the series "mm/swap: clean up and optimize swap cache index".
 
 - In the series "mm/memory: cleanly support zeropage in
   vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David
   Hildenbrand has reworked the rather sketchy handling of the use of the
   zeropage in MAP_SHARED mappings.  I don't see any runtime effects here -
   more a cleanup/understandability/maintainablity thing.
 
 - Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling of
   higher addresses, for aarch64.  The (poorly named) series is
   "Restructure va_high_addr_switch".
 
 - The core TLB handling code gets some cleanups and possible slight
   optimizations in Bang Li's series "Add update_mmu_tlb_range() to
   simplify code".
 
 - Jane Chu has improved the handling of our
   fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in the
   series "Enhance soft hwpoison handling and injection".
 
 - Jeff Johnson has sent a billion patches everywhere to add
   MODULE_DESCRIPTION() to everything.  Some landed in this pull.
 
 - In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang has
   simplified migration's use of hardware-offload memory copying.
 
 - Yosry Ahmed performs more folio API conversions in his series "mm:
   zswap: trivial folio conversions".
 
 - In the series "large folios swap-in: handle refault cases first",
   Chuanhua Han inches us forward in the handling of large pages in the
   swap code.  This is a cleanup and optimization, working toward the end
   objective of full support of large folio swapin/out.
 
 - In the series "mm,swap: cleanup VMA based swap readahead window
   calculation", Huang Ying has contributed some cleanups and a possible
   fixlet to his VMA based swap readahead code.
 
 - In the series "add mTHP support for anonymous shmem" Baolin Wang has
   taught anonymous shmem mappings to use multisize THP.  By default this
   is a no-op - users must opt in vis sysfs controls.  Dramatic
   improvements in pagefault latency are realized.
 
 - David Hildenbrand has some cleanups to our remaining use of
   page_mapcount() in the series "fs/proc: move page_mapcount() to
   fs/proc/internal.h".
 
 - David also has some highmem accounting cleanups in the series
   "mm/highmem: don't track highmem pages manually".
 
 - Build-time fixes and cleanups from John Hubbard in the series
   "cleanups, fixes, and progress towards avoiding "make headers"".
 
 - Cleanups and consolidation of the core pagemap handling from Barry
   Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers
   and utilize them".
 
 - Lance Yang's series "Reclaim lazyfree THP without splitting" has
   reduced the latency of the reclaim of pmd-mapped THPs under fairly
   common circumstances.  A 10x speedup is seen in a microbenchmark.
 
   It does this by punting to aother CPU but I guess that's a win unless
   all CPUs are pegged.
 
 - hugetlb_cgroup cleanups from Xiu Jianfeng in the series
   "mm/hugetlb_cgroup: rework on cftypes".
 
 - Miaohe Lin's series "Some cleanups for memory-failure" does just that
   thing.
 
 - Is anyone reading this stuff?  If so, email me!
 
 - Someone other than SeongJae has developed a DAMON feature in Honggyu
   Kim's series "DAMON based tiered memory management for CXL memory".
   This adds DAMON features which may be used to help determine the
   efficiency of our placement of CXL/PCIe attached DRAM.
 
 - DAMON user API centralization and simplificatio work in SeongJae
   Park's series "mm/damon: introduce DAMON parameters online commit
   function".
 
 - In the series "mm: page_type, zsmalloc and page_mapcount_reset()"
   David Hildenbrand does some maintenance work on zsmalloc - partially
   modernizing its use of pageframe fields.
 
 - Kefeng Wang provides more folio conversions in the series "mm: remove
   page_maybe_dma_pinned() and page_mkclean()".
 
 - More cleanup from David Hildenbrand, this time in the series
   "mm/memory_hotplug: use PageOffline() instead of PageReserved() for
   !ZONE_DEVICE".  It "enlightens memory hotplug more about PageOffline()
   pages" and permits the removal of some virtio-mem hacks.
 
 - Barry Song's series "mm: clarify folio_add_new_anon_rmap() and
   __folio_add_anon_rmap()" is a cleanup to the anon folio handling in
   preparation for mTHP (multisize THP) swapin.
 
 - Kefeng Wang's series "mm: improve clear and copy user folio"
   implements more folio conversions, this time in the area of large folio
   userspace copying.
 
 - The series "Docs/mm/damon/maintaier-profile: document a mailing tool
   and community meetup series" tells people how to get better involved
   with other DAMON developers.  From SeongJae Park.
 
 - A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does
   that.
 
 - David Hildenbrand sends along more cleanups, this time against the
   migration code.  The series is "mm/migrate: move NUMA hinting fault
   folio isolation + checks under PTL".
 
 - Jan Kara has found quite a lot of strangenesses and minor errors in
   the readahead code.  He addresses this in the series "mm: Fix various
   readahead quirks".
 
 - SeongJae Park's series "selftests/damon: test DAMOS tried regions and
   {min,max}_nr_regions" adds features and addresses errors in DAMON's self
   testing code.
 
 - Gavin Shan has found a userspace-triggerable WARN in the pagecache
   code.  The series "mm/filemap: Limit page cache size to that supported
   by xarray" addresses this.  The series is marked cc:stable.
 
 - Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations
   and cleanup" cleans up and slightly optimizes KSM.
 
 - Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of
   code motion.  The series (which also makes the memcg-v1 code
   Kconfigurable) are
 
   "mm: memcg: separate legacy cgroup v1 code and put under config
   option" and
   "mm: memcg: put cgroup v1-specific memcg data under CONFIG_MEMCG_V1"
 
 - Dan Schatzberg's series "Add swappiness argument to memory.reclaim"
   adds an additional feature to this cgroup-v2 control file.
 
 - The series "Userspace controls soft-offline pages" from Jiaqi Yan
   permits userspace to stop the kernel's automatic treatment of excessive
   correctable memory errors.  In order to permit userspace to monitor and
   handle this situation.
 
 - Kefeng Wang's series "mm: migrate: support poison recover from migrate
   folio" teaches the kernel to appropriately handle migration from
   poisoned source folios rather than simply panicing.
 
 - SeongJae Park's series "Docs/damon: minor fixups and improvements"
   does those things.
 
 - In the series "mm/zsmalloc: change back to per-size_class lock"
   Chengming Zhou improves zsmalloc's scalability and memory utilization.
 
 - Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for
   pinning memfd folios" makes the GUP code use FOLL_PIN rather than bare
   refcount increments.  So these paes can first be moved aside if they
   reside in the movable zone or a CMA block.
 
 - Andrii Nakryiko has added a binary ioctl()-based API to /proc/pid/maps
   for much faster reading of vma information.  The series is "query VMAs
   from /proc/<pid>/maps".
 
 - In the series "mm: introduce per-order mTHP split counters" Lance Yang
   improves the kernel's presentation of developer information related to
   multisize THP splitting.
 
 - Michael Ellerman has developed the series "Reimplement huge pages
   without hugepd on powerpc (8xx, e500, book3s/64)".  This permits
   userspace to use all available huge page sizes.
 
 - In the series "revert unconditional slab and page allocator fault
   injection calls" Vlastimil Babka removes a performance-affecting and not
   very useful feature from slab fault injection.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZp2C+QAKCRDdBJ7gKXxA
 joTkAQDvjqOoFStqk4GU3OXMYB7WCU/ZQMFG0iuu1EEwTVDZ4QEA8CnG7seek1R3
 xEoo+vw0sWWeLV3qzsxnCA1BJ8cTJA8=
 =z0Lf
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - In the series "mm: Avoid possible overflows in dirty throttling" Jan
   Kara addresses a couple of issues in the writeback throttling code.
   These fixes are also targetted at -stable kernels.

 - Ryusuke Konishi's series "nilfs2: fix potential issues related to
   reserved inodes" does that. This should actually be in the
   mm-nonmm-stable tree, along with the many other nilfs2 patches. My
   bad.

 - More folio conversions from Kefeng Wang in the series "mm: convert to
   folio_alloc_mpol()"

 - Kemeng Shi has sent some cleanups to the writeback code in the series
   "Add helper functions to remove repeated code and improve readability
   of cgroup writeback"

 - Kairui Song has made the swap code a little smaller and a little
   faster in the series "mm/swap: clean up and optimize swap cache
   index".

 - In the series "mm/memory: cleanly support zeropage in
   vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David
   Hildenbrand has reworked the rather sketchy handling of the use of
   the zeropage in MAP_SHARED mappings. I don't see any runtime effects
   here - more a cleanup/understandability/maintainablity thing.

 - Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling
   of higher addresses, for aarch64. The (poorly named) series is
   "Restructure va_high_addr_switch".

 - The core TLB handling code gets some cleanups and possible slight
   optimizations in Bang Li's series "Add update_mmu_tlb_range() to
   simplify code".

 - Jane Chu has improved the handling of our
   fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in
   the series "Enhance soft hwpoison handling and injection".

 - Jeff Johnson has sent a billion patches everywhere to add
   MODULE_DESCRIPTION() to everything. Some landed in this pull.

 - In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang
   has simplified migration's use of hardware-offload memory copying.

 - Yosry Ahmed performs more folio API conversions in his series "mm:
   zswap: trivial folio conversions".

 - In the series "large folios swap-in: handle refault cases first",
   Chuanhua Han inches us forward in the handling of large pages in the
   swap code. This is a cleanup and optimization, working toward the end
   objective of full support of large folio swapin/out.

 - In the series "mm,swap: cleanup VMA based swap readahead window
   calculation", Huang Ying has contributed some cleanups and a possible
   fixlet to his VMA based swap readahead code.

 - In the series "add mTHP support for anonymous shmem" Baolin Wang has
   taught anonymous shmem mappings to use multisize THP. By default this
   is a no-op - users must opt in vis sysfs controls. Dramatic
   improvements in pagefault latency are realized.

 - David Hildenbrand has some cleanups to our remaining use of
   page_mapcount() in the series "fs/proc: move page_mapcount() to
   fs/proc/internal.h".

 - David also has some highmem accounting cleanups in the series
   "mm/highmem: don't track highmem pages manually".

 - Build-time fixes and cleanups from John Hubbard in the series
   "cleanups, fixes, and progress towards avoiding "make headers"".

 - Cleanups and consolidation of the core pagemap handling from Barry
   Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers
   and utilize them".

 - Lance Yang's series "Reclaim lazyfree THP without splitting" has
   reduced the latency of the reclaim of pmd-mapped THPs under fairly
   common circumstances. A 10x speedup is seen in a microbenchmark.

   It does this by punting to aother CPU but I guess that's a win unless
   all CPUs are pegged.

 - hugetlb_cgroup cleanups from Xiu Jianfeng in the series
   "mm/hugetlb_cgroup: rework on cftypes".

 - Miaohe Lin's series "Some cleanups for memory-failure" does just that
   thing.

 - Someone other than SeongJae has developed a DAMON feature in Honggyu
   Kim's series "DAMON based tiered memory management for CXL memory".
   This adds DAMON features which may be used to help determine the
   efficiency of our placement of CXL/PCIe attached DRAM.

 - DAMON user API centralization and simplificatio work in SeongJae
   Park's series "mm/damon: introduce DAMON parameters online commit
   function".

 - In the series "mm: page_type, zsmalloc and page_mapcount_reset()"
   David Hildenbrand does some maintenance work on zsmalloc - partially
   modernizing its use of pageframe fields.

 - Kefeng Wang provides more folio conversions in the series "mm: remove
   page_maybe_dma_pinned() and page_mkclean()".

 - More cleanup from David Hildenbrand, this time in the series
   "mm/memory_hotplug: use PageOffline() instead of PageReserved() for
   !ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline()
   pages" and permits the removal of some virtio-mem hacks.

 - Barry Song's series "mm: clarify folio_add_new_anon_rmap() and
   __folio_add_anon_rmap()" is a cleanup to the anon folio handling in
   preparation for mTHP (multisize THP) swapin.

 - Kefeng Wang's series "mm: improve clear and copy user folio"
   implements more folio conversions, this time in the area of large
   folio userspace copying.

 - The series "Docs/mm/damon/maintaier-profile: document a mailing tool
   and community meetup series" tells people how to get better involved
   with other DAMON developers. From SeongJae Park.

 - A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does
   that.

 - David Hildenbrand sends along more cleanups, this time against the
   migration code. The series is "mm/migrate: move NUMA hinting fault
   folio isolation + checks under PTL".

 - Jan Kara has found quite a lot of strangenesses and minor errors in
   the readahead code. He addresses this in the series "mm: Fix various
   readahead quirks".

 - SeongJae Park's series "selftests/damon: test DAMOS tried regions and
   {min,max}_nr_regions" adds features and addresses errors in DAMON's
   self testing code.

 - Gavin Shan has found a userspace-triggerable WARN in the pagecache
   code. The series "mm/filemap: Limit page cache size to that supported
   by xarray" addresses this. The series is marked cc:stable.

 - Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations
   and cleanup" cleans up and slightly optimizes KSM.

 - Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of
   code motion. The series (which also makes the memcg-v1 code
   Kconfigurable) are "mm: memcg: separate legacy cgroup v1 code and put
   under config option" and "mm: memcg: put cgroup v1-specific memcg
   data under CONFIG_MEMCG_V1"

 - Dan Schatzberg's series "Add swappiness argument to memory.reclaim"
   adds an additional feature to this cgroup-v2 control file.

 - The series "Userspace controls soft-offline pages" from Jiaqi Yan
   permits userspace to stop the kernel's automatic treatment of
   excessive correctable memory errors. In order to permit userspace to
   monitor and handle this situation.

 - Kefeng Wang's series "mm: migrate: support poison recover from
   migrate folio" teaches the kernel to appropriately handle migration
   from poisoned source folios rather than simply panicing.

 - SeongJae Park's series "Docs/damon: minor fixups and improvements"
   does those things.

 - In the series "mm/zsmalloc: change back to per-size_class lock"
   Chengming Zhou improves zsmalloc's scalability and memory
   utilization.

 - Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for
   pinning memfd folios" makes the GUP code use FOLL_PIN rather than
   bare refcount increments. So these paes can first be moved aside if
   they reside in the movable zone or a CMA block.

 - Andrii Nakryiko has added a binary ioctl()-based API to
   /proc/pid/maps for much faster reading of vma information. The series
   is "query VMAs from /proc/<pid>/maps".

 - In the series "mm: introduce per-order mTHP split counters" Lance
   Yang improves the kernel's presentation of developer information
   related to multisize THP splitting.

 - Michael Ellerman has developed the series "Reimplement huge pages
   without hugepd on powerpc (8xx, e500, book3s/64)". This permits
   userspace to use all available huge page sizes.

 - In the series "revert unconditional slab and page allocator fault
   injection calls" Vlastimil Babka removes a performance-affecting and
   not very useful feature from slab fault injection.

* tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (411 commits)
  mm/mglru: fix ineffective protection calculation
  mm/zswap: fix a white space issue
  mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio
  mm/hugetlb: fix possible recursive locking detected warning
  mm/gup: clear the LRU flag of a page before adding to LRU batch
  mm/numa_balancing: teach mpol_to_str about the balancing mode
  mm: memcg1: convert charge move flags to unsigned long long
  alloc_tag: fix page_ext_get/page_ext_put sequence during page splitting
  lib: reuse page_ext_data() to obtain codetag_ref
  lib: add missing newline character in the warning message
  mm/mglru: fix overshooting shrinker memory
  mm/mglru: fix div-by-zero in vmpressure_calc_level()
  mm/kmemleak: replace strncpy() with strscpy()
  mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC
  mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB
  mm: ignore data-race in __swap_writepage
  hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr
  mm: shmem: rename mTHP shmem counters
  mm: swap_state: use folio_alloc_mpol() in __read_swap_cache_async()
  mm/migrate: putback split folios when numa hint migration fails
  ...
2024-07-21 17:15:46 -07:00
Linus Torvalds
f4f92db439 virtio: features, fixes, cleanups
Several new features here:
 
 - Virtio find vqs API has been reworked
   (required to fix the scalability issue we have with
    adminq, which I hope to merge later in the cycle)
 
 - vDPA driver for Marvell OCTEON
 
 - virtio fs performance improvement
 
 - mlx5 migration speedups
 
 Fixes, cleanups all over the place.
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmaXjQQPHG1zdEByZWRo
 YXQuY29tAAoJECgfDbjSjVRpnIsH/jVNqAQbe/vaBQdNMdnsA+P9A9unLbYRxYCQ
 tN73mQRIXKtnZHBRAEbMGq52HPYg8HlN2HJSgyNo6I6t8VD+PiOco7m+3GpmqEcW
 aXPOPl0BAbVoDgyutxRuuodP8Z61lBx0mG6iOxpzTXOPGlpQqtPCFHO8YnodqnPf
 tMix/5uAqgZKV2siCbw5DtzwEc0gDHU8qsD0/nyoS5nBDF9yh/ardr5P/qiyFDQH
 atCNYTOhIFU83pLAaw0fpCGbkt7gxf+5RpWVx3wkYww+/MwvYhsveRvQyaGbBz3n
 WDtET3SOtVTta98OAGIKCq/2z8f6mYXBP7vXapBgnJG3vwS/poQ=
 =LYua
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:
 "Several new features here:

   - Virtio find vqs API has been reworked (required to fix the
     scalability issue we have with adminq, which I hope to merge later
     in the cycle)

   - vDPA driver for Marvell OCTEON

   - virtio fs performance improvement

   - mlx5 migration speedups

  Fixes, cleanups all over the place"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (56 commits)
  virtio: rename virtio_find_vqs_info() to virtio_find_vqs()
  virtio: remove unused virtio_find_vqs() and virtio_find_vqs_ctx() helpers
  virtio: convert the rest virtio_find_vqs() users to virtio_find_vqs_info()
  virtio_balloon: convert to use virtio_find_vqs_info()
  virtiofs: convert to use virtio_find_vqs_info()
  scsi: virtio_scsi: convert to use virtio_find_vqs_info()
  virtio_net: convert to use virtio_find_vqs_info()
  virtio_crypto: convert to use virtio_find_vqs_info()
  virtio_console: convert to use virtio_find_vqs_info()
  virtio_blk: convert to use virtio_find_vqs_info()
  virtio: rename find_vqs_info() op to find_vqs()
  virtio: remove the original find_vqs() op
  virtio: call virtio_find_vqs_info() from virtio_find_single_vq() directly
  virtio: convert find_vqs() op implementations to find_vqs_info()
  virtio_pci: convert vp_*find_vqs() ops to find_vqs_info()
  virtio: introduce virtio_queue_info struct and find_vqs_info() config op
  virtio: make virtio_find_single_vq() call virtio_find_vqs()
  virtio: make virtio_find_vqs() call virtio_find_vqs_ctx()
  caif_virtio: use virtio_find_single_vq() for single virtqueue finding
  vdpa/mlx5: Don't enable non-active VQs in .set_vq_ready()
  ...
2024-07-19 11:57:55 -07:00
Jiri Pirko
6c85d6b653 virtio: rename virtio_find_vqs_info() to virtio_find_vqs()
Since the original virtio_find_vqs() is no longer present, rename
virtio_find_vqs_info() back to virtio_find_vqs().

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Message-Id: <20240708074814.1739223-20-jiri@resnulli.us>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-17 05:20:58 -04:00
Jiri Pirko
0c60458b18 virtio_blk: convert to use virtio_find_vqs_info()
Instead of passing separate names and callbacks arrays
to virtio_find_vqs(), allocate one of virtual_queue_info structs and
pass it to virtio_find_vqs_info().

Suggested-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Message-Id: <20240708074814.1739223-11-jiri@resnulli.us>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-17 05:20:57 -04:00
Linus Torvalds
3e78198862 for-6.11/block-20240710
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaOTd8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppqIEACUr8Vv2FtezvT3OfVSlYWHHLXzkRhwEG5s
 vdk0o7Ow6U54sMjfymbHTgLD0ZOJf3uJ6BI95FQuW41jPzDFVbx4Hy8QzqonMkw9
 1D/YQ4zrVL2mOKBzATbKpoGJzMOzGeoXEueFZ1AYPAX7RrDtP4xPQNfrcfkdE2zF
 LycJN70Vp6lrZZMuI9yb9ts1tf7TFzK0HJANxOAKTgSiPmBmxesjkJlhrdUrgkAU
 qDVyjj7u/ssndBJAb9i6Bl95Do8s9t4DeJq5/6wgKqtf5hClMXzPVB8Wy084gr6E
 rTRsCEhOug3qEZSqfAgAxnd3XFRNc/p2KMUe5YZ4mAqux4hpSmIQQDM/5X5K9vEv
 f4MNqUGlqyqntZx+KPyFpf7kLHFYS1qK4ub0FojWJEY4GrbBPNjjncLJ9+ozR0c8
 kNDaFjMNAjalBee1FxNNH8LdVcd28rrCkPxRLEfO/gvBMUmvJf4ZyKmSED0v5DhY
 vZqKlBqG+wg0EXvdiWEHMDh9Y+q/2XBIkS6NN/Bhh61HNu+XzC838ts1X7lR+4o2
 AM5Vapw+v0q6kFBMRP3IcJI/c0UcIU8EQU7axMyzWtvhog8kx8x01hIj1L4UyYYr
 rUdWrkugBVXJbywFuH/QIJxWxS/z4JdSw5VjASJLIrXy+aANmmG9Wonv95eyhpUv
 5iv+EdRSNA==
 =wVi8
 -----END PGP SIGNATURE-----

Merge tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe updates via Keith:
     - Device initialization memory leak fixes (Keith)
     - More constants defined (Weiwen)
     - Target debugfs support (Hannes)
     - PCIe subsystem reset enhancements (Keith)
     - Queue-depth multipath policy (Redhat and PureStorage)
     - Implement get_unique_id (Christoph)
     - Authentication error fixes (Gaosheng)

 - MD updates via Song
     - sync_action fix and refactoring (Yu Kuai)
     - Various small fixes (Christoph Hellwig, Li Nan, and Ofir Gal, Yu
       Kuai, Benjamin Marzinski, Christophe JAILLET, Yang Li)

 - Fix loop detach/open race (Gulam)

 - Fix lower control limit for blk-throttle (Yu)

 - Add module descriptions to various drivers (Jeff)

 - Add support for atomic writes for block devices, and statx reporting
   for same. Includes SCSI and NVMe (John, Prasad, Alan)

 - Add IO priority information to block trace points (Dongliang)

 - Various zone improvements and tweaks (Damien)

 - mq-deadline tag reservation improvements (Bart)

 - Ignore direct reclaim swap writes in writeback throttling (Baokun)

 - Block integrity improvements and fixes (Anuj)

 - Add basic support for rust based block drivers. Has a dummy null_blk
   variant for now (Andreas)

 - Series converting driver settings to queue limits, and cleanups and
   fixes related to that (Christoph)

 - Cleanup for poking too deeply into the bvec internals, in preparation
   for DMA mapping API changes (Christoph)

 - Various minor tweaks and fixes (Jiapeng, John, Kanchan, Mikulas,
   Ming, Zhu, Damien, Christophe, Chaitanya)

* tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux: (206 commits)
  floppy: add missing MODULE_DESCRIPTION() macro
  loop: add missing MODULE_DESCRIPTION() macro
  ublk_drv: add missing MODULE_DESCRIPTION() macro
  xen/blkback: add missing MODULE_DESCRIPTION() macro
  block/rnbd: Constify struct kobj_type
  block: take offset into account in blk_bvec_map_sg again
  block: fix get_max_segment_size() warning
  loop: Don't bother validating blocksize
  virtio_blk: Don't bother validating blocksize
  null_blk: Don't bother validating blocksize
  block: Validate logical block size in blk_validate_limits()
  virtio_blk: Fix default logical block size fallback
  nvmet-auth: fix nvmet_auth hash error handling
  nvme: implement ->get_unique_id
  block: pass a phys_addr_t to get_max_segment_size
  block: add a bvec_phys helper
  blk-lib: check for kill signal in ioctl BLKZEROOUT
  block: limit the Write Zeroes to manually writing zeroes fallback
  block: refacto blkdev_issue_zeroout
  block: move read-only and supported checks into (__)blkdev_issue_zeroout
  ...
2024-07-15 14:20:22 -07:00
Jeff Johnson
3c1743a685 floppy: add missing MODULE_DESCRIPTION() macro
make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/floppy.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Reviewed-by: Denis Efremov <efremov@linux.com>
Link: https://lore.kernel.org/r/20240602-md-block-floppy-v1-1-bc628ea5eb84@quicinc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-10 00:22:03 -06:00
Jeff Johnson
7d4425d2c9 loop: add missing MODULE_DESCRIPTION() macro
make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/loop.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://lore.kernel.org/r/20240602-md-block-loop-v1-1-b9b7e2603e72@quicinc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-10 00:21:50 -06:00
Jeff Johnson
c25a271c29 ublk_drv: add missing MODULE_DESCRIPTION() macro
make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/ublk_drv.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240602-md-block-ublk_drv-v1-1-995474cafff0@quicinc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-10 00:21:33 -06:00
Jeff Johnson
4c33e39f62 xen/blkback: add missing MODULE_DESCRIPTION() macro
make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/xen-blkback/xen-blkback.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://lore.kernel.org/r/20240602-md-block-xen-blkback-v1-1-6ff5b58bdee1@quicinc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-10 00:21:18 -06:00
Christophe JAILLET
e4eaca5e30 block/rnbd: Constify struct kobj_type
'struct kobj_type' is not modified in this driver. It is only used with
kobject_init_and_add() which takes a "const struct kobj_type *" parameter.

Constifying this structure moves some data to a read-only section, so
increase overall security.

On a x86_64, with allmodconfig, as an example:
Before:
======
   text	   data	    bss	    dec	    hex	filename
   4082	    792	      8	   4882	   1312	drivers/block/rnbd/rnbd-srv-sysfs.o

After:
=====
   text	   data	    bss	    dec	    hex	filename
   4210	    672	      8	   4890	   131a	drivers/block/rnbd/rnbd-srv-sysfs.o

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/e3d454173ffad30726c9351810d3aa7b75122711.1720462252.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09 11:52:03 -06:00
John Garry
9423c653fe loop: Don't bother validating blocksize
The block queue limits validation does this for us now.

The loop_configure() -> WARN_ON_ONCE() call is dropped, as an invalid
block size would trigger this now. We don't want userspace to be able to
directly trigger WARNs.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/20240708091651.177447-6-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09 00:00:17 -06:00
John Garry
af28172291 virtio_blk: Don't bother validating blocksize
The block queue limits validation does this for us now.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240708091651.177447-5-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09 00:00:17 -06:00
John Garry
addc3a68de null_blk: Don't bother validating blocksize
The block queue limits validation does this for us now.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/20240708091651.177447-4-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09 00:00:17 -06:00
John Garry
4ff3d01275 virtio_blk: Fix default logical block size fallback
If we fail to read a logical block size in virtblk_read_limits() ->
virtio_cread_feature(), then we default to what is in
lim->logical_block_size, but that would be 0.

We can deal with lim->logical_block_size = 0 later in the
blk_mq_alloc_disk(), but the code in virtblk_read_limits() needs a proper
default, so give a default of SECTOR_SIZE.

Fixes: 27e32cd23f ("block: pass a queue_limits argument to blk_mq_alloc_disk")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240708091651.177447-2-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09 00:00:17 -06:00
Damien Le Moal
f2a7bea237 block: Remove REQ_OP_ZONE_RESET_ALL emulation
Now that device mapper can handle resetting all zones of a mapped zoned
device using REQ_OP_ZONE_RESET_ALL, all zoned block device drivers
support this operation. With this, the request queue feature
BLK_FEAT_ZONE_RESETALL is not necessary and the emulation code in
blk-zone.c can be removed.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240704052816.623865-5-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05 00:42:04 -06:00
Damien Le Moal
f4d5dc33c8 null_blk: Introduce the zone_full parameter
Allow creating a zoned null_blk device with the initial state of its
sequential write required zones to be FULL. This is convenient to avoid
having to first write these zones to perform read performance evaluation
or test zone management operations such as zone reset (and zone reset
all).

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240704052816.623865-2-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05 00:42:04 -06:00
Christoph Hellwig
dd54fd4e17 loop: remove the unused inode variable in loop_configure
Remove the inode variable now that the last user is gone.

Fixes: a17ece76bc ("loop: regularize upgrading the block size for direct I/O")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240705053114.2042976-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05 00:27:56 -06:00
Zhu Yanjun
a18df07b7d null_blk: don't initialize static 'g_virt_boundary' to false
No functional changes intended.

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://lore.kernel.org/r/20240704010638.324349-1-yanjun.zhu@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-04 02:05:54 -06:00
David Hildenbrand
43d746dc49 mm/zsmalloc: use a proper page type
Let's clean it up: use a proper page type and store our data (offset into
a page) in the lower 16 bit as documented.

We won't be able to support 256 KiB base pages, which is acceptable. 
Teach Kconfig to handle that cleanly using a new CONFIG_HAVE_ZSMALLOC.

Based on this, we should do a proper "struct zsdesc" conversion, as
proposed in [1].

This removes the last _mapcount/page_type offender.

[1] https://lore.kernel.org/all/20231130101242.2590384-1-42.hyeyoo@gmail.com/

Link: https://lkml.kernel.org/r/20240529111904.2069608-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org>	[zram/zsmalloc workloads]
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03 19:30:16 -07:00
Christoph Hellwig
98d34c0872 xen-blkfront: fix sector_size propagation to the block layer
Ensure that info->sector_size and info->physical_sector_size are set
before the call to blkif_set_queue_limits by doing away with the
local variables and arguments that propagate them.

Thanks to Marek Marczykowski-Górecki and Jürgen Groß for root causing
the issue.

Fixes: ba3f67c116 ("xen-blkfront: atomically update queue limits")
Reported-by: Rusty Bird <rustybird@net-c.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Link: https://lore.kernel.org/r/20240625055238.7934-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-02 08:58:12 -06:00
Damien Le Moal
1c0b3fca38 null_blk: Fix description of the fua parameter
The description of the fua module parameter is defined using
MODULE_PARM_DESC() with the first argument passed being "zoned". That is
the wrong name, obviously. Fix that by using the correct "fua" parameter
name so that "modinfo null_blk" displays correct information.

Fixes: f4f84586c8 ("null_blk: Introduce fua attribute")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240702073234.206458-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-02 08:57:59 -06:00
Christoph Hellwig
caffa7cdce rnbd-cnt: don't set QUEUE_FLAG_SAME_FORCE
QUEUE_FLAG_SAME_FORCE has been set by rnbd-cnt since the initial
merge.  There is no good reason for a driver to force exact core
delivery, which is tunable for very specific workloads and not a
driver setting.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240627124926.512662-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28 10:37:34 -06:00
Christoph Hellwig
40988f1590 rnbd: don't set QUEUE_FLAG_SAME_COMP
QUEUE_FLAG_SAME_COMP is already set by default.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240627124926.512662-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28 10:37:34 -06:00
Christoph Hellwig
667ea36378 loop: don't set QUEUE_FLAG_NOMERGES
QUEUE_FLAG_NOMERGES isn't really a driver interface, but a user tunable.
There also isn't any good reason to set it in the loop driver.

The original commit adding it (5b5e20f421 "block: loop: set
QUEUE_FLAG_NOMERGES for request queue of loop") claims that "It doesn't
make sense to enable merge because the I/O submitted to backing file is
handled page by page."  which of course isn't true for multi-page bvec
now, and it never has been for direct I/O, for which commit 40326d8a33
("block/loop: allow request merge for directio mode") alredy disabled
the nomerges flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240627124926.512662-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28 10:37:34 -06:00
Gulam Mohamed
18048c1af7 loop: Fix a race between loop detach and loop open
1. Userspace sends the command "losetup -d" which uses the open() call
   to open the device
2. Kernel receives the ioctl command "LOOP_CLR_FD" which calls the
   function loop_clr_fd()
3. If LOOP_CLR_FD is the first command received at the time, then the
   AUTOCLEAR flag is not set and deletion of the
   loop device proceeds ahead and scans the partitions (drop/add
   partitions)

        if (disk_openers(lo->lo_disk) > 1) {
                lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
                loop_global_unlock(lo, true);
                return 0;
        }

 4. Before scanning partitions, it will check to see if any partition of
    the loop device is currently opened
 5. If any partition is opened, then it will return EBUSY:

    if (disk->open_partitions)
                return -EBUSY;
 6. So, after receiving the "LOOP_CLR_FD" command and just before the above
    check for open_partitions, if any other command
    (like blkid) opens any partition of the loop device, then the partition
    scan will not proceed and EBUSY is returned as shown in above code
 7. But in "__loop_clr_fd()", this EBUSY error is not propagated
 8. We have noticed that this is causing the partitions of the loop to
    remain stale even after the loop device is detached resulting in the
    IO errors on the partitions

Fix:

Defer the detach of loop device to release function, which is called when
the last close happens, by setting the lo_flags to LO_FLAGS_AUTOCLEAR at
the time of detach i.e in loop_clr_fd() function.

Test case involves the following two scripts:

script1.sh:

while [ 1 ];
do
        losetup -P -f /home/opt/looptest/test10.img
        blkid /dev/loop0p1
done

script2.sh:

while [ 1 ];
do
        losetup -d /dev/loop0
done

Without fix, the following IO errors have been observed:

kernel: __loop_clr_fd: partition scan of loop0 failed (rc=-16)
kernel: I/O error, dev loop0, sector 20971392 op 0x0:(READ) flags 0x80700
        phys_seg 1 prio class 0
kernel: I/O error, dev loop0, sector 108868 op 0x0:(READ) flags 0x0
        phys_seg 1 prio class 0
kernel: Buffer I/O error on dev loop0p1, logical block 27201, async page
        read

Signed-off-by: Gulam Mohamed <gulam.mohamed@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240618164042.343777-1-gulam.mohamed@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-27 16:10:32 -06:00
Jeff Johnson
876835b128 brd: add missing MODULE_DESCRIPTION() macro
make allmodconfig && make W=1 C=1 reports:
modpost: missing MODULE_DESCRIPTION() in drivers/block/brd.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://lore.kernel.org/r/20240602-md-block-brd-v1-1-e71338e131b6@quicinc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-24 08:38:52 -06:00
Damien Le Moal
4ac9056e4b null_blk: Do not set disk->nr_zones
In null_register_zoned_dev(), there is no need to set disk->nr_zones as
the now uncoditional call to blk_revalidate_disk_zones() will do that.
So remove the assignment using bdev_nr_zones().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240621031506.759397-2-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-21 08:26:35 -06:00
Jens Axboe
69c34f07e4 Merge branch 'for-6.11/block-limits' into for-6.11/block
Merge in last round of queue limits changes from Christoph.

* for-6.11/block-limits: (26 commits)
  block: move the bounce flag into the features field
  block: move the skip_tagset_quiesce flag to queue_limits
  block: move the pci_p2pdma flag to queue_limits
  block: move the zone_resetall flag to queue_limits
  block: move the zoned flag into the features field
  block: move the poll flag to queue_limits
  block: move the dax flag to queue_limits
  block: move the nowait flag to queue_limits
  block: move the synchronous flag to queue_limits
  block: move the stable_writes flag to queue_limits
  block: move the io_stat flag setting to queue_limits
  block: move the add_random flag to queue_limits
  block: move the nonrot flag to queue_limits
  block: move cache control settings out of queue->flags
  block: remove blk_flush_policy
  block: freeze the queue in queue_attr_store
  nbd: move setting the cache control flags to __nbd_set_size
  virtio_blk: remove virtblk_update_cache_mode
  loop: fold loop_update_rotational into loop_reconfigure_limits
  loop: also use the default block size from an underlying block device
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 08:14:49 -06:00
Christoph Hellwig
a52758a397 block: move the zone_resetall flag to queue_limits
Move the zone_resetall flag into the queue_limits feature field so that
it can be set atomically with the queue frozen.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-24-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
b1fc937a55 block: move the zoned flag into the features field
Move the zoned flags into the features field to reclaim a little
bit of space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-23-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
f76af42f8b block: move the nowait flag to queue_limits
Move the nowait flag into the queue_limits feature field so that it can
be set atomically with the queue frozen.

Stacking drivers are simplified in that they now can simply set the
flag, and blk_stack_limits will clear it when the features is not
supported by any of the underlying devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-20-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
aadd5c59c9 block: move the synchronous flag to queue_limits
Move the synchronous flag into the queue_limits feature field so that it
can be set atomically with the queue frozen.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-19-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
1a02f3a73f block: move the stable_writes flag to queue_limits
Move the stable_writes flag into the queue_limits feature field so that
it can be set atomically with the queue frozen.

The flag is now inherited by blk_stack_limits, which greatly simplifies
the code in dm, and fixed md which previously did not pass on the flag
set on lower devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
39a9f1c334 block: move the add_random flag to queue_limits
Move the add_random flag into the queue_limits feature field so that it
can be set atomically with the queue frozen.

Note that this also removes code from dm to clear the flag based on
the underlying devices, which can't be reached as dm devices will
always start out without the flag set.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
bd4a633b6f block: move the nonrot flag to queue_limits
Move the nonrot flag into the queue_limits feature field so that it can
be set atomically with the queue frozen.

Use the chance to switch to defaulting to non-rotational and require
the driver to opt into rotational, which matches the polarity of the
sysfs interface.

For the z2ram, ps3vram, 2x memstick, ubiblock and dcssblk the new
rotational flag is not set as they clearly are not rotational despite
this being a behavior change.  There are some other drivers that
unconditionally set the rotational flag to keep the existing behavior
as they arguably can be used on rotational devices even if that is
probably not their main use today (e.g. virtio_blk and drbd).

The flag is automatically inherited in blk_stack_limits matching the
existing behavior in dm and md.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
1122c0c1cc block: move cache control settings out of queue->flags
Move the cache control settings into the queue_limits so that the flags
can be set atomically with the device queue frozen.

Add new features and flags field for the driver set flags, and internal
(usually sysfs-controlled) flags in the block layer.  Note that we'll
eventually remove enough field from queue_limits to bring it back to the
previous size.

The disable flag is inverted compared to the previous meaning, which
means it now survives a rescan, similar to the max_sectors and
max_discard_sectors user limits.

The FLUSH and FUA flags are now inherited by blk_stack_limits, which
simplified the code in dm a lot, but also causes a slight behavior
change in that dm-switch and dm-unstripe now advertise a write cache
despite setting num_flush_bios to 0.  The I/O path will handle this
gracefully, but as far as I can tell the lack of num_flush_bios
and thus flush support is a pre-existing data integrity bug in those
targets that really needs fixing, after which a non-zero num_flush_bios
should be required in dm for targets that map to underlying devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
6b377787a3 nbd: move setting the cache control flags to __nbd_set_size
Move setting the cache control flags in nbd in preparation for moving
these flags into the queue_limits structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
bbe5c84122 virtio_blk: remove virtblk_update_cache_mode
virtblk_update_cache_mode boils down to a single call to
blk_queue_write_cache.  Remove it in preparation for moving the cache
control flags into the queue_limits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240617060532.127975-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
97dd4a43d6 loop: fold loop_update_rotational into loop_reconfigure_limits
This prepares for moving the rotational flag into the queue_limits and
also fixes it for the case where the loop device is backed by a block
device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240617060532.127975-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
4ce37fe093 loop: also use the default block size from an underlying block device
Fix the code in loop_reconfigure_limits to pick a default block size for
O_DIRECT file descriptors to also work when the loop device sits on top
of a block device and not just on a regular file on a block device based
file system.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240617060532.127975-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
a17ece76bc loop: regularize upgrading the block size for direct I/O
The LOOP_CONFIGURE path automatically upgrades the block size to that
of the underlying file for O_DIRECT file descriptors, but the
LOOP_SET_BLOCK_SIZE path does not.  Fix this by lifting the code to
pick the block size into common code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240617060532.127975-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
ae0d40ff49 loop: always update discard settings in loop_reconfigure_limits
Simplify loop_reconfigure_limits by always updating the discard limits.
This adds a little more work to loop_set_block_size, but doesn't change
the outcome as the discard flag won't change.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240617060532.127975-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:28 -06:00
Christoph Hellwig
c9055b44ab loop: stop using loop_reconfigure_limits in __loop_clr_fd
__loop_clr_fd wants to clear all settings on the device.  Prepare for
moving more settings into the block limits by open coding
loop_reconfigure_limits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240617060532.127975-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:27 -06:00
Christoph Hellwig
dd9300e9ea xen-blkfront: don't disable cache flushes when they fail
blkfront always had a robust negotiation protocol for detecting a write
cache.  Stop simply disabling cache flushes in the block layer as the
flags handling is moving to the atomic queue limits API that needs
user context to freeze the queue for that.  Instead handle the case
of the feature flags cleared inside of blkfront.  This removes old
debug code to check for such a mismatch which was previously impossible
to hit, including the check for passthrough requests that blkfront
never used to start with.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240617060532.127975-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:58:27 -06:00
Jeff Johnson
465478bb00 z2ram: add missing MODULE_DESCRIPTION() macro
With ARCH=m68k, make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/z2ram.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240617-md-m68k-drivers-block-v1-3-b200599a315e@quicinc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:54:00 -06:00
Jeff Johnson
ba8df22e25 ataflop: add missing MODULE_DESCRIPTION() macro
With ARCH=m68k, make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/ataflop.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240617-md-m68k-drivers-block-v1-2-b200599a315e@quicinc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:54:00 -06:00
Jeff Johnson
28d8c13830 amiflop: add missing MODULE_DESCRIPTION() macro
With ARCH=m68k, make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/amiflop.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240617-md-m68k-drivers-block-v1-1-b200599a315e@quicinc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19 07:54:00 -06:00
Jens Axboe
e3e72fe4cb Merge branch 'for-6.11/block-limits' into for-6.11/block
Pull in block limits branch, which exists as a shared branch for both
the block and SCSI tree.

* for-6.11/block-limits: (26 commits)
  block: move integrity information into queue_limits
  block: invert the BLK_INTEGRITY_{GENERATE,VERIFY} flags
  block: bypass the STABLE_WRITES flag for protection information
  block: don't require stable pages for non-PI metadata
  block: use kstrtoul in flag_store
  block: factor out flag_{store,show} helper for integrity
  block: remove the blk_flush_integrity call in blk_integrity_unregister
  block: remove the blk_integrity_profile structure
  dm-integrity: use the nop integrity profile
  md/raid1: don't free conf on raid0_run failure
  md/raid0: don't free conf on raid0_run failure
  block: initialize integrity buffer to zero before writing it to media
  block: add special APIs for run-time disabling of discard and friends
  block: remove unused queue limits API
  sr: convert to the atomic queue limits API
  sd: convert to the atomic queue limits API
  sd: cleanup zoned queue limits initialization
  sd: factor out a sd_discard_mode helper
  sd: simplify the disable case in sd_config_discard
  sd: add a sd_disable_write_same helper
  ...
2024-06-14 10:22:08 -06:00
Christoph Hellwig
73e3715ed1 block: add special APIs for run-time disabling of discard and friends
A few drivers optimistically try to support discard, write zeroes and
secure erase and disable the features from the I/O completion handler
if the hardware can't support them.  This disable can't be done using
the atomic queue limits API because the I/O completion handlers can't
take sleeping locks or freeze the queue.  Keep the existing clearing
of the relevant field to zero, but replace the old blk_queue_max_*
APIs with new disable APIs that force the value to 0.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240531074837.1648501-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14 10:19:44 -06:00
Christoph Hellwig
a23634644a block: take io_opt and io_min into account for max_sectors
The soft max_sectors limit is normally capped by the hardware limits and
an arbitrary upper limit enforced by the kernel, but can be modified by
the user.  A few drivers want to increase this limit (nbd, rbd) or
adjust it up or down based on hardware capabilities (sd).

Change blk_validate_limits to default max_sectors to the optimal I/O
size, or upgrade it to the preferred minimal I/O size if that is
larger than the kernel default if no optimal I/O size is provided based
on the logic in the SD driver.

This keeps the existing kernel default for drivers that do not provide
an io_opt or very big io_min value, but picks a much more useful
default for those who provide these hints, and allows to remove the
hacks to set the user max_sectors limit in nbd, rbd and sd.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Acked-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240531074837.1648501-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14 10:19:44 -06:00
Christoph Hellwig
a00d4bfce7 rbd: increase io_opt again
Commit 16d80c54ad ("rbd: set io_min, io_opt and discard_granularity to
alloc_size") lowered the io_opt size for rbd from objset_bytes which is
4MB for typical setup to alloc_size which is typically 64KB.

The commit mostly talks about discard behavior and does mention io_min
in passing.  Reducing io_opt means reducing the readahead size, which
seems counter-intuitive given that rbd currently abuses the user
max_sectors setting to actually increase the I/O size.  Switch back
to the old setting to allow larger reads (the readahead size despite it's
name actually limits the size of any buffered read) and to prepare
for using io_opt in the max_sectors calculation and getting drivers out
of the business of overriding the max_user_sectors value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240531074837.1648501-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14 10:19:44 -06:00
Andreas Hindborg
bc5b533b91 rust: block: add rnull, Rust null_blk implementation
This patch adds an initial version of the Rust null block driver.

Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
Reviewed-by: Benno Lossin <benno.lossin@proton.me>
Link: https://lore.kernel.org/r/20240611114551.228679-3-nmi@metaspace.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14 07:45:04 -06:00
Cyril Hrubis
5f75e081ab loop: Disable fallocate() zero and discard if not supported
If fallcate is implemented but zero and discard operations are not
supported by the filesystem the backing file is on we continue to fill
dmesg with errors from the blk_mq_end_request() since each time we call
fallocate() on the loop device the EOPNOTSUPP error from lo_fallocate()
ends up propagated into the block layer. In the end syscall succeeds
since the blkdev_issue_zeroout() falls back to writing zeroes which
makes the errors even more misleading and confusing.

How to reproduce:

1. make sure /tmp is mounted as tmpfs
2. dd if=/dev/zero of=/tmp/disk.img bs=1M count=100
3. losetup /dev/loop0 /tmp/disk.img
4. mkfs.ext2 /dev/loop0
5. dmesg |tail

[710690.898214] operation not supported error, dev loop0, sector 204672 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0
[710690.898279] operation not supported error, dev loop0, sector 522 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0
[710690.898603] operation not supported error, dev loop0, sector 16906 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0
[710690.898917] operation not supported error, dev loop0, sector 32774 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0
[710690.899218] operation not supported error, dev loop0, sector 49674 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0
[710690.899484] operation not supported error, dev loop0, sector 65542 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0
[710690.899743] operation not supported error, dev loop0, sector 82442 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0
[710690.900015] operation not supported error, dev loop0, sector 98310 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0
[710690.900276] operation not supported error, dev loop0, sector 115210 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0
[710690.900546] operation not supported error, dev loop0, sector 131078 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0

This patch changes the lo_fallocate() to clear the flags for zero and
discard operations if we get EOPNOTSUPP from the backing file fallocate
callback, that way we at least stop spewing errors after the first
unsuccessful try.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240613163817.22640-1-chrubis@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14 06:21:25 -06:00
Christoph Hellwig
957df9af72 nbd: Remove __force casts
Make it again possible for sparse to verify that blk_status_t and Unix
error codes are used in the proper context by making nbd_send_cmd()
return a blk_status_t instead of an integer.

No functionality has been changed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[ bvanassche: added description and made two small formatting changes ]
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240604221531.327131-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-12 14:47:07 -06:00
Andreas Hindborg
c462ecd659 null_blk: fix validation of block size
Block size should be between 512 and PAGE_SIZE and be a power of 2. The current
check does not validate this, so update the check.

Without this patch, null_blk would Oops due to a null pointer deref when
loaded with bs=1536 [1].

Link: https://lore.kernel.org/all/87wmn8mocd.fsf@metaspace.dk/

Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240603192645.977968-1-nmi@metaspace.dk
[axboe: remove unnecessary braces and != 0 check]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-05 12:12:54 -06:00
Damien Le Moal
b164316808 null_blk: Do not allow runt zone with zone capacity smaller then zone size
A zoned device with a smaller last zone together with a zone capacity
smaller than the zone size does make any sense as that does not
correspond to any possible setup for a real device:
1) For ZNS and zoned UFS devices, all zones are always the same size.
2) For SMR HDDs, all zones always have the same capacity.
In other words, if we have a smaller last runt zone, then this zone
capacity should always be equal to the zone size.

Add a check in null_init_zoned_dev() to prevent a configuration to have
both a smaller zone size and a zone capacity smaller than the zone size.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240530054035.491497-2-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-30 15:03:52 -06:00
Damien Le Moal
233e27b4d2 null_blk: Print correct max open zones limit in null_init_zoned_dev()
When changing the maximum number of open zones, print that number
instead of the total number of zones.

Fixes: dc4d137ee3 ("null_blk: add support for max open/active zone limit for zoned devices")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Link: https://lore.kernel.org/r/20240528062852.437599-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-28 06:54:21 -06:00
Damien Le Moal
d9ff882b54 null_blk: Fix return value of nullb_device_power_store()
When powering on a null_blk device that is not already on, the return
value ret that is initialized to be count is reused to check the return
value of null_add_dev(), leading to nullb_device_power_store() to return
null_add_dev() return value (0 on success) instead of "count".
So make sure to set ret to be equal to count when there are no errors.

Fixes: a2db328b08 ("null_blk: fix null-ptr-dereference while configuring 'power' and 'submit_queues'")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20240527043445.235267-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-27 13:56:58 -06:00
Linus Torvalds
b4d88a60fe block-6.10-20240523
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmZPaegQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplkkD/4h1vxr2a6jg44TEUJ9f59rIOELuYHXJdpt
 5m7r8UWcy7LF6HfmMgSeHV/7Gr1bBw6jh1eMubZRt9pZJ1sSGnc6vQdrOU+RnG9k
 F9i0qogAD2WXClQPAxvHGC1KD1quSdeiKME0hNJdGA6SsV4cYnDVeR8O6SQbaomD
 KPeGGBdjvrygRFhyDBFDACWK3GuD5POlbswUOwASYNrAb4OrQsj+bX/QXkuOXir9
 n/NW/RfiQqAvI4m51yzaMqfFWw+s0irhXNfchl3i8RBMvDFBRNEkgtDN4y2rUynK
 +FaDeAwGXR51/qL9gr0ZScXAY6Q7f/B9FkrTUZR7S1lD3JsLXiS+uOefXEljKsDd
 RpNUc0sX3RjaSu1uNiUD/H4v+umvR+r3uuAyH6OXstCQt+98SJUbQvZuzphVGC60
 iM8W+NRsaYZUhjN4LBj0NBGgCiidHanm22GCPADWN1fxZbjRWUoA886sZXTqmmMj
 +GGqpPU3pbGtj09ysaJpLKxu1TbD3QmcCUVPWQ8+DKt8PGGDDa+vIRXV8xswwQDg
 DyZoq0s/s00DzCXiPsbvVyKwXCJ1XSB0sEq0gvjDfGXb+5h6T+lH2irbcjBxUlwq
 qbofAmk6PVjxeWMUP4NXE04oK5Itc/l20LT9ECFPWzMdc1ht31TsqmxldHLIpDqp
 KUeacOh94A==
 =Btam
 -----END PGP SIGNATURE-----

Merge tag 'block-6.10-20240523' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:
 "Followup block updates, mostly due to NVMe being a bit late to the
  party. But nothing major in there, so not a big deal.

  In detail, this contains:

   - NVMe pull request via Keith:
       - Fabrics connection retries (Daniel, Hannes)
       - Fabrics logging enhancements (Tokunori)
       - RDMA delete optimization (Sagi)

   - ublk DMA alignment fix (me)

   - null_blk sparse warning fixes (Bart)

   - Discard support for brd (Keith)

   - blk-cgroup list corruption fixes (Ming)

   - blk-cgroup stat propagation fix (Waiman)

   - Regression fix for plugging stall with md (Yu)

   - Misc fixes or cleanups (David, Jeff, Justin)"

* tag 'block-6.10-20240523' of git://git.kernel.dk/linux: (24 commits)
  null_blk: fix null-ptr-dereference while configuring 'power' and 'submit_queues'
  blk-throttle: remove unused struct 'avg_latency_bucket'
  block: fix lost bio for plug enabled bio based device
  block: t10-pi: add MODULE_DESCRIPTION()
  blk-mq: add helper for checking if one CPU is mapped to specified hctx
  blk-cgroup: Properly propagate the iostat update up the hierarchy
  blk-cgroup: fix list corruption from reorder of WRITE ->lqueued
  blk-cgroup: fix list corruption from resetting io stat
  cdrom: rearrange last_media_change check to avoid unintentional overflow
  nbd: Fix signal handling
  nbd: Remove a local variable from nbd_send_cmd()
  nbd: Improve the documentation of the locking assumptions
  nbd: Remove superfluous casts
  nbd: Use NULL to represent a pointer
  brd: implement discard support
  null_blk: Fix two sparse warnings
  ublk_drv: set DMA alignment mask to 3
  nvme-rdma, nvme-tcp: include max reconnects for reconnect logging
  nvmet-rdma: Avoid o(n^2) loop in delete_ctrl
  nvme: do not retry authentication failures
  ...
2024-05-23 13:44:47 -07:00
Linus Torvalds
d6a326d694 tracing: Remove second argument of __assign_str()
The __assign_str() macro logic of the TRACE_EVENT() macro was optimized so
 that it no longer needs the second argument. The __assign_str() is always
 matched with __string() field that takes a field name and the source for
 that field:
 
   __string(field, source)
 
 The TRACE_EVENT() macro logic will save off the source value and then use
 that value to copy into the ring buffer via the __assign_str(). Before
 commit c1fa617cae ("tracing: Rework __assign_str() and __string() to not
 duplicate getting the string"), the __assign_str() needed the second
 argument which would perform the same logic as the __string() source
 parameter did. Not only would this add overhead, but it was error prone as
 if the __assign_str() source produced something different, it may not have
 allocated enough for the string in the ring buffer (as the __string()
 source was used to determine how much to allocate)
 
 Now that the __assign_str() just uses the same string that was used in
 __string() it no longer needs the source parameter. It can now be removed.
 -----BEGIN PGP SIGNATURE-----
 
 iIkEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZk9RMBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qur+AP9jbSYaGhzZdJ7a3HGA8M4l6JNju8nC
 GcX1JpJT4z1qvgD3RkoNvP87etDAUAqmbVhVWnUHCY/vTqr9uB/gqmG6Ag==
 =Y+6f
 -----END PGP SIGNATURE-----

Merge tag 'trace-assign-str-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing cleanup from Steven Rostedt:
 "Remove second argument of __assign_str()

  The __assign_str() macro logic of the TRACE_EVENT() macro was
  optimized so that it no longer needs the second argument. The
  __assign_str() is always matched with __string() field that takes a
  field name and the source for that field:

    __string(field, source)

  The TRACE_EVENT() macro logic will save off the source value and then
  use that value to copy into the ring buffer via the __assign_str().

  Before commit c1fa617cae ("tracing: Rework __assign_str() and
  __string() to not duplicate getting the string"), the __assign_str()
  needed the second argument which would perform the same logic as the
  __string() source parameter did. Not only would this add overhead, but
  it was error prone as if the __assign_str() source produced something
  different, it may not have allocated enough for the string in the ring
  buffer (as the __string() source was used to determine how much to
  allocate)

  Now that the __assign_str() just uses the same string that was used in
  __string() it no longer needs the source parameter. It can now be
  removed"

* tag 'trace-assign-str-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/treewide: Remove second parameter of __assign_str()
2024-05-23 12:28:01 -07:00
Linus Torvalds
2ef32ad224 virtio: features, fixes, cleanups
Several new features here:
 
 - virtio-net is finally supported in vduse.
 
 - Virtio (balloon and mem) interaction with suspend is improved
 
 - vhost-scsi now handles signals better/faster.
 
 Fixes, cleanups all over the place.
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmZN570PHG1zdEByZWRo
 YXQuY29tAAoJECgfDbjSjVRp2JUH/1K3fZOHymop6Y5Z3USFS7YdlF+dniedY/vg
 TKyWERkXOlxq1d9DVxC0mN7tk72DweuWI0YJjLXofrEW1VuW29ecSbyFXxpeWJls
 b7ErffxDAFRas5jkMCngD8TuFnbEegU0mGP5kbiHpEndBydQ2hH99Gg0x7swW+cE
 xsvU5zonCCLwLGIP2DrVrn9qGOHtV6o8eZfVKDVXfvicn3lFBkUSxlwEYsO9RMup
 aKxV4FT2Pb1yBicwBK4TH1oeEXqEGy1YLEn+kAHRbgoC/5L0/LaiqrkzwzwwOIPj
 uPGkacf8CIbX0qZo5EzD8kvfcYL1xhU3eT9WBmpp2ZwD+4bINd4=
 =nax1
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:
 "Several new features here:

   - virtio-net is finally supported in vduse

   - virtio (balloon and mem) interaction with suspend is improved

   - vhost-scsi now handles signals better/faster

  And fixes, cleanups all over the place"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (48 commits)
  virtio-pci: Check if is_avq is NULL
  virtio: delete vq in vp_find_vqs_msix() when request_irq() fails
  MAINTAINERS: add Eugenio Pérez as reviewer
  vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API
  vp_vdpa: don't allocate unused msix vectors
  sound: virtio: drop owner assignment
  fuse: virtio: drop owner assignment
  scsi: virtio: drop owner assignment
  rpmsg: virtio: drop owner assignment
  nvdimm: virtio_pmem: drop owner assignment
  wifi: mac80211_hwsim: drop owner assignment
  vsock/virtio: drop owner assignment
  net: 9p: virtio: drop owner assignment
  net: virtio: drop owner assignment
  net: caif: virtio: drop owner assignment
  misc: nsm: drop owner assignment
  iommu: virtio: drop owner assignment
  drm/virtio: drop owner assignment
  gpio: virtio: drop owner assignment
  firmware: arm_scmi: virtio: drop owner assignment
  ...
2024-05-23 12:04:36 -07:00
Yu Kuai
a2db328b08 null_blk: fix null-ptr-dereference while configuring 'power' and 'submit_queues'
Writing 'power' and 'submit_queues' concurrently will trigger kernel
panic:

Test script:

modprobe null_blk nr_devices=0
mkdir -p /sys/kernel/config/nullb/nullb0
while true; do echo 1 > submit_queues; echo 4 > submit_queues; done &
while true; do echo 1 > power; echo 0 > power; done

Test result:

BUG: kernel NULL pointer dereference, address: 0000000000000148
Oops: 0000 [#1] PREEMPT SMP
RIP: 0010:__lock_acquire+0x41d/0x28f0
Call Trace:
 <TASK>
 lock_acquire+0x121/0x450
 down_write+0x5f/0x1d0
 simple_recursive_removal+0x12f/0x5c0
 blk_mq_debugfs_unregister_hctxs+0x7c/0x100
 blk_mq_update_nr_hw_queues+0x4a3/0x720
 nullb_update_nr_hw_queues+0x71/0xf0 [null_blk]
 nullb_device_submit_queues_store+0x79/0xf0 [null_blk]
 configfs_write_iter+0x119/0x1e0
 vfs_write+0x326/0x730
 ksys_write+0x74/0x150

This is because del_gendisk() can concurrent with
blk_mq_update_nr_hw_queues():

nullb_device_power_store	nullb_apply_submit_queues
 null_del_dev
 del_gendisk
				 nullb_update_nr_hw_queues
				  if (!dev->nullb)
				  // still set while gendisk is deleted
				   return 0
				  blk_mq_update_nr_hw_queues
 dev->nullb = NULL

Fix this problem by resuing the global mutex to protect
nullb_device_power_store() and nullb_update_nr_hw_queues() from configfs.

Fixes: 45919fbfe1 ("null_blk: Enable modifying 'submit_queues' after an instance has been configured")
Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAHj4cs9LgsHLnjg8z06LQ3Pr5cax-+Ps+xT7AP7TPnEjStuwZA@mail.gmail.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://lore.kernel.org/r/20240523153934.1937851-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-23 06:58:13 -06:00
Steven Rostedt (Google)
2c92ca849f tracing/treewide: Remove second parameter of __assign_str()
With the rework of how the __string() handles dynamic strings where it
saves off the source string in field in the helper structure[1], the
assignment of that value to the trace event field is stored in the helper
value and does not need to be passed in again.

This means that with:

  __string(field, mystring)

Which use to be assigned with __assign_str(field, mystring), no longer
needs the second parameter and it is unused. With this, __assign_str()
will now only get a single parameter.

There's over 700 users of __assign_str() and because coccinelle does not
handle the TRACE_EVENT() macro I ended up using the following sed script:

  git grep -l __assign_str | while read a ; do
      sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file;
      mv /tmp/test-file $a;
  done

I then searched for __assign_str() that did not end with ';' as those
were multi line assignments that the sed script above would fail to catch.

Note, the same updates will need to be done for:

  __assign_str_len()
  __assign_rel_str()
  __assign_rel_str_len()

I tested this with both an allmodconfig and an allyesconfig (build only for both).

[1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/

Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts.
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for
Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal
Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Darrick J. Wong <djwong@kernel.org>	# xfs
Tested-by: Guenter Roeck <linux@roeck-us.net>
2024-05-22 20:14:47 -04:00
Krzysztof Kozlowski
bdb8e2f8e8 virtio_blk: drop owner assignment
virtio core already sets the .owner, so driver does not need to.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>

Message-Id: <20240331-module-owner-virtio-v2-6-98f04bfaf46a@linaro.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2024-05-22 08:31:16 -04:00
Linus Torvalds
b6394d6f71 Assorted commits that had missed the last merge window...
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZkzp/gAKCRBZ7Krx/gZQ
 63KFAQCsKv3XdcF+2BO+QuwPvR6eAvDxFjrFEcQFyyOXgFVLaAD/UMM0HcEFWxBb
 PCPvyKVP22wF9PbodkrKJn8DRdtRZwM=
 =jvWv
 -----END PGP SIGNATURE-----

Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull misc vfs updates from Al Viro:
 "Assorted commits that had missed the last merge window..."

* tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  remove call_{read,write}_iter() functions
  do_dentry_open(): kill inode argument
  kernel_file_open(): get rid of inode argument
  get_file_rcu(): no need to check for NULL separately
  fd_is_open(): move to fs/file.c
  close_on_exec(): pass files_struct instead of fdtable
2024-05-21 13:11:44 -07:00
Linus Torvalds
5ad8b6ad9a getting rid of bogus set_blocksize() uses, switching it
to struct file * and verifying that caller has device
 opened exclusively.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZkwkfQAKCRBZ7Krx/gZQ
 62C3AQDW5vuXNx2+KDPma5YStjFpPLC0xtSyAS5D3YANjtyRFgD/TOcCarq7rvBt
 KubxHVFsfW+eu6ASeaoMRB83w5OIzwk=
 =Liix
 -----END PGP SIGNATURE-----

Merge tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs blocksize updates from Al Viro:
 "This gets rid of bogus set_blocksize() uses, switches it over
  to be based on a 'struct file *' and verifies that the caller
  has the device opened exclusively"

* tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  make set_blocksize() fail unless block device is opened exclusive
  set_blocksize(): switch to passing struct file *
  btrfs_get_bdev_and_sb(): call set_blocksize() only for exclusive opens
  swsusp: don't bother with setting block size
  zram: don't bother with reopening - just use O_EXCL for open
  swapon(2): open swap with O_EXCL
  swapon(2)/swapoff(2): don't bother with block size
  pktcdvd: sort set_blocksize() calls out
  bcache_register(): don't bother with set_blocksize()
2024-05-21 08:34:51 -07:00
Linus Torvalds
61307b7be4 The usual shower of singleton fixes and minor series all over MM,
documented (hopefully adequately) in the respective changelogs.  Notable
 series include:
 
 - Lucas Stach has provided some page-mapping
   cleanup/consolidation/maintainability work in the series "mm/treewide:
   Remove pXd_huge() API".
 
 - In the series "Allow migrate on protnone reference with
   MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
   MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one
   test.
 
 - In their series "Memory allocation profiling" Kent Overstreet and
   Suren Baghdasaryan have contributed a means of determining (via
   /proc/allocinfo) whereabouts in the kernel memory is being allocated:
   number of calls and amount of memory.
 
 - Matthew Wilcox has provided the series "Various significant MM
   patches" which does a number of rather unrelated things, but in largely
   similar code sites.
 
 - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes
   Weiner has fixed the page allocator's handling of migratetype requests,
   with resulting improvements in compaction efficiency.
 
 - In the series "make the hugetlb migration strategy consistent" Baolin
   Wang has fixed a hugetlb migration issue, which should improve hugetlb
   allocation reliability.
 
 - Liu Shixin has hit an I/O meltdown caused by readahead in a
   memory-tight memcg.  Addressed in the series "Fix I/O high when memory
   almost met memcg limit".
 
 - In the series "mm/filemap: optimize folio adding and splitting" Kairui
   Song has optimized pagecache insertion, yielding ~10% performance
   improvement in one test.
 
 - Baoquan He has cleaned up and consolidated the early zone
   initialization code in the series "mm/mm_init.c: refactor
   free_area_init_core()".
 
 - Baoquan has also redone some MM initializatio code in the series
   "mm/init: minor clean up and improvement".
 
 - MM helper cleanups from Christoph Hellwig in his series "remove
   follow_pfn".
 
 - More cleanups from Matthew Wilcox in the series "Various page->flags
   cleanups".
 
 - Vlastimil Babka has contributed maintainability improvements in the
   series "memcg_kmem hooks refactoring".
 
 - More folio conversions and cleanups in Matthew Wilcox's series
 
 	"Convert huge_zero_page to huge_zero_folio"
 	"khugepaged folio conversions"
 	"Remove page_idle and page_young wrappers"
 	"Use folio APIs in procfs"
 	"Clean up __folio_put()"
 	"Some cleanups for memory-failure"
 	"Remove page_mapping()"
 	"More folio compat code removal"
 
 - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb
   functions to work on folis".
 
 - Code consolidation and cleanup work related to GUP's handling of
   hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".
 
 - Rick Edgecombe has developed some fixes to stack guard gaps in the
   series "Cover a guard gap corner case".
 
 - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series
   "mm/ksm: fix ksm exec support for prctl".
 
 - Baolin Wang has implemented NUMA balancing for multi-size THPs.  This
   is a simple first-cut implementation for now.  The series is "support
   multi-size THP numa balancing".
 
 - Cleanups to vma handling helper functions from Matthew Wilcox in the
   series "Unify vma_address and vma_pgoff_address".
 
 - Some selftests maintenance work from Dev Jain in the series
   "selftests/mm: mremap_test: Optimizations and style fixes".
 
 - Improvements to the swapping of multi-size THPs from Ryan Roberts in
   the series "Swap-out mTHP without splitting".
 
 - Kefeng Wang has significantly optimized the handling of arm64's
   permission page faults in the series
 
 	"arch/mm/fault: accelerate pagefault when badaccess"
 	"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"
 
 - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it
   GUP-fast".
 
 - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to
   use struct vm_fault".
 
 - selftests build fixes from John Hubbard in the series "Fix
   selftests/mm build without requiring "make headers"".
 
 - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
   series "Improved Memory Tier Creation for CPUless NUMA Nodes".  Fixes
   the initialization code so that migration between different memory types
   works as intended.
 
 - David Hildenbrand has improved follow_pte() and fixed an errant driver
   in the series "mm: follow_pte() improvements and acrn follow_pte()
   fixes".
 
 - David also did some cleanup work on large folio mapcounts in his
   series "mm: mapcount for large folios + page_mapcount() cleanups".
 
 - Folio conversions in KSM in Alex Shi's series "transfer page to folio
   in KSM".
 
 - Barry Song has added some sysfs stats for monitoring multi-size THP's
   in the series "mm: add per-order mTHP alloc and swpout counters".
 
 - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled
   and limit checking cleanups".
 
 - Matthew Wilcox has been looking at buffer_head code and found the
   documentation to be lacking.  The series is "Improve buffer head
   documentation".
 
 - Multi-size THPs get more work, this time from Lance Yang.  His series
   "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes
   the freeing of these things.
 
 - Kemeng Shi has added more userspace-visible writeback instrumentation
   in the series "Improve visibility of writeback".
 
 - Kemeng Shi then sent some maintenance work on top in the series "Fix
   and cleanups to page-writeback".
 
 - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the
   series "Improve anon_vma scalability for anon VMAs".  Intel's test bot
   reported an improbable 3x improvement in one test.
 
 - SeongJae Park adds some DAMON feature work in the series
 
 	"mm/damon: add a DAMOS filter type for page granularity access recheck"
 	"selftests/damon: add DAMOS quota goal test"
 
 - Also some maintenance work in the series
 
 	"mm/damon/paddr: simplify page level access re-check for pageout"
 	"mm/damon: misc fixes and improvements"
 
 - David Hildenbrand has disabled some known-to-fail selftests ni the
   series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL".
 
 - memcg metadata storage optimizations from Shakeel Butt in "memcg:
   reduce memory consumption by memcg stats".
 
 - DAX fixes and maintenance work from Vishal Verma in the series
   "dax/bus.c: Fixups for dax-bus locking".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZkgQYwAKCRDdBJ7gKXxA
 jrdKAP9WVJdpEcXxpoub/vVE0UWGtffr8foifi9bCwrQrGh5mgEAx7Yf0+d/oBZB
 nvA4E0DcPrUAFy144FNM0NTCb7u9vAw=
 =V3R/
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull mm updates from Andrew Morton:
 "The usual shower of singleton fixes and minor series all over MM,
  documented (hopefully adequately) in the respective changelogs.
  Notable series include:

   - Lucas Stach has provided some page-mapping cleanup/consolidation/
     maintainability work in the series "mm/treewide: Remove pXd_huge()
     API".

   - In the series "Allow migrate on protnone reference with
     MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
     MPOL_PREFERRED_MANY mode, yielding almost doubled performance in
     one test.

   - In their series "Memory allocation profiling" Kent Overstreet and
     Suren Baghdasaryan have contributed a means of determining (via
     /proc/allocinfo) whereabouts in the kernel memory is being
     allocated: number of calls and amount of memory.

   - Matthew Wilcox has provided the series "Various significant MM
     patches" which does a number of rather unrelated things, but in
     largely similar code sites.

   - In his series "mm: page_alloc: freelist migratetype hygiene"
     Johannes Weiner has fixed the page allocator's handling of
     migratetype requests, with resulting improvements in compaction
     efficiency.

   - In the series "make the hugetlb migration strategy consistent"
     Baolin Wang has fixed a hugetlb migration issue, which should
     improve hugetlb allocation reliability.

   - Liu Shixin has hit an I/O meltdown caused by readahead in a
     memory-tight memcg. Addressed in the series "Fix I/O high when
     memory almost met memcg limit".

   - In the series "mm/filemap: optimize folio adding and splitting"
     Kairui Song has optimized pagecache insertion, yielding ~10%
     performance improvement in one test.

   - Baoquan He has cleaned up and consolidated the early zone
     initialization code in the series "mm/mm_init.c: refactor
     free_area_init_core()".

   - Baoquan has also redone some MM initializatio code in the series
     "mm/init: minor clean up and improvement".

   - MM helper cleanups from Christoph Hellwig in his series "remove
     follow_pfn".

   - More cleanups from Matthew Wilcox in the series "Various
     page->flags cleanups".

   - Vlastimil Babka has contributed maintainability improvements in the
     series "memcg_kmem hooks refactoring".

   - More folio conversions and cleanups in Matthew Wilcox's series:
	"Convert huge_zero_page to huge_zero_folio"
	"khugepaged folio conversions"
	"Remove page_idle and page_young wrappers"
	"Use folio APIs in procfs"
	"Clean up __folio_put()"
	"Some cleanups for memory-failure"
	"Remove page_mapping()"
	"More folio compat code removal"

   - David Hildenbrand chipped in with "fs/proc/task_mmu: convert
     hugetlb functions to work on folis".

   - Code consolidation and cleanup work related to GUP's handling of
     hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".

   - Rick Edgecombe has developed some fixes to stack guard gaps in the
     series "Cover a guard gap corner case".

   - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the
     series "mm/ksm: fix ksm exec support for prctl".

   - Baolin Wang has implemented NUMA balancing for multi-size THPs.
     This is a simple first-cut implementation for now. The series is
     "support multi-size THP numa balancing".

   - Cleanups to vma handling helper functions from Matthew Wilcox in
     the series "Unify vma_address and vma_pgoff_address".

   - Some selftests maintenance work from Dev Jain in the series
     "selftests/mm: mremap_test: Optimizations and style fixes".

   - Improvements to the swapping of multi-size THPs from Ryan Roberts
     in the series "Swap-out mTHP without splitting".

   - Kefeng Wang has significantly optimized the handling of arm64's
     permission page faults in the series
	"arch/mm/fault: accelerate pagefault when badaccess"
	"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"

   - GUP cleanups from David Hildenbrand in "mm/gup: consistently call
     it GUP-fast".

   - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault
     path to use struct vm_fault".

   - selftests build fixes from John Hubbard in the series "Fix
     selftests/mm build without requiring "make headers"".

   - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
     series "Improved Memory Tier Creation for CPUless NUMA Nodes".
     Fixes the initialization code so that migration between different
     memory types works as intended.

   - David Hildenbrand has improved follow_pte() and fixed an errant
     driver in the series "mm: follow_pte() improvements and acrn
     follow_pte() fixes".

   - David also did some cleanup work on large folio mapcounts in his
     series "mm: mapcount for large folios + page_mapcount() cleanups".

   - Folio conversions in KSM in Alex Shi's series "transfer page to
     folio in KSM".

   - Barry Song has added some sysfs stats for monitoring multi-size
     THP's in the series "mm: add per-order mTHP alloc and swpout
     counters".

   - Some zswap cleanups from Yosry Ahmed in the series "zswap
     same-filled and limit checking cleanups".

   - Matthew Wilcox has been looking at buffer_head code and found the
     documentation to be lacking. The series is "Improve buffer head
     documentation".

   - Multi-size THPs get more work, this time from Lance Yang. His
     series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free"
     optimizes the freeing of these things.

   - Kemeng Shi has added more userspace-visible writeback
     instrumentation in the series "Improve visibility of writeback".

   - Kemeng Shi then sent some maintenance work on top in the series
     "Fix and cleanups to page-writeback".

   - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in
     the series "Improve anon_vma scalability for anon VMAs". Intel's
     test bot reported an improbable 3x improvement in one test.

   - SeongJae Park adds some DAMON feature work in the series
	"mm/damon: add a DAMOS filter type for page granularity access recheck"
	"selftests/damon: add DAMOS quota goal test"

   - Also some maintenance work in the series
	"mm/damon/paddr: simplify page level access re-check for pageout"
	"mm/damon: misc fixes and improvements"

   - David Hildenbrand has disabled some known-to-fail selftests ni the
     series "selftests: mm: cow: flag vmsplice() hugetlb tests as
     XFAIL".

   - memcg metadata storage optimizations from Shakeel Butt in "memcg:
     reduce memory consumption by memcg stats".

   - DAX fixes and maintenance work from Vishal Verma in the series
     "dax/bus.c: Fixups for dax-bus locking""

* tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits)
  memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order
  selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime
  mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp
  mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault
  selftests: cgroup: add tests to verify the zswap writeback path
  mm: memcg: make alloc_mem_cgroup_per_node_info() return bool
  mm/damon/core: fix return value from damos_wmark_metric_value
  mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED
  selftests: cgroup: remove redundant enabling of memory controller
  Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree
  Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT
  Docs/mm/damon/design: use a list for supported filters
  Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command
  Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file
  selftests/damon: classify tests for functionalities and regressions
  selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None'
  selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts
  selftests/damon/_damon_sysfs: check errors from nr_schemes file reads
  mm/damon/core: initialize ->esz_bp from damos_quota_init_priv()
  selftests/damon: add a test for DAMOS quota goal
  ...
2024-05-19 09:21:03 -07:00
Bart Van Assche
e56d4b633f nbd: Fix signal handling
Both nbd_send_cmd() and nbd_handle_cmd() return either a negative error
number or a positive blk_status_t value. nbd_queue_rq() converts these
return values into a blk_status_t value. There is a bug in the conversion
code: if nbd_send_cmd() returns BLK_STS_RESOURCE, nbd_queue_rq() should
return BLK_STS_RESOURCE instead of BLK_STS_OK. Fix this, move the
conversion code into nbd_handle_cmd() and fix the remaining sparse warnings.

This patch fixes the following sparse warnings:

drivers/block/nbd.c:673:32: warning: incorrect type in return expression (different base types)
drivers/block/nbd.c:673:32:    expected int
drivers/block/nbd.c:673:32:    got restricted blk_status_t [usertype]
drivers/block/nbd.c:714:48: warning: incorrect type in return expression (different base types)
drivers/block/nbd.c:714:48:    expected int
drivers/block/nbd.c:714:48:    got restricted blk_status_t [usertype]
drivers/block/nbd.c:1120:21: warning: incorrect type in assignment (different base types)
drivers/block/nbd.c:1120:21:    expected int [assigned] ret
drivers/block/nbd.c:1120:21:    got restricted blk_status_t [usertype]
drivers/block/nbd.c:1125:16: warning: incorrect type in return expression (different base types)
drivers/block/nbd.c:1125:16:    expected restricted blk_status_t
drivers/block/nbd.c:1125:16:    got int [assigned] ret

Cc: Christoph Hellwig <hch@lst.de>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Markus Pargmann <mpa@pengutronix.de>
Fixes: fc17b6534e ("blk-mq: switch ->queue_rq return value to blk_status_t")
Cc: stable@vger.kernel.org
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240510202313.25209-6-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-14 07:22:35 -06:00
Bart Van Assche
f6cb9a2c3d nbd: Remove a local variable from nbd_send_cmd()
blk_rq_bytes() returns an unsigned int while 'size' has type unsigned long.
This is confusing. Improve code readability by removing the local variable
'size'.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Markus Pargmann <mpa@pengutronix.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240510202313.25209-5-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-14 07:22:35 -06:00
Bart Van Assche
2a6751e052 nbd: Improve the documentation of the locking assumptions
Document locking assumptions with lockdep_assert_held() instead of source
code comments. The advantage of lockdep_assert_held() is that it is
verified at runtime if lockdep is enabled in the kernel config.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Markus Pargmann <mpa@pengutronix.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240510202313.25209-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-14 07:22:35 -06:00
Bart Van Assche
40639e9a0f nbd: Remove superfluous casts
In Linux kernel code it is preferred not to use a cast when converting a
void pointer to another pointer type.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Markus Pargmann <mpa@pengutronix.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240510202313.25209-3-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-14 07:22:35 -06:00
Keith Busch
9ead7efc6f brd: implement discard support
The ramdisk memory utilization can only go up when data is written to
new pages. Implement discard to provide the possibility to reduce memory
usage for pages no longer in use. Aligned discards will free the
associated pages, if any, and determinisitically return zeroed data
until written again.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240429102308.147627-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-13 21:47:51 -06:00
Bart Van Assche
25260555b1 null_blk: Fix two sparse warnings
Fix the following sparse warnings:

drivers/block/null_blk/main.c:1243:35: warning: incorrect type in return expression (different base types)
drivers/block/null_blk/main.c:1243:35:    expected int
drivers/block/null_blk/main.c:1243:35:    got restricted blk_status_t
drivers/block/null_blk/main.c:1291:30: warning: incorrect type in return expression (different base types)
drivers/block/null_blk/main.c:1291:30:    expected restricted blk_status_t
drivers/block/null_blk/main.c:1291:30:    got int

Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240510201816.24921-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-13 17:51:17 -06:00
Jens Axboe
928b607d1a ublk_drv: set DMA alignment mask to 3
By default, this will be 511, as that's the block layer default. But
drivers these days can support memory alignments that aren't tied to
the sector sizes, instead just being limited by what the DMA engine
supports. An example is NVMe, where it's generally set to a 32-bit or
64-bit boundary. As ublk itself doesn't really care, just set it low
enough that we don't run into issues with NVMe where the required
O_DIRECT memory alignment is now more restrictive on ublk than it is
on the underlying device.

This was triggered by spurious -EINVAL returns on O_DIRECT IO on a
setup with ublk managing NVMe devices, which previously worked just
fine on the NVMe device itself. With the alignment relaxed, the test
works fine.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-13 17:51:06 -06:00
Linus Torvalds
0c9f4ac808 for-6.10/block-20240511
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmY/YgsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvi0EACwnFRtYioizBH0x7QUHTBcIr0IhACd5gfz
 bm+uwlDUtf6G6lupHdJT9gOVB2z2z1m2Pz//8RuUVWw3Eqw2+rfgG8iJd+yo7IaV
 DpX3WaM4NnBvB7FKOKHlMPvGuf7KgbZ3uPm3x8cbrn/axMmkZ6ljxTixJ3p5t4+s
 xRsef/lVdG71DkXIFgTKATB86yNRJNlRQTbL+sZW22vdXdtfyBbOgR1sBuFfp7Hd
 g/uocZM/z0ahM6JH/5R2IX2ttKXMIBZLA8HRkJdvYqg022cj4js2YyRCPU3N6jQN
 MtN4TpJV5I++8l6SPQOOhaDNrK/6zFtDQpwG0YBiKKj3nQDgVbWWb8ejYTIUv4MP
 SrEto4MVBEqg5N65VwYYhIf45rmueFyJp6z0Vqv6Owur5nuww/YIFknmoMa/WDMd
 V8dIU3zL72FZDbPjIBjxHeqAGz9OgzEVafled7pi0Xbw6wqiB4kZihlMGXlD+WBy
 Yd6xo8PX4i5+d2LLKKPxpW1X0eJlKYJ/4dnYCoFN8LmXSiPJnMx2pYrV+NqMxy4X
 Thr8lxswLQC7j9YBBuIeDl8NB9N5FZZLvaC6I25QKq045M2ckJ+VrounsQb3vGwJ
 72nlxxBZL8wz3sasgX9Pc1Cez9AqYbM+UZahq8ezPY5y3Jh0QfRw/MOk1ZaDNC8V
 CNOHBH0E+Q==
 =HnjE
 -----END PGP SIGNATURE-----

Merge tag 'for-6.10/block-20240511' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - Add a partscan attribute in sysfs, fixing an issue with systemd
   relying on an internal interface that went away.

 - Attempt #2 at making long running discards interruptible. The
   previous attempt went into 6.9, but we ended up mostly reverting it
   as it had issues.

 - Remove old ida_simple API in bcache

 - Support for zoned write plugging, greatly improving the performance
   on zoned devices.

 - Remove the old throttle low interface, which has been experimental
   since 2017 and never made it beyond that and isn't being used.

 - Remove page->index debugging checks in brd, as it hasn't caught
   anything and prepares us for removing in struct page.

 - MD pull request from Song

 - Don't schedule block workers on isolated CPUs

* tag 'for-6.10/block-20240511' of git://git.kernel.dk/linux: (84 commits)
  blk-throttle: delay initialization until configuration
  blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW
  block: fix that util can be greater than 100%
  block: support to account io_ticks precisely
  block: add plug while submitting IO
  bcache: fix variable length array abuse in btree_iter
  bcache: Remove usage of the deprecated ida_simple_xx() API
  md: Revert "md: Fix overflow in is_mddev_idle"
  blk-lib: check for kill signal in ioctl BLKDISCARD
  block: add a bio_await_chain helper
  block: add a blk_alloc_discard_bio helper
  block: add a bio_chain_and_submit helper
  block: move discard checks into the ioctl handler
  block: remove the discard_granularity check in __blkdev_issue_discard
  block/ioctl: prefer different overflow check
  null_blk: Fix the WARNING: modpost: missing MODULE_DESCRIPTION()
  block: fix and simplify blkdevparts= cmdline parsing
  block: refine the EOF check in blkdev_iomap_begin
  block: add a partscan sysfs attribute for disks
  block: add a disk_has_partscan helper
  ...
2024-05-13 13:03:54 -07:00
Zhu Yanjun
9e6727f824 null_blk: Fix the WARNING: modpost: missing MODULE_DESCRIPTION()
No functional changes intended.

Fixes: f2298c0403 ("null_blk: multi queue aware block test driver")
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240506075538.6064-1-yanjun.zhu@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-06 06:50:09 -06:00
Al Viro
ead083aeee set_blocksize(): switch to passing struct file *
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02 17:39:44 -04:00
Al Viro
ebb0173df2 zram: don't bother with reopening - just use O_EXCL for open
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02 17:39:26 -04:00
Al Viro
3a52c03d1e pktcdvd: sort set_blocksize() calls out
1) it doesn't make any sense to have ->open() call set_blocksize() on the
device being opened - the caller will override that anyway.

2) setting block size on underlying device, OTOH, ought to be done when
we are opening it exclusive - i.e. as part of pkt_open_dev().  Having
it done at setup time doesn't guarantee us anything about the state
at the time we start talking to it.  Worse, if you happen to have
the underlying device containing e.g. ext2 with 4Kb blocks that
is currently mounted r/o, that set_blocksize() will confuse the hell
out of filesystem.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02 17:23:30 -04:00
Uday Shankar
eaf4a9b19b ublk: remove segment count and size limits
ublk_drv currently creates block devices with the default max_segments
and max_segment_size limits of BLK_MAX_SEGMENTS (128) and
BLK_MAX_SEGMENT_SIZE (65536) respectively. These defaults can
artificially constrain the I/O size seen by the ublk server - for
example, suppose that the ublk server has configured itself to accept
I/Os up to 1M and the application is also issuing 1M sized I/Os. If the
I/O buffer used by the application is backed by 4K pages, the buffer
could consist of up to 1M / 4K = 256 physically discontiguous segments
(even if the buffer is virtually contiguous). As such, the I/O could
exceed the default max_segments limit and get split. This can cause
unnecessary performance issues if the ublk server is optimized to handle
1M I/Os. The block layer's segment count/size limits exist to model
hardware constraints which don't exist in ublk_drv's case, so just
remove those limits for the block devices created by ublk_drv.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Riley Thomasson <riley@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240430211623.2802036-1-ushankar@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-30 15:36:50 -06:00
Sergey Senozhatsky
34efe1c3b6 zram: add max_pages param to recompression
Introduce "max_pages" param to recompress device attribute which sets an
upper limit on the number of entries (pages) zram attempts to recompress
(in this particular recompression call).  S/W recompression can be quite
expensive so limiting the number of pages recompress touches can be quite
helpful.

Link: https://lkml.kernel.org/r/20240329094050.2815699-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Brian Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:30 -07:00
Zhu Yanjun
07d1b99825 null_blk: Fix missing mutex_destroy() at module removal
When a mutex lock is not used any more, the function mutex_destroy
should be called to mark the mutex lock uninitialized.

Fixes: f2298c0403 ("null_blk: multi queue aware block test driver")
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://lore.kernel.org/r/20240425171635.4227-1-yanjun.zhu@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-25 14:15:00 -06:00
Damien Le Moal
e994ff5b55 null_blk: Simplify null_zone_write()
In null_zone_write, we do not need to first check if the target zone
condition is FULL, READONLY or OFFLINE: for theses conditions, the check
of the command sector against the zone write pointer will always result
in the command failing. Remove these checks.

We still however need to check that the target zone write pointer is not
invalid for zone append operations. To do so, add the macro
NULL_ZONE_INVALID_WP and use it in null_set_zone_cond() when changing a
zone to READONLY or OFFLINE condition.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240411085502.728558-4-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:07 -06:00
Damien Le Moal
3bdde0701e null_blk: Do zone resource management only if necessary
For zoned null_blk devices setup without any limit on the maximum number
of open and active zones, there is no need to count the number of zones
that are implicitly open, explicitly open and closed. This is indicated
by the boolean field need_zone_res_mgmt of sturct nullb_device.

Modify the zone management functions null_reset_zone(),
null_finish_zone(), null_open_zone() and null_close_zone() to manage
the zone condition counters only if the device need_zone_res_mgmt field
is true. With this change, the function __null_close_zone() is removed
and integrated into the 2 caller sites directly, with the
null_close_imp_open_zone() call site greatly simplified as this function
closes zones that are known to be in the implicit open condition.

null_zone_write() is modified in a similar manner to do zone condition
accouting only when the device need_zone_res_mgmt field is true.

With these changes, the inline helpers null_lock_zone_res() and
null_unlock_zone_res() are removed and replaced with direct calls to
spin_lock()/spin_unlock().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240411085502.728558-3-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:07 -06:00
Damien Le Moal
cb9e5273f6 null_blk: Have all null_handle_xxx() return a blk_status_t
Modify the null_handle_flush() and null_handle_rq() functions to return
a blk_status_t instead of an errno to simplify the call sites of these
functions and to be consistant with other null_handle_xxx() functions.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240411085502.728558-2-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:07 -06:00
Damien Le Moal
9b3c08b90f block: Simplify blk_revalidate_disk_zones() interface
The only user of blk_revalidate_disk_zones() second argument was the
SCSI disk driver (sd). Now that this driver does not require this
update_driver_data argument, remove it to simplify the interface of
blk_revalidate_disk_zones(). Also update the function kdoc comment to
be more accurate (i.e. there is no gendisk ->revalidate method).

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-21-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
Damien Le Moal
f4f84586c8 null_blk: Introduce fua attribute
Add the fua configfs attribute and module parameter to allow
configuring if the device supports FUA or not. Using this attribute
has an effect on the null_blk device only if memory backing is enabled
together with a write cache (cache_size option).

This new attribute allows configuring a null_blk device with a write
cache but without FUA support. This is convenient to test the block
layer flush machinery.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-18-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
Damien Le Moal
997a1f08b4 null_blk: Introduce zone_append_max_sectors attribute
Add the zone_append_max_sectors configfs attribute and module parameter
to allow configuring the maximum number of 512B sectors of zone append
operations. This attribute is meaningful only for zoned null block
devices.

If not specified, the default is unchanged and the zoned device max
append sectors limit is set to the device max sectors limit.
If a non 0 value is used for this attribute, which is the default,
then native support for zone append operations is enabled.
Setting a 0 value disables native zone append operations support to
instead use the block layer emulation.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-17-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
Damien Le Moal
b66f79b706 null_blk: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
With zone write plugging enabled at the block layer level, a zoned
device can only ever see at most a single write operation per zone.
There is thus no need to request a block scheduler with strick per-zone
sequential write ordering control through the ELEVATOR_F_ZBD_SEQ_WRITE
feature. Removing this allows using a zoned null_blk device with any
scheduler, including "none".

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-16-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
Damien Le Moal
11be0cb5fe ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
With zone write plugging enabled at the block layer level, any zone
device can only ever see at most a single write operation per zone.
There is thus no need to request a block scheduler with strick per-zone
sequential write ordering control through the ELEVATOR_F_ZBD_SEQ_WRITE
feature. Removing this allows using a zoned ublk device with any
scheduler, including "none".

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240408014128.205141-15-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
Miklos Szeredi
7c98f7cb8f remove call_{read,write}_iter() functions
These have no clear purpose.  This is effectively a revert of commit
bb7462b6fd ("vfs: use helpers for calling f_op->{read,write}_iter()").

The patch was created with the help of a coccinelle script.

Fixes: bb7462b6fd ("vfs: use helpers for calling f_op->{read,write}_iter()")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-04-15 16:03:25 -04:00
Damien Le Moal
fbbd5d3ad9 nullblk: Fix cleanup order in null_add_dev() error path
In null_add_dev(), if an error happen after initializing the resources
for a zoned null block device, we must free these resources before
exiting the function. To ensure this, move the out_cleanup_zone label
after out_cleanup_disk as we jump to this latter label if an error
happens after calling null_init_zoned_dev().

Fixes: e440626b1c ("null_blk: pass queue_limits to blk_mq_alloc_disk")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240330005300.1503252-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-02 07:43:24 -06:00
Matthew Wilcox (Oracle)
7d8d35791b brd: Remove use of page->index
This debugging check will become more costly in the future when we shrink
struct page.  It has not proven to be useful, so simply remove it.

This lets us use __xa_insert instead of __xa_cmpxchg() as we no longer
need to know about the page that is currently stored in the XArray.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240315181212.2573753-1-willy@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-01 11:53:36 -06:00
Linus Torvalds
e3111d9c3f block-6.9-20240322
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmX9ucMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiAYD/9S0w2fH9i1l4zGjpzDLI1ZpjmvKWt/k0+R
 I2DTHuRbs4RG5F0tFiQ6HigMrCD4rSmrnapt3CCdgLNwuzwouHMuYNL6BdhWKIxL
 hQ+krGuUchPfxvLnn1UI9CDqx/uIG6PKPI/N5+P4JLQeNi99S1rRw+RhDdQHwTKw
 QQgH6EBomdxbRjsdcPe9ZJcWy8mU14HAQ6gCu5P6M5VwjcMfqyus2uXJUGvgLDhD
 ZbIsdpjtOZ9r47rzkOeaZsUiTn1smr5CZYfH3e5Ab7p3T3JU/VUlu6wvSj+tKnK+
 0ZhW51do0phk5zCUCjkXxWBdiqEPmf9XTYnWegp/2iYU/SmM1gn96K+oI7TkCfxX
 PSEDO0ekRo7EAa6aZA5AGUzyPdk00JL8GIdPLQnuRe2Lcb1uwHEcaKNGjrDhfx9L
 +14uq9H3kDzS8dUlQsPn9TzgzAjQ/mXdpyKZhUxRr9VLSCkh1mx9wyM1ecXLS7/7
 B79KZ4aMt9OVTXhGuElGQtFB+DGofBs3a+2/bqbakpw+qH1BceqY5oxchVjkp0hy
 FHLH5akYFZ+XIughceCgTRP2PIHwYsgOypOMX3LnKc5prcUmX8X2hesIHqEwaxEM
 6zT32DAZyD8NanvVHSV0wdkmzq4wKm72syOJMcqX4qE3ayoAaU3m7oMuRbx41t1O
 hGKyel48Rw==
 =vcLT
 -----END PGP SIGNATURE-----

Merge tag 'block-6.9-20240322' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:

 - NVMe pull request via Keith:
     - Make an informative message less ominous (Keith)
     - Enhanced trace decoding (Guixin)
     - TCP updates (Hannes, Li)
     - Fabrics connect deadlock fix (Chunguang)
     - Platform API migration update (Uwe)
     - A new device quirk (Jiawei)

 - Remove dead assignment in fd (Yufeng)

* tag 'block-6.9-20240322' of git://git.kernel.dk/linux:
  nvmet-rdma: remove NVMET_RDMA_REQ_INVALIDATE_RKEY flag
  nvme: remove redundant BUILD_BUG_ON check
  floppy: remove duplicated code in redo_fd_request()
  nvme/tcp: Add wq_unbound modparam for nvme_tcp_wq
  nvme-tcp: Export the nvme_tcp_wq to sysfs
  drivers/nvme: Add quirks for device 126f:2262
  nvme: parse format command's lbafu when tracing
  nvme: add tracing of reservation commands
  nvme: parse zns command's zsa and zrasf to string
  nvme: use nvme_disk_is_ns_head helper
  nvme: fix reconnection fail due to reserved tag allocation
  nvmet: add tracing of zns commands
  nvmet: add tracing of authentication commands
  nvme-apple: Convert to platform remove callback returning void
  nvmet-tcp: do not continue for invalid icreq
  nvme: change shutdown timeout setting message
2024-03-22 12:46:07 -07:00
Yufeng Wang
50171b8667 floppy: remove duplicated code in redo_fd_request()
duplicated code in redo_fd_request(),
unlock_fdc() function has the same code "do_floppy = NULL" inside.

Signed-off-by: Yufeng Wang <wangyufeng@kylinos.cn>
Suggested-by: Denis Efremov <efremov@linux.com>
Link: https://lore.kernel.org/r/20240319014219.7812-1-wangyufeng@kylinos.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-18 20:19:56 -06:00
Linus Torvalds
e5eb28f6d1 - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min
heap optimizations".
 
 - Kuan-Wei Chiu has also sped up the library sorting code in the series
   "lib/sort: Optimize the number of swaps and comparisons".
 
 - Alexey Gladkov has added the ability for code running within an IPC
   namespace to alter its IPC and MQ limits.  The series is "Allow to
   change ipc/mq sysctls inside ipc namespace".
 
 - Geert Uytterhoeven has contributed some dhrystone maintenance work in
   the series "lib: dhry: miscellaneous cleanups".
 
 - Ryusuke Konishi continues nilfs2 maintenance work in the series
 
 	"nilfs2: eliminate kmap and kmap_atomic calls"
 	"nilfs2: fix kernel bug at submit_bh_wbc()"
 
 - Nathan Chancellor has updated our build tools requirements in the
   series "Bump the minimum supported version of LLVM to 13.0.1".
 
 - Muhammad Usama Anjum continues with the selftests maintenance work in
   the series "selftests/mm: Improve run_vmtests.sh".
 
 - Oleg Nesterov has done some maintenance work against the signal code
   in the series "get_signal: minor cleanups and fix".
 
 Plus the usual shower of singleton patches in various parts of the tree.
 Please see the individual changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfMnvgAKCRDdBJ7gKXxA
 jjKMAP4/Upq07D4wjkMVPb+QrkipbbLpdcgJ++q3z6rba4zhPQD+M3SFriIJk/Xh
 tKVmvihFxfAhdDthseXcIf1nBjMALwY=
 =8rVc
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min
   heap optimizations".

 - Kuan-Wei Chiu has also sped up the library sorting code in the series
   "lib/sort: Optimize the number of swaps and comparisons".

 - Alexey Gladkov has added the ability for code running within an IPC
   namespace to alter its IPC and MQ limits. The series is "Allow to
   change ipc/mq sysctls inside ipc namespace".

 - Geert Uytterhoeven has contributed some dhrystone maintenance work in
   the series "lib: dhry: miscellaneous cleanups".

 - Ryusuke Konishi continues nilfs2 maintenance work in the series

	"nilfs2: eliminate kmap and kmap_atomic calls"
	"nilfs2: fix kernel bug at submit_bh_wbc()"

 - Nathan Chancellor has updated our build tools requirements in the
   series "Bump the minimum supported version of LLVM to 13.0.1".

 - Muhammad Usama Anjum continues with the selftests maintenance work in
   the series "selftests/mm: Improve run_vmtests.sh".

 - Oleg Nesterov has done some maintenance work against the signal code
   in the series "get_signal: minor cleanups and fix".

Plus the usual shower of singleton patches in various parts of the tree.
Please see the individual changelogs for details.

* tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits)
  nilfs2: prevent kernel bug at submit_bh_wbc()
  nilfs2: fix failure to detect DAT corruption in btree and direct mappings
  ocfs2: enable ocfs2_listxattr for special files
  ocfs2: remove SLAB_MEM_SPREAD flag usage
  assoc_array: fix the return value in assoc_array_insert_mid_shortcut()
  buildid: use kmap_local_page()
  watchdog/core: remove sysctl handlers from public header
  nilfs2: use div64_ul() instead of do_div()
  mul_u64_u64_div_u64: increase precision by conditionally swapping a and b
  kexec: copy only happens before uchunk goes to zero
  get_signal: don't initialize ksig->info if SIGNAL_GROUP_EXIT/group_exec_task
  get_signal: hide_si_addr_tag_bits: fix the usage of uninitialized ksig
  get_signal: don't abuse ksig->info.si_signo and ksig->sig
  const_structs.checkpatch: add device_type
  Normalise "name (ad@dr)" MODULE_AUTHORs to "name <ad@dr>"
  dyndbg: replace kstrdup() + strchr() with kstrdup_and_replace()
  list: leverage list_is_head() for list_entry_is_head()
  nilfs2: MAINTAINERS: drop unreachable project mirror site
  smp: make __smp_processor_id() 0-argument macro
  fat: fix uninitialized field in nostale filehandles
  ...
2024-03-14 18:03:09 -07:00
Linus Torvalds
902861e34c - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
from hotplugged memory rather than only from main memory.  Series
   "implement "memmap on memory" feature on s390".
 
 - More folio conversions from Matthew Wilcox in the series
 
 	"Convert memcontrol charge moving to use folios"
 	"mm: convert mm counter to take a folio"
 
 - Chengming Zhou has optimized zswap's rbtree locking, providing
   significant reductions in system time and modest but measurable
   reductions in overall runtimes.  The series is "mm/zswap: optimize the
   scalability of zswap rb-tree".
 
 - Chengming Zhou has also provided the series "mm/zswap: optimize zswap
   lru list" which provides measurable runtime benefits in some
   swap-intensive situations.
 
 - And Chengming Zhou further optimizes zswap in the series "mm/zswap:
   optimize for dynamic zswap_pools".  Measured improvements are modest.
 
 - zswap cleanups and simplifications from Yosry Ahmed in the series "mm:
   zswap: simplify zswap_swapoff()".
 
 - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
   contributed several DAX cleanups as well as adding a sysfs tunable to
   control the memmap_on_memory setting when the dax device is hotplugged
   as system memory.
 
 - Johannes Weiner has added the large series "mm: zswap: cleanups",
   which does that.
 
 - More DAMON work from SeongJae Park in the series
 
 	"mm/damon: make DAMON debugfs interface deprecation unignorable"
 	"selftests/damon: add more tests for core functionalities and corner cases"
 	"Docs/mm/damon: misc readability improvements"
 	"mm/damon: let DAMOS feeds and tame/auto-tune itself"
 
 - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
   extension" Rakie Kim has developed a new mempolicy interleaving policy
   wherein we allocate memory across nodes in a weighted fashion rather
   than uniformly.  This is beneficial in heterogeneous memory environments
   appearing with CXL.
 
 - Christophe Leroy has contributed some cleanup and consolidation work
   against the ARM pagetable dumping code in the series "mm: ptdump:
   Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".
 
 - Luis Chamberlain has added some additional xarray selftesting in the
   series "test_xarray: advanced API multi-index tests".
 
 - Muhammad Usama Anjum has reworked the selftest code to make its
   human-readable output conform to the TAP ("Test Anything Protocol")
   format.  Amongst other things, this opens up the use of third-party
   tools to parse and process out selftesting results.
 
 - Ryan Roberts has added fork()-time PTE batching of THP ptes in the
   series "mm/memory: optimize fork() with PTE-mapped THP".  Mainly
   targeted at arm64, this significantly speeds up fork() when the process
   has a large number of pte-mapped folios.
 
 - David Hildenbrand also gets in on the THP pte batching game in his
   series "mm/memory: optimize unmap/zap with PTE-mapped THP".  It
   implements batching during munmap() and other pte teardown situations.
   The microbenchmark improvements are nice.
 
 - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan
   Roberts further utilizes arm's pte's contiguous bit ("contpte
   mappings").  Kernel build times on arm64 improved nicely.  Ryan's series
   "Address some contpte nits" provides some followup work.
 
 - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
   fixed an obscure hugetlb race which was causing unnecessary page faults.
   He has also added a reproducer under the selftest code.
 
 - In the series "selftests/mm: Output cleanups for the compaction test",
   Mark Brown did what the title claims.
 
 - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring".
 
 - Even more zswap material from Nhat Pham.  The series "fix and extend
   zswap kselftests" does as claimed.
 
 - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
   regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in
   our handling of DAX on archiecctures which have virtually aliasing data
   caches.  The arm architecture is the main beneficiary.
 
 - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic
   improvements in worst-case mmap_lock hold times during certain
   userfaultfd operations.
 
 - Some page_owner enhancements and maintenance work from Oscar Salvador
   in his series
 
 	"page_owner: print stacks and their outstanding allocations"
 	"page_owner: Fixup and cleanup"
 
 - Uladzislau Rezki has contributed some vmalloc scalability improvements
   in his series "Mitigate a vmap lock contention".  It realizes a 12x
   improvement for a certain microbenchmark.
 
 - Some kexec/crash cleanup work from Baoquan He in the series "Split
   crash out from kexec and clean up related config items".
 
 - Some zsmalloc maintenance work from Chengming Zhou in the series
 
 	"mm/zsmalloc: fix and optimize objects/page migration"
 	"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"
 
 - Zi Yan has taught the MM to perform compaction on folios larger than
   order=0.  This a step along the path to implementaton of the merging of
   large anonymous folios.  The series is named "Enable >0 order folio
   memory compaction".
 
 - Christoph Hellwig has done quite a lot of cleanup work in the
   pagecache writeback code in his series "convert write_cache_pages() to
   an iterator".
 
 - Some modest hugetlb cleanups and speedups in Vishal Moola's series
   "Handle hugetlb faults under the VMA lock".
 
 - Zi Yan has changed the page splitting code so we can split huge pages
   into sizes other than order-0 to better utilize large folios.  The
   series is named "Split a folio to any lower order folios".
 
 - David Hildenbrand has contributed the series "mm: remove
   total_mapcount()", a cleanup.
 
 - Matthew Wilcox has sought to improve the performance of bulk memory
   freeing in his series "Rearrange batched folio freeing".
 
 - Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
   provides large improvements in bootup times on large machines which are
   configured to use large numbers of hugetlb pages.
 
 - Matthew Wilcox's series "PageFlags cleanups" does that.
 
 - Qi Zheng's series "minor fixes and supplement for ptdesc" does that
   also.  S390 is affected.
 
 - Cleanups to our pagemap utility functions from Peter Xu in his series
   "mm/treewide: Replace pXd_large() with pXd_leaf()".
 
 - Nico Pache has fixed a few things with our hugepage selftests in his
   series "selftests/mm: Improve Hugepage Test Handling in MM Selftests".
 
 - Also, of course, many singleton patches to many things.  Please see
   the individual changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfJpPQAKCRDdBJ7gKXxA
 joxeAP9TrcMEuHnLmBlhIXkWbIR4+ki+pA3v+gNTlJiBhnfVSgD9G55t1aBaRplx
 TMNhHfyiHYDTx/GAV9NXW84tasJSDgA=
 =TG55
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
   from hotplugged memory rather than only from main memory. Series
   "implement "memmap on memory" feature on s390".

 - More folio conversions from Matthew Wilcox in the series

	"Convert memcontrol charge moving to use folios"
	"mm: convert mm counter to take a folio"

 - Chengming Zhou has optimized zswap's rbtree locking, providing
   significant reductions in system time and modest but measurable
   reductions in overall runtimes. The series is "mm/zswap: optimize the
   scalability of zswap rb-tree".

 - Chengming Zhou has also provided the series "mm/zswap: optimize zswap
   lru list" which provides measurable runtime benefits in some
   swap-intensive situations.

 - And Chengming Zhou further optimizes zswap in the series "mm/zswap:
   optimize for dynamic zswap_pools". Measured improvements are modest.

 - zswap cleanups and simplifications from Yosry Ahmed in the series
   "mm: zswap: simplify zswap_swapoff()".

 - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
   contributed several DAX cleanups as well as adding a sysfs tunable to
   control the memmap_on_memory setting when the dax device is
   hotplugged as system memory.

 - Johannes Weiner has added the large series "mm: zswap: cleanups",
   which does that.

 - More DAMON work from SeongJae Park in the series

	"mm/damon: make DAMON debugfs interface deprecation unignorable"
	"selftests/damon: add more tests for core functionalities and corner cases"
	"Docs/mm/damon: misc readability improvements"
	"mm/damon: let DAMOS feeds and tame/auto-tune itself"

 - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
   extension" Rakie Kim has developed a new mempolicy interleaving
   policy wherein we allocate memory across nodes in a weighted fashion
   rather than uniformly. This is beneficial in heterogeneous memory
   environments appearing with CXL.

 - Christophe Leroy has contributed some cleanup and consolidation work
   against the ARM pagetable dumping code in the series "mm: ptdump:
   Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".

 - Luis Chamberlain has added some additional xarray selftesting in the
   series "test_xarray: advanced API multi-index tests".

 - Muhammad Usama Anjum has reworked the selftest code to make its
   human-readable output conform to the TAP ("Test Anything Protocol")
   format. Amongst other things, this opens up the use of third-party
   tools to parse and process out selftesting results.

 - Ryan Roberts has added fork()-time PTE batching of THP ptes in the
   series "mm/memory: optimize fork() with PTE-mapped THP". Mainly
   targeted at arm64, this significantly speeds up fork() when the
   process has a large number of pte-mapped folios.

 - David Hildenbrand also gets in on the THP pte batching game in his
   series "mm/memory: optimize unmap/zap with PTE-mapped THP". It
   implements batching during munmap() and other pte teardown
   situations. The microbenchmark improvements are nice.

 - And in the series "Transparent Contiguous PTEs for User Mappings"
   Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte
   mappings"). Kernel build times on arm64 improved nicely. Ryan's
   series "Address some contpte nits" provides some followup work.

 - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
   fixed an obscure hugetlb race which was causing unnecessary page
   faults. He has also added a reproducer under the selftest code.

 - In the series "selftests/mm: Output cleanups for the compaction
   test", Mark Brown did what the title claims.

 - Kinsey Ho has added the series "mm/mglru: code cleanup and
   refactoring".

 - Even more zswap material from Nhat Pham. The series "fix and extend
   zswap kselftests" does as claimed.

 - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
   regression" Mathieu Desnoyers has cleaned up and fixed rather a mess
   in our handling of DAX on archiecctures which have virtually aliasing
   data caches. The arm architecture is the main beneficiary.

 - Lokesh Gidra's series "per-vma locks in userfaultfd" provides
   dramatic improvements in worst-case mmap_lock hold times during
   certain userfaultfd operations.

 - Some page_owner enhancements and maintenance work from Oscar Salvador
   in his series

	"page_owner: print stacks and their outstanding allocations"
	"page_owner: Fixup and cleanup"

 - Uladzislau Rezki has contributed some vmalloc scalability
   improvements in his series "Mitigate a vmap lock contention". It
   realizes a 12x improvement for a certain microbenchmark.

 - Some kexec/crash cleanup work from Baoquan He in the series "Split
   crash out from kexec and clean up related config items".

 - Some zsmalloc maintenance work from Chengming Zhou in the series

	"mm/zsmalloc: fix and optimize objects/page migration"
	"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"

 - Zi Yan has taught the MM to perform compaction on folios larger than
   order=0. This a step along the path to implementaton of the merging
   of large anonymous folios. The series is named "Enable >0 order folio
   memory compaction".

 - Christoph Hellwig has done quite a lot of cleanup work in the
   pagecache writeback code in his series "convert write_cache_pages()
   to an iterator".

 - Some modest hugetlb cleanups and speedups in Vishal Moola's series
   "Handle hugetlb faults under the VMA lock".

 - Zi Yan has changed the page splitting code so we can split huge pages
   into sizes other than order-0 to better utilize large folios. The
   series is named "Split a folio to any lower order folios".

 - David Hildenbrand has contributed the series "mm: remove
   total_mapcount()", a cleanup.

 - Matthew Wilcox has sought to improve the performance of bulk memory
   freeing in his series "Rearrange batched folio freeing".

 - Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
   provides large improvements in bootup times on large machines which
   are configured to use large numbers of hugetlb pages.

 - Matthew Wilcox's series "PageFlags cleanups" does that.

 - Qi Zheng's series "minor fixes and supplement for ptdesc" does that
   also. S390 is affected.

 - Cleanups to our pagemap utility functions from Peter Xu in his series
   "mm/treewide: Replace pXd_large() with pXd_leaf()".

 - Nico Pache has fixed a few things with our hugepage selftests in his
   series "selftests/mm: Improve Hugepage Test Handling in MM
   Selftests".

 - Also, of course, many singleton patches to many things. Please see
   the individual changelogs for details.

* tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits)
  mm/zswap: remove the memcpy if acomp is not sleepable
  crypto: introduce: acomp_is_async to expose if comp drivers might sleep
  memtest: use {READ,WRITE}_ONCE in memory scanning
  mm: prohibit the last subpage from reusing the entire large folio
  mm: recover pud_leaf() definitions in nopmd case
  selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements
  selftests/mm: skip uffd hugetlb tests with insufficient hugepages
  selftests/mm: dont fail testsuite due to a lack of hugepages
  mm/huge_memory: skip invalid debugfs new_order input for folio split
  mm/huge_memory: check new folio order when split a folio
  mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure
  mm: add an explicit smp_wmb() to UFFDIO_CONTINUE
  mm: fix list corruption in put_pages_list
  mm: remove folio from deferred split list before uncharging it
  filemap: avoid unnecessary major faults in filemap_fault()
  mm,page_owner: drop unnecessary check
  mm,page_owner: check for null stack_record before bumping its refcount
  mm: swap: fix race between free_swap_and_cache() and swapoff()
  mm/treewide: align up pXd_leaf() retval across archs
  mm/treewide: drop pXd_large()
  ...
2024-03-14 17:43:30 -07:00
Linus Torvalds
1ddeeb2a05 for-6.9/block-20240310
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmXuFO4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpq33D/9hyNyBce2A9iyo026eK8EqLDoed6BPzuvB
 kLKj5tsGvX4YlfuswvP86M5dgibTASXclnfUK394TijW/JPOfJ3mNhi9gMnHzRoK
 ZaR1di0Lum56dY1FkpMmWiGmE4fB79PAtXYKtajOkuoIcNzylncEAAACUY4/Ouhg
 Cm+LMg2prcc+m9g8rKDNQ51pUFg4U21KAUTl35XLMUAaQk1ahW3EDEVYhweC/zwE
 V/5hJsv8UY72+oQGY2Dc/YgQk/Zj4ZDh7C+oHR9XeB/ro99kr3/Vopagu0gBMLZi
 Rq6qqz6PVMhVcuz8uN2rsTQKXmXhsBn9/adsl4AKtdxcW5D5moWb5BLq1P0WQylc
 nzMxa1d6cVcTKZpaUQQv3Rj6ZMrLuDwP277UYHfn5x1oPWYRZCG7FtHuOo1gNcpG
 DrSNwVG6BSDcbABqI+MIS2oD1JoUMyevjwT7e2hOXukZhc6GLO5F3ODWE5j3KnCR
 S/aGSAmcdR4fTcgavULqWdQVt7SYl4f1IxT8KrUirJGVhc2LgahaWj69ooklVHoU
 fPDFRiruwJ5YkH4RWCSDm9mi4kAz6eUf+f4yE06wZOFOb2fT8/1ZK2Snpz2KeXuZ
 INO0RejtFzT8L0OUlu7dBmF20y6rgAYt87lR8mIt71yuuATIrVhzlX1VdsvhdrAo
 VLHGV1Ncgw==
 =WlVL
 -----END PGP SIGNATURE-----

Merge tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - MD pull requests via Song:
      - Cleanup redundant checks (Yu Kuai)
      - Remove deprecated headers (Marc Zyngier, Song Liu)
      - Concurrency fixes (Li Lingfeng)
      - Memory leak fix (Li Nan)
      - Refactor raid1 read_balance (Yu Kuai, Paul Luse)
      - Clean up and fix for md_ioctl (Li Nan)
      - Other small fixes (Gui-Dong Han, Heming Zhao)
      - MD atomic limits (Christoph)

 - NVMe pull request via Keith:
      - RDMA target enhancements (Max)
      - Fabrics fixes (Max, Guixin, Hannes)
      - Atomic queue_limits usage (Christoph)
      - Const use for class_register (Ricardo)
      - Identification error handling fixes (Shin'ichiro, Keith)

 - Improvement and cleanup for cached request handling (Christoph)

 - Moving towards atomic queue limits. Core changes and driver bits so
   far (Christoph)

 - Fix UAF issues in aoeblk (Chun-Yi)

 - Zoned fix and cleanups (Damien)

 - s390 dasd cleanups and fixes (Jan, Miroslav)

 - Block issue timestamp caching (me)

 - noio scope guarding for zoned IO (Johannes)

 - block/nvme PI improvements (Kanchan)

 - Ability to terminate long running discard loop (Keith)

 - bdev revalidation fix (Li)

 - Get rid of old nr_queues hack for kdump kernels (Ming)

 - Support for async deletion of ublk (Ming)

 - Improve IRQ bio recycling (Pavel)

 - Factor in CPU capacity for remote vs local completion (Qais)

 - Add shared_tags configfs entry for null_blk (Shin'ichiro

 - Fix for a regression in page refcounts introduced by the folio
   unification (Tony)

 - Misc fixes and cleanups (Arnd, Colin, John, Kunwu, Li, Navid,
   Ricardo, Roman, Tang, Uwe)

* tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux: (221 commits)
  block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC
  block/swim: Convert to platform remove callback returning void
  cdrom: gdrom: Convert to platform remove callback returning void
  block: remove disk_stack_limits
  md: remove mddev->queue
  md: don't initialize queue limits
  md/raid10: use the atomic queue limit update APIs
  md/raid5: use the atomic queue limit update APIs
  md/raid1: use the atomic queue limit update APIs
  md/raid0: use the atomic queue limit update APIs
  md: add queue limit helpers
  md: add a mddev_is_dm helper
  md: add a mddev_add_trace_msg helper
  md: add a mddev_trace_remap helper
  bcache: move calculation of stripe_size and io_opt into bcache_device_init
  virtio_blk: Do not use disk_set_max_open/active_zones()
  aoe: fix the potential use-after-free problem in aoecmd_cfg_pkts
  block: move capacity validation to blkpg_do_ioctl()
  block: prevent division by zero in blk_rq_stat_sum()
  drbd: atomically update queue limits in drbd_reconsider_queue_parameters
  ...
2024-03-11 11:43:44 -07:00
Linus Torvalds
910202f00a vfs-6.9.super
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem4DwAKCRCRxhvAZXjc
 ooTRAQDRI6Qz6wJym5Yblta8BScMGbt/SgrdgkoCvT6y83MtqwD+Nv/AZQzi3A3l
 9NdULtniW1reuCYkc8R7dYM8S+yAwAc=
 =Y1qX
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull block handle updates from Christian Brauner:
 "Last cycle we changed opening of block devices, and opening a block
  device would return a bdev_handle. This allowed us to implement
  support for restricting and forbidding writes to mounted block
  devices. It was accompanied by converting and adding helpers to
  operate on bdev_handles instead of plain block devices.

  That was already a good step forward but ultimately it isn't necessary
  to have special purpose helpers for opening block devices internally
  that return a bdev_handle.

  Fundamentally, opening a block device internally should just be
  equivalent to opening files. So now all internal opens of block
  devices return files just as a userspace open would. Instead of
  introducing a separate indirection into bdev_open_by_*() via struct
  bdev_handle bdev_file_open_by_*() is made to just return a struct
  file. Opening and closing a block device just becomes equivalent to
  opening and closing a file.

  This all works well because internally we already have a pseudo fs for
  block devices and so opening block devices is simple. There's a few
  places where we needed to be careful such as during boot when the
  kernel is supposed to mount the rootfs directly without init doing it.
  Here we need to take care to ensure that we flush out any asynchronous
  file close. That's what we already do for opening, unpacking, and
  closing the initramfs. So nothing new here.

  The equivalence of opening and closing block devices to regular files
  is a win in and of itself. But it also has various other advantages.
  We can remove struct bdev_handle completely. Various low-level helpers
  are now private to the block layer. Other helpers were simply
  removable completely.

  A follow-up series that is already reviewed build on this and makes it
  possible to remove bdev->bd_inode and allows various clean ups of the
  buffer head code as well. All places where we stashed a bdev_handle
  now just stash a file and use simple accessors to get to the actual
  block device which was already the case for bdev_handle"

* tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
  block: remove bdev_handle completely
  block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access
  bdev: remove bdev pointer from struct bdev_handle
  bdev: make struct bdev_handle private to the block layer
  bdev: make bdev_{release, open_by_dev}() private to block layer
  bdev: remove bdev_open_by_path()
  reiserfs: port block device access to file
  ocfs2: port block device access to file
  nfs: port block device access to files
  jfs: port block device access to file
  f2fs: port block device access to files
  ext4: port block device access to file
  erofs: port device access to file
  btrfs: port device access to file
  bcachefs: port block device access to file
  target: port block device access to file
  s390: port block device access to file
  nvme: port block device access to file
  block2mtd: port device access to files
  bcache: port block device access to files
  ...
2024-03-11 10:52:34 -07:00
Uwe Kleine-König
d8d6608b76 block/swim: Convert to platform remove callback returning void
The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.

To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new(), which already returns void. Eventually after all drivers
are converted, .remove_new() will be renamed to .remove().

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Link: https://lore.kernel.org/r/a00aea8201ea85ae726411bb0fb015ea026ff40a.1709886922.git.u.kleine-koenig@pengutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08 09:28:04 -07:00
Ahelenia Ziemiańska
6a57a21943 Normalise "name (ad@dr)" MODULE_AUTHORs to "name <ad@dr>"
Found with git grep 'MODULE_AUTHOR(".*([^)]*@'
Fixed with
  sed -i '/MODULE_AUTHOR(".*([^)]*@/{s/ (/ </g;s/)"/>"/;s/)and/> and/}' \
    $(git grep -l 'MODULE_AUTHOR(".*([^)]*@')

Also:
  in drivers/media/usb/siano/smsusb.c normalise ", INC" to ", Inc";
     this is what every other MODULE_AUTHOR for this company says,
     and it's what the header says
  in drivers/sbus/char/openprom.c normalise a double-spaced separator;
     this is clearly copied from the copyright header,
     where the names are aligned on consecutive lines thusly:
      * Linux/SPARC PROM Configuration Driver
      * Copyright (C) 1996 Thomas K. Dyas (tdyas@noc.rutgers.edu)
      * Copyright (C) 1996 Eddie C. Dost  (ecd@skynet.be)
     but the authorship branding is single-line

Link: https://lkml.kernel.org/r/mk3geln4azm5binjjlfsgjepow4o73domjv6ajybws3tz22vb3@tarta.nabijaczleweli.xyz
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-06 13:07:39 -08:00
Damien Le Moal
0e46064ebe virtio_blk: Do not use disk_set_max_open/active_zones()
In virtblk_read_zoned_limits(), setting a zoned block device maximum
number of open and active zones using the functions
disk_set_max_open_zones() and disk_set_max_active_zones() is incorrect
as setting the limits for the request queue is now done atomically when
the gendisk is created (with blk_mq_alloc_disk()). The value set by the
disk_set_max_open/active_zones() functions will be overwritten.
Fix this by setting the maximum number of open and active zones directly
in the queue_limits structure passed to virtblk_read_zoned_limits().

Fixes: 8b83725656 ("virtio_blk: pass queue_limits to blk_mq_alloc_disk")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240301192639.410183-2-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-06 08:33:58 -07:00
Chun-Yi Lee
f98364e926 aoe: fix the potential use-after-free problem in aoecmd_cfg_pkts
This patch is against CVE-2023-6270. The description of cve is:

  A flaw was found in the ATA over Ethernet (AoE) driver in the Linux
  kernel. The aoecmd_cfg_pkts() function improperly updates the refcnt on
  `struct net_device`, and a use-after-free can be triggered by racing
  between the free on the struct and the access through the `skbtxq`
  global queue. This could lead to a denial of service condition or
  potential code execution.

In aoecmd_cfg_pkts(), it always calls dev_put(ifp) when skb initial
code is finished. But the net_device ifp will still be used in
later tx()->dev_queue_xmit() in kthread. Which means that the
dev_put(ifp) should NOT be called in the success path of skb
initial code in aoecmd_cfg_pkts(). Otherwise tx() may run into
use-after-free because the net_device is freed.

This patch removed the dev_put(ifp) in the success path in
aoecmd_cfg_pkts(), and added dev_put() after skb xmit in tx().

Link: https://nvd.nist.gov/vuln/detail/CVE-2023-6270
Fixes: 7562f876cd ("[NET]: Rework dev_base via list_head (v3)")
Signed-off-by: Chun-Yi Lee <jlee@suse.com>
Link: https://lore.kernel.org/r/20240305082048.25526-1-jlee@suse.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-06 08:32:46 -07:00
Christoph Hellwig
e6dfe748f0 drbd: atomically update queue limits in drbd_reconsider_queue_parameters
Switch drbd_reconsider_queue_parameters to set up the queue parameters
in an on-stack queue_limits structure and apply the atomically.  Remove
various helpers that have become so trivial that they can be folded into
drbd_reconsider_queue_parameters.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240305134041.137006-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-06 08:30:34 -07:00
Christoph Hellwig
5eaee6e9c8 drbd: split out a drbd_discard_supported helper
Add a helper to check if discard is supported for a given connection /
backing device combination.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Philipp Reisner <philipp.reisner@linbit.com>
Reviewed-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Tested-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20240306140332.623759-7-philipp.reisner@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-06 08:30:34 -07:00
Christoph Hellwig
e3992e02c9 drbd: don't set max_write_zeroes_sectors in decide_on_discard_support
fixup_write_zeroes always overrides the max_write_zeroes_sectors value
a little further down the callchain, so don't bother to setup a limit
in decide_on_discard_support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Philipp Reisner <philipp.reisner@linbit.com>
Reviewed-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Tested-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20240306140332.623759-6-philipp.reisner@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-06 08:30:34 -07:00
Christoph Hellwig
e16344e506 drbd: merge drbd_setup_queue_param into drbd_reconsider_queue_parameters
drbd_setup_queue_param is only called by drbd_reconsider_queue_parameters
and there is no really clear boundary of responsibilities between the
two.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Philipp Reisner <philipp.reisner@linbit.com>
Reviewed-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Tested-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20240306140332.623759-5-philipp.reisner@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-06 08:30:34 -07:00
Christoph Hellwig
2828908d5c drbd: refactor the backing dev max_segments calculation
Factor out a drbd_backing_dev_max_segments helper that checks the
backing device limitation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Philipp Reisner <philipp.reisner@linbit.com>
Reviewed-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Tested-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20240306140332.623759-4-philipp.reisner@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-06 08:30:34 -07:00
Christoph Hellwig
342d81fde2 drbd: refactor drbd_reconsider_queue_parameters
Split out a drbd_max_peer_bio_size helper for the peer I/O size,
and condense the various checks to a nested min3(..., max())) instead
of using a lot of local variables.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240305134041.137006-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-06 08:30:34 -07:00
Christoph Hellwig
aa067325c0 drbd: pass the max_hw_sectors limit to blk_alloc_disk
Pass a queue_limits structure with the max_hw_sectors limit to
blk_alloc_disk instead of updating the limit on the allocated gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240305134041.137006-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-06 08:30:34 -07:00
Kefeng Wang
9602e0ce98 zram: zcomp: remove zcomp_set_max_streams() declaration
The zcomp_set_max_streams() is removed from commit 43209ea2d1
("zram: remove max_comp_streams internals"), remove the declaration.

Link: https://lkml.kernel.org/r/20240223035548.2591882-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04 17:01:17 -08:00
Christoph Hellwig
268283244c nbd: use the atomic queue limits API in nbd_set_size
Use queue_limits_start_update / queue_limits_commit_update to update
all the limits in one go and with proper sanity checking.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240229143846.1047223-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01 09:08:22 -07:00
Christoph Hellwig
242a49e5c8 nbd: freeze the queue for queue limits updates
nbd currently updates the logical and physical block sizes as well
as the discard_sectors on a live queue.  Freeze the queue first to
make sure there are not commands in flight that can see torn or
inconsistent limits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240229143846.1047223-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01 09:08:22 -07:00
Christoph Hellwig
7ea201f2cc nbd: don't clear discard_sectors in nbd_config_put
nbd_config_put currently clears discard_sectors when unusing a device.
This is pretty odd behavior and different from the sector size
configuration which is simply left in places and then reconfigured when
nbd_set_size is as part of configuring the device.  Change nbd_set_size
to clear discard_sectors if discard is not supported so that all the
queue limits changes are handled in one place.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240229143846.1047223-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01 09:08:22 -07:00
Christoph Hellwig
eabf5dfc2d pktcdvd: don't set max_hw_sectors on the underlying device
pktcdvd sets max_hw_sectors on the queue of the underlying device that
it doesn't own (and doesn't reset it ever) since the driver was merged.
This can create all kinds of problems as the underlying driver doesn't
even know about it changing the limit.

As the state purpose is to not create I/Os larger than a single frame,
and pktcdvd never builds bios larger than that, just set REQ_NOMERGE
on the bios it submits so that largers I/Os never get built.

Note: I don't have packet writing hardware, so this is compile tested
only.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240229144408.1047967-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01 09:08:00 -07:00
Ming Lei
13fe8e6825 ublk: add UBLK_CMD_DEL_DEV_ASYNC
The current command UBLK_CMD_DEL_DEV won't return until the device is
released, this way looks more reliable, but makes userspace more
difficult to implement, especially about orders: unmap command
buffer(which holds one ublkc reference), ublkc close,
io_uring_file_unregister, ublkb close.

Add UBLK_CMD_DEL_DEV_ASYNC so that device deletion won't wait release,
then userspace needn't worry about the above order. Actually both loop
and nbd is deleted in this async way.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240223075539.89945-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-28 18:47:08 -07:00
Ming Lei
1221b9e982 ublk: improve getting & putting ublk device
Firstly convert get_device() and put_device() into ublk_get_device()
and ublk_put_device().

Secondly annotate ublk_get_device() & ublk_put_device() as noinline
for trace, especially it is often to trigger device deletion hang
when incorrect order is used on ublkc mmap, ublkc close,
io_uring_sqe_unregister_file, ublkb close.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240223075539.89945-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-28 18:47:08 -07:00
Christoph Hellwig
ba3f67c116 xen-blkfront: atomically update queue limits
Pass the initial queue limits to blk_mq_alloc_disk and use the
blkif_set_queue_limits API to update the limits on reconnect.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Link: https://lore.kernel.org/r/20240221125845.3610668-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27 09:33:08 -07:00
Christoph Hellwig
4f81b87d91 xen-blkfront: don't redundantly set max_sements in blkif_recover
blkif_set_queue_limits already sets the max_sements limits, so don't do
it a second time.  Also remove a comment about a long fixe bug in
blk_mq_update_nr_hw_queues.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Link: https://lore.kernel.org/r/20240221125845.3610668-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27 09:33:08 -07:00
Christoph Hellwig
738be13632 xen-blkfront: rely on the default discard granularity
The block layer now sets the discard granularity to the physical
block size default.  Take advantage of that in xen-blkfront and only
set the discard granularity if explicitly specified.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Link: https://lore.kernel.org/r/20240221125845.3610668-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27 09:33:08 -07:00
Christoph Hellwig
4a718d7dba xen-blkfront: set max_discard/secure erase limits to UINT_MAX
Currently xen-blkfront set the max discard limit to the capacity of
the device, which is suboptimal when the capacity changes.  Just set
it to UINT_MAX, which has the same effect and is simpler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Link: https://lore.kernel.org/r/20240221125845.3610668-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27 09:33:08 -07:00
Christian Brauner
be914f8fd2
zram: port block device access to file
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-12-adbd023e19cc@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-25 12:05:24 +01:00
Christian Brauner
217759bbb9
xen: port block device access to file
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-11-adbd023e19cc@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-25 12:05:23 +01:00
Christian Brauner
a34606a9aa
rnbd: port block device access to file
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-10-adbd023e19cc@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-25 12:05:23 +01:00
Christian Brauner
05fb1dbc82
pktcdvd: port block device access to file
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-9-adbd023e19cc@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-25 12:05:23 +01:00
Christian Brauner
20e6a8d0dc
drbd: port block device access to file
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-8-adbd023e19cc@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-25 12:05:23 +01:00
Barry Song
45866e0e21 zram: do not allocate physically contiguous strm buffers
Currently zram allocates 2 physically contiguous pages per-CPU's
compression stream (we may have up to 4 streams per-CPU).  Since those
buffers are per-CPU we allocate them from CPU hotplug path, which may have
higher risks of failed allocations on devices with fragmented memory.

Switch to virtually contiguous allocations - crypto comp does not seem
impose requirements on compression working buffers to be physically
contiguous.

Link: https://lkml.kernel.org/r/20240213065400.6561-1-21cnbao@gmail.com
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 10:24:59 -08:00
Mark-PK Tsai
80ba4caf8b zram: use copy_page for full page copy
Some architectures, such as arm, have implemented optimized copy_page for
full page copying.

Replace the full page memcpy with copy_page to take advantage of the
optimization.

Link: https://lkml.kernel.org/r/20231007070554.8657-1-mark-pk.tsai@mediatek.com
Signed-off-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: YJ Chiang <yj.chiang@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22 10:24:55 -08:00
John Garry
0f225f8787 null_blk: Delete nullb.{queue_depth, nr_queues}
Since commit 8b631f9cf0 ("null_blk: remove the bio based I/O path"),
struct nullb members queue_depth and nr_queues are only ever written, so
delete them.

With that, null_exit_hctx() can also be deleted.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240222083420.6026-1-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-22 10:08:47 -07:00
Christoph Hellwig
4068550870 pktcdvd: set queue limits at disk allocation time
Remove pkt_init_queue and just pass the two parameters directly to
blk_alloc_disk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240222073647.3776769-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-22 10:08:20 -07:00
Christoph Hellwig
6f420d6a2d pktcdvd: stop setting q->queuedata
The two users can get the private data from the gendisk with one less
pointer dereference, and we can drop the useless q parameter from
pkt_make_request_write.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240222073647.3776769-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-22 10:08:20 -07:00
Christoph Hellwig
e440626b1c null_blk: pass queue_limits to blk_mq_alloc_disk
Pass the queue limits directly to blk_mq_alloc_disk instead of
setting them one at a time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240220093248.3290292-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-20 06:21:27 -07:00
Christoph Hellwig
0a39e550c1 null_blk: remove null_gendisk_register
null_gendisk_register isn't a very useful abstraction given that it
doesn't even allocate the gendisk.  Merge it into the only caller
instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240220093248.3290292-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-20 06:21:27 -07:00
Christoph Hellwig
72ca28765f null_blk: refactor tag_set setup
Move the tagset initialization out of null_add_dev into a new
null_setup_tagset helper, and move the shared vs local differences
out of null_init_tag_set into the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240220093248.3290292-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-20 06:21:27 -07:00
Christoph Hellwig
e32b085536 null_blk: initialize the tag_set timeout in null_init_tag_set
Otherwise it will be reset to the always same value when initializing a
device using the shared tag_set.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240220093248.3290292-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-20 06:21:27 -07:00
Christoph Hellwig
8b631f9cf0 null_blk: remove the bio based I/O path
The bio based I/O path complicates null_blk and also make various
data structures, including the per-command one way bigger than
required for the main request based interface.   As the bio-based
path is mostly used by stacking drivers and simple memory based
drivers, and brd is a good example driver for the latter there is
no need to have a bio based path in null_blk.  Remove the path
to simplify the driver and make future block layer API changes
simpler by not having to deal with the complex two API setup in
null_blk.

Note that the queue_mode field in struct nullb_device is kept as
that is simpler than having two different places to check the
value and fully open coding the debugfs helpers as the existing
ones won't work without a named struct member.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240220093248.3290292-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-20 06:21:27 -07:00
Christoph Hellwig
494ea040bc ublk: pass queue_limits to blk_mq_alloc_disk
Pass the limits ublk imposes directly to blk_mq_alloc_disk instead of
setting them one at a time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240215070300.2200308-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-19 16:59:32 -07:00
Christoph Hellwig
d0fa9a8b0a sunvdc: pass queue_limits to blk_mq_alloc_disk
Pass the few limits sunvdc imposes directly to blk_mq_alloc_disk instead
of setting them one at a time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240215070300.2200308-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-19 16:59:31 -07:00
Christoph Hellwig
e6ed9892f1 rnbd-clt: pass queue_limits to blk_mq_alloc_disk
Pass the limits rnbd-clt imposes directly to blk_mq_alloc_disk instead
of setting them one at a time.

While at it don't set an explicit number of discard segments, as 1 is
the default (which most drivers rely on).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20240215070300.2200308-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-19 16:59:31 -07:00
Christoph Hellwig
24f30b770c rbd: pass queue_limits to blk_mq_alloc_disk
Pass the limits rbd imposes directly to blk_mq_alloc_disk instead
of setting them one at a time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240215070300.2200308-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-19 16:59:31 -07:00
Christoph Hellwig
a7f18b74db ps3disk: pass queue_limits to blk_mq_alloc_disk
Pass the few limits ps3disk imposes directly to blk_mq_alloc_disk instead
of setting them one at a time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240215070300.2200308-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-19 16:59:31 -07:00
Christoph Hellwig
9a0d497028 nbd: pass queue_limits to blk_mq_alloc_disk
Pass the few limits nbd imposes directly to blk_mq_alloc_disk instead
of setting them one at a time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240215070300.2200308-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-19 16:59:31 -07:00
Christoph Hellwig
68c3135fb5 mtip: pass queue_limits to blk_mq_alloc_disk
Pass the few limits mtip imposes directly to blk_mq_alloc_disk instead
of setting them one at a time and drop the pointless setting of a io_min
that is equal to the physical block size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240215070300.2200308-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-19 16:59:31 -07:00
Christoph Hellwig
48bc8c7ba6 floppy: pass queue_limits to blk_mq_alloc_disk
Pass the few limits floppy imposes directly to blk_mq_alloc_disk instead
of setting them one at a time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Denis Efremov <efremov@linux.com>
Link: https://lore.kernel.org/r/20240215070300.2200308-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-19 16:59:31 -07:00
Christoph Hellwig
9999200f58 aoe: pass queue_limits to blk_mq_alloc_disk
Pass the few limits aoe imposes directly to blk_mq_alloc_disk instead
of setting them one at a time and improve the way the default
max_hw_sectors is initialized while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240215070300.2200308-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-19 16:59:31 -07:00
Christoph Hellwig
4190b3f291 zram: pass queue_limits to blk_mq_alloc_disk
Pass the queue limits directly to blk_alloc_disk instead of setting them
one at a time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20240215071055.2201424-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-19 16:58:24 -07:00
Christoph Hellwig
cc7f05c7ec n64cart: pass queue_limits to blk_mq_alloc_disk
Pass the queue limits directly to blk_alloc_disk instead of setting them
one at a time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20240215071055.2201424-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-19 16:58:24 -07:00
Christoph Hellwig
b5baaba4ce brd: pass queue_limits to blk_mq_alloc_disk
Pass the queue limits directly to blk_alloc_disk instead of setting them
one at a time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20240215071055.2201424-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-19 16:58:23 -07:00
Christoph Hellwig
74fa8f9c55 block: pass a queue_limits argument to blk_alloc_disk
Pass a queue_limits to blk_alloc_disk and apply it if non-NULL.  This
will allow allocating queues with valid queue limits instead of setting
the values one at a time later.

Also change blk_alloc_disk to return an ERR_PTR instead of just NULL
which can't distinguish errors.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20240215071055.2201424-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-19 16:58:23 -07:00
Navid Emamdoost
31edf4bbe0 nbd: null check for nla_nest_start
nla_nest_start() may fail and return NULL. Insert a check and set errno
based on other call sites within the same source code.

Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Fixes: 47d902b90a ("nbd: add a status netlink command")
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20240218042534.it.206-kees@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-18 06:01:18 -07:00
Christoph Hellwig
473516b361 loop: use the atomic queue limits update API
Pass the default limits to blk_mq_alloc_disk and then use the
queue_limits_{start,commit}_update API to change the limits in an
atomic way on existing loop gendisks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240213073425.1621680-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-13 08:56:59 -07:00
Christoph Hellwig
02aed4a1f2 loop: pass queue_limits to blk_mq_alloc_disk
Pass the max_hw_sector limit loop sets at initialization time directly to
blk_mq_alloc_disk instead of updating it right after the allocation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240213073425.1621680-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-13 08:56:59 -07:00
Christoph Hellwig
65bdd16f8c loop: cleanup loop_config_discard
Initialize the local variables for the discard max sectors and
granularity to zero as a sensible default, and then merge the
calls assigning them to the queue limits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240213073425.1621680-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-13 08:56:59 -07:00
Christoph Hellwig
8b83725656 virtio_blk: pass queue_limits to blk_mq_alloc_disk
Call virtblk_read_limits and most of virtblk_probe_zoned_device before
allocating the gendisk and thus request_queue and make them read into
a queue_limits structure instead.  Pass this initialized queue_limits
to blk_mq_alloc_disk to set the queue up with the right parameters
from the start and only leave a few final touches for zoned devices
to be done just before adding the disk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240213073425.1621680-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-13 08:56:59 -07:00
Christoph Hellwig
718628adfc virtio_blk: split virtblk_probe
Split out a virtblk_read_limits helper that just reads the various
queue limits to separate it from the higher level probing logic.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240213073425.1621680-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-13 08:56:59 -07:00
Christoph Hellwig
27e32cd23f block: pass a queue_limits argument to blk_mq_alloc_disk
Pass a queue_limits to blk_mq_alloc_disk and apply it if non-NULL.  This
will allow allocating queues with valid queue limits instead of setting
the values one at a time later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240213073425.1621680-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-13 08:56:59 -07:00
Arnd Bergmann
fe0b1e9a73 drbd: fix function cast warnings in state machine
There are four state machines in drbd that use a common infrastructure, with
a cast to an incompatible function type in REMEMBER_STATE_CHANGE that clang-16
now warns about:

drivers/block/drbd/drbd_state.c:1632:3: error: cast from 'int (*)(struct sk_buff *, unsigned int, struct drbd_resource_state_change *, enum drbd_notification_type)' to 'typeof (last_func)' (aka 'int (*)(struct sk_buff *, unsigned int, void *, enum drbd_notification_type)') converts to incompatible function type [-Werror,-Wcast-function-type-strict]
 1632 |                 REMEMBER_STATE_CHANGE(notify_resource_state_change,
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 1633 |                                       resource_state_change, NOTIFY_CHANGE);
      |                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/block/drbd/drbd_state.c:1619:17: note: expanded from macro 'REMEMBER_STATE_CHANGE'
 1619 |            last_func = (typeof(last_func))func; \
      |                        ^~~~~~~~~~~~~~~~~~~~~~~
drivers/block/drbd/drbd_state.c:1641:4: error: cast from 'int (*)(struct sk_buff *, unsigned int, struct drbd_connection_state_change *, enum drbd_notification_type)' to 'typeof (last_func)' (aka 'int (*)(struct sk_buff *, unsigned int, void *, enum drbd_notification_type)') converts to incompatible function type [-Werror,-Wcast-function-type-strict]
 1641 |                         REMEMBER_STATE_CHANGE(notify_connection_state_change,
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 1642 |                                               connection_state_change, NOTIFY_CHANGE);
      |                                               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Change these all to actually expect a void pointer to be passed, which
matches the caller.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20240213100354.457128-1-arnd@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-13 08:55:40 -07:00
Arnd Bergmann
7789bf0552 floppy: fix function pointer cast warnings
clang-16 complains about a control flow integrity (kcfi) violation
casting between incompatible pointers:

drivers/block/floppy.c:2001:11: error: cast from 'void (*)(void)' to 'done_f' (aka 'void (*)(int)') converts to incompatible function type [-Werror,-Wcast-function-type-strict]
 2001 |         .done           = (done_f)empty
      |                           ^~~~~~~~~~~~~

Just add another empty function with the correct prototype as a
workaround.

The warning is for code that was added before the start of the normal
git history, but I tracked it done to an early change in the reconstructed
linux-history.git.

Fixes: 598a477afe06 ("Import 1.1.41")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20240213095918.455478-1-arnd@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-13 08:55:33 -07:00
Linus Torvalds
a5b6244cf8 block-6.8-2024-02-10
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmXHhIoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiKfD/4vi+EUfKGmhmXx0Sh8iDCt3g4gUH+FqtpV
 gqC4FG8gWraNi1WY98ZVtXhlaQ0ALKuPbe7pNfYpxTn8zRdtixrOBxYDvNJLAJLH
 Q2LnviBZBTrRar4d51O+d9eGC2FwbihwEW4asFn6kIlKjmJ9DT8qUYBhwLh5AcBd
 rQ2yhk9KOlQIIN/Z0gjiexXM2WFxWbeHBWKWzTgxJL7q81wAfdwMP0IkNLcASaWU
 P48YHkIhwqFfSk6QqPrjmLr5P08jd2xEbr3DA/4unuto9iQZPoS0h+k1kauA048w
 ZassSiBfIaOGuv0xQg2bYHwdoGazW0fNeyWNQjE6qDaC7ECE8oBKE0fMyhTZ5Xgo
 0d86bqlL5913PDzjp5DXeGvpSZ7hLV393TyV3yAspHosAA5cHBqNQMiJD1okM99N
 wKfmXnH2CXE29ckfRyaN+M4Ywg+gpbyMNIfQM7N9If4GuQcyxC3M67RkT2GrK2MB
 ZJ/af6pnh/d68mqteJB5gV5r8t163uoVFcbSjJhwlVVgp1Sp3+h04wsB2sDyp62h
 Guu9043fuT6zzT3EhySjPvmkKoNu6cjNofuTwNoaVVRcloRhTafn7V3sTlfjrpOP
 woWnGv5VOSnuunOGQ/bxLXvXbrnEQ5ziW1S4qwr+oi+FjP5Ae9eOsjzld4Vl7cZH
 ABqOf0UEsg==
 =0mLS
 -----END PGP SIGNATURE-----

Merge tag 'block-6.8-2024-02-10' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
     - Update a potentially stale firmware attribute (Maurizio)
     - Fixes for the recent verbose error logging (Keith, Chaitanya)
     - Protection information payload size fix for passthrough (Francis)

 - Fix for a queue freezing issue in virtblk (Yi)

 - blk-iocost underflow fix (Tejun)

 - blk-wbt task detection fix (Jan)

* tag 'block-6.8-2024-02-10' of git://git.kernel.dk/linux:
  virtio-blk: Ensure no requests in virtqueues before deleting vqs.
  blk-iocost: Fix an UBSAN shift-out-of-bounds warning
  nvme: use ns->head->pi_size instead of t10_pi_tuple structure size
  nvme-core: fix comment to reflect right functions
  nvme: move passthrough logging attribute to head
  blk-wbt: Fix detection of dirty-throttled tasks
  nvme-host: fix the updating of the firmware version
2024-02-10 08:02:48 -08:00
Shin'ichiro Kawasaki
14509b748f null_blk: add configfs variable shared_tags
Allow setting shared_tags through configfs, which could only be set as a
module parameter. For that purpose, delay tag_set initialization from
null_init() to null_add_dev(). Refer tag_set.ops as the flag to check if
tag_set is initialized or not.

The following parameters can not be set through configfs yet:

    timeout
    requeue
    init_hctx

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240130042134.2463659-1-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08 11:31:23 -07:00
Yi Sun
4ce6e2db00 virtio-blk: Ensure no requests in virtqueues before deleting vqs.
Ensure no remaining requests in virtqueues before resetting vdev and
deleting virtqueues. Otherwise these requests will never be completed.
It may cause the system to become unresponsive.

Function blk_mq_quiesce_queue() can ensure that requests have become
in_flight status, but it cannot guarantee that requests have been
processed by the device. Virtqueues should never be deleted before
all requests become complete status.

Function blk_mq_freeze_queue() ensure that all requests in virtqueues
become complete status. And no requests can enter in virtqueues.

Signed-off-by: Yi Sun <yi.sun@unisoc.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/20240129085250.1550594-1-yi.sun@unisoc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08 11:30:26 -07:00
Ricardo B. Marliere
052618c71c block: rbd: make rbd_bus_type const
Now that the driver core can properly handle constant struct bus_type,
move the rbd_bus_type variable to be a constant structure as well,
placing it into read-only memory which can not be modified at runtime.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Alex Elder <elder@linaro.org>
Link: https://lore.kernel.org/r/20240204-bus_cleanup-block-v1-1-fc77afd8d7cc@marliere.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-06 10:37:45 -07:00
Linus Torvalds
914e17088e block-6.8-2024-01-26
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmWz+D8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjprEACQAJWX6h+bSmbxst3AkXZoNLCPq7EsripI
 igbUcdrkKmnUL+4/r1qM6sE7I7nIILBWzSsFgawC91tq2/9KWq7P7FeuEHdiRmsS
 RwXuTdJiQBvrDy2k59I0vtiE+zrbrhmLGcdIg4xbKl8bg17nA6KaICf4oxCqkPZp
 R+WvcZJvfKpN8bnd3bjJOK2vwYL26HIPhUU0KyWPAc4Nx7K5pe9RT+4kWY4vmklY
 XqOxfnoPiKBKlXApaSDRTFITPhg/usq6vfPgtADvdGt7ieMbN1YplLNO82kHjUA9
 sp/biSBtOT12JiH+biRFcSjNPyBOzu13Opshi0Ou4uNTtyH8FXqXMVLGfl/uDRgW
 khkcY8rfRXWpbPrMmqlNExeZ47ZztSNKoqVXCV1yNUOTlGQvFbhiRDOUtG3O+wPD
 df8MrPc5P5jVVUGhDSTHZtXlYbkjsmodIVSsI/YFwZ3XB0tz9NqTZtO7INa0BZku
 llIgDzBhk1NPqlZCpaLNvEFUWmjoHqktwyDqe3BuWpwjUOZnLvVKic2RkK7Y8ufF
 MG6qn9VZ4qrb2rxmsM4yd7V3nNRpnmVWlN6feIP3Rlb+M2+nGAyAw+pZxJ6iF+Hi
 bdvbP6UnaA5iZqSASifjDl8WCSllymPDwMPZGDsaEi/5CW8JMpdYwlCrBrY19hlm
 8Wq6YGCFXQ==
 =A7z9
 -----END PGP SIGNATURE-----

Merge tag 'block-6.8-2024-01-26' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - RCU warning fix for md (Mikulas)

 - Fix for an aoe issue that lockdep rightfully complained about
   (Maksim)

 - Fix for an error code change in partitioning that caused a regression
   with some tools (Li)

 - Fix for a data direction warning with bi-direction commands
   (Christian)

* tag 'block-6.8-2024-01-26' of git://git.kernel.dk/linux:
  md: fix a suspicious RCU usage warning
  aoe: avoid potential deadlock at set_capacity
  block: Fix WARNING in _copy_from_iter
  block: Move checking GENHD_FL_NO_PART to bdev_add_partition()
2024-01-26 15:19:43 -08:00
Maksim Kiselev
e169bd4fb2 aoe: avoid potential deadlock at set_capacity
Move set_capacity() outside of the section procected by (&d->lock).
To avoid possible interrupt unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
[1] lock(&bdev->bd_size_lock);
                                local_irq_disable();
                            [2] lock(&d->lock);
                            [3] lock(&bdev->bd_size_lock);
   <Interrupt>
[4]  lock(&d->lock);

  *** DEADLOCK ***

Where [1](&bdev->bd_size_lock) hold by zram_add()->set_capacity().
[2]lock(&d->lock) hold by aoeblk_gdalloc(). And aoeblk_gdalloc()
is trying to acquire [3](&bdev->bd_size_lock) at set_capacity() call.
In this situation an attempt to acquire [4]lock(&d->lock) from
aoecmd_cfg_rsp() will lead to deadlock.

So the simplest solution is breaking lock dependency
[2](&d->lock) -> [3](&bdev->bd_size_lock) by moving set_capacity()
outside.

Signed-off-by: Maksim Kiselev <bigunclemax@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240124072436.3745720-2-bigunclemax@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-24 08:30:43 -07:00
Ilya Dryomov
ded080c86b rbd: don't move requests to the running list on errors
The running list is supposed to contain requests that are pinning the
exclusive lock, i.e. those that must be flushed before exclusive lock
is released.  When wake_lock_waiters() is called to handle an error,
requests on the acquiring list are failed with that error and no
flushing takes place.  Briefly moving them to the running list is not
only pointless but also harmful: if exclusive lock gets acquired
before all of their state machines are scheduled and go through
rbd_lock_del_request(), we trigger

    rbd_assert(list_empty(&rbd_dev->running_list));

in rbd_try_acquire_lock().

Cc: stable@vger.kernel.org
Fixes: 637cd06053 ("rbd: new exclusive lock wait/wake code")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2024-01-22 00:14:10 +01:00
Christophe JAILLET
cd30e8bde2 rbd: remove usage of the deprecated ida_simple_*() API
ida_alloc() and ida_free() should be preferred to the deprecated
ida_simple_get() and ida_simple_remove().

Note that the upper limit of ida_simple_get() is exclusive, while that
of ida_alloc_max() is inclusive, so 1 has been subtracted.

[ idryomov: tweak changelog ]

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-22 00:14:10 +01:00
Linus Torvalds
9d1694dc91 for-6.8/block-2024-01-18
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmWpoCgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqIUEADFvJdC2izkPzYzsOrMK5Rt1H7vaHGKhbA+
 zWCuQaa1xQd8bazq+NVnQpbzgclkE/WodTCNfNXcTTjzeQEmcZC888llP3Y9vwyP
 XfEKH7fSaeKvGigJLro1oPe3YV7/t89F5ol3BoZayfzJF8GEU9BXRWzgOkZzijnk
 xdm5wUyn/GknksMuQQraZ+U6bQRFLBOulzoaQeMD6Dosx+uRlM4WvAJawC+uOV6R
 qPT2BVSfYGzmgEKvoaphw0FMkUhFBMDHfXTpQBi5tIzTKOaof8tynYEGz0FHZWeh
 V0JEEp+3jLWFxFXeEcXgBVPJPE8J0DzGm9g17/uwC2Yhmlbw4FKZVRvGG+PpeUso
 D5aqhqm3w0x7HgZ7JKwy/aUctADYvjVcSVzPHTaFK0aCSYCIAXxqv4p7fOoxPqyx
 T32IUHTzGtkCdqzv/xFdtTYhTNM2vyzzbbWj5lXgCBqHsXOVbCh8UM2p+9ec2Umq
 Fo1XF9eoCDe6Sn4s15hJ5G4DEhKGOKkHluvRUdM+0selA5b0sNOeUqlAf2v+0ve3
 Pv3e3X4NPssNIEcsDHf5pc3zGC+LXRS0oFvfIvDESBjwXc3iHIMl+SkjyS57P4Fd
 RKrHEUUiACuCKO/IWqFYLiNBNHnP3RmV5gSxIZr9QJhFSwOzP+/+4++TCdF5vdAV
 amhv+0PdCw==
 =DLW9
 -----END PGP SIGNATURE-----

Merge tag 'for-6.8/block-2024-01-18' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
      - tcp, fc, and rdma target fixes (Maurizio, Daniel, Hannes,
        Christoph)
      - discard fixes and improvements (Christoph)
      - timeout debug improvements (Keith, Max)
      - various cleanups (Daniel, Max, Giuxen)
      - trace event string fixes (Arnd)
      - shadow doorbell setup on reset fix (William)
      - a write zeroes quirk for SK Hynix (Jim)

 - MD pull request via Song:
      - Sparse warning since v6.0 (Bart)
      - /proc/mdstat regression since v6.7 (Yu Kuai)

 - Use symbolic error value (Christian)

 - IO Priority documentation update (Christian)

 - Fix for accessing queue limits without having entered the queue
   (Christoph, me)

 - Fix for loop dio support (Christoph)

 - Move null_blk off deprecated ida interface (Christophe)

 - Ensure nbd initializes full msghdr (Eric)

 - Fix for a regression with the folio conversion, which is now easier
   to hit because of an unrelated change (Matthew)

 - Remove redundant check in virtio-blk (Li)

 - Fix for a potential hang in sbitmap (Ming)

 - Fix for partial zone appending (Damien)

 - Misc changes and fixes (Bart, me, Kemeng, Dmitry)

* tag 'for-6.8/block-2024-01-18' of git://git.kernel.dk/linux: (45 commits)
  Documentation: block: ioprio: Update schedulers
  loop: fix the the direct I/O support check when used on top of block devices
  blk-mq: Remove the hctx 'run' debugfs attribute
  nbd: always initialize struct msghdr completely
  block: Fix iterating over an empty bio with bio_for_each_folio_all
  block: bio-integrity: fix kcalloc() arguments order
  virtio_blk: remove duplicate check if queue is broken in virtblk_done
  sbitmap: remove stale comment in sbq_calc_wake_batch
  block: Correct a documentation comment in blk-cgroup.c
  null_blk: Remove usage of the deprecated ida_simple_xx() API
  block: ensure we hold a queue reference when using queue limits
  blk-mq: rename blk_mq_can_use_cached_rq
  block: print symbolic error name instead of error code
  blk-mq: fix IO hang from sbitmap wakeup race
  nvmet-rdma: avoid circular locking dependency on install_queue()
  nvmet-tcp: avoid circular locking dependency on install_queue()
  nvme-pci: set doorbell config before unquiescing
  block: fix partial zone append completion handling in req_bio_endio()
  block/iocost: silence warning on 'last_period' potentially being unused
  md/raid1: Use blk_opf_t for read and write operations
  ...
2024-01-18 18:22:40 -08:00
Christoph Hellwig
baa7d53607 loop: fix the the direct I/O support check when used on top of block devices
__loop_update_dio only checks the alignment requirement for block backed
file systems, but misses them for the case where the loop device is
created directly on top of another block device.  Due to this creating
a loop device with default option plus the direct I/O flag on a > 512 byte
sector size file system will lead to incorrect I/O being submitted to the
lower block device and a lot of error from the lock layer.  This can
be seen with xfstests generic/563.

Fix the code in __loop_update_dio by factoring the alignment check into
a helper, and calling that also for the struct block_device of a block
device inode.

Also remove the TODO comment talking about dynamically switching between
buffered and direct I/O, which is a would be a recipe for horrible
performance and occasional data loss.

Fixes: 2e5ab5f379 ("block: loop: prepare for supporing direct IO")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240117175901.871796-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-18 08:20:52 -07:00
Eric Dumazet
78fbb92af2 nbd: always initialize struct msghdr completely
syzbot complains that msg->msg_get_inq value can be uninitialized [1]

struct msghdr got many new fields recently, we should always make
sure their values is zero by default.

[1]
 BUG: KMSAN: uninit-value in tcp_recvmsg+0x686/0xac0 net/ipv4/tcp.c:2571
  tcp_recvmsg+0x686/0xac0 net/ipv4/tcp.c:2571
  inet_recvmsg+0x131/0x580 net/ipv4/af_inet.c:879
  sock_recvmsg_nosec net/socket.c:1044 [inline]
  sock_recvmsg+0x12b/0x1e0 net/socket.c:1066
  __sock_xmit+0x236/0x5c0 drivers/block/nbd.c:538
  nbd_read_reply drivers/block/nbd.c:732 [inline]
  recv_work+0x262/0x3100 drivers/block/nbd.c:863
  process_one_work kernel/workqueue.c:2627 [inline]
  process_scheduled_works+0x104e/0x1e70 kernel/workqueue.c:2700
  worker_thread+0xf45/0x1490 kernel/workqueue.c:2781
  kthread+0x3ed/0x540 kernel/kthread.c:388
  ret_from_fork+0x66/0x80 arch/x86/kernel/process.c:147
  ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242

Local variable msg created at:
  __sock_xmit+0x4c/0x5c0 drivers/block/nbd.c:513
  nbd_read_reply drivers/block/nbd.c:732 [inline]
  recv_work+0x262/0x3100 drivers/block/nbd.c:863

CPU: 1 PID: 7465 Comm: kworker/u5:1 Not tainted 6.7.0-rc7-syzkaller-00041-gf016f7547aee #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
Workqueue: nbd5-recv recv_work

Fixes: f94fd25cb0 ("tcp: pass back data left in socket after receive")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: stable@vger.kernel.org
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: nbd@other.debian.org
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240112132657.647112-1-edumazet@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-17 08:47:46 -07:00
Li RongQing
04036d49c4 virtio_blk: remove duplicate check if queue is broken in virtblk_done
virtqueue_enable_cb() will call virtqueue_poll() which will check if
queue is broken at beginning, so remove the virtqueue_is_broken() call

Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-15 09:33:41 -07:00
Christophe JAILLET
95931a245b null_blk: Remove usage of the deprecated ida_simple_xx() API
ida_alloc() and ida_free() should be preferred to the deprecated
ida_simple_get() and ida_simple_remove().

This is less verbose.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/bf257b1078475a415cdc3344c6a750842946e367.1705222845.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-14 07:37:44 -07:00
Linus Torvalds
4c72e2b8c4 for-6.8/io_uring-2024-01-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmWcOk0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpq2wEACgP1Gvfb2cR65B/6tkLJ6neKnYOPVSYx9F
 4Mv6KS306vmOsE67LqynhtAe/Hgfv0NKADN+mKLLV8IdwhneLAwFxRoHl8QPO2dW
 DIxohDMLIqzY/SOUBqBHMS8b5lEr79UVh8vXjAuuYCBTcHBndi4hj0S0zkwf5YiK
 Vma6ty+TnzmOv+c6+OCgZZlwFhkxD1pQBM5iOlFRAkdt3Er/2e8FqthK0/od7Fgy
 Ii9xdxYZU9KTsM4R+IWhqZ1rj1qUdtMPSGzFRbzFt3b6NdANRZT9MUlxvSkuHPIE
 P/9anpmgu41gd2JbHuyVnw+4jdZMhhZUPUYtphmQ5n35rcFZjv1gnJalH/Nw/DTE
 78bDxxP0+35bkuq5MwHfvA3imVsGnvnFx5MlZBoqvv+VK4S3Q6E2+ARQZdiG+2Is
 8Uyjzzt0nq2waUO3H1JLkgiM9J9LFqaYQLE68u569m4NCfRSpoZva9KVmNpre2K0
 9WdGy2gfzaYAQkGud2TULzeLiEbWfvZAL0o43jYXkVPRKmnffCEphf2X6C2IDV/3
 5ZR4bandRRC6DVnE+8n1MB06AYtpJPB/w3rplON+l/V5Gnb7wRNwiUrBr/F15OOy
 OPbAnP6k56wY/JRpgzdbNqwGi8EvGWX6t3Kqjp2mczhs2as8M193FKS2xppl1TYl
 BPyTsAfbdQ==
 =5v/U
 -----END PGP SIGNATURE-----

Merge tag 'for-6.8/io_uring-2024-01-08' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:
 "Mostly just come fixes and cleanups, but one feature as well. In
  detail:

   - Harden the check for handling IOPOLL based on return (Pavel)

   - Various minor optimizations (Pavel)

   - Drop remnants of SCM_RIGHTS fd passing support, now that it's no
     longer supported since 6.7 (me)

   - Fix for a case where bytes_done wasn't initialized properly on a
     failure condition for read/write requests (me)

   - Move the register related code to a separate file (me)

   - Add support for returning the provided ring buffer head (me)

   - Add support for adding a direct descriptor to the normal file table
     (me, Christian Brauner)

   - Fix for ensuring pending task_work for a ring with DEFER_TASKRUN is
     run even if we timeout waiting (me)"

* tag 'for-6.8/io_uring-2024-01-08' of git://git.kernel.dk/linux:
  io_uring: ensure local task_work is run on wait timeout
  io_uring/kbuf: add method for returning provided buffer ring head
  io_uring/rw: ensure io->bytes_done is always initialized
  io_uring: drop any code related to SCM_RIGHTS
  io_uring/unix: drop usage of io_uring socket
  io_uring/register: move io_uring_register(2) related code to register.c
  io_uring/openclose: add support for IORING_OP_FIXED_FD_INSTALL
  io_uring/cmd: inline io_uring_cmd_get_task
  io_uring/cmd: inline io_uring_cmd_do_in_task_lazy
  io_uring: split out cmd api into a separate header
  io_uring: optimise ltimeout for inline execution
  io_uring: don't check iopoll if request completes
2024-01-11 14:19:23 -08:00
Linus Torvalds
01d550f0fc for-6.8/block-2024-01-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmWcIOIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpn6hD/9oO7U75PuxUwYYHZ9Uzxpw6gQ0LEmeyJmE
 NQYCkfYHVq3IsgOdF7elI9v3qtr6v8V8CdB7cByrnn3DgwsMuiTKZZ0dK7vH37PO
 DX+/xn349e8oH7RdRo7f3m95g1YbHfpfnj0Rc4mjTDV72Jr/HlLTVgGTQg8DEnCR
 wBIFmeuBHHgeeLh87gsWLAP7ReReiy9V1uqpDFsko2/4BxRAM/8eedkwcAxD8aEy
 rd+dT/SBQj2cOdQMUeExT3gWjwzHh6ZHx3f1WCLK5fdck6BogH2hBUeri6F/H98L
 HoaXjBZYBTH68hB/mnO5I4g1ZlrVM74Vp7JPa3e1SFFtyEi6lsyrk2J3GoNh0E7r
 pXqH5kAcaJwBsBrbRGuvEyGbn9RLTaN5Gvseud0VE4oMruyodTniQaHXuIGackgz
 sMavMho4486EUWPaF7gIBdLNK1hO13w+IDZ4+3oBxhudMqdgZbk4iYpOCqQ7QY5G
 2vkzAE/sZ+aVNXeaIQOI8dE5clBy8gJ+6+t8dm3DY1r1xdbcnU40iZ8/fri3h69r
 vHs9bpQnVWZF0gEyEflY1pkcAPpIkvMmWCR7Ehy5YCkIfa+qfSL05o3dicpWovLP
 N+gCtpkhTK2AvmUWsUMypMLRvoSOImyCIiobrr3qNBaUdgRP8xKfUa72RuRp8cGl
 Vrj5oAiE3w==
 =YAfp
 -----END PGP SIGNATURE-----

Merge tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:
 "Pretty quiet round this time around. This contains:

   - NVMe updates via Keith:
        - nvme fabrics spec updates (Guixin, Max)
        - nvme target udpates (Guixin, Evan)
        - nvme attribute refactoring (Daniel)
        - nvme-fc numa fix (Keith)

   - MD updates via Song:
        - Fix/Cleanup RCU usage from conf->disks[i].rdev (Yu Kuai)
        - Fix raid5 hang issue (Junxiao Bi)
        - Add Yu Kuai as Reviewer of the md subsystem
        - Remove deprecated flavors (Song Liu)
        - raid1 read error check support (Li Nan)
        - Better handle events off-by-1 case (Alex Lyakas)

   - Efficiency improvements for passthrough (Kundan)

   - Support for mapping integrity data directly (Keith)

   - Zoned write fix (Damien)

   - rnbd fixes (Kees, Santosh, Supriti)

   - Default to a sane discard size granularity (Christoph)

   - Make the default max transfer size naming less confusing
     (Christoph)

   - Remove support for deprecated host aware zoned model (Christoph)

   - Misc fixes (me, Li, Matthew, Min, Ming, Randy, liyouhong, Daniel,
     Bart, Christoph)"

* tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linux: (78 commits)
  block: Treat sequential write preferred zone type as invalid
  block: remove disk_clear_zoned
  sd: remove the !ZBC && blk_queue_is_zoned case in sd_read_block_characteristics
  drivers/block/xen-blkback/common.h: Fix spelling typo in comment
  blk-cgroup: fix rcu lockdep warning in blkg_lookup()
  blk-cgroup: don't use removal safe list iterators
  block: floor the discard granularity to the physical block size
  mtd_blkdevs: use the default discard granularity
  bcache: use the default discard granularity
  zram: use the default discard granularity
  null_blk: use the default discard granularity
  nbd: use the default discard granularity
  ubd: use the default discard granularity
  block: default the discard granularity to sector size
  bcache: discard_granularity should not be smaller than a sector
  block: remove two comments in bio_split_discard
  block: rename and document BLK_DEF_MAX_SECTORS
  loop: don't abuse BLK_DEF_MAX_SECTORS
  aoe: don't abuse BLK_DEF_MAX_SECTORS
  null_blk: don't cap max_hw_sectors to BLK_DEF_MAX_SECTORS
  ...
2024-01-11 13:58:04 -08:00
Linus Torvalds
fb46e22a9e Many singleton patches against the MM code. The patch series which
are included in this merge do the following:
 
 - Peng Zhang has done some mapletree maintainance work in the
   series
 
 	"maple_tree: add mt_free_one() and mt_attr() helpers"
 	"Some cleanups of maple tree"
 
 - In the series "mm: use memmap_on_memory semantics for dax/kmem"
   Vishal Verma has altered the interworking between memory-hotplug
   and dax/kmem so that newly added 'device memory' can more easily
   have its memmap placed within that newly added memory.
 
 - Matthew Wilcox continues folio-related work (including a few
   fixes) in the patch series
 
 	"Add folio_zero_tail() and folio_fill_tail()"
 	"Make folio_start_writeback return void"
 	"Fix fault handler's handling of poisoned tail pages"
 	"Convert aops->error_remove_page to ->error_remove_folio"
 	"Finish two folio conversions"
 	"More swap folio conversions"
 
 - Kefeng Wang has also contributed folio-related work in the series
 
 	"mm: cleanup and use more folio in page fault"
 
 - Jim Cromie has improved the kmemleak reporting output in the
   series "tweak kmemleak report format".
 
 - In the series "stackdepot: allow evicting stack traces" Andrey
   Konovalov to permits clients (in this case KASAN) to cause
   eviction of no longer needed stack traces.
 
 - Charan Teja Kalla has fixed some accounting issues in the page
   allocator's atomic reserve calculations in the series "mm:
   page_alloc: fixes for high atomic reserve caluculations".
 
 - Dmitry Rokosov has added to the samples/ dorectory some sample
   code for a userspace memcg event listener application.  See the
   series "samples: introduce cgroup events listeners".
 
 - Some mapletree maintanance work from Liam Howlett in the series
   "maple_tree: iterator state changes".
 
 - Nhat Pham has improved zswap's approach to writeback in the
   series "workload-specific and memory pressure-driven zswap
   writeback".
 
 - DAMON/DAMOS feature and maintenance work from SeongJae Park in
   the series
 
 	"mm/damon: let users feed and tame/auto-tune DAMOS"
 	"selftests/damon: add Python-written DAMON functionality tests"
 	"mm/damon: misc updates for 6.8"
 
 - Yosry Ahmed has improved memcg's stats flushing in the series
   "mm: memcg: subtree stats flushing and thresholds".
 
 - In the series "Multi-size THP for anonymous memory" Ryan Roberts
   has added a runtime opt-in feature to transparent hugepages which
   improves performance by allocating larger chunks of memory during
   anonymous page faults.
 
 - Matthew Wilcox has also contributed some cleanup and maintenance
   work against eh buffer_head code int he series "More buffer_head
   cleanups".
 
 - Suren Baghdasaryan has done work on Andrea Arcangeli's series
   "userfaultfd move option".  UFFDIO_MOVE permits userspace heap
   compaction algorithms to move userspace's pages around rather than
   UFFDIO_COPY'a alloc/copy/free.
 
 - Stefan Roesch has developed a "KSM Advisor", in the series
   "mm/ksm: Add ksm advisor".  This is a governor which tunes KSM's
   scanning aggressiveness in response to userspace's current needs.
 
 - Chengming Zhou has optimized zswap's temporary working memory
   use in the series "mm/zswap: dstmem reuse optimizations and
   cleanups".
 
 - Matthew Wilcox has performed some maintenance work on the
   writeback code, both code and within filesystems.  The series is
   "Clean up the writeback paths".
 
 - Andrey Konovalov has optimized KASAN's handling of alloc and
   free stack traces for secondary-level allocators, in the series
   "kasan: save mempool stack traces".
 
 - Andrey also performed some KASAN maintenance work in the series
   "kasan: assorted clean-ups".
 
 - David Hildenbrand has gone to town on the rmap code.  Cleanups,
   more pte batching, folio conversions and more.  See the series
   "mm/rmap: interface overhaul".
 
 - Kinsey Ho has contributed some maintenance work on the MGLRU
   code in the series "mm/mglru: Kconfig cleanup".
 
 - Matthew Wilcox has contributed lruvec page accounting code
   cleanups in the series "Remove some lruvec page accounting
   functions".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA
 jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27
 Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU=
 =0NHs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Many singleton patches against the MM code. The patch series which are
  included in this merge do the following:

   - Peng Zhang has done some mapletree maintainance work in the series

	'maple_tree: add mt_free_one() and mt_attr() helpers'
	'Some cleanups of maple tree'

   - In the series 'mm: use memmap_on_memory semantics for dax/kmem'
     Vishal Verma has altered the interworking between memory-hotplug
     and dax/kmem so that newly added 'device memory' can more easily
     have its memmap placed within that newly added memory.

   - Matthew Wilcox continues folio-related work (including a few fixes)
     in the patch series

	'Add folio_zero_tail() and folio_fill_tail()'
	'Make folio_start_writeback return void'
	'Fix fault handler's handling of poisoned tail pages'
	'Convert aops->error_remove_page to ->error_remove_folio'
	'Finish two folio conversions'
	'More swap folio conversions'

   - Kefeng Wang has also contributed folio-related work in the series

	'mm: cleanup and use more folio in page fault'

   - Jim Cromie has improved the kmemleak reporting output in the series
     'tweak kmemleak report format'.

   - In the series 'stackdepot: allow evicting stack traces' Andrey
     Konovalov to permits clients (in this case KASAN) to cause eviction
     of no longer needed stack traces.

   - Charan Teja Kalla has fixed some accounting issues in the page
     allocator's atomic reserve calculations in the series 'mm:
     page_alloc: fixes for high atomic reserve caluculations'.

   - Dmitry Rokosov has added to the samples/ dorectory some sample code
     for a userspace memcg event listener application. See the series
     'samples: introduce cgroup events listeners'.

   - Some mapletree maintanance work from Liam Howlett in the series
     'maple_tree: iterator state changes'.

   - Nhat Pham has improved zswap's approach to writeback in the series
     'workload-specific and memory pressure-driven zswap writeback'.

   - DAMON/DAMOS feature and maintenance work from SeongJae Park in the
     series

	'mm/damon: let users feed and tame/auto-tune DAMOS'
	'selftests/damon: add Python-written DAMON functionality tests'
	'mm/damon: misc updates for 6.8'

   - Yosry Ahmed has improved memcg's stats flushing in the series 'mm:
     memcg: subtree stats flushing and thresholds'.

   - In the series 'Multi-size THP for anonymous memory' Ryan Roberts
     has added a runtime opt-in feature to transparent hugepages which
     improves performance by allocating larger chunks of memory during
     anonymous page faults.

   - Matthew Wilcox has also contributed some cleanup and maintenance
     work against eh buffer_head code int he series 'More buffer_head
     cleanups'.

   - Suren Baghdasaryan has done work on Andrea Arcangeli's series
     'userfaultfd move option'. UFFDIO_MOVE permits userspace heap
     compaction algorithms to move userspace's pages around rather than
     UFFDIO_COPY'a alloc/copy/free.

   - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm:
     Add ksm advisor'. This is a governor which tunes KSM's scanning
     aggressiveness in response to userspace's current needs.

   - Chengming Zhou has optimized zswap's temporary working memory use
     in the series 'mm/zswap: dstmem reuse optimizations and cleanups'.

   - Matthew Wilcox has performed some maintenance work on the writeback
     code, both code and within filesystems. The series is 'Clean up the
     writeback paths'.

   - Andrey Konovalov has optimized KASAN's handling of alloc and free
     stack traces for secondary-level allocators, in the series 'kasan:
     save mempool stack traces'.

   - Andrey also performed some KASAN maintenance work in the series
     'kasan: assorted clean-ups'.

   - David Hildenbrand has gone to town on the rmap code. Cleanups, more
     pte batching, folio conversions and more. See the series 'mm/rmap:
     interface overhaul'.

   - Kinsey Ho has contributed some maintenance work on the MGLRU code
     in the series 'mm/mglru: Kconfig cleanup'.

   - Matthew Wilcox has contributed lruvec page accounting code cleanups
     in the series 'Remove some lruvec page accounting functions'"

* tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits)
  mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
  mm, treewide: introduce NR_PAGE_ORDERS
  selftests/mm: add separate UFFDIO_MOVE test for PMD splitting
  selftests/mm: skip test if application doesn't has root privileges
  selftests/mm: conform test to TAP format output
  selftests: mm: hugepage-mmap: conform to TAP format output
  selftests/mm: gup_test: conform test to TAP format output
  mm/selftests: hugepage-mremap: conform test to TAP format output
  mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING
  mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large
  mm/memcontrol: remove __mod_lruvec_page_state()
  mm/khugepaged: use a folio more in collapse_file()
  slub: use a folio in __kmalloc_large_node
  slub: use folio APIs in free_large_kmalloc()
  slub: use alloc_pages_node() in alloc_slab_page()
  mm: remove inc/dec lruvec page state functions
  mm: ratelimit stat flush from workingset shrinker
  kasan: stop leaking stack trace handles
  mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE
  mm/mglru: add dummy pmd_dirty()
  ...
2024-01-09 11:18:47 -08:00
Kirill A. Shutemov
5e0a760b44 mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
commit 23baf831a3 ("mm, treewide: redefine MAX_ORDER sanely") has
changed the definition of MAX_ORDER to be inclusive.  This has caused
issues with code that was not yet upstream and depended on the previous
definition.

To draw attention to the altered meaning of the define, rename MAX_ORDER
to MAX_PAGE_ORDER.

Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-08 15:27:15 -08:00
Linus Torvalds
5db8752c3b vfs-6.8.iov_iter
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUzBQAKCRCRxhvAZXjc
 ot+3AQCZw1PBD4azVxFMWH76qwlAGoVIFug4+ogKU/iUa4VLygEA2FJh1vLJw5iI
 LpgBEIUTPVkwtzinAW94iJJo1Vr7NAI=
 =p6PB
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.8.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs iov_iter cleanups from Christian Brauner:
 "This contains a minor cleanup. The patches drop an unused argument
  from import_single_range() allowing to replace import_single_range()
  with import_ubuf() and dropping import_single_range() completely"

* tag 'vfs-6.8.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  iov_iter: replace import_single_range() with import_ubuf()
  iov_iter: remove unused 'iov' argument from import_single_range()
2024-01-08 11:43:04 -08:00
Linus Torvalds
bb93c5ed45 vfs-6.8.rw
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUzXQAKCRCRxhvAZXjc
 ogOtAQDpqUp1zY4dV/dZisCJ5xarZTsSZ1AvgmcxZBtS0NhbdgEAshWvYGA9ryS/
 ChL5jjtjjZDLhRA//reoFHTQIrdp2w8=
 =bF+R
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs rw updates from Christian Brauner:
 "This contains updates from Amir for read-write backing file helpers
  for stacking filesystems such as overlayfs:

   - Fanotify is currently in the process of introducing pre content
     events. Roughly, a new permission event will be added indicating
     that it is safe to write to the file being accessed. These events
     are used by hierarchical storage managers to e.g., fill the content
     of files on first access.

     During that work we noticed that our current permission checking is
     inconsistent in rw_verify_area() and remap_verify_area().
     Especially in the splice code permission checking is done multiple
     times. For example, one time for the whole range and then again for
     partial ranges inside the iterator.

     In addition, we mostly do permission checking before we call
     file_start_write() except for a few places where we call it after.
     For pre-content events we need such permission checking to be done
     before file_start_write(). So this is a nice reason to clean this
     all up.

     After this series, all permission checking is done before
     file_start_write().

     As part of this cleanup we also massaged the splice code a bit. We
     got rid of a few helpers because we are alredy drowning in special
     read-write helpers. We also cleaned up the return types for splice
     helpers.

   - Introduce generic read-write helpers for backing files. This lifts
     some overlayfs code to common code so it can be used by the FUSE
     passthrough work coming in over the next cycles. Make Amir and
     Miklos the maintainers for this new subsystem of the vfs"

* tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
  fs: fix __sb_write_started() kerneldoc formatting
  fs: factor out backing_file_mmap() helper
  fs: factor out backing_file_splice_{read,write}() helpers
  fs: factor out backing_file_{read,write}_iter() helpers
  fs: prepare for stackable filesystems backing file helpers
  fsnotify: optionally pass access range in file permission hooks
  fsnotify: assert that file_start_write() is not held in permission hooks
  fsnotify: split fsnotify_perm() into two hooks
  fs: use splice_copy_file_range() inline helper
  splice: return type ssize_t from all helpers
  fs: use do_splice_direct() for nfsd/ksmbd server-side-copy
  fs: move file_start_write() into direct_splice_actor()
  fs: fork splice_file_range() from do_splice_direct()
  fs: create {sb,file}_write_not_started() helpers
  fs: create file_write_started() helper
  fs: create __sb_write_started() helper
  fs: move kiocb_start_write() into vfs_iocb_iter_write()
  fs: move permission hook out of do_iter_read()
  fs: move permission hook out of do_iter_write()
  fs: move file_start_write() into vfs_iter_write()
  ...
2024-01-08 11:11:51 -08:00
liyouhong
e3d7581cb1 drivers/block/xen-blkback/common.h: Fix spelling typo in comment
Fix spelling typo in comment.

Reported-by: k2ci <kernel-bot@kylinos.cn>
Signed-off-by: liyouhong <liyouhong@kylinos.cn>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20231226095701.172080-1-liyouhong@kylinos.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-04 16:10:29 -07:00
Christoph Hellwig
3753039def zram: use the default discard granularity
The discard granularity now defaults to a single sector, so don't set
that value explicitly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231228075545.362768-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-29 08:44:12 -07:00
Christoph Hellwig
724325477f null_blk: use the default discard granularity
The discard granularity now defaults to a single sector, so don't set
that value explicitly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231228075545.362768-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-29 08:44:12 -07:00
Christoph Hellwig
1e2ab2e8a9 nbd: use the default discard granularity
The discard granularity now defaults to a single sector, so don't set
that value explicitly.  Also don't bother clearing it as a discard
granularity without discard_sectors doesn't mean anything.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231228075545.362768-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-29 08:44:12 -07:00
Christoph Hellwig
3d77976c3a loop: don't abuse BLK_DEF_MAX_SECTORS
BLK_DEF_MAX_SECTORS despite the confusing name is the default cap for
the max_sectors limits.  Don't use it to initialize max_hw_setors, which
is a hardware / driver capacility.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231227092305.279567-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-27 10:46:01 -07:00
Christoph Hellwig
3888b2ee62 aoe: don't abuse BLK_DEF_MAX_SECTORS
BLK_DEF_MAX_SECTORS despite the confusing name is the default cap for
the max_sectors limits.  Don't use it to initialize max_hw_setors, which
is a hardware / driver capacility.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231227092305.279567-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-27 10:46:01 -07:00
Christoph Hellwig
9a9525de86 null_blk: don't cap max_hw_sectors to BLK_DEF_MAX_SECTORS
null_blk has some rather odd capping of the max_hw_sectors value to
BLK_DEF_MAX_SECTORS, which doesn't make sense - max_hw_sector is the
hardware limit, and BLK_DEF_MAX_SECTORS despite the confusing name is the
default cap for the max_sectors field used for normal file system I/O.

Remove all the capping, and simply leave it to the block layer or
user to take up or not all of that for file system I/O.

Fixes: ea17fd354c ("null_blk: Allow controlling max_hw_sectors limit")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231227092305.279567-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-27 10:46:01 -07:00
Christoph Hellwig
34c7db44b4 loop: don't update discard limits from loop_set_status
loop_set_status doesn't change anything relevant to the discard and
write_zeroes setting, so don't bother calling loop_config_discard.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231227082020.249427-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-27 10:43:46 -07:00
Randy Dunlap
8aabc11c8f drbd: actlog: fix kernel-doc warnings and spelling
Fix all kernel-doc warnings in drbd_actlog.c:

drbd_actlog.c:963: warning: No description found for return value of 'drbd_rs_begin_io'
drbd_actlog.c:1015: warning: Function parameter or member 'peer_device' not described in 'drbd_try_rs_begin_io'
drbd_actlog.c:1015: warning: Excess function parameter 'device' description in 'drbd_try_rs_begin_io'
drbd_actlog.c:1015: warning: No description found for return value of 'drbd_try_rs_begin_io'
drbd_actlog.c:1197: warning: No description found for return value of 'drbd_rs_del_all'

Fix one spelling error (s/ore/or/).

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Cc:  <drbd-dev@lists.linbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc:  <linux-block@vger.kernel.org>
Link: https://lore.kernel.org/r/20231222061909.8791-1-rdunlap@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-22 07:18:35 -07:00
Christoph Hellwig
d73e93b4df block: simplify disk_set_zoned
Only use disk_set_zoned to actually enable zoned device support.
For clearing it, call disk_clear_zoned, which is renamed from
disk_clear_zone_settings and now directly clears the zoned flag as
well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20231217165359.604246-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-19 20:17:43 -07:00
Christoph Hellwig
7437bb73f0 block: remove support for the host aware zone model
When zones were first added the SCSI and ATA specs, two different
models were supported (in addition to the drive managed one that
is invisible to the host):

 - host managed where non-conventional zones there is strict requirement
   to write at the write pointer, or else an error is returned
 - host aware where a write point is maintained if writes always happen
   at it, otherwise it is left in an under-defined state and the
   sequential write preferred zones behave like conventional zones
   (probably very badly performing ones, though)

Not surprisingly this lukewarm model didn't prove to be very useful and
was finally removed from the ZBC and SBC specs (NVMe never implemented
it).  Due to to the easily disappearing write pointer host software
could never rely on the write pointer to actually be useful for say
recovery.

Fortunately only a few HDD prototypes shipped using this model which
never made it to mass production.  Drop the support before it is too
late.  Note that any such host aware prototype HDD can still be used
with Linux as we'll now treat it as a conventional HDD.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20231217165359.604246-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-19 20:17:43 -07:00