- The 4 patch series "mm: ksm: prevent KSM from breaking merging of new
VMAs" from Lorenzo Stoakes addresses an issue with KSM's
PR_SET_MEMORY_MERGE mode: newly mapped VMAs were not eligible for
merging with existing adjacent VMAs.
- The 4 patch series "mm/damon: introduce DAMON_STAT for simple and
practical access monitoring" from SeongJae Park adds a new kernel module
which simplifies the setup and usage of DAMON in production
environments.
- The 6 patch series "stop passing a writeback_control to swap/shmem
writeout" from Christoph Hellwig is a cleanup to the writeback code
which removes a couple of pointers from struct writeback_control.
- The 7 patch series "drivers/base/node.c: optimization and cleanups"
from Donet Tom contains largely uncorrelated cleanups to the NUMA node
setup and management code.
- The 4 patch series "mm: userfaultfd: assorted fixes and cleanups" from
Tal Zussman does some maintenance work on the userfaultfd code.
- The 5 patch series "Readahead tweaks for larger folios" from Ryan
Roberts implements some tuneups for pagecache readahead when it is
reading into order>0 folios.
- The 4 patch series "selftests/mm: Tweaks to the cow test" from Mark
Brown provides some cleanups and consistency improvements to the
selftests code.
- The 4 patch series "Optimize mremap() for large folios" from Dev Jain
does that. A 37% reduction in execution time was measured in a
memset+mremap+munmap microbenchmark.
- The 5 patch series "Remove zero_user()" from Matthew Wilcox expunges
zero_user() in favor of the more modern memzero_page().
- The 3 patch series "mm/huge_memory: vmf_insert_folio_*() and
vmf_insert_pfn_pud() fixes" from David Hildenbrand addresses some warts
which David noticed in the huge page code. These were not known to be
causing any issues at this time.
- The 3 patch series "mm/damon: use alloc_migrate_target() for
DAMOS_MIGRATE_{HOT,COLD" from SeongJae Park provides some cleanup and
consolidation work in DAMON.
- The 3 patch series "use vm_flags_t consistently" from Lorenzo Stoakes
uses vm_flags_t in places where we were inappropriately using other
types.
- The 3 patch series "mm/memfd: Reserve hugetlb folios before
allocation" from Vivek Kasireddy increases the reliability of large page
allocation in the memfd code.
- The 14 patch series "mm: Remove pXX_devmap page table bit and pfn_t
type" from Alistair Popple removes several now-unneeded PFN_* flags.
- The 5 patch series "mm/damon: decouple sysfs from core" from SeongJae
Park implememnts some cleanup and maintainability work in the DAMON
sysfs layer.
- The 5 patch series "madvise cleanup" from Lorenzo Stoakes does quite a
lot of cleanup/maintenance work in the madvise() code.
- The 4 patch series "madvise anon_name cleanups" from Vlastimil Babka
provides additional cleanups on top or Lorenzo's effort.
- The 11 patch series "Implement numa node notifier" from Oscar Salvador
creates a standalone notifier for NUMA node memory state changes.
Previously these were lumped under the more general memory on/offline
notifier.
- The 6 patch series "Make MIGRATE_ISOLATE a standalone bit" from Zi Yan
cleans up the pageblock isolation code and fixes a potential issue which
doesn't seem to cause any problems in practice.
- The 5 patch series "selftests/damon: add python and drgn based DAMON
sysfs functionality tests" from SeongJae Park adds additional drgn- and
python-based DAMON selftests which are more comprehensive than the
existing selftest suite.
- The 5 patch series "Misc rework on hugetlb faulting path" from Oscar
Salvador fixes a rather obscure deadlock in the hugetlb fault code and
follows that fix with a series of cleanups.
- The 3 patch series "cma: factor out allocation logic from
__cma_declare_contiguous_nid" from Mike Rapoport rationalizes and cleans
up the highmem-specific code in the CMA allocator.
- The 28 patch series "mm/migration: rework movable_ops page migration
(part 1)" from David Hildenbrand provides cleanups and
future-preparedness to the migration code.
- The 2 patch series "mm/damon: add trace events for auto-tuned
monitoring intervals and DAMOS quota" from SeongJae Park adds some
tracepoints to some DAMON auto-tuning code.
- The 6 patch series "mm/damon: fix misc bugs in DAMON modules" from
SeongJae Park does that.
- The 6 patch series "mm/damon: misc cleanups" from SeongJae Park also
does what it claims.
- The 4 patch series "mm: folio_pte_batch() improvements" from David
Hildenbrand cleans up the large folio PTE batching code.
- The 13 patch series "mm/damon/vaddr: Allow interleaving in
migrate_{hot,cold} actions" from SeongJae Park facilitates dynamic
alteration of DAMON's inter-node allocation policy.
- The 3 patch series "Remove unmap_and_put_page()" from Vishal Moola
provides a couple of page->folio conversions.
- The 4 patch series "mm: per-node proactive reclaim" from Davidlohr
Bueso implements a per-node control of proactive reclaim - beyond the
current memcg-based implementation.
- The 14 patch series "mm/damon: remove damon_callback" from SeongJae
Park replaces the damon_callback interface with a more general and
powerful damon_call()+damos_walk() interface.
- The 10 patch series "mm/mremap: permit mremap() move of multiple VMAs"
from Lorenzo Stoakes implements a number of mremap cleanups (of course)
in preparation for adding new mremap() functionality: newly permit the
remapping of multiple VMAs when the user is specifying MREMAP_FIXED. It
still excludes some specialized situations where this cannot be
performed reliably.
- The 3 patch series "drop hugetlb_free_pgd_range()" from Anthony Yznaga
switches some sparc hugetlb code over to the generic version and removes
the thus-unneeded hugetlb_free_pgd_range().
- The 4 patch series "mm/damon/sysfs: support periodic and automated
stats update" from SeongJae Park augments the present
userspace-requested update of DAMON sysfs monitoring files. Automatic
update is now provided, along with a tunable to control the update
interval.
- The 4 patch series "Some randome fixes and cleanups to swapfile" from
Kemeng Shi does what is claims.
- The 4 patch series "mm: introduce snapshot_page" from Luiz Capitulino
and David Hildenbrand provides (and uses) a means by which debug-style
functions can grab a copy of a pageframe and inspect it locklessly
without tripping over the races inherent in operating on the live
pageframe directly.
- The 6 patch series "use per-vma locks for /proc/pid/maps reads" from
Suren Baghdasaryan addresses the large contention issues which can be
triggered by reads from that procfs file. Latencies are reduced by more
than half in some situations. The series also introduces several new
selftests for the /proc/pid/maps interface.
- The 6 patch series "__folio_split() clean up" from Zi Yan cleans up
__folio_split()!
- The 7 patch series "Optimize mprotect() for large folios" from Dev
Jain provides some quite large (>3x) speedups to mprotect() when dealing
with large folios.
- The 2 patch series "selftests/mm: reuse FORCE_READ to replace "asm
volatile("" : "+r" (XXX));" and some cleanup" from wang lian does some
cleanup work in the selftests code.
- The 3 patch series "tools/testing: expand mremap testing" from Lorenzo
Stoakes extends the mremap() selftest in several ways, including adding
more checking of Lorenzo's recently added "permit mremap() move of
multiple VMAs" feature.
- The 22 patch series "selftests/damon/sysfs.py: test all parameters"
from SeongJae Park extends the DAMON sysfs interface selftest so that it
tests all possible user-requested parameters. Rather than the present
minimal subset.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaIqcCgAKCRDdBJ7gKXxA
jkVBAQCCn9DR1QP0CRk961ot0cKzOgioSc0aA03DPb2KXRt2kQEAzDAz0ARurFhL
8BzbvI0c+4tntHLXvIlrC33n9KWAOQM=
=XsFy
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"As usual, many cleanups. The below blurbiage describes 42 patchsets.
21 of those are partially or fully cleanup work. "cleans up",
"cleanup", "maintainability", "rationalizes", etc.
I never knew the MM code was so dirty.
"mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes)
addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly
mapped VMAs were not eligible for merging with existing adjacent
VMAs.
"mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park)
adds a new kernel module which simplifies the setup and usage of
DAMON in production environments.
"stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig)
is a cleanup to the writeback code which removes a couple of
pointers from struct writeback_control.
"drivers/base/node.c: optimization and cleanups" (Donet Tom)
contains largely uncorrelated cleanups to the NUMA node setup and
management code.
"mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman)
does some maintenance work on the userfaultfd code.
"Readahead tweaks for larger folios" (Ryan Roberts)
implements some tuneups for pagecache readahead when it is reading
into order>0 folios.
"selftests/mm: Tweaks to the cow test" (Mark Brown)
provides some cleanups and consistency improvements to the
selftests code.
"Optimize mremap() for large folios" (Dev Jain)
does that. A 37% reduction in execution time was measured in a
memset+mremap+munmap microbenchmark.
"Remove zero_user()" (Matthew Wilcox)
expunges zero_user() in favor of the more modern memzero_page().
"mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand)
addresses some warts which David noticed in the huge page code.
These were not known to be causing any issues at this time.
"mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park)
provides some cleanup and consolidation work in DAMON.
"use vm_flags_t consistently" (Lorenzo Stoakes)
uses vm_flags_t in places where we were inappropriately using other
types.
"mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy)
increases the reliability of large page allocation in the memfd
code.
"mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple)
removes several now-unneeded PFN_* flags.
"mm/damon: decouple sysfs from core" (SeongJae Park)
implememnts some cleanup and maintainability work in the DAMON
sysfs layer.
"madvise cleanup" (Lorenzo Stoakes)
does quite a lot of cleanup/maintenance work in the madvise() code.
"madvise anon_name cleanups" (Vlastimil Babka)
provides additional cleanups on top or Lorenzo's effort.
"Implement numa node notifier" (Oscar Salvador)
creates a standalone notifier for NUMA node memory state changes.
Previously these were lumped under the more general memory
on/offline notifier.
"Make MIGRATE_ISOLATE a standalone bit" (Zi Yan)
cleans up the pageblock isolation code and fixes a potential issue
which doesn't seem to cause any problems in practice.
"selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park)
adds additional drgn- and python-based DAMON selftests which are
more comprehensive than the existing selftest suite.
"Misc rework on hugetlb faulting path" (Oscar Salvador)
fixes a rather obscure deadlock in the hugetlb fault code and
follows that fix with a series of cleanups.
"cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport)
rationalizes and cleans up the highmem-specific code in the CMA
allocator.
"mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand)
provides cleanups and future-preparedness to the migration code.
"mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park)
adds some tracepoints to some DAMON auto-tuning code.
"mm/damon: fix misc bugs in DAMON modules" (SeongJae Park)
does that.
"mm/damon: misc cleanups" (SeongJae Park)
also does what it claims.
"mm: folio_pte_batch() improvements" (David Hildenbrand)
cleans up the large folio PTE batching code.
"mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park)
facilitates dynamic alteration of DAMON's inter-node allocation
policy.
"Remove unmap_and_put_page()" (Vishal Moola)
provides a couple of page->folio conversions.
"mm: per-node proactive reclaim" (Davidlohr Bueso)
implements a per-node control of proactive reclaim - beyond the
current memcg-based implementation.
"mm/damon: remove damon_callback" (SeongJae Park)
replaces the damon_callback interface with a more general and
powerful damon_call()+damos_walk() interface.
"mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes)
implements a number of mremap cleanups (of course) in preparation
for adding new mremap() functionality: newly permit the remapping
of multiple VMAs when the user is specifying MREMAP_FIXED. It still
excludes some specialized situations where this cannot be performed
reliably.
"drop hugetlb_free_pgd_range()" (Anthony Yznaga)
switches some sparc hugetlb code over to the generic version and
removes the thus-unneeded hugetlb_free_pgd_range().
"mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park)
augments the present userspace-requested update of DAMON sysfs
monitoring files. Automatic update is now provided, along with a
tunable to control the update interval.
"Some randome fixes and cleanups to swapfile" (Kemeng Shi)
does what is claims.
"mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand)
provides (and uses) a means by which debug-style functions can grab
a copy of a pageframe and inspect it locklessly without tripping
over the races inherent in operating on the live pageframe
directly.
"use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan)
addresses the large contention issues which can be triggered by
reads from that procfs file. Latencies are reduced by more than
half in some situations. The series also introduces several new
selftests for the /proc/pid/maps interface.
"__folio_split() clean up" (Zi Yan)
cleans up __folio_split()!
"Optimize mprotect() for large folios" (Dev Jain)
provides some quite large (>3x) speedups to mprotect() when dealing
with large folios.
"selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian)
does some cleanup work in the selftests code.
"tools/testing: expand mremap testing" (Lorenzo Stoakes)
extends the mremap() selftest in several ways, including adding
more checking of Lorenzo's recently added "permit mremap() move of
multiple VMAs" feature.
"selftests/damon/sysfs.py: test all parameters" (SeongJae Park)
extends the DAMON sysfs interface selftest so that it tests all
possible user-requested parameters. Rather than the present minimal
subset"
* tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits)
MAINTAINERS: add missing headers to mempory policy & migration section
MAINTAINERS: add missing file to cgroup section
MAINTAINERS: add MM MISC section, add missing files to MISC and CORE
MAINTAINERS: add missing zsmalloc file
MAINTAINERS: add missing files to page alloc section
MAINTAINERS: add missing shrinker files
MAINTAINERS: move memremap.[ch] to hotplug section
MAINTAINERS: add missing mm_slot.h file THP section
MAINTAINERS: add missing interval_tree.c to memory mapping section
MAINTAINERS: add missing percpu-internal.h file to per-cpu section
mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info()
selftests/damon: introduce _common.sh to host shared function
selftests/damon/sysfs.py: test runtime reduction of DAMON parameters
selftests/damon/sysfs.py: test non-default parameters runtime commit
selftests/damon/sysfs.py: generalize DAMON context commit assertion
selftests/damon/sysfs.py: generalize monitoring attributes commit assertion
selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion
selftests/damon/sysfs.py: test DAMOS filters commitment
selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion
selftests/damon/sysfs.py: test DAMOS destinations commitment
...
Patch series "Remove zero_user()".
The zero_user() API is almost unused these days. Finish the job of
removing it.
This patch (of 5):
memzero_page() is the new name for zero_user().
Link: https://lkml.kernel.org/r/20250612143443.2848197-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20250612143443.2848197-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
To get out of the DMA mapping helpers having to check every segment for
it's P2P status, ensure that bios either contain P2P transfers or non-P2P
transfers, and that a P2P bio only contains ranges from a single device.
This means we do the page zone access in the bio add path where it should
be still page hot, and will only have do the fairly expensive P2P topology
lookup once per bio down in the DMA mapping path, and only for already
marked bios.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20250625113531.522027-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgwnGYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpq9aD/4iqOts77xhWWLrOJWkkhOcV5rREeyppq8X
MKYul9S4cc4Uin9Xou9a+nab31QBQEk3nsN3kX9o3yAXvkh6yUm36HD8qYNW/46q
IUkwRQQJ0COyTnexMZQNTbZPQDIYcenXmQxOcrEJ5jC1Jcz0sOKHsgekL+ab3kCy
fLnuz2ozvjGDMala/NmE8fN5qSlj4qQABHgbamwlwfo4aWu07cwfqn5G/FCYJgDO
xUvsnTVclom2g4G+7eSSvGQI1QyAxl5QpviPnj/TEgfFBFnhbCSoBTEY6ecqhlfW
6u59MF/Uw8E+weiuGY4L87kDtBhjQs3UMSLxCuwH7MxXb25ff7qB4AIkcFD0kKFH
3V5NtwqlU7aQT0xOjGxaHhfPwjLD+FVss4ARmuHS09/Kn8egOW9yROPyetnuH84R
Oz0Ctnt1IPLFjvGeg3+rt9fjjS9jWOXLITb9Q6nX9gnCt7orCwIYke8YCpmnJyhn
i+fV4CWYIQBBRKxIT0E/GhJxZOmL0JKpomnbpP2dH8npemnsTCuvtfdrK9gfhH2X
chBVqCPY8MNU5zKfzdEiavPqcm9392lMzOoOXW2pSC1eAKqnAQ86ZT3r7rLntqE8
75LxHcvaQIsnpyG+YuJVHvoiJ83TbqZNpyHwNaQTYhDmdYpp2d/wTtTQywX4DuXb
Y6NDJw5+kQ==
=1PNK
-----END PGP SIGNATURE-----
Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- ublk updates:
- Add support for updating the size of a ublk instance
- Zero-copy improvements
- Auto-registering of buffers for zero-copy
- Series simplifying and improving GET_DATA and request lookup
- Series adding quiesce support
- Lots of selftests additions
- Various cleanups
- NVMe updates via Christoph:
- add per-node DMA pools and use them for PRP/SGL allocations
(Caleb Sander Mateos, Keith Busch)
- nvme-fcloop refcounting fixes (Daniel Wagner)
- support delayed removal of the multipath node and optionally
support the multipath node for private namespaces (Nilay Shroff)
- support shared CQs in the PCI endpoint target code (Wilfred
Mallawa)
- support admin-queue only authentication (Hannes Reinecke)
- use the crc32c library instead of the crypto API (Eric Biggers)
- misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes
Reinecke, Leon Romanovsky, Gustavo A. R. Silva)
- MD updates via Yu:
- Fix that normal IO can be starved by sync IO, found by mkfs on
newly created large raid5, with some clean up patches for bdev
inflight counters
- Clean up brd, getting rid of atomic kmaps and bvec poking
- Add loop driver specifically for zoned IO testing
- Eliminate blk-rq-qos calls with a static key, if not enabled
- Improve hctx locking for when a plug has IO for multiple queues
pending
- Remove block layer bouncing support, which in turn means we can
remove the per-node bounce stat as well
- Improve blk-throttle support
- Improve delay support for blk-throttle
- Improve brd discard support
- Unify IO scheduler switching. This should also fix a bunch of lockdep
warnings we've been seeing, after enabling lockdep support for queue
freezing/unfreezeing
- Add support for block write streams via FDP (flexible data placement)
on NVMe
- Add a bunch of block helpers, facilitating the removal of a bunch of
duplicated boilerplate code
- Remove obsolete BLK_MQ pci and virtio Kconfig options
- Add atomic/untorn write support to blktrace
- Various little cleanups and fixes
* tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits)
selftests: ublk: add test for UBLK_F_QUIESCE
ublk: add feature UBLK_F_QUIESCE
selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE
traceevent/block: Add REQ_ATOMIC flag to block trace events
ublk: run auto buf unregisgering in same io_ring_ctx with registering
io_uring: add helper io_uring_cmd_ctx_handle()
ublk: remove io argument from ublk_auto_buf_reg_fallback()
ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch()
selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK
selftests: ublk: support UBLK_F_AUTO_BUF_REG
ublk: support UBLK_AUTO_BUF_REG_FALLBACK
ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG
ublk: prepare for supporting to register request buffer automatically
ublk: convert to refcount_t
selftests: ublk: make IO & device removal test more stressful
nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
nvme: introduce multipath_always_on module param
nvme-multipath: introduce delayed removal of the multipath head node
nvme-pci: derive and better document max segments limits
nvme-pci: use struct_size for allocation struct nvme_dev
...
bvec_try_merge_page currently returns if the added page fragment is
within the same page as the last page in the last current bio_vec.
This information is used by __bio_iov_iter_get_pages so that we always
have a single folio pin per page even when the page is split over
multiple __bio_iov_iter_get_pages calls.
Threading this through the entire lowlevel add page to bio logic is
annoying and inefficient and leads to less code sharing than otherwise
possible. Instead add code to __bio_iov_iter_get_pages that checks if
the bio_vecs did not change and thus a merge into the last segment must
have happened, and if there is an offset into the page for the currently
added fragment, because if yes we must have already had a previous
fragment of the same page in the last bio_vec. While this is still a bit
ugly, it keeps the logic in the one place that needs it and allows for
more code sharing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250512042354.514329-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The REPORT ZONES buffer size is currently limited by the HBA's maximum
segment count to ensure the buffer can be mapped. However, the block
layer further limits the number of iovec entries to 1024 when allocating
a bio.
To avoid allocation of buffers too large to be mapped, further restrict
the maximum buffer size to BIO_MAX_INLINE_VECS.
Replace the UIO_MAXIOV symbolic name with the more contextually
appropriate BIO_MAX_INLINE_VECS.
Fixes: b091ac6168 ("sd_zbc: Fix report zones buffer allocation")
Cc: stable@vger.kernel.org
Signed-off-by: Steve Siwinski <ssiwinski@atto.com>
Link: https://lore.kernel.org/r/20250508200122.243129-1-ssiwinski@atto.com
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Add a helper to add a vmalloc region to a bio, abstracting away the
vmalloc addresses from the underlying pages and another one wrapping
it for the simple case where all data fits into a single bio.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper to perform synchronous I/O on a kernel direct map range.
Currently this is implemented in various places in usually not very
efficient ways, so provide a generic helper instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper to add a directly mapped kernel virtual address to a
bio so that callers don't have to convert to pages or folios.
For now only the _nofail variant is provided as that is what all the
obvious callers want.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add the ability to pass a write stream for placement control in the bio.
The new field fits in an existing hole, so does not change the size of
the struct.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20250506121732.8211-3-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
>4GB folio is possible on some ARCHs, such as aarch64, 16GB hugepage
is supported, then 'offset' of folio can't be held in 'unsigned int',
cause warning in bio_add_folio_nofail() and IO failure.
Fix it by adjusting 'page' & trimming 'offset' so that `->bi_offset` won't
be overflow, and folio can be added to bio successfully.
Fixes: ed9832bc08 ("block: introduce folio awareness and add a bigger size from folio")
Cc: Kundan Kumar <kundan.kumar@samsung.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Gavin Shan <gshan@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20250312145136.2891229-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Many of the fields in struct bio_integrity_payload are only needed for
the default integrity buffer in the block layer, and the variable
sized array at the end of the structure makes it very hard to embed
into caller allocated structures.
Reduce struct bio_integrity_payload to the minimal structure needed in
common code and create two separate containing structures for the
automatically generated payload and the caller allocated payload.
The latter is a simple wrapper for struct bio_integrity_payload and
the bvecs, while the former contains the additional fields moved out
of struct bio_integrity_payload.
Always use a dedicated mempool for automatic integrity metadata
instead of depending on bio_set that is submitter controlled and thus
often doesn't have the mempool initialized and stop using mempools for
the submitter buffers as they aren't in the NOIO I/O submission path
where we need to guarantee forward progress.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20250225154449.422989-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Device mapper bioset often has big bio_slab size, which can be more than
1000, then 8byte can't hold the slab name any more, cause the kmem_cache
allocation warning of 'kmem_cache of name 'bio-108' already exists'.
Fix the warning by extending bio_slab->name to 12 bytes, but fix output
of /proc/slabinfo
Reported-by: Guangwu Zhang <guazhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250228132656.2838008-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is disallowed.
This check will now be relevant since the device mapper personalities
will start to support atomic writes, and they use this function.
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20250116170301.474130-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lift bio_split_rw_at into blk_rq_append_bio so that it validates the
hardware limits. With this all passthrough callers can simply add
bio_add_page to build the bio and delay checking for exceeding of limits
to this point instead of doing it for each page.
While this looks like adding a new expensive loop over all bio_vecs,
blk_rq_append_bio is already doing that just to counter the number of
segments.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20250103073417.459715-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make bio_iov_bvec_set() accept a pointer to const iov_iter, which means
that we can drop the undesirable casting to struct iov_iter pointer in
blk_rq_map_user_bvec().
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241202115727.2320401-1-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is disallowed.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241111112150.3756529-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of returning an inconclusive value of NULL for an error in calling
bio_split(), return a ERR_PTR() always.
Also remove the BUG_ON() calls, and WARN_ON_ONCE() instead. Indeed, since
almost all callers don't check the return code from bio_split(), we'll
crash anyway (for those failures).
Fix up the only user which checks bio_split() return code today (directly
or indirectly), blk_crypto_fallback_split_bio_if_needed(). The md/bcache
code does check the return code in cached_dev_cache_miss() ->
bio_next_split() -> bio_split(), but only to see if there was a split, so
there would be no change in behaviour here (when returning a ERR_PTR()).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241111112150.3756529-2-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is only used by the nvmet zns passthrough code, which can trivially
just use bio_add_pc_page and do the sanity check for the max zone append
limit itself.
All future zoned file systems should follow the btrfs lead and let the
upper layers fill up bios unlimited by hardware constraints and split
them to the limits in the I/O submission handler.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20241030051859.280923-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This code is unused, and all future zoned file systems should follow
the btrfs lead of splitting the bios themselves to the zoned limits
in the I/O submission handler, because if they didn't they would be
hit by commit ed9832bc08 ("block: introduce folio awareness and add
a bigger size from folio") breaking this code when the zone append
limit (that is usually the max_hw_sectors limit) is smaller than the
largest possible folio size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20241030051859.280923-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use newly added mm function unpin_user_folio() to put refs by npages
count.
Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20240911064935.5630-5-kundan.kumar@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a bigger size from folio to bio and skip merge processing for pages.
Fetch the offset of page within a folio. Depending on the size of folio
and folio_offset, fetch a larger length. This length may consist of
multiple contiguous pages if folio is multiorder.
Using the length calculate number of pages which will be added to bio and
increment the loop counter to skip those pages.
This technique helps to avoid overhead of merging pages which belong to
same large order folio.
Also folio-ize the functions bio_iov_add_page() and
bio_iov_add_zone_append_page()
Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240911064935.5630-3-kundan.kumar@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Added new bio_add_hw_folio() function as a wrapper around
bio_add_hw_page(). This is a prep patch.
Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240911064935.5630-2-kundan.kumar@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaeY00QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjPGD/9CPo93+V/ztfzY1J18KhA2CCUh1uuxZIjx
dLfi07Bo+gyLwB1vaSf0bNy9gM8SzGFSMszSIDTErNq9/F6RvWjXN0CchyQf1Wii
o2UyQg8JLjT2o1pJSsdJySZQRsG/daWUHzHaX1kD343Cd6OBV2YaVFdYTaXUGg4v
G1AVh7qFvQhAIg1jV8q2z7QC7PSeuTnvyvY65Z8/iVJe95FayOrtGmDPTaJab8r2
7uEFiWZk23erzNygVdcSoNIrwWFmRARz5o3IvwJJfEL08hkdoAqu6vD2oCUZspKU
3g4wU6JrN0QYQpVwIJ9WcwYcoOm6iMm9xwCVMsp8R3KRUU107HjaiEazFDGk4HW4
ozZTa7leTXnrRqnjVhcQpUvC+1uVLCFN8sSElNY7m2dg0IojnlMz+t3lMiTtaR9N
Rt6wy5alVQFlb2uhzALuUh6HM1zA98swWySNoP0arTkOT9kjXwwAgn0I+M1s9Uxo
FaQvM0YnAsb2C8LSpNtZWLaTlRSLTzUsGThLSJMBZueIJ9+BF23i7W7euklCNxjj
Jl6CykEkEkacOxU6b9PG6qSnUq9JJ+W7gcJVing+ugAFrZDutxy6eJZXVv8wuvCC
EOxaADpSs2xAaH9V0BMmwO51w0NDWySyGPHB5UBkhNjqOji/oG3FvAITiboQArgS
FES4jtU1TA==
=dn4l
-----END PGP SIGNATURE-----
Merge tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linux
Pull block integrity mapping updates from Jens Axboe:
"A set of cleanups and fixes for the block integrity support.
Sent separately from the main block changes from last week, as they
depended on later fixes in the 6.10-rc cycle"
* tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linux:
block: don't free the integrity payload in bio_integrity_unmap_free_user
block: don't free submitter owned integrity payload on I/O completion
block: call bio_integrity_unmap_free_user from blk_rq_unmap_user
block: don't call bio_uninit from bio_endio
block: also return bio_integrity_payload * from stubs
block: split integrity support out of bio.h
Get callers out of poking into bvec internals a bit more. Not a huge win
right now, but with the proposed new DMA mapping API we might end up with
a lot more of this otherwise.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240706075228.2350978-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit b222dd2fdd ("block: call bio_uninit in bio_endio") added a call
to bio_uninit in bio_endio to work around callers that use bio_init but
fail to call bio_uninit after they are done to release the resources.
While this is an abuse of the bio_init API we still have quite a few of
those left. But this early uninit causes a problem for integrity data,
as at least some users need the bio_integrity_payload. Right now the
only one is the NVMe passthrough which archives this by adding a special
case to skip the freeing if the BIP_INTEGRITY_USER flag is set.
Sort this out by only putting bi_blkg in bio_endio as that is the cause
of the actual leaks - the few users of the crypto context and integrity
data all properly call bio_uninit, usually through bio_put for
dynamically allocated bios.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240702151047.1746127-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Split struct bio_integrity_payload and the related prototypes out of
bio.h into a separate bio-integrity.h header so that it is only pulled
in by the few places that need it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240702151047.1746127-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmZCE4MACgkQxWXV+ddt
WDudtQ//WjXcHtY3I6NJtDhPsIOG3Qjg9mA0shp73X4djJtZoGCdgL7dq+fTp5lk
Wu6/XY5g+CSttTgwF4eyHgUSJOptKWY0XQDWxX5VR8WCM2qmUZ7SedlrBED9GNDM
rN/3egmc74OGwnqyQq3I/2qYLByXFj66tsvW3UBjLNB8vMHajjw1idj9ujipioHq
ySStPCHkPMwuhEzw9+CTe3W47VUSb5Ug3XDhAZXvxT99oDHn1m+CxKQwcona/IPH
1El8PmZ7JetaT9ZO3DICBICfCyo+2SSy/KXYypXXE+nzNZhbhC0V9N7Uqm1c91C0
aRglsJZCXmHBD4BPLvkls6CqEIvMc7FvcNCqQlrbRT6PlfX91/XaeDq4l3RUcuPn
mGShsdHUiwbPMWYVwqVUKd0IPiktF1R7yigTjYSkEFJTL6HFTrBqV/2fAMUsMfPc
8gyzYMCPQld73WmrnXZQPKvmzO/LvE0gS5cPapokGwoXstq9n3iYd4ypN0wN6sif
1jwy3efNzWXXMYV0WzcihKwFMm2fqp/pl9bXq/zwn2CunfIX4WTsaQ2NmJf81jqF
qFNjlr8S3qO7AvIOs+R2XY9E3VjfzeDADzvjpQy5J/ZYbcHBcxxdYDhg+QGhe5nB
eNmR51oL1pHSjU2M8PxATL8JxKkX2BvX6u64lVojaw4rxUlyFC0=
=MMpE
-----END PGP SIGNATURE-----
Merge tag 'for-6.10-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This update brings a few minor performance improvements, otherwise
there's a lot of refactoring, cleanups and other sort of not user
visible changes.
Performance improvements:
- inline b-tree locking functions, improvement in metadata-heavy
changes
- relax locking on a range that's being reflinked, allows read
operations to run in parallel
- speed up NOCOW write checks (throughput +9% on a sample test)
- extent locking ranges have been reduced in several places, namely
around delayed ref processing
Core:
- more page to folio conversions:
- relocation
- send
- compression
- inline extent handling
- super block write and wait
- extent_map structure optimizations:
- reduced structure size
- code simplifications
- add shrinker for allocated objects, the numbers can go high and
could exhaust memory on smaller systems (reported) as they may
not get an opportunity to be freed fast enough
- extent locking optimizations:
- reduce locking ranges where it does not seem to be necessary and
are safe due to other means of synchronization
- potential improvements due to lower contention,
allocation/freeing and state management operations of extent
state tracking structures
- delayed ref cleanups and simplifications
- updated trace points
- improved error handling, warnings and assertions
- cleanups and refactoring, unification of error handling paths"
* tag 'for-6.10-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (122 commits)
btrfs: qgroup: fix initialization of auto inherit array
btrfs: count super block write errors in device instead of tracking folio error state
btrfs: use the folio iterator in btrfs_end_super_write()
btrfs: convert super block writes to folio in write_dev_supers()
btrfs: convert super block writes to folio in wait_dev_supers()
bio: Export bio_add_folio_nofail to modules
btrfs: remove duplicate included header from fs.h
btrfs: add a cached state to extent_clear_unlock_delalloc
btrfs: push extent lock down in submit_one_async_extent
btrfs: push lock_extent down in cow_file_range()
btrfs: move can_cow_file_range_inline() outside of the extent lock
btrfs: push lock_extent into cow_file_range_inline
btrfs: push extent lock into cow_file_range
btrfs: push extent lock into run_delalloc_cow
btrfs: remove unlock_extent from run_delalloc_compressed
btrfs: push extent lock down in run_delalloc_nocow
btrfs: adjust while loop condition in run_delalloc_nocow
btrfs: push extent lock into run_delalloc_nocow
btrfs: push the extent lock into btrfs_run_delalloc_range
btrfs: lock extent when doing inline extent in compression
...
One the one hand, it's marked EXPERIMENTAL since 2017, and looks like
there are no users since then, and no testers and no developers, it's
just not active at all.
On the other hand, even if the config is disabled, there are still many
fields in throtl_grp and throtl_data and many functions that are only
used for throtl low.
At last, currently blk-throtl is initialized during disk initialization,
and destroyed during disk removal, and it exposes many functions to be
called directly from block layer.
Remove throtl low to make code much more cleaner and follow up work much
easier.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240509121107.3195568-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Several modules use __bio_add_page() today and may need to be converted
to bio_add_folio_nofail().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a helper to wait for an entire chain of bios to complete.
[hch: split from a larger patch, moved and changed the name now that it
is non-static]
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240506042027.2289826-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is basically blk_next_bio just with the bio allocation moved
to the caller to allow for more flexible bio handling in the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240506042027.2289826-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Zone write plugging implements a per-zone "plug" for write operations
to control the submission and execution order of write operations to
sequential write required zones of a zoned block device. Per-zone
plugging guarantees that at any time there is at most only one write
request per zone being executed. This mechanism is intended to replace
zone write locking which implements a similar per-zone write throttling
at the scheduler level, but is implemented only by mq-deadline.
Unlike zone write locking which operates on requests, zone write
plugging operates on BIOs. A zone write plug is simply a BIO list that
is atomically manipulated using a spinlock and a kblockd submission
work. A write BIO to a zone is "plugged" to delay its execution if a
write BIO for the same zone was already issued, that is, if a write
request for the same zone is being executed. The next plugged BIO is
unplugged and issued once the write request completes.
This mechanism allows to:
- Untangle zone write ordering from block IO schedulers. This allows
removing the restriction on using mq-deadline for writing to zoned
block devices. Any block IO scheduler, including "none" can be used.
- Zone write plugging operates on BIOs instead of requests. Plugged
BIOs waiting for execution thus do not hold scheduling tags and thus
are not preventing other BIOs from executing (reads or writes to
other zones). Depending on the workload, this can significantly
improve the device use (higher queue depth operation) and
performance.
- Both blk-mq (request based) zoned devices and BIO-based zoned devices
(e.g. device mapper) can use zone write plugging. It is mandatory
for the former but optional for the latter. BIO-based drivers can
use zone write plugging to implement write ordering guarantees, or
the drivers can implement their own if needed.
- The code is less invasive in the block layer and is mostly limited to
blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and
bio.c.
Zone write plugging is implemented using struct blk_zone_wplug. This
structure includes a spinlock, a BIO list and a work structure to
handle the submission of plugged BIOs. Zone write plugs structures are
managed using a per-disk hash table.
Plugging of zone write BIOs is done using the function
blk_zone_write_plug_bio() which returns false if a BIO execution does
not need to be delayed and true otherwise. This function is called
from blk_mq_submit_bio() after a BIO is split to avoid large BIOs
spanning multiple zones which would cause mishandling of zone write
plugs. This ichange enables by default zone write plugging for any mq
request-based block device. BIO-based device drivers can also use zone
write plugging by expliclty calling blk_zone_write_plug_bio() in their
->submit_bio method. For such devices, the driver must ensure that a
BIO passed to blk_zone_write_plug_bio() is already split and not
straddling zone boundaries.
Only write and write zeroes BIOs are plugged. Zone write plugging does
not introduce any significant overhead for other operations. A BIO that
is being handled through zone write plugging is flagged using the new
BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with
this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag.
The completion of BIOs and requests flagged trigger respectively calls
to the functions blk_zone_write_bio_endio() and
blk_zone_write_complete_request(). The latter function is used to
trigger submission of the next plugged BIO using the zone plug work.
blk_zone_write_bio_endio() does the same for BIO-based devices.
This ensures that at any time, at most one request (blk-mq devices) or
one BIO (BIO-based devices) is being executed for any zone. The
handling of zone write plugs using a per-zone plug spinlock maximizes
parallelism and device usage by allowing multiple zones to be writen
simultaneously without lock contention.
Zone write plugging ignores flush BIOs without data. Hovever, any flush
BIO that has data is always plugged so that the write part of the flush
sequence is serialized with other regular writes.
Given that any BIO handled through zone write plugging will be the only
BIO in flight for the target zone when it is executed, the unplugging
and submission of a BIO will have no chance of successfully merging with
plugged requests or requests in the scheduler. To overcome this
potential performance degradation, blk_mq_submit_bio() calls the
function blk_zone_write_plug_attempt_merge() to try to merge other
plugged BIOs with the one just unplugged and submitted. Successful
merging is signaled using blk_zone_write_plug_bio_merged(), called from
bio_attempt_back_merge(). Furthermore, to avoid recalculating the number
of segments of plugged BIOs to attempt merging, the number of segments
of a plugged BIO is saved using the new struct bio field
__bi_nr_segments. To avoid growing the size of struct bio, this field is
added as a union with the bio_cookie field. This is safe to do as
polling is always disabled for plugged BIOs.
When BIOs are plugged in a zone write plug, the device request queue
usage counter is always incremented. This reference is kept and reused
for blk-mq devices when the plugged BIO is unplugged and submitted
again using submit_bio_noacct_nocheck(). For this case, the unplugged
BIO is already flagged with BIO_ZONE_WRITE_PLUGGING and
blk_mq_submit_bio() proceeds directly to allocating a new request for
the BIO, re-using the usage reference count taken when the BIO was
plugged. This extra reference count is dropped in
blk_zone_write_plug_attempt_merge() for any plugged BIO that is
successfully merged. Given that BIO-based devices will not take this
path, the extra reference is dropped after a plugged BIO is unplugged
and submitted.
Zone write plugs are dynamically allocated and managed using a hash
table (an array of struct hlist_head) with RCU protection.
A zone write plug is allocated when a write BIO is received for the
zone and not freed until the zone is fully written, reset or finished.
To detect when a zone write plug can be freed, the write state of each
zone is tracked using a write pointer offset which corresponds to the
offset of a zone write pointer relative to the zone start. Write
operations always increment this write pointer offset. Zone reset
operations set it to 0 and zone finish operations set it to the zone
size.
If a write error happens, the wp_offset value of a zone write plug may
become incorrect and out of sync with the device managed write pointer.
This is handled using the zone write plug flag BLK_ZONE_WPLUG_ERROR.
The function blk_zone_wplug_handle_error() is called from the new disk
zone write plug work when this flag is set. This function executes a
report zone to update the zone write pointer offset to the current
value as indicated by the device. The disk zone write plug work is
scheduled whenever a BIO flagged with BIO_ZONE_WRITE_PLUGGING completes
with an error or when bio_zone_wplug_prepare_bio() detects an unaligned
write. Once scheduled, the disk zone write plugs work keeps running
until all zone errors are handled.
To match the new data structures used for zoned disks, the function
disk_free_zone_bitmaps() is renamed to the more generic
disk_free_zone_resources(). The function disk_init_zone_resources() is
also introduced to initialize zone write plugs resources when a gendisk
is allocated.
In order to guarantee that the user can simultaneously write up to a
number of zones equal to a device max active zone limit or max open zone
limit, zone write plugs are allocated using a mempool sized to the
maximum of these 2 device limits. For a device that does not have
active and open zone limits, 128 is used as the default mempool size.
If a change to the device active and open zone limits is detected, the
disk mempool is resized when blk_revalidate_disk_zones() is executed.
This commit contains contributions from Christoph Hellwig <hch@lst.de>.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240408014128.205141-8-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is the same in two places, and another will be added soon. Create a
helper for it.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240223155910.3622666-4-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When enlisting a bio into ->free_list_irq we protect the list by
disabling irqs. It's likely they're already disabled and performance of
local_irq_{save,restore}() is decent, but it's not zero cost.
Let's only use the irq cache when when we're serving a hard irq, which
allows to remove local_irq_{save,restore}(), and fall back to bio_free()
in all left cases.
Profiles indicate that the bio_put() cost is reduced by ~3.5 times
(1.76% -> 0.49%), and total throughput of a CPU bound benchmark improve
by around 1% (t/io_uring with high QD and several drives).
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/36d207540b7046c653cc16e5ff08fe7234b19f81.1707314970.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_put_percpu_cache() puts all non-iopoll bios into the irq-safe list,
which entails disabling irqs. The overhead of that is not that bad when
interrupts are already off but getting worse otherwise. We can optimise
it when we're in the task context by using ->free_list directly just as
the IOPOLL path does.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4774e1a0f905f96c63174b0f3e4f79f0d9b63246.1707314970.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Restore support for passing data lifetime information from filesystems to
block drivers. This patch reverts commit b179c98f76 ("block: Remove
request.write_hint") and commit c75e707fe1 ("block: remove the
per-bio/request write hint").
This patch does not modify the size of struct bio because the new
bi_write_hint member fills a hole in struct bio. pahole reports the
following for struct bio on an x86_64 system with this patch applied:
/* size: 112, cachelines: 2, members: 20 */
/* sum members: 110, holes: 1, sum holes: 2 */
/* last cacheline: 48 bytes */
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240202203926.2478590-7-bvanassche@acm.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
bio_add_hw_page currently always fails or succeeds. This is fine for
the existing callers that always add PAGE_SIZE worth given that the
max_segment_size and max_sectors must always allow at least a page
worth of data. But when we want to add it for bigger amounts of data
this means it can also fail when adding the data to a bio, and creating
a fallback for that becomes really annoying in the callers.
Make use of the existing API design that allows to return a smaller
length than the one passed in and add up to max_segment_size worth
of data from a larger input. All the existing callers are fine with
this - not because they handle this return correctly, but because they
never pass more than a page in.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20231204173419.782378-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reordered a check to avoid a possible overflow when adding len to bv_len.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20231204173419.782378-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The special casing was originally added in pre-git history; reproducing
the commit log here:
> commit a318a92567d77
> Author: Andrew Morton <akpm@osdl.org>
> Date: Sun Sep 21 01:42:22 2003 -0700
>
> [PATCH] Speed up direct-io hugetlbpage handling
>
> This patch short-circuits all the direct-io page dirtying logic for
> higher-order pages. Without this, we pointlessly bounce BIOs up to
> keventd all the time.
In the last twenty years, compound pages have become used for more than
just hugetlb. Rewrite these functions to operate on folios instead
of pages and remove the special case for hugetlbfs; I don't think
it's needed any more (and if it is, we can put it back in as a call
to folio_test_hugetlb()).
This was found by inspection; as far as I can tell, this bug can lead
to pages used as the destination of a direct I/O read not being marked
as dirty. If those pages are then reclaimed by the MM without being
dirtied for some other reason, they won't be written out. Then when
they're faulted back in, they will not contain the data they should.
It'll take a pretty unusual setup to produce this problem with several
races all going the wrong way.
This problem predates the folio work; it could for example have been
triggered by mmaping a THP in tmpfs and using that as the target of an
O_DIRECT read.
Fixes: 800d8c63b2 ("shmem: add huge pages support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_iov_iter_get_pages() trims the IO based on the block size of the
block device the IO will be issued to.
However, bcachefs is a multi device filesystem; when we're creating the
bio we don't yet know which block device the bio will be submitted to -
we have to handle the alignment checks elsewhere.
Thus this is needed to avoid a null ptr deref.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Link: https://lore.kernel.org/r/20230813182636.2966159-3-kent.overstreet@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- bio_set_pages_dirty(), bio_check_pages_dirty() - dio path
- blk_status_to_str() - error messages
- bio_add_folio() - this should definitely be exported for everyone,
it's the modern version of bio_add_page()
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: linux-block@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lore.kernel.org/r/20230813182636.2966159-2-kent.overstreet@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This will be used for multi-page configuration for integrity payload.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com>
Tested-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230803024827epcms2p838d9e9131492c86a159fff25d195658f@epcms2p8
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no good reason to pass the bio to bio_try_merge_hw_seg. Just
pass the current bvec and rename the function to bvec_try_merge_hw_page.
This will allow reusing this function for supporting multi-page integrity
payload bvecs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Link: https://lore.kernel.org/r/20230724165433.117645-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The update of bi_size is the only thing in __bio_try_merge_page that
needs a bio. Move it to the callers, and merge __bio_try_merge_page
and page_is_mergeable into a single bvec_try_merge_page that only takes
the current bvec instead of a full bio. This will allow reusing this
function for supporting multi-page integrity payload bvecs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Link: https://lore.kernel.org/r/20230724165433.117645-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_add_page already checks that there is space in bi_size a little
earlier. So after we failed to add to an existing segment, just check
that there is another one available instead of duplicating the bi_size
check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Link: https://lore.kernel.org/r/20230724165433.117645-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>