Commit Graph

1663 Commits

Author SHA1 Message Date
Linus Torvalds
beace86e61 Summary of significant series in this pull request:
- The 4 patch series "mm: ksm: prevent KSM from breaking merging of new
   VMAs" from Lorenzo Stoakes addresses an issue with KSM's
   PR_SET_MEMORY_MERGE mode: newly mapped VMAs were not eligible for
   merging with existing adjacent VMAs.
 
 - The 4 patch series "mm/damon: introduce DAMON_STAT for simple and
   practical access monitoring" from SeongJae Park adds a new kernel module
   which simplifies the setup and usage of DAMON in production
   environments.
 
 - The 6 patch series "stop passing a writeback_control to swap/shmem
   writeout" from Christoph Hellwig is a cleanup to the writeback code
   which removes a couple of pointers from struct writeback_control.
 
 - The 7 patch series "drivers/base/node.c: optimization and cleanups"
   from Donet Tom contains largely uncorrelated cleanups to the NUMA node
   setup and management code.
 
 - The 4 patch series "mm: userfaultfd: assorted fixes and cleanups" from
   Tal Zussman does some maintenance work on the userfaultfd code.
 
 - The 5 patch series "Readahead tweaks for larger folios" from Ryan
   Roberts implements some tuneups for pagecache readahead when it is
   reading into order>0 folios.
 
 - The 4 patch series "selftests/mm: Tweaks to the cow test" from Mark
   Brown provides some cleanups and consistency improvements to the
   selftests code.
 
 - The 4 patch series "Optimize mremap() for large folios" from Dev Jain
   does that.  A 37% reduction in execution time was measured in a
   memset+mremap+munmap microbenchmark.
 
 - The 5 patch series "Remove zero_user()" from Matthew Wilcox expunges
   zero_user() in favor of the more modern memzero_page().
 
 - The 3 patch series "mm/huge_memory: vmf_insert_folio_*() and
   vmf_insert_pfn_pud() fixes" from David Hildenbrand addresses some warts
   which David noticed in the huge page code.  These were not known to be
   causing any issues at this time.
 
 - The 3 patch series "mm/damon: use alloc_migrate_target() for
   DAMOS_MIGRATE_{HOT,COLD" from SeongJae Park provides some cleanup and
   consolidation work in DAMON.
 
 - The 3 patch series "use vm_flags_t consistently" from Lorenzo Stoakes
   uses vm_flags_t in places where we were inappropriately using other
   types.
 
 - The 3 patch series "mm/memfd: Reserve hugetlb folios before
   allocation" from Vivek Kasireddy increases the reliability of large page
   allocation in the memfd code.
 
 - The 14 patch series "mm: Remove pXX_devmap page table bit and pfn_t
   type" from Alistair Popple removes several now-unneeded PFN_* flags.
 
 - The 5 patch series "mm/damon: decouple sysfs from core" from SeongJae
   Park implememnts some cleanup and maintainability work in the DAMON
   sysfs layer.
 
 - The 5 patch series "madvise cleanup" from Lorenzo Stoakes does quite a
   lot of cleanup/maintenance work in the madvise() code.
 
 - The 4 patch series "madvise anon_name cleanups" from Vlastimil Babka
   provides additional cleanups on top or Lorenzo's effort.
 
 - The 11 patch series "Implement numa node notifier" from Oscar Salvador
   creates a standalone notifier for NUMA node memory state changes.
   Previously these were lumped under the more general memory on/offline
   notifier.
 
 - The 6 patch series "Make MIGRATE_ISOLATE a standalone bit" from Zi Yan
   cleans up the pageblock isolation code and fixes a potential issue which
   doesn't seem to cause any problems in practice.
 
 - The 5 patch series "selftests/damon: add python and drgn based DAMON
   sysfs functionality tests" from SeongJae Park adds additional drgn- and
   python-based DAMON selftests which are more comprehensive than the
   existing selftest suite.
 
 - The 5 patch series "Misc rework on hugetlb faulting path" from Oscar
   Salvador fixes a rather obscure deadlock in the hugetlb fault code and
   follows that fix with a series of cleanups.
 
 - The 3 patch series "cma: factor out allocation logic from
   __cma_declare_contiguous_nid" from Mike Rapoport rationalizes and cleans
   up the highmem-specific code in the CMA allocator.
 
 - The 28 patch series "mm/migration: rework movable_ops page migration
   (part 1)" from David Hildenbrand provides cleanups and
   future-preparedness to the migration code.
 
 - The 2 patch series "mm/damon: add trace events for auto-tuned
   monitoring intervals and DAMOS quota" from SeongJae Park adds some
   tracepoints to some DAMON auto-tuning code.
 
 - The 6 patch series "mm/damon: fix misc bugs in DAMON modules" from
   SeongJae Park does that.
 
 - The 6 patch series "mm/damon: misc cleanups" from SeongJae Park also
   does what it claims.
 
 - The 4 patch series "mm: folio_pte_batch() improvements" from David
   Hildenbrand cleans up the large folio PTE batching code.
 
 - The 13 patch series "mm/damon/vaddr: Allow interleaving in
   migrate_{hot,cold} actions" from SeongJae Park facilitates dynamic
   alteration of DAMON's inter-node allocation policy.
 
 - The 3 patch series "Remove unmap_and_put_page()" from Vishal Moola
   provides a couple of page->folio conversions.
 
 - The 4 patch series "mm: per-node proactive reclaim" from Davidlohr
   Bueso implements a per-node control of proactive reclaim - beyond the
   current memcg-based implementation.
 
 - The 14 patch series "mm/damon: remove damon_callback" from SeongJae
   Park replaces the damon_callback interface with a more general and
   powerful damon_call()+damos_walk() interface.
 
 - The 10 patch series "mm/mremap: permit mremap() move of multiple VMAs"
   from Lorenzo Stoakes implements a number of mremap cleanups (of course)
   in preparation for adding new mremap() functionality: newly permit the
   remapping of multiple VMAs when the user is specifying MREMAP_FIXED.  It
   still excludes some specialized situations where this cannot be
   performed reliably.
 
 - The 3 patch series "drop hugetlb_free_pgd_range()" from Anthony Yznaga
   switches some sparc hugetlb code over to the generic version and removes
   the thus-unneeded hugetlb_free_pgd_range().
 
 - The 4 patch series "mm/damon/sysfs: support periodic and automated
   stats update" from SeongJae Park augments the present
   userspace-requested update of DAMON sysfs monitoring files.  Automatic
   update is now provided, along with a tunable to control the update
   interval.
 
 - The 4 patch series "Some randome fixes and cleanups to swapfile" from
   Kemeng Shi does what is claims.
 
 - The 4 patch series "mm: introduce snapshot_page" from Luiz Capitulino
   and David Hildenbrand provides (and uses) a means by which debug-style
   functions can grab a copy of a pageframe and inspect it locklessly
   without tripping over the races inherent in operating on the live
   pageframe directly.
 
 - The 6 patch series "use per-vma locks for /proc/pid/maps reads" from
   Suren Baghdasaryan addresses the large contention issues which can be
   triggered by reads from that procfs file.  Latencies are reduced by more
   than half in some situations.  The series also introduces several new
   selftests for the /proc/pid/maps interface.
 
 - The 6 patch series "__folio_split() clean up" from Zi Yan cleans up
   __folio_split()!
 
 - The 7 patch series "Optimize mprotect() for large folios" from Dev
   Jain provides some quite large (>3x) speedups to mprotect() when dealing
   with large folios.
 
 - The 2 patch series "selftests/mm: reuse FORCE_READ to replace "asm
   volatile("" : "+r" (XXX));" and some cleanup" from wang lian does some
   cleanup work in the selftests code.
 
 - The 3 patch series "tools/testing: expand mremap testing" from Lorenzo
   Stoakes extends the mremap() selftest in several ways, including adding
   more checking of Lorenzo's recently added "permit mremap() move of
   multiple VMAs" feature.
 
 - The 22 patch series "selftests/damon/sysfs.py: test all parameters"
   from SeongJae Park extends the DAMON sysfs interface selftest so that it
   tests all possible user-requested parameters.  Rather than the present
   minimal subset.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaIqcCgAKCRDdBJ7gKXxA
 jkVBAQCCn9DR1QP0CRk961ot0cKzOgioSc0aA03DPb2KXRt2kQEAzDAz0ARurFhL
 8BzbvI0c+4tntHLXvIlrC33n9KWAOQM=
 =XsFy
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "As usual, many cleanups. The below blurbiage describes 42 patchsets.
  21 of those are partially or fully cleanup work. "cleans up",
  "cleanup", "maintainability", "rationalizes", etc.

  I never knew the MM code was so dirty.

  "mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes)
     addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly
     mapped VMAs were not eligible for merging with existing adjacent
     VMAs.

  "mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park)
     adds a new kernel module which simplifies the setup and usage of
     DAMON in production environments.

  "stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig)
     is a cleanup to the writeback code which removes a couple of
     pointers from struct writeback_control.

  "drivers/base/node.c: optimization and cleanups" (Donet Tom)
     contains largely uncorrelated cleanups to the NUMA node setup and
     management code.

  "mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman)
     does some maintenance work on the userfaultfd code.

  "Readahead tweaks for larger folios" (Ryan Roberts)
     implements some tuneups for pagecache readahead when it is reading
     into order>0 folios.

  "selftests/mm: Tweaks to the cow test" (Mark Brown)
     provides some cleanups and consistency improvements to the
     selftests code.

  "Optimize mremap() for large folios" (Dev Jain)
     does that. A 37% reduction in execution time was measured in a
     memset+mremap+munmap microbenchmark.

  "Remove zero_user()" (Matthew Wilcox)
     expunges zero_user() in favor of the more modern memzero_page().

  "mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand)
     addresses some warts which David noticed in the huge page code.
     These were not known to be causing any issues at this time.

  "mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park)
     provides some cleanup and consolidation work in DAMON.

  "use vm_flags_t consistently" (Lorenzo Stoakes)
     uses vm_flags_t in places where we were inappropriately using other
     types.

  "mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy)
     increases the reliability of large page allocation in the memfd
     code.

  "mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple)
     removes several now-unneeded PFN_* flags.

  "mm/damon: decouple sysfs from core" (SeongJae Park)
     implememnts some cleanup and maintainability work in the DAMON
     sysfs layer.

  "madvise cleanup" (Lorenzo Stoakes)
     does quite a lot of cleanup/maintenance work in the madvise() code.

  "madvise anon_name cleanups" (Vlastimil Babka)
     provides additional cleanups on top or Lorenzo's effort.

  "Implement numa node notifier" (Oscar Salvador)
     creates a standalone notifier for NUMA node memory state changes.
     Previously these were lumped under the more general memory
     on/offline notifier.

  "Make MIGRATE_ISOLATE a standalone bit" (Zi Yan)
     cleans up the pageblock isolation code and fixes a potential issue
     which doesn't seem to cause any problems in practice.

  "selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park)
     adds additional drgn- and python-based DAMON selftests which are
     more comprehensive than the existing selftest suite.

  "Misc rework on hugetlb faulting path" (Oscar Salvador)
     fixes a rather obscure deadlock in the hugetlb fault code and
     follows that fix with a series of cleanups.

  "cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport)
     rationalizes and cleans up the highmem-specific code in the CMA
     allocator.

  "mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand)
     provides cleanups and future-preparedness to the migration code.

  "mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park)
     adds some tracepoints to some DAMON auto-tuning code.

  "mm/damon: fix misc bugs in DAMON modules" (SeongJae Park)
     does that.

  "mm/damon: misc cleanups" (SeongJae Park)
     also does what it claims.

  "mm: folio_pte_batch() improvements" (David Hildenbrand)
     cleans up the large folio PTE batching code.

  "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park)
     facilitates dynamic alteration of DAMON's inter-node allocation
     policy.

  "Remove unmap_and_put_page()" (Vishal Moola)
     provides a couple of page->folio conversions.

  "mm: per-node proactive reclaim" (Davidlohr Bueso)
     implements a per-node control of proactive reclaim - beyond the
     current memcg-based implementation.

  "mm/damon: remove damon_callback" (SeongJae Park)
     replaces the damon_callback interface with a more general and
     powerful damon_call()+damos_walk() interface.

  "mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes)
     implements a number of mremap cleanups (of course) in preparation
     for adding new mremap() functionality: newly permit the remapping
     of multiple VMAs when the user is specifying MREMAP_FIXED. It still
     excludes some specialized situations where this cannot be performed
     reliably.

  "drop hugetlb_free_pgd_range()" (Anthony Yznaga)
     switches some sparc hugetlb code over to the generic version and
     removes the thus-unneeded hugetlb_free_pgd_range().

  "mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park)
     augments the present userspace-requested update of DAMON sysfs
     monitoring files. Automatic update is now provided, along with a
     tunable to control the update interval.

  "Some randome fixes and cleanups to swapfile" (Kemeng Shi)
     does what is claims.

  "mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand)
     provides (and uses) a means by which debug-style functions can grab
     a copy of a pageframe and inspect it locklessly without tripping
     over the races inherent in operating on the live pageframe
     directly.

  "use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan)
     addresses the large contention issues which can be triggered by
     reads from that procfs file. Latencies are reduced by more than
     half in some situations. The series also introduces several new
     selftests for the /proc/pid/maps interface.

  "__folio_split() clean up" (Zi Yan)
     cleans up __folio_split()!

  "Optimize mprotect() for large folios" (Dev Jain)
     provides some quite large (>3x) speedups to mprotect() when dealing
     with large folios.

  "selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian)
     does some cleanup work in the selftests code.

  "tools/testing: expand mremap testing" (Lorenzo Stoakes)
     extends the mremap() selftest in several ways, including adding
     more checking of Lorenzo's recently added "permit mremap() move of
     multiple VMAs" feature.

  "selftests/damon/sysfs.py: test all parameters" (SeongJae Park)
     extends the DAMON sysfs interface selftest so that it tests all
     possible user-requested parameters. Rather than the present minimal
     subset"

* tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits)
  MAINTAINERS: add missing headers to mempory policy & migration section
  MAINTAINERS: add missing file to cgroup section
  MAINTAINERS: add MM MISC section, add missing files to MISC and CORE
  MAINTAINERS: add missing zsmalloc file
  MAINTAINERS: add missing files to page alloc section
  MAINTAINERS: add missing shrinker files
  MAINTAINERS: move memremap.[ch] to hotplug section
  MAINTAINERS: add missing mm_slot.h file THP section
  MAINTAINERS: add missing interval_tree.c to memory mapping section
  MAINTAINERS: add missing percpu-internal.h file to per-cpu section
  mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info()
  selftests/damon: introduce _common.sh to host shared function
  selftests/damon/sysfs.py: test runtime reduction of DAMON parameters
  selftests/damon/sysfs.py: test non-default parameters runtime commit
  selftests/damon/sysfs.py: generalize DAMON context commit assertion
  selftests/damon/sysfs.py: generalize monitoring attributes commit assertion
  selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion
  selftests/damon/sysfs.py: test DAMOS filters commitment
  selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion
  selftests/damon/sysfs.py: test DAMOS destinations commitment
  ...
2025-07-31 14:57:54 -07:00
Linus Torvalds
6e11664f14 for-6.17/block-20250728
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmiHdZ8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptRED/9o3dQ1QHL5yNM/AyCCGox0V4zra8qGS/Vc
 cBWpAVrmPGRw0IYlLZENtN9PdwKcbMzJq3l6cxeC7dBnAZP0AxTzP4YYJYUNVsqo
 WtJ3d/k5+cVp0OyOp4uabaqNeMeLoPk9/JXe1Ml2KxtDmHtj5yee0JRh7zlPZmZj
 tsrpIUTeHgAPn6yR1EI+0ybx/mjCb05Mv2Y8gF5hkUPA2PuON+MTFixJmqoy2ySh
 n+22mz/prqlyOSYh/VVv1+9jcQ94wMjcW0JIpg9lM3Kg8BCPU4IetvO1UiX6X33v
 154zEh2aJJDBx+yORS4BM4JMXjRZI7lYea2dkHM8Cajctu1Wpja9bNwnK9ibXvEc
 WtyBwztleLbAZef25fA/W87JE23fGa/r3nwIb2cF4QqkAFslCvhjA93WkOzNJCgQ
 qsWOrlCh3IK2NUu4b1Ncs3ZHOPvc51+zzjMzC6SUr54xhrxDK+gngDPhRy7XDqWJ
 DTMpIlr366o8GdJqnib0/e/CPBrThS6Vl6u0tgLnNbwdpK1svgo/uHW5ksKvDqHX
 kGEIhyRRJJC+4wyl4dsYKXa2twcyFrlWdAE+pZguEC2nZRYqYl9uXftOtvfp1x0y
 /skDX0FIDjvyjRqCLcqF03FSGqwCGS8WuWXZjPhVhcfz47NvbHeFDh1G/jMzsbpj
 S9zrPve/DQ==
 =e86T
 -----END PGP SIGNATURE-----

Merge tag 'for-6.17/block-20250728' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - MD pull request via Yu:
      - call del_gendisk synchronously (Xiao)
      - cleanup unused variable (John)
      - cleanup workqueue flags (Ryo)
      - fix faulty rdev can't be removed during resync (Qixing)

 - NVMe pull request via Christoph:
      - try PCIe function level reset on init failure (Keith Busch)
      - log TLS handshake failures at error level (Maurizio Lombardi)
      - pci-epf: do not complete commands twice if nvmet_req_init()
        fails (Rick Wertenbroek)
      - misc cleanups (Alok Tiwari)

 - Removal of the pktcdvd driver

   This has been more than a decade coming at this point, and some
   recently revealed breakages that had it causing issues even for cases
   where it isn't required made me re-pull the trigger on this one. It's
   known broken and nobody has stepped up to maintain the code

 - Series for ublk supporting batch commands, enabling the use of
   multishot where appropriate

 - Speed up ublk exit handling

 - Fix for the two-stage elevator fixing which could leak data

 - Convert NVMe to use the new IOVA based API

 - Increase default max transfer size to something more reasonable

 - Series fixing write operations on zoned DM devices

 - Add tracepoints for zoned block device operations

 - Prep series working towards improving blk-mq queue management in the
   presence of isolated CPUs

 - Don't allow updating of the block size of a loop device that is
   currently under exclusively ownership/open

 - Set chunk sectors from stacked device stripe size and use it for the
   atomic write size limit

 - Switch to folios in bcache read_super()

 - Fix for CD-ROM MRW exit flush handling

 - Various tweaks, fixes, and cleanups

* tag 'for-6.17/block-20250728' of git://git.kernel.dk/linux: (94 commits)
  block: restore two stage elevator switch while running nr_hw_queue update
  cdrom: Call cdrom_mrw_exit from cdrom_release function
  sunvdc: Balance device refcount in vdc_port_mpgroup_check
  nvme-pci: try function level reset on init failure
  dm: split write BIOs on zone boundaries when zone append is not emulated
  block: use chunk_sectors when evaluating stacked atomic write limits
  dm-stripe: limit chunk_sectors to the stripe size
  md/raid10: set chunk_sectors limit
  md/raid0: set chunk_sectors limit
  block: sanitize chunk_sectors for atomic write limits
  ilog2: add max_pow_of_two_factor()
  nvmet: pci-epf: Do not complete commands twice if nvmet_req_init() fails
  nvme-tcp: log TLS handshake failures at error level
  docs: nvme: fix grammar in nvme-pci-endpoint-target.rst
  nvme: fix typo in status code constant for self-test in progress
  nvmet: remove redundant assignment of error code in nvmet_ns_enable()
  nvme: fix incorrect variable in io cqes error message
  nvme: fix multiple spelling and grammar issues in host drivers
  block: fix blk_zone_append_update_request_bio() kernel-doc
  md/raid10: fix set but not used variable in sync_request_write()
  ...
2025-07-28 16:43:54 -07:00
Joanne Koong
595d7ebeaf fuse: remove page alignment check for writeback len
Remove incorrect page alignment check for the writeback len arg in
fuse_iomap_writeback_range().  len will always be block-aligned as
passed in by iomap.

On regular fuse filesystems, i_blkbits is set to PAGE_SHIFT so this is
not a problem but for fuseblk filesystems, the block size is set to a
default of 512 bytes or a block size passed in at mount time.

Please note that non-page-aligned lengths are fine for the logic in
fuse_iomap_writeback_range().  The check was originally added as a
safeguard to detect conspicuously wrong ranges.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Fixes: ef7e7cbb32 ("fuse: use iomap for writeback")
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Link: https://lore.kernel.org/linux-fsdevel/CA+G9fYs5AdVM-T2Tf3LciNCwLZEHetcnSkHsjZajVwwpM2HmJw@mail.gmail.com/
Reported-by: Sasha Levin <sashal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-07-28 16:14:18 -07:00
Linus Torvalds
b5d760d53a vfs-6.17-rc1.iomap
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCtwAKCRCRxhvAZXjc
 ogPuAQChc4tCjlNp+yAwbSmuzWooKTN8PHI6v+3ftjdaKSy9AgD/Yya1i8aBYBA8
 9HBtIKGAqvcgNB3por7yN+GJ8fxb/Ag=
 =YmLL
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs iomap updates from Christian Brauner:

 - Refactor the iomap writeback code and split the generic and ioend/bio
   based writeback code.

   There are two methods that define the split between the generic
   writeback code, and the implemementation of it, and all knowledge of
   ioends and bios now sits below that layer.

 - Add fuse iomap support for buffered writes and dirty folio writeback.

   This is needed so that granular uptodate and dirty tracking can be
   used in fuse when large folios are enabled. This has two big
   advantages. For writes, instead of the entire folio needing to be
   read into the page cache, only the relevant portions need to be. For
   writeback, only the dirty portions need to be written back instead of
   the entire folio.

* tag 'vfs-6.17-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fuse: refactor writeback to use iomap_writepage_ctx inode
  fuse: hook into iomap for invalidating and checking partial uptodateness
  fuse: use iomap for folio laundering
  fuse: use iomap for writeback
  fuse: use iomap for buffered writes
  iomap: build the writeback code without CONFIG_BLOCK
  iomap: add read_folio_range() handler for buffered writes
  iomap: improve argument passing to iomap_read_folio_sync
  iomap: replace iomap_folio_ops with iomap_write_ops
  iomap: export iomap_writeback_folio
  iomap: move folio_unlock out of iomap_writeback_folio
  iomap: rename iomap_writepage_map to iomap_writeback_folio
  iomap: move all ioend handling to ioend.c
  iomap: add public helpers for uptodate state manipulation
  iomap: hide ioends from the generic writeback code
  iomap: refactor the writeback interface
  iomap: cleanup the pending writeback tracking in iomap_writepage_map_blocks
  iomap: pass more arguments using the iomap writeback context
  iomap: header diet
2025-07-28 16:09:03 -07:00
Linus Torvalds
57fcb7d930 vfs-6.17-rc1.fileattr
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCpgAKCRCRxhvAZXjc
 oqfFAQDcy3rROUF3W34KcSi7rDmaKVSX53d1tUoqH+1zDRpSlwEAriKDNC1ybudp
 YAnxVzkRHjHs1296WIuwKq5lfhJ60Q4=
 =geAl
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull fileattr updates from Christian Brauner:
 "This introduces the new file_getattr() and file_setattr() system calls
  after lengthy discussions.

  Both system calls serve as successors and extensible companions to
  the FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR system calls which have
  started to show their age in addition to being named in a way that
  makes it easy to conflate them with extended attribute related
  operations.

  These syscalls allow userspace to set filesystem inode attributes on
  special files. One of the usage examples is the XFS quota projects.

  XFS has project quotas which could be attached to a directory. All new
  inodes in these directories inherit project ID set on parent
  directory.

  The project is created from userspace by opening and calling
  FS_IOC_FSSETXATTR on each inode. This is not possible for special
  files such as FIFO, SOCK, BLK etc. Therefore, some inodes are left
  with empty project ID. Those inodes then are not shown in the quota
  accounting but still exist in the directory. This is not critical but
  in the case when special files are created in the directory with
  already existing project quota, these new inodes inherit extended
  attributes. This creates a mix of special files with and without
  attributes. Moreover, special files with attributes don't have a
  possibility to become clear or change the attributes. This, in turn,
  prevents userspace from re-creating quota project on these existing
  files.

  In addition, these new system calls allow the implementation of
  additional attributes that we couldn't or didn't want to fit into the
  legacy ioctls anymore"

* tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: tighten a sanity check in file_attr_to_fileattr()
  tree-wide: s/struct fileattr/struct file_kattr/g
  fs: introduce file_getattr and file_setattr syscalls
  fs: prepare for extending file_get/setattr()
  fs: make vfs_fileattr_[get|set] return -EOPNOTSUPP
  selinux: implement inode_file_[g|s]etattr hooks
  lsm: introduce new hooks for setting/getting inode fsxattr
  fs: split fileattr related helpers into separate file
2025-07-28 15:24:14 -07:00
Linus Torvalds
7879d7aff0 vfs-6.17-rc1.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaIM/KwAKCRCRxhvAZXjc
 opT+AP407JwhRSBjUEmHg5JzUyDoivkOySdnthunRjaBKD8rlgEApM6SOIZYucU7
 cPC3ZY6ORFM6Mwaw+iDW9lasM5ucHQ8=
 =CHha
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc VFS updates from Christian Brauner:
 "This contains the usual selections of misc updates for this cycle.

  Features:

   - Add ext4 IOCB_DONTCACHE support

     This refactors the address_space_operations write_begin() and
     write_end() callbacks to take const struct kiocb * as their first
     argument, allowing IOCB flags such as IOCB_DONTCACHE to propagate
     to the filesystem's buffered I/O path.

     Ext4 is updated to implement handling of the IOCB_DONTCACHE flag
     and advertises support via the FOP_DONTCACHE file operation flag.

     Additionally, the i915 driver's shmem write paths are updated to
     bypass the legacy write_begin/write_end interface in favor of
     directly calling write_iter() with a constructed synchronous kiocb.
     Another i915 change replaces a manual write loop with
     kernel_write() during GEM shmem object creation.

  Cleanups:

   - don't duplicate vfs_open() in kernel_file_open()

   - proc_fd_getattr(): don't bother with S_ISDIR() check

   - fs/ecryptfs: replace snprintf with sysfs_emit in show function

   - vfs: Remove unnecessary list_for_each_entry_safe() from
     evict_inodes()

   - filelock: add new locks_wake_up_waiter() helper

   - fs: Remove three arguments from block_write_end()

   - VFS: change old_dir and new_dir in struct renamedata to dentrys

   - netfs: Remove unused declaration netfs_queue_write_request()

  Fixes:

   - eventpoll: Fix semi-unbounded recursion

   - eventpoll: fix sphinx documentation build warning

   - fs/read_write: Fix spelling typo

   - fs: annotate data race between poll_schedule_timeout() and
     pollwake()

   - fs/pipe: set FMODE_NOWAIT in create_pipe_files()

   - docs/vfs: update references to i_mutex to i_rwsem

   - fs/buffer: remove comment about hard sectorsize

   - fs/buffer: remove the min and max limit checks in __getblk_slow()

   - fs/libfs: don't assume blocksize <= PAGE_SIZE in
     generic_check_addressable

   - fs_context: fix parameter name in infofc() macro

   - fs: Prevent file descriptor table allocations exceeding INT_MAX"

* tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (24 commits)
  netfs: Remove unused declaration netfs_queue_write_request()
  eventpoll: fix sphinx documentation build warning
  ext4: support uncached buffered I/O
  mm/pagemap: add write_begin_get_folio() helper function
  fs: change write_begin/write_end interface to take struct kiocb *
  drm/i915: Refactor shmem_pwrite() to use kiocb and write_iter
  drm/i915: Use kernel_write() in shmem object create
  eventpoll: Fix semi-unbounded recursion
  vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes()
  fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable
  fs/buffer: remove the min and max limit checks in __getblk_slow()
  fs: Prevent file descriptor table allocations exceeding INT_MAX
  fs: Remove three arguments from block_write_end()
  fs/ecryptfs: replace snprintf with sysfs_emit in show function
  fs: annotate suspected data race between poll_schedule_timeout() and pollwake()
  docs/vfs: update references to i_mutex to i_rwsem
  fs/buffer: remove comment about hard sectorsize
  fs_context: fix parameter name in infofc() macro
  VFS: change old_dir and new_dir in struct renamedata to dentrys
  proc_fd_getattr(): don't bother with S_ISDIR() check
  ...
2025-07-28 11:22:56 -07:00
Linus Torvalds
1959e18cc0 Removing subtrees of kernel filesystems is done in quite a few
places; unfortunately, it's easy to get wrong.  A number of open-coded
 attempts are out there, with varying amount of bogosities.
 
 	simple_recursive_removal() had been introduced for doing that with
 all precautions needed; it does an equivalent of rm -rf, with sufficient
 locking, eviction of anything mounted on top of the subtree, etc.
 
 	This series converts a bunch of open-coded instances to using that.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaIRCUwAKCRBZ7Krx/gZQ
 66XWAP9BNyHcvl9uV/ku/mswYiRBxYoVogciIKeugwYTVLuTJgEA7jdh1eyLkvbS
 rwbL7XD+Q35/vXZHEet+RLCGH3ae6wc=
 =yaKF
 -----END PGP SIGNATURE-----

Merge tag 'pull-simple_recursive_removal' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull simple_recursive_removal() update from Al Viro:
 "Removing subtrees of kernel filesystems is done in quite a few places;
  unfortunately, it's easy to get wrong. A number of open-coded attempts
  are out there, with varying amount of bogosities.

  simple_recursive_removal() had been introduced for doing that with all
  precautions needed; it does an equivalent of rm -rf, with sufficient
  locking, eviction of anything mounted on top of the subtree, etc.

  This series converts a bunch of open-coded instances to using that"

* tag 'pull-simple_recursive_removal' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  functionfs, gadgetfs: use simple_recursive_removal()
  kill binderfs_remove_file()
  fuse_ctl: use simple_recursive_removal()
  pstore: switch to locked_recursive_removal()
  binfmt_misc: switch to locked_recursive_removal()
  spufs: switch to locked_recursive_removal()
  add locked_recursive_removal()
  better lockdep annotations for simple_recursive_removal()
  simple_recursive_removal(): saner interaction with fsnotify
2025-07-28 09:43:51 -07:00
Linus Torvalds
11fe69fbd5 Current exclusion rules for ->d_flags stores are rather unpleasant.
The basic rules are simple:
 	* stores to dentry->d_flags are OK under dentry->d_lock.
 	* stores to dentry->d_flags are OK in the dentry constructor, before
 becomes potentially visible to other threads.
 Unfortunately, there's a couple of exceptions to that, and that's where the
 headache comes from.
 
 	Main PITA comes from d_set_d_op(); that primitive sets ->d_op
 of dentry and adjusts the flags that correspond to presence of individual
 methods.  It's very easy to misuse; existing uses _are_ safe, but proof
 of correctness is brittle.
 
 	Use in __d_alloc() is safe (we are within a constructor), but we
 might as well precalculate the initial value of ->d_flags when we set
 the default ->d_op for given superblock and set ->d_flags directly
 instead of messing with that helper.
 
 	The reasons why other uses are safe are bloody convoluted; I'm not going
 to reproduce it here.  See https://lore.kernel.org/all/20250224010624.GT1977892@ZenIV/
 for gory details, if you care.  The critical part is using d_set_d_op() only
 just prior to d_splice_alias(), which makes a combination of d_splice_alias()
 with setting ->d_op, etc. a natural replacement primitive.  Better yet, if
 we go that way, it's easy to take setting ->d_op and modifying ->d_flags
 under ->d_lock, which eliminates the headache as far as ->d_flags exclusion
 rules are concerned.  Other exceptions are minor and easy to deal with.
 
 	What this series does:
 * d_set_d_op() is no longer available; new primitive (d_splice_alias_ops())
 is provided, equivalent to combination of d_set_d_op() and d_splice_alias().
 * new field of struct super_block - ->s_d_flags.  Default value of ->d_flags
 to be used when allocating dentries on this filesystem.
 * new primitive for setting ->s_d_op: set_default_d_op().  Replaces stores
 to ->s_d_op at mount time.  All in-tree filesystems converted; out-of-tree
 ones will get caught by compiler (->s_d_op is renamed, so stores to it will
 be caught).  ->s_d_flags is set by the same primitive to match the ->s_d_op.
 * a lot of filesystems had ->s_d_op->d_delete equal to always_delete_dentry;
 that is equivalent to setting DCACHE_DONTCACHE in ->d_flags, so such filesystems
 can bloody well set that bit in ->s_d_flags and drop ->d_delete() from
 dentry_operations.  In quite a few cases that results in empty dentry_operations,
 which means that we can get rid of those.
 * kill simple_dentry_operations - not needed anymore.
 * massage d_alloc_parallel() to get rid of the other exception wrt ->d_flags
 stores - we can set DCACHE_PAR_LOOKUP as soon as we allocate the new dentry;
 no need to delay that until we commit to using the sucker.
 
 As the result, ->d_flags stores are all either under ->d_lock or done before
 the dentry becomes visible in any shared data structures.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaIQ/tQAKCRBZ7Krx/gZQ
 66AhAQDgQ+S224x5YevNXc9mDoGUBMF4OG0n0fIla9rfdL4I6wEAqpOWMNDcVPCZ
 GwYOvJ9YuqNdz+MyprAI18Yza4GOmgs=
 =rTYB
 -----END PGP SIGNATURE-----

Merge tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull dentry d_flags updates from Al Viro:
 "The current exclusion rules for dentry->d_flags stores are rather
  unpleasant. The basic rules are simple:

   - stores to dentry->d_flags are OK under dentry->d_lock

   - stores to dentry->d_flags are OK in the dentry constructor, before
     becomes potentially visible to other threads

  Unfortunately, there's a couple of exceptions to that, and that's
  where the headache comes from.

  The main PITA comes from d_set_d_op(); that primitive sets ->d_op of
  dentry and adjusts the flags that correspond to presence of individual
  methods. It's very easy to misuse; existing uses _are_ safe, but proof
  of correctness is brittle.

  Use in __d_alloc() is safe (we are within a constructor), but we might
  as well precalculate the initial value of 'd_flags' when we set the
  default ->d_op for given superblock and set 'd_flags' directly instead
  of messing with that helper.

  The reasons why other uses are safe are bloody convoluted; I'm not
  going to reproduce it here. See [1] for gory details, if you care. The
  critical part is using d_set_d_op() only just prior to
  d_splice_alias(), which makes a combination of d_splice_alias() with
  setting ->d_op, etc a natural replacement primitive.

  Better yet, if we go that way, it's easy to take setting ->d_op and
  modifying 'd_flags' under ->d_lock, which eliminates the headache as
  far as 'd_flags' exclusion rules are concerned. Other exceptions are
  minor and easy to deal with.

  What this series does:

   - d_set_d_op() is no longer available; instead a new primitive
     (d_splice_alias_ops()) is provided, equivalent to combination of
     d_set_d_op() and d_splice_alias().

   - new field of struct super_block - 's_d_flags'. This sets the
     default value of 'd_flags' to be used when allocating dentries on
     this filesystem.

   - new primitive for setting 's_d_op': set_default_d_op(). This
     replaces stores to 's_d_op' at mount time.

     All in-tree filesystems converted; out-of-tree ones will get caught
     by the compiler ('s_d_op' is renamed, so stores to it will be
     caught). 's_d_flags' is set by the same primitive to match the
     's_d_op'.

   - a lot of filesystems had sb->s_d_op->d_delete equal to
     always_delete_dentry; that is equivalent to setting
     DCACHE_DONTCACHE in 'd_flags', so such filesystems can bloody well
     set that bit in 's_d_flags' and drop 'd_delete()' from
     dentry_operations.

     In quite a few cases that results in empty dentry_operations, which
     means that we can get rid of those.

   - kill simple_dentry_operations - not needed anymore

   - massage d_alloc_parallel() to get rid of the other exception wrt
     'd_flags' stores - we can set DCACHE_PAR_LOOKUP as soon as we
     allocate the new dentry; no need to delay that until we commit to
     using the sucker.

  As the result, 'd_flags' stores are all either under ->d_lock or done
  before the dentry becomes visible in any shared data structures"

Link: https://lore.kernel.org/all/20250224010624.GT1977892@ZenIV/ [1]

* tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (21 commits)
  configfs: use DCACHE_DONTCACHE
  debugfs: use DCACHE_DONTCACHE
  efivarfs: use DCACHE_DONTCACHE instead of always_delete_dentry()
  9p: don't bother with always_delete_dentry
  ramfs, hugetlbfs, mqueue: set DCACHE_DONTCACHE
  kill simple_dentry_operations
  devpts, sunrpc, hostfs: don't bother with ->d_op
  shmem: no dentry retention past the refcount reaching zero
  d_alloc_parallel(): set DCACHE_PAR_LOOKUP earlier
  make d_set_d_op() static
  simple_lookup(): just set DCACHE_DONTCACHE
  tracefs: Add d_delete to remove negative dentries
  set_default_d_op(): calculate the matching value for ->d_flags
  correct the set of flags forbidden at d_set_d_op() time
  split d_flags calculation out of d_set_d_op()
  new helper: set_default_d_op()
  fuse: no need for special dentry_operations for root dentry
  switch procfs from d_set_d_op() to d_splice_alias_ops()
  new helper: d_splice_alias_ops()
  procfs: kill ->proc_dops
  ...
2025-07-28 09:17:57 -07:00
Joanne Koong
6e2f4d8a61
fuse: refactor writeback to use iomap_writepage_ctx inode
struct iomap_writepage_ctx includes a pointer to the file inode. In
writeback, use that instead of also passing the inode into
fuse_fill_wb_data.

No functional changes.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/20250715202122.2282532-6-joannelkoong@gmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-17 09:55:19 +02:00
Joanne Koong
707c5d3471
fuse: hook into iomap for invalidating and checking partial uptodateness
Hook into iomap_invalidate_folio() so that if the entire folio is being
invalidated during truncation, the dirty state is cleared and the folio
doesn't get written back. As well the folio's corresponding ifs struct
will get freed.

Hook into iomap_is_partially_uptodate() since iomap tracks uptodateness
granularly when it does buffered writes.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/20250715202122.2282532-5-joannelkoong@gmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-17 09:55:19 +02:00
Joanne Koong
1097a87dcb
fuse: use iomap for folio laundering
Use iomap for folio laundering, which will do granular dirty
writeback when laundering a large folio.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/20250715202122.2282532-4-joannelkoong@gmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-17 09:55:18 +02:00
Joanne Koong
ef7e7cbb32
fuse: use iomap for writeback
Use iomap for dirty folio writeback in ->writepages().
This allows for granular dirty writeback of large folios.

Only the dirty portions of the large folio will be written instead of
having to write out the entire folio. For example if there is a 1 MB
large folio and only 2 bytes in it are dirty, only the page for those
dirty bytes will be written out.

.dirty_folio needs to be set to iomap_dirty_folio so that the bitmap
iomap uses for dirty tracking correctly reflects dirty regions that need
to be written back.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/20250715202122.2282532-3-joannelkoong@gmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-17 09:55:18 +02:00
Joanne Koong
a4c9ab1d49
fuse: use iomap for buffered writes
Have buffered writes go through iomap. This has two advantages:
* granular large folio synchronous reads
* granular large folio dirty tracking

If for example there is a 1 MB large folio and a write issued at pos 1
to pos 1 MB - 2, only the head and tail pages will need to be read in
and marked uptodate instead of the entire folio needing to be read in.
Non-relevant trailing pages are also skipped (eg if for a 1 MB large
folio a write is issued at pos 1 to 4099, only the first two pages are
read in and the ones after that are skipped).

iomap also has granular dirty tracking. This is useful in that when it
comes to writeback time, only the dirty portions of the large folio will
be written instead of having to write out the entire folio. For example
if there is a 1 MB large folio and only 2 bytes in it are dirty, only
the page for those dirty bytes get written out. Please note that
granular writeback is only done once fuse also uses iomap in writeback
(separate commit).

.release_folio needs to be set to iomap_release_folio so that any
allocated iomap ifs structs get freed.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/20250715202122.2282532-2-joannelkoong@gmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-17 09:55:15 +02:00
Taotao Chen
e9d8e2bf23
fs: change write_begin/write_end interface to take struct kiocb *
Change the address_space_operations callbacks write_begin() and
write_end() to take struct kiocb * as the first argument instead of
struct file *.

Update all affected function prototypes, implementations, call sites,
and related documentation across VFS, filesystems, and block layer.

Part of a series refactoring address_space_operations write_begin and
write_end callbacks to use struct kiocb for passing write context and
flags.

Signed-off-by: Taotao Chen <chentaotao@didiglobal.com>
Link: https://lore.kernel.org/20250716093559.217344-4-chentaotao@didiglobal.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-16 14:48:18 +02:00
Alistair Popple
21aa65bf82 mm: remove callers of pfn_t functionality
All PFN_* pfn_t flags have been removed.  Therefore there is no longer a
need for the pfn_t type and all uses can be replaced with normal pfns.

Link: https://lkml.kernel.org/r/bbedfa576c9822f8032494efbe43544628698b1f.1750323463.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Björn Töpel <bjorn@kernel.org>
Cc: Björn Töpel <bjorn@rivosinc.com>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: John Groves <john@groves.net>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:42:19 -07:00
Christoph Hellwig
a8fb49c6ab mm: remove the for_reclaim field from struct writeback_control
This field is now only set to one in the i915 gem code that only calls
writeback_iter on it, which ignores the flag.  All other checks are thuse
dead code and the field can be removed.

Link: https://lkml.kernel.org/r/20250610054959.2057526-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:41:58 -07:00
Linus Torvalds
2eb7f03acf vfs-6.16-rc5.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaGeHBAAKCRCRxhvAZXjc
 omJNAQCnHIDuiscCUFeevb5sMNqws6td2kexX8reLxbdzzTrFgEAwAKxy5BVhNlg
 NusCZ2taYmenAK+HjI3JEw6c/3IKqwE=
 =NxGx
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.16-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:

 - Fix a regression caused by the anonymous inode rework. Making them
   regular files causes various places in the kernel to tip over
   starting with io_uring.

   Revert to the former status quo and port our assertion to be based on
   checking the inode so we don't lose the valuable VFS_*_ON_*()
   assertions that have already helped discover weird behavior our
   outright bugs.

 - Fix the the upper bound calculation in fuse_fill_write_pages()

 - Fix priority inversion issues in the eventpoll code

 - Make secretmen use anon_inode_make_secure_inode() to avoid bypassing
   the LSM layer

 - Fix a netfs hang due to missing case in final DIO read result
   collection

 - Fix a double put of the netfs_io_request struct

 - Provide some helpers to abstract out NETFS_RREQ_IN_PROGRESS flag
   wrangling

 - Fix infinite looping in netfs_wait_for_pause/request()

 - Fix a netfs ref leak on an extra subrequest inserted into a request's
   list of subreqs

 - Fix various cifs RPC callbacks to set NETFS_SREQ_NEED_RETRY if a
   subrequest fails retriably

 - Fix a cifs warning in the workqueue code when reconnecting a channel

 - Fix the updating of i_size in netfs to avoid a race between testing
   if we should have extended the file with a DIO write and changing
   i_size

 - Merge the places in netfs that update i_size on write

 - Fix coredump socket selftests

* tag 'vfs-6.16-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  anon_inode: rework assertions
  netfs: Update tracepoints in a number of ways
  netfs: Renumber the NETFS_RREQ_* flags to make traces easier to read
  netfs: Merge i_size update functions
  netfs: Fix i_size updating
  smb: client: set missing retry flag in cifs_writev_callback()
  smb: client: set missing retry flag in cifs_readv_callback()
  smb: client: set missing retry flag in smb2_writev_callback()
  netfs: Fix ref leak on inserted extra subreq in write retry
  netfs: Fix looping in wait functions
  netfs: Provide helpers to perform NETFS_RREQ_IN_PROGRESS flag wangling
  netfs: Fix double put of request
  netfs: Fix hang due to missing case in final DIO read result collection
  eventpoll: Fix priority inversion problem
  fuse: fix fuse_fill_write_pages() upper bound calculation
  fs: export anon_inode_make_secure_inode() and fix secretmem LSM bypass
  selftests/coredump: Fix "socket_detect_userspace_client" test failure
2025-07-04 09:06:49 -07:00
Christian Brauner
ca115d7e75
tree-wide: s/struct fileattr/struct file_kattr/g
Now that we expose struct file_attr as our uapi struct rename all the
internal struct to struct file_kattr to clearly communicate that it is a
kernel internal struct. This is similar to struct mount_{k}attr and
others.

Link: https://lore.kernel.org/20250703-restlaufzeit-baurecht-9ed44552b481@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-04 16:14:39 +02:00
Al Viro
fcaac5b427 fuse_ctl: use simple_recursive_removal()
easier that way - no need to keep that array of dentry references, etc.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-07-02 22:36:52 -04:00
Andrey Albershteyn
474b155adf
fs: make vfs_fileattr_[get|set] return -EOPNOTSUPP
Future patches will add new syscalls which use these functions. As
this interface won't be used for ioctls only, the EOPNOSUPP is more
appropriate return code.

This patch converts return code from ENOIOCTLCMD to EOPNOSUPP for
vfs_fileattr_get and vfs_fileattr_set. To save old behavior translate
EOPNOSUPP back for current users - overlayfs, encryptfs and fs/ioctl.c.

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Link: https://lore.kernel.org/20250630-xattrat-syscall-v6-4-c4e3bc35227b@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-02 14:29:10 +02:00
Daniel Wagner
b6139a6abf lib/group_cpus: Let group_cpu_evenly() return the number of initialized masks
group_cpu_evenly() might have allocated less groups then requested:

group_cpu_evenly()
  __group_cpus_evenly()
    alloc_nodes_groups()
      # allocated total groups may be less than numgrps when
      # active total CPU number is less then numgrps

In this case, the caller will do an out of bound access because the
caller assumes the masks returned has numgrps.

Return the number of groups created so the caller can limit the access
range accordingly.

Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250617-isolcpus-queue-counters-v1-1-13923686b54b@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01 10:24:11 -06:00
Haiyue Wang
befd9a71d8 fuse: fix runtime warning on truncate_folio_batch_exceptionals()
The WARN_ON_ONCE is introduced on truncate_folio_batch_exceptionals() to
capture whether the filesystem has removed all DAX entries or not.

And the fix has been applied on the filesystem xfs and ext4 by the commit
0e2f80afcf ("fs/dax: ensure all pages are idle prior to filesystem
unmount").

Apply the missed fix on filesystem fuse to fix the runtime warning:

[    2.011450] ------------[ cut here ]------------
[    2.011873] WARNING: CPU: 0 PID: 145 at mm/truncate.c:89 truncate_folio_batch_exceptionals+0x272/0x2b0
[    2.012468] Modules linked in:
[    2.012718] CPU: 0 UID: 1000 PID: 145 Comm: weston Not tainted 6.16.0-rc2-WSL2-STABLE #2 PREEMPT(undef)
[    2.013292] RIP: 0010:truncate_folio_batch_exceptionals+0x272/0x2b0
[    2.013704] Code: 48 63 d0 41 29 c5 48 8d 1c d5 00 00 00 00 4e 8d 6c 2a 01 49 c1 e5 03 eb 09 48 83 c3 08 49 39 dd 74 83 41 f6 44 1c 08 01 74 ef <0f> 0b 49 8b 34 1e 48 89 ef e8 10 a2 17 00 eb df 48 8b 7d 00 e8 35
[    2.014845] RSP: 0018:ffffa47ec33f3b10 EFLAGS: 00010202
[    2.015279] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[    2.015884] RDX: 0000000000000000 RSI: ffffa47ec33f3ca0 RDI: ffff98aa44f3fa80
[    2.016377] RBP: ffff98aa44f3fbf0 R08: ffffa47ec33f3ba8 R09: 0000000000000000
[    2.016942] R10: 0000000000000001 R11: 0000000000000000 R12: ffffa47ec33f3ca0
[    2.017437] R13: 0000000000000008 R14: ffffa47ec33f3ba8 R15: 0000000000000000
[    2.017972] FS:  000079ce006afa40(0000) GS:ffff98aade441000(0000) knlGS:0000000000000000
[    2.018510] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.018987] CR2: 000079ce03e74000 CR3: 000000010784f006 CR4: 0000000000372eb0
[    2.019518] Call Trace:
[    2.019729]  <TASK>
[    2.019901]  truncate_inode_pages_range+0xd8/0x400
[    2.020280]  ? timerqueue_add+0x66/0xb0
[    2.020574]  ? get_nohz_timer_target+0x2a/0x140
[    2.020904]  ? timerqueue_add+0x66/0xb0
[    2.021231]  ? timerqueue_del+0x2e/0x50
[    2.021646]  ? __remove_hrtimer+0x39/0x90
[    2.022017]  ? srso_alias_untrain_ret+0x1/0x10
[    2.022497]  ? psi_group_change+0x136/0x350
[    2.023046]  ? _raw_spin_unlock+0xe/0x30
[    2.023514]  ? finish_task_switch.isra.0+0x8d/0x280
[    2.024068]  ? __schedule+0x532/0xbd0
[    2.024551]  fuse_evict_inode+0x29/0x190
[    2.025131]  evict+0x100/0x270
[    2.025641]  ? _atomic_dec_and_lock+0x39/0x50
[    2.026316]  ? __pfx_generic_delete_inode+0x10/0x10
[    2.026843]  __dentry_kill+0x71/0x180
[    2.027335]  dput+0xeb/0x1b0
[    2.027725]  __fput+0x136/0x2b0
[    2.028054]  __x64_sys_close+0x3d/0x80
[    2.028469]  do_syscall_64+0x6d/0x1b0
[    2.028832]  ? clear_bhb_loop+0x30/0x80
[    2.029182]  ? clear_bhb_loop+0x30/0x80
[    2.029533]  ? clear_bhb_loop+0x30/0x80
[    2.029902]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[    2.030423] RIP: 0033:0x79ce03d0d067
[    2.030820] Code: b8 ff ff ff ff e9 3e ff ff ff 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 c3 a7 f8 ff
[    2.032354] RSP: 002b:00007ffef0498948 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[    2.032939] RAX: ffffffffffffffda RBX: 00007ffef0498960 RCX: 000079ce03d0d067
[    2.033612] RDX: 0000000000000003 RSI: 0000000000001000 RDI: 000000000000000d
[    2.034289] RBP: 00007ffef0498a30 R08: 000000000000000d R09: 0000000000000000
[    2.034944] R10: 00007ffef0498978 R11: 0000000000000246 R12: 0000000000000001
[    2.035610] R13: 00007ffef0498960 R14: 000079ce03e09ce0 R15: 0000000000000003
[    2.036301]  </TASK>
[    2.036532] ---[ end trace 0000000000000000 ]---

Link: https://lkml.kernel.org/r/20250621171507.3770-1-haiyuewa@163.com
Fixes: bde708f1a6 ("fs/dax: always remove DAX page-cache entries when breaking layouts")
Signed-off-by: Haiyue Wang <haiyuewa@163.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-25 15:55:03 -07:00
Joanne Koong
dbee298cb7
fuse: fix fuse_fill_write_pages() upper bound calculation
This fixes a bug in commit 63c69ad3d1 ("fuse: refactor
fuse_fill_write_pages()") where max_pages << PAGE_SHIFT is mistakenly
used as the calculation for the max_pages upper limit but there's the
possibility that copy_folio_from_iter_atomic() may copy over bytes
from the iov_iter that are less than the full length of the folio,
which would lead to exceeding max_pages.

This commit fixes it by adding a 'ap->num_folios < max_folios' check.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/20250614000114.910380-1-joannelkoong@gmail.com
Fixes: 63c69ad3d1 ("fuse: refactor fuse_fill_write_pages()")
Tested-by: Brian Foster <bfoster@redhat.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Closes: https://lore.kernel.org/linux-fsdevel/aEq4haEQScwHIWK6@bfoster/
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-24 11:07:19 +02:00
Al Viro
05fb0e6664 new helper: set_default_d_op()
... to be used instead of manually assigning to ->s_d_op.
All in-tree filesystem converted (and field itself is renamed,
so any out-of-tree ones in need of conversion will be caught
by compiler).

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-10 22:21:16 -04:00
Al Viro
4bd9f3fd87 fuse: no need for special dentry_operations for root dentry
->d_revalidate() is never called for root anyway...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-10 22:15:18 -04:00
Linus Torvalds
2619a6d413 fuse update for 6.16
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCaD2Y1wAKCRDh3BK/laaZ
 PFSHAP4q1+mOlQfZJPH/PFDwa+F0QW/uc3szXatS0888nxui/gEAsIeyyJlf+Mr8
 /1JPXxCqcapRFw9xsS0zioiK54Elfww=
 =2KxA
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:

 - Remove tmp page copying in writeback path (Joanne).

   This removes ~300 lines and with that a lot of complexity related to
   avoiding reclaim related deadlock. The old mechanism is replaced with
   a mapping flag that tells the MM not to block reclaim waiting for
   writeback to complete. The MM parts have been reviewed/acked by
   respective maintainers.

 - Convert more code to handle large folios (Joanne). This still just
   adds the code to deal with large folios and does not enable them yet.

 - Allow invalidating all cached lookups atomically (Luis Henriques).
   This feature is useful for CernVMFS, which currently does this
   iteratively.

 - Align write prefaulting in fuse with generic one (Dave Hansen)

 - Fix race causing invalid data to be cached when setting attributes on
   different nodes of a distributed fs (Guang Yuan Wu)

 - Update documentation for passthrough (Chen Linxuan)

 - Add fdinfo about the device number associated with an opened
   /dev/fuse instance (Chen Linxuan)

 - Increase readdir buffer size (Miklos). This depends on a patch to VFS
   readdir code that was already merged through Christians tree.

 - Optimize io-uring request expiration (Joanne)

 - Misc cleanups

* tag 'fuse-update-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits)
  fuse: increase readdir buffer size
  readdir: supply dir_context.count as readdir buffer size hint
  fuse: don't allow signals to interrupt getdents copying
  fuse: support large folios for writeback
  fuse: support large folios for readahead
  fuse: support large folios for queued writes
  fuse: support large folios for stores
  fuse: support large folios for symlinks
  fuse: support large folios for folio reads
  fuse: support large folios for writethrough writes
  fuse: refactor fuse_fill_write_pages()
  fuse: support large folios for retrieves
  fuse: support copying large folios
  fs: fuse: add dev id to /dev/fuse fdinfo
  docs: filesystems: add fuse-passthrough.rst
  MAINTAINERS: update filter of FUSE documentation
  fuse: fix race between concurrent setattrs from multiple nodes
  fuse: remove tmp folio for writebacks and internal rb tree
  mm: skip folio reclaim in legacy memcg contexts for deadlockable mappings
  fuse: optimize over-io-uring request expiration check
  ...
2025-06-02 15:31:05 -07:00
Linus Torvalds
00c010e130 - The 11 patch series "Add folio_mk_pte()" from Matthew Wilcox
simplifies the act of creating a pte which addresses the first page in a
   folio and reduces the amount of plumbing which architecture must
   implement to provide this.
 
 - The 8 patch series "Misc folio patches for 6.16" from Matthew Wilcox
   is a shower of largely unrelated folio infrastructure changes which
   clean things up and better prepare us for future work.
 
 - The 3 patch series "memory,x86,acpi: hotplug memory alignment
   advisement" from Gregory Price adds early-init code to prevent x86 from
   leaving physical memory unused when physical address regions are not
   aligned to memory block size.
 
 - The 2 patch series "mm/compaction: allow more aggressive proactive
   compaction" from Michal Clapinski provides some tuning of the (sadly,
   hard-coded (more sadly, not auto-tuned)) thresholds for our invokation
   of proactive compaction.  In a simple test case, the reduction of a guest
   VM's memory consumption was dramatic.
 
 - The 8 patch series "Minor cleanups and improvements to swap freeing
   code" from Kemeng Shi provides some code cleaups and a small efficiency
   improvement to this part of our swap handling code.
 
 - The 6 patch series "ptrace: introduce PTRACE_SET_SYSCALL_INFO API"
   from Dmitry Levin adds the ability for a ptracer to modify syscalls
   arguments.  At this time we can alter only "system call information that
   are used by strace system call tampering, namely, syscall number,
   syscall arguments, and syscall return value.
 
   This series should have been incorporated into mm.git's "non-MM"
   branch, but I goofed.
 
 - The 3 patch series "fs/proc: extend the PAGEMAP_SCAN ioctl to report
   guard regions" from Andrei Vagin extends the info returned by the
   PAGEMAP_SCAN ioctl against /proc/pid/pagemap.  This permits CRIU to more
   efficiently get at the info about guard regions.
 
 - The 2 patch series "Fix parameter passed to page_mapcount_is_type()"
   from Gavin Shan implements that fix.  No runtime effect is expected
   because validate_page_before_insert() happens to fix up this error.
 
 - The 3 patch series "kernel/events/uprobes: uprobe_write_opcode()
   rewrite" from David Hildenbrand basically brings uprobe text poking into
   the current decade.  Remove a bunch of hand-rolled implementation in
   favor of using more current facilities.
 
 - The 3 patch series "mm/ptdump: Drop assumption that pxd_val() is u64"
   from Anshuman Khandual provides enhancements and generalizations to the
   pte dumping code.  This might be needed when 128-bit Page Table
   Descriptors are enabled for ARM.
 
 - The 12 patch series "Always call constructor for kernel page tables"
   from Kevin Brodsky "ensures that the ctor/dtor is always called for
   kernel pgtables, as it already is for user pgtables".  This permits the
   addition of more functionality such as "insert hooks to protect page
   tables".  This change does result in various architectures performing
   unnecesary work, but this is fixed up where it is anticipated to occur.
 
 - The 9 patch series "Rust support for mm_struct, vm_area_struct, and
   mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM
   structures.
 
 - The 3 patch series "fix incorrectly disallowed anonymous VMA merges"
   from Lorenzo Stoakes takes advantage of some VMA merging opportunities
   which we've been missing for 15 years.
 
 - The 4 patch series "mm/madvise: batch tlb flushes for MADV_DONTNEED
   and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB
   flushing.  Instead of flushing each address range in the provided iovec,
   we batch the flushing across all the iovec entries.  The syscall's cost
   was approximately halved with a microbenchmark which was designed to
   load this particular operation.
 
 - The 6 patch series "Track node vacancy to reduce worst case allocation
   counts" from Sidhartha Kumar makes the maple tree smarter about its node
   preallocation.  stress-ng mmap performance increased by single-digit
   percentages and the amount of unnecessarily preallocated memory was
   dramaticelly reduced.
 
 - The 3 patch series "mm/gup: Minor fix, cleanup and improvements" from
   Baoquan He removes a few unnecessary things which Baoquan noted when
   reading the code.
 
 - The 3 patch series ""Enhance sysfs handling for memory hotplug in
   weighted interleave" from Rakie Kim "enhances the weighted interleave
   policy in the memory management subsystem by improving sysfs handling,
   fixing memory leaks, and introducing dynamic sysfs updates for memory
   hotplug support".  Fixes things on error paths which we are unlikely to
   hit.
 
 - The 7 patch series "mm/damon: auto-tune DAMOS for NUMA setups
   including tiered memory" from SeongJae Park introduces new DAMOS quota
   goal metrics which eliminate the manual tuning which is required when
   utilizing DAMON for memory tiering.
 
 - The 5 patch series "mm/vmalloc.c: code cleanup and improvements" from
   Baoquan He provides cleanups and small efficiency improvements which
   Baoquan found via code inspection.
 
 - The 2 patch series "vmscan: enforce mems_effective during demotion"
   from Gregory Price "changes reclaim to respect cpuset.mems_effective
   during demotion when possible".  because "presently, reclaim explicitly
   ignores cpuset.mems_effective when demoting, which may cause the cpuset
   settings to violated." "This is useful for isolating workloads on a
   multi-tenant system from certain classes of memory more consistently."
 
 - The 2 patch series ""Clean up split_huge_pmd_locked() and remove
   unnecessary folio pointers" from Gavin Guo provides minor cleanups and
   efficiency gains in in the huge page splitting and migrating code.
 
 - The 3 patch series "Use kmem_cache for memcg alloc" from Huan Yang
   creates a slab cache for `struct mem_cgroup', yielding improved memory
   utilization.
 
 - The 4 patch series "add max arg to swappiness in memory.reclaim and
   lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness="
   argument for memory.reclaim MGLRU's lru_gen.  This directs proactive
   reclaim to reclaim from only anon folios rather than file-backed folios.
 
 - The 17 patch series "kexec: introduce Kexec HandOver (KHO)" from Mike
   Rapoport is the first step on the path to permitting the kernel to
   maintain existing VMs while replacing the host kernel via file-based
   kexec.  At this time only memblock's reserve_mem is preserved.
 
 - The 7 patch series "mm: Introduce for_each_valid_pfn()" from David
   Woodhouse provides and uses a smarter way of looping over a pfn range.
   By skipping ranges of invalid pfns.
 
 - The 2 patch series "sched/numa: Skip VMA scanning on memory pinned to
   one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless
   VMA scanning when a task is pinned a single NUMA mode.  Dramatic
   performance benefits were seen in some real world cases.
 
 - The 2 patch series "JFS: Implement migrate_folio for
   jfs_metapage_aops" from Shivank Garg addresses a warning which occurs
   during memory compaction when using JFS.
 
 - The 4 patch series "move all VMA allocation, freeing and duplication
   logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c
   into the more appropriate mm/vma.c.
 
 - The 6 patch series "mm, swap: clean up swap cache mapping helper" from
   Kairui Song provides code consolidation and cleanups related to the
   folio_index() function.
 
 - The 2 patch series "mm/gup: Cleanup memfd_pin_folios()" from Vishal
   Moola does that.
 
 - The 8 patch series "memcg: Fix test_memcg_min/low test failures" from
   Waiman Long addresses some bogus failures which are being reported by
   the test_memcontrol selftest.
 
 - The 3 patch series "eliminate mmap() retry merge, add .mmap_prepare
   hook" from Lorenzo Stoakes commences the deprecation of
   file_operations.mmap() in favor of the new
   file_operations.mmap_prepare().  The latter is more restrictive and
   prevents drivers from messing with things in ways which, amongst other
   problems, may defeat VMA merging.
 
 - The 4 patch series "memcg: decouple memcg and objcg stocks"" from
   Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's
   one.  This is a step along the way to making memcg and objcg charging
   NMI-safe, which is a BPF requirement.
 
 - The 6 patch series "mm/damon: minor fixups and improvements for code,
   tests, and documents" from SeongJae Park is "yet another batch of
   miscellaneous DAMON changes.  Fix and improve minor problems in code,
   tests and documents."
 
 - The 7 patch series "memcg: make memcg stats irq safe" from Shakeel
   Butt converts memcg stats to be irq safe.  Another step along the way to
   making memcg charging and stats updates NMI-safe, a BPF requirement.
 
 - The 4 patch series "Let unmap_hugepage_range() and several related
   functions take folio instead of page" from Fan Ni provides folio
   conversions in the hugetlb code.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDt5qgAKCRDdBJ7gKXxA
 ju6XAP9nTiSfRz8Cz1n5LJZpFKEGzLpSihCYyR6P3o1L9oe3mwEAlZ5+XAwk2I5x
 Qqb/UGMEpilyre1PayQqOnct3aSL9Ao=
 =tYYm
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of
   creating a pte which addresses the first page in a folio and reduces
   the amount of plumbing which architecture must implement to provide
   this.

 - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of
   largely unrelated folio infrastructure changes which clean things up
   and better prepare us for future work.

 - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory
   Price adds early-init code to prevent x86 from leaving physical
   memory unused when physical address regions are not aligned to memory
   block size.

 - "mm/compaction: allow more aggressive proactive compaction" from
   Michal Clapinski provides some tuning of the (sadly, hard-coded (more
   sadly, not auto-tuned)) thresholds for our invokation of proactive
   compaction. In a simple test case, the reduction of a guest VM's
   memory consumption was dramatic.

 - "Minor cleanups and improvements to swap freeing code" from Kemeng
   Shi provides some code cleaups and a small efficiency improvement to
   this part of our swap handling code.

 - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin
   adds the ability for a ptracer to modify syscalls arguments. At this
   time we can alter only "system call information that are used by
   strace system call tampering, namely, syscall number, syscall
   arguments, and syscall return value.

   This series should have been incorporated into mm.git's "non-MM"
   branch, but I goofed.

 - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from
   Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl
   against /proc/pid/pagemap. This permits CRIU to more efficiently get
   at the info about guard regions.

 - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan
   implements that fix. No runtime effect is expected because
   validate_page_before_insert() happens to fix up this error.

 - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David
   Hildenbrand basically brings uprobe text poking into the current
   decade. Remove a bunch of hand-rolled implementation in favor of
   using more current facilities.

 - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman
   Khandual provides enhancements and generalizations to the pte dumping
   code. This might be needed when 128-bit Page Table Descriptors are
   enabled for ARM.

 - "Always call constructor for kernel page tables" from Kevin Brodsky
   ensures that the ctor/dtor is always called for kernel pgtables, as
   it already is for user pgtables.

   This permits the addition of more functionality such as "insert hooks
   to protect page tables". This change does result in various
   architectures performing unnecesary work, but this is fixed up where
   it is anticipated to occur.

 - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice
   Ryhl adds plumbing to permit Rust access to core MM structures.

 - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo
   Stoakes takes advantage of some VMA merging opportunities which we've
   been missing for 15 years.

 - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from
   SeongJae Park optimizes process_madvise()'s TLB flushing.

   Instead of flushing each address range in the provided iovec, we
   batch the flushing across all the iovec entries. The syscall's cost
   was approximately halved with a microbenchmark which was designed to
   load this particular operation.

 - "Track node vacancy to reduce worst case allocation counts" from
   Sidhartha Kumar makes the maple tree smarter about its node
   preallocation.

   stress-ng mmap performance increased by single-digit percentages and
   the amount of unnecessarily preallocated memory was dramaticelly
   reduced.

 - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes
   a few unnecessary things which Baoquan noted when reading the code.

 - ""Enhance sysfs handling for memory hotplug in weighted interleave"
   from Rakie Kim "enhances the weighted interleave policy in the memory
   management subsystem by improving sysfs handling, fixing memory
   leaks, and introducing dynamic sysfs updates for memory hotplug
   support". Fixes things on error paths which we are unlikely to hit.

 - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory"
   from SeongJae Park introduces new DAMOS quota goal metrics which
   eliminate the manual tuning which is required when utilizing DAMON
   for memory tiering.

 - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He
   provides cleanups and small efficiency improvements which Baoquan
   found via code inspection.

 - "vmscan: enforce mems_effective during demotion" from Gregory Price
   changes reclaim to respect cpuset.mems_effective during demotion when
   possible. because presently, reclaim explicitly ignores
   cpuset.mems_effective when demoting, which may cause the cpuset
   settings to violated.

   This is useful for isolating workloads on a multi-tenant system from
   certain classes of memory more consistently.

 - "Clean up split_huge_pmd_locked() and remove unnecessary folio
   pointers" from Gavin Guo provides minor cleanups and efficiency gains
   in in the huge page splitting and migrating code.

 - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache
   for `struct mem_cgroup', yielding improved memory utilization.

 - "add max arg to swappiness in memory.reclaim and lru_gen" from
   Zhongkun He adds a new "max" argument to the "swappiness=" argument
   for memory.reclaim MGLRU's lru_gen.

   This directs proactive reclaim to reclaim from only anon folios
   rather than file-backed folios.

 - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the
   first step on the path to permitting the kernel to maintain existing
   VMs while replacing the host kernel via file-based kexec. At this
   time only memblock's reserve_mem is preserved.

 - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides
   and uses a smarter way of looping over a pfn range. By skipping
   ranges of invalid pfns.

 - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
   cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning
   when a task is pinned a single NUMA mode.

   Dramatic performance benefits were seen in some real world cases.

 - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank
   Garg addresses a warning which occurs during memory compaction when
   using JFS.

 - "move all VMA allocation, freeing and duplication logic to mm" from
   Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more
   appropriate mm/vma.c.

 - "mm, swap: clean up swap cache mapping helper" from Kairui Song
   provides code consolidation and cleanups related to the folio_index()
   function.

 - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that.

 - "memcg: Fix test_memcg_min/low test failures" from Waiman Long
   addresses some bogus failures which are being reported by the
   test_memcontrol selftest.

 - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo
   Stoakes commences the deprecation of file_operations.mmap() in favor
   of the new file_operations.mmap_prepare().

   The latter is more restrictive and prevents drivers from messing with
   things in ways which, amongst other problems, may defeat VMA merging.

 - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples
   the per-cpu memcg charge cache from the objcg's one.

   This is a step along the way to making memcg and objcg charging
   NMI-safe, which is a BPF requirement.

 - "mm/damon: minor fixups and improvements for code, tests, and
   documents" from SeongJae Park is yet another batch of miscellaneous
   DAMON changes. Fix and improve minor problems in code, tests and
   documents.

 - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg
   stats to be irq safe. Another step along the way to making memcg
   charging and stats updates NMI-safe, a BPF requirement.

 - "Let unmap_hugepage_range() and several related functions take folio
   instead of page" from Fan Ni provides folio conversions in the
   hugetlb code.

* tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits)
  mm: pcp: increase pcp->free_count threshold to trigger free_high
  mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range()
  mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page
  mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page
  mm/hugetlb: pass folio instead of page to unmap_ref_private()
  memcg: objcg stock trylock without irq disabling
  memcg: no stock lock for cpu hot-unplug
  memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
  memcg: make count_memcg_events re-entrant safe against irqs
  memcg: make mod_memcg_state re-entrant safe against irqs
  memcg: move preempt disable to callers of memcg_rstat_updated
  memcg: memcg_rstat_updated re-entrant safe against irqs
  mm: khugepaged: decouple SHMEM and file folios' collapse
  selftests/eventfd: correct test name and improve messages
  alloc_tag: check mem_profiling_support in alloc_tag_init
  Docs/damon: update titles and brief introductions to explain DAMOS
  selftests/damon/_damon_sysfs: read tried regions directories in order
  mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject()
  mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat()
  mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs
  ...
2025-05-31 15:44:16 -07:00
Linus Torvalds
0f70f5b08a automount wart removal
Calling conventions of ->d_automount() made saner (flagday change)
 vfs_submount() is gone - its sole remaining user (trace_automount) had
 been switched to saner primitives.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaDoRWQAKCRBZ7Krx/gZQ
 6wxMAQCzuMc2GiGBMXzeK4SGA7d5rsK71unf+zczOd8NvbTImQEAs1Cu3u3bF3pq
 EmHQWFTKBpBf+RHsLSoDHwUA+9THowM=
 =GXLi
 -----END PGP SIGNATURE-----

Merge tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull automount updates from Al Viro:
 "Automount wart removal

  A bunch of odd boilerplate gone from instances - the reason for
  those was the need to protect the yet-to-be-attched mount from
  mark_mounts_for_expiry() deciding to take it out.

  But that's easy to detect and take care of in mark_mounts_for_expiry()
  itself; no need to have every instance simulate mount being busy by
  grabbing an extra reference to it, with finish_automount() undoing
  that once it attaches that mount.

  Should've done it that way from the very beginning... This is a
  flagday change, thankfully there are very few instances.

  vfs_submount() is gone - its sole remaining user (trace_automount)
  had been switched to saner primitives"

* tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  kill vfs_submount()
  saner calling conventions for ->d_automount()
2025-05-30 15:38:29 -07:00
Miklos Szeredi
dabb903910 fuse: increase readdir buffer size
Increase the buffer size to the count requested by userspace.  This
improves performance.

Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
2025-05-29 12:31:24 +02:00
Miklos Szeredi
c31f91c6af fuse: don't allow signals to interrupt getdents copying
When getting the directory contents, the entries are first fetched to a
kernel buffer, then they are copied to userspace with dir_emit().  This
second phase is non-blocking as long as the userspace buffer is not paged
out, making it interruptible makes zero sense.

Overload d_type as flags, since it only uses 4 bits from 32.

Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29 12:31:23 +02:00
Joanne Koong
f3cb8bd908 fuse: support large folios for writeback
Add support for folios larger than one page size for writeback.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29 12:31:23 +02:00
Joanne Koong
906354c87f fuse: support large folios for readahead
Add support for folios larger than one page size for readahead.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29 12:31:23 +02:00
Joanne Koong
ff7c3ee484 fuse: support large folios for queued writes
Add support for folios larger than one page size for queued writes.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29 12:31:23 +02:00
Joanne Koong
c91440c89f fuse: support large folios for stores
Add support for folios larger than one page size for stores.
Also change variable naming from "this_num" to "nr_bytes".

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29 12:31:23 +02:00
Joanne Koong
cacc0645bc fuse: support large folios for symlinks
Support large folios for symlinks and change the name from
fuse_getlink_page() to fuse_getlink_folio().

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29 12:31:23 +02:00
Joanne Koong
351a24eb48 fuse: support large folios for folio reads
Add support for folios larger than one page size for folio reads into
the page cache.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29 12:31:23 +02:00
Joanne Koong
d60a6015e1 fuse: support large folios for writethrough writes
Add support for folios larger than one page size for writethrough
writes.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29 12:31:23 +02:00
Joanne Koong
63c69ad3d1 fuse: refactor fuse_fill_write_pages()
Refactor the logic in fuse_fill_write_pages() for copying out write
data. This will make the future change for supporting large folios for
writes easier. No functional changes.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29 12:31:23 +02:00
Joanne Koong
3568a95693 fuse: support large folios for retrieves
Add support for folios larger than one page size for retrieves.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29 12:31:23 +02:00
Joanne Koong
394244b24f fuse: support copying large folios
Currently, all folios associated with fuse are one page size. As part of
the work to enable large folios, this commit adds support for copying
to/from folios larger than one page size.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29 12:30:30 +02:00
Linus Torvalds
181d8e399f vfs-6.16-rc1.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPTwAKCRCRxhvAZXjc
 om0+AQDMxKLweJXplqQQ7jxuvW2dEa60YpE2EalEKWGg9YA3KgEA3nI4kyKMKn7Y
 PRFXgIcKvhs62oJLKsq8SGQUqExqvAE=
 =atEw
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.16-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual selections of misc updates for this cycle.

  Features:

   - Use folios for symlinks in the page cache

     FUSE already uses folios for its symlinks. Mirror that conversion
     in the generic code and the NFS code. That lets us get rid of a few
     folio->page->folio conversions in this path, and some of the few
     remaining users of read_cache_page() / read_mapping_page()

   - Try and make a few filesystem operations killable on the VFS
     inode->i_mutex level

   - Add sysctl vfs_cache_pressure_denom for bulk file operations

     Some workloads need to preserve more dentries than we currently
     allow through out sysctl interface

     A HDFS servers with 12 HDDs per server, on a HDFS datanode startup
     involves scanning all files and caching their metadata (including
     dentries and inodes) in memory. Each HDD contains approximately 2
     million files, resulting in a total of ~20 million cached dentries
     after initialization

     To minimize dentry reclamation, they set vfs_cache_pressure to 1.
     Despite this configuration, memory pressure conditions can still
     trigger reclamation of up to 50% of cached dentries, reducing the
     cache from 20 million to approximately 10 million entries. During
     the subsequent cache rebuild period, any HDFS datanode restart
     operation incurs substantial latency penalties until full cache
     recovery completes

     To maintain service stability, more dentries need to be preserved
     during memory reclamation. The current minimum reclaim ratio (1/100
     of total dentries) remains too aggressive for such workload. This
     patch introduces vfs_cache_pressure_denom for more granular cache
     pressure control

     The configuration [vfs_cache_pressure=1,
     vfs_cache_pressure_denom=10000] effectively maintains the full 20
     million dentry cache under memory pressure, preventing datanode
     restart performance degradation

   - Avoid some jumps in inode_permission() using likely()/unlikely()

   - Avid a memory access which is most likely a cache miss when
     descending into devcgroup_inode_permission()

   - Add fastpath predicts for stat() and fdput()

   - Anonymous inodes currently don't come with a proper mode causing
     issues in the kernel when we want to add useful VFS debug assert.
     Fix that by giving them a proper mode and masking it off when we
     report it to userspace which relies on them not having any mode

   - Anonymous inodes currently allow to change inode attributes because
     the VFS falls back to simple_setattr() if i_op->setattr isn't
     implemented. This means the ownership and mode for every single
     user of anon_inode_inode can be changed. Block that as it's either
     useless or actively harmful. If specific ownership is needed the
     respective subsystem should allocate anonymous inodes from their
     own private superblock

   - Raise SB_I_NODEV and SB_I_NOEXEC on the anonymous inode superblock

   - Add proper tests for anonymous inode behavior

   - Make it easy to detect proper anonymous inodes and to ensure that
     we can detect them in codepaths such as readahead()

  Cleanups:

   - Port pidfs to the new anon_inode_{g,s}etattr() helpers

   - Try to remove the uselib() system call

   - Add unlikely branch hint return path for poll

   - Add unlikely branch hint on return path for core_sys_select

   - Don't allow signals to interrupt getdents copying for fuse

   - Provide a size hint to dir_context for during readdir()

   - Use writeback_iter directly in mpage_writepages

   - Update compression and mtime descriptions in initramfs
     documentation

   - Update main netfs API document

   - Remove useless plus one in super_cache_scan()

   - Remove unnecessary NULL-check guards during setns()

   - Add separate separate {get,put}_cgroup_ns no-op cases

  Fixes:

   - Fix typo in root= kernel parameter description

   - Use KERN_INFO for infof()|info_plog()|infofc()

   - Correct comments of fs_validate_description()

   - Mark an unlikely if condition with unlikely() in
     vfs_parse_monolithic_sep()

   - Delete macro fsparam_u32hex()

   - Remove unused and problematic validate_constant_table()

   - Fix potential unsigned integer underflow in fs_name()

   - Make file-nr output the total allocated file handles"

* tag 'vfs-6.16-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (43 commits)
  fs: Pass a folio to page_put_link()
  nfs: Use a folio in nfs_get_link()
  fs: Convert __page_get_link() to use a folio
  fs/read_write: make default_llseek() killable
  fs/open: make do_truncate() killable
  fs/open: make chmod_common() and chown_common() killable
  include/linux/fs.h: add inode_lock_killable()
  readdir: supply dir_context.count as readdir buffer size hint
  vfs: Add sysctl vfs_cache_pressure_denom for bulk file operations
  fuse: don't allow signals to interrupt getdents copying
  Documentation: fix typo in root= kernel parameter description
  include/cgroup: separate {get,put}_cgroup_ns no-op case
  kernel/nsproxy: remove unnecessary guards
  fs: use writeback_iter directly in mpage_writepages
  fs: remove useless plus one in super_cache_scan()
  fs: add S_ANON_INODE
  fs: remove uselib() system call
  device_cgroup: avoid access to ->i_rdev in the common case in devcgroup_inode_permission()
  fs/fs_parse: Remove unused and problematic validate_constant_table()
  fs: touch up predicts in inode_permission()
  ...
2025-05-26 09:02:39 -07:00
Matthew Wilcox (Oracle)
4ec373b74e
fs: Pass a folio to page_put_link()
All callers now have a folio.  Pass it to page_put_link(), saving a
hidden call to compound_head().  Also add kernel-doc for page_get_link()
and page_put_link().

Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/20250514171316.3002934-4-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-15 12:14:29 +02:00
Miklos Szeredi
8d9117009d
fuse: don't allow signals to interrupt getdents copying
When getting the directory contents, the entries are first fetched to a
kernel buffer, then they are copied to userspace with dir_emit().  This
second phase is non-blocking as long as the userspace buffer is not paged
out, making it interruptible makes zero sense.

Overload d_type as flags, since it only uses 4 bits from 32.

Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/20250513112335.1473177-1-mszeredi@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-15 11:12:11 +02:00
Chen Linxuan
f09222980d fs: fuse: add dev id to /dev/fuse fdinfo
This commit add fuse connection device id to
fdinfo of opened /dev/fuse files.

Related discussions can be found at links below.

Link: https://lore.kernel.org/all/CAJfpegvEYUgEbpATpQx8NqVR33Mv-VK96C+gbTag1CEUeBqvnA@mail.gmail.com/
Signed-off-by: Chen Linxuan <chenlinxuan@uniontech.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-15 09:43:19 +02:00
Kairui Song
74e6ee62a8 fuse: drop usage of folio_index
Patch series "mm, swap: clean up swap cache mapping helper", v3.

This series removes usage of folio_index usage in fs/, and remove swap
cache checking in folio_contains.

Currently, the swap cache is already no longer directly exposed to fs, and
swap cache will be more different from page cache.  Clean up the helpers
first to simplify the code and eliminate the helpers used for resolving
circular header dependency issue between filemap and swap headers, and
prepare for further changes.


This patch (of 6):

folio_index is only needed for mixed usage of page cache and swap cache,
for pure page cache usage, the caller can just use folio->index instead.

It can't be a swap cache folio here.  Swap mapping may only call into fs
through `swap_rw` but fuse does not use that method for SWAP.

Link: https://lkml.kernel.org/r/20250430181052.55698-1-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20250430181052.55698-2-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Joanne Koong <joannelkoong@gmail.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Qu Wenruo <wqu@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12 23:50:50 -07:00
Guang Yuan Wu
69efbff69f fuse: fix race between concurrent setattrs from multiple nodes
When mounting a user-space filesystem on multiple clients, after
concurrent ->setattr() calls from different node, stale inode
attributes may be cached in some node.

This is caused by fuse_setattr() racing with
fuse_reverse_inval_inode().

When filesystem server receives setattr request, the client node
with valid iattr cached will be required to update the fuse_inode's
attr_version and invalidate the cache by fuse_reverse_inval_inode(),
and at the next call to ->getattr() they will be fetched from user
space.

The race scenario is:
1. client-1 sends setattr (iattr-1) request to server
2. client-1 receives the reply from server
3. before client-1 updates iattr-1 to the cached attributes by
   fuse_change_attributes_common(), server receives another setattr
   (iattr-2) request from client-2
4. server requests client-1 to update the inode attr_version and
   invalidate the cached iattr, and iattr-1 becomes staled
5. client-2 receives the reply from server, and caches iattr-2
6. continue with step 2, client-1 invokes
   fuse_change_attributes_common(), and caches iattr-1

The issue has been observed from concurrent of chmod, chown, or
truncate, which all invoke ->setattr() call.

The solution is to use fuse_inode's attr_version to check whether
the attributes have been modified during the setattr request's
lifetime.  If so, mark the attributes as invalid in the function
fuse_change_attributes_common().

Signed-off-by: Guang Yuan Wu <gwu@ddn.com>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-12 09:40:41 +02:00
Al Viro
006ff7498f saner calling conventions for ->d_automount()
Currently the calling conventions for ->d_automount() instances have
an odd wart - returned new mount to be attached is expected to have
refcount 2.

That kludge is intended to make sure that mark_mounts_for_expiry() called
before we get around to attaching that new mount to the tree won't decide
to take it out.  finish_automount() drops the extra reference after it's
done with attaching mount to the tree - or drops the reference twice in
case of error.  ->d_automount() instances have rather counterintuitive
boilerplate in them.

There's a much simpler approach: have mark_mounts_for_expiry() skip the
mounts that are yet to be mounted.  And to hell with grabbing/dropping
those extra references.  Makes for simpler correctness analysis, at that...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Acked-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-05 13:42:49 -04:00
Joanne Koong
0c58a97f91 fuse: remove tmp folio for writebacks and internal rb tree
In the current FUSE writeback design (see commit 3be5a52b30
("fuse: support writable mmap")), a temp page is allocated for every
dirty page to be written back, the contents of the dirty page are copied
over to the temp page, and the temp page gets handed to the server to
write back.

This is done so that writeback may be immediately cleared on the dirty
page, and this in turn is done in order to mitigate the following
deadlock scenario that may arise if reclaim waits on writeback on the
dirty page to complete:
* single-threaded FUSE server is in the middle of handling a request
  that needs a memory allocation
* memory allocation triggers direct reclaim
* direct reclaim waits on a folio under writeback
* the FUSE server can't write back the folio since it's stuck in
  direct reclaim

With a recent change that added AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM and
mitigates the situation described above, FUSE writeback does not need
to use temp pages if it sets AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM on its
inode mappings.

This commit sets AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM on the inode
mappings and removes the temporary pages + extra copying and the internal
rb tree.

fio benchmarks --
(using averages observed from 10 runs, throwing away outliers)

Setup:
sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
 ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount

fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
--numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount

        bs =  1k          4k            1M
Before  351 MiB/s     1818 MiB/s     1851 MiB/s
After   341 MiB/s     2246 MiB/s     2685 MiB/s
% diff        -3%          23%         45%

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-04-15 14:23:28 +02:00
Joanne Koong
4fea593e62 fuse: optimize over-io-uring request expiration check
Currently, when checking whether a request has timed out, we check
fpq processing, but fuse-over-io-uring has one fpq per core and 256
entries in the processing table. For systems where there are a
large number of cores, this may be too much overhead.

Instead of checking the fpq processing list, check ent_w_req_queue
and ent_in_userspace.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Bernd Schubert <bernd@bsbernd.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-04-15 13:43:38 +02:00
Joanne Koong
03a3617f92 fuse: use boolean bit-fields in struct fuse_copy_state
Refactor struct fuse_copy_state to use boolean bit-fields to improve
clarity/readability and be consistent with other fuse structs that use
bit-fields for boolean state (eg fuse_fs_context, fuse_args).

No functional changes.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-04-15 12:59:17 +02:00