Highlights include:
Stable fixes:
- NFS don't inherit NFS filesystem capabilities when crossing from one
filesystem to another.
Bugfixes:
- NFS wakeup of __nfs_lookup_revalidate() needs memory barriers.
- NFS improve bounds checking in nfs_fh_to_dentry().
- NFS Fix allocation errors when writing to a NFS file backed loopback
device.
- NFSv4: More listxattr fixes
- SUNRPC: fix client handling of TLS alerts.
- pNFS block/scsi layout fix for an uninitialised pointer dereference.
- pNFS block/scsi layout fixes for the extent encoding, stripe mapping,
and disk offset overflows.
- pNFS layoutcommit work around for RPC size limitations.
- pNFS/flexfiles avoid looping when handling fatal errors after layoutget.
- localio: fix various race conditions.
Features and cleanups:
- Add NFSv4 support for retrieving the btime.
- NFS: Allow folio migration for the case of mode == MIGRATE_SYNC.
- NFS: Support using a kernel keyring to store TLS certificates.
- NFSv4: Speed up delegation lookup using a hash table.
- Assorted cleanups to remove unused variables and struct fields.
- Assorted new tracepoints to improve debugging.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmiWBC8ACgkQZwvnipYK
APKTfRAAn3ETKN15+yR6wYr/wiibaL6sRzQVo8OzFSI9hEVxljX6kK2HEHXIV93T
F8bjUMB24KK+Eim8zIeLf4Ke7ldbRqtbiYLJox/I12TtQ6yaiFF+xDm7Fyc7UwcT
ZUl1UnGeNY30RfQ1n8O4O/suBOsTJy1rpWBWynGeQZLiNHFVDoxH4OgCXGZ5579p
3GACtToDDH9lgCBbKLM3J0nrcW5Or6BidFxT+zN/FXqeroepuvEcloiwJY7N4f/o
DW436v7ep92WlaJfuypeMmdusx6+vVaYJEKw+B+UjS3tRjbDmhj2FL3su4dQVCqU
JVW7TwGFL2zwfjTZjfp43ACN16goqRhX7DnTQkgD1mDnnhENqOsa0rqyIS3Wla4d
W9phfGmOo2FwuVOUXH2L4k7cfPIsktsZ0s/xg+5UfcsoG2yxUxnY9HaWQGFw6fnN
Fr9B7gUmaO6o1qpZ3emjecoOdQM9IqxMo39P9/72J9pWFstO9Js/kmGXcfsDo2IE
z2ZYd+roj2ylcLD9mCJ94tjNsAx0ytvUl1fxNFTRLToQ3Nti+jNy22i4jrCgWfyW
5jwCjjO87CJlJC0kixEPtQXAB6HQSraZw3Qmok1Ywez3fp4+j+QDFYP7OPJPQ0LY
C3R1uzvUQKqsbsE8cwYQBrG0VNzZrET6x1dxunVtcYGOkU8s4YA=
=HCFp
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- don't inherit NFS filesystem capabilities when crossing from one
filesystem to another
Bugfixes:
- NFS wakeup of __nfs_lookup_revalidate() needs memory barriers
- NFS improve bounds checking in nfs_fh_to_dentry()
- NFS Fix allocation errors when writing to a NFS file backed
loopback device
- NFSv4: More listxattr fixes
- SUNRPC: fix client handling of TLS alerts
- pNFS block/scsi layout fix for an uninitialised pointer
dereference
- pNFS block/scsi layout fixes for the extent encoding, stripe
mapping, and disk offset overflows
- pNFS layoutcommit work around for RPC size limitations
- pNFS/flexfiles avoid looping when handling fatal errors after
layoutget
- localio: fix various race conditions
Features and cleanups:
- Add NFSv4 support for retrieving the btime
- NFS: Allow folio migration for the case of mode == MIGRATE_SYNC
- NFS: Support using a kernel keyring to store TLS certificates
- NFSv4: Speed up delegation lookup using a hash table
- Assorted cleanups to remove unused variables and struct fields
- Assorted new tracepoints to improve debugging"
* tag 'nfs-for-6.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (44 commits)
NFS/localio: nfs_uuid_put() fix the wake up after unlinking the file
NFS/localio: nfs_uuid_put() fix races with nfs_open/close_local_fh()
NFS/localio: nfs_close_local_fh() fix check for file closed
NFSv4: Remove duplicate lookups, capability probes and fsinfo calls
NFS: Fix the setting of capabilities when automounting a new filesystem
sunrpc: fix client side handling of tls alerts
nfs/localio: use read_seqbegin() rather than read_seqbegin_or_lock()
NFS: Fixup allocation flags for nfsiod's __GFP_NORETRY
NFSv4.2: another fix for listxattr
NFS: Fix filehandle bounds checking in nfs_fh_to_dentry()
SUNRPC: Silence warnings about parameters not being described
NFS: Clean up pnfs_put_layout_hdr()/pnfs_destroy_layout_final()
NFS: Fix wakeup of __nfs_lookup_revalidate() in unblock_revalidate()
NFS: use a hash table for delegation lookup
NFS: track active delegations per-server
NFS: move the delegation_watermark module parameter
NFS: cleanup nfs_inode_reclaim_delegation
NFS: cleanup error handling in nfs4_server_common_setup
pNFS/flexfiles: don't attempt pnfs on fatal DS errors
NFS: drop __exit from nfs_exit_keyring
...
When crossing into a new filesystem, the NFSv4 client will look up the
new directory, and then call nfs4_server_capabilities() as well as
nfs4_do_fsinfo() at least twice.
This patch removes the duplicate calls, and reduces the initial lookup
to retrieve just a minimal set of attributes.
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Capabilities cannot be inherited when we cross into a new filesystem.
They need to be reset to the minimal defaults, and then probed for
again.
Fixes: 54ceac4515 ("NFS: Share NFS superblocks per-protocol per-server per-FSID")
Cc: stable@vger.kernel.org
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The usage of read_seqbegin_or_lock() in nfs_copy_boot_verifier()
is wrong. "seq" is always even and thus "or_lock" has no effect.
nfs_copy_boot_verifier() just copies 8 bytes and is supposed to be
very rare operation, so we do not need the adaptive locking in this case.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://lore.kernel.org/r/20250731081038.3478-1-lirongqing@baidu.com
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
- The 4 patch series "mm: ksm: prevent KSM from breaking merging of new
VMAs" from Lorenzo Stoakes addresses an issue with KSM's
PR_SET_MEMORY_MERGE mode: newly mapped VMAs were not eligible for
merging with existing adjacent VMAs.
- The 4 patch series "mm/damon: introduce DAMON_STAT for simple and
practical access monitoring" from SeongJae Park adds a new kernel module
which simplifies the setup and usage of DAMON in production
environments.
- The 6 patch series "stop passing a writeback_control to swap/shmem
writeout" from Christoph Hellwig is a cleanup to the writeback code
which removes a couple of pointers from struct writeback_control.
- The 7 patch series "drivers/base/node.c: optimization and cleanups"
from Donet Tom contains largely uncorrelated cleanups to the NUMA node
setup and management code.
- The 4 patch series "mm: userfaultfd: assorted fixes and cleanups" from
Tal Zussman does some maintenance work on the userfaultfd code.
- The 5 patch series "Readahead tweaks for larger folios" from Ryan
Roberts implements some tuneups for pagecache readahead when it is
reading into order>0 folios.
- The 4 patch series "selftests/mm: Tweaks to the cow test" from Mark
Brown provides some cleanups and consistency improvements to the
selftests code.
- The 4 patch series "Optimize mremap() for large folios" from Dev Jain
does that. A 37% reduction in execution time was measured in a
memset+mremap+munmap microbenchmark.
- The 5 patch series "Remove zero_user()" from Matthew Wilcox expunges
zero_user() in favor of the more modern memzero_page().
- The 3 patch series "mm/huge_memory: vmf_insert_folio_*() and
vmf_insert_pfn_pud() fixes" from David Hildenbrand addresses some warts
which David noticed in the huge page code. These were not known to be
causing any issues at this time.
- The 3 patch series "mm/damon: use alloc_migrate_target() for
DAMOS_MIGRATE_{HOT,COLD" from SeongJae Park provides some cleanup and
consolidation work in DAMON.
- The 3 patch series "use vm_flags_t consistently" from Lorenzo Stoakes
uses vm_flags_t in places where we were inappropriately using other
types.
- The 3 patch series "mm/memfd: Reserve hugetlb folios before
allocation" from Vivek Kasireddy increases the reliability of large page
allocation in the memfd code.
- The 14 patch series "mm: Remove pXX_devmap page table bit and pfn_t
type" from Alistair Popple removes several now-unneeded PFN_* flags.
- The 5 patch series "mm/damon: decouple sysfs from core" from SeongJae
Park implememnts some cleanup and maintainability work in the DAMON
sysfs layer.
- The 5 patch series "madvise cleanup" from Lorenzo Stoakes does quite a
lot of cleanup/maintenance work in the madvise() code.
- The 4 patch series "madvise anon_name cleanups" from Vlastimil Babka
provides additional cleanups on top or Lorenzo's effort.
- The 11 patch series "Implement numa node notifier" from Oscar Salvador
creates a standalone notifier for NUMA node memory state changes.
Previously these were lumped under the more general memory on/offline
notifier.
- The 6 patch series "Make MIGRATE_ISOLATE a standalone bit" from Zi Yan
cleans up the pageblock isolation code and fixes a potential issue which
doesn't seem to cause any problems in practice.
- The 5 patch series "selftests/damon: add python and drgn based DAMON
sysfs functionality tests" from SeongJae Park adds additional drgn- and
python-based DAMON selftests which are more comprehensive than the
existing selftest suite.
- The 5 patch series "Misc rework on hugetlb faulting path" from Oscar
Salvador fixes a rather obscure deadlock in the hugetlb fault code and
follows that fix with a series of cleanups.
- The 3 patch series "cma: factor out allocation logic from
__cma_declare_contiguous_nid" from Mike Rapoport rationalizes and cleans
up the highmem-specific code in the CMA allocator.
- The 28 patch series "mm/migration: rework movable_ops page migration
(part 1)" from David Hildenbrand provides cleanups and
future-preparedness to the migration code.
- The 2 patch series "mm/damon: add trace events for auto-tuned
monitoring intervals and DAMOS quota" from SeongJae Park adds some
tracepoints to some DAMON auto-tuning code.
- The 6 patch series "mm/damon: fix misc bugs in DAMON modules" from
SeongJae Park does that.
- The 6 patch series "mm/damon: misc cleanups" from SeongJae Park also
does what it claims.
- The 4 patch series "mm: folio_pte_batch() improvements" from David
Hildenbrand cleans up the large folio PTE batching code.
- The 13 patch series "mm/damon/vaddr: Allow interleaving in
migrate_{hot,cold} actions" from SeongJae Park facilitates dynamic
alteration of DAMON's inter-node allocation policy.
- The 3 patch series "Remove unmap_and_put_page()" from Vishal Moola
provides a couple of page->folio conversions.
- The 4 patch series "mm: per-node proactive reclaim" from Davidlohr
Bueso implements a per-node control of proactive reclaim - beyond the
current memcg-based implementation.
- The 14 patch series "mm/damon: remove damon_callback" from SeongJae
Park replaces the damon_callback interface with a more general and
powerful damon_call()+damos_walk() interface.
- The 10 patch series "mm/mremap: permit mremap() move of multiple VMAs"
from Lorenzo Stoakes implements a number of mremap cleanups (of course)
in preparation for adding new mremap() functionality: newly permit the
remapping of multiple VMAs when the user is specifying MREMAP_FIXED. It
still excludes some specialized situations where this cannot be
performed reliably.
- The 3 patch series "drop hugetlb_free_pgd_range()" from Anthony Yznaga
switches some sparc hugetlb code over to the generic version and removes
the thus-unneeded hugetlb_free_pgd_range().
- The 4 patch series "mm/damon/sysfs: support periodic and automated
stats update" from SeongJae Park augments the present
userspace-requested update of DAMON sysfs monitoring files. Automatic
update is now provided, along with a tunable to control the update
interval.
- The 4 patch series "Some randome fixes and cleanups to swapfile" from
Kemeng Shi does what is claims.
- The 4 patch series "mm: introduce snapshot_page" from Luiz Capitulino
and David Hildenbrand provides (and uses) a means by which debug-style
functions can grab a copy of a pageframe and inspect it locklessly
without tripping over the races inherent in operating on the live
pageframe directly.
- The 6 patch series "use per-vma locks for /proc/pid/maps reads" from
Suren Baghdasaryan addresses the large contention issues which can be
triggered by reads from that procfs file. Latencies are reduced by more
than half in some situations. The series also introduces several new
selftests for the /proc/pid/maps interface.
- The 6 patch series "__folio_split() clean up" from Zi Yan cleans up
__folio_split()!
- The 7 patch series "Optimize mprotect() for large folios" from Dev
Jain provides some quite large (>3x) speedups to mprotect() when dealing
with large folios.
- The 2 patch series "selftests/mm: reuse FORCE_READ to replace "asm
volatile("" : "+r" (XXX));" and some cleanup" from wang lian does some
cleanup work in the selftests code.
- The 3 patch series "tools/testing: expand mremap testing" from Lorenzo
Stoakes extends the mremap() selftest in several ways, including adding
more checking of Lorenzo's recently added "permit mremap() move of
multiple VMAs" feature.
- The 22 patch series "selftests/damon/sysfs.py: test all parameters"
from SeongJae Park extends the DAMON sysfs interface selftest so that it
tests all possible user-requested parameters. Rather than the present
minimal subset.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaIqcCgAKCRDdBJ7gKXxA
jkVBAQCCn9DR1QP0CRk961ot0cKzOgioSc0aA03DPb2KXRt2kQEAzDAz0ARurFhL
8BzbvI0c+4tntHLXvIlrC33n9KWAOQM=
=XsFy
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"As usual, many cleanups. The below blurbiage describes 42 patchsets.
21 of those are partially or fully cleanup work. "cleans up",
"cleanup", "maintainability", "rationalizes", etc.
I never knew the MM code was so dirty.
"mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes)
addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly
mapped VMAs were not eligible for merging with existing adjacent
VMAs.
"mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park)
adds a new kernel module which simplifies the setup and usage of
DAMON in production environments.
"stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig)
is a cleanup to the writeback code which removes a couple of
pointers from struct writeback_control.
"drivers/base/node.c: optimization and cleanups" (Donet Tom)
contains largely uncorrelated cleanups to the NUMA node setup and
management code.
"mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman)
does some maintenance work on the userfaultfd code.
"Readahead tweaks for larger folios" (Ryan Roberts)
implements some tuneups for pagecache readahead when it is reading
into order>0 folios.
"selftests/mm: Tweaks to the cow test" (Mark Brown)
provides some cleanups and consistency improvements to the
selftests code.
"Optimize mremap() for large folios" (Dev Jain)
does that. A 37% reduction in execution time was measured in a
memset+mremap+munmap microbenchmark.
"Remove zero_user()" (Matthew Wilcox)
expunges zero_user() in favor of the more modern memzero_page().
"mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand)
addresses some warts which David noticed in the huge page code.
These were not known to be causing any issues at this time.
"mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park)
provides some cleanup and consolidation work in DAMON.
"use vm_flags_t consistently" (Lorenzo Stoakes)
uses vm_flags_t in places where we were inappropriately using other
types.
"mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy)
increases the reliability of large page allocation in the memfd
code.
"mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple)
removes several now-unneeded PFN_* flags.
"mm/damon: decouple sysfs from core" (SeongJae Park)
implememnts some cleanup and maintainability work in the DAMON
sysfs layer.
"madvise cleanup" (Lorenzo Stoakes)
does quite a lot of cleanup/maintenance work in the madvise() code.
"madvise anon_name cleanups" (Vlastimil Babka)
provides additional cleanups on top or Lorenzo's effort.
"Implement numa node notifier" (Oscar Salvador)
creates a standalone notifier for NUMA node memory state changes.
Previously these were lumped under the more general memory
on/offline notifier.
"Make MIGRATE_ISOLATE a standalone bit" (Zi Yan)
cleans up the pageblock isolation code and fixes a potential issue
which doesn't seem to cause any problems in practice.
"selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park)
adds additional drgn- and python-based DAMON selftests which are
more comprehensive than the existing selftest suite.
"Misc rework on hugetlb faulting path" (Oscar Salvador)
fixes a rather obscure deadlock in the hugetlb fault code and
follows that fix with a series of cleanups.
"cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport)
rationalizes and cleans up the highmem-specific code in the CMA
allocator.
"mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand)
provides cleanups and future-preparedness to the migration code.
"mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park)
adds some tracepoints to some DAMON auto-tuning code.
"mm/damon: fix misc bugs in DAMON modules" (SeongJae Park)
does that.
"mm/damon: misc cleanups" (SeongJae Park)
also does what it claims.
"mm: folio_pte_batch() improvements" (David Hildenbrand)
cleans up the large folio PTE batching code.
"mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park)
facilitates dynamic alteration of DAMON's inter-node allocation
policy.
"Remove unmap_and_put_page()" (Vishal Moola)
provides a couple of page->folio conversions.
"mm: per-node proactive reclaim" (Davidlohr Bueso)
implements a per-node control of proactive reclaim - beyond the
current memcg-based implementation.
"mm/damon: remove damon_callback" (SeongJae Park)
replaces the damon_callback interface with a more general and
powerful damon_call()+damos_walk() interface.
"mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes)
implements a number of mremap cleanups (of course) in preparation
for adding new mremap() functionality: newly permit the remapping
of multiple VMAs when the user is specifying MREMAP_FIXED. It still
excludes some specialized situations where this cannot be performed
reliably.
"drop hugetlb_free_pgd_range()" (Anthony Yznaga)
switches some sparc hugetlb code over to the generic version and
removes the thus-unneeded hugetlb_free_pgd_range().
"mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park)
augments the present userspace-requested update of DAMON sysfs
monitoring files. Automatic update is now provided, along with a
tunable to control the update interval.
"Some randome fixes and cleanups to swapfile" (Kemeng Shi)
does what is claims.
"mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand)
provides (and uses) a means by which debug-style functions can grab
a copy of a pageframe and inspect it locklessly without tripping
over the races inherent in operating on the live pageframe
directly.
"use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan)
addresses the large contention issues which can be triggered by
reads from that procfs file. Latencies are reduced by more than
half in some situations. The series also introduces several new
selftests for the /proc/pid/maps interface.
"__folio_split() clean up" (Zi Yan)
cleans up __folio_split()!
"Optimize mprotect() for large folios" (Dev Jain)
provides some quite large (>3x) speedups to mprotect() when dealing
with large folios.
"selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian)
does some cleanup work in the selftests code.
"tools/testing: expand mremap testing" (Lorenzo Stoakes)
extends the mremap() selftest in several ways, including adding
more checking of Lorenzo's recently added "permit mremap() move of
multiple VMAs" feature.
"selftests/damon/sysfs.py: test all parameters" (SeongJae Park)
extends the DAMON sysfs interface selftest so that it tests all
possible user-requested parameters. Rather than the present minimal
subset"
* tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits)
MAINTAINERS: add missing headers to mempory policy & migration section
MAINTAINERS: add missing file to cgroup section
MAINTAINERS: add MM MISC section, add missing files to MISC and CORE
MAINTAINERS: add missing zsmalloc file
MAINTAINERS: add missing files to page alloc section
MAINTAINERS: add missing shrinker files
MAINTAINERS: move memremap.[ch] to hotplug section
MAINTAINERS: add missing mm_slot.h file THP section
MAINTAINERS: add missing interval_tree.c to memory mapping section
MAINTAINERS: add missing percpu-internal.h file to per-cpu section
mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info()
selftests/damon: introduce _common.sh to host shared function
selftests/damon/sysfs.py: test runtime reduction of DAMON parameters
selftests/damon/sysfs.py: test non-default parameters runtime commit
selftests/damon/sysfs.py: generalize DAMON context commit assertion
selftests/damon/sysfs.py: generalize monitoring attributes commit assertion
selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion
selftests/damon/sysfs.py: test DAMOS filters commitment
selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion
selftests/damon/sysfs.py: test DAMOS destinations commitment
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCgQAKCRCRxhvAZXjc
os+nAP9LFHUwWO6EBzHJJGEVjJvvzsbzqeYrRFamYiMc5ulPJwD+KW4RIgJa/MWO
pcYE40CacaekD8rFWwYUyszpgmv6ewc=
=wCwp
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull mmap_prepare updates from Christian Brauner:
"Last cycle we introduce f_op->mmap_prepare() in c84bf6dd2b ("mm:
introduce new .mmap_prepare() file callback").
This is preferred to the existing f_op->mmap() hook as it does require
a VMA to be established yet, thus allowing the mmap logic to invoke
this hook far, far earlier, prior to inserting a VMA into the virtual
address space, or performing any other heavy handed operations.
This allows for much simpler unwinding on error, and for there to be a
single attempt at merging a VMA rather than having to possibly
reattempt a merge based on potentially altered VMA state.
Far more importantly, it prevents inappropriate manipulation of
incompletely initialised VMA state, which is something that has been
the cause of bugs and complexity in the past.
The intent is to gradually deprecate f_op->mmap, and in that vein this
series coverts the majority of file systems to using f_op->mmap_prepare.
Prerequisite steps are taken - firstly ensuring all checks for mmap
capabilities use the file_has_valid_mmap_hooks() helper rather than
directly checking for f_op->mmap (which is now not a valid check) and
secondly updating daxdev_mapping_supported() to not require a VMA
parameter to allow ext4 and xfs to be converted.
Commit bb666b7c27 ("mm: add mmap_prepare() compatibility layer for
nested file systems") handles the nasty edge-case of nested file
systems like overlayfs, which introduces a compatibility shim to allow
f_op->mmap_prepare() to be invoked from an f_op->mmap() callback.
This allows for nested filesystems to continue to function correctly
with all file systems regardless of which callback is used. Once we
finally convert all file systems, this shim can be removed.
As a result, ecryptfs, fuse, and overlayfs remain unaltered so they
can nest all other file systems.
We additionally do not update resctl - as this requires an update to
remap_pfn_range() (or an alternative to it) which we defer to a later
series, equally we do not update cramfs which needs a mixed mapping
insertion with the same issue, nor do we update procfs, hugetlbfs,
syfs or kernfs all of which require VMAs for internal state and hooks.
We shall return to all of these later"
* tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
doc: update porting, vfs documentation to describe mmap_prepare()
fs: replace mmap hook with .mmap_prepare for simple mappings
fs: convert most other generic_file_*mmap() users to .mmap_prepare()
fs: convert simple use of generic_file_*_mmap() to .mmap_prepare()
mm/filemap: introduce generic_file_*_mmap_prepare() helpers
fs/xfs: transition from deprecated .mmap hook to .mmap_prepare
fs/ext4: transition from deprecated .mmap hook to .mmap_prepare
fs/dax: make it possible to check dev dax support without a VMA
fs: consistently use can_mmap_file() helper
mm/nommu: use file_has_valid_mmap_hooks() helper
mm: rename call_mmap/mmap_prepare to vfs_mmap/mmap_prepare
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaIM/KwAKCRCRxhvAZXjc
opT+AP407JwhRSBjUEmHg5JzUyDoivkOySdnthunRjaBKD8rlgEApM6SOIZYucU7
cPC3ZY6ORFM6Mwaw+iDW9lasM5ucHQ8=
=CHha
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc VFS updates from Christian Brauner:
"This contains the usual selections of misc updates for this cycle.
Features:
- Add ext4 IOCB_DONTCACHE support
This refactors the address_space_operations write_begin() and
write_end() callbacks to take const struct kiocb * as their first
argument, allowing IOCB flags such as IOCB_DONTCACHE to propagate
to the filesystem's buffered I/O path.
Ext4 is updated to implement handling of the IOCB_DONTCACHE flag
and advertises support via the FOP_DONTCACHE file operation flag.
Additionally, the i915 driver's shmem write paths are updated to
bypass the legacy write_begin/write_end interface in favor of
directly calling write_iter() with a constructed synchronous kiocb.
Another i915 change replaces a manual write loop with
kernel_write() during GEM shmem object creation.
Cleanups:
- don't duplicate vfs_open() in kernel_file_open()
- proc_fd_getattr(): don't bother with S_ISDIR() check
- fs/ecryptfs: replace snprintf with sysfs_emit in show function
- vfs: Remove unnecessary list_for_each_entry_safe() from
evict_inodes()
- filelock: add new locks_wake_up_waiter() helper
- fs: Remove three arguments from block_write_end()
- VFS: change old_dir and new_dir in struct renamedata to dentrys
- netfs: Remove unused declaration netfs_queue_write_request()
Fixes:
- eventpoll: Fix semi-unbounded recursion
- eventpoll: fix sphinx documentation build warning
- fs/read_write: Fix spelling typo
- fs: annotate data race between poll_schedule_timeout() and
pollwake()
- fs/pipe: set FMODE_NOWAIT in create_pipe_files()
- docs/vfs: update references to i_mutex to i_rwsem
- fs/buffer: remove comment about hard sectorsize
- fs/buffer: remove the min and max limit checks in __getblk_slow()
- fs/libfs: don't assume blocksize <= PAGE_SIZE in
generic_check_addressable
- fs_context: fix parameter name in infofc() macro
- fs: Prevent file descriptor table allocations exceeding INT_MAX"
* tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (24 commits)
netfs: Remove unused declaration netfs_queue_write_request()
eventpoll: fix sphinx documentation build warning
ext4: support uncached buffered I/O
mm/pagemap: add write_begin_get_folio() helper function
fs: change write_begin/write_end interface to take struct kiocb *
drm/i915: Refactor shmem_pwrite() to use kiocb and write_iter
drm/i915: Use kernel_write() in shmem object create
eventpoll: Fix semi-unbounded recursion
vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes()
fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable
fs/buffer: remove the min and max limit checks in __getblk_slow()
fs: Prevent file descriptor table allocations exceeding INT_MAX
fs: Remove three arguments from block_write_end()
fs/ecryptfs: replace snprintf with sysfs_emit in show function
fs: annotate suspected data race between poll_schedule_timeout() and pollwake()
docs/vfs: update references to i_mutex to i_rwsem
fs/buffer: remove comment about hard sectorsize
fs_context: fix parameter name in infofc() macro
VFS: change old_dir and new_dir in struct renamedata to dentrys
proc_fd_getattr(): don't bother with S_ISDIR() check
...
APIs provided to the rest of the kernel.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaIRDbQAKCRBZ7Krx/gZQ
63n6APwNnJXwgtSDi9N0FfHOlYqYSCaCjezVLbq+GR8K+r4wowD/TX/A4Qbyjjic
/VG8VbYe6fRaD53vp1giGI/dJiTI2Qg=
=Ta4H
-----END PGP SIGNATURE-----
Merge tag 'pull-rpc_pipefs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull rpc_pipefs updates from Al Viro:
"Massage rpc_pipefs to use saner primitives and clean up the APIs
provided to the rest of the kernel"
* tag 'pull-rpc_pipefs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
rpc_create_client_dir(): return 0 or -E...
rpc_create_client_dir(): don't bother with rpc_populate()
rpc_new_dir(): the last argument is always NULL
rpc_pipe: expand the calls of rpc_mkdir_populate()
rpc_gssd_dummy_populate(): don't bother with rpc_populate()
rpc_mkpipe_dentry(): switch to simple_start_creating()
rpc_pipe: saner primitive for creating regular files
rpc_pipe: saner primitive for creating subdirectories
rpc_pipe: don't overdo directory locking
rpc_mkpipe_dentry(): saner calling conventions
rpc_unlink(): saner calling conventions
rpc_populate(): lift cleanup into callers
rpc_unlink(): use simple_recursive_removal()
rpc_{rmdir_,}depopulate(): use simple_recursive_removal() instead
rpc_pipe: clean failure exits in fill_super
new helper: simple_start_creating()
The basic rules are simple:
* stores to dentry->d_flags are OK under dentry->d_lock.
* stores to dentry->d_flags are OK in the dentry constructor, before
becomes potentially visible to other threads.
Unfortunately, there's a couple of exceptions to that, and that's where the
headache comes from.
Main PITA comes from d_set_d_op(); that primitive sets ->d_op
of dentry and adjusts the flags that correspond to presence of individual
methods. It's very easy to misuse; existing uses _are_ safe, but proof
of correctness is brittle.
Use in __d_alloc() is safe (we are within a constructor), but we
might as well precalculate the initial value of ->d_flags when we set
the default ->d_op for given superblock and set ->d_flags directly
instead of messing with that helper.
The reasons why other uses are safe are bloody convoluted; I'm not going
to reproduce it here. See https://lore.kernel.org/all/20250224010624.GT1977892@ZenIV/
for gory details, if you care. The critical part is using d_set_d_op() only
just prior to d_splice_alias(), which makes a combination of d_splice_alias()
with setting ->d_op, etc. a natural replacement primitive. Better yet, if
we go that way, it's easy to take setting ->d_op and modifying ->d_flags
under ->d_lock, which eliminates the headache as far as ->d_flags exclusion
rules are concerned. Other exceptions are minor and easy to deal with.
What this series does:
* d_set_d_op() is no longer available; new primitive (d_splice_alias_ops())
is provided, equivalent to combination of d_set_d_op() and d_splice_alias().
* new field of struct super_block - ->s_d_flags. Default value of ->d_flags
to be used when allocating dentries on this filesystem.
* new primitive for setting ->s_d_op: set_default_d_op(). Replaces stores
to ->s_d_op at mount time. All in-tree filesystems converted; out-of-tree
ones will get caught by compiler (->s_d_op is renamed, so stores to it will
be caught). ->s_d_flags is set by the same primitive to match the ->s_d_op.
* a lot of filesystems had ->s_d_op->d_delete equal to always_delete_dentry;
that is equivalent to setting DCACHE_DONTCACHE in ->d_flags, so such filesystems
can bloody well set that bit in ->s_d_flags and drop ->d_delete() from
dentry_operations. In quite a few cases that results in empty dentry_operations,
which means that we can get rid of those.
* kill simple_dentry_operations - not needed anymore.
* massage d_alloc_parallel() to get rid of the other exception wrt ->d_flags
stores - we can set DCACHE_PAR_LOOKUP as soon as we allocate the new dentry;
no need to delay that until we commit to using the sucker.
As the result, ->d_flags stores are all either under ->d_lock or done before
the dentry becomes visible in any shared data structures.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaIQ/tQAKCRBZ7Krx/gZQ
66AhAQDgQ+S224x5YevNXc9mDoGUBMF4OG0n0fIla9rfdL4I6wEAqpOWMNDcVPCZ
GwYOvJ9YuqNdz+MyprAI18Yza4GOmgs=
=rTYB
-----END PGP SIGNATURE-----
Merge tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dentry d_flags updates from Al Viro:
"The current exclusion rules for dentry->d_flags stores are rather
unpleasant. The basic rules are simple:
- stores to dentry->d_flags are OK under dentry->d_lock
- stores to dentry->d_flags are OK in the dentry constructor, before
becomes potentially visible to other threads
Unfortunately, there's a couple of exceptions to that, and that's
where the headache comes from.
The main PITA comes from d_set_d_op(); that primitive sets ->d_op of
dentry and adjusts the flags that correspond to presence of individual
methods. It's very easy to misuse; existing uses _are_ safe, but proof
of correctness is brittle.
Use in __d_alloc() is safe (we are within a constructor), but we might
as well precalculate the initial value of 'd_flags' when we set the
default ->d_op for given superblock and set 'd_flags' directly instead
of messing with that helper.
The reasons why other uses are safe are bloody convoluted; I'm not
going to reproduce it here. See [1] for gory details, if you care. The
critical part is using d_set_d_op() only just prior to
d_splice_alias(), which makes a combination of d_splice_alias() with
setting ->d_op, etc a natural replacement primitive.
Better yet, if we go that way, it's easy to take setting ->d_op and
modifying 'd_flags' under ->d_lock, which eliminates the headache as
far as 'd_flags' exclusion rules are concerned. Other exceptions are
minor and easy to deal with.
What this series does:
- d_set_d_op() is no longer available; instead a new primitive
(d_splice_alias_ops()) is provided, equivalent to combination of
d_set_d_op() and d_splice_alias().
- new field of struct super_block - 's_d_flags'. This sets the
default value of 'd_flags' to be used when allocating dentries on
this filesystem.
- new primitive for setting 's_d_op': set_default_d_op(). This
replaces stores to 's_d_op' at mount time.
All in-tree filesystems converted; out-of-tree ones will get caught
by the compiler ('s_d_op' is renamed, so stores to it will be
caught). 's_d_flags' is set by the same primitive to match the
's_d_op'.
- a lot of filesystems had sb->s_d_op->d_delete equal to
always_delete_dentry; that is equivalent to setting
DCACHE_DONTCACHE in 'd_flags', so such filesystems can bloody well
set that bit in 's_d_flags' and drop 'd_delete()' from
dentry_operations.
In quite a few cases that results in empty dentry_operations, which
means that we can get rid of those.
- kill simple_dentry_operations - not needed anymore
- massage d_alloc_parallel() to get rid of the other exception wrt
'd_flags' stores - we can set DCACHE_PAR_LOOKUP as soon as we
allocate the new dentry; no need to delay that until we commit to
using the sucker.
As the result, 'd_flags' stores are all either under ->d_lock or done
before the dentry becomes visible in any shared data structures"
Link: https://lore.kernel.org/all/20250224010624.GT1977892@ZenIV/ [1]
* tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (21 commits)
configfs: use DCACHE_DONTCACHE
debugfs: use DCACHE_DONTCACHE
efivarfs: use DCACHE_DONTCACHE instead of always_delete_dentry()
9p: don't bother with always_delete_dentry
ramfs, hugetlbfs, mqueue: set DCACHE_DONTCACHE
kill simple_dentry_operations
devpts, sunrpc, hostfs: don't bother with ->d_op
shmem: no dentry retention past the refcount reaching zero
d_alloc_parallel(): set DCACHE_PAR_LOOKUP earlier
make d_set_d_op() static
simple_lookup(): just set DCACHE_DONTCACHE
tracefs: Add d_delete to remove negative dentries
set_default_d_op(): calculate the matching value for ->d_flags
correct the set of flags forbidden at d_set_d_op() time
split d_flags calculation out of d_set_d_op()
new helper: set_default_d_op()
fuse: no need for special dentry_operations for root dentry
switch procfs from d_set_d_op() to d_splice_alias_ops()
new helper: d_splice_alias_ops()
procfs: kill ->proc_dops
...
If the NFS client is doing writeback from a workqueue context, avoid using
__GFP_NORETRY for allocations if the task has set PF_MEMALLOC_NOIO or
PF_MEMALLOC_NOFS. The combination of these flags makes memory allocation
failures much more likely.
We've seen those allocation failures show up when the loopback driver is
doing writeback from a workqueue to a file on NFS, where memory allocation
failure results in errors or corruption within the loopback device's
filesystem.
Suggested-by: Trond Myklebust <trondmy@kernel.org>
Fixes: 0bae835b63 ("NFS: Avoid writeback threads getting stuck in mempool_alloc()")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/f83ac1155a4bc670f2663959a7a068571e06afd9.1752111622.git.bcodding@redhat.com
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Currently, when the server supports NFS4.1 security labels then
security.selinux label in included twice. Instead, only add it
when the server doesn't possess security label support.
Fixes: 243fea1346 ("NFSv4.2: fix listxattr to return selinux security label")
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Link: https://lore.kernel.org/r/20250722205641.79394-1-okorniev@redhat.com
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The function needs to check the minimal filehandle length before it can
access the embedded filehandle.
Reported-by: zhangjian <zhangjian496@huawei.com>
Fixes: 20fa190272 ("nfs: add export operations")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
nfs_delegation_find_inode currently has to walk the entire list of
delegations per inode, which can become pretty large, and can become even
larger when increasing the delegation watermark.
Add a hash table to speed up the delegation lookup, sized as a fraction
of the delegation watermark.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250718081509.2607553-6-hch@lst.de
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The active delegation watermark was added to avoid overloading servers.
Track the active delegation per-server instead of globally so that clients
talking to multiple servers aren't limited by the global limit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20250718081509.2607553-5-hch@lst.de
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Keep the module_param_named next to the variable declaration instead of
somewhere unrelated, following the best practice in the rest of the
kernel.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20250718081509.2607553-4-hch@lst.de
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reduce a level of indentation for most of the code in this function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20250718081509.2607553-3-hch@lst.de
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When an applications get killed (SIGTERM/SIGINT) while pNFS client performs a connection
to DS, client ends in an infinite loop of connect-disconnect. This
source of the issue, it that flexfilelayoutdev#nfs4_ff_layout_prepare_ds gets an error
on nfs4_pnfs_ds_connect with status ERESTARTSYS, which is set by rpc_signal_task, but
the error is treated as transient, thus retried.
The issue is reproducible with Ctrl+C the following script(there should be ~1000 files in
a directory, client should must not have any connections to DSes):
```
echo 3 > /proc/sys/vm/drop_caches
for i in *
do
head -1 $i
done
```
The change aims to propagate the nfs4_ff_layout_prepare_ds error state
to the caller that can decide whatever this is a retryable error or not.
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Link: https://lore.kernel.org/r/20250627071751.189663-1-tigran.mkrtchyan@desy.de
Fixes: 260f32adb8 ("pNFS/flexfiles: Check the result of nfs4_pnfs_ds_connect")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Otherwise built-in NFS can lead to sectіon mismatches.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250714062450.1468117-1-hch@lst.de
Fixes: 87268f7a4f ("nfs: create a kernel keyring")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Passed the partially filled out structure to nfs4_set_client instead of
11 arguments that then get stashed into the structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Change the address_space_operations callbacks write_begin() and
write_end() to take struct kiocb * as the first argument instead of
struct file *.
Update all affected function prototypes, implementations, call sites,
and related documentation across VFS, filesystems, and block layer.
Part of a series refactoring address_space_operations write_begin and
write_end callbacks to use struct kiocb for passing write context and
flags.
Signed-off-by: Taotao Chen <chentaotao@didiglobal.com>
Link: https://lore.kernel.org/20250716093559.217344-4-chentaotao@didiglobal.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
At the end of the isect translation, disc_addr represents the physical
disk offset. Thus, end calculated from disk_addr is also a physical disk
offset. Therefore, range checking should be done using map->disk_offset,
not map->start.
Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250702133226.212537-1-sergeybashirov@gmail.com
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When there are too many block extents for a layoutcommit, they may not
all fit into the maximum-sized RPC. This patch allows the generic pnfs
code to properly handle -ENOSPC returned by the block/scsi layout driver
and trigger additional layoutcommits if necessary.
Co-developed-by: Konstantin Evtushenko <koevtushenko@yandex.com>
Signed-off-by: Konstantin Evtushenko <koevtushenko@yandex.com>
Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250630183537.196479-5-sergeybashirov@gmail.com
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The ext_tree_encode_commit() function may be called multiple times for
the same file, layout, and last written byte if the provided buffer is
not large enough to encode all extents in it.
The first problem is that the last written byte field must be zeroed
only on a successful call, otherwise we will lose its actual value and
get an integer overflow on the next encoding attempt.
The second problem is that we can't count and encode in one pass. The
extent state changes during encoding, so if we return -ENOSPC but have
already encoded some extents into a small buffer, they will not be
re-encoded into a new larger buffer on the next try. As a result, the
client never commits these extents to the server.
Co-developed-by: Konstantin Evtushenko <koevtushenko@yandex.com>
Signed-off-by: Konstantin Evtushenko <koevtushenko@yandex.com>
Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250630183537.196479-3-sergeybashirov@gmail.com
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The error occurs on the third attempt to encode extents. When function
ext_tree_prepare_commit() reallocates a larger buffer to retry encoding
extents, the "layoutupdate_pages" page array is initialized only after the
retry loop. But ext_tree_free_commitdata() is called on every iteration
and tries to put pages in the array, thus dereferencing uninitialized
pointers.
An additional problem is that there is no limit on the maximum possible
buffer_size. When there are too many extents, the client may create a
layoutcommit that is larger than the maximum possible RPC size accepted
by the server.
During testing, we observed two typical scenarios. First, one memory page
for extents is enough when we work with small files, append data to the
end of the file, or preallocate extents before writing. But when we fill
a new large file without preallocating, the number of extents can be huge,
and counting the number of written extents in ext_tree_encode_commit()
does not help much. Since this number increases even more between
unlocking and locking of ext_tree, the reallocated buffer may not be
large enough again and again.
Co-developed-by: Konstantin Evtushenko <koevtushenko@yandex.com>
Signed-off-by: Konstantin Evtushenko <koevtushenko@yandex.com>
Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250630183537.196479-2-sergeybashirov@gmail.com
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
nfs_umount() has been unused since 2013's
commit 4580a92d44 ("NFS: Use server-recommended security flavor by
default (NFSv3)")
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://lore.kernel.org/r/20250218215250.263709-1-linux@treblig.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Create a kernel .nfs keyring similar to the nvme .nvme one. Unlike for
a userspace-created keyrind, tlshd is a possesor of the keys with this
and thus the keys don't need user read permissions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20250515115107.33052-3-hch@lst.de
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Allow tlshd to use a per-mount key from the kernel keyring similar
to NVMe over TCP.
Note that tlshd expects keys and certificates stored in the kernel
keyring to be in DER format, not the PEM format used for file based keys
and certificates, so they need to be converted before they are added
to the keyring, which is a bit unexpected.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20250515115107.33052-2-hch@lst.de
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When the mode is MIGRATE_SYNC, we are allowed to call nfs_wb_folio()
under the folio lock.
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add new tracepoints in the NFSv4 match_stateid minorversion op that show
the info in both stateids.
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20250618-nfs-tracepoints-v2-4-540c9fb48da2@kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add a tracepoint in the function that decides whether to return a
delegation to the server.
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20250618-nfs-tracepoints-v2-3-540c9fb48da2@kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We have tracepoints for setting a delegation and reclaiming them. Add a
tracepoint for when the delegation is being detached from the inode.
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20250618-nfs-tracepoints-v2-2-540c9fb48da2@kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Managing the cache_validity flags is the deep voodoo of NFS cache
coherency. Let's have a little extra visibility into that value via the
nfs_inode_event tracepoints.
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20250618-nfs-tracepoints-v2-1-540c9fb48da2@kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The last code that was using this was removed via commit ca0daa277a
("NFS: Cache aggressively when file is open for writing") which was
merged in v4.8-rc1, so it can be removed completely.
Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
Link: https://lore.kernel.org/r/20250613094439.82338-3-ailiop@suse.com
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The wpages field is not serving any purpose since commit c63c7b0513
("NFS: Fix a race when doing NFS write coalescing") which was merged in
v2.6.22-rc1. Remove it completely.
Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
Link: https://lore.kernel.org/r/20250613094439.82338-2-ailiop@suse.com
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Rather than open-coding this function call it to make intention clear
and to use "correct" nesting levels (parent and child are for
directories).
This is purely cosmetic with no expected change in behaviour.
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://lore.kernel.org/r/174942091741.608730.3327223511347232829@noble.neil.brown.name
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add tracking of the create time (a.k.a. btime) along with corresponding
bitfields, request, and decode xdr routines.
Signed-off-by: Anne Marie Merritt <annemarie.merritt@primarydata.com>
Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/1e3677b0655fa2bbaba0817b41d111d94a06e5ee.1748515333.git.bcodding@redhat.com
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We need to be able to track more than 32 attributes per inode.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/1e3405fca54efd0be7c91c1da77917b94f5dfcc4.1748515333.git.bcodding@redhat.com
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
This field is now only set to one in the i915 gem code that only calls
writeback_iter on it, which ignores the flag. All other checks are thuse
dead code and the field can be removed.
Link: https://lkml.kernel.org/r/20250610054959.2057526-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Instead of returning a dentry or ERR_PTR(-E...), return 0 and store
dentry into pipe->dentry on success and return -E... on failure.
Callers are happier that way...
NOTE: dummy rpc_pipe is getting ->dentry set; we never access that,
since we
1) never call rpc_unlink() for it (dentry is taken out by
->kill_sb())
2) never call rpc_queue_upcall() for it (writing to that
sucker fails; no downcalls are ever submitted, so no replies are
going to arrive)
IOW, having that ->dentry set (and left dangling) is harmless,
if ugly; cleaner solution will take more massage.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
1) pass it pipe instead of pipe->dentry
2) zero pipe->dentry afterwards
3) it always returns 0; why bother?
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Allow the flexfiles error handling to recognise NFS level errors (as
opposed to RPC level errors) and handle them separately. The main
motivator is the NFSERR_PERM errors that get returned if the NFS client
connects to the data server through a port number that is lower than
1024. In that case, the client should disconnect and retry a READ on a
different data server, or it should retry a WRITE after reconnecting.
Reviewed-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Fixes: d67ae825a5 ("pnfs/flexfiles: Add the FlexFile Layout Driver")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
We found a few different systems hung up in writeback waiting on the same
page lock, and one task waiting on the NFS_LAYOUT_DRAIN bit in
pnfs_update_layout(), however the pnfs_layout_hdr's plh_outstanding count
was zero.
It seems most likely that this is another race between the waiter and waker
similar to commit ed0172af5d ("SUNRPC: Fix a race to wake a sync task").
Fix it up by applying the advised barrier.
Fixes: 880265c77a ("pNFS: Avoid a live lock condition in pnfs_update_layout()")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>