- The 2 patch series "squashfs: Remove page->mapping references" from
Matthew Wilcox gets us closer to being able to remove page->mapping.
- The 5 patch series "relayfs: misc changes" from Jason Xing does some
maintenance and minor feature addition work in relayfs.
- The 5 patch series "kdump: crashkernel reservation from CMA" from Jiri
Bohac switches us from static preallocation of the kdump crashkernel's
working memory over to dynamic allocation. So the difficulty of
a-priori estimation of the second kernel's needs is removed and the
first kernel obtains extra memory.
- The 5 patch series "generalize panic_print's dump function to be used
by other kernel parts" from Feng Tang implements some consolidation and
rationalizatio of the various ways in which a faiing kernel splats
information at the operator.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaI+82gAKCRDdBJ7gKXxA
jj4JAP9xb+w9DrBY6sa+7KTPIb+aTqQ7Zw3o9O2m+riKQJv6jAEA6aEwRnDA0451
fDT5IqVlCWGvnVikdZHSnvhdD7TGsQ0=
=rT71
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2025-08-03-12-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"Significant patch series in this pull request:
- "squashfs: Remove page->mapping references" (Matthew Wilcox) gets
us closer to being able to remove page->mapping
- "relayfs: misc changes" (Jason Xing) does some maintenance and
minor feature addition work in relayfs
- "kdump: crashkernel reservation from CMA" (Jiri Bohac) switches
us from static preallocation of the kdump crashkernel's working
memory over to dynamic allocation. So the difficulty of a-priori
estimation of the second kernel's needs is removed and the first
kernel obtains extra memory
- "generalize panic_print's dump function to be used by other
kernel parts" (Feng Tang) implements some consolidation and
rationalization of the various ways in which a failing kernel
splats information at the operator
* tag 'mm-nonmm-stable-2025-08-03-12-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (80 commits)
tools/getdelays: add backward compatibility for taskstats version
kho: add test for kexec handover
delaytop: enhance error logging and add PSI feature description
samples: Kconfig: fix spelling mistake "instancess" -> "instances"
fat: fix too many log in fat_chain_add()
scripts/spelling.txt: add notifer||notifier to spelling.txt
xen/xenbus: fix typo "notifer"
net: mvneta: fix typo "notifer"
drm/xe: fix typo "notifer"
cxl: mce: fix typo "notifer"
KVM: x86: fix typo "notifer"
MAINTAINERS: add maintainers for delaytop
ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below()
ucount: fix atomic_long_inc_below() argument type
kexec: enable CMA based contiguous allocation
stackdepot: make max number of pools boot-time configurable
lib/xxhash: remove unused functions
init/Kconfig: restore CONFIG_BROKEN help text
lib/raid6: update recov_rvv.c zero page usage
docs: update docs after introducing delaytop
...
The usage of read_seqbegin_or_lock() in nfs_copy_boot_verifier()
is wrong. "seq" is always even and thus "or_lock" has no effect.
nfs_copy_boot_verifier() just copies 8 bytes and is supposed to be
very rare operation, so we do not need the adaptive locking in this case.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://lore.kernel.org/r/20250731081038.3478-1-lirongqing@baidu.com
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
SMB3.1.1 POSIX mounts require symlinks to be created natively with
IO_REPARSE_TAG_SYMLINK reparse point.
Cc: linux-cifs@vger.kernel.org
Cc: Ralph Boehme <slow@samba.org>
Cc: David Howells <dhowells@redhat.com>
Cc: <stable@vger.kernel.org>
Reported-by: Matthew Richardson <m.richardson@ed.ac.uk>
Closes: https://marc.info/?i=1124e7cd-6a46-40a6-9f44-b7664a66654b@ed.ac.uk
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmiLj1MACgkQiiy9cAdy
T1H2UQv9ErlPpB5zI5nmwCqKg3G2uGQ9yXk1iwIPKzwqA3ypPGRz1KpC80TN2SBA
XdjtDyTlLVasxCMomzodHCloEQSJ4HYWcOL4uzVhhbiX/KqVoWofb5D5QKzkiWvK
HkIgFCOipXxpg4871+O2fmxxpm+D6dJHz8ffARwb6VTmp1bSBEXR8UTEVXo7LJAK
jFZv9BPEkKkdnIZw/otbV1eyp8xxtEPZRqMgCsMgdjp1L03ujnFe0COxHnQjYC5o
MygdVslmuuLswsDoEgZmzVLH7ubAUUhbYzkL6xd0c6zFf1jE67uKDQqOu0kYZCjr
/dTh33WM4vDjjOGBm6XMFBEKp8+CLHKEMl8qj48MKhGpeGaCSoFP5/5cl+BRZV6D
sZqJy3cZw3hApQeJHwBjWPdujczmPQ1/BtWkyDIccC//v9rlovmVhd3t2VM7BylS
5rGgpa9OcFR1Dg+U63ONggTPgdFqEcmuLRXFnEMi+n8TwahIaVkrPz6AmG+8gLyH
u9tw3DLo
=fjzp
-----END PGP SIGNATURE-----
Merge tag 'v6.17-rc-part1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client updates from Steve French:
- Fix network namespace refcount leak
- Multichannel reconnect fix
- Perf improvement to not do unneeded EA query on native symlinks
- Performance improvement for directory leases to allow extending lease
for actively queried directories
- Improve debugging of directory leases by adding pseudofile to show
them
- Five minor mount cleanup patches
- Minor directory lease cleanup patch
- Allow creating special files via reparse points over SMB1
- Two minor improvements to FindFirst over SMB1
- Two NTLMSSP session setup fixes
* tag 'v6.17-rc-part1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb3 client: add way to show directory leases for improved debugging
smb: client: get rid of kstrdup() when parsing iocharset mount option
smb: client: get rid of kstrdup() when parsing domain mount option
smb: client: get rid of kstrdup() when parsing pass2 mount option
smb: client: get rid of kstrdup() when parsing pass mount option
smb: client: get rid of kstrdup() when parsing user mount option
cifs: Add support for creating reparse points over SMB1
cifs: Do not query WSL EAs for native SMB symlink
cifs: Optimize CIFSFindFirst() response when not searching
cifs: Fix calling CIFSFindFirst() for root path without msearch
smb: client: fix session setup against servers that require SPN
smb: client: allow parsing zero-length AV pairs
cifs: add new field to track the last access time of cfid
smb: change return type of cached_dir_lease_break() to bool
cifs: reset iface weights when we cannot find a candidate
smb: client: fix netns refcount leak after net_passive changes
An infinite loop may occur if the following conditions occur due to
file system corruption.
(1) Condition for exfat_count_dir_entries() to loop infinitely.
- The cluster chain includes a loop.
- There is no UNUSED entry in the cluster chain.
(2) Condition for exfat_create_upcase_table() to loop infinitely.
- The cluster chain of the root directory includes a loop.
- There are no UNUSED entry and up-case table entry in the cluster
chain of the root directory.
(3) Condition for exfat_load_bitmap() to loop infinitely.
- The cluster chain of the root directory includes a loop.
- There are no UNUSED entry and bitmap entry in the cluster chain
of the root directory.
(4) Condition for exfat_find_dir_entry() to loop infinitely.
- The cluster chain includes a loop.
- The unused directory entries were exhausted by some operation.
(5) Condition for exfat_check_dir_empty() to loop infinitely.
- The cluster chain includes a loop.
- The unused directory entries were exhausted by some operation.
- All files and sub-directories under the directory are deleted.
This commit adds checks to break the above infinite loop.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Test: androbench by default setting, use 64GB sdcard.
the random write speed:
without this patch 3.5MB/s
with this patch 7MB/s
After patch "11a347fb6cef", the random write speed decreased significantly.
the .write_iter() interface had been modified, and check the differences
with generic_file_write_iter(), when calling generic_write_sync() and
exfat_file_write_iter() to call vfs_fsync_range(), the fdatasync flag is
wrong, and make not use the fdatasync mode, and make random write speed
decreased. So use generic_write_sync() instead of vfs_fsync_range().
Fixes: 11a347fb6c ("exfat: change to get file size from DataLength")
Signed-off-by: Zhengxu Zhang <zhengxu.zhang@unisoc.com>
Acked-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
- The 4 patch series "mm: ksm: prevent KSM from breaking merging of new
VMAs" from Lorenzo Stoakes addresses an issue with KSM's
PR_SET_MEMORY_MERGE mode: newly mapped VMAs were not eligible for
merging with existing adjacent VMAs.
- The 4 patch series "mm/damon: introduce DAMON_STAT for simple and
practical access monitoring" from SeongJae Park adds a new kernel module
which simplifies the setup and usage of DAMON in production
environments.
- The 6 patch series "stop passing a writeback_control to swap/shmem
writeout" from Christoph Hellwig is a cleanup to the writeback code
which removes a couple of pointers from struct writeback_control.
- The 7 patch series "drivers/base/node.c: optimization and cleanups"
from Donet Tom contains largely uncorrelated cleanups to the NUMA node
setup and management code.
- The 4 patch series "mm: userfaultfd: assorted fixes and cleanups" from
Tal Zussman does some maintenance work on the userfaultfd code.
- The 5 patch series "Readahead tweaks for larger folios" from Ryan
Roberts implements some tuneups for pagecache readahead when it is
reading into order>0 folios.
- The 4 patch series "selftests/mm: Tweaks to the cow test" from Mark
Brown provides some cleanups and consistency improvements to the
selftests code.
- The 4 patch series "Optimize mremap() for large folios" from Dev Jain
does that. A 37% reduction in execution time was measured in a
memset+mremap+munmap microbenchmark.
- The 5 patch series "Remove zero_user()" from Matthew Wilcox expunges
zero_user() in favor of the more modern memzero_page().
- The 3 patch series "mm/huge_memory: vmf_insert_folio_*() and
vmf_insert_pfn_pud() fixes" from David Hildenbrand addresses some warts
which David noticed in the huge page code. These were not known to be
causing any issues at this time.
- The 3 patch series "mm/damon: use alloc_migrate_target() for
DAMOS_MIGRATE_{HOT,COLD" from SeongJae Park provides some cleanup and
consolidation work in DAMON.
- The 3 patch series "use vm_flags_t consistently" from Lorenzo Stoakes
uses vm_flags_t in places where we were inappropriately using other
types.
- The 3 patch series "mm/memfd: Reserve hugetlb folios before
allocation" from Vivek Kasireddy increases the reliability of large page
allocation in the memfd code.
- The 14 patch series "mm: Remove pXX_devmap page table bit and pfn_t
type" from Alistair Popple removes several now-unneeded PFN_* flags.
- The 5 patch series "mm/damon: decouple sysfs from core" from SeongJae
Park implememnts some cleanup and maintainability work in the DAMON
sysfs layer.
- The 5 patch series "madvise cleanup" from Lorenzo Stoakes does quite a
lot of cleanup/maintenance work in the madvise() code.
- The 4 patch series "madvise anon_name cleanups" from Vlastimil Babka
provides additional cleanups on top or Lorenzo's effort.
- The 11 patch series "Implement numa node notifier" from Oscar Salvador
creates a standalone notifier for NUMA node memory state changes.
Previously these were lumped under the more general memory on/offline
notifier.
- The 6 patch series "Make MIGRATE_ISOLATE a standalone bit" from Zi Yan
cleans up the pageblock isolation code and fixes a potential issue which
doesn't seem to cause any problems in practice.
- The 5 patch series "selftests/damon: add python and drgn based DAMON
sysfs functionality tests" from SeongJae Park adds additional drgn- and
python-based DAMON selftests which are more comprehensive than the
existing selftest suite.
- The 5 patch series "Misc rework on hugetlb faulting path" from Oscar
Salvador fixes a rather obscure deadlock in the hugetlb fault code and
follows that fix with a series of cleanups.
- The 3 patch series "cma: factor out allocation logic from
__cma_declare_contiguous_nid" from Mike Rapoport rationalizes and cleans
up the highmem-specific code in the CMA allocator.
- The 28 patch series "mm/migration: rework movable_ops page migration
(part 1)" from David Hildenbrand provides cleanups and
future-preparedness to the migration code.
- The 2 patch series "mm/damon: add trace events for auto-tuned
monitoring intervals and DAMOS quota" from SeongJae Park adds some
tracepoints to some DAMON auto-tuning code.
- The 6 patch series "mm/damon: fix misc bugs in DAMON modules" from
SeongJae Park does that.
- The 6 patch series "mm/damon: misc cleanups" from SeongJae Park also
does what it claims.
- The 4 patch series "mm: folio_pte_batch() improvements" from David
Hildenbrand cleans up the large folio PTE batching code.
- The 13 patch series "mm/damon/vaddr: Allow interleaving in
migrate_{hot,cold} actions" from SeongJae Park facilitates dynamic
alteration of DAMON's inter-node allocation policy.
- The 3 patch series "Remove unmap_and_put_page()" from Vishal Moola
provides a couple of page->folio conversions.
- The 4 patch series "mm: per-node proactive reclaim" from Davidlohr
Bueso implements a per-node control of proactive reclaim - beyond the
current memcg-based implementation.
- The 14 patch series "mm/damon: remove damon_callback" from SeongJae
Park replaces the damon_callback interface with a more general and
powerful damon_call()+damos_walk() interface.
- The 10 patch series "mm/mremap: permit mremap() move of multiple VMAs"
from Lorenzo Stoakes implements a number of mremap cleanups (of course)
in preparation for adding new mremap() functionality: newly permit the
remapping of multiple VMAs when the user is specifying MREMAP_FIXED. It
still excludes some specialized situations where this cannot be
performed reliably.
- The 3 patch series "drop hugetlb_free_pgd_range()" from Anthony Yznaga
switches some sparc hugetlb code over to the generic version and removes
the thus-unneeded hugetlb_free_pgd_range().
- The 4 patch series "mm/damon/sysfs: support periodic and automated
stats update" from SeongJae Park augments the present
userspace-requested update of DAMON sysfs monitoring files. Automatic
update is now provided, along with a tunable to control the update
interval.
- The 4 patch series "Some randome fixes and cleanups to swapfile" from
Kemeng Shi does what is claims.
- The 4 patch series "mm: introduce snapshot_page" from Luiz Capitulino
and David Hildenbrand provides (and uses) a means by which debug-style
functions can grab a copy of a pageframe and inspect it locklessly
without tripping over the races inherent in operating on the live
pageframe directly.
- The 6 patch series "use per-vma locks for /proc/pid/maps reads" from
Suren Baghdasaryan addresses the large contention issues which can be
triggered by reads from that procfs file. Latencies are reduced by more
than half in some situations. The series also introduces several new
selftests for the /proc/pid/maps interface.
- The 6 patch series "__folio_split() clean up" from Zi Yan cleans up
__folio_split()!
- The 7 patch series "Optimize mprotect() for large folios" from Dev
Jain provides some quite large (>3x) speedups to mprotect() when dealing
with large folios.
- The 2 patch series "selftests/mm: reuse FORCE_READ to replace "asm
volatile("" : "+r" (XXX));" and some cleanup" from wang lian does some
cleanup work in the selftests code.
- The 3 patch series "tools/testing: expand mremap testing" from Lorenzo
Stoakes extends the mremap() selftest in several ways, including adding
more checking of Lorenzo's recently added "permit mremap() move of
multiple VMAs" feature.
- The 22 patch series "selftests/damon/sysfs.py: test all parameters"
from SeongJae Park extends the DAMON sysfs interface selftest so that it
tests all possible user-requested parameters. Rather than the present
minimal subset.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaIqcCgAKCRDdBJ7gKXxA
jkVBAQCCn9DR1QP0CRk961ot0cKzOgioSc0aA03DPb2KXRt2kQEAzDAz0ARurFhL
8BzbvI0c+4tntHLXvIlrC33n9KWAOQM=
=XsFy
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"As usual, many cleanups. The below blurbiage describes 42 patchsets.
21 of those are partially or fully cleanup work. "cleans up",
"cleanup", "maintainability", "rationalizes", etc.
I never knew the MM code was so dirty.
"mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes)
addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly
mapped VMAs were not eligible for merging with existing adjacent
VMAs.
"mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park)
adds a new kernel module which simplifies the setup and usage of
DAMON in production environments.
"stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig)
is a cleanup to the writeback code which removes a couple of
pointers from struct writeback_control.
"drivers/base/node.c: optimization and cleanups" (Donet Tom)
contains largely uncorrelated cleanups to the NUMA node setup and
management code.
"mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman)
does some maintenance work on the userfaultfd code.
"Readahead tweaks for larger folios" (Ryan Roberts)
implements some tuneups for pagecache readahead when it is reading
into order>0 folios.
"selftests/mm: Tweaks to the cow test" (Mark Brown)
provides some cleanups and consistency improvements to the
selftests code.
"Optimize mremap() for large folios" (Dev Jain)
does that. A 37% reduction in execution time was measured in a
memset+mremap+munmap microbenchmark.
"Remove zero_user()" (Matthew Wilcox)
expunges zero_user() in favor of the more modern memzero_page().
"mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand)
addresses some warts which David noticed in the huge page code.
These were not known to be causing any issues at this time.
"mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park)
provides some cleanup and consolidation work in DAMON.
"use vm_flags_t consistently" (Lorenzo Stoakes)
uses vm_flags_t in places where we were inappropriately using other
types.
"mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy)
increases the reliability of large page allocation in the memfd
code.
"mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple)
removes several now-unneeded PFN_* flags.
"mm/damon: decouple sysfs from core" (SeongJae Park)
implememnts some cleanup and maintainability work in the DAMON
sysfs layer.
"madvise cleanup" (Lorenzo Stoakes)
does quite a lot of cleanup/maintenance work in the madvise() code.
"madvise anon_name cleanups" (Vlastimil Babka)
provides additional cleanups on top or Lorenzo's effort.
"Implement numa node notifier" (Oscar Salvador)
creates a standalone notifier for NUMA node memory state changes.
Previously these were lumped under the more general memory
on/offline notifier.
"Make MIGRATE_ISOLATE a standalone bit" (Zi Yan)
cleans up the pageblock isolation code and fixes a potential issue
which doesn't seem to cause any problems in practice.
"selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park)
adds additional drgn- and python-based DAMON selftests which are
more comprehensive than the existing selftest suite.
"Misc rework on hugetlb faulting path" (Oscar Salvador)
fixes a rather obscure deadlock in the hugetlb fault code and
follows that fix with a series of cleanups.
"cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport)
rationalizes and cleans up the highmem-specific code in the CMA
allocator.
"mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand)
provides cleanups and future-preparedness to the migration code.
"mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park)
adds some tracepoints to some DAMON auto-tuning code.
"mm/damon: fix misc bugs in DAMON modules" (SeongJae Park)
does that.
"mm/damon: misc cleanups" (SeongJae Park)
also does what it claims.
"mm: folio_pte_batch() improvements" (David Hildenbrand)
cleans up the large folio PTE batching code.
"mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park)
facilitates dynamic alteration of DAMON's inter-node allocation
policy.
"Remove unmap_and_put_page()" (Vishal Moola)
provides a couple of page->folio conversions.
"mm: per-node proactive reclaim" (Davidlohr Bueso)
implements a per-node control of proactive reclaim - beyond the
current memcg-based implementation.
"mm/damon: remove damon_callback" (SeongJae Park)
replaces the damon_callback interface with a more general and
powerful damon_call()+damos_walk() interface.
"mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes)
implements a number of mremap cleanups (of course) in preparation
for adding new mremap() functionality: newly permit the remapping
of multiple VMAs when the user is specifying MREMAP_FIXED. It still
excludes some specialized situations where this cannot be performed
reliably.
"drop hugetlb_free_pgd_range()" (Anthony Yznaga)
switches some sparc hugetlb code over to the generic version and
removes the thus-unneeded hugetlb_free_pgd_range().
"mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park)
augments the present userspace-requested update of DAMON sysfs
monitoring files. Automatic update is now provided, along with a
tunable to control the update interval.
"Some randome fixes and cleanups to swapfile" (Kemeng Shi)
does what is claims.
"mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand)
provides (and uses) a means by which debug-style functions can grab
a copy of a pageframe and inspect it locklessly without tripping
over the races inherent in operating on the live pageframe
directly.
"use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan)
addresses the large contention issues which can be triggered by
reads from that procfs file. Latencies are reduced by more than
half in some situations. The series also introduces several new
selftests for the /proc/pid/maps interface.
"__folio_split() clean up" (Zi Yan)
cleans up __folio_split()!
"Optimize mprotect() for large folios" (Dev Jain)
provides some quite large (>3x) speedups to mprotect() when dealing
with large folios.
"selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian)
does some cleanup work in the selftests code.
"tools/testing: expand mremap testing" (Lorenzo Stoakes)
extends the mremap() selftest in several ways, including adding
more checking of Lorenzo's recently added "permit mremap() move of
multiple VMAs" feature.
"selftests/damon/sysfs.py: test all parameters" (SeongJae Park)
extends the DAMON sysfs interface selftest so that it tests all
possible user-requested parameters. Rather than the present minimal
subset"
* tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits)
MAINTAINERS: add missing headers to mempory policy & migration section
MAINTAINERS: add missing file to cgroup section
MAINTAINERS: add MM MISC section, add missing files to MISC and CORE
MAINTAINERS: add missing zsmalloc file
MAINTAINERS: add missing files to page alloc section
MAINTAINERS: add missing shrinker files
MAINTAINERS: move memremap.[ch] to hotplug section
MAINTAINERS: add missing mm_slot.h file THP section
MAINTAINERS: add missing interval_tree.c to memory mapping section
MAINTAINERS: add missing percpu-internal.h file to per-cpu section
mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info()
selftests/damon: introduce _common.sh to host shared function
selftests/damon/sysfs.py: test runtime reduction of DAMON parameters
selftests/damon/sysfs.py: test non-default parameters runtime commit
selftests/damon/sysfs.py: generalize DAMON context commit assertion
selftests/damon/sysfs.py: generalize monitoring attributes commit assertion
selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion
selftests/damon/sysfs.py: test DAMOS filters commitment
selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion
selftests/damon/sysfs.py: test DAMOS destinations commitment
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmiHoygACgkQnJ2qBz9k
QNnQCwf9Et0BcGgtcE9bUPJMlK5AEJ2nsEGT24ajRPcJ9DtcpO5KY8htshPwvRBc
KeSW3bjYaBluKK6LmbvDbaDi4lBwFNKh+dfi8PeVwBECNn2DLNv+EOB6VPgxREWw
zmC3SL2oh7eGTtJq0h1XeI/s/AFYukOsjKcjoKmT3NyxkCwxYUoY3E1mePutBsW6
/2CjVdQoMUgw1kLcezdccOJi6IkAk5Vfr8gIRFucQiR49YIICYk17bhYBH/Gx5PJ
eJkqncq8FLv8gLaaHq5W8lC05nVnpNDFxUxKiN6YcuYUrUC7jFi7ad1EriZP3Zsr
y8pDAmVUHgzpiMYiPHhP2zG8xK06TQ==
=ZyXo
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
"A couple of small improvements for fsnotify subsystem.
The most interesting is probably Amir's change modifying the meaning
of fsnotify fmode bits (and I spell it out specifically because I know
you care about those). There's no change for the common cases of no
fsnotify watches or no permission event watches. But when there are
permission watches (either for open or for pre-content events) but no
FAN_ACCESS_PERM watch (which nobody uses in practice) we are now able
optimize away unnecessary cache loads from the read path"
* tag 'fsnotify_for_v6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: optimize FMODE_NONOTIFY_PERM for the common cases
fsnotify: merge file_set_fsnotify_mode_from_watchers() with open perm hook
samples: fix building fs-monitor on musl systems
fanotify: sanitize handle_type values when reporting fid
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEIodevzQLVs53l6BhNqiEXrVAjGQFAmiLd+AACgkQNqiEXrVA
jGTZjxAAjJGErKMS4XwC10Cpyy7en97xQF8qHnGWr0nIss7dvdN0zw8oUBCZFuso
uhtWddAcEwuWK+VcNj1TdEf3tSyxbjGkMA9Gh7amv7eI7YVNrFz7OZBSWMsd2DPP
ZJzPfExDIGEygWP5KDPd3qrCIA+hMwwJmpCY0sO7zZbrH8aftOUJQe5Gi/a+43f4
n+vTE1d/0h0PK/Vvy1Ceiql7V0HdJ7SWOfBfTTTQivdzdk8UFqgDoC27WJ1DpfQR
XTTEIoqizSeSV5YXDqwF6d7QQ3gROvzQlk0J7r3CUJ4gYHF0hDEFmHStYnkz8ZKB
Z+KMMEnd2Mt/NUFBI6+tYnMzAVj5fq4XLTfe7GqM7ZL+z2JRH+kAT7SGQJ0fabPA
FlNIA7fpLd1c0uuZghgS1rdY2uLYJEY2gFumcn9T0fbMsE10hSbfrkbldXCvkMpN
z1pLHj0yvG1yXUOXjgvdkpJEGpdp63Dq7qNq5yAbXCURzH08nAQQ2Fx9vX86WNta
8vkBrx/6X2uIE7Cd4WOSj3IYb4LY9BSBc9fScVd7D5kkwN0qAHTJNSdMUO0vpvjx
+wwYMhfIVAreacrmSVsCAh+Zaf7zg/VwS+5GZ+trBx5/KNeLVI5f/QbOXcDhHWm3
Y/2oQ8TAZgcPoyIROmZ/bQD4T+v48IESNrE54yt5WUazAeIxqYo=
=jTbL
-----END PGP SIGNATURE-----
Merge tag 'jfs-6.17' of github.com:kleikamp/linux-shaggy
Pull jfs updates from Dave Kleikamp:
"Fixes and cleanups for JFS filesystem"
* tag 'jfs-6.17' of github.com:kleikamp/linux-shaggy:
jfs: fix metapage reference count leak in dbAllocCtl
jfs: stop using write_cache_pages
jfs: truncate good inode pages when hard link is 0
jfs: jfs_xtree: replace XT_GETPAGE macro with xt_getpage()
jfs: Regular file corruption check
jfs: upper bound check of tree index in dbAllocAG
Change scnprintf to sysfs_emit in sysfs code.
Change sprintf to scnprintf in debugfs code.
Refactor debugfs mask-to-string code for readability and slightly
improved functionality.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEWLOQAZkv9m2xtUCf/Vh1C1u0YnAFAmiJMgUACgkQ/Vh1C1u0
YnB27xAAgnJMTPfbrRF2El/WwpceZQzMtXyuzlBNgr5r3P04afEl38ztBMrwGKr1
mZlPciTQiCZElAhUHUOCG/pswMnALjBobS0rUGITKwWAv+Ttzr6InhwHc9w7aYm/
RBfVGaRMObAIiKpH5YSMwqVhkXsMk+G9Ekx/snx31LgBN6JiCgPHBAAAo5fhKixa
ifyLuxsRJZAfgMzhMxuPVkjuNWrE8ee0e9bSkrNY+hMMKNtaLuhIAyuo7XZcAwCx
1qysJiTaxrHhn5A2Fc/s6GzhWMlXoNu76OniBpKDbHUGNKhFRXgb9yrb5k1ezgwv
CMsVuPPi3p7gyUFPKGtUGZARqeKSoq2pvg5WR50DvgRFx4PyyPC5o/98sBo+ASJS
Tvh5Kcuze1RqoyZIvj1/1eLYxxwywRVYjlurE1C8qEPTa+v6RkkCRk7V6vN++U2u
cO4/2MtDLphsuGDJdC8/bzfhaBlUGQ2FuK6nVvLhNAvSeAA8qEPOxgQQZqDMc710
9KdbeL2kuisUZggplB7vmlmJfT/Qd6tTK1vvIkHNIgczLQTwIjVYtCblJvdbidBl
5GbLWFLq2rifR3U9s3exny9dpuFEF1Xm6YgVdCf0nSCg60F9UM2Og4Jt0enPzOTz
Js+GCMOQsGDa2E/eZf0yr782KxTdP4uuhECP+Z/Yk1VSEdzEBow=
=GCSM
-----END PGP SIGNATURE-----
Merge tag 'for-linus-6.17-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs updates from Mike Marshall:
"Fixes for string handling in debugfs and sysfs:
- change scnprintf to sysfs_emit in sysfs code.
- change sprintf to scnprintf in debugfs code.
- refactor debugfs mask-to-string code for readability and slightly
improved functionality"
* tag 'for-linus-6.17-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
fs/orangefs: Allow 2 more characters in do_c_string()
fs: orangefs: replace scnprintf() with sysfs_emit()
fs/orangefs: use snprintf() instead of sprintf()
UBIFS:
- No longer use write_cache_pages()
UBI:
- Removal of an unused function
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAmiJJ9UWHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wf4gEACYUeCKRPJOc6qoCnhsqONKymE/
jVXxCWOBqpoyF0wtt0BXcS46FpXzKufb2U2BHL6j2/v77qXBYv/KenT1PJe9llXH
CNoGQb/zgr7cGFULetE12iaTJYxcIyaWfCk1pH7+n8JWeLwTQE66P1PdK5uEzHEp
5gtDyOic8hSIShLBC01/Td9FVRD3A79DhZUP2WDYxsiZGbA93IvE0RpV2x6iBnOG
uOzcHAZfnnlO22GoCqcQe5xaqOo3f4mJch6StEDLAs7EN1jvZMSH4wjM6Sbl+R4Z
3rzLE0LwqItVZCA9yBsZWZLvaGWhS44UH6HXiluTh0a6NzSoyOAg2U+4vU3frfbY
Kbnum9Rujc3NaVyHHHKmpDWEglj1bk8oN0EXVIOx887JkLjJF0FkSlo+gXp+cA+f
C0DpsSdJp6XhJsD60dp4gnas8stISmEN0xs/qgICXMo2WByIZlLtsOgBNwfocD8t
PtckkemupulEnSb9GJ/niAJttACSQFI9DCfnZibj3PYu3OeubSD30mJs/SY6rXOR
XS9jejSKxdyhyboL9PwuSPVbD2rj/uD/uL8R2ypGbrwNyQGvqUgBQUQjnhNDiOvt
CHwCtkmoupWxJscmu34K91zrN19i/9Nw4g+/2SOBz6NvVmIknTtuBQfBL57f2ykq
tZlNwdV/e/a2WCx7dg==
=/AwZ
-----END PGP SIGNATURE-----
Merge tag 'ubifs-for-linus-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull UBI and UBIFS updates from Richard Weinberger:
"UBIFS:
- No longer use write_cache_pages()
UBI:
- Remove an unused function"
* tag 'ubifs-for-linus-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubifs: stop using write_cache_pages
mtd: ubi: Remove unused ubi_flush
- Better scalability for ext4 block allocation
- Fix insufficient credits when writing back large folios
Miscellaneous bug fixes, especially when handling exteded attriutes,
inline data, and fast commit.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmiIQEoACgkQ8vlZVpUN
gaPB9wf/QursT7eLjx9Gz+4PYNWPKptBERQtmmDAnNYxDlEQ28+CHdMdEeiIPPoP
IW1DIHfR7VaTI2K7gy6D5632VAhDDKiXBpIYu1yh3KPClAxjTZbhrif8J5UBXj1K
ZwmCeLDF40jijua4rVKq3Fqf4iTJUyU2NqLpvcze7BZg7FwstXiNJrZ3DjAwi1BW
j/5veWwh/KrNMzT5u0+RpMs4FBrdXQXvwSe/4pSx6d75r6WAdzhgUMy09os1wAWU
3N0JU+R5hAG6iFfbWQRURB6oLMmmxl4x2F7r5BvM27uQtELNLNcxBKZhMW97HpiE
uSwKgo/59DKpWX0xQ2x/yugQIzd62w==
=oPHD
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Major ext4 changes for 6.17:
- Better scalability for ext4 block allocation
- Fix insufficient credits when writing back large folios
Miscellaneous bug fixes, especially when handling exteded attriutes,
inline data, and fast commit"
* tag 'ext4_for_linus_6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (39 commits)
ext4: do not BUG when INLINE_DATA_FL lacks system.data xattr
ext4: implement linear-like traversal across order xarrays
ext4: refactor choose group to scan group
ext4: convert free groups order lists to xarrays
ext4: factor out ext4_mb_scan_group()
ext4: factor out ext4_mb_might_prefetch()
ext4: factor out __ext4_mb_scan_group()
ext4: fix largest free orders lists corruption on mb_optimize_scan switch
ext4: fix zombie groups in average fragment size lists
ext4: merge freed extent with existing extents before insertion
ext4: convert sbi->s_mb_free_pending to atomic_t
ext4: fix typo in CR_GOAL_LEN_SLOW comment
ext4: get rid of some obsolete EXT4_MB_HINT flags
ext4: utilize multiple global goals to reduce contention
ext4: remove unnecessary s_md_lock on update s_mb_last_group
ext4: remove unnecessary s_mb_last_start
ext4: separate stream goal hits from s_bal_goals for better tracking
ext4: add ext4_try_lock_group() to skip busy groups
ext4: initialize superblock fields in the kballoc-test.c kunit tests
ext4: refactor the inline directory conversion and new directory codepaths
...
When looking at performance issues around directory caching, or debugging
directory lease issues, it is helpful to be able to display the current
directory leases (as we can e.g. or open files). Create pseudo-file
/proc/fs/cifs/open_dirs that displays current directory leases. Here
is sample output:
cat /proc/fs/cifs/open_dirs
Version:1
Format:
<tree id> <sess id> <persistent fid> <path>
Num entries: 3
0xce4c1c68 0x7176aa54 0xd95ef58e \dira valid file info, valid dirents
0xce4c1c68 0x7176aa54 0xd031e211 \dir5 valid file info, valid dirents
0xce4c1c68 0x7176aa54 0x96533a90 \dir1 valid file info
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Let's drop the inode from the donation list when there is no other
open file.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmiINnEACgkQ6rmadz2v
bToBnA/9F+A3R6rTwGk4HK3xpfc/nm2Tanl3oRN7S2ub/mskDOtWSIyG6cVFZ0UG
1fK6IkByyRIpAF/5qhdlw8drRXHkQtGLA0lP2L9llm4X1mHLofB18y9OeLrDE1WN
KwNP06+IGX9W802lCGSIXOY+VmRscVfXSMokyQt2ilHplKjOnDqJcYkWupi3T2rC
mz79FY9aEl2YrIcpj9RXz+8nwP49pZBuW2P0IM5PAIj4BJBXShrUp8T1nz94okNe
NFsnAyRxjWpUT0McEgtA9WvpD9lZqujYD8Qp0KlGZWmI3vNpV5d9S1+dBcEb1n7q
dyNMkTF3oRrJhhg4VqoHc6fVpzSEoZ9ZxV5Hx4cs+ganH75D4YbdGqx/7mR3DUgH
MZh6rHF1pGnK7TAm7h5gl3ZRAOkZOaahbe1i01NKo9CEe5fSh3AqMyzJYoyGHRKi
xDN39eQdWBNA+hm1VkbK2Bv93Rbjrka2Kj+D3sSSO9Bo/u3ntcknr7LW39idKz62
Q8dkKHcCEtun7gjk0YXPF013y81nEohj1C+52BmJ2l5JitM57xfr6YOaQpu7DPDE
AJbHx6ASxKdyEETecd0b+cXUPQ349zmRXy0+CDMAGKpBicC0H0mHhL14cwOY1Hfu
EIpIjmIJGI3JNF6T5kybcQGSBOYebdV0FFgwSllzPvuYt7YsHCs=
=/O3j
-----END PGP SIGNATURE-----
Merge tag 'bpf-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
- Remove usermode driver (UMD) framework (Thomas Weißschuh)
- Introduce Strongly Connected Component (SCC) in the verifier to
detect loops and refine register liveness (Eduard Zingerman)
- Allow 'void *' cast using bpf_rdonly_cast() and corresponding
'__arg_untrusted' for global function parameters (Eduard Zingerman)
- Improve precision for BPF_ADD and BPF_SUB operations in the verifier
(Harishankar Vishwanathan)
- Teach the verifier that constant pointer to a map cannot be NULL
(Ihor Solodrai)
- Introduce BPF streams for error reporting of various conditions
detected by BPF runtime (Kumar Kartikeya Dwivedi)
- Teach the verifier to insert runtime speculation barrier (lfence on
x86) to mitigate speculative execution instead of rejecting the
programs (Luis Gerhorst)
- Various improvements for 'veristat' (Mykyta Yatsenko)
- For CONFIG_DEBUG_KERNEL config warn on internal verifier errors to
improve bug detection by syzbot (Paul Chaignon)
- Support BPF private stack on arm64 (Puranjay Mohan)
- Introduce bpf_cgroup_read_xattr() kfunc to read xattr of cgroup's
node (Song Liu)
- Introduce kfuncs for read-only string opreations (Viktor Malik)
- Implement show_fdinfo() for bpf_links (Tao Chen)
- Reduce verifier's stack consumption (Yonghong Song)
- Implement mprog API for cgroup-bpf programs (Yonghong Song)
* tag 'bpf-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (192 commits)
selftests/bpf: Migrate fexit_noreturns case into tracing_failure test suite
selftests/bpf: Add selftest for attaching tracing programs to functions in deny list
bpf: Add log for attaching tracing programs to functions in deny list
bpf: Show precise rejected function when attaching fexit/fmod_ret to __noreturn functions
bpf: Fix various typos in verifier.c comments
bpf: Add third round of bounds deduction
selftests/bpf: Test invariants on JSLT crossing sign
selftests/bpf: Test cross-sign 64bits range refinement
selftests/bpf: Update reg_bound range refinement logic
bpf: Improve bounds when s64 crosses sign boundary
bpf: Simplify bounds refinement from s32
selftests/bpf: Enable private stack tests for arm64
bpf, arm64: JIT support for private stack
bpf: Move bpf_jit_get_prog_name() to core.c
bpf, arm64: Fix fp initialization for exception boundary
umd: Remove usermode driver framework
bpf/preload: Don't select USERMODE_DRIVER
selftests/bpf: Fix test dynptr/test_dynptr_memset_xdp_chunks failure
selftests/bpf: Fix test dynptr/test_dynptr_copy_xdp failure
selftests/bpf: Increase xdp data size for arm64 64K page size
...
Core & protocols
----------------
- Wrap datapath globals into net_aligned_data, to avoid false sharing.
- Preserve MSG_ZEROCOPY in forwarding (e.g. out of a container).
- Add SO_INQ and SCM_INQ support to AF_UNIX.
- Add SIOCINQ support to AF_VSOCK.
- Add TCP_MAXSEG sockopt to MPTCP.
- Add IPv6 force_forwarding sysctl to enable forwarding per interface.
- Make TCP validation of whether packet fully fits in the receive
window and the rcv_buf more strict. With increased use of HW
aggregation a single "packet" can be multiple 100s of kB.
- Add MSG_MORE flag to optimize large TCP transmissions via sockmap,
improves latency up to 33% for sockmap users.
- Convert TCP send queue handling from tasklet to BH workque.
- Improve BPF iteration over TCP sockets to see each socket exactly once.
- Remove obsolete and unused TCP RFC3517/RFC6675 loss recovery code.
- Support enabling kernel threads for NAPI processing on per-NAPI
instance basis rather than a whole device. Fully stop the kernel NAPI
thread when threaded NAPI gets disabled. Previously thread would stick
around until ifdown due to tricky synchronization.
- Allow multicast routing to take effect on locally-generated packets.
- Add output interface argument for End.X in segment routing.
- MCTP: add support for gateway routing, improve bind() handling.
- Don't require rtnl_lock when fetching an IPv6 neighbor over Netlink.
- Add a new neighbor flag ("extern_valid"), which cedes refresh
responsibilities to userspace. This is needed for EVPN multi-homing
where a neighbor entry for a multi-homed host needs to be synced
across all the VTEPs among which the host is multi-homed.
- Support NUD_PERMANENT for proxy neighbor entries.
- Add a new queuing discipline for IETF RFC9332 DualQ Coupled AQM.
- Add sequence numbers to netconsole messages. Unregister netconsole's
console when all net targets are removed. Code refactoring.
Add a number of selftests.
- Align IPSec inbound SA lookup to RFC 4301. Only SPI and protocol
should be used for an inbound SA lookup.
- Support inspecting ref_tracker state via DebugFS.
- Don't force bonding advertisement frames tx to ~333 ms boundaries.
Add broadcast_neighbor option to send ARP/ND on all bonded links.
- Allow providing upcall pid for the 'execute' command in openvswitch.
- Remove DCCP support from Netfilter's conntrack.
- Disallow multiple packet duplications in the queuing layer.
- Prevent use of deprecated iptables code on PREEMPT_RT.
Driver API
----------
- Support RSS and hashing configuration over ethtool Netlink.
- Add dedicated ethtool callbacks for getting and setting hashing fields.
- Add support for power budget evaluation strategy in PSE /
Power-over-Ethernet. Generate Netlink events for overcurrent etc.
- Support DPLL phase offset monitoring across all device inputs.
Support providing clock reference and SYNC over separate DPLL
inputs.
- Support traffic classes in devlink rate API for bandwidth management.
- Remove rtnl_lock dependency from UDP tunnel port configuration.
Device drivers
--------------
- Add a new Broadcom driver for 800G Ethernet (bnge).
- Add a standalone driver for Microchip ZL3073x DPLL.
- Remove IBM's NETIUCV device driver.
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- support zero-copy Tx of DMABUF memory
- take page size into account for page pool recycling rings
- Intel (100G, ice, idpf):
- idpf: XDP and AF_XDP support preparations
- idpf: add flow steering
- add link_down_events statistic
- clean up the TSPLL code
- preparations for live VM migration
- nVidia/Mellanox:
- support zero-copy Rx/Tx interfaces (DMABUF and io_uring)
- optimize context memory usage for matchers
- expose serial numbers in devlink info
- support PCIe congestion metrics
- Meta (fbnic):
- add 25G, 50G, and 100G link modes to phylink
- support dumping FW logs
- Marvell/Cavium:
- support for CN20K generation of the Octeon chips
- Amazon:
- add HW clock (without timestamping, just hypervisor time access)
- Ethernet virtual:
- VirtIO net:
- support segmentation of UDP-tunnel-encapsulated packets
- Google (gve):
- support packet timestamping and clock synchronization
- Microsoft vNIC:
- add handler for device-originated servicing events
- allow dynamic MSI-X vector allocation
- support Tx bandwidth clamping
- Ethernet NICs consumer, and embedded:
- AMD:
- amd-xgbe: hardware timestamping and PTP clock support
- Broadcom integrated MACs (bcmgenet, bcmasp):
- use napi_complete_done() return value to support NAPI polling
- add support for re-starting auto-negotiation
- Broadcom switches (b53):
- support BCM5325 switches
- add bcm63xx EPHY power control
- Synopsys (stmmac):
- lots of code refactoring and cleanups
- TI:
- icssg-prueth: read firmware-names from device tree
- icssg: PRP offload support
- Microchip:
- lan78xx: convert to PHYLINK for improved PHY and MAC management
- ksz: add KSZ8463 switch support
- Intel:
- support similar queue priority scheme in multi-queue and
time-sensitive networking (taprio)
- support packet pre-emption in both
- RealTek (r8169):
- enable EEE at 5Gbps on RTL8126
- Airoha:
- add PPPoE offload support
- MDIO bus controller for Airoha AN7583
- Ethernet PHYs:
- support for the IPQ5018 internal GE PHY
- micrel KSZ9477 switch-integrated PHYs:
- add MDI/MDI-X control support
- add RX error counters
- add cable test support
- add Signal Quality Indicator (SQI) reporting
- dp83tg720: improve reset handling and reduce link recovery time
- support bcm54811 (and its MII-Lite interface type)
- air_en8811h: support resume/suspend
- support PHY counters for QCA807x and QCA808x
- support WoL for QCA807x
- CAN drivers:
- rcar_canfd: support for Transceiver Delay Compensation
- kvaser: report FW versions via devlink dev info
- WiFi:
- extended regulatory info support (6 GHz)
- add statistics and beacon monitor for Multi-Link Operation (MLO)
- support S1G aggregation, improve S1G support
- add Radio Measurement action fields
- support per-radio RTS threshold
- some work around how FIPS affects wifi, which was wrong (RC4 is used
by TKIP, not only WEP)
- improvements for unsolicited probe response handling
- WiFi drivers:
- RealTek (rtw88):
- IBSS mode for SDIO devices
- RealTek (rtw89):
- BT coexistence for MLO/WiFi7
- concurrent station + P2P support
- support for USB devices RTL8851BU/RTL8852BU
- Intel (iwlwifi):
- use embedded PNVM in (to be released) FW images to fix
compatibility issues
- many cleanups (unused FW APIs, PCIe code, WoWLAN)
- some FIPS interoperability
- MediaTek (mt76):
- firmware recovery improvements
- more MLO work
- Qualcomm/Atheros (ath12k):
- fix scan on multi-radio devices
- more EHT/Wi-Fi 7 features
- encapsulation/decapsulation offload
- Broadcom (brcm80211):
- support SDIO 43751 device
- Bluetooth:
- hci_event: add support for handling LE BIG Sync Lost event
- ISO: add socket option to report packet seqnum via CMSG
- ISO: support SCM_TIMESTAMPING for ISO TS
- Bluetooth drivers:
- intel_pcie: support Function Level Reset
- nxpuart: add support for 4M baudrate
- nxpuart: implement powerup sequence, reset, FW dump, and FW loading
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmiFgLgACgkQMUZtbf5S
IrvafxAAnQRwYBoIG+piCILx6z5pRvBGHkmEQ4AQgSCFuq2eO3ubwMFIqEybfma1
5+QFjUZAV3OgGgKRBS2KGWxtSzdiF+/JGV1VOIN67sX3Mm0a2QgjA4n5CgKL0FPr
o6BEzjX5XwG1zvGcBNQ5BZ19xUUKjoZQgTtnea8sZ57Fsp5RtRgmYRqoewNvNk/n
uImh0NFsDVb0UeOpSzC34VD9l1dJvLGdui4zJAjno/vpvmT1DkXjoK419J/r52SS
X+5WgsfJ6DkjHqVN1tIhhK34yWqBOcwGFZJgEnWHMkFIl2FqRfFKMHyqtfLlVnLA
mnIpSyz8Sq2AHtx0TlgZ3At/Ri8p5+yYJgHOXcDKyABa8y8Zf4wrycmr6cV9JLuL
z54nLEVnJuvfDVDVJjsLYdJXyhMpZFq6+uAItdxKaw8Ugp/QqG4QtoRj+XIHz4ZW
z6OohkCiCzTwEISFK+pSTxPS30eOxq43kCspcvuLiwCCStJBRkRb5GdZA4dm7LA+
1Od4ADAkHjyrFtBqTyyC2scX8UJ33DlAIpAYyIeS6w9Cj9EXxtp1z33IAAAZ03MW
jJwIaJuc8bK2fWKMmiG7ucIXjPo4t//KiWlpkwwqLhPbjZgfDAcxq1AC2TLoqHBL
y4EOgKpHDCMAghSyiFIAn2JprGcEt8dp+11B0JRXIn4Pm/eYDH8=
=lqbe
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Wrap datapath globals into net_aligned_data, to avoid false sharing
- Preserve MSG_ZEROCOPY in forwarding (e.g. out of a container)
- Add SO_INQ and SCM_INQ support to AF_UNIX
- Add SIOCINQ support to AF_VSOCK
- Add TCP_MAXSEG sockopt to MPTCP
- Add IPv6 force_forwarding sysctl to enable forwarding per interface
- Make TCP validation of whether packet fully fits in the receive
window and the rcv_buf more strict. With increased use of HW
aggregation a single "packet" can be multiple 100s of kB
- Add MSG_MORE flag to optimize large TCP transmissions via sockmap,
improves latency up to 33% for sockmap users
- Convert TCP send queue handling from tasklet to BH workque
- Improve BPF iteration over TCP sockets to see each socket exactly
once
- Remove obsolete and unused TCP RFC3517/RFC6675 loss recovery code
- Support enabling kernel threads for NAPI processing on per-NAPI
instance basis rather than a whole device. Fully stop the kernel
NAPI thread when threaded NAPI gets disabled. Previously thread
would stick around until ifdown due to tricky synchronization
- Allow multicast routing to take effect on locally-generated packets
- Add output interface argument for End.X in segment routing
- MCTP: add support for gateway routing, improve bind() handling
- Don't require rtnl_lock when fetching an IPv6 neighbor over Netlink
- Add a new neighbor flag ("extern_valid"), which cedes refresh
responsibilities to userspace. This is needed for EVPN multi-homing
where a neighbor entry for a multi-homed host needs to be synced
across all the VTEPs among which the host is multi-homed
- Support NUD_PERMANENT for proxy neighbor entries
- Add a new queuing discipline for IETF RFC9332 DualQ Coupled AQM
- Add sequence numbers to netconsole messages. Unregister
netconsole's console when all net targets are removed. Code
refactoring. Add a number of selftests
- Align IPSec inbound SA lookup to RFC 4301. Only SPI and protocol
should be used for an inbound SA lookup
- Support inspecting ref_tracker state via DebugFS
- Don't force bonding advertisement frames tx to ~333 ms boundaries.
Add broadcast_neighbor option to send ARP/ND on all bonded links
- Allow providing upcall pid for the 'execute' command in openvswitch
- Remove DCCP support from Netfilter's conntrack
- Disallow multiple packet duplications in the queuing layer
- Prevent use of deprecated iptables code on PREEMPT_RT
Driver API:
- Support RSS and hashing configuration over ethtool Netlink
- Add dedicated ethtool callbacks for getting and setting hashing
fields
- Add support for power budget evaluation strategy in PSE /
Power-over-Ethernet. Generate Netlink events for overcurrent etc
- Support DPLL phase offset monitoring across all device inputs.
Support providing clock reference and SYNC over separate DPLL
inputs
- Support traffic classes in devlink rate API for bandwidth
management
- Remove rtnl_lock dependency from UDP tunnel port configuration
Device drivers:
- Add a new Broadcom driver for 800G Ethernet (bnge)
- Add a standalone driver for Microchip ZL3073x DPLL
- Remove IBM's NETIUCV device driver
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- support zero-copy Tx of DMABUF memory
- take page size into account for page pool recycling rings
- Intel (100G, ice, idpf):
- idpf: XDP and AF_XDP support preparations
- idpf: add flow steering
- add link_down_events statistic
- clean up the TSPLL code
- preparations for live VM migration
- nVidia/Mellanox:
- support zero-copy Rx/Tx interfaces (DMABUF and io_uring)
- optimize context memory usage for matchers
- expose serial numbers in devlink info
- support PCIe congestion metrics
- Meta (fbnic):
- add 25G, 50G, and 100G link modes to phylink
- support dumping FW logs
- Marvell/Cavium:
- support for CN20K generation of the Octeon chips
- Amazon:
- add HW clock (without timestamping, just hypervisor time access)
- Ethernet virtual:
- VirtIO net:
- support segmentation of UDP-tunnel-encapsulated packets
- Google (gve):
- support packet timestamping and clock synchronization
- Microsoft vNIC:
- add handler for device-originated servicing events
- allow dynamic MSI-X vector allocation
- support Tx bandwidth clamping
- Ethernet NICs consumer, and embedded:
- AMD:
- amd-xgbe: hardware timestamping and PTP clock support
- Broadcom integrated MACs (bcmgenet, bcmasp):
- use napi_complete_done() return value to support NAPI polling
- add support for re-starting auto-negotiation
- Broadcom switches (b53):
- support BCM5325 switches
- add bcm63xx EPHY power control
- Synopsys (stmmac):
- lots of code refactoring and cleanups
- TI:
- icssg-prueth: read firmware-names from device tree
- icssg: PRP offload support
- Microchip:
- lan78xx: convert to PHYLINK for improved PHY and MAC management
- ksz: add KSZ8463 switch support
- Intel:
- support similar queue priority scheme in multi-queue and
time-sensitive networking (taprio)
- support packet pre-emption in both
- RealTek (r8169):
- enable EEE at 5Gbps on RTL8126
- Airoha:
- add PPPoE offload support
- MDIO bus controller for Airoha AN7583
- Ethernet PHYs:
- support for the IPQ5018 internal GE PHY
- micrel KSZ9477 switch-integrated PHYs:
- add MDI/MDI-X control support
- add RX error counters
- add cable test support
- add Signal Quality Indicator (SQI) reporting
- dp83tg720: improve reset handling and reduce link recovery time
- support bcm54811 (and its MII-Lite interface type)
- air_en8811h: support resume/suspend
- support PHY counters for QCA807x and QCA808x
- support WoL for QCA807x
- CAN drivers:
- rcar_canfd: support for Transceiver Delay Compensation
- kvaser: report FW versions via devlink dev info
- WiFi:
- extended regulatory info support (6 GHz)
- add statistics and beacon monitor for Multi-Link Operation (MLO)
- support S1G aggregation, improve S1G support
- add Radio Measurement action fields
- support per-radio RTS threshold
- some work around how FIPS affects wifi, which was wrong (RC4 is
used by TKIP, not only WEP)
- improvements for unsolicited probe response handling
- WiFi drivers:
- RealTek (rtw88):
- IBSS mode for SDIO devices
- RealTek (rtw89):
- BT coexistence for MLO/WiFi7
- concurrent station + P2P support
- support for USB devices RTL8851BU/RTL8852BU
- Intel (iwlwifi):
- use embedded PNVM in (to be released) FW images to fix
compatibility issues
- many cleanups (unused FW APIs, PCIe code, WoWLAN)
- some FIPS interoperability
- MediaTek (mt76):
- firmware recovery improvements
- more MLO work
- Qualcomm/Atheros (ath12k):
- fix scan on multi-radio devices
- more EHT/Wi-Fi 7 features
- encapsulation/decapsulation offload
- Broadcom (brcm80211):
- support SDIO 43751 device
- Bluetooth:
- hci_event: add support for handling LE BIG Sync Lost event
- ISO: add socket option to report packet seqnum via CMSG
- ISO: support SCM_TIMESTAMPING for ISO TS
- Bluetooth drivers:
- intel_pcie: support Function Level Reset
- nxpuart: add support for 4M baudrate
- nxpuart: implement powerup sequence, reset, FW dump, and FW loading"
* tag 'net-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1742 commits)
dpll: zl3073x: Fix build failure
selftests: bpf: fix legacy netfilter options
ipv6: annotate data-races around rt->fib6_nsiblings
ipv6: fix possible infinite loop in fib6_info_uses_dev()
ipv6: prevent infinite loop in rt6_nlmsg_size()
ipv6: add a retry logic in net6_rt_notify()
vrf: Drop existing dst reference in vrf_ip6_input_dst
net/sched: taprio: align entry index attr validation with mqprio
net: fsl_pq_mdio: use dev_err_probe
selftests: rtnetlink.sh: remove esp4_offload after test
vsock: remove unnecessary null check in vsock_getname()
igb: xsk: solve negative overflow of nb_pkts in zerocopy mode
stmmac: xsk: fix negative overflow of budget in zerocopy mode
dt-bindings: ieee802154: Convert at86rf230.txt yaml format
net: dsa: microchip: Disable PTP function of KSZ8463
net: dsa: microchip: Setup fiber ports for KSZ8463
net: dsa: microchip: Write switch MAC address differently for KSZ8463
net: dsa: microchip: Use different registers for KSZ8463
net: dsa: microchip: Add KSZ8463 switch support to KSZ DSA driver
dt-bindings: net: dsa: microchip: Add KSZ8463 switch support
...
- DEBUGFS
- Remove unneeded debugfs_file_{get,put}() instances
- Remove last remnants of debugfs_real_fops()
- Allow storing non-const void * in struct debugfs_inode_info::aux
- SYSFS
- Switch back to attribute_group::bin_attrs (treewide)
- Switch back to bin_attribute::read()/write() (treewide)
- Constify internal references to 'struct bin_attribute'
- Support cache-ids for device-tree systems
- Add arch hook arch_compact_of_hwid()
- Use arch_compact_of_hwid() to compact MPIDR values on arm64
- Rust
- Device
- Introduce CoreInternal device context (for bus internal methods)
- Provide generic drvdata accessors for bus devices
- Provide Driver::unbind() callbacks
- Use the infrastructure above for auxiliary, PCI and platform
- Implement Device::as_bound()
- Rename Device::as_ref() to Device::from_raw() (treewide)
- Implement fwnode and device property abstractions
- Implement example usage in the Rust platform sample driver
- Devres
- Remove the inner reference count (Arc) and use pin-init instead
- Replace Devres::new_foreign_owned() with devres::register()
- Require T to be Send in Devres<T>
- Initialize the data kept inside a Devres last
- Provide an accessor for the Devres associated Device
- Device ID
- Add support for ACPI device IDs and driver match tables
- Split up generic device ID infrastructure
- Use generic device ID infrastructure in net::phy
- DMA
- Implement the dma::Device trait
- Add DMA mask accessors to dma::Device
- Implement dma::Device for PCI and platform devices
- Use DMA masks from the DMA sample module
- I/O
- Implement abstraction for resource regions (struct resource)
- Implement resource-based ioremap() abstractions
- Provide platform device accessors for I/O (remap) requests
- Misc
- Support fallible PinInit types in Revocable
- Implement Wrapper<T> for Opaque<T>
- Merge pin-init blanket dependencies (for Devres)
- Misc
- Fix OF node leak in auxiliary_device_create()
- Use util macros in device property iterators
- Improve kobject sample code
- Add device_link_test() for testing device link flags
- Fix typo in Documentation/ABI/testing/sysfs-kernel-address_bits
- Hint to prefer container_of_const() over container_of()
-----BEGIN PGP SIGNATURE-----
iHQEABYKAB0WIQS2q/xV6QjXAdC7k+1FlHeO1qrKLgUCaIjkhwAKCRBFlHeO1qrK
LpXuAP9RWwfD9ZGgQZ9OsMk/0pZ2mDclaK97jcmI9TAeSxeZMgD1FHnOMTY7oSIi
iG7Muq0yLD+A5gk9HUnMUnFNrngWCg==
=jgRj
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core
Pull driver core updates from Danilo Krummrich:
"debugfs:
- Remove unneeded debugfs_file_{get,put}() instances
- Remove last remnants of debugfs_real_fops()
- Allow storing non-const void * in struct debugfs_inode_info::aux
sysfs:
- Switch back to attribute_group::bin_attrs (treewide)
- Switch back to bin_attribute::read()/write() (treewide)
- Constify internal references to 'struct bin_attribute'
Support cache-ids for device-tree systems:
- Add arch hook arch_compact_of_hwid()
- Use arch_compact_of_hwid() to compact MPIDR values on arm64
Rust:
- Device:
- Introduce CoreInternal device context (for bus internal methods)
- Provide generic drvdata accessors for bus devices
- Provide Driver::unbind() callbacks
- Use the infrastructure above for auxiliary, PCI and platform
- Implement Device::as_bound()
- Rename Device::as_ref() to Device::from_raw() (treewide)
- Implement fwnode and device property abstractions
- Implement example usage in the Rust platform sample driver
- Devres:
- Remove the inner reference count (Arc) and use pin-init instead
- Replace Devres::new_foreign_owned() with devres::register()
- Require T to be Send in Devres<T>
- Initialize the data kept inside a Devres last
- Provide an accessor for the Devres associated Device
- Device ID:
- Add support for ACPI device IDs and driver match tables
- Split up generic device ID infrastructure
- Use generic device ID infrastructure in net::phy
- DMA:
- Implement the dma::Device trait
- Add DMA mask accessors to dma::Device
- Implement dma::Device for PCI and platform devices
- Use DMA masks from the DMA sample module
- I/O:
- Implement abstraction for resource regions (struct resource)
- Implement resource-based ioremap() abstractions
- Provide platform device accessors for I/O (remap) requests
- Misc:
- Support fallible PinInit types in Revocable
- Implement Wrapper<T> for Opaque<T>
- Merge pin-init blanket dependencies (for Devres)
Misc:
- Fix OF node leak in auxiliary_device_create()
- Use util macros in device property iterators
- Improve kobject sample code
- Add device_link_test() for testing device link flags
- Fix typo in Documentation/ABI/testing/sysfs-kernel-address_bits
- Hint to prefer container_of_const() over container_of()"
* tag 'driver-core-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: (84 commits)
rust: io: fix broken intra-doc links to `platform::Device`
rust: io: fix broken intra-doc link to missing `flags` module
rust: io: mem: enable IoRequest doc-tests
rust: platform: add resource accessors
rust: io: mem: add a generic iomem abstraction
rust: io: add resource abstraction
rust: samples: dma: set DMA mask
rust: platform: implement the `dma::Device` trait
rust: pci: implement the `dma::Device` trait
rust: dma: add DMA addressing capabilities
rust: dma: implement `dma::Device` trait
rust: net::phy Change module_phy_driver macro to use module_device_table macro
rust: net::phy represent DeviceId as transparent wrapper over mdio_device_id
rust: device_id: split out index support into a separate trait
device: rust: rename Device::as_ref() to Device::from_raw()
arm64: cacheinfo: Provide helper to compress MPIDR value into u32
cacheinfo: Add arch hook to compress CPU h/w id into 32 bits for cache-id
cacheinfo: Set cache 'id' based on DT data
container_of: Document container_of() is not to be used in new code
driver core: auxiliary bus: fix OF node leak
...
Add this to control GC algorithm for boost GC.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add a sysfs knob to set a multiplier for the background GC migration
window when F2FS Garbage Collection is boosted.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In dbAllocCtl(), read_metapage() increases the reference count of the
metapage. However, when dp->tree.budmin < 0, the function returns -EIO
without calling release_metapage() to decrease the reference count,
leading to a memory leak.
Add release_metapage(mp) before the error return to properly manage
the metapage reference count and prevent the leak.
Fixes: a5f5e4698f ("jfs: fix shift-out-of-bounds in dbSplit")
Signed-off-by: Zheng Yu <zheng.yu@northwestern.edu>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Simplify how fscrypt uses the crypto API, resulting in some
significant performance improvements:
- Drop the incomplete and problematic support for asynchronous
algorithms. These drivers are bug-prone, and it turns out they are
actually much slower than the CPU-based code as well.
- Allocate crypto requests on the stack instead of the heap. This
improves encryption and decryption performance, especially for
filenames. It also eliminates a point of failure during I/O.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCaIZ+fhQcZWJpZ2dlcnNA
a2VybmVsLm9yZwAKCRDzXCl4vpKOK+dkAQDrHUTj9dGZI/cQ/TjP0kmOv9XfYAfj
HOQDRikTX+Ip4QEA6L8FS8lJYf9EMznTvTPOkP7hXpwqzuf00vJWr+ySmQs=
=N9vo
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux
Pull fscrypt updates from Eric Biggers:
"Simplify how fscrypt uses the crypto API, resulting in some
significant performance improvements:
- Drop the incomplete and problematic support for asynchronous
algorithms. These drivers are bug-prone, and it turns out they are
actually much slower than the CPU-based code as well.
- Allocate crypto requests on the stack instead of the heap. This
improves encryption and decryption performance, especially for
filenames. This also eliminates a point of failure during I/O"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
ceph: Remove gfp_t argument from ceph_fscrypt_encrypt_*()
fscrypt: Remove gfp_t argument from fscrypt_encrypt_block_inplace()
fscrypt: Remove gfp_t argument from fscrypt_crypt_data_unit()
fscrypt: Switch to sync_skcipher and on-stack requests
fscrypt: Drop FORBID_WEAK_KEYS flag for AES-ECB
fscrypt: Don't use asynchronous CryptoAPI algorithms
fscrypt: Don't use problematic non-inline crypto engines
fscrypt: Drop obsolete recommendation to enable optimized SHA-512
fscrypt: Explicitly include <linux/export.h>
Convert fsverity and apparmor to use the SHA-2 library functions
instead of crypto_shash. This is simpler and also slightly faster.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCaIZ+MhQcZWJpZ2dlcnNA
a2VybmVsLm9yZwAKCRDzXCl4vpKOK6LlAP42XrHo9hMqCnVs7uCtDVOk2kBUB7ml
m7rkVqQLiEbgAAEAiFvQYPjiaUedTi+bH45+qP/O/LxY7y5laIby7QKa1wc=
=X4Xs
-----END PGP SIGNATURE-----
Merge tag 'libcrypto-conversions-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull crypto library conversions from Eric Biggers:
"Convert fsverity and apparmor to use the SHA-2 library functions
instead of crypto_shash. This is simpler and also slightly faster"
* tag 'libcrypto-conversions-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
fsverity: Switch from crypto_shash to SHA-2 library
fsverity: Explicitly include <linux/export.h>
apparmor: use SHA-256 library API instead of crypto_shash API
Updates for the kernel's CRC (cyclic redundancy check) code:
- Reorganize the architecture-optimized CRC code. It now lives in
lib/crc/$(SRCARCH)/ rather than arch/$(SRCARCH)/lib/, and it is no
longer artificially split into separate generic and arch modules.
This allows better inlining and dead code elimination. The generic
CRC code is also no longer exported, simplifying the API. (This
mirrors the similar changes to SHA-1 and SHA-2 in lib/crypto/,
which can be found in the "Crypto library updates" pull request.)
- Improve crc32c() performance on newer x86_64 CPUs on long messages
by enabling the VPCLMULQDQ optimized code.
- Simplify the crypto_shash wrappers for crc32_le() and crc32c().
Register just one shash algorithm for each that uses the (fully
optimized) library functions, instead of unnecessarily providing
direct access to the generic CRC code.
- Remove unused and obsolete drivers for hardware CRC engines.
- Remove CRC-32 combination functions that are no longer used.
- Add kerneldoc for crc32_le(), crc32_be(), and crc32c().
- Convert the crc32() macro to an inline function.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCaIZ8rRQcZWJpZ2dlcnNA
a2VybmVsLm9yZwAKCRDzXCl4vpKOK3yOAP9OuoCirD42ZHNSgQeGTzhhZ2jCHiPN
BPvHChwtE2MSRwEA0ddNX36aOiEKmpjog3TMllOIBz7wBrwZV7KgoX75+AU=
=uAY8
-----END PGP SIGNATURE-----
Merge tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull CRC updates from Eric Biggers:
- Reorganize the architecture-optimized CRC code
It now lives in lib/crc/$(SRCARCH)/ rather than arch/$(SRCARCH)/lib/,
and it is no longer artificially split into separate generic and arch
modules. This allows better inlining and dead code elimination
The generic CRC code is also no longer exported, simplifying the API.
(This mirrors the similar changes to SHA-1 and SHA-2 in lib/crypto/,
which can be found in the "Crypto library updates" pull request)
- Improve crc32c() performance on newer x86_64 CPUs on long messages by
enabling the VPCLMULQDQ optimized code
- Simplify the crypto_shash wrappers for crc32_le() and crc32c()
Register just one shash algorithm for each that uses the (fully
optimized) library functions, instead of unnecessarily providing
direct access to the generic CRC code
- Remove unused and obsolete drivers for hardware CRC engines
- Remove CRC-32 combination functions that are no longer used
- Add kerneldoc for crc32_le(), crc32_be(), and crc32c()
- Convert the crc32() macro to an inline function
* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (26 commits)
lib/crc: x86/crc32c: Enable VPCLMULQDQ optimization where beneficial
lib/crc: x86: Reorganize crc-pclmul static_call initialization
lib/crc: crc64: Add include/linux/crc64.h to kernel-api.rst
lib/crc: crc32: Change crc32() from macro to inline function and remove cast
nvmem: layouts: Switch from crc32() to crc32_le()
lib/crc: crc32: Document crc32_le(), crc32_be(), and crc32c()
lib/crc: Explicitly include <linux/export.h>
lib/crc: Remove ARCH_HAS_* kconfig symbols
lib/crc: x86: Migrate optimized CRC code into lib/crc/
lib/crc: sparc: Migrate optimized CRC code into lib/crc/
lib/crc: s390: Migrate optimized CRC code into lib/crc/
lib/crc: riscv: Migrate optimized CRC code into lib/crc/
lib/crc: powerpc: Migrate optimized CRC code into lib/crc/
lib/crc: mips: Migrate optimized CRC code into lib/crc/
lib/crc: loongarch: Migrate optimized CRC code into lib/crc/
lib/crc: arm64: Migrate optimized CRC code into lib/crc/
lib/crc: arm: Migrate optimized CRC code into lib/crc/
lib/crc: Prepare for arch-optimized code in subdirs of lib/crc/
lib/crc: Move files into lib/crc/
lib/crc32: Remove unused combination support
...
- Introduce and start using TRAILING_OVERLAP() helper for fixing
embedded flex array instances (Gustavo A. R. Silva)
- mux: Convert mux_control_ops to a flex array member in mux_chip
(Thorsten Blum)
- string: Group str_has_prefix() and strstarts() (Andy Shevchenko)
- Remove KCOV instrumentation from __init and __head (Ritesh Harjani,
Kees Cook)
- Refactor and rename stackleak feature to support Clang
- Add KUnit test for seq_buf API
- Fix KUnit fortify test under LTO
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCaIfUkgAKCRA2KwveOeQk
uypLAP92r6f47sWcOw/5B9aVffX6Bypsb7dqBJQpCNxI5U1xcAEAiCrZ98UJyOeQ
JQgnXd4N67K4EsS2JDc+FutRn3Yi+A8=
=+5Bq
-----END PGP SIGNATURE-----
Merge tag 'hardening-v6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
- Introduce and start using TRAILING_OVERLAP() helper for fixing
embedded flex array instances (Gustavo A. R. Silva)
- mux: Convert mux_control_ops to a flex array member in mux_chip
(Thorsten Blum)
- string: Group str_has_prefix() and strstarts() (Andy Shevchenko)
- Remove KCOV instrumentation from __init and __head (Ritesh Harjani,
Kees Cook)
- Refactor and rename stackleak feature to support Clang
- Add KUnit test for seq_buf API
- Fix KUnit fortify test under LTO
* tag 'hardening-v6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (22 commits)
sched/task_stack: Add missing const qualifier to end_of_stack()
kstack_erase: Support Clang stack depth tracking
kstack_erase: Add -mgeneral-regs-only to silence Clang warnings
init.h: Disable sanitizer coverage for __init and __head
kstack_erase: Disable kstack_erase for all of arm compressed boot code
x86: Handle KCOV __init vs inline mismatches
arm64: Handle KCOV __init vs inline mismatches
s390: Handle KCOV __init vs inline mismatches
arm: Handle KCOV __init vs inline mismatches
mips: Handle KCOV __init vs inline mismatch
powerpc/mm/book3s64: Move kfence and debug_pagealloc related calls to __init section
configs/hardening: Enable CONFIG_INIT_ON_FREE_DEFAULT_ON
configs/hardening: Enable CONFIG_KSTACK_ERASE
stackleak: Split KSTACK_ERASE_CFLAGS from GCC_PLUGINS_CFLAGS
stackleak: Rename stackleak_track_stack to __sanitizer_cov_stack_depth
stackleak: Rename STACKLEAK to KSTACK_ERASE
seq_buf: Introduce KUnit tests
string: Group str_has_prefix() and strstarts()
kunit/fortify: Add back "volatile" for sizeof() constants
acpi: nfit: intel: avoid multiple -Wflex-array-member-not-at-end warnings
...
- Introduce regular REGSET note macros arch-wide (Dave Martin)
- Remove arbitrary 4K limitation of program header size (Yin Fengwei)
- Reorder function qualifiers for copy_clone_args_from_user() (Dishank Jogi)
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCaIVKiAAKCRA2KwveOeQk
u4zBAP4zUNj2+XyixVPXCzv+Hkle6zWs7yrzdA2yLxe8Qtwj5AD+N2I6MUGcCFGW
W+uWxlWTtGLDqh1CplIUqTlxMi39Og4=
=vYnE
-----END PGP SIGNATURE-----
Merge tag 'execve-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve updates from Kees Cook:
- Introduce regular REGSET note macros arch-wide (Dave Martin)
- Remove arbitrary 4K limitation of program header size (Yin Fengwei)
- Reorder function qualifiers for copy_clone_args_from_user() (Dishank Jogi)
* tag 'execve-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (25 commits)
fork: reorder function qualifiers for copy_clone_args_from_user
binfmt_elf: remove the 4k limitation of program header size
binfmt_elf: Warn on missing or suspicious regset note names
xtensa: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
um: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
x86/ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
sparc: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
sh: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
s390/ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
riscv: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
powerpc/ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
parisc: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
openrisc: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
nios2: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
MIPS: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
m68k: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
LoongArch: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
hexagon: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
csky: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
arm64: ptrace: Use USER_REGSET_NOTE_TYPE() to specify regset note names
...
- Use ZONEFS_SUPER_SIZE instead of PAGE_SIZE to read from disk the
super block (Johannes).
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCaIcLzAAKCRDdoc3SxdoY
du1PAQDZoyBI+zwM73r8vh26n1ftOEDbXc2RnfLbj+x9D+C52wD9GQY4VYP5DZ6u
K04UYehzCpAHRA2FEtfgEQCGr5mfsA0=
=Veb6
-----END PGP SIGNATURE-----
Merge tag 'zonefs-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs update from Damien Le Moal:
- Use ZONEFS_SUPER_SIZE instead of PAGE_SIZE to read from disk the
super block (Johannes).
* tag 'zonefs-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: use ZONEFS_SUPER_SIZE instead of PAGE_SIZE
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmiHdZ8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgptRED/9o3dQ1QHL5yNM/AyCCGox0V4zra8qGS/Vc
cBWpAVrmPGRw0IYlLZENtN9PdwKcbMzJq3l6cxeC7dBnAZP0AxTzP4YYJYUNVsqo
WtJ3d/k5+cVp0OyOp4uabaqNeMeLoPk9/JXe1Ml2KxtDmHtj5yee0JRh7zlPZmZj
tsrpIUTeHgAPn6yR1EI+0ybx/mjCb05Mv2Y8gF5hkUPA2PuON+MTFixJmqoy2ySh
n+22mz/prqlyOSYh/VVv1+9jcQ94wMjcW0JIpg9lM3Kg8BCPU4IetvO1UiX6X33v
154zEh2aJJDBx+yORS4BM4JMXjRZI7lYea2dkHM8Cajctu1Wpja9bNwnK9ibXvEc
WtyBwztleLbAZef25fA/W87JE23fGa/r3nwIb2cF4QqkAFslCvhjA93WkOzNJCgQ
qsWOrlCh3IK2NUu4b1Ncs3ZHOPvc51+zzjMzC6SUr54xhrxDK+gngDPhRy7XDqWJ
DTMpIlr366o8GdJqnib0/e/CPBrThS6Vl6u0tgLnNbwdpK1svgo/uHW5ksKvDqHX
kGEIhyRRJJC+4wyl4dsYKXa2twcyFrlWdAE+pZguEC2nZRYqYl9uXftOtvfp1x0y
/skDX0FIDjvyjRqCLcqF03FSGqwCGS8WuWXZjPhVhcfz47NvbHeFDh1G/jMzsbpj
S9zrPve/DQ==
=e86T
-----END PGP SIGNATURE-----
Merge tag 'for-6.17/block-20250728' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- MD pull request via Yu:
- call del_gendisk synchronously (Xiao)
- cleanup unused variable (John)
- cleanup workqueue flags (Ryo)
- fix faulty rdev can't be removed during resync (Qixing)
- NVMe pull request via Christoph:
- try PCIe function level reset on init failure (Keith Busch)
- log TLS handshake failures at error level (Maurizio Lombardi)
- pci-epf: do not complete commands twice if nvmet_req_init()
fails (Rick Wertenbroek)
- misc cleanups (Alok Tiwari)
- Removal of the pktcdvd driver
This has been more than a decade coming at this point, and some
recently revealed breakages that had it causing issues even for cases
where it isn't required made me re-pull the trigger on this one. It's
known broken and nobody has stepped up to maintain the code
- Series for ublk supporting batch commands, enabling the use of
multishot where appropriate
- Speed up ublk exit handling
- Fix for the two-stage elevator fixing which could leak data
- Convert NVMe to use the new IOVA based API
- Increase default max transfer size to something more reasonable
- Series fixing write operations on zoned DM devices
- Add tracepoints for zoned block device operations
- Prep series working towards improving blk-mq queue management in the
presence of isolated CPUs
- Don't allow updating of the block size of a loop device that is
currently under exclusively ownership/open
- Set chunk sectors from stacked device stripe size and use it for the
atomic write size limit
- Switch to folios in bcache read_super()
- Fix for CD-ROM MRW exit flush handling
- Various tweaks, fixes, and cleanups
* tag 'for-6.17/block-20250728' of git://git.kernel.dk/linux: (94 commits)
block: restore two stage elevator switch while running nr_hw_queue update
cdrom: Call cdrom_mrw_exit from cdrom_release function
sunvdc: Balance device refcount in vdc_port_mpgroup_check
nvme-pci: try function level reset on init failure
dm: split write BIOs on zone boundaries when zone append is not emulated
block: use chunk_sectors when evaluating stacked atomic write limits
dm-stripe: limit chunk_sectors to the stripe size
md/raid10: set chunk_sectors limit
md/raid0: set chunk_sectors limit
block: sanitize chunk_sectors for atomic write limits
ilog2: add max_pow_of_two_factor()
nvmet: pci-epf: Do not complete commands twice if nvmet_req_init() fails
nvme-tcp: log TLS handshake failures at error level
docs: nvme: fix grammar in nvme-pci-endpoint-target.rst
nvme: fix typo in status code constant for self-test in progress
nvmet: remove redundant assignment of error code in nvmet_ns_enable()
nvme: fix incorrect variable in io cqes error message
nvme: fix multiple spelling and grammar issues in host drivers
block: fix blk_zone_append_update_request_bio() kernel-doc
md/raid10: fix set but not used variable in sync_request_write()
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmiHcP8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgphu+EACp+MNKBIx7iIM1arbmafEmroaR9Whi5CJu
8BczAKRVPNhICPXAjr6UOmRuD4nBzPysPNi73xNRdXdnd/dnOL27zMqF/0JCJsd6
V4Xt7pXcanJ4BoeRDv9P332h6/IVo1ulxqEzhWJt2KKoq4cWbSwlvk18NPOdGIkp
PcMPARf15d4KbLprdST5Dzrg2uqZ3jZ4Y4zvIMKery56AZCkp5ZrtPFttw4jhTff
FQzIjY/hXXDFURubv75EHFg7KC9rcZa4CuU40duTHmsXD/EsSsa0eAJMmiVAGjhM
PXK/70ZOoxLK1zU7zN+iAMUnVL7onoNvo8g9e7/FJNuHV25am8aUx4DzzP3ci+He
+2C4PpOKx9Egz0oKmC6hl9eMg/muPd/3j8uxQlGOvz38icgeirQB/ebr1KLtBkM9
pKQmCsGaIzncXWddcTMCayEj3wbvn2lgWSqSgZuwTt1AXqSTxLOmTywCYcXDBdli
2ejh/Hk45TALnBhiiDWdOT2euh2EEP1ADOZzVsZzeR29OzLqJDRjRnitb40NUU60
+ny2HOcplIN6aah+QdFwK31FieVKDY/ufjt1yGvwCP+kIxUEDSYi+NRRldmwznxy
8UZwFyvd8wOxUc+iG/wCK7ccY+MtYyaZAE4ok2QEvQvMUTBJE/LvI4bkd77fx7Sy
2cMeDukZ/w==
=EGH5
-----END PGP SIGNATURE-----
Merge tag 'for-6.17/io_uring-20250728' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
- Optimization to avoid reference counts on non-cloned registered
buffers. This is how these buffers were handled prior to having
cloning support, and we can still use that approach as long as the
buffers haven't been cloned to another ring.
- Cleanup and improvement for uring_cmd, where btrfs was the only user
of storing allocated data for the lifetime of the uring_cmd. Clean
that up so we can get rid of the need to do that.
- Avoid unnecessary memory copies in uring_cmd usage. This is
particularly important as a lot of uring_cmd usage necessitates the
use of 128b SQEs.
- A few updates for recv multishot, where it's now possible to add
fairness limits for limiting how much is transferred for each retry
loop. Additionally, recv multishot now supports an overall cap as
well, where once reached the multishot recv will terminate. The
latter is useful for buffer management and juggling many recv streams
at the same time.
- Add support for returning the TX timestamps via a new socket command.
This feature can work in either singleshot or multishot mode, where
the latter triggers a completion whenever new timestamps are
available. This is an alternative to using the existing error queue.
- Add support for an io_uring "mock" file, which is the start of being
able to do 100% targeted testing in terms of exercising io_uring
request handling. The idea is to have a file type that can be
anything the tester would like, and behave exactly how you want it to
behave in terms of hitting the code paths you want.
- Improve zcrx by using sgtables to de-duplicate and improve dma
address handling.
- Prep work for supporting larger pages for zcrx.
- Various little improvements and fixes.
* tag 'for-6.17/io_uring-20250728' of git://git.kernel.dk/linux: (42 commits)
io_uring/zcrx: fix leaking pages on sg init fail
io_uring/zcrx: don't leak pages on account failure
io_uring/zcrx: fix null ifq on area destruction
io_uring: fix breakage in EXPERT menu
io_uring/cmd: remove struct io_uring_cmd_data
btrfs/ioctl: store btrfs_uring_encoded_data in io_btrfs_cmd
io_uring/cmd: introduce IORING_URING_CMD_REISSUE flag
io_uring/zcrx: account area memory
io_uring: export io_[un]account_mem
io_uring/net: Support multishot receive len cap
io_uring: deduplicate wakeup handling
io_uring/net: cast min_not_zero() type
io_uring/poll: cleanup apoll freeing
io_uring/net: allow multishot receive per-invocation cap
io_uring/net: move io_sr_msg->retry_flags to io_sr_msg->flags
io_uring/net: use passed in 'len' in io_recv_buf_select()
io_uring/zcrx: prepare fallback for larger pages
io_uring/zcrx: assert area type in io_zcrx_iov_page
io_uring/zcrx: allocate sgtable for umem areas
io_uring/zcrx: introduce io_populate_area_dma
...
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmiHqYkACgkQiiy9cAdy
T1FfyQwAi+luMNvR8p9rH3zNVk+WN/a4vV3suJFuc6Bl6kXFfeNU0DFIB2ZCqAmz
lk0N5B/dAd71IpNkTtIX2nZiJpRIBzTy5xAnwARq2LdrDi6ipNSKOv4qfOEMeufl
ehfkqJL/9YQWJv/78Tntfuub2V3t479lpVRDp5hGAwZB4i9KZcVew7ifJJhCoUSh
9NxcRD6AefeZTZaoTyKXayDCTjEn9N7Q6sXOTsSnLSX+e2nIDobFh+Sm1sDcSHMh
2BH9a2Eg9CAZkLj6/53sIYUR4LLJjFMeDRpsGa4RZW0QA23WYJqR+R2Ouc5eCkBD
OS79J5GAbag0RqujLqbl4QXFJRJgW6epB6cCOCjqqo5fQPQa9TIPem2iB5T1NHTg
yMiuJk+qRX1vqyt8QI05yromSYKY7QBqgKzfkjv7054N1XfyUK/99Y+lX5t2nMZ7
Fodxdaxrmk78cUpyaCbTLlW7zrkvMK6hQRQ2h++Q2RrYZ8mo+ciR13B6nftR3sdA
mewSF+yO
=DRH3
-----END PGP SIGNATURE-----
Merge tag 'v6.17-rc-smb3-server-fixes' of git://git.samba.org/ksmbd
Pull smb server updates from Steve French:
- Fix mtime/ctime reporting issue
- Auth fixes, including two session setup race bugs reported by ZDI
- Locking improvement in query directory
- Fix for potential deadlock in creating hardlinks
- Improvements to path name processing
* tag 'v6.17-rc-smb3-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix corrupted mtime and ctime in smb2_open
ksmbd: fix Preauh_HashValue race condition
ksmbd: check return value of xa_store() in krb5_authenticate
ksmbd: fix null pointer dereference error in generate_encryptionkey
smb/server: add ksmbd_vfs_kern_path()
smb/server: avoid deadlock when linking with ReplaceIfExists
smb/server: simplify ksmbd_vfs_kern_path_locked()
smb/server: use lookup_one_unlocked()
- hfs: fix general protection fault in hfs_find_init()
- hfs: fix slab-out-of-bounds in hfs_bnode_read()
- hfsplus: fix slab-out-of-bounds in hfsplus_bnode_read()
- hfsplus: fix slab-out-of-bounds read in hfsplus_uni2asc()
- hfsplus: don't use BUG_ON() in hfsplus_create_attributes_file()
- hfsplus: don't set REQ_SYNC for hfsplus_submit_bio()
- hfsplus: remove mutex_lock check in hfsplus_free_extents
- hfs: make splice write available again
- hfsplus: make splice write available again
- hfs: fix not erasing deleted b-tree node issue
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQT4wVoLCG92poNnMFAhI4xTh21NnQUCaIQQ0wAKCRAhI4xTh21N
nW3yAQDMhJcNyjP1j2dhNRq8l2PO6jDJqLhxAYGKwWMwv1GTvQD5AaOUSeMQbmcs
hNkMtjzb7OlfBLUthvrWlaCfLKWCmAk=
=dI94
-----END PGP SIGNATURE-----
Merge tag 'hfs-v6.17-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/vdubeyko/hfs
Pull hfs/hfsplus updates from Viacheslav Dubeyko:
"Johannes Thumshirn has made nice cleanup in hfsplus_submit_bio().
Tetsuo Handa has fixed the syzbot reported issue in
hfsplus_create_attributes_file() for the case of corruption the
Attributes File's metadata.
Yangtao Li has fixed the syzbot reported issue by removing the
uneccessary WARN_ON() in hfsplus_free_extents().
Other fixes:
- restore generic/001 successful execution by erasing deleted b-tree
nodes
- eliminate slab-out-of-bounds issue in hfs_bnode_read() and
hfsplus_bnode_read() by checking correctness of offset and length
when accessing b-tree node contents
- eliminate slab-out-of-bounds read in hfsplus_uni2asc() if the
b-tree node record has corrupted length of a name that could be
bigger than HFSPLUS_MAX_STRLEN
- eliminate general protection fault in hfs_find_init() for the case
of initial b-tree object creation"
* tag 'hfs-v6.17-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/vdubeyko/hfs:
hfs: fix general protection fault in hfs_find_init()
hfs: fix slab-out-of-bounds in hfs_bnode_read()
hfsplus: fix slab-out-of-bounds in hfsplus_bnode_read()
hfsplus: fix slab-out-of-bounds read in hfsplus_uni2asc()
hfsplus: don't use BUG_ON() in hfsplus_create_attributes_file()
hfsplus: don't set REQ_SYNC for hfsplus_submit_bio()
hfsplus: remove mutex_lock check in hfsplus_free_extents
hfs: make splice write available again
hfsplus: make splice write available again
hfs: fix not erasing deleted b-tree node issue
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmiHny8ACgkQnJ2qBz9k
QNl/dQf/Wh/wsQwnYbZ1P1Mk1aGq+5xLAZJrt8dY8umHfCklWzjrmrpbxV11KbSX
sxnAzRRGP/GlP9Atb6J4oBH2odIY57aKZcfeA64FYYM7yDo3ZvNQvNe+Il3Wr5Zn
UBCGxr6mbeGt1GjBiP77kZzgLeHNnDKBR8Eu9i6zqYBsPk6wBM83oC2g+Ala++vM
leb+uph2fL0rGqXu07LUpQDeLadBjhqRkzdgKOYJ6OXmckWASpYBkQlnCNTnAPYv
AvDod+Mh5UNR0Sq+zB/EYEQn6OTFq+IB6MpGypQjVKxACUt3pTlRyraKWpvakR6r
tih1TyOS5z1wZXBcoU2EtA9N/xYnjw==
=Kj02
-----END PGP SIGNATURE-----
Merge tag 'fs_for_v6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull udf and ext2 updates from Jan Kara:
"A few udf and ext2 fixes and cleanups"
* tag 'fs_for_v6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Verify partition map count
udf: stop using write_cache_pages
ext2: Handle fiemap on empty files to prevent EINVAL
Remove incorrect page alignment check for the writeback len arg in
fuse_iomap_writeback_range(). len will always be block-aligned as
passed in by iomap.
On regular fuse filesystems, i_blkbits is set to PAGE_SHIFT so this is
not a problem but for fuseblk filesystems, the block size is set to a
default of 512 bytes or a block size passed in at mount time.
Please note that non-page-aligned lengths are fine for the logic in
fuse_iomap_writeback_range(). The check was originally added as a
safeguard to detect conspicuously wrong ranges.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Fixes: ef7e7cbb32 ("fuse: use iomap for writeback")
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Link: https://lore.kernel.org/linux-fsdevel/CA+G9fYs5AdVM-T2Tf3LciNCwLZEHetcnSkHsjZajVwwpM2HmJw@mail.gmail.com/
Reported-by: Sasha Levin <sashal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCtwAKCRCRxhvAZXjc
ogPuAQChc4tCjlNp+yAwbSmuzWooKTN8PHI6v+3ftjdaKSy9AgD/Yya1i8aBYBA8
9HBtIKGAqvcgNB3por7yN+GJ8fxb/Ag=
=YmLL
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.17-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs iomap updates from Christian Brauner:
- Refactor the iomap writeback code and split the generic and ioend/bio
based writeback code.
There are two methods that define the split between the generic
writeback code, and the implemementation of it, and all knowledge of
ioends and bios now sits below that layer.
- Add fuse iomap support for buffered writes and dirty folio writeback.
This is needed so that granular uptodate and dirty tracking can be
used in fuse when large folios are enabled. This has two big
advantages. For writes, instead of the entire folio needing to be
read into the page cache, only the relevant portions need to be. For
writeback, only the dirty portions need to be written back instead of
the entire folio.
* tag 'vfs-6.17-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fuse: refactor writeback to use iomap_writepage_ctx inode
fuse: hook into iomap for invalidating and checking partial uptodateness
fuse: use iomap for folio laundering
fuse: use iomap for writeback
fuse: use iomap for buffered writes
iomap: build the writeback code without CONFIG_BLOCK
iomap: add read_folio_range() handler for buffered writes
iomap: improve argument passing to iomap_read_folio_sync
iomap: replace iomap_folio_ops with iomap_write_ops
iomap: export iomap_writeback_folio
iomap: move folio_unlock out of iomap_writeback_folio
iomap: rename iomap_writepage_map to iomap_writeback_folio
iomap: move all ioend handling to ioend.c
iomap: add public helpers for uptodate state manipulation
iomap: hide ioends from the generic writeback code
iomap: refactor the writeback interface
iomap: cleanup the pending writeback tracking in iomap_writepage_map_blocks
iomap: pass more arguments using the iomap writeback context
iomap: header diet
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCsAAKCRCRxhvAZXjc
op1/AQCYRmE6MsFclZ/6Qhpd8Xxl6jYaw0VuSIGneh/HA5EmqQEAiE3/Q0paC1HB
PHryCsVau1yOfJtE1P05/3JLA73hWA4=
=MBdP
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.17-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull superblock callback update from Christian Brauner:
"Currently all filesystems which implement super_operations::shutdown()
can not afford losing a device.
Thus fs_bdev_mark_dead() will just call the ->shutdown() callback for
the involved filesystem.
But it will no longer be the case, as multi-device filesystems like
btrfs can handle certain device loss without the need to shutdown the
whole filesystem.
To allow those multi-device filesystems to be integrated to use
fs_holder_ops:
- Add a new super_operations::remove_bdev() callback
- Try ->remove_bdev() callback first inside fs_bdev_mark_dead().
If the callback returned 0, meaning the fs can handling the device
loss, then exit without doing anything else.
If there is no such callback or the callback returned non-zero
value, continue to shutdown the filesystem as usual.
This means the new remove_bdev() should only do the check on whether
the operation can continue, and if so do the fs specific handlings.
The shutdown handling should still be handled by the existing
->shutdown() callback.
For all existing filesystems with shutdown callback, there is no
change to the code nor behavior.
Btrfs is going to implement both the ->remove_bdev() and ->shutdown()
callbacks soon"
* tag 'vfs-6.17-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: add a new remove_bdev() callback
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCpgAKCRCRxhvAZXjc
oqfFAQDcy3rROUF3W34KcSi7rDmaKVSX53d1tUoqH+1zDRpSlwEAriKDNC1ybudp
YAnxVzkRHjHs1296WIuwKq5lfhJ60Q4=
=geAl
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fileattr updates from Christian Brauner:
"This introduces the new file_getattr() and file_setattr() system calls
after lengthy discussions.
Both system calls serve as successors and extensible companions to
the FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR system calls which have
started to show their age in addition to being named in a way that
makes it easy to conflate them with extended attribute related
operations.
These syscalls allow userspace to set filesystem inode attributes on
special files. One of the usage examples is the XFS quota projects.
XFS has project quotas which could be attached to a directory. All new
inodes in these directories inherit project ID set on parent
directory.
The project is created from userspace by opening and calling
FS_IOC_FSSETXATTR on each inode. This is not possible for special
files such as FIFO, SOCK, BLK etc. Therefore, some inodes are left
with empty project ID. Those inodes then are not shown in the quota
accounting but still exist in the directory. This is not critical but
in the case when special files are created in the directory with
already existing project quota, these new inodes inherit extended
attributes. This creates a mix of special files with and without
attributes. Moreover, special files with attributes don't have a
possibility to become clear or change the attributes. This, in turn,
prevents userspace from re-creating quota project on these existing
files.
In addition, these new system calls allow the implementation of
additional attributes that we couldn't or didn't want to fit into the
legacy ioctls anymore"
* tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: tighten a sanity check in file_attr_to_fileattr()
tree-wide: s/struct fileattr/struct file_kattr/g
fs: introduce file_getattr and file_setattr syscalls
fs: prepare for extending file_get/setattr()
fs: make vfs_fileattr_[get|set] return -EOPNOTSUPP
selinux: implement inode_file_[g|s]etattr hooks
lsm: introduce new hooks for setting/getting inode fsxattr
fs: split fileattr related helpers into separate file
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCjwAKCRCRxhvAZXjc
osnVAQCv4rM7sF4yJvGlm1myIJcJy5Sabk2q31qMdI1VHmkcOwD+Mxs7d1aByTS8
/6djhVleq6lcT2LpP9j8YI3Rb+x30QY=
=PF3o
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.17-rc1.bpf' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs bpf updates from Christian Brauner:
"These changes allow bpf to read extended attributes from cgroupfs.
This is useful in redirecting AF_UNIX socket connections based on
cgroup membership of the socket. One use-case is the ability to
implement log namespaces in systemd so services and containers are
redirected to different journals"
* tag 'vfs-6.17-rc1.bpf' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
selftests/kernfs: test xattr retrieval
selftests/bpf: Add tests for bpf_cgroup_read_xattr
bpf: Mark cgroup_subsys_state->cgroup RCU safe
bpf: Introduce bpf_cgroup_read_xattr to read xattr of cgroup's node
kernfs: remove iattr_mutex
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCiQAKCRCRxhvAZXjc
orltAQDq3y1anYETz5/FD6P2gXY1W5hXdSm3EHHeacQ1JjTXvgEA2g1lWO7J4anf
oOVE8aSvMow/FOjivLZBYmI65pkYJAE=
=oDKB
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.17-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pidfs updates from Christian Brauner:
- persistent info
Persist exit and coredump information independent of whether anyone
currently holds a pidfd for the struct pid.
The current scheme allocated pidfs dentries on-demand repeatedly.
This scheme is reaching it's limits as it makes it impossible to pin
information that needs to be available after the task has exited or
coredumped and that should not be lost simply because the pidfd got
closed temporarily. The next opener should still see the stashed
information.
This is also a prerequisite for supporting extended attributes on
pidfds to allow attaching meta information to them.
If someone opens a pidfd for a struct pid a pidfs dentry is allocated
and stashed in pid->stashed. Once the last pidfd for the struct pid
is closed the pidfs dentry is released and removed from pid->stashed.
So if 10 callers create a pidfs dentry for the same struct pid
sequentially, i.e., each closing the pidfd before the other creates a
new one then a new pidfs dentry is allocated every time.
Because multiple tasks acquiring and releasing a pidfd for the same
struct pid can race with each another a task may still find a valid
pidfs entry from the previous task in pid->stashed and reuse it. Or
it might find a dead dentry in there and fail to reuse it and so
stashes a new pidfs dentry. Multiple tasks may race to stash a new
pidfs dentry but only one will succeed, the other ones will put their
dentry.
The current scheme aims to ensure that a pidfs dentry for a struct
pid can only be created if the task is still alive or if a pidfs
dentry already existed before the task was reaped and so exit
information has been was stashed in the pidfs inode.
That's great except that it's buggy. If a pidfs dentry is stashed in
pid->stashed after pidfs_exit() but before __unhash_process() is
called we will return a pidfd for a reaped task without exit
information being available.
The pidfds_pid_valid() check does not guard against this race as it
doens't sync at all with pidfs_exit(). The pid_has_task() check might
be successful simply because we're before __unhash_process() but
after pidfs_exit().
Introduce a new scheme where the lifetime of information associated
with a pidfs entry (coredump and exit information) isn't bound to the
lifetime of the pidfs inode but the struct pid itself.
The first time a pidfs dentry is allocated for a struct pid a struct
pidfs_attr will be allocated which will be used to store exit and
coredump information.
If all pidfs for the pidfs dentry are closed the dentry and inode can
be cleaned up but the struct pidfs_attr will stick until the struct
pid itself is freed. This will ensure minimal memory usage while
persisting relevant information.
The new scheme has various advantages. First, it allows to close the
race where we end up handing out a pidfd for a reaped task for which
no exit information is available. Second, it minimizes memory usage.
Third, it allows to remove complex lifetime tracking via dentries
when registering a struct pid with pidfs. There's no need to get or
put a reference. Instead, the lifetime of exit and coredump
information associated with a struct pid is bound to the lifetime of
struct pid itself.
- extended attributes
Now that we have a way to persist information for pidfs dentries we
can start supporting extended attributes on pidfds. This will allow
userspace to attach meta information to tasks.
One natural extension would be to introduce a custom pidfs.* extended
attribute space and allow for the inheritance of extended attributes
across fork() and exec().
The first simple scheme will allow privileged userspace to set
trusted extended attributes on pidfs inodes.
- Allow autonomous pidfs file handles
Various filesystems such as pidfs and drm support opening file
handles without having to require a file descriptor to identify the
filesystem. The filesystem are global single instances and can be
trivially identified solely on the information encoded in the file
handle.
This makes it possible to not have to keep or acquire a sentinal file
descriptor just to pass it to open_by_handle_at() to identify the
filesystem. That's especially useful when such sentinel file
descriptor cannot or should not be acquired.
For pidfs this means a file handle can function as full replacement
for storing a pid in a file. Instead a file handle can be stored and
reopened purely based on the file handle.
Such autonomous file handles can be opened with or without specifying
a a file descriptor. If no proper file descriptor is used the
FD_PIDFS_ROOT sentinel must be passed. This allows us to define
further special negative fd sentinels in the future.
Userspace can trivially test for support by trying to open the file
handle with an invalid file descriptor.
- Allow pidfds for reaped tasks with SCM_PIDFD messages
This is a logical continuation of the earlier work to create pidfds
for reaped tasks through the SO_PEERPIDFD socket option merged in
923ea4d448 ("Merge patch series "net, pidfs: enable handing out
pidfds for reaped sk->sk_peer_pid"").
- Two minor fixes:
* Fold fs_struct->{lock,seq} into a seqlock
* Don't bother with path_{get,put}() in unix_open_file()
* tag 'vfs-6.17-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (37 commits)
don't bother with path_get()/path_put() in unix_open_file()
fold fs_struct->{lock,seq} into a seqlock
selftests: net: extend SCM_PIDFD test to cover stale pidfds
af_unix: enable handing out pidfds for reaped tasks in SCM_PIDFD
af_unix: stash pidfs dentry when needed
af_unix/scm: fix whitespace errors
af_unix: introduce and use scm_replace_pid() helper
af_unix: introduce unix_skb_to_scm helper
af_unix: rework unix_maybe_add_creds() to allow sleep
selftests/pidfd: decode pidfd file handles withou having to specify an fd
fhandle, pidfs: support open_by_handle_at() purely based on file handle
uapi/fcntl: add FD_PIDFS_ROOT
uapi/fcntl: add FD_INVALID
fcntl/pidfd: redefine PIDFD_SELF_THREAD_GROUP
uapi/fcntl: mark range as reserved
fhandle: reflow get_path_anchor()
pidfs: add pidfs_root_path() helper
fhandle: rename to get_path_anchor()
fhandle: hoist copy_from_user() above get_path_from_fd()
fhandle: raise FILEID_IS_DIR in handle_type
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCgQAKCRCRxhvAZXjc
os+nAP9LFHUwWO6EBzHJJGEVjJvvzsbzqeYrRFamYiMc5ulPJwD+KW4RIgJa/MWO
pcYE40CacaekD8rFWwYUyszpgmv6ewc=
=wCwp
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull mmap_prepare updates from Christian Brauner:
"Last cycle we introduce f_op->mmap_prepare() in c84bf6dd2b ("mm:
introduce new .mmap_prepare() file callback").
This is preferred to the existing f_op->mmap() hook as it does require
a VMA to be established yet, thus allowing the mmap logic to invoke
this hook far, far earlier, prior to inserting a VMA into the virtual
address space, or performing any other heavy handed operations.
This allows for much simpler unwinding on error, and for there to be a
single attempt at merging a VMA rather than having to possibly
reattempt a merge based on potentially altered VMA state.
Far more importantly, it prevents inappropriate manipulation of
incompletely initialised VMA state, which is something that has been
the cause of bugs and complexity in the past.
The intent is to gradually deprecate f_op->mmap, and in that vein this
series coverts the majority of file systems to using f_op->mmap_prepare.
Prerequisite steps are taken - firstly ensuring all checks for mmap
capabilities use the file_has_valid_mmap_hooks() helper rather than
directly checking for f_op->mmap (which is now not a valid check) and
secondly updating daxdev_mapping_supported() to not require a VMA
parameter to allow ext4 and xfs to be converted.
Commit bb666b7c27 ("mm: add mmap_prepare() compatibility layer for
nested file systems") handles the nasty edge-case of nested file
systems like overlayfs, which introduces a compatibility shim to allow
f_op->mmap_prepare() to be invoked from an f_op->mmap() callback.
This allows for nested filesystems to continue to function correctly
with all file systems regardless of which callback is used. Once we
finally convert all file systems, this shim can be removed.
As a result, ecryptfs, fuse, and overlayfs remain unaltered so they
can nest all other file systems.
We additionally do not update resctl - as this requires an update to
remap_pfn_range() (or an alternative to it) which we defer to a later
series, equally we do not update cramfs which needs a mixed mapping
insertion with the same issue, nor do we update procfs, hugetlbfs,
syfs or kernfs all of which require VMAs for internal state and hooks.
We shall return to all of these later"
* tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
doc: update porting, vfs documentation to describe mmap_prepare()
fs: replace mmap hook with .mmap_prepare for simple mappings
fs: convert most other generic_file_*mmap() users to .mmap_prepare()
fs: convert simple use of generic_file_*_mmap() to .mmap_prepare()
mm/filemap: introduce generic_file_*_mmap_prepare() helpers
fs/xfs: transition from deprecated .mmap hook to .mmap_prepare
fs/ext4: transition from deprecated .mmap hook to .mmap_prepare
fs/dax: make it possible to check dev dax support without a VMA
fs: consistently use can_mmap_file() helper
mm/nommu: use file_has_valid_mmap_hooks() helper
mm: rename call_mmap/mmap_prepare to vfs_mmap/mmap_prepare
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCeQAKCRCRxhvAZXjc
otqEAP9bWFExQtnzrNR+1s4UBfPVDAaTJzDnBWj6z0+Idw9oegEAoxF2ifdCPnR4
t/xWiM4FmSA+9pwvP3U5z3sOReDDsgo=
=WMMB
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.17-rc1.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fallocate updates from Christian Brauner:
"fallocate() currently supports creating preallocated files
efficiently. However, on most filesystems fallocate() will preallocate
blocks in an unwriten state even if FALLOC_FL_ZERO_RANGE is specified.
The extent state must later be converted to a written state when the
user writes data into this range, which can trigger numerous metadata
changes and journal I/O. This may leads to significant write
amplification and performance degradation in synchronous write mode.
At the moment, the only method to avoid this is to create an empty
file and write zero data into it (for example, using 'dd' with a large
block size). However, this method is slow and consumes a considerable
amount of disk bandwidth.
Now that more and more flash-based storage devices are available it is
possible to efficiently write zeros to SSDs using the unmap write
zeroes command if the devices do not write physical zeroes to the
media.
For example, if SCSI SSDs support the UMMAP bit or NVMe SSDs support
the DEAC bit[1], the write zeroes command does not write actual data
to the device, instead, NVMe converts the zeroed range to a
deallocated state, which works fast and consumes almost no disk write
bandwidth.
This series implements the BLK_FEAT_WRITE_ZEROES_UNMAP feature and
BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED flag for SCSI, NVMe and
device-mapper drivers, and add the FALLOC_FL_WRITE_ZEROES and
STATX_ATTR_WRITE_ZEROES_UNMAP support for ext4 and raw bdev devices.
fallocate() is subsequently extended with the FALLOC_FL_WRITE_ZEROES
flag. FALLOC_FL_WRITE_ZEROES zeroes a specified file range in such a
way that subsequent writes to that range do not require further
changes to the file mapping metadata. This flag is beneficial for
subsequent pure overwriting within this range, as it can save on block
allocation and, consequently, significant metadata changes"
* tag 'vfs-6.17-rc1.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
ext4: add FALLOC_FL_WRITE_ZEROES support
block: add FALLOC_FL_WRITE_ZEROES support
block: factor out common part in blkdev_fallocate()
fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate
dm: clear unmap write zeroes limits when disabling write zeroes
scsi: sd: set max_hw_wzeroes_unmap_sectors if device supports SD_ZERO_*_UNMAP
nvmet: set WZDS and DRB if device enables unmap write zeroes operation
nvme: set max_hw_wzeroes_unmap_sectors if device supports DEAC bit
block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCcgAKCRCRxhvAZXjc
ose6AQDUhwws5T7FYbqQRZC7tc19xJ4CJN2MH6WQsRJ8PrXMtQD/dY/KVPGtOZgb
+fFGcOPkO9c+D9WUNXjcGtGMv+fsegc=
=iV8T
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.17-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull async directory updates from Christian Brauner:
"This contains preparatory changes for the asynchronous directory
locking scheme.
While the locking scheme is still very much controversial and we're
still far away from landing any actual changes in that area the
preparatory work that we've been upstreaming for a while now has been
very useful. This is another set of minor changes and cleanups"
* tag 'vfs-6.17-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
exportfs: use lookup_one_unlocked()
coda: use iterate_dir() in coda_readdir()
VFS: Minor fixes for porting.rst
VFS: merge lookup_one_qstr_excl_raw() back into lookup_one_qstr_excl()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCaAAKCRCRxhvAZXjc
ouTCAQCrNCM3h9MpgcMDDUbi9b+lIR11JlvtWNGRUACv3RSNLgEA1vm30+u+JM87
KpVYg1RrkXDyMFXXuzy7UWpzLcBRywI=
=L0eW
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.17-rc1.nsfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull namespace updates from Christian Brauner:
"This contains namespace updates. This time specifically for nsfs:
- Userspace heavily relies on the root inode numbers for namespaces
to identify the initial namespaces. That's already a hard
dependency. So we cannot change that anymore. Move the initial
inode numbers to a public header and align the only two namespaces
that currently don't do that with all the other namespaces.
- The root inode of /proc having a fixed inode number has been part
of the core kernel ABI since its inception, and recently some
userspace programs (mainly container runtimes) have started to
explicitly depend on this behaviour.
The main reason this is useful to userspace is that by checking
that a suspect /proc handle has fstype PROC_SUPER_MAGIC and is
PROCFS_ROOT_INO, they can then use openat2() together with
RESOLVE_{NO_{XDEV,MAGICLINK},BENEATH} to ensure that there isn't a
bind-mount that replaces some procfs file with a different one.
This kind of attack has lead to security issues in container
runtimes in the past (such as CVE-2019-19921) and libraries like
libpathrs[1] use this feature of procfs to provide safe procfs
handling functions"
* tag 'vfs-6.17-rc1.nsfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
uapi: export PROCFS_ROOT_INO
mntns: use stable inode number for initial mount ns
netns: use stable inode number for initial mount ns
nsfs: move root inode number to uapi
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINakQAKCRCRxhvAZXjc
okGZAP9CUQfiiT3DUq0pAeuXR2BjpjM8hnNTlO7REC/AmoDWcQD/SDZWfjP2uhtk
TgGlT1fS5cVcRf72+8JBtT7LGmDB7wA=
=5vdH
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.17-rc1.ovl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull overlayfs updates from Christian Brauner:
"This contains overlayfs updates for this cycle.
The changes for overlayfs in here are primarily focussed on preparing
for some proposed changes to directory locking.
Overlayfs currently will sometimes lock a directory on the upper
filesystem and do a few different things while holding the lock. This
is incompatible with the new potential scheme.
This series narrows the region of code protected by the directory
lock, taking it multiple times when necessary. This theoretically
opens up the possibilty of other changes happening on the upper
filesytem between the unlock and the lock. To some extent the patches
guard against that by checking the dentries still have the expect
parent after retaking the lock. In general, concurrent changes to the
upper and lower filesystems aren't supported properly anyway"
* tag 'vfs-6.17-rc1.ovl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits)
ovl: properly print correct variable
ovl: rename ovl_cleanup_unlocked() to ovl_cleanup()
ovl: change ovl_create_real() to receive dentry parent
ovl: narrow locking in ovl_check_rename_whiteout()
ovl: narrow locking in ovl_whiteout()
ovl: change ovl_cleanup_and_whiteout() to take rename lock as needed
ovl: narrow locking on ovl_remove_and_whiteout()
ovl: change ovl_workdir_cleanup() to take dir lock as needed.
ovl: narrow locking in ovl_workdir_cleanup_recurse()
ovl: narrow locking in ovl_indexdir_cleanup()
ovl: narrow locking in ovl_workdir_create()
ovl: narrow locking in ovl_cleanup_index()
ovl: narrow locking in ovl_cleanup_whiteouts()
ovl: narrow locking in ovl_rename()
ovl: simplify gotos in ovl_rename()
ovl: narrow locking in ovl_create_over_whiteout()
ovl: narrow locking in ovl_clear_empty()
ovl: narrow locking in ovl_create_upper()
ovl: narrow the locked region in ovl_copy_up_workdir()
ovl: Call ovl_create_temp() without lock held.
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINAYAAKCRCRxhvAZXjc
opJiAQDXGs+gQcxJ+4BpV4QszT2OJC19oI/f5AQ4PWMJdHgr4AEA7fc6NbBrpmW7
L/tbdAwIiWp8bL1Q8Wy7Q2qldHtcggM=
=KbD9
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.17-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull coredump updates from Christian Brauner:
"This contains an extension to the coredump socket and a proper rework
of the coredump code.
- This extends the coredump socket to allow the coredump server to
tell the kernel how to process individual coredumps. This allows
for fine-grained coredump management. Userspace can decide to just
let the kernel write out the coredump, or generate the coredump
itself, or just reject it.
* COREDUMP_KERNEL
The kernel will write the coredump data to the socket.
* COREDUMP_USERSPACE
The kernel will not write coredump data but will indicate to the
parent that a coredump has been generated. This is used when
userspace generates its own coredumps.
* COREDUMP_REJECT
The kernel will skip generating a coredump for this task.
* COREDUMP_WAIT
The kernel will prevent the task from exiting until the coredump
server has shutdown the socket connection.
The flexible coredump socket can be enabled by using the "@@"
prefix instead of the single "@" prefix for the regular coredump
socket:
@@/run/systemd/coredump.socket
- Cleanup the coredump code properly while we have to touch it
anyway.
Split out each coredump mode in a separate helper so it's easy to
grasp what is going on and make the code easier to follow. The core
coredump function should now be very trivial to follow"
* tag 'vfs-6.17-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (31 commits)
cleanup: add a scoped version of CLASS()
coredump: add coredump_skip() helper
coredump: avoid pointless variable
coredump: order auto cleanup variables at the top
coredump: add coredump_cleanup()
coredump: auto cleanup prepare_creds()
cred: add auto cleanup method
coredump: directly return
coredump: auto cleanup argv
coredump: add coredump_write()
coredump: use a single helper for the socket
coredump: move pipe specific file check into coredump_pipe()
coredump: split pipe coredumping into coredump_pipe()
coredump: move core_pipe_count to global variable
coredump: prepare to simplify exit paths
coredump: split file coredumping into coredump_file()
coredump: rename do_coredump() to vfs_coredump()
selftests/coredump: make sure invalid paths are rejected
coredump: validate socket path in coredump_parse()
coredump: don't allow ".." in coredump socket path
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaIM/KwAKCRCRxhvAZXjc
opT+AP407JwhRSBjUEmHg5JzUyDoivkOySdnthunRjaBKD8rlgEApM6SOIZYucU7
cPC3ZY6ORFM6Mwaw+iDW9lasM5ucHQ8=
=CHha
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc VFS updates from Christian Brauner:
"This contains the usual selections of misc updates for this cycle.
Features:
- Add ext4 IOCB_DONTCACHE support
This refactors the address_space_operations write_begin() and
write_end() callbacks to take const struct kiocb * as their first
argument, allowing IOCB flags such as IOCB_DONTCACHE to propagate
to the filesystem's buffered I/O path.
Ext4 is updated to implement handling of the IOCB_DONTCACHE flag
and advertises support via the FOP_DONTCACHE file operation flag.
Additionally, the i915 driver's shmem write paths are updated to
bypass the legacy write_begin/write_end interface in favor of
directly calling write_iter() with a constructed synchronous kiocb.
Another i915 change replaces a manual write loop with
kernel_write() during GEM shmem object creation.
Cleanups:
- don't duplicate vfs_open() in kernel_file_open()
- proc_fd_getattr(): don't bother with S_ISDIR() check
- fs/ecryptfs: replace snprintf with sysfs_emit in show function
- vfs: Remove unnecessary list_for_each_entry_safe() from
evict_inodes()
- filelock: add new locks_wake_up_waiter() helper
- fs: Remove three arguments from block_write_end()
- VFS: change old_dir and new_dir in struct renamedata to dentrys
- netfs: Remove unused declaration netfs_queue_write_request()
Fixes:
- eventpoll: Fix semi-unbounded recursion
- eventpoll: fix sphinx documentation build warning
- fs/read_write: Fix spelling typo
- fs: annotate data race between poll_schedule_timeout() and
pollwake()
- fs/pipe: set FMODE_NOWAIT in create_pipe_files()
- docs/vfs: update references to i_mutex to i_rwsem
- fs/buffer: remove comment about hard sectorsize
- fs/buffer: remove the min and max limit checks in __getblk_slow()
- fs/libfs: don't assume blocksize <= PAGE_SIZE in
generic_check_addressable
- fs_context: fix parameter name in infofc() macro
- fs: Prevent file descriptor table allocations exceeding INT_MAX"
* tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (24 commits)
netfs: Remove unused declaration netfs_queue_write_request()
eventpoll: fix sphinx documentation build warning
ext4: support uncached buffered I/O
mm/pagemap: add write_begin_get_folio() helper function
fs: change write_begin/write_end interface to take struct kiocb *
drm/i915: Refactor shmem_pwrite() to use kiocb and write_iter
drm/i915: Use kernel_write() in shmem object create
eventpoll: Fix semi-unbounded recursion
vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes()
fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable
fs/buffer: remove the min and max limit checks in __getblk_slow()
fs: Prevent file descriptor table allocations exceeding INT_MAX
fs: Remove three arguments from block_write_end()
fs/ecryptfs: replace snprintf with sysfs_emit in show function
fs: annotate suspected data race between poll_schedule_timeout() and pollwake()
docs/vfs: update references to i_mutex to i_rwsem
fs/buffer: remove comment about hard sectorsize
fs_context: fix parameter name in infofc() macro
VFS: change old_dir and new_dir in struct renamedata to dentrys
proc_fd_getattr(): don't bother with S_ISDIR() check
...
this is getting too much for one merge window as it is.
* mount hash conflicts rudiments are gone now - we do not allow
multiple mounts with the same parent/mountpoint to be
hashed at the same time.
* struct mount changes
mnt_umounting is gone;
mnt_slave_list/mnt_slave is an hlist now;
overmounts are kept track of by explicit pointer in mount;
a bunch of flags moved out of mnt_flags to a new field,
with only namespace_sem for protection;
mnt_expiry is protected by mount_lock now (instead of
namespace_sem);
MNT_LOCKED is used only for mounts that need to remain
attached to their parents to prevent mountpoint exposure -
no more overloading it for absolute roots;
all mnt_list uses are transient now - it's used only to
represent temporary sets during umount_tree().
* mount refcounting change
children no longer pin parents for any mounts, whether they'd
passed through umount_tree() or not.
* struct mountpoint changes
refcount is no more; what matters is ->m_list emptiness;
instead of temporary bumping the refcount, we insert a new object
(pinned_mountpoint) into ->m_list;
new calling conventions for lock_mount() and friends.
* do_move_mount()/attach_recursive_mnt() seriously cleaned up.
* globals in fs/pnode.c are gone.
* propagate_mnt(), change_mnt_propagation() and propagate_umount() cleaned up
(in the last case - pretty much completely rewritten).
* freeing of emptied mnt_namespace is done in namespace_unlock()
for one thing, there are subtle ordering requirements there;
for another it simplifies cleanups.
* assorted cleanups.
* restore the machinery for long-term mounts from accumulated bitrot.
This is going to get a followup come next cycle, when #work.fs_context
with its change of vfs_fs_parse_string() calling conventions goes
into -next.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaIR2dQAKCRBZ7Krx/gZQ
6/SzAP4x+Fjjc5Tm2UNgGW5dptDY5s9O5RuFauo1MM6rcrekagEApTarcMlPnZvC
mj1TVJFNfdVhZyTXnz5ocHhGX1udmgU=
=qT69
-----END PGP SIGNATURE-----
Merge tag 'pull-mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs mount updates from Al Viro:
- mount hash conflicts rudiments are gone now - we do not allow
multiple mounts with the same parent/mountpoint to be hashed at the
same time.
- 'struct mount' changes:
- mnt_umounting is gone
- mnt_slave_list/mnt_slave is an hlist now
- overmounts are kept track of by explicit pointer in mount
- a bunch of flags moved out of mnt_flags to a new field, with
only namespace_sem for protection
- mnt_expiry is protected by mount_lock now (instead of
namespace_sem)
- MNT_LOCKED is used only for mounts that need to remain attached
to their parents to prevent mountpoint exposure - no more
overloading it for absolute roots
- all mnt_list uses are transient now - it's used only to
represent temporary sets during umount_tree()
- mount refcounting change: children no longer pin parents for any
mounts, whether they'd passed through umount_tree() or not
- 'struct mountpoint' changes:
- refcount is no more; what matters is ->m_list emptiness
- instead of temporary bumping the refcount, we insert a new
object (pinned_mountpoint) into ->m_list
- new calling conventions for lock_mount() and friends
- do_move_mount()/attach_recursive_mnt() seriously cleaned up
- globals in fs/pnode.c are gone
- propagate_mnt(), change_mnt_propagation() and propagate_umount()
cleaned up (in the last case - pretty much completely rewritten).
- freeing of emptied mnt_namespace is done in namespace_unlock(). For
one thing, there are subtle ordering requirements there; for another
it simplifies cleanups.
- assorted cleanups
- restore the machinery for long-term mounts from accumulated bitrot.
This is going to get a followup come next cycle, when the change of
vfs_fs_parse_string() calling conventions goes into -next
* tag 'pull-mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (48 commits)
statmount_mnt_basic(): simplify the logics for group id
invent_group_ids(): zero ->mnt_group_id always implies !IS_MNT_SHARED()
get rid of CL_SHARE_TO_SLAVE
take freeing of emptied mnt_namespace to namespace_unlock()
copy_tree(): don't link the mounts via mnt_list
change_mnt_propagation(): move ->mnt_master assignment into MS_SLAVE case
mnt_slave_list/mnt_slave: turn into hlist_head/hlist_node
turn do_make_slave() into transfer_propagation()
do_make_slave(): choose new master sanely
change_mnt_propagation(): do_make_slave() is a no-op unless IS_MNT_SHARED()
change_mnt_propagation() cleanups, step 1
propagate_mnt(): fix comment and convert to kernel-doc, while we are at it
propagate_mnt(): get rid of last_dest
fs/pnode.c: get rid of globals
propagate_one(): fold into the sole caller
propagate_one(): separate the "what should be the master for this copy" part
propagate_one(): separate the "do we need secondary here?" logics
propagate_mnt(): handle all peer groups in the same loop
propagate_one(): get rid of dest_master
mount: separate the flags accessed only under namespace_sem
...
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaIRLdAAKCRBZ7Krx/gZQ
6ysfAQDtJazbwSdu5MFLZs0YBv757xiWvsYBWWmgEPedtTRYnAEAnawe/IkQ4IAd
9c5h4IPLOK9wwwFWgHY60L1pwlwrSAY=
=B0Yx
-----END PGP SIGNATURE-----
Merge tag 'pull-ceph-d_name-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull ceph dentry->d_name fixes from Al Viro:
"Stuff that had fallen through the cracks back in February; ceph folks
tested that pile and said they prefer to have it go through my tree..."
* tag 'pull-ceph-d_name-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ceph: fix a race with rename() in ceph_mdsc_build_path()
prep for ceph_encode_encrypted_fname() fixes
[ceph] parse_longname(): strrchr() expects NUL-terminated string
APIs provided to the rest of the kernel.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaIRDbQAKCRBZ7Krx/gZQ
63n6APwNnJXwgtSDi9N0FfHOlYqYSCaCjezVLbq+GR8K+r4wowD/TX/A4Qbyjjic
/VG8VbYe6fRaD53vp1giGI/dJiTI2Qg=
=Ta4H
-----END PGP SIGNATURE-----
Merge tag 'pull-rpc_pipefs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull rpc_pipefs updates from Al Viro:
"Massage rpc_pipefs to use saner primitives and clean up the APIs
provided to the rest of the kernel"
* tag 'pull-rpc_pipefs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
rpc_create_client_dir(): return 0 or -E...
rpc_create_client_dir(): don't bother with rpc_populate()
rpc_new_dir(): the last argument is always NULL
rpc_pipe: expand the calls of rpc_mkdir_populate()
rpc_gssd_dummy_populate(): don't bother with rpc_populate()
rpc_mkpipe_dentry(): switch to simple_start_creating()
rpc_pipe: saner primitive for creating regular files
rpc_pipe: saner primitive for creating subdirectories
rpc_pipe: don't overdo directory locking
rpc_mkpipe_dentry(): saner calling conventions
rpc_unlink(): saner calling conventions
rpc_populate(): lift cleanup into callers
rpc_unlink(): use simple_recursive_removal()
rpc_{rmdir_,}depopulate(): use simple_recursive_removal() instead
rpc_pipe: clean failure exits in fill_super
new helper: simple_start_creating()
places; unfortunately, it's easy to get wrong. A number of open-coded
attempts are out there, with varying amount of bogosities.
simple_recursive_removal() had been introduced for doing that with
all precautions needed; it does an equivalent of rm -rf, with sufficient
locking, eviction of anything mounted on top of the subtree, etc.
This series converts a bunch of open-coded instances to using that.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaIRCUwAKCRBZ7Krx/gZQ
66XWAP9BNyHcvl9uV/ku/mswYiRBxYoVogciIKeugwYTVLuTJgEA7jdh1eyLkvbS
rwbL7XD+Q35/vXZHEet+RLCGH3ae6wc=
=yaKF
-----END PGP SIGNATURE-----
Merge tag 'pull-simple_recursive_removal' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull simple_recursive_removal() update from Al Viro:
"Removing subtrees of kernel filesystems is done in quite a few places;
unfortunately, it's easy to get wrong. A number of open-coded attempts
are out there, with varying amount of bogosities.
simple_recursive_removal() had been introduced for doing that with all
precautions needed; it does an equivalent of rm -rf, with sufficient
locking, eviction of anything mounted on top of the subtree, etc.
This series converts a bunch of open-coded instances to using that"
* tag 'pull-simple_recursive_removal' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
functionfs, gadgetfs: use simple_recursive_removal()
kill binderfs_remove_file()
fuse_ctl: use simple_recursive_removal()
pstore: switch to locked_recursive_removal()
binfmt_misc: switch to locked_recursive_removal()
spufs: switch to locked_recursive_removal()
add locked_recursive_removal()
better lockdep annotations for simple_recursive_removal()
simple_recursive_removal(): saner interaction with fsnotify
w/ "mode=lfs" mount option, generic/299 will cause system panic as below:
------------[ cut here ]------------
kernel BUG at fs/f2fs/segment.c:2835!
Call Trace:
<TASK>
f2fs_allocate_data_block+0x6f4/0xc50
f2fs_map_blocks+0x970/0x1550
f2fs_iomap_begin+0xb2/0x1e0
iomap_iter+0x1d6/0x430
__iomap_dio_rw+0x208/0x9a0
f2fs_file_write_iter+0x6b3/0xfa0
aio_write+0x15d/0x2e0
io_submit_one+0x55e/0xab0
__x64_sys_io_submit+0xa5/0x230
do_syscall_64+0x84/0x2f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0010:new_curseg+0x70f/0x720
The root cause of we run out-of-space is: in f2fs_map_blocks(), f2fs may
trigger foreground gc only if it allocates any physical block, it will be
a little bit later when there is multiple threads writing data w/
aio/dio/bufio method in parallel, since we always use OPU in lfs mode, so
f2fs_map_blocks() does block allocations aggressively.
In order to fix this issue, let's give a chance to trigger foreground
gc in prior to block allocation in f2fs_map_blocks().
Fixes: 36abef4e79 ("f2fs: introduce mode=lfs mount option")
Cc: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In lfs mode, dirty data needs OPU, we'd better calculate lower_p and
upper_p w/ them during has_not_enough_free_secs(), otherwise we may
encounter out-of-space issue due to we missed to reclaim enough
free section w/ foreground gc.
Fixes: 36abef4e79 ("f2fs: introduce mode=lfs mount option")
Cc: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit 1acd73edbb ("f2fs: fix to account dirty data in __get_secs_required()")
missed to calculate upper_p w/ data_secs, fix it.
Fixes: 1acd73edbb ("f2fs: fix to account dirty data in __get_secs_required()")
Cc: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When we need to alloc nat entry and set it dirty, we can directly add it to
dirty set list(or initialize its list_head for new_ne) instead of adding it
to clean list and make a move. Introduce init_dirty flag to do it.
Signed-off-by: wangzijie <wangzijie1@honor.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
__lookup_nat_cache follows LRU manner to move clean nat entry, when nat
entries are going to be dirty, no need to move them to tail of lru list.
Introduce a parameter 'for_dirty' to avoid it.
Signed-off-by: wangzijie <wangzijie1@honor.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The basic rules are simple:
* stores to dentry->d_flags are OK under dentry->d_lock.
* stores to dentry->d_flags are OK in the dentry constructor, before
becomes potentially visible to other threads.
Unfortunately, there's a couple of exceptions to that, and that's where the
headache comes from.
Main PITA comes from d_set_d_op(); that primitive sets ->d_op
of dentry and adjusts the flags that correspond to presence of individual
methods. It's very easy to misuse; existing uses _are_ safe, but proof
of correctness is brittle.
Use in __d_alloc() is safe (we are within a constructor), but we
might as well precalculate the initial value of ->d_flags when we set
the default ->d_op for given superblock and set ->d_flags directly
instead of messing with that helper.
The reasons why other uses are safe are bloody convoluted; I'm not going
to reproduce it here. See https://lore.kernel.org/all/20250224010624.GT1977892@ZenIV/
for gory details, if you care. The critical part is using d_set_d_op() only
just prior to d_splice_alias(), which makes a combination of d_splice_alias()
with setting ->d_op, etc. a natural replacement primitive. Better yet, if
we go that way, it's easy to take setting ->d_op and modifying ->d_flags
under ->d_lock, which eliminates the headache as far as ->d_flags exclusion
rules are concerned. Other exceptions are minor and easy to deal with.
What this series does:
* d_set_d_op() is no longer available; new primitive (d_splice_alias_ops())
is provided, equivalent to combination of d_set_d_op() and d_splice_alias().
* new field of struct super_block - ->s_d_flags. Default value of ->d_flags
to be used when allocating dentries on this filesystem.
* new primitive for setting ->s_d_op: set_default_d_op(). Replaces stores
to ->s_d_op at mount time. All in-tree filesystems converted; out-of-tree
ones will get caught by compiler (->s_d_op is renamed, so stores to it will
be caught). ->s_d_flags is set by the same primitive to match the ->s_d_op.
* a lot of filesystems had ->s_d_op->d_delete equal to always_delete_dentry;
that is equivalent to setting DCACHE_DONTCACHE in ->d_flags, so such filesystems
can bloody well set that bit in ->s_d_flags and drop ->d_delete() from
dentry_operations. In quite a few cases that results in empty dentry_operations,
which means that we can get rid of those.
* kill simple_dentry_operations - not needed anymore.
* massage d_alloc_parallel() to get rid of the other exception wrt ->d_flags
stores - we can set DCACHE_PAR_LOOKUP as soon as we allocate the new dentry;
no need to delay that until we commit to using the sucker.
As the result, ->d_flags stores are all either under ->d_lock or done before
the dentry becomes visible in any shared data structures.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaIQ/tQAKCRBZ7Krx/gZQ
66AhAQDgQ+S224x5YevNXc9mDoGUBMF4OG0n0fIla9rfdL4I6wEAqpOWMNDcVPCZ
GwYOvJ9YuqNdz+MyprAI18Yza4GOmgs=
=rTYB
-----END PGP SIGNATURE-----
Merge tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dentry d_flags updates from Al Viro:
"The current exclusion rules for dentry->d_flags stores are rather
unpleasant. The basic rules are simple:
- stores to dentry->d_flags are OK under dentry->d_lock
- stores to dentry->d_flags are OK in the dentry constructor, before
becomes potentially visible to other threads
Unfortunately, there's a couple of exceptions to that, and that's
where the headache comes from.
The main PITA comes from d_set_d_op(); that primitive sets ->d_op of
dentry and adjusts the flags that correspond to presence of individual
methods. It's very easy to misuse; existing uses _are_ safe, but proof
of correctness is brittle.
Use in __d_alloc() is safe (we are within a constructor), but we might
as well precalculate the initial value of 'd_flags' when we set the
default ->d_op for given superblock and set 'd_flags' directly instead
of messing with that helper.
The reasons why other uses are safe are bloody convoluted; I'm not
going to reproduce it here. See [1] for gory details, if you care. The
critical part is using d_set_d_op() only just prior to
d_splice_alias(), which makes a combination of d_splice_alias() with
setting ->d_op, etc a natural replacement primitive.
Better yet, if we go that way, it's easy to take setting ->d_op and
modifying 'd_flags' under ->d_lock, which eliminates the headache as
far as 'd_flags' exclusion rules are concerned. Other exceptions are
minor and easy to deal with.
What this series does:
- d_set_d_op() is no longer available; instead a new primitive
(d_splice_alias_ops()) is provided, equivalent to combination of
d_set_d_op() and d_splice_alias().
- new field of struct super_block - 's_d_flags'. This sets the
default value of 'd_flags' to be used when allocating dentries on
this filesystem.
- new primitive for setting 's_d_op': set_default_d_op(). This
replaces stores to 's_d_op' at mount time.
All in-tree filesystems converted; out-of-tree ones will get caught
by the compiler ('s_d_op' is renamed, so stores to it will be
caught). 's_d_flags' is set by the same primitive to match the
's_d_op'.
- a lot of filesystems had sb->s_d_op->d_delete equal to
always_delete_dentry; that is equivalent to setting
DCACHE_DONTCACHE in 'd_flags', so such filesystems can bloody well
set that bit in 's_d_flags' and drop 'd_delete()' from
dentry_operations.
In quite a few cases that results in empty dentry_operations, which
means that we can get rid of those.
- kill simple_dentry_operations - not needed anymore
- massage d_alloc_parallel() to get rid of the other exception wrt
'd_flags' stores - we can set DCACHE_PAR_LOOKUP as soon as we
allocate the new dentry; no need to delay that until we commit to
using the sucker.
As the result, 'd_flags' stores are all either under ->d_lock or done
before the dentry becomes visible in any shared data structures"
Link: https://lore.kernel.org/all/20250224010624.GT1977892@ZenIV/ [1]
* tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (21 commits)
configfs: use DCACHE_DONTCACHE
debugfs: use DCACHE_DONTCACHE
efivarfs: use DCACHE_DONTCACHE instead of always_delete_dentry()
9p: don't bother with always_delete_dentry
ramfs, hugetlbfs, mqueue: set DCACHE_DONTCACHE
kill simple_dentry_operations
devpts, sunrpc, hostfs: don't bother with ->d_op
shmem: no dentry retention past the refcount reaching zero
d_alloc_parallel(): set DCACHE_PAR_LOOKUP earlier
make d_set_d_op() static
simple_lookup(): just set DCACHE_DONTCACHE
tracefs: Add d_delete to remove negative dentries
set_default_d_op(): calculate the matching value for ->d_flags
correct the set of flags forbidden at d_set_d_op() time
split d_flags calculation out of d_set_d_op()
new helper: set_default_d_op()
fuse: no need for special dentry_operations for root dentry
switch procfs from d_set_d_op() to d_splice_alias_ops()
new helper: d_splice_alias_ops()
procfs: kill ->proc_dops
...
The most unlikely watched permission event is FAN_ACCESS_PERM, because
at the time that it was introduced there were no evictable ignore mark,
so subscribing to FAN_ACCESS_PERM would have incured a very high
overhead.
Yet, when we set the fmode to FMODE_NOTIFY_HSM(), we never skip trying
to send FAN_ACCESS_PERM, which is almost always a waste of cycles.
We got to this logic because of bundling FAN_OPEN*_PERM and
FAN_ACCESS_PERM in the same category and because FAN_OPEN_PERM is a
commonly used event.
By open coding fsnotify_open_perm() in fsnotify_open_perm_and_set_mode(),
we no longer need to regard FAN_OPEN*_PERM when calculating fmode.
This leaves the case of having pre-content events and not having any
other permission event in the object masks a more likely case than the
other way around.
Rework the fmode macros and code so that their meaning now refers only
to hooks on an already open file:
- FMODE_NOTIFY_NONE() skip all events
- FMODE_NOTIFY_ACCESS_PERM() send all permission events including
FAN_ACCESS_PERM
- FMODE_NOTIFY_HSM() send pre-content permission events
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250708143641.418603-3-amir73il@gmail.com
Create helper fsnotify_open_perm_and_set_mode() that moves the
fsnotify_open_perm() hook into file_set_fsnotify_mode_from_watchers().
This will allow some more optimizations.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250708143641.418603-2-amir73il@gmail.com
If the NFS client is doing writeback from a workqueue context, avoid using
__GFP_NORETRY for allocations if the task has set PF_MEMALLOC_NOIO or
PF_MEMALLOC_NOFS. The combination of these flags makes memory allocation
failures much more likely.
We've seen those allocation failures show up when the loopback driver is
doing writeback from a workqueue to a file on NFS, where memory allocation
failure results in errors or corruption within the loopback device's
filesystem.
Suggested-by: Trond Myklebust <trondmy@kernel.org>
Fixes: 0bae835b63 ("NFS: Avoid writeback threads getting stuck in mempool_alloc()")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/f83ac1155a4bc670f2663959a7a068571e06afd9.1752111622.git.bcodding@redhat.com
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
NFSD is finally able to offer write delegations to clients that open
files with O_WRONLY, thanks to patches from Dai Ngo. We're expecting
this to accelerate a few interesting corner cases.
The cap on the number of operations per NFSv4 COMPOUND has been
lifted. Now, clients that send COMPOUNDs containing dozens of
operations (for example, a long stream of LOOKUP operations to walk
a pathname in a single round trip) will no longer be rejected.
This release re-enables the ability for NFSD to perform NFSv4.2 COPY
operations asynchronously. This feature has been disabled to
mitigate the risk of denial-of-service when too many such requests
arrive.
Many thanks to the contributors, reviewers, testers, and bug
reporters who participated during the v6.17 development cycle.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmiFJAcACgkQM2qzM29m
f5fOvA/+I1W3iAXMeuS4MdBD+20976XZNazXKXXfJE9ay/0I7rXka0uD9HH+cTnU
3wY1p+jjTs+Tatc5A39MjuS9a6o23FnHZB7IOimL+9ASRjBgjXisOyb7yEnfcA4s
9NjM5sMHskmrNpLX5kDPNHzTMdaozGl/uSDKg5WSAU/NMrtAT9c9snx4bO5A6mdk
48XPkP5++aBKGehsPqI0WGOeSzGKI7dc/kJS9F8kIbBCAMJSbIY7PKly+y+fbJkk
eMapUX257DCRQejA6hnFff0/x1NnR2tC8lQAZE1c7P5D9CV+1UEAQWK4/OOD2aeQ
hY9Ieb7CFZRot3VDGnnrYjLbApiZCY9m10ukDTykPErJ4ZEWEjUtMN7oAhRN3/Ie
O2NKvyVo4bOI5zHf4iCIVNp/hDHs01FoMfJfQYRACpBtsIKm+1pn4uTJtrezhJvn
qsvctMEMtXwZDlntwhQwU54XJyyGq7gJwuRAZ5xgW6WWQQI+NNKUm2XZu3YwJZF+
4Ji2vj6kRpS46HWG0VRUX12hXdDZdwFjcZ7eXZiSL3gZJ3xuEJDQ3jyyRwfe5t+8
W6eQRW9Sq1gN4OLwWjfltqs9l52XYfw0jitmX8Y98l1K05a4X74iIPRC5s97HV0E
XfvW+jRS4+x7tRp4wwcI2cGTPRTdK8xjmRWM3l2PQzgeG3AHUs8=
=rC5G
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"NFSD is finally able to offer write delegations to clients that open
files with O_WRONLY, thanks to patches from Dai Ngo. We're expecting
this to accelerate a few interesting corner cases.
The cap on the number of operations per NFSv4 COMPOUND has been
lifted. Now, clients that send COMPOUNDs containing dozens of
operations (for example, a long stream of LOOKUP operations to walk a
pathname in a single round trip) will no longer be rejected.
This release re-enables the ability for NFSD to perform NFSv4.2 COPY
operations asynchronously. This feature has been disabled to mitigate
the risk of denial-of-service when too many such requests arrive.
Many thanks to the contributors, reviewers, testers, and bug reporters
who participated during the v6.17 development cycle"
* tag 'nfsd-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (32 commits)
nfsd: Drop dprintk in blocklayout xdr functions
sunrpc: make svc_tcp_sendmsg() take a signed sentp pointer
sunrpc: rearrange struct svc_rqst for fewer cachelines
sunrpc: return better error in svcauth_gss_accept() on alloc failure
sunrpc: reset rq_accept_statp when starting a new RPC
sunrpc: remove SVC_SYSERR
sunrpc: fix handling of unknown auth status codes
NFSD: Simplify struct knfsd_fh
NFSD: Access a knfsd_fh's fsid by pointer
Revert "NFSD: Force all NFSv4.2 COPY requests to be synchronous"
NFSD: Avoid multiple -Wflex-array-member-not-at-end warnings
NFSD: Use vfs_iocb_iter_write()
NFSD: Use vfs_iocb_iter_read()
NFSD: Clean up kdoc for nfsd_open_local_fh()
NFSD: Clean up kdoc for nfsd_file_put_local()
NFSD: Remove definition for trace_nfsd_ctl_maxconn
NFSD: Remove definition for trace_nfsd_file_gc_recent
NFSD: Remove definitions for unused trace_nfsd_file_lru trace points
NFSD: Remove definition for trace_nfsd_file_unhash_and_queue
nfsd: Use correct error code when decoding extents
...
- Prevent cluster nodes from trying to recover their own filesystems
during a withdraw.
- Add two missing migrate_folio aops and an additional exhash directory
consistency check (both triggered by syzbot bug reports).
- Sanitize how dlm results are processed and clean up a few quirks in
the glock code.
- Minor stuff: Get rid of the GIF_ALLOC_FAILED flag; use SECTOR_SIZE and
SECTOR_SHIFT.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmiHiqQUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTrK9w/7BU//ySUwhfK2bXVxhXNgkMACv1UG
MXt08FgmVAg6CLVQzsCxT2J/vBnauY9KaEzNZTspMMZtHJLpIoy095bcPLO1SHFh
AbnLIcfX4wbdPJumeI6DJXSdeInVTBCjaPanHdZ9BvCcl9vSJ5WXUzHVfos2dcSl
P6xKhAOICE7T/ML03o81KtHEypblj3E34i7y1fIBJN2F2/eKCywQOmtYPa6ertIA
WNKiH6OR1YloQF4ddYkU7vpeLgQFItkHIDqGrr2R8FEByprBrs1FbkJKZGQHRZuq
0g5rRdA7KOjc/pJPPl0aiaSlPm6DQx6TJD+suDwerKpu0HZo65QuWCvVq6Dipkp5
Y0dgP1aPZw/LAcJ86BDI/OhBCPEQI4xa92mCPU9OofvgbjhtJi3a5l2GVZtmbbBW
Yp6o+rUCYYlYSzP0FDM9hIdYcpNdDC3v7u8IZ1bd9bgflIgoGb7TEqm8geZ4XcyC
chqjkRcffNRoxuon+Csfv/P/fSvaNxCqaCKTa+4+iewIFTYwpTj8pQqjw/cI6w7Y
8q6nLBFPvPU/nms6kQfOvG0YN1NpemFHp7uvdqwmkqDXMBw7x2mR5+J2bvl6XtCG
yuliqWNYcIvxRG4zlbK+nw4j6CQ71V4aTQVt0XJ9wKYhHLZTowOEm/nkat4O285G
BAwyQjpCULSAQFk=
=Sfn8
-----END PGP SIGNATURE-----
Merge tag 'gfs2-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Prevent cluster nodes from trying to recover their own filesystems
during a withdraw
- Add two missing migrate_folio aops and an additional exhash directory
consistency check (both triggered by syzbot bug reports)
- Sanitize how dlm results are processed and clean up a few quirks in
the glock code
- Minor stuff: Get rid of the GIF_ALLOC_FAILED flag; use SECTOR_SIZE
and SECTOR_SHIFT
* tag 'gfs2-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: No more self recovery
gfs2: Validate i_depth for exhash directories
gfs2: Set .migrate_folio in gfs2_{rgrp,meta}_aops
gfs2: a minor finish_xmote cleanup
gfs2: simplify finish_xmote
gfs2: sanitize the gdlm_ast -> finish_xmote interface
gfs2: Minor do_xmote cancelation fix
gfs2: Remove GIF_ALLOC_FAILED flag
gfs2: Use SECTOR_SIZE and SECTOR_SHIFT
Signed-off-by: Carlos Maiolino <cem@kernel.org>
-----BEGIN PGP SIGNATURE-----
iJUEABMJAB0WIQSmtYVZ/MfVMGUq1GNcsMJ8RxYuYwUCaIcswwAKCRBcsMJ8RxYu
Y5UqAYCWhEmZNTd0aN4kR3xqGMV+/3bubml4TXB8HvIv35gaOrFCgQkytVJLD74e
A0WtR+4BgJOIKmYfz4AAx6vbsoMWevya1f879i7NI7SYuI0/oB1klRtApZF0Slwv
OKJsN/s6Sg==
=/Q43
-----END PGP SIGNATURE-----
Merge tag 'xfs-merge-6.17' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Carlos Maiolino:
"This doesn't contain any new features. It mostly is a collection of
clean ups and code refactoring that I preferred to postpone to the
merge window.
It includes removal of several unused tracepoints, refactoring key
comparing routines under the B-Trees management and cleanup of xfs
journaling code"
* tag 'xfs-merge-6.17' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (44 commits)
xfs: don't use a xfs_log_iovec for ri_buf in log recovery
xfs: don't use a xfs_log_iovec for attr_item names and values
xfs: use better names for size members in xfs_log_vec
xfs: cleanup the ordered item logic in xlog_cil_insert_format_items
xfs: don't pass the old lv to xfs_cil_prepare_item
xfs: remove unused trace event xfs_reflink_cow_enospc
xfs: remove unused trace event xfs_discard_rtrelax
xfs: remove unused trace event xfs_log_cil_return
xfs: remove unused trace event xfs_dqreclaim_dirty
fs/xfs: replace strncpy with memtostr_pad()
xfs: Remove unused label in xfs_dax_notify_dev_failure
xfs: improve the comments in xfs_select_zone_nowait
xfs: improve the comments in xfs_max_open_zones
xfs: stop passing an inode to the zone space reservation helpers
xfs: rename oz_write_pointer to oz_allocated
xfs: use a uint32_t to cache i_used_blocks in xfs_init_zone
xfs: improve the xg_active_ref check in xfs_group_free
xfs: remove the xlog_ticket_t typedef
xfs: remove xrep_trans_{alloc,cancel}_hook_dummy
xfs: return the allocated transaction from xchk_trans_alloc_empty
...
Currently, when the server supports NFS4.1 security labels then
security.selinux label in included twice. Instead, only add it
when the server doesn't possess security label support.
Fixes: 243fea1346 ("NFSv4.2: fix listxattr to return selinux security label")
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Link: https://lore.kernel.org/r/20250722205641.79394-1-okorniev@redhat.com
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
- Add support for metadata compression;
- Enable readahead for directories to improve readdir performance;
- Minor fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmiGgHARHHhpYW5nQGtl
cm5lbC5vcmcACgkQUXZn5Zlu5qrDRA/8CYk5EJVvk5+AtpJJobxYsKR08ERs9QUi
AKjOMDfMkuDsHG5K8M9dUiEm3Xf0PL0+ukqwzlaDCzZswbwoYBp+ghXRIHjWA53Z
8i6Oj1Xdd7lNejKqzKzIEaZY2RuplMtxrRZMrkRVpAfJcv1YQfE142h++kTUHqKr
4ON14L6es7HDPnadRq8Ska3aAAmz3XNNiUUdHZ7Qwx32eH9wtu+ihpLF5R2E2v6o
H4tegJa9lIUF1aLhelYwIqU9racEuqUcsm/s8qFN0HafbRl8LfPq75neMPfXgVOV
Wc8O9yNiD53fMYW7hASOOMl2uTEcid1aLSZ4vfbXAJWS3dQoEoWb8XdLbSz3uIC7
oECRikN/DfhkxmQanMaTt5oRO/PkLQvJk6AtksbJ4kxYhWApQgrJM6YLHE34GNev
le6lyOq9xMIye/tRcywzjCP5FtIg+i28QEK6RVEOUpr2m8SpPhOPqk48q5ZqZXkj
hv9l6y0d1alUSykK/2gNon3b0R3savYPagm1kFkoJbumAVnx7gZjNvtPNhW8UdWv
GiMX3uydpxPm3ar/fNhcjbts/bPBSKHD2zJUCYslIB15cPe105on5a5HmAyV6o2i
ELxWzVsrjmwHY2QLRKfLFkLabaclMn1Qy0VlkNYo/3Z7EjcD9QHjIb+Wdx3gGAvu
45Vr/cJypjM=
=Tl9j
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"We now support metadata compression. It can be useful for embedded use
cases or archiving a large number of small files.
Additionally, readdir performance has been improved by enabling
readahead (note that it was already common practice for ext3/4 non-dx
and f2fs directories). We may consider further improvements later to
align with ext4's s_inode_readahead_blks behavior for slow devices
too.
The remaining commits are minor.
Summary:
- Add support for metadata compression
- Enable readahead for directories to improve readdir performance
- Minor fixes and cleanups"
* tag 'erofs-for-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: support to readahead dirent blocks in erofs_readdir()
erofs: implement metadata compression
erofs: add on-disk definition for metadata compression
erofs: fix build error with CONFIG_EROFS_FS_ZIP_ACCEL=y
erofs: remove ENOATTR definition
erofs: refine erofs_iomap_begin()
erofs: unify meta buffers in z_erofs_fill_inode()
erofs: remove need_kmap in erofs_read_metabuf()
erofs: do sanity check on m->type in z_erofs_load_compact_lcluster()
erofs: get rid of {get,put}_page() for ztailpacking data
Added:
sanity check for file name;
mark live inode as bad and avoid any operations.
Fixed:
handling of symlinks created in windows;
creation of symlinks for relative path.
Changed:
cancel setting inode as bad after removing name fails;
revert "replace inode_trylock with inode_lock".
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEh0DEKNP0I9IjwfWEqbAzH4MkB7YFAmiDiMwACgkQqbAzH4Mk
B7aJFg//Q/L5xNxtXf3bUDCGhEYfex2qvvSnMg8MZEcLXJI9DW/Z701Xs7ay8BBL
Gp982rSX87hnnA/+1YutzAEiyTCnwRWPKNocrfPEqzXDFl4t/4Rua1Wf+Bw2/p5C
YjGd8zFbFel3NDkZa94KSCV6ATsPMnF3UN3Zj0+izyoaxQPnpoAaaOsxS2PlK+97
ePZ9PWeAe6vjW3myKQzSKz8vjeb2bNK3R67Z27igCi78jjsuI6BomBhKjGN3iOk5
eZjCrWL+DesTN43EyV8yL72Mha+BvT5XQx83gzBKghhKCBXa2BMZo3l+KDI4pGr6
Mmynw1i+o1Z+P+xvCAPEiil6x6DNauUCUtgKJEEoRatGjGOPRM4U/NWZiXITZuFD
zkSmToA3ZtJA0XBHRnoPSDg0wEQ965sy6lsGs/O1sUfURJ51lc9EwmnWmGmq4Fs+
HZ3aNbaizi1TEx89R+BEaE7bZqSK0cv4nHHtcepTgVimgSzz02uvCpz9SLLs3JsB
g9a3b6M68Rtex2+unos5Cw9Afq1Y34wcPFn9rAK8+LashAgKKhRuUqwTaZ1k7nMU
hZIFsXCxvgnQgmbF5Dod8uA/2eO+U7BXixZWvGqhAliPuLSrubhb7dZrSANRkFlt
lv9qLmrcD/cpa/JvnrqOf6NyEc9vFoowvio3JQxkIhf7wQviv44=
=OGQG
-----END PGP SIGNATURE-----
Merge tag 'ntfs3_for_6.17' of https://github.com/Paragon-Software-Group/linux-ntfs3
Pull ntfs3 updates from Konstantin Komarov:
"Added:
- sanity check for file name
- mark live inode as bad and avoid any operations
Fixed:
- handling of symlinks created in windows
- creation of symlinks for relative path
Changed:
- cancel setting inode as bad after removing name fails
- revert 'replace inode_trylock with inode_lock'"
* tag 'ntfs3_for_6.17' of https://github.com/Paragon-Software-Group/linux-ntfs3:
Revert "fs/ntfs3: Replace inode_trylock with inode_lock"
fs/ntfs3: Exclude call make_bad_inode for live nodes.
fs/ntfs3: cancle set bad inode after removing name fails
fs/ntfs3: Add sanity check for file name
fs/ntfs3: correctly create symlink for relative path
fs/ntfs3: fix symlinks cannot be handled correctly
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmiAG9MACgkQxWXV+ddt
WDvClg/8DZPAi4MPAtyDTx5Fbm6t6znopWloDORL1+Kg4fMF7KY+ZKMN8hHv57Ov
j+TvHIZ2YdOMYncA8BUYPMY88QUSY+V9fVZoXufFdEXUEWT5v70UmOCK0Z7Ml5L/
FDxypDZVS+hN7zyoLmzRz9DTgXjmiAgRUsw/77MCcTCS1Y6mQJBWktpTUXnEiUXH
9VZ//AUCrNOvXlqyxOLwvGDdkVjvHzgahXzvRYu+rzlGI3kuY+WWEGijCwivd1AW
y2LA1PyqabRC0STlGC9KK2SMymvDdvf2e3iapTHMaj/HqBmlY0eCSh9jQxIqbUf7
oIq92xIjNbTi3rm4rK+FvxSPxPsfUhS7TPBANHhpcYitJuUXBiLcsT8tvnNN+rjp
vIxfkhVuUtSjMDiHgfD7KrGhg9vfYYig5Amip5BAJ0pi/MOP3hMzgT03rFq6iiOi
qP76K2qVNEqwkk07dapL7IVMcaXxldSLHg4Jo2XD79rGVlXiHoFok9K7cAMwHjnD
PGaFYDqQuRmpltp7oigcPIhcSHJ1gPnU+vmIWZ3lcwqZJ7b5kAoEDdELFcdLkMqC
kmS6RUwoWhNY6gsI4e9Roigr4shxB6zv/+mOzV4OSXl757Iuy8Kirl4AMPPElE91
y8+MqS/42Etxt4cmh5w51Ynb5gSpiI8DsPRdasIJanFFrXC7ZlM=
=b9RC
-----END PGP SIGNATURE-----
Merge tag 'for-6.17-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"A number of usability and feature updates, scattered performance
improvements and fixes. Highlight of the core changes is getting
closer to enabling large folios (now behind a config option).
User visible changes:
- update defrag ioctl, add new flag to request no compression on
existing extents
- restrict writes to block devices after mount
- in experimental config, enable large folios for data, almost
complete but not widely tested
- add stats tracking duration of critical section in transaction
commit to /sys/fs/btrfs/FSID/commit_stats
Performance improvements:
- caching of lookup results of free space bitmap (20% runtime
improvement on an empty file creation benchmark)
- accessors to metadata (b-tree items) simplified and optimized,
minor improvement in metadata-heavy workloads
- readahead on compressed data improves sequential read
- the xarray for extent buffers is indexed by denser keys, leading to
better packing of the nodes (50-70% reduction of leaf nodes)
Notable fixes:
- stricter compression mount option parsing
- send properly emits fallocate command for file holes when protocol
v2 is used
- fix overallocation of chunks with mount option 'ssd_spread', due to
interaction with size classes not finding the right chunk
(workaround: manual reclaim by 'usage' balance filter)
- various quota enable/disable races with rescan, more verbose
notifications about inconsistent state
- populate otime in tree-log during log replay
- handle ENOSPC when NOCOW file is used with mmap()
Core:
- large data folios enabled in experimental config
- improved error handling, transaction abort call sites
- in zoned mode, allocate reloc block group on mount to make sure
there's always one available for zone reclaim under heavy load
- rework device opening, they're always open as read-only and delayed
until the super block is created, allowing the restricted writes
after mount
- preparatory work for adding blk_holder_ops, allowing device
freeze/thaw in the future
Cleanups, refactoring:
- type and naming unifications (int/bool, return variables)
- rb-tree helper refactoring and simplifications
- reorder memory allocations to less critical places
- RCU string (used for device name) refactoring and API removal
- replace all remaining use of strcpy()"
* tag 'for-6.17-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (209 commits)
btrfs: send: use fallocate for hole punching with send stream v2
btrfs: unfold transaction aborts when writing dirty block groups
btrfs: use saner variable type and name to indicate extrefs at add_inode_ref()
btrfs: don't skip remaining extrefs if dir not found during log replay
btrfs: don't ignore inode missing when replaying log tree
btrfs: enable large data folios for data reloc inode
btrfs: output more info when btrfs_subpage_assert() failed
btrfs: reloc: unconditionally invalidate the page cache for each cluster
btrfs: defrag: add flag to force no-compression
btrfs: fix ssd_spread overallocation
btrfs: zoned: requeue to unused block group list if zone finish failed
btrfs: zoned: do not remove unwritten non-data block group
btrfs: remove btrfs_clear_extent_bits()
btrfs: use cached state when falling back from NOCoW write to CoW write
btrfs: set EXTENT_NORESERVE before range unlock in btrfs_truncate_block()
btrfs: don't print relocation messages from auto reclaim
btrfs: remove redundant auto reclaim log message
btrfs: make btrfs_check_nocow_lock() check more than one extent
btrfs: assert we can NOCOW the range in btrfs_truncate_block()
btrfs: update function comment for btrfs_check_nocow_lock()
...
Steal string reference from @param->string rather than duplicating it.
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Steal string reference from @param->string rather than duplicating it.
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Steal string reference from @param->string rather than duplicating it.
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Steal string reference from @param->string rather than duplicating it.
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Steal string reference from @param->string rather than duplicating it.
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
SMB1 already supports querying reparse points and detecting types of
symlink, fifo, socket, block and char.
This change implements the missing part - ability to create a new reparse
points over SMB1. This includes everything which SMB2+ already supports:
- native SMB symlinks and sockets
- NFS style of special files (symlinks, fifos, sockets, char/block devs)
- WSL style of special files (symlinks, fifos, sockets, char/block devs)
Attaching a reparse point to an existing file or directory is done via
SMB1 SMB_COM_NT_TRANSACT/NT_TRANSACT_IOCTL/FSCTL_SET_REPARSE_POINT command
and implemented in a new cifs_create_reparse_inode() function.
This change introduce a new callback ->create_reparse_inode() which creates
a new reperse point file or directory and returns inode. For SMB1 it is
provided via that new cifs_create_reparse_inode() function.
Existing reparse.c code was only slightly updated to call new protocol
callback ->create_reparse_inode() instead of hardcoded SMB2+ function.
This make the whole reparse.c code to work with every SMB dialect.
The original callback ->create_reparse_symlink() is not needed anymore as
the implementation of new create_reparse_symlink() function is dialect
agnostic too. So the link.c code was updated to call that function directly
(and not via callback).
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
WSL EAs are not required for native SMB symlinks, so do not query them from server.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
When not searching for child entries with msearch wildcard pattern then ask
server just for one output entry. There is no need to ask for more entries
as we are interested only for one search result, as we are doing query on
path.
CIFSFindFirst() with msearch=false is called by the cifs_query_path_info()
function.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
To query root path (without msearch wildcard) it is needed to
send pattern '\' instead of '' (empty string).
This allows to use CIFSFindFirst() to query information about root path
which is being used in followup changes.
This change fixes the stat() syscall called on the root path on the mount.
It is because stat() syscall uses the cifs_query_path_info() function and
it can fallback to the CIFSFindFirst() usage with msearch=false.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Some servers might enforce the SPN to be set in the target info
blob (AV pairs) when sending NTLMSSP_AUTH message. In Windows Server,
this could be enforced with SmbServerNameHardeningLevel set to 2.
Fix this by always appending SPN (cifs/<hostname>) to the existing
list of target infos when setting up NTLMv2 response blob.
Cc: linux-cifs@vger.kernel.org
Cc: David Howells <dhowells@redhat.com>
Reported-by: Pierguido Lambri <plambri@redhat.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Zero-length AV pairs should be considered as valid target infos.
Don't skip the next AV pairs that follow them.
Cc: linux-cifs@vger.kernel.org
Cc: David Howells <dhowells@redhat.com>
Fixes: 0e8ae9b953 ("smb: client: parse av pair type 4 in CHALLENGE_MESSAGE")
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
The handlecache code today tracks the time at which dir lease was
acquired and the laundromat thread uses that to check for old
entries to cleanup.
However, if a directory is actively accessed, it should not
be chosen to expire first.
This change adds a new last_access_time field to cfid and
uses that to decide expiry of the cfid.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
cached_dir_lease_break() has return type as int but only
returning true or false. change return type of this function
to bool for clarity.
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We now do a weighted selection of server interfaces when allocating
new channels. The weights are decided based on the speed advertised.
The fulfilled weight for an interface is a counter that is used to
track the interface selection. It should be reset back to zero once
all interfaces fulfilling their weight.
In cifs_chan_update_iface, this reset logic was missing. As a result
when the server interface list changes, the client may not be able
to find a new candidate for other channels after all interfaces have
been fulfilled.
Fixes: a6d8fb54a5 ("cifs: distribute channels across interfaces based on speed")
Cc: <stable@vger.kernel.org>
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
After commit 5c70eb5c59 ("net: better track kernel sockets lifetime"),
kernel sockets now use net_passive reference counting. However, commit
95d2b9f693 ("Revert "smb: client: fix TCP timers deadlock after rmmod"")
restored the manual socket refcount manipulation without adapting to this
new mechanism, causing a memory leak.
The issue can be reproduced by[1]:
1. Creating a network namespace
2. Mounting and Unmounting CIFS within the namespace
3. Deleting the namespace
Some memory leaks may appear after a period of time following step 3.
unreferenced object 0xffff9951419f6b00 (size 256):
comm "ip", pid 447, jiffies 4294692389 (age 14.730s)
hex dump (first 32 bytes):
1b 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 80 77 c2 44 51 99 ff ff .........w.DQ...
backtrace:
__kmem_cache_alloc_node+0x30e/0x3d0
__kmalloc+0x52/0x120
net_alloc_generic+0x1d/0x30
copy_net_ns+0x86/0x200
create_new_namespaces+0x117/0x300
unshare_nsproxy_namespaces+0x60/0xa0
ksys_unshare+0x148/0x360
__x64_sys_unshare+0x12/0x20
do_syscall_64+0x59/0x110
entry_SYSCALL_64_after_hwframe+0x78/0xe2
...
unreferenced object 0xffff9951442e7500 (size 32):
comm "mount.cifs", pid 475, jiffies 4294693782 (age 13.343s)
hex dump (first 32 bytes):
40 c5 38 46 51 99 ff ff 18 01 96 42 51 99 ff ff @.8FQ......BQ...
01 00 00 00 6f 00 c5 07 6f 00 d8 07 00 00 00 00 ....o...o.......
backtrace:
__kmem_cache_alloc_node+0x30e/0x3d0
kmalloc_trace+0x2a/0x90
ref_tracker_alloc+0x8e/0x1d0
sk_alloc+0x18c/0x1c0
inet_create+0xf1/0x370
__sock_create+0xd7/0x1e0
generic_ip_connect+0x1d4/0x5a0 [cifs]
cifs_get_tcp_session+0x5d0/0x8a0 [cifs]
cifs_mount_get_session+0x47/0x1b0 [cifs]
dfs_mount_share+0xfa/0xa10 [cifs]
cifs_mount+0x68/0x2b0 [cifs]
cifs_smb3_do_mount+0x10b/0x760 [cifs]
smb3_get_tree+0x112/0x2e0 [cifs]
vfs_get_tree+0x29/0xf0
path_mount+0x2d4/0xa00
__se_sys_mount+0x165/0x1d0
Root cause:
When creating kernel sockets, sk_alloc() calls net_passive_inc() for
sockets with sk_net_refcnt=0. The CIFS code manually converts kernel
sockets to user sockets by setting sk_net_refcnt=1, but doesn't call
the corresponding net_passive_dec(). This creates an imbalance in the
net_passive counter, which prevents the network namespace from being
destroyed when its last user reference is dropped. As a result, the
entire namespace and all its associated resources remain allocated.
Timeline of patches leading to this issue:
- commit ef7134c7fc ("smb: client: Fix use-after-free of network
namespace.") in v6.12 fixed the original netns UAF by manually
managing socket refcounts
- commit e9f2517a3e ("smb: client: fix TCP timers deadlock after
rmmod") in v6.13 attempted to use kernel sockets but introduced
TCP timer issues
- commit 5c70eb5c59 ("net: better track kernel sockets lifetime")
in v6.14-rc5 introduced the net_passive mechanism with
sk_net_refcnt_upgrade() for proper socket conversion
- commit 95d2b9f693 ("Revert "smb: client: fix TCP timers deadlock
after rmmod"") in v6.15-rc3 reverted to manual refcount management
without adapting to the new net_passive changes
Fix this by using sk_net_refcnt_upgrade() which properly handles the
net_passive counter when converting kernel sockets to user sockets.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=220343 [1]
Fixes: 95d2b9f693 ("Revert "smb: client: fix TCP timers deadlock after rmmod"")
Cc: stable@vger.kernel.org
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Wang Zhaolong <wangzhaolong@huaweicloud.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This patch introduces is_bnode_offset_valid() method that checks
the requested offset value. Also, it introduces
check_and_correct_requested_length() method that checks and
correct the requested length (if it is necessary). These methods
are used in hfs_bnode_read(), hfs_bnode_write(), hfs_bnode_clear(),
hfs_bnode_copy(), and hfs_bnode_move() with the goal to prevent
the access out of allocated memory and triggering the crash.
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
Link: https://lore.kernel.org/r/20250703214912.244138-1-slava@dubeyko.com
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
When the volume header contains erroneous values that do not reflect
the actual state of the filesystem, hfsplus_fill_super() assumes that
the attributes file is not yet created, which later results in hitting
BUG_ON() when hfsplus_create_attributes_file() is called. Replace this
BUG_ON() with -EIO error with a message to suggest running fsck tool.
Reported-by: syzbot <syzbot+1107451c16b9eb9d29e6@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=1107451c16b9eb9d29e6
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com>
Link: https://lore.kernel.org/r/7b587d24-c8a1-4413-9b9a-00a33fbd849f@I-love.SAKURA.ne.jp
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
hfsplus_submit_bio() called by hfsplus_sync_fs() uses bdev_virt_rw() which
in turn uses submit_bio_wait() to submit the BIO.
But submit_bio_wait() already sets the REQ_SYNC flag on the BIO so there
is no need for setting the flag in hfsplus_sync_fs() when calling
hfsplus_submit_bio().
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
Link: https://lore.kernel.org/r/20250710063553.4805-1-johannes.thumshirn@wdc.com
Link: https://lore.kernel.org/r/20250710063553.4805-1-johannes.thumshirn@wdc.com
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaIM+0AAKCRCRxhvAZXjc
oopdAP9SviubsueENDGRNvyvjCajdaUcZ481UeWCsvshc1ykmgEAo+Cwu3QxRzF7
BVjSIJSV9r4ae/fNw/MASbJVwyfCJQc=
=hGG6
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.16-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"Two last-minute fixes for this cycle:
- Set afs vllist to NULL if addr parsing fails
- Add a missing check for reaching the end of the string in afs"
* tag 'vfs-6.16-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
afs: Set vllist to NULL if addr parsing fails
afs: Fix check for NULL terminator
User reported fixes:
- Fix btree node scan on encrypted filesystems by not using btree node
header fields encrypted
- Fix a race in btree write buffer flush; this caused EROs primarily
during fsck for some people
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmiC8nIACgkQE6szbY3K
bnb2Mw//c9gFPpIqCQnMGwG1lXCCDjyKMkaP9OFq/Y2Ul5ERqWr2x3zzj36AYAkn
Ujo4/xNVBTsdvz8Kj4kG926F87UPnSJPJ64Y3L2Ag8WtOImMDWCu2GBHQ6ILdZYp
sPaf3Kd5P8GS221s/Rg45DihHlkoaZiXe7nE9/i/SdrvWyeMG7WQBuwgyTkjbVSh
c95sot66FlKURHDilFxZ1JU7UIf+QLyZRb5d6gQlmCce4c3w7RaRjK/dtnA9wZPQ
GkLim+Ja5oxUxT+pn4g2ImfLDL1nIGN7xtZR8vg0awekIbQxCsFzv6l7t5vvLEQ/
l9Q8bCUFhJ5pQ4uMGsde47G3smmVC8OFDJTNeUog9grcbcsyUh1+bN9Ix886oVnd
14Yji6++WmwekDISw9wH8tKNkGp7PZFMDxRFET8LQf+WReIDr/n1F4j1fhNBL0jr
UVy0F8u3prGwE3TigUX35cBvDmdbhIpkSZIaXskUewkiOByzcLXh5fF5XlfEtAJ0
OhMQQPAOKv5vcmLF3+JYMGj0JDch3s6HJy6AfxQ2EfZs4iozL47aOEiMlQ3PdZfw
CJfMAXwXs/PgznSDBdAdV0iMRM5f43QtHGXfBIsHk+F3Ax05ERvvghtnCYht3ltd
5Fy80Tm1iGxH4nsItR3bXxAnkFQmhJFDPAZ7RD8TkvgkSrDYJhA=
=qe+b
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2025-07-24' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"User reported fixes:
- Fix btree node scan on encrypted filesystems by not using btree
node header fields encrypted
- Fix a race in btree write buffer flush; this caused EROs primarily
during fsck for some people"
* tag 'bcachefs-2025-07-24' of git://evilpiepirate.org/bcachefs:
bcachefs: Add missing snapshots_seen_add_inorder()
bcachefs: Fix write buffer flushing from open journal entry
bcachefs: btree_node_scan: don't re-read before initializing found_btree_node
A syzbot fuzzed image triggered a BUG_ON in ext4_update_inline_data()
when an inode had the INLINE_DATA_FL flag set but was missing the
system.data extended attribute.
Since this can happen due to a maiciouly fuzzed file system, we
shouldn't BUG, but rather, report it as a corrupted file system.
Add similar replacements of BUG_ON with EXT4_ERROR_INODE() ii
ext4_create_inline_data() and ext4_inline_data_truncate().
Reported-by: syzbot+544248a761451c0df72f@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Although we now perform ordered traversal within an xarray, this is
currently limited to a single xarray. However, we have multiple such
xarrays, which prevents us from guaranteeing a linear-like traversal
where all groups on the right are visited before all groups on the left.
For example, suppose we have 128 block groups, with a target group of 64,
a target length corresponding to an order of 1, and available free groups
of 16 (order 1) and group 65 (order 8):
For linear traversal, when no suitable free block is found in group 64, it
will search in the next block group until group 127, then start searching
from 0 up to block group 63. It ensures continuous forward traversal, which
is consistent with the unidirectional rotation behavior of HDD platters.
Additionally, the block group lock contention during freeing block is
unavoidable. The goal increasing from 0 to 64 indicates that previously
scanned groups (which had no suitable free space and are likely to free
blocks later) and skipped groups (which are currently in use) have newly
freed some used blocks. If we allocate blocks in these groups, the
probability of competing with other processes increases.
For non-linear traversal, we first traverse all groups in order_1. If only
group 16 has free space in this list, we first traverse [63, 128), then
traverse [0, 64) to find the available group 16, and then allocate blocks
in group 16. Therefore, it cannot guarantee continuous traversal in one
direction, thus increasing the probability of contention.
So refactor ext4_mb_scan_groups_xarray() to ext4_mb_scan_groups_xa_range()
to only traverse a fixed range of groups, and move the logic for handling
wrap around to the caller. The caller first iterates through all xarrays
in the range [start, ngroups) and then through the range [0, start). This
approach simulates a linear scan, which reduces contention between freeing
blocks and allocating blocks.
Assume we have the following groups, where "|" denotes the xarray traversal
start position:
order_1_groups: AB | CD
order_2_groups: EF | GH
Traversal order:
Before: C > D > A > B > G > H > E > F
After: C > D > G > H > A > B > E > F
Performance test data follows:
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 19555 | 20049 (+2.5%) | 315636 | 316724 (-0.3%) |
|mb_optimize_scan=1 | 15496 | 19342 (+24.8%) | 323569 | 328324 (+1.4%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 53192 | 52125 (-2.0%) | 212678 | 215136 (+1.1%) |
|mb_optimize_scan=1 | 37636 | 50331 (+33.7%) | 214189 | 209431 (-2.2%) |
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-18-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This commit converts the `choose group` logic to `scan group` using
previously prepared helper functions. This allows us to leverage xarrays
for ordered non-linear traversal, thereby mitigating the "bouncing" issue
inherent in the `choose group` mechanism.
This also decouples linear and non-linear traversals, leading to cleaner
and more readable code.
Key changes:
* ext4_mb_choose_next_group() is refactored to ext4_mb_scan_groups().
* Replaced ext4_mb_good_group() with ext4_mb_scan_group() in non-linear
traversals, and related functions now return error codes instead of
group info.
* Added ext4_mb_scan_groups_linear() for performing linear scans starting
from a specific group for a set number of times.
* Linear scans now execute up to sbi->s_mb_max_linear_groups times,
so ac_groups_linear_remaining is removed as it's no longer used.
* ac->ac_criteria is now used directly instead of passing cr around.
Also, ac->ac_criteria is incremented directly after groups scan fails
for the corresponding criteria.
* Since we're now directly scanning groups instead of finding a good group
then scanning, the following variables and flags are no longer needed,
s_bal_cX_groups_considered is sufficient.
s_bal_p2_aligned_bad_suggestions
s_bal_goal_fast_bad_suggestions
s_bal_best_avail_bad_suggestions
EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED
EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED
EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-17-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
While traversing the list, holding a spin_lock prevents load_buddy, making
direct use of ext4_try_lock_group impossible. This can lead to a bouncing
scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group()
fails, forcing the list traversal to repeatedly restart from grp_A.
In contrast, linear traversal directly uses ext4_try_lock_group(),
avoiding this bouncing. Therefore, we need a lockless, ordered traversal
to achieve linear-like efficiency.
Therefore, this commit converts both average fragment size lists and
largest free order lists into ordered xarrays.
In an xarray, the index represents the block group number and the value
holds the block group information; a non-empty value indicates the block
group's presence.
While insertion and deletion complexity remain O(1), lookup complexity
changes from O(1) to O(nlogn), which may slightly reduce single-threaded
performance.
Additionally, xarray insertions might fail, potentially due to memory
allocation issues. However, since we have linear traversal as a fallback,
this isn't a major problem. Therefore, we've only added a warning message
for insertion failures here.
A helper function ext4_mb_find_good_group_xarray() is added to find good
groups in the specified xarray starting at the specified position start,
and when it reaches ngroups-1, it wraps around to 0 and then to start-1.
This ensures an ordered traversal within the xarray.
Performance test results are as follows: Single-process operations
on an empty disk show negligible impact, while multi-process workloads
demonstrate a noticeable performance gain.
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 20097 | 19555 (-2.6%) | 316141 | 315636 (-0.2%) |
|mb_optimize_scan=1 | 13318 | 15496 (+16.3%) | 325273 | 323569 (-0.5%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 53603 | 53192 (-0.7%) | 214243 | 212678 (-0.7%) |
|mb_optimize_scan=1 | 20887 | 37636 (+80.1%) | 213632 | 214189 (+0.2%) |
[ Applied spelling fixes per discussion on the ext4-list see thread
referened in the Link tag. --tytso]
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-16-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Extract ext4_mb_scan_group() to make the code clearer and to
prepare for the later conversion of 'choose group' to 'scan groups'.
No functional changes.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-15-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Extract ext4_mb_might_prefetch() to make the code clearer and to
prepare for the later conversion of 'choose group' to 'scan groups'.
No functional changes.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-14-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Extract __ext4_mb_scan_group() to make the code clearer and to
prepare for the later conversion of 'choose group' to 'scan groups'.
No functional changes.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-13-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The grp->bb_largest_free_order is updated regardless of whether
mb_optimize_scan is enabled. This can lead to inconsistencies between
grp->bb_largest_free_order and the actual s_mb_largest_free_orders list
index when mb_optimize_scan is repeatedly enabled and disabled via remount.
For example, if mb_optimize_scan is initially enabled, largest free
order is 3, and the group is in s_mb_largest_free_orders[3]. Then,
mb_optimize_scan is disabled via remount, block allocations occur,
updating largest free order to 2. Finally, mb_optimize_scan is re-enabled
via remount, more block allocations update largest free order to 1.
At this point, the group would be removed from s_mb_largest_free_orders[3]
under the protection of s_mb_largest_free_orders_locks[2]. This lock
mismatch can lead to list corruption.
To fix this, whenever grp->bb_largest_free_order changes, we now always
attempt to remove the group from its old order list. However, we only
insert the group into the new order list if `mb_optimize_scan` is enabled.
This approach helps prevent lock inconsistencies and ensures the data in
the order lists remains reliable.
Fixes: 196e402adf ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@vger.kernel.org
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-12-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Groups with no free blocks shouldn't be in any average fragment size list.
However, when all blocks in a group are allocated(i.e., bb_fragments or
bb_free is 0), we currently skip updating the average fragment size, which
means the group isn't removed from its previous s_mb_avg_fragment_size[old]
list.
This created "zombie" groups that were always skipped during traversal as
they couldn't satisfy any block allocation requests, negatively impacting
traversal efficiency.
Therefore, when a group becomes completely full, bb_avg_fragment_size_order
is now set to -1. If the old order was not -1, a removal operation is
performed; if the new order is not -1, an insertion is performed.
Fixes: 196e402adf ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@vger.kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-11-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Attempt to merge ext4_free_data with already inserted free extents prior
to adding new ones. This strategy drastically cuts down the number of
times locks are held.
For example, if prev, new, and next extents are all mergeable, the existing
code (before this patch) requires acquiring the s_md_lock three times:
prev merge into new and free prev // hold lock
next merge into new and free next // hold lock
insert new // hold lock
After the patch, it only needs to be acquired once:
new merge into next and free new // no lock
next merge into prev and free next // hold lock
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 20043 | 20097 (+0.2%) | 314331 | 316141 (+0.5%) |
|mb_optimize_scan=1 | 7290 | 13318 (+87.4%) | 324226 | 325273 (+0.3%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 54999 | 53603 (-2.5%) | 214380 | 214243 (-0.06%)|
|mb_optimize_scan=1 | 13497 | 20887 (+54.6%) | 216276 | 213632 (-1.2%) |
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-10-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Previously, s_md_lock was used to protect s_mb_free_pending during
modifications, while smp_mb() ensured fresh reads, so s_md_lock just
guarantees the atomicity of s_mb_free_pending. Thus we optimized it by
converting s_mb_free_pending into an atomic variable, thereby eliminating
s_md_lock and minimizing lock contention. This also prepares for future
lockless merging of free extents.
Following this modification, s_md_lock is exclusively responsible for
managing insertions and deletions within s_freed_data_list, along with
operations involving list_splice.
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 19628 | 20043 (+2.1%) | 320885 | 314331 (-2.0%) |
|mb_optimize_scan=1 | 7129 | 7290 (+2.2%) | 321275 | 324226 (+0.9%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 53760 | 54999 (+2.3%) | 213145 | 214380 (+0.5%) |
|mb_optimize_scan=1 | 12716 | 13497 (+6.1%) | 215262 | 216276 (+0.4%) |
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-9-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Since nobody has used these EXT4_MB_HINT flags for ages,
let's remove them.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-7-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When allocating data blocks, if the first try (goal allocation) fails and
stream allocation is on, it tries a global goal starting from the last
group we used (s_mb_last_group). This helps cluster large files together
to reduce free space fragmentation, and the data block contiguity also
accelerates write-back to disk.
However, when multiple processes allocate blocks, having just one global
goal means they all fight over the same group. This drastically lowers
the chances of extents merging and leads to much worse file fragmentation.
To mitigate this multi-process contention, we now employ multiple global
goals, with the number of goals being the minimum between the number of
possible CPUs and one-quarter of the filesystem's total block group count.
To ensure a consistent goal for each inode, we select the corresponding
goal by taking the inode number modulo the total number of goals.
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 9636 | 19628 (+103%) | 337597 | 320885 (-4.9%) |
|mb_optimize_scan=1 | 4834 | 7129 (+47.4%) | 341440 | 321275 (-5.9%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 22341 | 53760 (+140%) | 219707 | 213145 (-2.9%) |
|mb_optimize_scan=1 | 9177 | 12716 (+38.5%) | 215732 | 215262 (+0.2%) |
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-6-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
After we optimized the block group lock, we found another lock
contention issue when running will-it-scale/fallocate2 with multiple
processes. The fallocate's block allocation and the truncate's block
release were fighting over the s_md_lock. The problem is, this lock
protects totally different things in those two processes: the list of
freed data blocks (s_freed_data_list) when releasing, and where to start
looking for new blocks (mb_last_group) when allocating.
Now we only need to track s_mb_last_group and no longer need to track
s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
two are consistent. Since s_mb_last_group is merely a hint and doesn't
require strong synchronization, READ_ONCE/WRITE_ONCE is sufficient.
Besides, the s_mb_last_group data type only requires ext4_group_t
(i.e., unsigned int), rendering unsigned long superfluous.
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 4821 | 9636 (+99.8%) | 314065 | 337597 (+7.4%) |
|mb_optimize_scan=1 | 4784 | 4834 (+1.04%) | 316344 | 341440 (+7.9%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 15371 | 22341 (+45.3%) | 205851 | 219707 (+6.7%) |
|mb_optimize_scan=1 | 6101 | 9177 (+50.4%) | 207373 | 215732 (+4.0%) |
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-5-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Since stream allocation does not use ac->ac_f_ex.fe_start, it is set to -1
by default, so the no longer needed sbi->s_mb_last_start is removed.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-4-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When ext4 allocates blocks, we used to just go through the block groups
one by one to find a good one. But when there are tons of block groups
(like hundreds of thousands or even millions) and not many have free space
(meaning they're mostly full), it takes a really long time to check them
all, and performance gets bad. So, we added the "mb_optimize_scan" mount
option (which is on by default now). It keeps track of some group lists,
so when we need a free block, we can just grab a likely group from the
right list. This saves time and makes block allocation much faster.
But when multiple processes or containers are doing similar things, like
constantly allocating 8k blocks, they all try to use the same block group
in the same list. Even just two processes doing this can cut the IOPS in
half. For example, one container might do 300,000 IOPS, but if you run two
at the same time, the total is only 150,000.
Since we can already look at block groups in a non-linear way, the first
and last groups in the same list are basically the same for finding a block
right now. Therefore, add an ext4_try_lock_group() helper function to skip
the current group when it is locked by another process, thereby avoiding
contention with other processes. This helps ext4 make better use of having
multiple block groups.
Also, to make sure we don't skip all the groups that have free space
when allocating blocks, we won't try to skip busy groups anymore when
ac_criteria is CR_ANY_FREE.
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
|CPU: Kunpeng 920 | P80 |
|Memory: 512GB |-------------------------|
|960GB SSD (0.5GB/s)| base | patched |
|-------------------|-------|-----------------|
|mb_optimize_scan=0 | 2667 | 4821 (+80.7%) |
|mb_optimize_scan=1 | 2643 | 4784 (+81.0%) |
|CPU: AMD 9654 * 2 | P96 |
|Memory: 1536GB |-------------------------|
|960GB SSD (1GB/s) | base | patched |
|-------------------|-------|-----------------|
|mb_optimize_scan=0 | 3450 | 15371 (+345%) |
|mb_optimize_scan=1 | 3209 | 6101 (+90.0%) |
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In case of ovl_lookup_temp() failure, we currently print `err`
which is actually not initialized at all.
Instead, properly print PTR_ERR(whiteout) which is where the
actual error really is.
Address-Coverity-ID: 1647983 ("Uninitialized variables (UNINIT)")
Fixes: 8afa0a7367 ("ovl: narrow locking in ovl_whiteout()")
Signed-off-by: Antonio Quartulli <antonio@mandelbit.com>
Link: https://lore.kernel.org/20250721203821.7812-1-antonio@mandelbit.com
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Christian Brauner <brauner@kernel.org>
If STATX_BASIC_STATS flags are not given as an argument to vfs_getattr,
It can not get ctime and mtime in kstat.
This causes a problem showing mtime and ctime outdated from cifs.ko.
File: /xfstest.test/foo
Size: 4096 Blocks: 8 IO Block: 1048576 regular file
Device: 0,65 Inode: 2033391 Links: 1
Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Context: system_u:object_r:cifs_t:s0
Access: 2025-07-23 22:15:30.136051900 +0100
Modify: 1970-01-01 01:00:00.000000000 +0100
Change: 1970-01-01 01:00:00.000000000 +0100
Birth: 2025-07-23 22:15:30.136051900 +0100
Cc: stable@vger.kernel.org
Reported-by: David Howells <dhowells@redhat.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
If client send multiple session setup requests to ksmbd,
Preauh_HashValue race condition could happen.
There is no need to free sess->Preauh_HashValue at session setup phase.
It can be freed together with session at connection termination phase.
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-27661
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
xa_store() may fail so check its return value and return error code if
error occurred.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
If client send two session setups with krb5 authenticate to ksmbd,
null pointer dereference error in generate_encryptionkey could happen.
sess->Preauth_HashValue is set to NULL if session is valid.
So this patch skip generate encryption key if session is valid.
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-27654
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
This fixes an infinite loop when repairing "extent past end of inode",
when the extent is an older snapshot than the inode that needs repair.
Without the snaphsots_seen_add_inorder() we keep trying to delete the
same extent, even though it's no longer visible in the inode's snapshot.
Fixes: 63d6e93119 ("bcachefs: bch2_fpunch_snapshot()")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When flushing the btree write buffer, we pull write buffer keys directly
from the journal instead of letting the journal write path copy them to
the write buffer.
When flushing from the currently open journal buffer, we have to block
new reservations and wait for outstanding reservations to complete.
Recheck the reservation state after blocking new reservations:
previously, we were checking the reservation count from before calling
__journal_block().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
or aren't considered necessary for -stable kernels.
7 are for MM.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaILYBgAKCRDdBJ7gKXxA
jo0uAQDvTlAjH6TcgRW/cbqHRIeiRoZ9Bwh/RUlJXM9neDR2LgEA41B+ohTsxUmZ
OhM3Ce94tiGrHnVlW3SsmVaO+1TjGAU=
=KUR9
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2025-07-24-18-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"11 hotfixes. 9 are cc:stable and the remainder address post-6.15
issues or aren't considered necessary for -stable kernels.
7 are for MM"
* tag 'mm-hotfixes-stable-2025-07-24-18-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
sprintf.h requires stdarg.h
resource: fix false warning in __request_region()
mm/damon/core: commit damos_quota_goal->nid
kasan: use vmalloc_dump_obj() for vmalloc error reports
mm/ksm: fix -Wsometimes-uninitialized from clang-21 in advisor_mode_show()
mm: update MAINTAINERS entry for HMM
nilfs2: reject invalid file types when reading inodes
selftests/mm: fix split_huge_page_test for folio_split() tests
mailmap: add entry for Senozhatsky
mm/zsmalloc: do not pass __GFP_MOVABLE if CONFIG_COMPACTION=n
mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
Enable HUGETLBFS only when platform subscrbes via ARCH_SUPPORTS_HUGETLBFS.
Hence select ARCH_SUPPORTS_HUGETLBFS on existing x86 and sparc for their
continuing HUGETLBFS support. While here also just drop existing 'BROKEN'
dependency.
Link: https://lkml.kernel.org/r/20250711102934.2399533-1-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
With maple_tree supporting vma tree traversal under RCU and per-vma locks,
/proc/pid/maps can be read while holding individual vma locks instead of
locking the entire address space.
A completely lockless approach (walking vma tree under RCU) would be quite
complex with the main issue being get_vma_name() using callbacks which
might not work correctly with a stable vma copy, requiring original
(unstable) vma - see special_mapping_name() for example.
When per-vma lock acquisition fails, we take the mmap_lock for reading,
lock the vma, release the mmap_lock and continue. This fallback to mmap
read lock guarantees the reader to make forward progress even during lock
contention. This will interfere with the writer but for a very short time
while we are acquiring the per-vma lock and only when there was contention
on the vma reader is interested in.
We shouldn't see a repeated fallback to mmap read locks in practice, as
this require a very unlikely series of lock contentions (for instance due
to repeated vma split operations). However even if this did somehow
happen, we would still progress.
One case requiring special handling is when a vma changes between the time
it was found and the time it got locked. A problematic case would be if a
vma got shrunk so that its vm_start moved higher in the address space and
a new vma was installed at the beginning:
reader found: |--------VMA A--------|
VMA is modified: |-VMA B-|----VMA A----|
reader locks modified VMA A
reader reports VMA A: | gap |----VMA A----|
This would result in reporting a gap in the address space that does not
exist. To prevent this we retry the lookup after locking the vma, however
we do that only when we identify a gap and detect that the address space
was changed after we found the vma.
This change is designed to reduce mmap_lock contention and prevent a
process reading /proc/pid/maps files (often a low priority task, such as
monitoring/data collection services) from blocking address space updates.
Note that this change has a userspace visible disadvantage: it allows for
sub-page data tearing as opposed to the previous mechanism where data
tearing could happen only between pages of generated output data. Since
current userspace considers data tearing between pages to be acceptable,
we assume is will be able to handle sub-page data tearing as well.
Link: https://lkml.kernel.org/r/20250719182854.3166724-7-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jeongjun Park <aha310510@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: T.J. Mercier <tjmercier@google.com>
Cc: Ye Bin <yebin10@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Back in 2.6 era, last_addr used to be stored in seq_file->version
variable, which was unsigned long. As a result, sentinels to represent
gate vma and end of all vmas used unsigned values. In more recent kernels
we don't used seq_file->version anymore and therefore conversion from
loff_t into unsigned type is not needed. Similarly, sentinel values don't
need to be unsigned. Remove type conversion for set_file position and
change sentinel values to signed. While at it, change the hardcoded
sentinel values with named definitions for better documentation.
Link: https://lkml.kernel.org/r/20250719182854.3166724-6-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Jann Horn <jannh@google.com>
Cc: Jeongjun Park <aha310510@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: T.J. Mercier <tjmercier@google.com>
Cc: Ye Bin <yebin10@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A race condition is possible in stable_page_flags() where user-space is
reading /proc/kpageflags concurrently to a folio split. This may lead to
oopses or BUG_ON()s being triggered.
To fix this, this commit uses snapshot_page() in stable_page_flags() so
that stable_page_flags() works with a stable page and folio snapshots
instead.
Note that stable_page_flags() makes use of some functions that require the
original page or folio pointer to work properly (eg. is_free_budy_page()
and folio_test_idle()). Since those functions can't be used on the page
snapshot, we replace their usage with flags that were set by
snapshot_page() for this purpose.
Link: https://lkml.kernel.org/r/52c16c0f00995a812a55980c2f26848a999a34ab.1752499009.git.luizcap@redhat.com
Signed-off-by: Luiz Capitulino <luizcap@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Tested-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Right now it appears that the code is relying upon the returned
destination address having bits outside PAGE_MASK to indicate whether an
error value is specified, and decrementing the increased refcount on the
uffd ctx if so.
This is not a safe means of determining an error value, so instead, be
specific. It makes far more sense to do so in a dedicated error path, so
add mremap_userfaultfd_fail() for this purpose and use this when an error
arises.
A vm_userfaultfd_ctx is not established until we are at the point where
mremap_userfaultfd_prep() is invoked in copy_vma_and_data(), so this is a
no-op until this happens.
That is - uffd remap notification only occurs if the VMA is actually moved
- at which point a UFFD_EVENT_REMAP event is raised.
No errors can occur after this point currently, though it's certainly not
guaranteed this will always remain the case, and we mustn't rely on this.
However, the reason for needing to handle this case is that, when an error
arises on a VMA move at the point of adjusting page tables, we revert this
operation, and propagate the error.
At this point, it is not correct to raise a uffd remap event, and we must
handle it.
This refactoring makes it abundantly clear what we are doing.
We assume vrm->new_addr is always valid, which a prior change made the
case even for mremap() invocations which don't move the VMA, however given
no uffd context would be set up in this case it's immaterial to this
change anyway.
No functional change intended.
Link: https://lkml.kernel.org/r/a70e8a1f7bce9f43d1431065b414e0f212297297.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stop using the obsolete write_cache_pages and use writeback_iter directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
read for the pinfile using Direct I/O do not wait for dio write.
Signed-off-by: yohan.joung <yohan.joung@sk.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Otherwise F2FS will not do GC in background in low free section.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There is no extra work before trace_f2fs_[dataread|datawrite]_end(),
so there is no need to check trace_<tracepoint>_enabled().
Signed-off-by: Sheng Yong <shengyong1@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The function ksmbd_vfs_kern_path_locked() seems to serve two functions
and as a result has an odd interface.
On success it returns with the parent directory locked and with write
access on that filesystem requested, but it may have crossed over a
mount point to return the path, which makes the lock and the write
access irrelevant.
This patches separates the functionality into two functions:
- ksmbd_vfs_kern_path() does not lock the parent, does not request
write access, but does cross mount points
- ksmbd_vfs_kern_path_locked() does not cross mount points but
does lock the parent and request write access.
The parent_path parameter is no longer needed. For the _locked case
the final path is sufficient to drop write access and to unlock the
parent (using path->dentry->d_parent which is safe while the lock is
held).
There were 3 caller of ksmbd_vfs_kern_path_locked().
- smb2_create_link() needs to remove the target if it existed and
needs the lock and the write-access, so it continues to use
ksmbd_vfs_kern_path_locked(). It would not make sense to
cross mount points in this case.
- smb2_open() is the only user that needs to cross mount points
and it has no need for the lock or write access, so it now uses
ksmbd_vfs_kern_path()
- smb2_creat() does not need to cross mountpoints as it is accessing
a file that it has just created on *this* filesystem. But also it
does not need the lock or write access because by the time
ksmbd_vfs_kern_path_locked() was called it has already created the
file. So it could use either interface. It is simplest to use
ksmbd_vfs_kern_path().
ksmbd_vfs_kern_path_unlock() is still needed after
ksmbd_vfs_kern_path_locked() but it doesn't require the parent_path any
more. After a successful call to ksmbd_vfs_kern_path(), only path_put()
is needed to release the path.
Signed-off-by: NeilBrown <neil@brown.name>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
ri_buf just holds a pointer/len pair and is not a log iovec used for
writing to the log. Switch to use a kvec instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
These buffers are not directly logged, just use a kvec and remove the
xlog_copy_from_iovec helper only used here.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The lv_size member counts the size of the entire allocation, rename it to
lv_alloc_size to make that clear.
The lv_buf_len member tracks how much of lv_buf has been used up
to format the log item, rename it to lv_buf_used to make that more clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Split out handling of ordered items into a single branch in
xlog_cil_insert_format_items so that the rest of the code becomes more
clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
By the time xfs_cil_prepare_item is called, the old lv is still pointed
to by the log item. Take it from there instead of spreading the old lv
logic over xlog_cil_insert_format_items and xfs_cil_prepare_item.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The call to the event xfs_reflink_cow_enospc was removed when the COW
handling was merged into xfs_file_iomap_begin_delay, but the trace event
itself was not. Remove it.
Fixes: db46e604ad ("xfs: merge COW handling into xfs_file_iomap_begin_delay")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The trace event xfs_discard_rtrelax was added but never used. Remove it.
Fixes: a330cae8a7 ("xfs: Remove header files which are included more than once")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The trace event xfs_log_cil_return was added but never used. Remove it.
Fixes: c1220522ef ("xfs: grant heads track byte counts, not LSNs")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The tracepoint trace_xfs_dqreclaim_dirty was removed with other code
removed from xfs_qm_dquot_isolate() but the defined tracepoint was not.
Fixes: d62016b1a2 ("xfs: avoid dquot buffer pin deadlock")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Replace the deprecated strncpy() with memtostr_pad(). This also avoids
the need for separate zeroing using memset(). Mark sb_fname buffer with
__nonstring as its size is XFSLABEL_MAX and so no terminating NULL for
sb_fname.
Signed-off-by: Pranav Tyagi <pranav.tyagi03@gmail.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Fixes: e967dc40d501 ("xfs: return the allocated transaction from xfs_trans_alloc_empty")
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The top of the function comment is outdated, and the parts still correct
duplicate information in comment inside the function. Remove the top of
the function comment and instead improve a comment inside the function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Describe the rationale for the decisions a bit better.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
None of them actually needs the inode, the mount is enough.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
This member just tracks how much space we handed out for sequential
write required zones. Only for conventional space it actually is the
pointer where thing are written at, otherwise zone append manages
that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
i_used_blocks is a uint32_t, so use the same value for the local variable
caching it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Split up the XFS_IS_CORRUPT statement so that it immediately shows
if the reference counter overflowed or underflowed.
I ran into this quite a bit when developing the zoned allocator, and had
to reapply the patch for some work recently. We might as well just apply
it upstream given that freeing group is far removed from performance
critical code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Almost no users of the typedef left, kill it and switch the remaining
users to use the underlying struct.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
XFS stopped using current->journal_info in commit f2e812c152 ("xfs:
don't use current->journal_info"), so there is no point in saving and
restoring it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
xchk_trans_alloc_empty can't return errors, so return the allocated
transaction directly instead of an output double pointer argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs_trans_alloc_empty can't return errors, so return the allocated
transaction directly instead of an output double pointer argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs_trans_roll uses xfs_trans_reserve to basically just call into
xfs_log_regrant while bypassing the reset of xfs_trans_reserve.
Open code the call to xfs_log_regrant in xfs_trans_roll and simplify
xfs_trans_reserve now that it never regrants and always asks for a log
reservation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs_trans_alloc_empty only shares the very basic transaction structure
allocation and initialization with xfs_trans_alloc.
Split out a new __xfs_trans_alloc helper for that and otherwise decouple
xfs_trans_alloc_empty from xfs_trans_alloc.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs_trans_reserve_more just tries to allocate additional blocks and/or
rtextents and is otherwise unrelated to the transaction reservation
logic. Open code the block and rtextent reservation in
xfs_trans_reserve_more to prepare for simplifying xfs_trans_reserve.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Instead of duplicating the empty transacaction reservation
definition.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Use cmp_int() to yield the result of a three-way-comparison instead of
performing subtractions with extra casts. Thus also rename the function
to make its name clearer in purpose.
Found by Linux Verification Center (linuxtesting.org).
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Perhaps that's just my silly imagination but 'diff' doesn't look good for
the name of a variable to hold a result of a three-way-comparison
(-1, 0, 1) which is what ->cmp_key_with_cur() does. It implies to contain
an actual difference between the two integer variables but that's not true
anymore after recent refactoring.
Declaring it as int64_t is also misleading now. Plain integer type is
more than enough.
Found by Linux Verification Center (linuxtesting.org).
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The net value of these functions is to determine the result of a
three-way-comparison between operands of the same type.
Simplify the code using cmp_int() to eliminate potential errors with
opencoded casts and subtractions. This also means we can change the return
value type of cmp_key_with_cur routines from int64_t to int and make the
interface a bit clearer.
Found by Linux Verification Center (linuxtesting.org).
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The net value of these functions is to determine the result of a
three-way-comparison between operands of the same type.
Simplify the code using cmp_int() to eliminate potential errors with
opencoded casts and subtractions. This also means we can change the return
value type of cmp_two_keys routines from int64_t to int and make the
interface a bit clearer.
Found by Linux Verification Center (linuxtesting.org).
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
key_diff routines compare a key value with a cursor value. Make the naming
to be a bit more self-descriptive.
Found by Linux Verification Center (linuxtesting.org).
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
One may think that diff_two_keys routines are used to compute the actual
difference between the arguments but they return a result of a
three-way-comparison of the passed operands. So it looks more appropriate
to denote them as cmp_two_keys.
Found by Linux Verification Center (linuxtesting.org).
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs_xattr_class was accidentally created as a TRACE_EVENT() instead of a
class with DECLARE_EVENT_CLASS().
Note, TRACE_EVENT() is just defined as:
#define TRACE_EVENT(name, proto, args, tstruct, assign, print) \
DECLARE_EVENT_CLASS(name, \
PARAMS(proto), \
PARAMS(args), \
PARAMS(tstruct), \
PARAMS(assign), \
PARAMS(print)); \
DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
The difference between TRACE_EVENT() and DECLARE_EVENT_CLASS() is that
TRACE_EVENT() also creates an event with the class name.
Switch xfs_xattr_class over to being a class and not an event as it is not
called directly, and that event with the class name takes up unnecessary
memory.
Fixes: e47dcf113a ("xfs: repair extended attributes")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The trace event xfs_file_compat_ioctl is only used when CONFIG_COMPAT is
configured in the build. As trace events can take up to 5K in memory for
text and meta data regardless if they are used, they should not be created
when unused. Add #ifdef CONFIG_COMPAT around the event so that it is only
created when that is configured.
Fixes: cca28fb83d ("xfs: split xfs_itrace_entry")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
When the use of iomap_dio_rw was added, the calls to the trace events
xfs_end_io_direct_unwritten and xfs_end_io_direct_append were removed but
those trace events were not. As trace events can take up to 5K in memory
for text and meta data regardless if they are used or not, they should not
be created when not used. Remove the unused events.
Fixes: acdda3aae1 ("xfs: use iomap_dio_rw")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
When the function xfs_flushinval_pages() was removed, it removed the only
caller to the trace event xfs_pagecache_inval. As trace events can take up
to 5K of memory in text and meta data each regardless if they are used or
not, they should not be created when unused. Remove the unused event.
Fixes: fb59581404 ("xfs: remove xfs_flushinval_pages")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
When the function xfs_alloc_space_available() was restructured, it removed
the only calls to the trace event xfs_alloc_near_nominleft. As trace
events take up to 5K of memory for text and meta data for each event, they
should not be created when not used. Remove this unused event.
Fixes: 54fee133ad ("xfs: adjust allocation length in xfs_alloc_space_available")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Trace events take up to 5K of memory in text and meta data regardless if
they are used or not. The call to the event xfs_alloc_near_error was
removed when the cursor data structure allocation was introduced. Remove
it as it is no longer used and is just wasting memory.
Fixes: f5e7dbea1e ("xfs: introduce allocation cursor data structure")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
When xfs_attri_remove_iter() was removed, so was the call to the trace
event xfs_attr_node_removename. As trace events can take up to 5K in
memory for text and meta data regardless if they are used or not, they
should not be created when unused. Remove the unused event.
Fixes: 59782a236b ("xfs: remove xfs_attri_remove_iter")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Trace events can take up to 5K in memory for text and meta data per event
regardless if they are used or not, so they should not be defined when not
used. The events xfs_attr_fillstate and xfs_attr_refillstate are only
called in code that is #ifdef out and exists only for future reference.
Remove these unused events. If the code is needed again, then git history
can recover what the events were.
Suggested-by: Christoph Hellwig <hch@lst.de>
Fixes: 59782a236b ("xfs: remove xfs_attri_remove_iter")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
When the function xfs_attr_rmtval_set() was removed, the call to the
corresponding trace event was also removed but the trace event itself was
not. As trace events can take up to 5K of memory in text and meta data
regardless if they are used or not they should not be created when not
used. Remove the unused trace event.
Fixes: 0e6acf29db ("xfs: Remove xfs_attr_rmtval_set")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
When the clone/dedupe_file_rang common functions were refactored, it
removed the calls to the xfs_reflink_compare_extents and
xfs_reflink_compare_extents_error events. As each event can take up to 5K
in memory for text and meta data regardless if they are used or not, they
should not be created if they are not used. Remove these unused events.
Fixes: 876bec6f9b ("vfs: refactor clone/dedupe_file_range common functions")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The trace event xfs_ioctl_clone was added but never used. As trace events
can take up to 5K of memory in text and meta data regardless if they are
used or not, remove the unused trace event.
Fixes: 53aa1c34f4 ("xfs: define tracepoints for reflink activities")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The trace event xlog_iclog_want_sync was added but never used. As trace
events can take up around 5K of memory in text and meta data regardless if
they are used or not, remove this unused event.
Fixes: 956f6daa84 ("xfs: add iclog state trace events")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
When the function xfs_attri_remove_iter was removed, it did not remove the
trace event that it called. As a trace event can take up to 5K of memory for
text and meta data regardless of if it is used or not, remove this unused trace
event.
Fixes: 59782a236b ("xfs: remove xfs_attri_remove_iter")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
This patch supports to readahead more blocks in erofs_readdir(), it can
enhance readdir performance in large direcotry.
readdir test in a large directory which contains 12000 sub-files.
files_per_second
Before: 926385.54
After: 2380435.562
Meanwhile, let's introduces a new sysfs entry to control readahead
bytes to provide more flexible policy for readahead of readdir().
- location: /sys/fs/erofs/<disk>/dir_ra_bytes
- default value: 16384
- disable readahead: set the value to 0
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250721021352.2495371-1-chao@kernel.org
[ Gao Xiang: minor styling adjustment. ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Thanks to the meta buffer infrastructure, metadata-compressed inodes are
just read from the metabox inode instead of the blockdevice (or backing
file) inode.
The same is true for shared extended attributes.
When metadata compression is enabled, inode numbers are divided from
on-disk NIDs because of non-LTS 32-bit application compatibility.
Co-developed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Bo Liu (OpenAnolis) <liubo03@inspur.com>
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250722003229.2121752-1-hsiangkao@linux.alibaba.com
Filesystem metadata has a high degree of redundancy, so it should
compress well in the general case.
Although metadata compression can increase overall I/O latency, many
users care more about minimized image sizes than extreme runtime
performance. Let's implement metadata compression in response to user
requests [1].
Actually, it's quite simple to implement metadata compression: since
EROFS already supports per-inode compression, we can simply treat a
special inode (called `the metabox inode`) as a container for compressed
inode metadata. Since EROFS supports multiple algorithms, users can
even specify LZ4 for metadata and LZMA for data.
To better support incremental builds, the MSB of NIDs indicates where
the inode metadata is located: if bit 63 is set, the inode itself should
be read from `the metabox inode`.
Optionally, shared xattrs can also be kept in `the metabox inode` if
COMPAT_SHARED_EA_IN_METABOX is set.
[1] https://issues.redhat.com/browse/RHEL-75783
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20250717070804.1446345-2-hsiangkao@linux.alibaba.com
There is no need to keep additional local metabufs since we already
have one in `struct erofs_map_blocks`.
This was actually a leftover when applying meta buffers to zmap
operations, see commit 09c543798c ("erofs: use meta buffers for
zmap operations").
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250716064152.3537457-1-hsiangkao@linux.alibaba.com
- need_kmap is always true except for a ztailpacking case; thus, just
open-code that one;
- The upcoming metadata compression will add a new boolean, so simplify
this first.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20250714090907.4095645-1-hsiangkao@linux.alibaba.com
The compressed data for the ztailpacking feature is fetched from
the metadata inode (e.g., bd_inode), which is folio-based.
Therefore, the folio interface should be used instead.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250626085459.339830-1-hsiangkao@linux.alibaba.com
A really dumb braino on rebasing and a dumber fuckup with managing #for-next
Fixes: b70cb45989 ("ufs: convert ufs to the new mount API")
Fucked-up-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
syzbot reported a bug in in afs_put_vlserverlist.
kAFS: bad VL server IP address
BUG: unable to handle page fault for address: fffffffffffffffa
...
Oops: Oops: 0002 [#1] SMP KASAN PTI
...
RIP: 0010:refcount_dec_and_test include/linux/refcount.h:450 [inline]
RIP: 0010:afs_put_vlserverlist+0x3a/0x220 fs/afs/vl_list.c:67
...
Call Trace:
<TASK>
afs_alloc_cell fs/afs/cell.c:218 [inline]
afs_lookup_cell+0x12a5/0x1680 fs/afs/cell.c:264
afs_cell_init+0x17a/0x380 fs/afs/cell.c:386
afs_proc_rootcell_write+0x21f/0x290 fs/afs/proc.c:247
proc_simple_write+0x114/0x1b0 fs/proc/generic.c:825
pde_write fs/proc/inode.c:330 [inline]
proc_reg_write+0x23d/0x330 fs/proc/inode.c:342
vfs_write+0x25c/0x1180 fs/read_write.c:682
ksys_write+0x12a/0x240 fs/read_write.c:736
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xcd/0x260 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Because afs_parse_text_addrs() parses incorrectly, its return value -EINVAL
is assigned to vllist, which results in -EINVAL being used as the vllist
address when afs_put_vlserverlist() is executed.
Set the vllist value to NULL when a parsing error occurs to avoid this
issue.
Fixes: e2c2cb8ef0 ("afs: Simplify cell record handling")
Reported-by: syzbot+5c042fbab0b292c98fc6@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=5c042fbab0b292c98fc6
Tested-by: syzbot+5c042fbab0b292c98fc6@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/4119365.1753108011@warthog.procyon.org.uk
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
The do_k_string() and do_c_string() functions do essentially the same
thing which is they add a string and a comma onto the end of an existing
string. At the end, the caller will overwrite the last comma with a
newline. Later, in orangefs_kernel_debug_init(), we add a newline to
the string.
The change to do_k_string() is just cosmetic. I moved the "- 1" to
the other side of the comparison and made it "+ 1". This has no
effect on runtime, I just wanted the functions to match each other
and the rest of the file.
However in do_c_string(), I removed the "- 2" which allows us to print
two extra characters. I noticed this issue while reviewing the code
and I doubt affects anything in real life. My guess is that this was
double counting the comma and the newline. The "+ 1" accounts for
the newline, and the caller will delete the final comma which ensures
there is enough space for the newline.
Removing the "- 2" lets us print 2 more characters, but mainly it makes
the code more consistent and understandable for reviewers.
Fixes: 44f4641073 ("orangefs: clean up debugfs globals")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
As Jiaming Zhang reported:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x1c1/0x2a0 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0x17e/0x800 mm/kasan/report.c:480
kasan_report+0x147/0x180 mm/kasan/report.c:593
data_blkaddr fs/f2fs/f2fs.h:3053 [inline]
f2fs_data_blkaddr fs/f2fs/f2fs.h:3058 [inline]
f2fs_get_dnode_of_data+0x1a09/0x1c40 fs/f2fs/node.c:855
f2fs_reserve_block+0x53/0x310 fs/f2fs/data.c:1195
prepare_write_begin fs/f2fs/data.c:3395 [inline]
f2fs_write_begin+0xf39/0x2190 fs/f2fs/data.c:3594
generic_perform_write+0x2c7/0x910 mm/filemap.c:4112
f2fs_buffered_write_iter fs/f2fs/file.c:4988 [inline]
f2fs_file_write_iter+0x1ec8/0x2410 fs/f2fs/file.c:5216
new_sync_write fs/read_write.c:593 [inline]
vfs_write+0x546/0xa90 fs/read_write.c:686
ksys_write+0x149/0x250 fs/read_write.c:738
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xf3/0x3d0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The root cause is in the corrupted image, there is a dnode has the same
node id w/ its inode, so during f2fs_get_dnode_of_data(), it tries to
access block address in dnode at offset 934, however it parses the dnode
as inode node, so that get_dnode_addr() returns 360, then it tries to
access page address from 360 + 934 * 4 = 4096 w/ 4 bytes.
To fix this issue, let's add sanity check for node id of all direct nodes
during f2fs_get_dnode_of_data().
Cc: stable@kernel.org
Reported-by: Jiaming Zhang <r772577952@gmail.com>
Closes: https://groups.google.com/g/syzkaller/c/-ZnaaOOfO3M
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The new mount api will execute .parse_param, .init_fs_context, .get_tree
and will call .remount if remount happened. So we add the necessary
functions for the fs_context_operations. If .init_fs_context is added,
the old .mount should remove.
See Documentation/filesystems/mount_api.rst for more information.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
[sandeen: forward port]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[hongbo: context modified]
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The handle_mount_opt() helper is used to parse mount parameters,
and so we can rename this function to f2fs_parse_param() and set
it as .param_param in fs_context_operations.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
[sandeen: forward port]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The new mount api separates option parsing and super block setup
into two distinct steps and so we need to separate the options
parsing out of the parse_options().
In order to achieve this, here we handle the mount options with
three steps:
- Firstly, we move sb/sbi out of handle_mount_opt.
As the former patch introduced f2fs_fs_context, so we record
the changed mount options in this context. In handle_mount_opt,
sb/sbi is null, so we should move all relative code out of
handle_mount_opt (thus, some check case which use sb/sbi should
move out).
- Secondly, we introduce the some check helpers to keep the option
consistent.
During filling superblock period, sb/sbi are ready. So we check
the f2fs_fs_context which holds the mount options base on sb/sbi.
- Thirdly, we apply the new mount options to sb/sbi.
After checking the f2fs_fs_context, all changed on mount options
are valid. So we can apply them to sb/sbi directly.
After do these, option parsing and super block setting have been
decoupled. Also it should have retained the original execution
flow.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
[sandeen: forward port, minor fixes and updates]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[hongbo: minor fixes]
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
At the parsing phase of mouont in the new mount api, options
value will be recorded with the context, and then it will be
used in fill_super and other helpers.
Note that, this is a temporary status, we want remove the sb
and sbi usages in handle_mount_opt. So here the f2fs_fs_context
only records the mount options, it will be copied in sb/sbi in
later process. (At this point in the series, mount options are
temporarily not set during mount.)
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
[sandeen: forward port, minor fixes and updates]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[hongbo: minor cleanup]
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
At the parsing phase of the new mount api, sbi will not be
available. So here allows sbi to be NULL in f2fs log helpers
and use that in handle_mount_opt().
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
[sandeen: forward port]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In handle_mount_opt, we use fs_parameter to parse each option.
However we're still using the old API to get the options string.
Using fsparams parse_options allows us to remove many of the Opt_
enums, so remove them.
The checkpoint disable cap (or percent) involves rather complex
parsing; we retain the old match_table mechanism for this, which
handles it well.
There are some changes about parsing options:
1. For `active_logs`, `inline_xattr_size` and `fault_injection`,
we use s32 type according the internal structure to record the
option's value.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
[sandeen: forward port, minor fixes and updates]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[hongbo: minor cleanup]
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use an array of `fs_parameter_spec` called f2fs_param_specs to
hold the mount option specifications for the new mount api.
Add constant_table structures for several options to facilitate
parsing.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
[sandeen: forward port, minor fixes and updates, more fsparam_enum]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
- touch /mnt/f2fs/012345678901234567890123456789012345678901234567890123
- truncate -s $((1024*1024*1024)) \
/mnt/f2fs/012345678901234567890123456789012345678901234567890123
- touch /mnt/f2fs/file
- truncate -s $((1024*1024*1024)) /mnt/f2fs/file
- mkfs.f2fs /mnt/f2fs/012345678901234567890123456789012345678901234567890123 \
-c /mnt/f2fs/file
- mount /mnt/f2fs/012345678901234567890123456789012345678901234567890123 \
/mnt/f2fs/loop
[16937.192225] F2FS-fs (loop0): Mount Device [ 0]: /mnt/f2fs/012345678901234567890123456789012345678901234567890123\xff\x01, 511, 0 - 3ffff
[16937.192268] F2FS-fs (loop0): Failed to find devices
If device path length equals to MAX_PATH_LEN, sbi->devs.path[] may
not end up w/ null character due to path array is fully filled, So
accidently, fields locate after path[] may be treated as part of
device path, result in parsing wrong device path.
struct f2fs_dev_info {
...
char path[MAX_PATH_LEN];
...
};
Let's add one byte space for sbi->devs.path[] to store null
character of device path string.
Fixes: 3c62be17d4 ("f2fs: support multiple devices")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
All callers have been converted to F2FS_F_SB() so delete this wrapper.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
All three callers have a folio so pass it in.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>