Commit Graph

5280 Commits

Author SHA1 Message Date
Linus Torvalds
57fcb7d930 vfs-6.17-rc1.fileattr
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCpgAKCRCRxhvAZXjc
 oqfFAQDcy3rROUF3W34KcSi7rDmaKVSX53d1tUoqH+1zDRpSlwEAriKDNC1ybudp
 YAnxVzkRHjHs1296WIuwKq5lfhJ60Q4=
 =geAl
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull fileattr updates from Christian Brauner:
 "This introduces the new file_getattr() and file_setattr() system calls
  after lengthy discussions.

  Both system calls serve as successors and extensible companions to
  the FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR system calls which have
  started to show their age in addition to being named in a way that
  makes it easy to conflate them with extended attribute related
  operations.

  These syscalls allow userspace to set filesystem inode attributes on
  special files. One of the usage examples is the XFS quota projects.

  XFS has project quotas which could be attached to a directory. All new
  inodes in these directories inherit project ID set on parent
  directory.

  The project is created from userspace by opening and calling
  FS_IOC_FSSETXATTR on each inode. This is not possible for special
  files such as FIFO, SOCK, BLK etc. Therefore, some inodes are left
  with empty project ID. Those inodes then are not shown in the quota
  accounting but still exist in the directory. This is not critical but
  in the case when special files are created in the directory with
  already existing project quota, these new inodes inherit extended
  attributes. This creates a mix of special files with and without
  attributes. Moreover, special files with attributes don't have a
  possibility to become clear or change the attributes. This, in turn,
  prevents userspace from re-creating quota project on these existing
  files.

  In addition, these new system calls allow the implementation of
  additional attributes that we couldn't or didn't want to fit into the
  legacy ioctls anymore"

* tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: tighten a sanity check in file_attr_to_fileattr()
  tree-wide: s/struct fileattr/struct file_kattr/g
  fs: introduce file_getattr and file_setattr syscalls
  fs: prepare for extending file_get/setattr()
  fs: make vfs_fileattr_[get|set] return -EOPNOTSUPP
  selinux: implement inode_file_[g|s]etattr hooks
  lsm: introduce new hooks for setting/getting inode fsxattr
  fs: split fileattr related helpers into separate file
2025-07-28 15:24:14 -07:00
Linus Torvalds
7031769e10 vfs-6.17-rc1.mmap_prepare
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCgQAKCRCRxhvAZXjc
 os+nAP9LFHUwWO6EBzHJJGEVjJvvzsbzqeYrRFamYiMc5ulPJwD+KW4RIgJa/MWO
 pcYE40CacaekD8rFWwYUyszpgmv6ewc=
 =wCwp
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull mmap_prepare updates from Christian Brauner:
 "Last cycle we introduce f_op->mmap_prepare() in c84bf6dd2b ("mm:
  introduce new .mmap_prepare() file callback").

  This is preferred to the existing f_op->mmap() hook as it does require
  a VMA to be established yet, thus allowing the mmap logic to invoke
  this hook far, far earlier, prior to inserting a VMA into the virtual
  address space, or performing any other heavy handed operations.

  This allows for much simpler unwinding on error, and for there to be a
  single attempt at merging a VMA rather than having to possibly
  reattempt a merge based on potentially altered VMA state.

  Far more importantly, it prevents inappropriate manipulation of
  incompletely initialised VMA state, which is something that has been
  the cause of bugs and complexity in the past.

  The intent is to gradually deprecate f_op->mmap, and in that vein this
  series coverts the majority of file systems to using f_op->mmap_prepare.

  Prerequisite steps are taken - firstly ensuring all checks for mmap
  capabilities use the file_has_valid_mmap_hooks() helper rather than
  directly checking for f_op->mmap (which is now not a valid check) and
  secondly updating daxdev_mapping_supported() to not require a VMA
  parameter to allow ext4 and xfs to be converted.

  Commit bb666b7c27 ("mm: add mmap_prepare() compatibility layer for
  nested file systems") handles the nasty edge-case of nested file
  systems like overlayfs, which introduces a compatibility shim to allow
  f_op->mmap_prepare() to be invoked from an f_op->mmap() callback.

  This allows for nested filesystems to continue to function correctly
  with all file systems regardless of which callback is used. Once we
  finally convert all file systems, this shim can be removed.

  As a result, ecryptfs, fuse, and overlayfs remain unaltered so they
  can nest all other file systems.

  We additionally do not update resctl - as this requires an update to
  remap_pfn_range() (or an alternative to it) which we defer to a later
  series, equally we do not update cramfs which needs a mixed mapping
  insertion with the same issue, nor do we update procfs, hugetlbfs,
  syfs or kernfs all of which require VMAs for internal state and hooks.
  We shall return to all of these later"

* tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  doc: update porting, vfs documentation to describe mmap_prepare()
  fs: replace mmap hook with .mmap_prepare for simple mappings
  fs: convert most other generic_file_*mmap() users to .mmap_prepare()
  fs: convert simple use of generic_file_*_mmap() to .mmap_prepare()
  mm/filemap: introduce generic_file_*_mmap_prepare() helpers
  fs/xfs: transition from deprecated .mmap hook to .mmap_prepare
  fs/ext4: transition from deprecated .mmap hook to .mmap_prepare
  fs/dax: make it possible to check dev dax support without a VMA
  fs: consistently use can_mmap_file() helper
  mm/nommu: use file_has_valid_mmap_hooks() helper
  mm: rename call_mmap/mmap_prepare to vfs_mmap/mmap_prepare
2025-07-28 13:43:25 -07:00
Linus Torvalds
7879d7aff0 vfs-6.17-rc1.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaIM/KwAKCRCRxhvAZXjc
 opT+AP407JwhRSBjUEmHg5JzUyDoivkOySdnthunRjaBKD8rlgEApM6SOIZYucU7
 cPC3ZY6ORFM6Mwaw+iDW9lasM5ucHQ8=
 =CHha
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc VFS updates from Christian Brauner:
 "This contains the usual selections of misc updates for this cycle.

  Features:

   - Add ext4 IOCB_DONTCACHE support

     This refactors the address_space_operations write_begin() and
     write_end() callbacks to take const struct kiocb * as their first
     argument, allowing IOCB flags such as IOCB_DONTCACHE to propagate
     to the filesystem's buffered I/O path.

     Ext4 is updated to implement handling of the IOCB_DONTCACHE flag
     and advertises support via the FOP_DONTCACHE file operation flag.

     Additionally, the i915 driver's shmem write paths are updated to
     bypass the legacy write_begin/write_end interface in favor of
     directly calling write_iter() with a constructed synchronous kiocb.
     Another i915 change replaces a manual write loop with
     kernel_write() during GEM shmem object creation.

  Cleanups:

   - don't duplicate vfs_open() in kernel_file_open()

   - proc_fd_getattr(): don't bother with S_ISDIR() check

   - fs/ecryptfs: replace snprintf with sysfs_emit in show function

   - vfs: Remove unnecessary list_for_each_entry_safe() from
     evict_inodes()

   - filelock: add new locks_wake_up_waiter() helper

   - fs: Remove three arguments from block_write_end()

   - VFS: change old_dir and new_dir in struct renamedata to dentrys

   - netfs: Remove unused declaration netfs_queue_write_request()

  Fixes:

   - eventpoll: Fix semi-unbounded recursion

   - eventpoll: fix sphinx documentation build warning

   - fs/read_write: Fix spelling typo

   - fs: annotate data race between poll_schedule_timeout() and
     pollwake()

   - fs/pipe: set FMODE_NOWAIT in create_pipe_files()

   - docs/vfs: update references to i_mutex to i_rwsem

   - fs/buffer: remove comment about hard sectorsize

   - fs/buffer: remove the min and max limit checks in __getblk_slow()

   - fs/libfs: don't assume blocksize <= PAGE_SIZE in
     generic_check_addressable

   - fs_context: fix parameter name in infofc() macro

   - fs: Prevent file descriptor table allocations exceeding INT_MAX"

* tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (24 commits)
  netfs: Remove unused declaration netfs_queue_write_request()
  eventpoll: fix sphinx documentation build warning
  ext4: support uncached buffered I/O
  mm/pagemap: add write_begin_get_folio() helper function
  fs: change write_begin/write_end interface to take struct kiocb *
  drm/i915: Refactor shmem_pwrite() to use kiocb and write_iter
  drm/i915: Use kernel_write() in shmem object create
  eventpoll: Fix semi-unbounded recursion
  vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes()
  fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable
  fs/buffer: remove the min and max limit checks in __getblk_slow()
  fs: Prevent file descriptor table allocations exceeding INT_MAX
  fs: Remove three arguments from block_write_end()
  fs/ecryptfs: replace snprintf with sysfs_emit in show function
  fs: annotate suspected data race between poll_schedule_timeout() and pollwake()
  docs/vfs: update references to i_mutex to i_rwsem
  fs/buffer: remove comment about hard sectorsize
  fs_context: fix parameter name in infofc() macro
  VFS: change old_dir and new_dir in struct renamedata to dentrys
  proc_fd_getattr(): don't bother with S_ISDIR() check
  ...
2025-07-28 11:22:56 -07:00
Kent Overstreet
c37495fe35 bcachefs: Add missing snapshots_seen_add_inorder()
This fixes an infinite loop when repairing "extent past end of inode",
when the extent is an older snapshot than the inode that needs repair.

Without the snaphsots_seen_add_inorder() we keep trying to delete the
same extent, even though it's no longer visible in the inode's snapshot.

Fixes: 63d6e93119 ("bcachefs: bch2_fpunch_snapshot()")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-24 22:56:37 -04:00
Kent Overstreet
1831840c2b bcachefs: Fix write buffer flushing from open journal entry
When flushing the btree write buffer, we pull write buffer keys directly
from the journal instead of letting the journal write path copy them to
the write buffer.

When flushing from the currently open journal buffer, we have to block
new reservations and wait for outstanding reservations to complete.

Recheck the reservation state after blocking new reservations:
previously, we were checking the reservation count from before calling
__journal_block().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-24 22:56:37 -04:00
Kent Overstreet
eef9170700 bcachefs: btree_node_scan: don't re-read before initializing found_btree_node
If the btree node is encrypted, this caused us to initialize
found_btree_node from the encrypted header.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-20 13:26:08 -04:00
Kent Overstreet
89edfcf710 bcachefs: Fix bch2_maybe_casefold() when CONFIG_UTF8=n
maybe_casefold() shouldn't have been nooped, just bch2_casefold().

Fixes: 94426e4201 ("bcachefs: opts.casefold_disabled")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16 17:32:33 -04:00
Kent Overstreet
40c35a0b47 bcachefs: Fix build when CONFIG_UNICODE=n
94426e4201, which added the killswitch for casefolding, accidentally
removed some of the ifdefs we need to avoid build errors.

It appears we need better build testing for different configurations, it
took two weeks for the robots to catch this one.

Fixes: 94426e4201 ("bcachefs: opts.casefold_disabled")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16 17:32:33 -04:00
Kent Overstreet
c02b943f7d bcachefs: Fix reference to invalid bucket in copygc
Use bch2_dev_bucket_tryget() instead of bch2_dev_tryget() before
checking the bucket bitmap.

Reported-by: syzbot+3168625f36f4a539237e@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16 17:32:33 -04:00
Kent Overstreet
9fe8ec8664 bcachefs: Don't build aux search tree when still repairing node
bch2_btree_node_drop_keys_outside_node() will (re)build aux search
trees, because it's also called by topology repair.

bch2_btree_node_read_done() was calling it before validating individual
keys; invalid ones have to be dropped.

If we call drop_keys_outside_node() first, then
bch2_bset_build_aux_tree() doesn't run because the node already has an
aux search tree - which was invalidated by the repair.

Reported-by: syzbot+c5e7a66b3b23ae65d44f@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16 17:32:33 -04:00
Kent Overstreet
6a1c4323de bcachefs: Tweak threshold for allocator triggering discards
The allocator path has a "if we're really low on free buckets, check if
we should issue discards" - tweak this to also trigger discards if more
than 1/128th of the device is in need_discard state.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16 17:32:33 -04:00
Kent Overstreet
b4d6e204f8 bcachefs: Fix triggering of discard by the journal path
It becomes possible to do discards after a journal flush, which
naturally the journal code is reponsible for.

A prior refactoring seems to have broken this - which went unnoticed
because the foreground allocator path can also trigger discards.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-16 17:32:33 -04:00
Taotao Chen
e9d8e2bf23
fs: change write_begin/write_end interface to take struct kiocb *
Change the address_space_operations callbacks write_begin() and
write_end() to take struct kiocb * as the first argument instead of
struct file *.

Update all affected function prototypes, implementations, call sites,
and related documentation across VFS, filesystems, and block layer.

Part of a series refactoring address_space_operations write_begin and
write_end callbacks to use struct kiocb for passing write context and
flags.

Signed-off-by: Taotao Chen <chentaotao@didiglobal.com>
Link: https://lore.kernel.org/20250716093559.217344-4-chentaotao@didiglobal.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-16 14:48:18 +02:00
Kent Overstreet
30792947c6 bcachefs: io_read: remove from async obj list in rbio_done()
Previously, only split rbios allocated in io_read.c would be removed
from the async obj list.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-13 17:45:13 -04:00
Kent Overstreet
fec5e6f97d bcachefs: Don't set BCH_FS_error on transaction restart
This started showing up more when we started logging the error being
corrected in the journal - but __bch2_fsck_err() could return
transaction restarts before that.

Setting BCH_FS_error incorrectly causes recovery passes to not be
cleared, among other issues.

Fixes: b43f724927 ("bcachefs: Log fsck errors in the journal")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-08 15:24:15 -04:00
Kent Overstreet
74f3931a1b bcachefs: Fix additional misalignment in journal space calculations
Additional fix on top of

f54b2a80d0 bcachefs: Fix misaligned bucket check in journal space calculations

Make sure that when we calculate space for the next entry it's not
misaligned: we need to round_down() to filesystem block size in multiple
places (next entry size calculation as well as total space available).

Reported-by: Ondřej Kraus <neverberlerfellerer@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-07 18:19:30 -04:00
Kent Overstreet
7de3c8b407 bcachefs: Don't schedule non persistent passes persistently
if (!(in_recovery && (flags & RUN_RECOVERY_PASS_nopersistent)))

should have been

  if (!in_recovery && !(flags & RUN_RECOVERY_PASS_nopersistent)))

But the !in_recovery part was also wrong: the assumption is that if
we're in recovery we'll just rewind and run the recovery pass
immediately, but we're not able to do so if we've already gone RW and
the pass must be run before we go RW. In that case, we need to schedule
it in the superblock so it can be run on the next mount attempt.

Scheduling it persistently is fine, because it'll be cleared in the
superblock immediately when the pass completes successfully.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-07 14:10:47 -04:00
Kent Overstreet
63a83463d2 bcachefs: Fix bch2_btree_transactions_read() synchronization
Since we're accessing btree_trans objects owned by another thread, we
need to guard against using pointers to freed key cache entries: we need
our own srcu read lock, and we should skip a btree_trans if it didn't
hold the srcu lock (and thus it might have pointers to freed key cache
entries).

00693 Mem abort info:
00693   ESR = 0x0000000096000005
00693   EC = 0x25: DABT (current EL), IL = 32 bits
00693   SET = 0, FnV = 0
00693   EA = 0, S1PTW = 0
00693   FSC = 0x05: level 1 translation fault
00693 Data abort info:
00693   ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
00693   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
00693   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
00693 user pgtable: 4k pages, 39-bit VAs, pgdp=000000012e650000
00693 [000000008fb96218] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
00693 Internal error: Oops: 0000000096000005 [#1]  SMP
00693 Modules linked in:
00693 CPU: 0 UID: 0 PID: 4307 Comm: cat Not tainted 6.16.0-rc2-ktest-g9e15af94fd86 #27578 NONE
00693 Hardware name: linux,dummy-virt (DT)
00693 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
00693 pc : six_lock_counts+0x20/0xe8
00693 lr : bch2_btree_bkey_cached_common_to_text+0x38/0x130
00693 sp : ffffff80ca98bb60
00693 x29: ffffff80ca98bb60 x28: 000000008fb96200 x27: 0000000000000007
00693 x26: ffffff80eafd06b8 x25: 0000000000000000 x24: ffffffc080d75a60
00693 x23: ffffff80eafd0000 x22: ffffffc080bdfcc0 x21: ffffff80eafd0210
00693 x20: ffffff80c192ff08 x19: 000000008fb96200 x18: 00000000ffffffff
00693 x17: 0000000000000000 x16: 0000000000000000 x15: 00000000ffffffff
00693 x14: 0000000000000000 x13: ffffff80ceb5a29a x12: 20796220646c6568
00693 x11: 72205d3e303c5b20 x10: 0000000000000020 x9 : ffffffc0805fb6b0
00693 x8 : 0000000000000020 x7 : 0000000000000000 x6 : 0000000000000020
00693 x5 : ffffff80ceb5a29c x4 : 0000000000000001 x3 : 000000000000029c
00693 x2 : 0000000000000000 x1 : ffffff80ef66c000 x0 : 000000008fb96200
00693 Call trace:
00693  six_lock_counts+0x20/0xe8 (P)
00693  bch2_btree_bkey_cached_common_to_text+0x38/0x130
00693  bch2_btree_trans_to_text+0x260/0x2a8
00693  bch2_btree_transactions_read+0xac/0x1e8
00693  full_proxy_read+0x74/0xd8
00693  vfs_read+0x90/0x300
00693  ksys_read+0x6c/0x108
00693  __arm64_sys_read+0x20/0x30
00693  invoke_syscall.constprop.0+0x54/0xe8
00693  do_el0_svc+0x44/0xc8
00693  el0_svc+0x18/0x58
00693  el0t_64_sync_handler+0x104/0x130
00693  el0t_64_sync+0x154/0x158
00693 Code: 910003fd f9423c22 f90017e2 d2800002 (f9400c01)
00693 ---[ end trace 0000000000000000 ]---

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05 12:42:41 -04:00
Kent Overstreet
14dd95647e bcachefs: btree read retry fixes
Fix btree node read retries after validate errors:

__btree_err() is the wrong place to flag a topology error: that is done
by btree_lost_data().

Additionally, some calls to bch2_bkey_pick_read_device() were not
updated in the 6.16 rework for improved log messages; we were failing to
signal that we still had a retry.

Cc: Nikita Ofitserov <himikof@gmail.com>
Cc: Alan Huang <mmpgouride@gmail.com>
Reported-and-tested-by: Edoardo Codeglia <bcachefs@404.blue>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05 12:42:41 -04:00
Kent Overstreet
a77ffbe34d bcachefs: btree node scan no longer uses btree cache
Previously, btree node scan used the btree node cache to check if btree
nodes were readable, but this is subject to interference from threads
scanning different devices trying to read the same node - and more
critically, nodes that we already attempted and failed to read before
kicking off scan.

Instead, we now allocate a 'struct btree' that does not live in the
btree node cache, and call bch2_btree_node_read_done() directly.

Cc: Nikita Ofitserov <himikof@gmail.com>
Reviewed-by: Nikita Ofitserov <himikof@gmail.com>
Reported-and-tested-by: Edoardo Codeglia <bcachefs@404.blue>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05 12:42:41 -04:00
Kent Overstreet
c2b2c7d1da bcachefs: Tweak btree cache helpers for use by btree node scan
btree node scan needs to not use the btree node cache: that causes
interference from prior failed reads and parallel workers.

Instead we need to allocate btree nodes that don't live in the btree
cache, so that we can call bch2_btree_node_read_done() directly.

This patch tweaks the low level helpers so they don't touch the btree
cache lists.

Cc: Nikita Ofitserov <himikof@gmail.com>
Reviewed-by: Nikita Ofitserov <himikof@gmail.com>
Reported-and-tested-by: Edoardo Codeglia <bcachefs@404.blue>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-04 23:17:07 -04:00
Kent Overstreet
c72d628469 bcachefs: Fix btree for nonexistent tree depth
The fix for when we should increase tree depth in journal replay was
entirely bogus.

We should only increase the tree depth in journal replay when recovery
from btree node scan, and then only for keys found by btree node scan.

This needs additional work - we should be shooting down existing
interior node pointers when recovery from scan, they shouldn't be
showing up here.

Fixes: b47a82ff47 ("bcachefs: Only run 'increase_depth' for keys from btree node csan")
Cc: Alan Huang <mmpgouride@gmail.com>
Reported-by: syzbot+8deb6ff4415db67a9f18@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-04 15:47:13 -04:00
Kent Overstreet
ddb9680a72 bcachefs: Fix bch2_io_failures_to_text()
This wasn't updated when we added tracking for btree validate errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-04 15:47:13 -04:00
Kent Overstreet
63d6e93119 bcachefs: bch2_fpunch_snapshot()
Add a new version of fpunch for operating on a snapshot ID, not a
subvolume - and use it for "extent past end of inode" repair.

Previously, repair would try to delete everything at once, but deleting
too many extents at once can overflow the btree_trans bump allocator, as
well as causing other problems - the new helper properly uses
bch2_extent_trim_atomic().

Reported-and-tested-by: Edoardo Codeglia <bcachefs@404.blue>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-04 15:45:22 -04:00
Christian Brauner
ca115d7e75
tree-wide: s/struct fileattr/struct file_kattr/g
Now that we expose struct file_attr as our uapi struct rename all the
internal struct to struct file_kattr to clearly communicate that it is a
kernel internal struct. This is similar to struct mount_{k}attr and
others.

Link: https://lore.kernel.org/20250703-restlaufzeit-baurecht-9ed44552b481@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-04 16:14:39 +02:00
Kent Overstreet
94426e4201 bcachefs: opts.casefold_disabled
Add an option for completely disabling casefolding on a filesystem, as a
workaround for overlayfs.

This should only be needed as a temporary workaround, until the
overlayfs fix arrives.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-01 19:33:46 -04:00
Kent Overstreet
c6e8d51b37 bcachefs: Work around deadlock to btree node rewrites in journal replay
Don't mark btree nodes for rewrites, if they are or would be degraded,
if journal replay hasn't finished, to avoid a deadlock.

This is because btree node rewrites generate more updates for the
interior updates (alloc, backpointers), and if those updates touch
new nodes and generate more rewrites - we can only have so many interior
btree updates in flight before we deadlock on open_buckets.

The biggest cause is that we don't use the btree write buffer (for
the backpointer updates - this needs some real thought on locking in
order to fix.

The problem with this workaround (not doing the rewrite for degraded
nodes in journal replay) is that those degraded nodes persist, and we
don't want that (this is a real bug when a btree node write completes
with fewer replicas than we wanted and leaves a degraded node due to
device _removal_, i.e. the device went away mid write).

It's less of a bug here, but still a problem because we don't yet
have a way of tracking degraded data - we another index (all
extents/btree nodes, by replicas entry) in order to fix properly
(re-replicate degraded data at the earliest possible time).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-01 19:33:46 -04:00
Alan Huang
fbf913cb72 bcachefs: Fix incorrect transaction restart handling
Reported-by: syzbot+cc7567f096079cb4146f@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-30 17:28:55 -04:00
Kent Overstreet
14da58521e bcachefs: fix btree_trans_peek_prev_journal()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-29 00:47:52 -04:00
Bharadwaj Raju
96de8f8520 bcachefs: mark invalid_btree_id autofix
Checking for invalid IDs was introduced in 9e7cfb35e2 ("bcachefs: Check for invalid btree IDs")
to prevent an invalid shift later, but since 1415265480 ("bcachefs: Bad btree roots are now autofix")
which made btree_root_bkey_invalid autofix, the fsck_err_on call didn't
do anything.

We can mark this err type (invalid_btree_id) autofix as well, so it gets
handled.

Reported-by: syzbot+029d1989099aa5ae3e89@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=029d1989099aa5ae3e89
Fixes: 1415265480 ("bcachefs: Bad btree roots are now autofix")

Signed-off-by: Bharadwaj Raju <bharadwaj.raju777@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-27 12:47:07 -04:00
Kent Overstreet
ef6fac0f9e bcachefs: Plumb correct ip to trans_relock_fail tracepoint
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-26 00:01:16 -04:00
Kent Overstreet
64b6a788bd bcachefs: Ensure we rewind to run recovery passes
Fix a 6.16 regression from the recovery pass rework, which introduced a
bug where calling bch2_run_explicit_recovery_pass() would only return
the error code to rewind recovery for the first call that scheduled that
recovery pass.

If the error code from the first call was swallowed (because it was
called by an asynchronous codepath), subsequent calls would go "ok, this
pass is already marked as needing to run" and return 0.

Fixing this ensures that check_topology bails out to run btree_node_scan
before doing any repair.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-26 00:01:16 -04:00
Kent Overstreet
3e72acb78b bcachefs: Ensure btree node scan runs before checking for scanned nodes
Previously, calling bch2_btree_has_scanned_nodes() when btree node
scan hadn't actually run would erroniously return false - causing us to
think a btree was entirely gone.

This fixes a 6.16 regression from moving the scheduling of btree node
scan out of bch2_btree_lost_data() (fixing the bug where we'd schedule
it persistently in the superblock) and only scheduling it when
check_toploogy() is asking for scanned btree nodes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-26 00:01:16 -04:00
Kent Overstreet
1dcea07810 bcachefs: btree_root_unreadable_and_scan_found_nothing should not be autofix
Autofix is specified in btree_gc.c if it's not an important btree.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-26 00:01:16 -04:00
Kent Overstreet
1f8aede70d bcachefs: fix bch2_journal_keys_peek_prev_min() underflow
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-24 18:58:18 -04:00
Kent Overstreet
f5109c201c bcachefs: Use wait_on_allocator() when allocating journal
wait_on_allocator() emits debug info when we hang trying to allocate.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-24 18:16:01 -04:00
Kent Overstreet
865ad1dbf1 bcachefs: Check for bad write buffer key when moving from journal
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-24 15:48:00 -04:00
Alan Huang
5c4acbc8ce bcachefs: Don't unlock the trans if ret doesn't match BCH_ERR_operation_blocked
Reported-by: syzbot+d540192e763531d307ff@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-24 15:46:59 -04:00
Kent Overstreet
72c0d9cb0f bcachefs: Fix range in bch2_lookup_indirect_extent() error path
Before calling bch2_indirect_extent_missing_error(), we have to
calculate the missing range, which is the intersection of the reflink
pointer and the non-indirect-extent we found.

The calculation didn't take into account that the returned extent may
span the iter position, leading to an infinite loop when we
(unnecessarily) resized the extent we were returning to one that didn't
extend past the offset we were looking up.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-22 00:29:03 -04:00
Kent Overstreet
abcb6bd4be bcachefs: fix spurious error_throw
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-22 00:29:03 -04:00
Kent Overstreet
bb378314ce bcachefs: Add missing bch2_err_class() to fileattr_set()
Make sure we return a standard error code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-22 00:29:03 -04:00
Kent Overstreet
b2e2bed119 bcachefs: Add missing key type checks to check_snapshot_exists()
For now we only have one key type in these btrees, but forward
compatibility means we do have to check.

Reported-by: syzbot+b4cb4a6988aced0cec4b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-19 14:37:04 -04:00
Kent Overstreet
32a01cd433 bcachefs: Don't log fsck err in the journal if doing repair elsewhere
This fixes exceeding the bump allocator limit when the allocator finds
many buckets that need repair - they're repaired asynchronously, which
means that every error logged a message in the bump allocator, without
committing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-19 13:08:07 -04:00
Kent Overstreet
b2348fe6c8 bcachefs: Fix *__bch2_trans_subbuf_alloc() error path
Don't change buf->size on error - this would usually be a transaction
restart, but it could also be -ENOMEM - when we've exceeded the bump
allocator max).

Fixes: 247abee6ae ("bcachefs: btree_trans_subbuf")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-19 13:08:06 -04:00
Lorenzo Stoakes
2e3b37a7e4
fs: replace mmap hook with .mmap_prepare for simple mappings
Since commit c84bf6dd2b ("mm: introduce new .mmap_prepare() file
callback"), the f_op->mmap() hook has been deprecated in favour of
f_op->mmap_prepare().

This callback is invoked in the mmap() logic far earlier, so error handling
can be performed more safely without complicated and bug-prone state
unwinding required should an error arise.

This hook also avoids passing a pointer to a not-yet-correctly-established
VMA avoiding any issues with referencing this data structure.

It rather provides a pointer to the new struct vm_area_desc descriptor type
which contains all required state and allows easy setting of required
parameters without any consideration needing to be paid to locking or
reference counts.

Note that nested filesystems like overlayfs are compatible with an
.mmap_prepare() callback since commit bb666b7c27 ("mm: add mmap_prepare()
compatibility layer for nested file systems").

In this patch we apply this change to file systems with relatively simple
mmap() hook logic - exfat, ceph, f2fs, bcachefs, zonefs, btrfs, ocfs2,
orangefs, nilfs2, romfs, ramfs and aio.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/f528ac4f35b9378931bd800920fee53fc0c5c74d.1750099179.git.lorenzo.stoakes@oracle.com
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-19 13:56:59 +02:00
Kent Overstreet
434635987f bcachefs: Fix missing newlines before ero
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-17 20:45:27 -04:00
Kent Overstreet
88bd771191 bcachefs: fix spurious error in read_btree_roots()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-17 20:45:26 -04:00
Kent Overstreet
1df310860a bcachefs: fsck: Fix oops in key_visible_in_snapshot()
The normal fsck code doesn't call key_visible_in_snapshot() with an
empty list of snapshot IDs seen (the current snapshot ID will always be
on the list), but str_hash_repair_key() ->
bch2_get_snapshot_overwrites() can, and that's totally fine as long as
we check for it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-17 20:45:26 -04:00
Kent Overstreet
3f890768da bcachefs: fsck: fix unhandled restart in topology repair
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-17 20:45:26 -04:00
Kent Overstreet
bbc3a0b17a bcachefs: fsck: Fix check_directory_structure when no check_dirents
check_directory_structure runs after check_dirents, so it expects that
it won't see any inodes with missing backpointers - normally.

But online fsck can't run check_dirents yet, or the user might only be
running a specific pass, so we need to be careful that this isn't an
error. If an inode is unreachable, that's handled by a separate pass.

Also, add a new 'bch2_inode_has_backpointer()' helper, since we were
doing this inconsistently.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-17 13:35:19 -04:00