Commit Graph

14440 Commits

Author SHA1 Message Date
David Sterba
64b8c3851f btrfs: rename error to ret in btrfs_mksubvol()
Unify naming of return value to the preferred way.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:03 +02:00
David Sterba
bfa13b82cc btrfs: rename error to ret in btrfs_may_delete()
Unify naming of return value to the preferred way.

Reviewed-by: Daniel Vacek <neelx@suse.com>yy
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:03 +02:00
Filipe Manana
2abd9e1c58 btrfs: cache if we are using free space bitmaps for a block group
Every time we add free space to the free space tree or we remove free
space from the free space tree, we do a lookup for the block group's free
space info item in the free space tree. This takes time, navigating the
btree and we may block either on IO when reading extent buffers from disk
or on extent buffer lock contention due to concurrency.

Instead of doing this lookup every time, cache the result in the block
structure and use it after the first lookup. This adds two boolean members
to the block group structure but doesn't increase the structure's size.

The following script that runs fs_mark was used to measure the time spent
on run_delayed_tree_ref(), since down that call chain we have calls to
add and remove free space to/from the free space tree (calls to
btrfs_add_to_free_space_tree() and btrfs_remove_from_free_space_tree()):

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/nullb0
  MNT=/mnt
  FILES=100000
  THREADS=$(nproc --all)

  echo "performance" | \
      tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  umount $DEV &> /dev/null
  mkfs.btrfs -f $DEV
  mount -o ssd $DEV $MNT

  OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k"
  for ((i = 1; i <= $THREADS; i++)); do
      OPTS="$OPTS -d $MNT/d$i"
  done

  fs_mark $OPTS

  umount $MNT

This is a heavy metadata test as it's exercising only file creation, so a
lot of allocations of metadata extents, creating delayed refs for adding
new metadata extents and dropping existing ones due to COW. The results
of the times it took to execute run_delayed_tree_ref(), in nanoseconds,
are the following.

Before this change:

  Range: 1868.000 - 6482857.000; Mean: 10231.430; Median: 7005.000; Stddev: 27993.173
  Percentiles:  90th: 13342.000; 95th: 23279.000; 99th: 82448.000
     1868.000 -    4222.038:  270696 ############
     4222.038 -    9541.029: 1201327 #####################################################
     9541.029 -   21559.383:  385436 #################
    21559.383 -   48715.063:   64942 ###
    48715.063 -  110073.800:   31454 #
   110073.800 -  248714.944:    8218 |
   248714.944 -  561977.042:    1030 |
   561977.042 - 1269798.254:     295 |
  1269798.254 - 2869132.711:     116 |
  2869132.711 - 6482857.000:      28 |

After this change:

  Range: 1554.000 - 4557014.000; Mean: 9168.164; Median: 6391.000; Stddev: 21467.060
  Percentiles:  90th: 12478.000; 95th: 20964.000; 99th: 72234.000
     1554.000 -    3453.820: 219004 ############
     3453.820 -    7674.743: 980645 #####################################################
     7674.743 -   17052.574: 552486 ##############################
    17052.574 -   37887.762:  68558 ####
    37887.762 -   84178.322:  31557 ##
    84178.322 -  187024.331:  12102 #
   187024.331 -  415522.355:   1364 |
   415522.355 -  923187.626:    256 |
   923187.626 - 2051092.468:    125 |
  2051092.468 - 4557014.000:     21 |

Approximate improvement in the first four buckets is about 20%.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:03 +02:00
Filipe Manana
fdeffeb4f5 btrfs: add and use helper to determine if using bitmaps in free space tree
When adding and removing free space to the free space tree, we need to
lookup the respective block group's free info item in the free space tree,
check its flags for the BTRFS_FREE_SPACE_USING_BITMAPS bit and then
release the path.

Move these steps into a helper function and use it in both sites.
This will also help avoiding duplicate code in a subsequent change.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:03 +02:00
Filipe Manana
d1ac35ae2a btrfs: use fs_info from local variable in btrfs_convert_free_space_to_extents()
There's no need to dereference the block group to extract fs_info as we
have already stored fs_info in a local variable. So use the local variable
instead.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:03 +02:00
Filipe Manana
497c726ff8 btrfs: avoid double slot decrement at btrfs_convert_free_space_to_extents()
There's no need to subtract 1 from path->slots[0] and then decrement the
slot, we can reduce the generated assembly code by decrementing the slot
and then use it.

Module size before:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1846220	 162998	  16136	2025354	 1ee78a	fs/btrfs/btrfs.ko

Module size after:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1846204	 162998	  16136	2025338	 1ee77a	fs/btrfs/btrfs.ko

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:02 +02:00
Filipe Manana
a8da443c9b btrfs: turn remove argument of modify_free_space_bitmap() to boolean
The argument is used as a boolean, so switch its type from int to bool.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:02 +02:00
Filipe Manana
3887067f55 btrfs: rename free_space_set_bits() and make it less confusing
The free_space_set_bits() is used both to set a range of bits or to clear
range of bits, depending on the 'bit' argument value. So the name is very
misleading since it's not used only to set bits. Furthermore the 'bit'
argument is an integer when a boolean is all that is needed plus its name
is singular, which gives the idea the function operates on a single bit
and not on a range of bits.

So rename the function to free_space_modify_bits(), rename the 'bit'
argument to 'set_bits' and turn the argument to bool instead of int.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:02 +02:00
Filipe Manana
6fc5ef7829 btrfs: add btrfs prefix to free space tree exported functions
A few of the free space tree exported functions have a 'btrfs_' prefix in
their name, but most don't. Not only is this inconsistent, the preferred
style is to have such a prefix, to avoid potential collisions in the
future with other kernel code and offer a somewhat better readibility by
making it obvious in calls sites that we are calling btrfs specific code.

So add the 'btrfs_' prefix to all free space tree functions that are
missing it.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:02 +02:00
Filipe Manana
8bfa3727ea btrfs: remove pointless out label from load_free_space_extents()
All we do under the label is to return, so there's no point in having it,
just return directly whenever we get an error.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:02 +02:00
Filipe Manana
b7db594bc2 btrfs: remove pointless out label from load_free_space_bitmaps()
All we do under the label is to return, so there's no point in having it,
just return directly whenever we get an error.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:02 +02:00
Filipe Manana
5801a749a9 btrfs: remove pointless out label from add_free_space_extent()
All we do under the label is to return, so there's no point in having it,
just return directly whenever we get an error.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:02 +02:00
Filipe Manana
e3ecf6f164 btrfs: remove pointless out label from remove_free_space_extent()
All we do under the label is to return, so there's no point in having it,
just return directly whenever we get an error.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:02 +02:00
Filipe Manana
ffb7068f16 btrfs: remove pointless out label from modify_free_space_bitmap()
All we do under the label is to return, so there's no point in having it,
just return directly whenever we get an error.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:02 +02:00
Filipe Manana
22b609768c btrfs: make free_space_test_bit() return a boolean instead
The function returns the result of another function that returns a boolean
(extent_buffer_test_bit()), and all the callers need is a boolean an not
an integer. So change its return type from int to bool, and modify the
callers to store results in booleans instead of integers, which also makes
them simpler.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:02 +02:00
Filipe Manana
790b88c4dd btrfs: make extent_buffer_test_bit() return a boolean instead
All the callers want is to determine if a bit is set and all of them call
the function and do a double negation (!!) on its result to get a boolean.
So change it to return a boolean and simplify callers.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:02 +02:00
Filipe Manana
e4e5fcbc62 btrfs: remove pointless out label from update_free_space_extent_count()
Just return directly, we don't need the label since all we do under it is
to return.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:01 +02:00
Filipe Manana
61b43a9374 btrfs: remove pointless out label from add_new_free_space_info()
We can just return directly if btrfs_insert_empty_item() fails, there is
no need to release the path before returning, as all callers (or upper
in the call chain) will free the path if they get an error from the call
to add_new_free_space_info(), which is also a common pattern everywhere
in btrfs. Finally there's no need to set 'ret' to 0 if the call to
btrfs_insert_empty_item() didn't fail, since 'ret' is already 0.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:01 +02:00
David Sterba
44892c5a3e btrfs: tree-log: add and rename extent bits for dirty_log_pages tree
The dirty_log_pages tree is used for tree logging and marks extents
based on log_transid. The bits could be renamed to resemble the
LOG1/LOG2 naming used for the BTRFS_FS_LOG1_ERR bits.

The DIRTY bit is renamed to LOG1 and NEW to LOG2.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:01 +02:00
David Sterba
55cd57faa5 btrfs: use folio_end() where appropriate
Simplify folio_pos() + folio_size() and use the new helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:01 +02:00
David Sterba
89a3cc19e4 btrfs: add helper folio_end()
There are several cases of folio_pos + folio_size, add a convenience
helper for that. This is a local helper and not proposed as folio API
because it does not seem to be heavily used elsewhere:

A quick grep (folio_size + folio_end) in fs/ shows

     24 btrfs
      4 iomap
      4 ext4
      2 xfs
      2 netfs
      1 gfs2
      1 f2fs
      1 bcachefs
      1 buffer.c

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:01 +02:00
David Sterba
d549391fc6 btrfs: rename variables for locked range in defrag_prepare_one_folio()
In preparation to use a helper for folio_pos + folio_size, rename the
variables for the locked range so they don't use the 'folio_' prefix. As
the locking ranges take inclusive end of the range (hence the "-1") this
would be confusing as the folio helpers typically use exclusive end of
the range.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:01 +02:00
David Sterba
e47c8a4767 btrfs: simplify range end calculations in truncate_block_zero_beyond_eof()
The way zero_end is calculated and used does a -1 and +1 that
effectively cancel out, so this can be simplified. This is also
preparatory patch for using a helper for folio_pos + folio_size with the
semantics of exclusive end of the range.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:01 +02:00
Filipe Manana
bdd01fb036 btrfs: check BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE at __add_block_group_free_space()
Every caller of __add_block_group_free_space() is checking if the flag
BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE is set before calling it. This is
duplicate code and it's prone to some mistake in case we add more callers
in the future. So move the check for that flag into the start of
__add_block_group_free_space(), and, as a consequence, the path allocation
from add_block_group_free_space() is moved into
__add_block_group_free_space(), to preserve the behaviour of allocating
a path only if the flag BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE is set.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:01 +02:00
Filipe Manana
1f06c942aa btrfs: always abort transaction on failure to add block group to free space tree
Only one of the callers of __add_block_group_free_space() aborts the
transaction if the call fails, while the others don't do it and it's
either never done up the call chain or much higher in the call chain.

So make sure we abort the transaction at __add_block_group_free_space()
if it fails, which brings a couple benefits:

1) If some call chain never aborts the transaction, we avoid having some
   metadata inconsistency because BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE is
   cleared when we enter __add_block_group_free_space() and therefore
   __add_block_group_free_space() is never called again to add the block
   group items to the free space tree, since the function is only called
   when that flag is set in a block group;

2) If the call chain already aborts the transaction, then we get a better
   trace that points to the exact step from __add_block_group_free_space()
   which failed, which is better for analysis.

So abort the transaction at __add_block_group_free_space() if any of its
steps fails.

CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:57:56 +02:00
Qu Wenruo
936f0b49dc btrfs: add extra warning when qgroup is marked inconsistent
Unlike qgroup rescan, which always shows whether it cleared the
inconsistent flag, we do not have a proper way to show if qgroup is
marked inconsistent.

This was not a big deal before as there aren't that many locations that
can mark qgroup inconsistent.

But with the introduction of drop_subtree_threshold, qgroup can be
marked inconsistent very frequently, especially when dropping
subvolumes.

Although most user space tools relying on qgroup should do their own
checks and queue a rescan if needed, we have no idea when qgroup is
marked inconsistent, and this would be much harder to debug.

So this patch will add an extra warning (btrfs_warn_rl()) when the
qgroup flag is flipped into inconsistent for the first time.
And add extra reason why qgroup flips inconsistent.

This means we can move the error message immediately before
qgroup_inconsistent_warning() into that function.

For call sites without an obvious reason, or is a shared error handling,
output the function that failed and the error code instead.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:56:39 +02:00
David Sterba
b37532bffd btrfs: merge btrfs_printk_ratelimited() and its only caller
There's only one caller of btrfs_printk_ratelimited(), merge it there.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:56:39 +02:00
David Sterba
2f3f1ad7f1 btrfs: simplify debug print helpers without enabled printk
The btrfs_debug() helpers depend on various config options. In case of
"no_printk" we don't need to define a special helper that in the end
does nothing but validates the parameters. As the default build config
is to validate the parameters we can simplify it to let the debug
helpers expand to nothing and remove btrfs_no_printk_in_rcu().

To avoid warnings use fs_info and inline one variable in
extent_from_logical().

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:56:39 +02:00
David Sterba
f9095103f2 btrfs: remove remaining unused message helpers
Remove the critical level message helpers as they're not used, the RCU
protection is provided by the plain helpers.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:56:39 +02:00
David Sterba
80f4fab544 btrfs: switch RCU helper versions to btrfs_debug()
The RCU protection is now done in the plain helpers, we can remove the
"_in_rcu" and "_rl_in_rcu".

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:56:38 +02:00
David Sterba
2eac2ae8b2 btrfs: switch RCU helper versions to btrfs_info()
The RCU protection is now done in the plain helpers, we can remove the
"_in_rcu" and "_rl_in_rcu".

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:56:38 +02:00
David Sterba
0fe04bf132 btrfs: switch RCU helper versions to btrfs_warn()
The RCU protection is now done in the plain helpers, we can remove the
"_in_rcu" and "_rl_in_rcu".

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:56:38 +02:00
David Sterba
9db18fe3ac btrfs: switch RCU helper versions to btrfs_err()
The RCU protection is now done in the plain helpers, we can remove the
"_in_rcu" and "_rl_in_rcu".

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:56:38 +02:00
David Sterba
0e26727a73 btrfs: switch all message helpers to be RCU safe
We have two versions of message helpers, one that provides RCU
protection around the call in case we need to dereference device name.
As messages are not performance critical we can set up the RCU
protection for all of them and drop the distinction for those where
device name is needed. This will lead to further simplifications.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:56:38 +02:00
David Sterba
4d4b489ef1 btrfs: remove unused levels of message helpers
We're using the following levels: crit, err, warn, info, debug. This
covers our needs and further specializations is not needed, so let's
remove emerg, alert and notice.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:56:38 +02:00
David Sterba
ee3af49a05 btrfs: remove unused rcu-string printk helpers
The RCU-string API has never taken off and we don't use the printk
helpers provided as we do the protection in our helpers. Remove the "in
RCU" wrappers.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:56:38 +02:00
David Sterba
d1d1c85427 btrfs: open code rcu_string_free() and remove it
The helper is trivial and we can simply use kfree_rcu() if needed. In
our case it's just one place where we rename a device in
device_list_add() and the old name can still be used until the end of
the RCU grace period. The other case is freeing a device and there
nothing should reach the device, so we can use plain kfree().

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:56:38 +02:00
Johannes Thumshirn
694ce5e143 btrfs: zoned: reserve data_reloc block group on mount
Create a block group dedicated for data relocation on mount of a zoned
filesystem.

If there is already more than one empty DATA block group on mount, this
one is picked for the data relocation block group, instead of a newly
created one.

This is done to ensure, there is always space for performing garbage
collection and the filesystem is not hitting ENOSPC under heavy overwrite
workloads.

CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:56:31 +02:00
David Sterba
f1f22dfbea btrfs: use btrfs_root_id() where not done yet
A few more remaining cases where we can use the helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:31 +02:00
David Sterba
918fb77073 btrfs: use btrfs_is_data_reloc_root() where not done yet
Two remaining cases where we can use the helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:31 +02:00
David Sterba
c6aeae86b9 btrfs: use on-stack variable for block reserve in btrfs_replace_file_extents()
We can avoid potential memory allocation failure in
btrfs_replace_file_extents() as the block reserve lifetime is limited to
the scope of the function. This requires +48 bytes on stack.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:31 +02:00
David Sterba
7ce22f62b2 btrfs: use on-stack variable for block reserve in btrfs_truncate()
We can avoid potential memory allocation failure in btrfs_truncate() as
the block reserve lifetime is limited to the scope of the function. This
requires +48 bytes on stack.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:31 +02:00
David Sterba
ec41c34547 btrfs: use on-stack variable for block reserve in btrfs_evict_inode()
We can avoid potential memory allocation failure in btrfs_evict_inode()
as the block reserve lifetime is limited to the scope of the function.
This requires +48 bytes on stack.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:31 +02:00
Sun YangKai
8811ace439 btrfs: update comment for xarray fields in struct btrfs_root
The inode_lock field of struct btrfs_root was removed in commit
e2844cce75c9e61("btrfs: remove inode_lock from struct btrfs_root and use
xarray locks") but the related comment haven't been updated.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:31 +02:00
Qu Wenruo
cc38d178ff btrfs: enable large data folio support under CONFIG_BTRFS_EXPERIMENTAL
With all the preparation patches already merged, it's pretty easy to
enable large data folios:

- Remove the ASSERT() on folio size in btrfs_end_repair_bio()

- Add a helper to properly set the max folio order
  Currently due to several call sites that are fetching the bitmap
  content directly into an unsigned long, we can only support
  BITS_PER_LONG blocks for each bitmap.

- Call the helper when reading/creating an inode

The support has the following limitations:

- No large folios for data reloc inode
  The relocation code still requires page sized folio.
  But it's not that hot nor common compared to regular buffered ios.

  Will be improved in the future.

- Requires CONFIG_BTRFS_EXPERIMENTAL

- Will require all folio related operations to check if it needs the
  extra btrfs_subpage structure
  Now any folio larger than block size will need btrfs_subpage structure
  handling.

Unfortunately I do not have a physical machine for performance test, but
if everything goes like XFS/EXT4, it should mostly bring single digits
percentage performance improvement in the real world.

Although I believe there are still quite some optimizations to be done,
let's focus on testing the current large data folio support first.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:30 +02:00
Filipe Manana
b769777d92 btrfs: use refcount_t type for the extent buffer reference counter
Instead of using a bare atomic, use the refcount_t type, which despite
being a structure that contains only an atomic, has an API that checks
for underflows and other hazards. This doesn't change the size of the
extent_buffer structure.

This removes the need to do things like this:

    WARN_ON(atomic_read(&eb->refs) == 0);
    if (atomic_dec_and_test(&eb->refs)) {
        (...)
    }

And do just:

    if (refcount_dec_and_test(&eb->refs)) {
        (...)
    }

Since refcount_dec_and_test() already triggers a warning when we decrement
a ref count that has a value of 0 (or below zero).

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:30 +02:00
Filipe Manana
2697b61597 btrfs: add comment for optimization in free_extent_buffer()
There's this special atomic compare and exchange logic which serves to
avoid locking the extent buffers refs_lock spinlock and therefore reduce
lock contention, so add a comment to make it more obvious.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:30 +02:00
Filipe Manana
71c086b30d btrfs: reorganize logic at free_extent_buffer() for better readability
It's hard to read the logic to break out of the while loop since it's a
very long expression consisting of a logical or of two composite
expressions, each one composed by a logical and. Further each one is also
testing for the EXTENT_BUFFER_UNMAPPED bit, making it more verbose than
necessary.

So change from this:

    if ((!test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags) && refs <= 3)
        || (test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags) &&
            refs == 1))
       break;

To this:

    if (test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags)) {
        if (refs == 1)
            break;
    } else if (refs <= 3) {
            break;
    }

At least on x86_64 using gcc 9.3.0, this doesn't change the object size.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:30 +02:00
Filipe Manana
41e4ea0bf5 btrfs: make btrfs_readdir_delayed_dir_index() return a bool instead
There's no need to return errors, all we do is return 1 or 0 depending
on whether we should or should not stop iterating over delayed dir
indexes. So change the function to return bool instead of an int.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:30 +02:00
Filipe Manana
4106eb9bda btrfs: make btrfs_should_delete_dir_index() return a bool instead
There's no need to return errors and we currently return 1 in case the
index should be deleted and 0 otherwise, so change the return type from
int to bool as this is a boolean function and it's more clear this way.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:30 +02:00
Filipe Manana
adc1ef55dc btrfs: add details to error messages at btrfs_delete_delayed_dir_index()
Update the error messages with:

1) Fix typo in the first one, deltiona -> deletion;

2) Remove redundant part of the first message, the part following the
   comma, and including all the useful information: root, inode, index
   and error value;

3) Update the second message to use more formal language (example 'error'
   instead of 'err'), , remove redundant part about "deletion tree of
   delayed node..." and print the relevant information in the same
   format and order as the first message, without the ugly opening
   parenthesis without a space separating from the previous word.
   This also makes the message similar in format to the one we have at
   btrfs_insert_delayed_dir_index().

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:30 +02:00
Filipe Manana
0187acef35 btrfs: make btrfs_delete_delayed_insertion_item() return a boolean
There's no need to return an integer as all we need to do is return true
or false to tell whether we deleted a delayed item or not. Also the logic
is inverted since we return 1 (true) if we didn't delete and 0 (false) if
we did, which is somewhat counter intuitive. Change the return type to a
boolean and make it return true if we deleted and false otherwise.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:30 +02:00
Filipe Manana
7077d7b872 btrfs: switch del_all argument of replay_dir_deletes() from int to bool
The argument has boolean semantics, so change its type from int to bool,
making it more clear.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:30 +02:00
Filipe Manana
5f8882c854 btrfs: pass NULL index to btrfs_del_inode_ref() where not needed
There are two callers of btrfs_del_inode_ref() that declare a local index
variable and then pass a pointer for it to btrfs_del_inode_ref(), but then
don't use that index at all. Since btrfs_del_inode_ref() accepts a NULL
index pointer, pass NULL instead and stop declaring those useless index
variables.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:30 +02:00
Filipe Manana
93612a92ba btrfs: allocate scratch eb earlier at btrfs_log_new_name()
Instead of allocating the scratch eb after joining the log transaction,
allocate it before so that we're not delaying log commits for longer
than necessary, as allocating the scratch eb means allocating an
extent_buffer structure, which comes from a dedicated kmem_cache, plus
pages/folios to attach to the eb. Both of these allocations may take time
when we're under memory pressure.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:30 +02:00
Filipe Manana
841324a8e6 btrfs: allocate path earlier at btrfs_log_new_name()
Instead of allocating the path after joining the log transaction, allocate
it before so that we're not delaying log commits for the rare cases where
the allocation takes a significant time (under memory pressure and all
slabs are full, there's the need to allocate a new page, etc).

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:29 +02:00
Filipe Manana
b32efae7b8 btrfs: allocate path earlier at btrfs_del_dir_entries_in_log()
Instead of allocating the path after joining the log transaction, allocate
it before so that we're not delaying log commits for the rare cases where
the allocation takes a significant time (under memory pressure and all
slabs are full, there's the need to allocate a new page, etc).

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:29 +02:00
Filipe Manana
181436a85b btrfs: assert we join log transaction at btrfs_del_dir_entries_in_log()
We are supposed to be able to join a log transaction at that point, since
we have determined that the inode was logged in the current transaction
with the call to inode_logged(). So ASSERT() we joined a log transaction
and also warn if we didn't in case assertions are disabled (the kernel
config doesn't have CONFIG_BTRFS_ASSERT=y), so that the issue gets noticed
and reported if it ever happens.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:29 +02:00
Filipe Manana
1ed0cfc89e btrfs: use btrfs_del_item() at del_logged_dentry()
There's no need to use btrfs_delete_one_dir_name() at del_logged_dentry()
because we are processing a dir index key which can contain only a single
name, unlike dir item keys which can encode multiple names in case of name
hash collisions. We have explicitly looked up for a dir index key by
calling btrfs_lookup_dir_index_item() and we don't log dir item keys
anymore (since commit 339d035424 ("btrfs: only copy dir index keys when
logging a directory")). So simplify and use btrfs_del_item() directly
instead of btrfs_delete_one_dir_name().

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:29 +02:00
Filipe Manana
0ef4c6120e btrfs: free path sooner at __btrfs_unlink_inode()
After calling btrfs_delete_one_dir_name() there's no need for the path
anymore so we can free it immediately after. After that point we do
some btree operations that take time and in those call chains we end up
allocating paths for these operations, so we're unnecessarily holding on
to the path we allocated early at __btrfs_unlink_inode().

So free the path as soon as we don't need it and add a comment. This
also allows to simplify the error path, removing two exit labels and
returning directly when an error happens.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:29 +02:00
Filipe Manana
d94edb0d7e btrfs: assert we join log transaction at btrfs_del_inode_ref_in_log()
We are supposed to be able to join a log transaction at that point, since
we have determined that the inode was logged in the current transaction
with the call to inode_logged(). So ASSERT() we joined a log transaction
and also warn if we didn't in case assertions are disabled (the kernel
config doesn't have CONFIG_BTRFS_ASSERT=y), so that the issue gets noticed
and reported if it ever happens.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:29 +02:00
Al Viro
75764b41bf btrfs: open code fc_mount() to avoid releasing s_umount rw_sempahore
[CURRENT BEHAVIOR]
Currently inside btrfs_get_tree_subvol(), we call fc_mount() to grab a
tree, then re-lock s_umount inside btrfs_reconfigure_for_mount() to
avoid race with remount.

However fc_mount() itself is just doing two things:

1. Call vfs_get_tree()
2. Release s_umount then call vfs_create_mount()

[ENHANCEMENT]
Instead of calling fc_mount(), we can open-code it with vfs_get_tree()
first.
This provides a benefit that, since we have the full control of
s_umount, we do not need to re-lock that rw_sempahore when calling
btrfs_reconfigure_for_mount(), meaning less race between RO/RW remount.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ Rework the subject and commit message, refactor the error handling ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Tested-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:29 +02:00
David Sterba
4013cde56e btrfs: rename err to ret in scrub_submit_extent_sector_read()
Unify naming of return value to the preferred way.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:29 +02:00
David Sterba
56ccdd9af2 btrfs: rename err to ret in btrfs_create_common()
Unify naming of return value to the preferred way.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:29 +02:00
David Sterba
7d13ea864e btrfs: rename err to ret in btrfs_wait_tree_log_extents()
Unify naming of return value to the preferred way.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:29 +02:00
David Sterba
0b2cd9e2c7 btrfs: rename err to ret in btrfs_wait_extents()
Unify naming of return value to the preferred way.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:28 +02:00
David Sterba
69c5c6130d btrfs: rename err to ret in quota_override_store()
Unify naming of return value to the preferred way.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:28 +02:00
David Sterba
148961dac3 btrfs: rename err to ret in btrfs_fill_super()
Unify naming of return value to the preferred way.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:28 +02:00
David Sterba
60a8bab08c btrfs: rename err to ret in calc_pct_ratio()
Unify naming of return value to the preferred way.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:28 +02:00
David Sterba
3b5742f379 btrfs: rename err to ret in btrfs_symlink()
Unify naming of return value to the preferred way.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:28 +02:00
David Sterba
af6f6c3af7 btrfs: rename err to ret in btrfs_link()
Unify naming of return value to the preferred way.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:28 +02:00
David Sterba
9cf280e2bd btrfs: rename err to ret in btrfs_setattr()
Unify naming of return value to the preferred way.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:28 +02:00
David Sterba
b71a348513 btrfs: rename err to ret in btrfs_init_inode_security()
Unify naming of return value to the preferred way.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:28 +02:00
David Sterba
d64ef1d23f btrfs: rename err to ret in btrfs_alloc_from_bitmap()
Unify naming of return value to the preferred way.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:28 +02:00
David Sterba
8d9e877919 btrfs: rename err to ret in btrfs_lock_extent_bits()
Unify naming of return value to the preferred way.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:28 +02:00
David Sterba
886240cbcd btrfs: rename err to ret in btrfs_try_lock_extent_bits()
Unify naming of return value to the preferred way.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:27 +02:00
David Sterba
986b6aa185 btrfs: rename err to ret2 in btrfs_truncate_inode_items()
Unify naming of return value to the preferred way.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:27 +02:00
David Sterba
a579ddca43 btrfs: rename err to ret2 in btrfs_add_link()
Unify naming of return value to the preferred way.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:27 +02:00
David Sterba
8f38507068 btrfs: rename err to ret2 in btrfs_setsize()
Unify naming of return value to the preferred way.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:27 +02:00
David Sterba
df20be9f02 btrfs: rename err to ret2 in btrfs_search_old_slot()
Unify naming of return value to the preferred way, move the variable to
the closest scope.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:27 +02:00
David Sterba
644dcb4316 btrfs: rename err to ret2 in btrfs_search_slot()
Unify naming of return value to the preferred way, move the variable to
the closest scope.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:27 +02:00
David Sterba
56fc5b18c9 btrfs: rename err to ret2 in search_leaf()
Unify naming of return value to the preferred way.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:27 +02:00
David Sterba
58019c1dd4 btrfs: rename err to ret2 in read_block_for_search()
Unify naming of return value to the preferred way.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:27 +02:00
David Sterba
66ca7ea650 btrfs: rename err to ret2 in resolve_indirect_refs()
Unify naming of return value to the preferred way, move the variable to
the closest scope.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:27 +02:00
Qu Wenruo
582cd4bad4 btrfs: rename btrfs_subpage structure
With the incoming large data folios support, the structure name
btrfs_subpage is no longer correct, as for we can have multiple blocks
inside a large folio, and the block size is still page size.

So to follow the schema of iomap, rename btrfs_subpage to
btrfs_folio_state, along with involved enums.

There are still exported functions with "btrfs_subpage_" prefix, and I
believe for metadata the name "subpage" will stay forever as we will
never allocate a folio larger than nodesize anyway.

The full cleanup of the word "subpage" will happen in much smaller steps
in the future.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:27 +02:00
Qu Wenruo
1e17738d6b btrfs: add comments on the extra btrfs specific subpage bitmaps
Unlike the iomap_folio_state structure, the btrfs_subpage structure has a
lot of extra sub-bitmaps, namely:

- writeback sub-bitmap
- locked sub-bitmap
  iomap_folio_state uses an atomic for writeback tracking, while it has
  no per-block locked tracking.

  This is because iomap always locks a single folio, and submits dirty
  blocks with that folio locked.

  But btrfs has async delalloc ranges (for compression), which are queued
  with their range locked, until the compression is done, then marks the
  involved range writeback and unlocked.

  This means a range can be unlocked and marked writeback at seemingly
  random timing, thus it needs the extra tracking.

  This needs a huge rework on the lifespan of async delalloc range
  before we can remove/simplify these two sub-bitmaps.

- ordered sub-bitmap
- checked sub-bitmap
  These are for COW-fixup, but as I mentioned in the past, the COW-fixup
  is not really needed anymore and these two flags are already marked
  deprecated, and will be removed in the near future after comprehensive
  tests.

Add related comments to indicate we're actively trying to align the
sub-bitmaps to the iomap ones.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:27 +02:00
Daniel Vacek
3f093ccb95 btrfs: harden parsing of compression mount options
Btrfs happily but incorrectly accepts the `-o compress=zlib+foo` and similar
options with any random suffix.

Fix that by explicitly checking the end of the strings.

Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:26 +02:00
Daniel Vacek
3f0e865ae6 btrfs: factor out compression mount options parsing
There are many options making the parsing a bit lengthy.  Factor the
compress options out into a helper function.  The next patch is going to
harden this function.

Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:26 +02:00
David Sterba
ccb42a6eed btrfs: constify more pointer parameters
Another batch of pointer parameter constifications. This is for clarity
and minor addition to type safety. There are no observable effects in the
assembly code and .ko measured on release config.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:26 +02:00
Boris Burkov
c7f04fbc98 btrfs: sysfs: track current commit duration in commit_stats
When debugging/detecting outlier commit stalls, having an indicator that
we are currently in a long commit critical section can be very useful.
Extend the commit_stats sysfs file to also include the current commit
critical section duration.

Since this requires storing the last commit start time, use that rather
than a separate stack variable for storing the finished commit durations
as well.

This also requires slightly moving up the timing of the stats updating
to *inside* the critical section to avoid the transaction T+1 setting
the critical_section_start_time to 0 before transaction T can update its
stats, which would trigger the new ASSERT. This is an improvement in and
of itself, as it makes the stats more accurately represent the true
critical section time. It may be yet better to pull the stats up to where
start_transaction gets unblocked, rather than the next commit, but this
seems like a good enough place as well.

Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:26 +02:00
Pan Chuang
46d549928c btrfs: use rb_find_add() in rb_simple_insert()
Use the rb-tree helper so we don't open code the search and insert
code.

Signed-off-by: Pan Chuang <panchuang@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:26 +02:00
Pan Chuang
c52ea14d05 btrfs: pass struct rb_simple_node pointer directly in rb_simple_insert()
Replace struct embedding with union to enable safe type conversion in
btrfs_backref_node, tree_block and mapping_node.

Adjust function calls to use the new unified API, eliminating redundant
parameters.

Signed-off-by: Pan Chuang <panchuang@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:26 +02:00
Yangtao Li
fbec9a5d3e btrfs: use rb_find_add() in btrfs_qgroup_add_swapped_blocks()
Use the rb-tree helper so we don't open code the search and insert
code.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Pan Chuang <panchuang@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:26 +02:00
Yangtao Li
844e5f902d btrfs: use rb_find() in btrfs_qgroup_trace_subtree_after_cow()
Use the rb-tree helper so we don't open code the search code.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Pan Chuang <panchuang@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:26 +02:00
Yangtao Li
e3def6ce67 btrfs: use rb_find_add() in add_qgroup_rb()
Use the rb-tree helper so we don't open code the search and insert
code.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Pan Chuang <panchuang@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:26 +02:00
Yangtao Li
1e0f0239a3 btrfs: use rb_find() in find_qgroup_rb()
Use the rb-tree helper so we don't open code the search code.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Pan Chuang <panchuang@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:26 +02:00
Yangtao Li
287480e269 btrfs: use rb_find_add() in insert_ref_entry()
Use the rb-tree helper so we don't open code the search and insert
code.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Pan Chuang <panchuang@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:25 +02:00
Yangtao Li
c6e3ae8ac3 btrfs: use rb_find_add() in insert_root_entry()
Use the rb-tree helper so we don't open code the search and insert
code.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Pan Chuang <panchuang@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:25 +02:00
Yangtao Li
afaa9f8235 btrfs: use rb_find() in lookup_root_entry()
Use the rb-tree helper so we don't open code the search code.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Pan Chuang <panchuang@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:25 +02:00
Yangtao Li
3f60f4374a btrfs: use rb_find_add() in insert_block_entry()
Use the rb-tree helper so we don't open code the search and insert
code.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Pan Chuang <panchuang@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:25 +02:00
Yangtao Li
4044a7ed3b btrfs: use rb_find() in lookup_block_entry()
Use the rb-tree helper so we don't open code the search code.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Pan Chuang <panchuang@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:25 +02:00
Yangtao Li
b017a92bd9 btrfs: use rb_find_add() in ulist_rbtree_insert()
Use the rb-tree helper so we don't open code the search and insert
code.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Pan Chuang <panchuang@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:25 +02:00
Yangtao Li
c4f38e7ca5 btrfs: use rb_find() in ulist_rbtree_search()
Use the rb-tree helper so we don't open code the search code.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Pan Chuang <panchuang@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:25 +02:00
Yangtao Li
9734685854 btrfs: use rb_find() in __btrfs_lookup_delayed_item()
Use the rb-tree helper so we don't open code the search code.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Pan Chuang <panchuang@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:25 +02:00
Yangtao Li
7a91e01875 btrfs: use rb_find_add() in btrfs_insert_inode_defrag()
Use the rb-tree helper so we don't open code the search and insert
code.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Pan Chuang <panchuang@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:25 +02:00
Dan Johnson
06c3437f74 btrfs: fix comment in reserved space warning
mkfs.btrfs up to v4.14 actually can leave a chunk inside the reserved
space when invoked with `-m single`, fixed by 997f9977c24397eb6980bb9
("mkfs: Prevent temporary system chunk to use space in reserved 1M
range") released with v4.15.

Signed-off-by: Dan Johnson <ComputerDruid@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:25 +02:00
Daniel Vacek
d8f6cb2b28 btrfs: relocation: simplify unused logic related to LINK_LOWER
btrfs_backref_link_edge() is always called with the LINK_LOWER argument.
We can simplify it and remove the LINK_LOWER and LINK_UPPER macros
completely.

The last call with LINK_UPPER was removed with commit 0097422c0d
("btrfs: remove clone_backref_node() from relocation").

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:25 +02:00
Filipe Manana
593062f67b btrfs: unfold transaction abort at btrfs_insert_one_raid_extent()
We have a common error path where we abort the transaction, but like this
in case we get a transaction abort stack trace we don't know exactly which
previous function call failed. Instead abort the transaction after any
function call that returns an error, so that we can easily identify which
function failed.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:24 +02:00
Filipe Manana
35bb03e57a btrfs: unfold transaction abort at __btrfs_update_delayed_inode()
We have a common error path where we abort the transaction, but like this
in case we get a transaction abort stack trace we don't know exactly which
previous function call failed. Instead abort the transaction after any
function call that returns an error, so that we can easily identify which
function failed.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:24 +02:00
Filipe Manana
33e8f24b52 btrfs: abort transaction on unexpected eb generation at btrfs_copy_root()
If we find an unexpected generation for the extent buffer we are cloning
at btrfs_copy_root(), we just WARN_ON() and don't error out and abort the
transaction, meaning we allow to persist metadata with an unexpected
generation. Instead of warning only, abort the transaction and return
-EUCLEAN.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:53:16 +02:00
Filipe Manana
273bbb5b48 btrfs: unfold transaction abort at btrfs_copy_root()
Instead of having a common btrfs_abort_transaction() call for when any of
the two btrfs_inc_ref() calls fail, move the btrfs_abort_transaction() to
happen immediately after each one of the calls, so that when analyzing a
stack trace with a transaction abort we know which call failed.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:50:35 +02:00
David Sterba
b63c8c1ede btrfs: move transaction aborts to the error site in add_block_group_free_space()
Transaction aborts should be done next to the place the error happens,
which was not done in add_block_group_free_space().

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:50:34 +02:00
David Sterba
0b10f3dd13 btrfs: move transaction aborts to the error site in remove_block_group_free_space()
Transaction aborts should be done next to the place the error happens,
which was not done in remove_block_group_free_space().

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:50:34 +02:00
Filipe Manana
81bfd9d547 btrfs: simplify error detection flow during log replay
We have this fuzzy logic at btrfs_recover_log_trees() where we don't
abort the transaction and exit immediately after each function call that
returned an error, and instead have if-then-else logic or check if the
previous function call returned success before calling the next function.

Make the flow more straightforward by immediately aborting the transaction
and exiting after each function call failure. This also allows to avoid
two consecutive if statements that test the same conditions:

   if (!ret && wc.stage == LOG_WALK_REPLAY_ALL) {
        (...)
   }

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:50:34 +02:00
Filipe Manana
6466084df6 btrfs: remove redundant path release when replaying a log tree
There's no need to call btrfs_release_path() before calling
btrfs_init_root_free_objectid() as we have released the path already at
the top of the loop and the previous call to fixup_inode_link_counts()
also releases the path. So remove it to simplify the code.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:50:34 +02:00
Filipe Manana
2a5898c4aa btrfs: abort transaction during log replay if walk_log_tree() failed
If we failed walking a log tree during replay, we have a missing
transaction abort to prevent committing a transaction where we didn't
fully replay all the changes from a log tree and therefore can leave the
respective subvolume tree in some inconsistent state. So add the missing
transaction abort.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:50:28 +02:00
Filipe Manana
8f1e1b263d btrfs: unfold transaction aborts when replaying log trees
We have a single line doing a transaction abort in case either we got an
error from btrfs_get_fs_root() different from -ENOENT or we got an error
from btrfs_pin_extent_for_log_replay(), making it hard to figure out which
function call failed when looking at a transaction abort massages and
stack trace in dmesg. Change this to have an explicit transaction abort
for each one of the two cases.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:44:11 +02:00
Johannes Thumshirn
2a946bf6d6 btrfs: make btrfs_should_periodic_reclaim() static
btrfs_should_periodic_reclaim() is not used outside of space-info.c so
make it static and remove the prototype from space-info.h.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:44:11 +02:00
Johannes Thumshirn
55f7c65b2f btrfs: zoned: use filesystem size not disk size for reclaim decision
When deciding if a zoned filesystem is reaching the threshold to reclaim
data block groups, look at the size of the filesystem not to potentially
total available size of all drives in the filesystem.

Especially if a filesystem was created with mkfs' -b option, constraining
it to only a portion of the block device, the numbers won't match and
potentially garbage collection is kicking in too late.

Fixes: 3687fcb075 ("btrfs: zoned: make auto-reclaim less aggressive")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:42:24 +02:00
Filipe Manana
f2de2b9ffd btrfs: unfold transaction abort at clone_copy_inline_extent()
We have a common error path where we abort the transaction, but like this
in case we get a transaction abort stack trace we don't know exactly which
previous function call failed. Instead abort the transaction after any
function call that returns an error, so that we can easily identify which
function failed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 15:31:06 +02:00
Filipe Manana
5ff6050fcd btrfs: remove pointless 'out' label from clone_finish_inode_update()
The label is only used once and we can instead return directly where it's
used, besides the fact that all we do under the label is to return the
value of 'ret'. So get rid of the label and return directly.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 15:31:06 +02:00
Filipe Manana
5cf0e668ea btrfs: unfold transaction abort at walk_up_proc()
Instead of having a common btrfs_abort_transaction() call for when any of
the two btrfs_dec_ref() calls fail, move the btrfs_abort_transaction() to
happen immediately after each one of the calls, so that when analysing a
stack trace with a transaction abort we know which call failed.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 15:31:05 +02:00
Filipe Manana
227aa55fa2 btrfs: unfold transaction abort at __btrfs_inc_extent_ref()
Instead of having a common btrfs_abort_transaction() call for when either
insert_tree_block_ref() failed or when insert_extent_data_ref() failed,
move the btrfs_abort_transaction() to happen immediately after each one of
those calls, so that when analysing a stack trace with a transaction abort
we know which call failed.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 15:31:05 +02:00
Filipe Manana
3f757b56f1 btrfs: unfold transaction aborts at btrfs_create_new_inode()
Instead of having a common btrfs_abort_transaction() call for when either
btrfs_orphan_add() failed or when btrfs_add_link() failed, move the
btrfs_abort_transaction() to happen immediately after each one of those
calls, so that when analysing a stack trace with a transaction abort we
know which call failed.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 15:31:05 +02:00
Caleb Sander Mateos
9aad72b4e3 btrfs/ioctl: store btrfs_uring_encoded_data in io_btrfs_cmd
btrfs is the only user of struct io_uring_cmd_data and its op_data
field. Switch its ->uring_cmd() implementations to store the struct
btrfs_uring_encoded_data * in the struct io_btrfs_cmd, overlayed with
io_uring_cmd's pdu field. This avoids having to touch another cache line
to access the struct btrfs_uring_encoded_data *, and allows op_data and
struct io_uring_cmd_data to be removed.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Acked-by: David Sterba <dsterba@suse.com>
Link: https://lore.kernel.org/r/20250708202212.2851548-4-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-18 12:34:56 -06:00
Christian Brauner
ca115d7e75
tree-wide: s/struct fileattr/struct file_kattr/g
Now that we expose struct file_attr as our uapi struct rename all the
internal struct to struct file_kattr to clearly communicate that it is a
kernel internal struct. This is similar to struct mount_{k}attr and
others.

Link: https://lore.kernel.org/20250703-restlaufzeit-baurecht-9ed44552b481@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-04 16:14:39 +02:00
Linus Torvalds
4c06e63b92 for-6.16-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmhmwFQACgkQxWXV+ddt
 WDsVMA/+NuSth71V0AfiDnFyqjgDMqIlZL2+dqBiTYHXQQHKbqiUlKvYkWICCT6T
 1YgDV+95XJYy4TDBoA49Ndd/l+CiDcMLbOYeneIfbJy13ts84jVANPkl4n03gPkF
 ktibCw15h0MENVctTCPc71dX2X0cV9WPf4iDmoxUZiukDA376akGTArZKwH4tVVg
 4qVpzUtDdNOf848D+8DZKGd+ot/RWgEdLkFCZES27BMg/OFemxBK1MU6K8VjxiKF
 VoaSVJRDXuug8oVBAGNl86XpiSgd4gHyoNNA5b4mhdSWMSBMxUAaILsONT9pNQZA
 CFyHA1Jp2gLOIzQIzeXwWgXaAOQDtco8YWYaXhf0v0mySs89tweXjOibfj2mU9pS
 wPaJyeD+nyRDMwPa4VWEws64D3vXX6aKwiThUENuDmxBvrRXjrkGYH9tf0LNzDDe
 OKv/vOCfeyutxbjKhP+qElMhdh73BZnJ4UCxxYRRDq2v1Mg+k06swl+6uL6xenme
 a2KLJlwEoG6LAlkpZzV66ZEaIHDyGBZNdVYtuA/G3dDtmlt0aLXDdp1eq7NivS1j
 aV7cd0JMX89lAUtqKT932ZOw8RoDrUPPjsnXzCaZJ69mMVyEkxyCV+iYHTTJPDga
 W5Vg8Tq3d1gwxMebZHvyI6wwUhmGA0wUFG2eohYY/tcSrrUlrHQ=
 =Ke0p
 -----END PGP SIGNATURE-----

Merge tag 'for-6.16-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - tree-log fixes:
    - fixes of log tracking of directories and subvolumes
    - fix iteration and error handling of inode references
      during log replay

 - fix free space tree rebuild (reported by syzbot)

* tag 'for-6.16-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: use btrfs_record_snapshot_destroy() during rmdir
  btrfs: propagate last_unlink_trans earlier when doing a rmdir
  btrfs: record new subvolume in parent dir earlier to avoid dir logging races
  btrfs: fix inode lookup error handling during log replay
  btrfs: fix iteration of extrefs during log replay
  btrfs: fix missing error handling when searching for inode refs during log replay
  btrfs: fix failure to rebuild free space tree using multiple transactions
2025-07-03 13:29:56 -07:00
Eric Biggers
2c7528d36e btrfs: stop parsing crc32c driver name
To determine whether the crc32c implementation is "fast", use
crc32_optimizations() instead of parsing the crypto_shash driver name.
This keeps the code working as intended after the driver name is changed
by the next commit.

Acked-by: David Sterba <dsterba@suse.com>
Link: https://lore.kernel.org/r/20250613183753.31864-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-06-30 09:31:56 -07:00
Filipe Manana
157501b046 btrfs: use btrfs_record_snapshot_destroy() during rmdir
We are setting the parent directory's last_unlink_trans directly which
may result in a concurrent task starting to log the directory not see the
update and therefore can log the directory after we removed a child
directory which had a snapshot within instead of falling back to a
transaction commit. Replaying such a log tree would result in a mount
failure since we can't currently delete snapshots (and subvolumes) during
log replay. This is the type of failure described in commit 1ec9a1ae1e
("Btrfs: fix unreplayable log after snapshot delete + parent dir fsync").

Fix this by using btrfs_record_snapshot_destroy() which updates the
last_unlink_trans field while holding the inode's log_mutex lock.

Fixes: 44f714dae5 ("Btrfs: improve performance on fsync against new inode after rename/unlink")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-27 19:58:12 +02:00
Filipe Manana
c466e33e72 btrfs: propagate last_unlink_trans earlier when doing a rmdir
In case the removed directory had a snapshot that was deleted, we are
propagating its inode's last_unlink_trans to the parent directory after
we removed the entry from the parent directory. This leaves a small race
window where someone can log the parent directory after we removed the
entry and before we updated last_unlink_trans, and as a result if we ever
try to replay such a log tree, we will fail since we will attempt to
remove a snapshot during log replay, which is currently not possible and
results in the log replay (and mount) to fail. This is the type of failure
described in commit 1ec9a1ae1e ("Btrfs: fix unreplayable log after
snapshot delete + parent dir fsync").

So fix this by propagating the last_unlink_trans to the parent directory
before we remove the entry from it.

Fixes: 44f714dae5 ("Btrfs: improve performance on fsync against new inode after rename/unlink")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-27 19:57:47 +02:00
Filipe Manana
bf5bcf9a6f btrfs: record new subvolume in parent dir earlier to avoid dir logging races
Instead of recording that a new subvolume was created in a directory after
we add the entry do the directory, record it before adding the entry. This
is to avoid races where after creating the entry and before recording the
new subvolume in the directory (the call to btrfs_record_new_subvolume()),
another task logs the directory, so we end up with a log tree where we
logged a directory that has an entry pointing to a root that was not yet
committed, resulting in an invalid entry if the log is persisted and
replayed later due to a power failure or crash.

Also state this requirement in the function comment for
btrfs_record_new_subvolume(), similar to what we do for the
btrfs_record_unlink_dir() and btrfs_record_snapshot_destroy().

Fixes: 45c4102f0d ("btrfs: avoid transaction commit on any fsync after subvolume creation")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-27 19:57:24 +02:00
Filipe Manana
5f61b96159 btrfs: fix inode lookup error handling during log replay
When replaying log trees we use read_one_inode() to get an inode, which is
just a wrapper around btrfs_iget_logging(), which in turn is a wrapper for
btrfs_iget(). But read_one_inode() always returns NULL for any error
that btrfs_iget_logging() / btrfs_iget() may return and this is a problem
because:

1) In many callers of read_one_inode() we convert the NULL into -EIO,
   which is not accurate since btrfs_iget() may return -ENOMEM and -ENOENT
   for example, besides -EIO and other errors. So during log replay we
   may end up reporting a false -EIO, which is confusing since we may
   not have had any IO error at all;

2) When replaying directory deletes, at replay_dir_deletes(), we assume
   the NULL returned from read_one_inode() means that the inode doesn't
   exist and then proceed as if no error had happened. This is wrong
   because unless btrfs_iget() returned ERR_PTR(-ENOENT), we had an
   actual error and the target inode may exist in the target subvolume
   root - this may later result in the log replay code failing at a
   later stage (if we are "lucky") or succeed but leaving some
   inconsistency in the filesystem.

So fix this by not ignoring errors from btrfs_iget_logging() and as
a consequence remove the read_one_inode() wrapper and just use
btrfs_iget_logging() directly. Also since btrfs_iget_logging() is
supposed to be called only against subvolume roots, just like
read_one_inode() which had a comment about it, add an assertion to
btrfs_iget_logging() to check that the target root corresponds to a
subvolume root.

Fixes: 5d4f98a28c ("Btrfs: Mixed back reference  (FORWARD ROLLING FORMAT CHANGE)")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-27 19:57:06 +02:00
Filipe Manana
54a7081ed1 btrfs: fix iteration of extrefs during log replay
At __inode_add_ref() when processing extrefs, if we jump into the next
label we have an undefined value of victim_name.len, since we haven't
initialized it before we did the goto. This results in an invalid memory
access in the next iteration of the loop since victim_name.len was not
initialized to the length of the name of the current extref.

Fix this by initializing victim_name.len with the current extref's name
length.

Fixes: e43eec81c5 ("btrfs: use struct qstr instead of name and namelen pairs")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-27 19:56:55 +02:00
Filipe Manana
6561a40cec btrfs: fix missing error handling when searching for inode refs during log replay
During log replay, at __add_inode_ref(), when we are searching for inode
ref keys we totally ignore if btrfs_search_slot() returns an error. This
may make a log replay succeed when there was an actual error and leave
some metadata inconsistency in a subvolume tree. Fix this by checking if
an error was returned from btrfs_search_slot() and if so, return it to
the caller.

Fixes: e02119d5a7 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-27 19:56:35 +02:00
Filipe Manana
1e6ed33cab btrfs: fix failure to rebuild free space tree using multiple transactions
If we are rebuilding a free space tree, while modifying the free space
tree we may need to allocate a new metadata block group.
If we end up using multiple transactions for the rebuild, when we call
btrfs_end_transaction() we enter btrfs_create_pending_block_groups()
which calls add_block_group_free_space() to add items to the free space
tree for the block group.

Then later during the free space tree rebuild, at
btrfs_rebuild_free_space_tree(), we may find such new block groups
and call populate_free_space_tree() for them, which fails with -EEXIST
because there are already items in the free space tree. Then we abort the
transaction with -EEXIST at btrfs_rebuild_free_space_tree().
Notice that we say "may find" the new block groups because a new block
group may be inserted in the block groups rbtree, which is being iterated
by the rebuild process, before or after the current node where the rebuild
process is currently at.

Syzbot recently reported such case which produces a trace like the
following:

  ------------[ cut here ]------------
  BTRFS: Transaction aborted (error -17)
  WARNING: CPU: 1 PID: 7626 at fs/btrfs/free-space-tree.c:1341 btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341
  Modules linked in:
  CPU: 1 UID: 0 PID: 7626 Comm: syz.2.25 Not tainted 6.15.0-rc7-syzkaller-00085-gd7fa1af5b33e-dirty #0 PREEMPT
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
  pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  pc : btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341
  lr : btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341
  sp : ffff80009c4f7740
  x29: ffff80009c4f77b0 x28: ffff0000d4c3f400 x27: 0000000000000000
  x26: dfff800000000000 x25: ffff70001389eee8 x24: 0000000000000003
  x23: 1fffe000182b6e7b x22: 0000000000000000 x21: ffff0000c15b73d8
  x20: 00000000ffffffef x19: ffff0000c15b7378 x18: 1fffe0003386f276
  x17: ffff80008f31e000 x16: ffff80008adbe98c x15: 0000000000000001
  x14: 1fffe0001b281550 x13: 0000000000000000 x12: 0000000000000000
  x11: ffff60001b281551 x10: 0000000000000003 x9 : 1c8922000a902c00
  x8 : 1c8922000a902c00 x7 : ffff800080485878 x6 : 0000000000000000
  x5 : 0000000000000001 x4 : 0000000000000001 x3 : ffff80008047843c
  x2 : 0000000000000001 x1 : ffff80008b3ebc40 x0 : 0000000000000001
  Call trace:
   btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341 (P)
   btrfs_start_pre_rw_mount+0xa78/0xe10 fs/btrfs/disk-io.c:3074
   btrfs_remount_rw fs/btrfs/super.c:1319 [inline]
   btrfs_reconfigure+0x828/0x2418 fs/btrfs/super.c:1543
   reconfigure_super+0x1d4/0x6f0 fs/super.c:1083
   do_remount fs/namespace.c:3365 [inline]
   path_mount+0xb34/0xde0 fs/namespace.c:4200
   do_mount fs/namespace.c:4221 [inline]
   __do_sys_mount fs/namespace.c:4432 [inline]
   __se_sys_mount fs/namespace.c:4409 [inline]
   __arm64_sys_mount+0x3e8/0x468 fs/namespace.c:4409
   __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
   invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49
   el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132
   do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151
   el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767
   el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786
   el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600
  irq event stamp: 330
  hardirqs last  enabled at (329): [<ffff80008048590c>] raw_spin_rq_unlock_irq kernel/sched/sched.h:1525 [inline]
  hardirqs last  enabled at (329): [<ffff80008048590c>] finish_lock_switch+0xb0/0x1c0 kernel/sched/core.c:5130
  hardirqs last disabled at (330): [<ffff80008adb9e60>] el1_dbg+0x24/0x80 arch/arm64/kernel/entry-common.c:511
  softirqs last  enabled at (10): [<ffff8000801fbf10>] local_bh_enable+0x10/0x34 include/linux/bottom_half.h:32
  softirqs last disabled at (8): [<ffff8000801fbedc>] local_bh_disable+0x10/0x34 include/linux/bottom_half.h:19
  ---[ end trace 0000000000000000 ]---

Fix this by flagging new block groups which had their free space tree
entries already added and then skip them in the rebuild process. Also,
since the rebuild may be triggered when doing a remount, make sure that
when we clear an existing free space tree that we clear such flag from
every existing block group, otherwise we would skip those block groups
during the rebuild.

Reported-by: syzbot+d0014fb0fc39c5487ae5@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/68460a54.050a0220.daf97.0af5.GAE@google.com/
Fixes: 882af9f13e ("btrfs: handle free space tree rebuild in multiple transactions")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-27 19:56:15 +02:00
Linus Torvalds
5ca7fe213b for-6.16-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmhZQ9MACgkQxWXV+ddt
 WDsyvw/+K5N4zbig9D5QL5SdsQwMe/ZUk1KF0LLu6H3hFetdICeM/Z4K46EBh40X
 c9Sxb13gLnIAm8DR/IFTTlOZVrrbJ3CTazZuJbncCpaZchH863aYb/1KboxjJnpW
 KqOen20KdUh8HdevrJFhkFc7rOjp7KupfIHsbWqIxaWYPf8ORvUyK55lKxQz0HES
 E5tFXLNr6z/8Ws5pc2HnRLgnRcCHuRUNJUb1PEaTfPKxoFvTwjda6cDsYnXOJEO9
 NOnh6lluurqja+3FUEFig2f292/CbKGtByYUDgfhHO21P//IHSDhlouvwipzI/kh
 6WUoH1K+DWCxxNbIVFFbUYLxrDGu7R7/aWFHH2q0dNjqQeiQBbUnbn4WIjAAwDWf
 k9cmE+WgVqwQI+vpfG3eENUafG5MpcQQo2wKrxG0whWaC2fiA6QtI+3DfKyMj4XJ
 JI1jUhfCwHrqzoGQ4XBE3UYENqQw9RICNC+Z3UfZx+5sQMWcb+ac5qIGygvCfU8N
 Gtfx4ladZshpQUSuRneiLozxdxLyXX3LzCt2Ls1s5fPPikZft/+2QRu5rzSbb/Cp
 50TDSn/pE1N/TEMVZaP5M2PxquBVDOZ4TFSsSm3IvceqFInm0UerAGaJ7+T2eZhM
 3XHhIp6xTecHfwukvGqs+XSxB9PMLfF5M0gc+9PR+3oxzFRpowI=
 =XLWR
 -----END PGP SIGNATURE-----

Merge tag 'for-6.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Fixes:

   - fix invalid inode pointer dereferences during log replay

   - fix a race between renames and directory logging

   - fix shutting down delayed iput worker

   - fix device byte accounting when dropping chunk

   - in zoned mode, fix offset calculations for DUP profile when
     conventional and sequential zones are used together

  Regression fixes:

   - fix possible double unlock of extent buffer tree (xarray
     conversion)

   - in zoned mode, fix extent buffer refcount when writing out extents
     (xarray conversion)

  Error handling fixes and updates:

   - handle unexpected extent type when replaying log

   - check and warn if there are remaining delayed inodes when putting a
     root

   - fix assertion when building free space tree

   - handle csum tree error with mount option 'rescue=ibadroot'

  Other:

   - error message updates: add prefix to all scrub related messages,
     include other information in messages"

* tag 'for-6.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: zoned: fix alloc_offset calculation for partly conventional block groups
  btrfs: handle csum tree error with rescue=ibadroots correctly
  btrfs: fix race between async reclaim worker and close_ctree()
  btrfs: fix assertion when building free space tree
  btrfs: don't silently ignore unexpected extent type when replaying log
  btrfs: fix invalid inode pointer dereferences during log replay
  btrfs: fix double unlock of buffer_tree xarray when releasing subpage eb
  btrfs: update superblock's device bytes_used when dropping chunk
  btrfs: fix a race between renames and directory logging
  btrfs: scrub: add prefix for the error messages
  btrfs: warn if leaking delayed_nodes in btrfs_put_root()
  btrfs: fix delayed ref refcount leak in debug assertion
  btrfs: include root in error message when unlinking inode
  btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails
2025-06-23 11:16:38 -07:00
Johannes Thumshirn
c0d90a79e8 btrfs: zoned: fix alloc_offset calculation for partly conventional block groups
When one of two zones composing a DUP block group is a conventional zone,
we have the zone_info[i]->alloc_offset = WP_CONVENTIONAL. That will, of
course, not match the write pointer of the other zone, and fails that
block group.

This commit solves that issue by properly recovering the emulated write
pointer from the last allocated extent. The offset for the SINGLE, DUP,
and RAID1 are straight-forward: it is same as the end of last allocated
extent. The RAID0 and RAID10 are a bit tricky that we need to do the math
of striping.

This is the kernel equivalent of Naohiro's user-space commit:
"btrfs-progs: zoned: fix alloc_offset calculation for partly
conventional block groups".

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19 15:21:15 +02:00
Qu Wenruo
547e836661 btrfs: handle csum tree error with rescue=ibadroots correctly
[BUG]
There is syzbot based reproducer that can crash the kernel, with the
following call trace: (With some debug output added)

 DEBUG: rescue=ibadroots parsed
 BTRFS: device fsid 14d642db-7b15-43e4-81e6-4b8fac6a25f8 devid 1 transid 8 /dev/loop0 (7:0) scanned by repro (1010)
 BTRFS info (device loop0): first mount of filesystem 14d642db-7b15-43e4-81e6-4b8fac6a25f8
 BTRFS info (device loop0): using blake2b (blake2b-256-generic) checksum algorithm
 BTRFS info (device loop0): using free-space-tree
 BTRFS warning (device loop0): checksum verify failed on logical 5312512 mirror 1 wanted 0xb043382657aede36608fd3386d6b001692ff406164733d94e2d9a180412c6003 found 0x810ceb2bacb7f0f9eb2bf3b2b15c02af867cb35ad450898169f3b1f0bd818651 level 0
 DEBUG: read tree root path failed for tree csum, ret=-5
 BTRFS warning (device loop0): checksum verify failed on logical 5328896 mirror 1 wanted 0x51be4e8b303da58e6340226815b70e3a93592dac3f30dd510c7517454de8567a found 0x51be4e8b303da58e634022a315b70e3a93592dac3f30dd510c7517454de8567a level 0
 BTRFS warning (device loop0): checksum verify failed on logical 5292032 mirror 1 wanted 0x1924ccd683be9efc2fa98582ef58760e3848e9043db8649ee382681e220cdee4 found 0x0cb6184f6e8799d9f8cb335dccd1d1832da1071d12290dab3b85b587ecacca6e level 0
 process 'repro' launched './file2' with NULL argv: empty string added
 DEBUG: no csum root, idatacsums=0 ibadroots=134217728
 Oops: general protection fault, probably for non-canonical address 0xdffffc0000000041: 0000 [#1] SMP KASAN NOPTI
 KASAN: null-ptr-deref in range [0x0000000000000208-0x000000000000020f]
 CPU: 5 UID: 0 PID: 1010 Comm: repro Tainted: G           OE       6.15.0-custom+ #249 PREEMPT(full)
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
 RIP: 0010:btrfs_lookup_csum+0x93/0x3d0 [btrfs]
 Call Trace:
  <TASK>
  btrfs_lookup_bio_sums+0x47a/0xdf0 [btrfs]
  btrfs_submit_bbio+0x43e/0x1a80 [btrfs]
  submit_one_bio+0xde/0x160 [btrfs]
  btrfs_readahead+0x498/0x6a0 [btrfs]
  read_pages+0x1c3/0xb20
  page_cache_ra_order+0x4b5/0xc20
  filemap_get_pages+0x2d3/0x19e0
  filemap_read+0x314/0xde0
  __kernel_read+0x35b/0x900
  bprm_execve+0x62e/0x1140
  do_execveat_common.isra.0+0x3fc/0x520
  __x64_sys_execveat+0xdc/0x130
  do_syscall_64+0x54/0x1d0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
 ---[ end trace 0000000000000000 ]---

[CAUSE]
Firstly the fs has a corrupted csum tree root, thus to mount the fs we
have to go "ro,rescue=ibadroots" mount option.

Normally with that mount option, a bad csum tree root should set
BTRFS_FS_STATE_NO_DATA_CSUMS flag, so that any future data read will
ignore csum search.

But in this particular case, we have the following call trace that
caused NULL csum root, but not setting BTRFS_FS_STATE_NO_DATA_CSUMS:

load_global_roots_objectid():

		ret = btrfs_search_slot();
		/* Succeeded */
		btrfs_item_key_to_cpu()
		found = true;
		/* We found the root item for csum tree. */
		root = read_tree_root_path();
		if (IS_ERR(root)) {
			if (!btrfs_test_opt(fs_info, IGNOREBADROOTS))
			/*
			 * Since we have rescue=ibadroots mount option,
			 * @ret is still 0.
			 */
			break;
	if (!found || ret) {
		/* @found is true, @ret is 0, error handling for csum
		 * tree is skipped.
		 */
	}

This means we completely skipped to set BTRFS_FS_STATE_NO_DATA_CSUMS if
the csum tree is corrupted, which results unexpected later csum lookup.

[FIX]
If read_tree_root_path() failed, always populate @ret to the error
number.

As at the end of the function, we need @ret to determine if we need to
do the extra error handling for csum tree.

Fixes: abed4aaae4 ("btrfs: track the csum, extent, and free space trees in a rb tree")
Reported-by: Zhiyu Zhang <zhiyuzhang999@gmail.com>
Reported-by: Longxing Li <coregee2000@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19 15:21:06 +02:00
Filipe Manana
a26bf338cd btrfs: fix race between async reclaim worker and close_ctree()
Syzbot reported an assertion failure due to an attempt to add a delayed
iput after we have set BTRFS_FS_STATE_NO_DELAYED_IPUT in the fs_info
state:

  WARNING: CPU: 0 PID: 65 at fs/btrfs/inode.c:3420 btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420
  Modules linked in:
  CPU: 0 UID: 0 PID: 65 Comm: kworker/u8:4 Not tainted 6.15.0-next-20250530-syzkaller #0 PREEMPT(full)
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
  Workqueue: btrfs-endio-write btrfs_work_helper
  RIP: 0010:btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420
  Code: 4e ad 5d (...)
  RSP: 0018:ffffc9000213f780 EFLAGS: 00010293
  RAX: ffffffff83c635b7 RBX: ffff888058920000 RCX: ffff88801c769e00
  RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000000
  RBP: 0000000000000001 R08: ffff888058921b67 R09: 1ffff1100b12436c
  R10: dffffc0000000000 R11: ffffed100b12436d R12: 0000000000000001
  R13: dffffc0000000000 R14: ffff88807d748000 R15: 0000000000000100
  FS:  0000000000000000(0000) GS:ffff888125c53000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00002000000bd038 CR3: 000000006a142000 CR4: 00000000003526f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   btrfs_put_ordered_extent+0x19f/0x470 fs/btrfs/ordered-data.c:635
   btrfs_finish_one_ordered+0x11d8/0x1b10 fs/btrfs/inode.c:3312
   btrfs_work_helper+0x399/0xc20 fs/btrfs/async-thread.c:312
   process_one_work kernel/workqueue.c:3238 [inline]
   process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321
   worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402
   kthread+0x70e/0x8a0 kernel/kthread.c:464
   ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
   </TASK>

This can happen due to a race with the async reclaim worker like this:

1) The async metadata reclaim worker enters shrink_delalloc(), which calls
   btrfs_start_delalloc_roots() with an nr_pages argument that has a value
   less than LONG_MAX, and that in turn enters start_delalloc_inodes(),
   which sets the local variable 'full_flush' to false because
   wbc->nr_to_write is less than LONG_MAX;

2) There it finds inode X in a root's delalloc list, grabs a reference for
   inode X (with igrab()), and triggers writeback for it with
   filemap_fdatawrite_wbc(), which creates an ordered extent for inode X;

3) The unmount sequence starts from another task, we enter close_ctree()
   and we flush the workqueue fs_info->endio_write_workers, which waits
   for the ordered extent for inode X to complete and when dropping the
   last reference of the ordered extent, with btrfs_put_ordered_extent(),
   when we call btrfs_add_delayed_iput() we don't add the inode to the
   list of delayed iputs because it has a refcount of 2, so we decrement
   it to 1 and return;

4) Shortly after at close_ctree() we call btrfs_run_delayed_iputs() which
   runs all delayed iputs, and then we set BTRFS_FS_STATE_NO_DELAYED_IPUT
   in the fs_info state;

5) The async reclaim worker, after calling filemap_fdatawrite_wbc(), now
   calls btrfs_add_delayed_iput() for inode X and there we trigger an
   assertion failure since the fs_info state has the flag
   BTRFS_FS_STATE_NO_DELAYED_IPUT set.

Fix this by setting BTRFS_FS_STATE_NO_DELAYED_IPUT only after we wait for
the async reclaim workers to finish, after we call cancel_work_sync() for
them at close_ctree(), and by running delayed iputs after wait for the
reclaim workers to finish and before setting the bit.

This race was recently introduced by commit 19e60b2a95 ("btrfs: add
extra warning if delayed iput is added when it's not allowed"). Without
the new validation at btrfs_add_delayed_iput(), this described scenario
was safe because close_ctree() later calls btrfs_commit_super(). That
will run any final delayed iputs added by reclaim workers in the window
between the btrfs_run_delayed_iputs() and the the reclaim workers being
shut down.

Reported-by: syzbot+0ed30ad435bf6f5b7a42@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6840481c.a00a0220.d4325.000c.GAE@google.com/T/#u
Fixes: 19e60b2a95 ("btrfs: add extra warning if delayed iput is added when it's not allowed")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19 15:20:57 +02:00
Filipe Manana
1961d20f6f btrfs: fix assertion when building free space tree
When building the free space tree with the block group tree feature
enabled, we can hit an assertion failure like this:

  BTRFS info (device loop0 state M): rebuilding free space tree
  assertion failed: ret == 0, in fs/btrfs/free-space-tree.c:1102
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/free-space-tree.c:1102!
  Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
  Modules linked in:
  CPU: 1 UID: 0 PID: 6592 Comm: syz-executor322 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
  pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  pc : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102
  lr : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102
  sp : ffff8000a4ce7600
  x29: ffff8000a4ce76e0 x28: ffff0000c9bc6000 x27: ffff0000ddfff3d8
  x26: ffff0000ddfff378 x25: dfff800000000000 x24: 0000000000000001
  x23: ffff8000a4ce7660 x22: ffff70001499cecc x21: ffff0000e1d8c160
  x20: ffff0000e1cb7800 x19: ffff0000e1d8c0b0 x18: 00000000ffffffff
  x17: ffff800092f39000 x16: ffff80008ad27e48 x15: ffff700011e740c0
  x14: 1ffff00011e740c0 x13: 0000000000000004 x12: ffffffffffffffff
  x11: ffff700011e740c0 x10: 0000000000ff0100 x9 : 94ef24f55d2dbc00
  x8 : 94ef24f55d2dbc00 x7 : 0000000000000001 x6 : 0000000000000001
  x5 : ffff8000a4ce6f98 x4 : ffff80008f415ba0 x3 : ffff800080548ef0
  x2 : 0000000000000000 x1 : 0000000100000000 x0 : 000000000000003e
  Call trace:
   populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 (P)
   btrfs_rebuild_free_space_tree+0x14c/0x54c fs/btrfs/free-space-tree.c:1337
   btrfs_start_pre_rw_mount+0xa78/0xe10 fs/btrfs/disk-io.c:3074
   btrfs_remount_rw fs/btrfs/super.c:1319 [inline]
   btrfs_reconfigure+0x828/0x2418 fs/btrfs/super.c:1543
   reconfigure_super+0x1d4/0x6f0 fs/super.c:1083
   do_remount fs/namespace.c:3365 [inline]
   path_mount+0xb34/0xde0 fs/namespace.c:4200
   do_mount fs/namespace.c:4221 [inline]
   __do_sys_mount fs/namespace.c:4432 [inline]
   __se_sys_mount fs/namespace.c:4409 [inline]
   __arm64_sys_mount+0x3e8/0x468 fs/namespace.c:4409
   __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
   invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49
   el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132
   do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151
   el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767
   el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786
   el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600
  Code: f0047182 91178042 528089c3 9771d47b (d4210000)
  ---[ end trace 0000000000000000 ]---

This happens because we are processing an empty block group, which has
no extents allocated from it, there are no items for this block group,
including the block group item since block group items are stored in a
dedicated tree when using the block group tree feature. It also means
this is the block group with the highest start offset, so there are no
higher keys in the extent root, hence btrfs_search_slot_for_read()
returns 1 (no higher key found).

Fix this by asserting 'ret' is 0 only if the block group tree feature
is not enabled, in which case we should find a block group item for
the block group since it's stored in the extent root and block group
item keys are greater than extent item keys (the value for
BTRFS_BLOCK_GROUP_ITEM_KEY is 192 and for BTRFS_EXTENT_ITEM_KEY and
BTRFS_METADATA_ITEM_KEY the values are 168 and 169 respectively).
In case 'ret' is 1, we just need to add a record to the free space
tree which spans the whole block group, and we can achieve this by
making 'ret == 0' as the while loop's condition.

Reported-by: syzbot+36fae25c35159a763a2a@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6841dca8.a00a0220.d4325.0020.GAE@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19 15:20:54 +02:00
Filipe Manana
16edae52f6 btrfs: don't silently ignore unexpected extent type when replaying log
If there's an unexpected (invalid) extent type, we just silently ignore
it. This means a corruption or some bug somewhere, so instead return
-EUCLEAN to the caller, making log replay fail, and print an error message
with relevant information.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19 15:20:47 +02:00
Filipe Manana
2dcf838cf5 btrfs: fix invalid inode pointer dereferences during log replay
In a few places where we call read_one_inode(), if we get a NULL pointer
we end up jumping into an error path, or fallthrough in case of
__add_inode_ref(), where we then do something like this:

   iput(&inode->vfs_inode);

which results in an invalid inode pointer that triggers an invalid memory
access, resulting in a crash.

Fix this by making sure we don't do such dereferences.

Fixes: b4c50cbb01 ("btrfs: return a btrfs_inode from read_one_inode()")
CC: stable@vger.kernel.org # 6.15+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19 15:20:42 +02:00
Filipe Manana
e5b5596011 btrfs: fix double unlock of buffer_tree xarray when releasing subpage eb
If we break out of the loop because an extent buffer doesn't have the bit
EXTENT_BUFFER_TREE_REF set, we end up unlocking the xarray twice, once
before we tested for the bit and break out of the loop, and once again
after the loop.

Fix this by testing the bit and exiting before unlocking the xarray.
The time spent testing the bit is negligible and it's not worth trying
to do that outside the critical section delimited by the xarray lock due
to the code complexity required to avoid it (like using a local boolean
variable to track whether the xarray is locked or not). The xarray unlock
only needs to be done before calling release_extent_buffer(), as that
needs to lock the xarray (through xa_cmpxchg_irq()) and does a more
significant amount of work.

Fixes: 19d7f65f03 ("btrfs: convert the buffer_radix to an xarray")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/linux-btrfs/aDRNDU0GM1_D4Xnw@stanley.mountain/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19 15:20:33 +02:00
Mark Harmstone
ae4477f937 btrfs: update superblock's device bytes_used when dropping chunk
Each superblock contains a copy of the device item for that device. In a
transaction which drops a chunk but doesn't create any new ones, we were
correctly updating the device item in the chunk tree but not copying
over the new bytes_used value to the superblock.

This can be seen by doing the following:

  # dd if=/dev/zero of=test bs=4096 count=2621440
  # mkfs.btrfs test
  # mount test /root/temp

  # cd /root/temp
  # for i in {00..10}; do dd if=/dev/zero of=$i bs=4096 count=32768; done
  # sync
  # rm *
  # sync
  # btrfs balance start -dusage=0 .
  # sync

  # cd
  # umount /root/temp
  # btrfs check test

For btrfs-check to detect this, you will also need my patch at
https://github.com/kdave/btrfs-progs/pull/991.

Change btrfs_remove_dev_extents() so that it adds the devices to the
fs_info->post_commit_list if they're not there already. This causes
btrfs_commit_device_sizes() to be called, which updates the bytes_used
value in the superblock.

Fixes: bbbf7243d6 ("btrfs: combine device update operations during transaction commit")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19 15:20:22 +02:00
Filipe Manana
3ca864de85 btrfs: fix a race between renames and directory logging
We have a race between a rename and directory inode logging that if it
happens and we crash/power fail before the rename completes, the next time
the filesystem is mounted, the log replay code will end up deleting the
file that was being renamed.

This is best explained following a step by step analysis of an interleaving
of steps that lead into this situation.

Consider the initial conditions:

1) We are at transaction N;

2) We have directories A and B created in a past transaction (< N);

3) We have inode X corresponding to a file that has 2 hardlinks, one in
   directory A and the other in directory B, so we'll name them as
   "A/foo_link1" and "B/foo_link2". Both hard links were persisted in a
   past transaction (< N);

4) We have inode Y corresponding to a file that as a single hard link and
   is located in directory A, we'll name it as "A/bar". This file was also
   persisted in a past transaction (< N).

The steps leading to a file loss are the following and for all of them we
are under transaction N:

 1) Link "A/foo_link1" is removed, so inode's X last_unlink_trans field
    is updated to N, through btrfs_unlink() -> btrfs_record_unlink_dir();

 2) Task A starts a rename for inode Y, with the goal of renaming from
    "A/bar" to "A/baz", so we enter btrfs_rename();

 3) Task A inserts the new BTRFS_INODE_REF_KEY for inode Y by calling
    btrfs_insert_inode_ref();

 4) Because the rename happens in the same directory, we don't set the
    last_unlink_trans field of directoty A's inode to the current
    transaction id, that is, we don't cal btrfs_record_unlink_dir();

 5) Task A then removes the entries from directory A (BTRFS_DIR_ITEM_KEY
    and BTRFS_DIR_INDEX_KEY items) when calling __btrfs_unlink_inode()
    (actually the dir index item is added as a delayed item, but the
    effect is the same);

 6) Now before task A adds the new entry "A/baz" to directory A by
    calling btrfs_add_link(), another task, task B is logging inode X;

 7) Task B starts a fsync of inode X and after logging inode X, at
    btrfs_log_inode_parent() it calls btrfs_log_all_parents(), since
    inode X has a last_unlink_trans value of N, set at in step 1;

 8) At btrfs_log_all_parents() we search for all parent directories of
    inode X using the commit root, so we find directories A and B and log
    them. Bu when logging direct A, we don't have a dir index item for
    inode Y anymore, neither the old name "A/bar" nor for the new name
    "A/baz" since the rename has deleted the old name but has not yet
    inserted the new name - task A hasn't called yet btrfs_add_link() to
    do that.

    Note that logging directory A doesn't fallback to a transaction
    commit because its last_unlink_trans has a lower value than the
    current transaction's id (see step 4);

 9) Task B finishes logging directories A and B and gets back to
    btrfs_sync_file() where it calls btrfs_sync_log() to persist the log
    tree;

10) Task B successfully persisted the log tree, btrfs_sync_log() completed
    with success, and a power failure happened.

    We have a log tree without any directory entry for inode Y, so the
    log replay code deletes the entry for inode Y, name "A/bar", from the
    subvolume tree since it doesn't exist in the log tree and the log
    tree is authorative for its index (we logged a BTRFS_DIR_LOG_INDEX_KEY
    item that covers the index range for the dentry that corresponds to
    "A/bar").

    Since there's no other hard link for inode Y and the log replay code
    deletes the name "A/bar", the file is lost.

The issue wouldn't happen if task B synced the log only after task A
called btrfs_log_new_name(), which would update the log with the new name
for inode Y ("A/bar").

Fix this by pinning the log root during renames before removing the old
directory entry, and unpinning after btrfs_log_new_name() is called.

Fixes: 259c4b96d7 ("btrfs: stop doing unnecessary log updates during a rename")
CC: stable@vger.kernel.org # 5.18+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19 15:19:58 +02:00
Anand Jain
65d5112b4d btrfs: scrub: add prefix for the error messages
Add a "scrub: " prefix to all messages logged by scrub so that it's
easy to filter them from dmesg for analysis.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19 15:19:06 +02:00
Leo Martins
186b9dc3c3 btrfs: warn if leaking delayed_nodes in btrfs_put_root()
Add a warning for leaked delayed_nodes when putting a root. We currently
do this for inodes, but not delayed_nodes.

Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ Remove the changelog from the commit message. ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19 15:18:39 +02:00
Leo Martins
dd276214e4 btrfs: fix delayed ref refcount leak in debug assertion
If the delayed_root is not empty we are increasing the number of
references to a delayed_node without decreasing it, causing a leak.  Fix
by decrementing the delayed_node reference count.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ Remove the changelog from the commit message. ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19 15:18:35 +02:00
Filipe Manana
c769be2d3d btrfs: include root in error message when unlinking inode
To help debugging include the root number in the error message, and since
this is a critical error that implies a metadata inconsistency and results
in a transaction abort change the log message level from "info" to
"critical", which is a much better fit.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19 15:18:30 +02:00
Lorenzo Stoakes
2e3b37a7e4
fs: replace mmap hook with .mmap_prepare for simple mappings
Since commit c84bf6dd2b ("mm: introduce new .mmap_prepare() file
callback"), the f_op->mmap() hook has been deprecated in favour of
f_op->mmap_prepare().

This callback is invoked in the mmap() logic far earlier, so error handling
can be performed more safely without complicated and bug-prone state
unwinding required should an error arise.

This hook also avoids passing a pointer to a not-yet-correctly-established
VMA avoiding any issues with referencing this data structure.

It rather provides a pointer to the new struct vm_area_desc descriptor type
which contains all required state and allows easy setting of required
parameters without any consideration needing to be paid to locking or
reference counts.

Note that nested filesystems like overlayfs are compatible with an
.mmap_prepare() callback since commit bb666b7c27 ("mm: add mmap_prepare()
compatibility layer for nested file systems").

In this patch we apply this change to file systems with relatively simple
mmap() hook logic - exfat, ceph, f2fs, bcachefs, zonefs, btrfs, ocfs2,
orangefs, nilfs2, romfs, ramfs and aio.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/f528ac4f35b9378931bd800920fee53fc0c5c74d.1750099179.git.lorenzo.stoakes@oracle.com
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-19 13:56:59 +02:00