linux-loongson

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-08-28 09:22:08 +00:00

Author	SHA1	Message	Date
David Sterba	64b8c3851f	btrfs: rename error to ret in btrfs_mksubvol() Unify naming of return value to the preferred way. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:03 +02:00
David Sterba	bfa13b82cc	btrfs: rename error to ret in btrfs_may_delete() Unify naming of return value to the preferred way. Reviewed-by: Daniel Vacek <neelx@suse.com>yy Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:03 +02:00
Filipe Manana	2abd9e1c58	btrfs: cache if we are using free space bitmaps for a block group Every time we add free space to the free space tree or we remove free space from the free space tree, we do a lookup for the block group's free space info item in the free space tree. This takes time, navigating the btree and we may block either on IO when reading extent buffers from disk or on extent buffer lock contention due to concurrency. Instead of doing this lookup every time, cache the result in the block structure and use it after the first lookup. This adds two boolean members to the block group structure but doesn't increase the structure's size. The following script that runs fs_mark was used to measure the time spent on run_delayed_tree_ref(), since down that call chain we have calls to add and remove free space to/from the free space tree (calls to btrfs_add_to_free_space_tree() and btrfs_remove_from_free_space_tree()): $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt FILES=100000 THREADS=$(nproc --all) echo "performance" \| \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT This is a heavy metadata test as it's exercising only file creation, so a lot of allocations of metadata extents, creating delayed refs for adding new metadata extents and dropping existing ones due to COW. The results of the times it took to execute run_delayed_tree_ref(), in nanoseconds, are the following. Before this change: Range: 1868.000 - 6482857.000; Mean: 10231.430; Median: 7005.000; Stddev: 27993.173 Percentiles: 90th: 13342.000; 95th: 23279.000; 99th: 82448.000 1868.000 - 4222.038: 270696 ############ 4222.038 - 9541.029: 1201327 ##################################################### 9541.029 - 21559.383: 385436 ################# 21559.383 - 48715.063: 64942 ### 48715.063 - 110073.800: 31454 # 110073.800 - 248714.944: 8218 \| 248714.944 - 561977.042: 1030 \| 561977.042 - 1269798.254: 295 \| 1269798.254 - 2869132.711: 116 \| 2869132.711 - 6482857.000: 28 \| After this change: Range: 1554.000 - 4557014.000; Mean: 9168.164; Median: 6391.000; Stddev: 21467.060 Percentiles: 90th: 12478.000; 95th: 20964.000; 99th: 72234.000 1554.000 - 3453.820: 219004 ############ 3453.820 - 7674.743: 980645 ##################################################### 7674.743 - 17052.574: 552486 ############################## 17052.574 - 37887.762: 68558 #### 37887.762 - 84178.322: 31557 ## 84178.322 - 187024.331: 12102 # 187024.331 - 415522.355: 1364 \| 415522.355 - 923187.626: 256 \| 923187.626 - 2051092.468: 125 \| 2051092.468 - 4557014.000: 21 \| Approximate improvement in the first four buckets is about 20%. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:03 +02:00
Filipe Manana	fdeffeb4f5	btrfs: add and use helper to determine if using bitmaps in free space tree When adding and removing free space to the free space tree, we need to lookup the respective block group's free info item in the free space tree, check its flags for the BTRFS_FREE_SPACE_USING_BITMAPS bit and then release the path. Move these steps into a helper function and use it in both sites. This will also help avoiding duplicate code in a subsequent change. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:03 +02:00
Filipe Manana	d1ac35ae2a	btrfs: use fs_info from local variable in btrfs_convert_free_space_to_extents() There's no need to dereference the block group to extract fs_info as we have already stored fs_info in a local variable. So use the local variable instead. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:03 +02:00
Filipe Manana	497c726ff8	btrfs: avoid double slot decrement at btrfs_convert_free_space_to_extents() There's no need to subtract 1 from path->slots[0] and then decrement the slot, we can reduce the generated assembly code by decrementing the slot and then use it. Module size before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1846220 162998 16136 `2025354` 1ee78a fs/btrfs/btrfs.ko Module size after: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1846204 162998 16136 2025338 1ee77a fs/btrfs/btrfs.ko Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	a8da443c9b	btrfs: turn remove argument of modify_free_space_bitmap() to boolean The argument is used as a boolean, so switch its type from int to bool. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	3887067f55	btrfs: rename free_space_set_bits() and make it less confusing The free_space_set_bits() is used both to set a range of bits or to clear range of bits, depending on the 'bit' argument value. So the name is very misleading since it's not used only to set bits. Furthermore the 'bit' argument is an integer when a boolean is all that is needed plus its name is singular, which gives the idea the function operates on a single bit and not on a range of bits. So rename the function to free_space_modify_bits(), rename the 'bit' argument to 'set_bits' and turn the argument to bool instead of int. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	6fc5ef7829	btrfs: add btrfs prefix to free space tree exported functions A few of the free space tree exported functions have a 'btrfs_' prefix in their name, but most don't. Not only is this inconsistent, the preferred style is to have such a prefix, to avoid potential collisions in the future with other kernel code and offer a somewhat better readibility by making it obvious in calls sites that we are calling btrfs specific code. So add the 'btrfs_' prefix to all free space tree functions that are missing it. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	8bfa3727ea	btrfs: remove pointless out label from load_free_space_extents() All we do under the label is to return, so there's no point in having it, just return directly whenever we get an error. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	b7db594bc2	btrfs: remove pointless out label from load_free_space_bitmaps() All we do under the label is to return, so there's no point in having it, just return directly whenever we get an error. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	5801a749a9	btrfs: remove pointless out label from add_free_space_extent() All we do under the label is to return, so there's no point in having it, just return directly whenever we get an error. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	e3ecf6f164	btrfs: remove pointless out label from remove_free_space_extent() All we do under the label is to return, so there's no point in having it, just return directly whenever we get an error. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	ffb7068f16	btrfs: remove pointless out label from modify_free_space_bitmap() All we do under the label is to return, so there's no point in having it, just return directly whenever we get an error. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	22b609768c	btrfs: make free_space_test_bit() return a boolean instead The function returns the result of another function that returns a boolean (extent_buffer_test_bit()), and all the callers need is a boolean an not an integer. So change its return type from int to bool, and modify the callers to store results in booleans instead of integers, which also makes them simpler. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	790b88c4dd	btrfs: make extent_buffer_test_bit() return a boolean instead All the callers want is to determine if a bit is set and all of them call the function and do a double negation (!!) on its result to get a boolean. So change it to return a boolean and simplify callers. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	e4e5fcbc62	btrfs: remove pointless out label from update_free_space_extent_count() Just return directly, we don't need the label since all we do under it is to return. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:01 +02:00
Filipe Manana	61b43a9374	btrfs: remove pointless out label from add_new_free_space_info() We can just return directly if btrfs_insert_empty_item() fails, there is no need to release the path before returning, as all callers (or upper in the call chain) will free the path if they get an error from the call to add_new_free_space_info(), which is also a common pattern everywhere in btrfs. Finally there's no need to set 'ret' to 0 if the call to btrfs_insert_empty_item() didn't fail, since 'ret' is already 0. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:01 +02:00
David Sterba	44892c5a3e	btrfs: tree-log: add and rename extent bits for dirty_log_pages tree The dirty_log_pages tree is used for tree logging and marks extents based on log_transid. The bits could be renamed to resemble the LOG1/LOG2 naming used for the BTRFS_FS_LOG1_ERR bits. The DIRTY bit is renamed to LOG1 and NEW to LOG2. Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:01 +02:00
David Sterba	55cd57faa5	btrfs: use folio_end() where appropriate Simplify folio_pos() + folio_size() and use the new helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:01 +02:00
David Sterba	89a3cc19e4	btrfs: add helper folio_end() There are several cases of folio_pos + folio_size, add a convenience helper for that. This is a local helper and not proposed as folio API because it does not seem to be heavily used elsewhere: A quick grep (folio_size + folio_end) in fs/ shows 24 btrfs 4 iomap 4 ext4 2 xfs 2 netfs 1 gfs2 1 f2fs 1 bcachefs 1 buffer.c Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:01 +02:00
David Sterba	d549391fc6	btrfs: rename variables for locked range in defrag_prepare_one_folio() In preparation to use a helper for folio_pos + folio_size, rename the variables for the locked range so they don't use the 'folio_' prefix. As the locking ranges take inclusive end of the range (hence the "-1") this would be confusing as the folio helpers typically use exclusive end of the range. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:01 +02:00
David Sterba	e47c8a4767	btrfs: simplify range end calculations in truncate_block_zero_beyond_eof() The way zero_end is calculated and used does a -1 and +1 that effectively cancel out, so this can be simplified. This is also preparatory patch for using a helper for folio_pos + folio_size with the semantics of exclusive end of the range. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:01 +02:00
Filipe Manana	bdd01fb036	btrfs: check BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE at __add_block_group_free_space() Every caller of __add_block_group_free_space() is checking if the flag BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE is set before calling it. This is duplicate code and it's prone to some mistake in case we add more callers in the future. So move the check for that flag into the start of __add_block_group_free_space(), and, as a consequence, the path allocation from add_block_group_free_space() is moved into __add_block_group_free_space(), to preserve the behaviour of allocating a path only if the flag BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE is set. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:01 +02:00
Filipe Manana	1f06c942aa	btrfs: always abort transaction on failure to add block group to free space tree Only one of the callers of __add_block_group_free_space() aborts the transaction if the call fails, while the others don't do it and it's either never done up the call chain or much higher in the call chain. So make sure we abort the transaction at __add_block_group_free_space() if it fails, which brings a couple benefits: 1) If some call chain never aborts the transaction, we avoid having some metadata inconsistency because BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE is cleared when we enter __add_block_group_free_space() and therefore __add_block_group_free_space() is never called again to add the block group items to the free space tree, since the function is only called when that flag is set in a block group; 2) If the call chain already aborts the transaction, then we get a better trace that points to the exact step from __add_block_group_free_space() which failed, which is better for analysis. So abort the transaction at __add_block_group_free_space() if any of its steps fails. CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:57:56 +02:00
Qu Wenruo	936f0b49dc	btrfs: add extra warning when qgroup is marked inconsistent Unlike qgroup rescan, which always shows whether it cleared the inconsistent flag, we do not have a proper way to show if qgroup is marked inconsistent. This was not a big deal before as there aren't that many locations that can mark qgroup inconsistent. But with the introduction of drop_subtree_threshold, qgroup can be marked inconsistent very frequently, especially when dropping subvolumes. Although most user space tools relying on qgroup should do their own checks and queue a rescan if needed, we have no idea when qgroup is marked inconsistent, and this would be much harder to debug. So this patch will add an extra warning (btrfs_warn_rl()) when the qgroup flag is flipped into inconsistent for the first time. And add extra reason why qgroup flips inconsistent. This means we can move the error message immediately before qgroup_inconsistent_warning() into that function. For call sites without an obvious reason, or is a shared error handling, output the function that failed and the error code instead. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:39 +02:00
David Sterba	b37532bffd	btrfs: merge btrfs_printk_ratelimited() and its only caller There's only one caller of btrfs_printk_ratelimited(), merge it there. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:39 +02:00
David Sterba	2f3f1ad7f1	btrfs: simplify debug print helpers without enabled printk The btrfs_debug() helpers depend on various config options. In case of "no_printk" we don't need to define a special helper that in the end does nothing but validates the parameters. As the default build config is to validate the parameters we can simplify it to let the debug helpers expand to nothing and remove btrfs_no_printk_in_rcu(). To avoid warnings use fs_info and inline one variable in extent_from_logical(). Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:39 +02:00
David Sterba	f9095103f2	btrfs: remove remaining unused message helpers Remove the critical level message helpers as they're not used, the RCU protection is provided by the plain helpers. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:39 +02:00
David Sterba	80f4fab544	btrfs: switch RCU helper versions to btrfs_debug() The RCU protection is now done in the plain helpers, we can remove the "_in_rcu" and "_rl_in_rcu". Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:38 +02:00
David Sterba	2eac2ae8b2	btrfs: switch RCU helper versions to btrfs_info() The RCU protection is now done in the plain helpers, we can remove the "_in_rcu" and "_rl_in_rcu". Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:38 +02:00
David Sterba	0fe04bf132	btrfs: switch RCU helper versions to btrfs_warn() The RCU protection is now done in the plain helpers, we can remove the "_in_rcu" and "_rl_in_rcu". Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:38 +02:00
David Sterba	9db18fe3ac	btrfs: switch RCU helper versions to btrfs_err() The RCU protection is now done in the plain helpers, we can remove the "_in_rcu" and "_rl_in_rcu". Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:38 +02:00
David Sterba	0e26727a73	btrfs: switch all message helpers to be RCU safe We have two versions of message helpers, one that provides RCU protection around the call in case we need to dereference device name. As messages are not performance critical we can set up the RCU protection for all of them and drop the distinction for those where device name is needed. This will lead to further simplifications. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:38 +02:00
David Sterba	4d4b489ef1	btrfs: remove unused levels of message helpers We're using the following levels: crit, err, warn, info, debug. This covers our needs and further specializations is not needed, so let's remove emerg, alert and notice. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:38 +02:00
David Sterba	ee3af49a05	btrfs: remove unused rcu-string printk helpers The RCU-string API has never taken off and we don't use the printk helpers provided as we do the protection in our helpers. Remove the "in RCU" wrappers. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:38 +02:00
David Sterba	d1d1c85427	btrfs: open code rcu_string_free() and remove it The helper is trivial and we can simply use kfree_rcu() if needed. In our case it's just one place where we rename a device in device_list_add() and the old name can still be used until the end of the RCU grace period. The other case is freeing a device and there nothing should reach the device, so we can use plain kfree(). Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:38 +02:00
Johannes Thumshirn	694ce5e143	btrfs: zoned: reserve data_reloc block group on mount Create a block group dedicated for data relocation on mount of a zoned filesystem. If there is already more than one empty DATA block group on mount, this one is picked for the data relocation block group, instead of a newly created one. This is done to ensure, there is always space for performing garbage collection and the filesystem is not hitting ENOSPC under heavy overwrite workloads. CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:56:31 +02:00
David Sterba	f1f22dfbea	btrfs: use btrfs_root_id() where not done yet A few more remaining cases where we can use the helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:31 +02:00
David Sterba	918fb77073	btrfs: use btrfs_is_data_reloc_root() where not done yet Two remaining cases where we can use the helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:31 +02:00
David Sterba	c6aeae86b9	btrfs: use on-stack variable for block reserve in btrfs_replace_file_extents() We can avoid potential memory allocation failure in btrfs_replace_file_extents() as the block reserve lifetime is limited to the scope of the function. This requires +48 bytes on stack. Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:31 +02:00
David Sterba	7ce22f62b2	btrfs: use on-stack variable for block reserve in btrfs_truncate() We can avoid potential memory allocation failure in btrfs_truncate() as the block reserve lifetime is limited to the scope of the function. This requires +48 bytes on stack. Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:31 +02:00
David Sterba	ec41c34547	btrfs: use on-stack variable for block reserve in btrfs_evict_inode() We can avoid potential memory allocation failure in btrfs_evict_inode() as the block reserve lifetime is limited to the scope of the function. This requires +48 bytes on stack. Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:31 +02:00
Sun YangKai	8811ace439	btrfs: update comment for xarray fields in struct btrfs_root The inode_lock field of struct btrfs_root was removed in commit e2844cce75c9e61("btrfs: remove inode_lock from struct btrfs_root and use xarray locks") but the related comment haven't been updated. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:31 +02:00
Qu Wenruo	cc38d178ff	btrfs: enable large data folio support under CONFIG_BTRFS_EXPERIMENTAL With all the preparation patches already merged, it's pretty easy to enable large data folios: - Remove the ASSERT() on folio size in btrfs_end_repair_bio() - Add a helper to properly set the max folio order Currently due to several call sites that are fetching the bitmap content directly into an unsigned long, we can only support BITS_PER_LONG blocks for each bitmap. - Call the helper when reading/creating an inode The support has the following limitations: - No large folios for data reloc inode The relocation code still requires page sized folio. But it's not that hot nor common compared to regular buffered ios. Will be improved in the future. - Requires CONFIG_BTRFS_EXPERIMENTAL - Will require all folio related operations to check if it needs the extra btrfs_subpage structure Now any folio larger than block size will need btrfs_subpage structure handling. Unfortunately I do not have a physical machine for performance test, but if everything goes like XFS/EXT4, it should mostly bring single digits percentage performance improvement in the real world. Although I believe there are still quite some optimizations to be done, let's focus on testing the current large data folio support first. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:30 +02:00
Filipe Manana	b769777d92	btrfs: use refcount_t type for the extent buffer reference counter Instead of using a bare atomic, use the refcount_t type, which despite being a structure that contains only an atomic, has an API that checks for underflows and other hazards. This doesn't change the size of the extent_buffer structure. This removes the need to do things like this: WARN_ON(atomic_read(&eb->refs) == 0); if (atomic_dec_and_test(&eb->refs)) { (...) } And do just: if (refcount_dec_and_test(&eb->refs)) { (...) } Since refcount_dec_and_test() already triggers a warning when we decrement a ref count that has a value of 0 (or below zero). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:30 +02:00
Filipe Manana	2697b61597	btrfs: add comment for optimization in free_extent_buffer() There's this special atomic compare and exchange logic which serves to avoid locking the extent buffers refs_lock spinlock and therefore reduce lock contention, so add a comment to make it more obvious. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:30 +02:00
Filipe Manana	71c086b30d	btrfs: reorganize logic at free_extent_buffer() for better readability It's hard to read the logic to break out of the while loop since it's a very long expression consisting of a logical or of two composite expressions, each one composed by a logical and. Further each one is also testing for the EXTENT_BUFFER_UNMAPPED bit, making it more verbose than necessary. So change from this: if ((!test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags) && refs <= 3) \|\| (test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags) && refs == 1)) break; To this: if (test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags)) { if (refs == 1) break; } else if (refs <= 3) { break; } At least on x86_64 using gcc 9.3.0, this doesn't change the object size. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:30 +02:00
Filipe Manana	41e4ea0bf5	btrfs: make btrfs_readdir_delayed_dir_index() return a bool instead There's no need to return errors, all we do is return 1 or 0 depending on whether we should or should not stop iterating over delayed dir indexes. So change the function to return bool instead of an int. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:30 +02:00
Filipe Manana	4106eb9bda	btrfs: make btrfs_should_delete_dir_index() return a bool instead There's no need to return errors and we currently return 1 in case the index should be deleted and 0 otherwise, so change the return type from int to bool as this is a boolean function and it's more clear this way. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:30 +02:00
Filipe Manana	adc1ef55dc	btrfs: add details to error messages at btrfs_delete_delayed_dir_index() Update the error messages with: 1) Fix typo in the first one, deltiona -> deletion; 2) Remove redundant part of the first message, the part following the comma, and including all the useful information: root, inode, index and error value; 3) Update the second message to use more formal language (example 'error' instead of 'err'), , remove redundant part about "deletion tree of delayed node..." and print the relevant information in the same format and order as the first message, without the ugly opening parenthesis without a space separating from the previous word. This also makes the message similar in format to the one we have at btrfs_insert_delayed_dir_index(). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:30 +02:00
Filipe Manana	0187acef35	btrfs: make btrfs_delete_delayed_insertion_item() return a boolean There's no need to return an integer as all we need to do is return true or false to tell whether we deleted a delayed item or not. Also the logic is inverted since we return 1 (true) if we didn't delete and 0 (false) if we did, which is somewhat counter intuitive. Change the return type to a boolean and make it return true if we deleted and false otherwise. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:30 +02:00
Filipe Manana	7077d7b872	btrfs: switch del_all argument of replay_dir_deletes() from int to bool The argument has boolean semantics, so change its type from int to bool, making it more clear. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:30 +02:00
Filipe Manana	5f8882c854	btrfs: pass NULL index to btrfs_del_inode_ref() where not needed There are two callers of btrfs_del_inode_ref() that declare a local index variable and then pass a pointer for it to btrfs_del_inode_ref(), but then don't use that index at all. Since btrfs_del_inode_ref() accepts a NULL index pointer, pass NULL instead and stop declaring those useless index variables. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:30 +02:00
Filipe Manana	93612a92ba	btrfs: allocate scratch eb earlier at btrfs_log_new_name() Instead of allocating the scratch eb after joining the log transaction, allocate it before so that we're not delaying log commits for longer than necessary, as allocating the scratch eb means allocating an extent_buffer structure, which comes from a dedicated kmem_cache, plus pages/folios to attach to the eb. Both of these allocations may take time when we're under memory pressure. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:30 +02:00
Filipe Manana	841324a8e6	btrfs: allocate path earlier at btrfs_log_new_name() Instead of allocating the path after joining the log transaction, allocate it before so that we're not delaying log commits for the rare cases where the allocation takes a significant time (under memory pressure and all slabs are full, there's the need to allocate a new page, etc). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:29 +02:00
Filipe Manana	b32efae7b8	btrfs: allocate path earlier at btrfs_del_dir_entries_in_log() Instead of allocating the path after joining the log transaction, allocate it before so that we're not delaying log commits for the rare cases where the allocation takes a significant time (under memory pressure and all slabs are full, there's the need to allocate a new page, etc). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:29 +02:00
Filipe Manana	181436a85b	btrfs: assert we join log transaction at btrfs_del_dir_entries_in_log() We are supposed to be able to join a log transaction at that point, since we have determined that the inode was logged in the current transaction with the call to inode_logged(). So ASSERT() we joined a log transaction and also warn if we didn't in case assertions are disabled (the kernel config doesn't have CONFIG_BTRFS_ASSERT=y), so that the issue gets noticed and reported if it ever happens. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:29 +02:00
Filipe Manana	1ed0cfc89e	btrfs: use btrfs_del_item() at del_logged_dentry() There's no need to use btrfs_delete_one_dir_name() at del_logged_dentry() because we are processing a dir index key which can contain only a single name, unlike dir item keys which can encode multiple names in case of name hash collisions. We have explicitly looked up for a dir index key by calling btrfs_lookup_dir_index_item() and we don't log dir item keys anymore (since commit `339d035424` ("btrfs: only copy dir index keys when logging a directory")). So simplify and use btrfs_del_item() directly instead of btrfs_delete_one_dir_name(). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:29 +02:00
Filipe Manana	0ef4c6120e	btrfs: free path sooner at __btrfs_unlink_inode() After calling btrfs_delete_one_dir_name() there's no need for the path anymore so we can free it immediately after. After that point we do some btree operations that take time and in those call chains we end up allocating paths for these operations, so we're unnecessarily holding on to the path we allocated early at __btrfs_unlink_inode(). So free the path as soon as we don't need it and add a comment. This also allows to simplify the error path, removing two exit labels and returning directly when an error happens. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:29 +02:00
Filipe Manana	d94edb0d7e	btrfs: assert we join log transaction at btrfs_del_inode_ref_in_log() We are supposed to be able to join a log transaction at that point, since we have determined that the inode was logged in the current transaction with the call to inode_logged(). So ASSERT() we joined a log transaction and also warn if we didn't in case assertions are disabled (the kernel config doesn't have CONFIG_BTRFS_ASSERT=y), so that the issue gets noticed and reported if it ever happens. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:29 +02:00
Al Viro	75764b41bf	btrfs: open code fc_mount() to avoid releasing s_umount rw_sempahore [CURRENT BEHAVIOR] Currently inside btrfs_get_tree_subvol(), we call fc_mount() to grab a tree, then re-lock s_umount inside btrfs_reconfigure_for_mount() to avoid race with remount. However fc_mount() itself is just doing two things: 1. Call vfs_get_tree() 2. Release s_umount then call vfs_create_mount() [ENHANCEMENT] Instead of calling fc_mount(), we can open-code it with vfs_get_tree() first. This provides a benefit that, since we have the full control of s_umount, we do not need to re-lock that rw_sempahore when calling btrfs_reconfigure_for_mount(), meaning less race between RO/RW remount. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Qu Wenruo <wqu@suse.com> [ Rework the subject and commit message, refactor the error handling ] Signed-off-by: Qu Wenruo <wqu@suse.com> Tested-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:29 +02:00
David Sterba	4013cde56e	btrfs: rename err to ret in scrub_submit_extent_sector_read() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:29 +02:00
David Sterba	56ccdd9af2	btrfs: rename err to ret in btrfs_create_common() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:29 +02:00
David Sterba	7d13ea864e	btrfs: rename err to ret in btrfs_wait_tree_log_extents() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:29 +02:00
David Sterba	0b2cd9e2c7	btrfs: rename err to ret in btrfs_wait_extents() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:28 +02:00
David Sterba	69c5c6130d	btrfs: rename err to ret in quota_override_store() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:28 +02:00
David Sterba	148961dac3	btrfs: rename err to ret in btrfs_fill_super() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:28 +02:00
David Sterba	60a8bab08c	btrfs: rename err to ret in calc_pct_ratio() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:28 +02:00
David Sterba	3b5742f379	btrfs: rename err to ret in btrfs_symlink() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:28 +02:00
David Sterba	af6f6c3af7	btrfs: rename err to ret in btrfs_link() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:28 +02:00
David Sterba	9cf280e2bd	btrfs: rename err to ret in btrfs_setattr() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:28 +02:00
David Sterba	b71a348513	btrfs: rename err to ret in btrfs_init_inode_security() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:28 +02:00
David Sterba	d64ef1d23f	btrfs: rename err to ret in btrfs_alloc_from_bitmap() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:28 +02:00
David Sterba	8d9e877919	btrfs: rename err to ret in btrfs_lock_extent_bits() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:28 +02:00
David Sterba	886240cbcd	btrfs: rename err to ret in btrfs_try_lock_extent_bits() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:27 +02:00
David Sterba	986b6aa185	btrfs: rename err to ret2 in btrfs_truncate_inode_items() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:27 +02:00
David Sterba	a579ddca43	btrfs: rename err to ret2 in btrfs_add_link() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:27 +02:00
David Sterba	8f38507068	btrfs: rename err to ret2 in btrfs_setsize() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:27 +02:00
David Sterba	df20be9f02	btrfs: rename err to ret2 in btrfs_search_old_slot() Unify naming of return value to the preferred way, move the variable to the closest scope. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:27 +02:00
David Sterba	644dcb4316	btrfs: rename err to ret2 in btrfs_search_slot() Unify naming of return value to the preferred way, move the variable to the closest scope. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:27 +02:00
David Sterba	56fc5b18c9	btrfs: rename err to ret2 in search_leaf() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:27 +02:00
David Sterba	58019c1dd4	btrfs: rename err to ret2 in read_block_for_search() Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:27 +02:00
David Sterba	66ca7ea650	btrfs: rename err to ret2 in resolve_indirect_refs() Unify naming of return value to the preferred way, move the variable to the closest scope. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:27 +02:00
Qu Wenruo	582cd4bad4	btrfs: rename btrfs_subpage structure With the incoming large data folios support, the structure name btrfs_subpage is no longer correct, as for we can have multiple blocks inside a large folio, and the block size is still page size. So to follow the schema of iomap, rename btrfs_subpage to btrfs_folio_state, along with involved enums. There are still exported functions with "btrfs_subpage_" prefix, and I believe for metadata the name "subpage" will stay forever as we will never allocate a folio larger than nodesize anyway. The full cleanup of the word "subpage" will happen in much smaller steps in the future. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:27 +02:00
Qu Wenruo	1e17738d6b	btrfs: add comments on the extra btrfs specific subpage bitmaps Unlike the iomap_folio_state structure, the btrfs_subpage structure has a lot of extra sub-bitmaps, namely: - writeback sub-bitmap - locked sub-bitmap iomap_folio_state uses an atomic for writeback tracking, while it has no per-block locked tracking. This is because iomap always locks a single folio, and submits dirty blocks with that folio locked. But btrfs has async delalloc ranges (for compression), which are queued with their range locked, until the compression is done, then marks the involved range writeback and unlocked. This means a range can be unlocked and marked writeback at seemingly random timing, thus it needs the extra tracking. This needs a huge rework on the lifespan of async delalloc range before we can remove/simplify these two sub-bitmaps. - ordered sub-bitmap - checked sub-bitmap These are for COW-fixup, but as I mentioned in the past, the COW-fixup is not really needed anymore and these two flags are already marked deprecated, and will be removed in the near future after comprehensive tests. Add related comments to indicate we're actively trying to align the sub-bitmaps to the iomap ones. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:27 +02:00
Daniel Vacek	3f093ccb95	btrfs: harden parsing of compression mount options Btrfs happily but incorrectly accepts the `-o compress=zlib+foo` and similar options with any random suffix. Fix that by explicitly checking the end of the strings. Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:26 +02:00
Daniel Vacek	3f0e865ae6	btrfs: factor out compression mount options parsing There are many options making the parsing a bit lengthy. Factor the compress options out into a helper function. The next patch is going to harden this function. Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:26 +02:00
David Sterba	ccb42a6eed	btrfs: constify more pointer parameters Another batch of pointer parameter constifications. This is for clarity and minor addition to type safety. There are no observable effects in the assembly code and .ko measured on release config. Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:26 +02:00
Boris Burkov	c7f04fbc98	btrfs: sysfs: track current commit duration in commit_stats When debugging/detecting outlier commit stalls, having an indicator that we are currently in a long commit critical section can be very useful. Extend the commit_stats sysfs file to also include the current commit critical section duration. Since this requires storing the last commit start time, use that rather than a separate stack variable for storing the finished commit durations as well. This also requires slightly moving up the timing of the stats updating to inside the critical section to avoid the transaction T+1 setting the critical_section_start_time to 0 before transaction T can update its stats, which would trigger the new ASSERT. This is an improvement in and of itself, as it makes the stats more accurately represent the true critical section time. It may be yet better to pull the stats up to where start_transaction gets unblocked, rather than the next commit, but this seems like a good enough place as well. Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:26 +02:00
Pan Chuang	46d549928c	btrfs: use rb_find_add() in rb_simple_insert() Use the rb-tree helper so we don't open code the search and insert code. Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:26 +02:00
Pan Chuang	c52ea14d05	btrfs: pass struct rb_simple_node pointer directly in rb_simple_insert() Replace struct embedding with union to enable safe type conversion in btrfs_backref_node, tree_block and mapping_node. Adjust function calls to use the new unified API, eliminating redundant parameters. Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:26 +02:00
Yangtao Li	fbec9a5d3e	btrfs: use rb_find_add() in btrfs_qgroup_add_swapped_blocks() Use the rb-tree helper so we don't open code the search and insert code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:26 +02:00
Yangtao Li	844e5f902d	btrfs: use rb_find() in btrfs_qgroup_trace_subtree_after_cow() Use the rb-tree helper so we don't open code the search code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:26 +02:00
Yangtao Li	e3def6ce67	btrfs: use rb_find_add() in add_qgroup_rb() Use the rb-tree helper so we don't open code the search and insert code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:26 +02:00
Yangtao Li	1e0f0239a3	btrfs: use rb_find() in find_qgroup_rb() Use the rb-tree helper so we don't open code the search code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:26 +02:00
Yangtao Li	287480e269	btrfs: use rb_find_add() in insert_ref_entry() Use the rb-tree helper so we don't open code the search and insert code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Yangtao Li	c6e3ae8ac3	btrfs: use rb_find_add() in insert_root_entry() Use the rb-tree helper so we don't open code the search and insert code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Yangtao Li	afaa9f8235	btrfs: use rb_find() in lookup_root_entry() Use the rb-tree helper so we don't open code the search code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Yangtao Li	3f60f4374a	btrfs: use rb_find_add() in insert_block_entry() Use the rb-tree helper so we don't open code the search and insert code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Yangtao Li	4044a7ed3b	btrfs: use rb_find() in lookup_block_entry() Use the rb-tree helper so we don't open code the search code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Yangtao Li	b017a92bd9	btrfs: use rb_find_add() in ulist_rbtree_insert() Use the rb-tree helper so we don't open code the search and insert code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Yangtao Li	c4f38e7ca5	btrfs: use rb_find() in ulist_rbtree_search() Use the rb-tree helper so we don't open code the search code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Yangtao Li	9734685854	btrfs: use rb_find() in __btrfs_lookup_delayed_item() Use the rb-tree helper so we don't open code the search code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Yangtao Li	7a91e01875	btrfs: use rb_find_add() in btrfs_insert_inode_defrag() Use the rb-tree helper so we don't open code the search and insert code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Dan Johnson	06c3437f74	btrfs: fix comment in reserved space warning mkfs.btrfs up to v4.14 actually can leave a chunk inside the reserved space when invoked with `-m single`, fixed by 997f9977c24397eb6980bb9 ("mkfs: Prevent temporary system chunk to use space in reserved 1M range") released with v4.15. Signed-off-by: Dan Johnson <ComputerDruid@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Daniel Vacek	d8f6cb2b28	btrfs: relocation: simplify unused logic related to LINK_LOWER btrfs_backref_link_edge() is always called with the LINK_LOWER argument. We can simplify it and remove the LINK_LOWER and LINK_UPPER macros completely. The last call with LINK_UPPER was removed with commit `0097422c0d` ("btrfs: remove clone_backref_node() from relocation"). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Filipe Manana	593062f67b	btrfs: unfold transaction abort at btrfs_insert_one_raid_extent() We have a common error path where we abort the transaction, but like this in case we get a transaction abort stack trace we don't know exactly which previous function call failed. Instead abort the transaction after any function call that returns an error, so that we can easily identify which function failed. Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:24 +02:00
Filipe Manana	35bb03e57a	btrfs: unfold transaction abort at __btrfs_update_delayed_inode() We have a common error path where we abort the transaction, but like this in case we get a transaction abort stack trace we don't know exactly which previous function call failed. Instead abort the transaction after any function call that returns an error, so that we can easily identify which function failed. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:24 +02:00
Filipe Manana	33e8f24b52	btrfs: abort transaction on unexpected eb generation at btrfs_copy_root() If we find an unexpected generation for the extent buffer we are cloning at btrfs_copy_root(), we just WARN_ON() and don't error out and abort the transaction, meaning we allow to persist metadata with an unexpected generation. Instead of warning only, abort the transaction and return -EUCLEAN. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:16 +02:00
Filipe Manana	273bbb5b48	btrfs: unfold transaction abort at btrfs_copy_root() Instead of having a common btrfs_abort_transaction() call for when any of the two btrfs_inc_ref() calls fail, move the btrfs_abort_transaction() to happen immediately after each one of the calls, so that when analyzing a stack trace with a transaction abort we know which call failed. Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:50:35 +02:00
David Sterba	b63c8c1ede	btrfs: move transaction aborts to the error site in add_block_group_free_space() Transaction aborts should be done next to the place the error happens, which was not done in add_block_group_free_space(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:50:34 +02:00
David Sterba	0b10f3dd13	btrfs: move transaction aborts to the error site in remove_block_group_free_space() Transaction aborts should be done next to the place the error happens, which was not done in remove_block_group_free_space(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:50:34 +02:00
Filipe Manana	81bfd9d547	btrfs: simplify error detection flow during log replay We have this fuzzy logic at btrfs_recover_log_trees() where we don't abort the transaction and exit immediately after each function call that returned an error, and instead have if-then-else logic or check if the previous function call returned success before calling the next function. Make the flow more straightforward by immediately aborting the transaction and exiting after each function call failure. This also allows to avoid two consecutive if statements that test the same conditions: if (!ret && wc.stage == LOG_WALK_REPLAY_ALL) { (...) } Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:50:34 +02:00
Filipe Manana	6466084df6	btrfs: remove redundant path release when replaying a log tree There's no need to call btrfs_release_path() before calling btrfs_init_root_free_objectid() as we have released the path already at the top of the loop and the previous call to fixup_inode_link_counts() also releases the path. So remove it to simplify the code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:50:34 +02:00
Filipe Manana	2a5898c4aa	btrfs: abort transaction during log replay if walk_log_tree() failed If we failed walking a log tree during replay, we have a missing transaction abort to prevent committing a transaction where we didn't fully replay all the changes from a log tree and therefore can leave the respective subvolume tree in some inconsistent state. So add the missing transaction abort. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:50:28 +02:00
Filipe Manana	8f1e1b263d	btrfs: unfold transaction aborts when replaying log trees We have a single line doing a transaction abort in case either we got an error from btrfs_get_fs_root() different from -ENOENT or we got an error from btrfs_pin_extent_for_log_replay(), making it hard to figure out which function call failed when looking at a transaction abort massages and stack trace in dmesg. Change this to have an explicit transaction abort for each one of the two cases. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:44:11 +02:00
Johannes Thumshirn	2a946bf6d6	btrfs: make btrfs_should_periodic_reclaim() static btrfs_should_periodic_reclaim() is not used outside of space-info.c so make it static and remove the prototype from space-info.h. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:44:11 +02:00
Johannes Thumshirn	55f7c65b2f	btrfs: zoned: use filesystem size not disk size for reclaim decision When deciding if a zoned filesystem is reaching the threshold to reclaim data block groups, look at the size of the filesystem not to potentially total available size of all drives in the filesystem. Especially if a filesystem was created with mkfs' -b option, constraining it to only a portion of the block device, the numbers won't match and potentially garbage collection is kicking in too late. Fixes: `3687fcb075` ("btrfs: zoned: make auto-reclaim less aggressive") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:42:24 +02:00
Filipe Manana	f2de2b9ffd	btrfs: unfold transaction abort at clone_copy_inline_extent() We have a common error path where we abort the transaction, but like this in case we get a transaction abort stack trace we don't know exactly which previous function call failed. Instead abort the transaction after any function call that returns an error, so that we can easily identify which function failed. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 15:31:06 +02:00
Filipe Manana	5ff6050fcd	btrfs: remove pointless 'out' label from clone_finish_inode_update() The label is only used once and we can instead return directly where it's used, besides the fact that all we do under the label is to return the value of 'ret'. So get rid of the label and return directly. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 15:31:06 +02:00
Filipe Manana	5cf0e668ea	btrfs: unfold transaction abort at walk_up_proc() Instead of having a common btrfs_abort_transaction() call for when any of the two btrfs_dec_ref() calls fail, move the btrfs_abort_transaction() to happen immediately after each one of the calls, so that when analysing a stack trace with a transaction abort we know which call failed. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 15:31:05 +02:00
Filipe Manana	227aa55fa2	btrfs: unfold transaction abort at __btrfs_inc_extent_ref() Instead of having a common btrfs_abort_transaction() call for when either insert_tree_block_ref() failed or when insert_extent_data_ref() failed, move the btrfs_abort_transaction() to happen immediately after each one of those calls, so that when analysing a stack trace with a transaction abort we know which call failed. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 15:31:05 +02:00
Filipe Manana	3f757b56f1	btrfs: unfold transaction aborts at btrfs_create_new_inode() Instead of having a common btrfs_abort_transaction() call for when either btrfs_orphan_add() failed or when btrfs_add_link() failed, move the btrfs_abort_transaction() to happen immediately after each one of those calls, so that when analysing a stack trace with a transaction abort we know which call failed. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 15:31:05 +02:00
Caleb Sander Mateos	9aad72b4e3	btrfs/ioctl: store btrfs_uring_encoded_data in io_btrfs_cmd btrfs is the only user of struct io_uring_cmd_data and its op_data field. Switch its ->uring_cmd() implementations to store the struct btrfs_uring_encoded_data * in the struct io_btrfs_cmd, overlayed with io_uring_cmd's pdu field. This avoids having to touch another cache line to access the struct btrfs_uring_encoded_data *, and allows op_data and struct io_uring_cmd_data to be removed. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20250708202212.2851548-4-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-18 12:34:56 -06:00
Christian Brauner	ca115d7e75	tree-wide: s/struct fileattr/struct file_kattr/g Now that we expose struct file_attr as our uapi struct rename all the internal struct to struct file_kattr to clearly communicate that it is a kernel internal struct. This is similar to struct mount_{k}attr and others. Link: https://lore.kernel.org/20250703-restlaufzeit-baurecht-9ed44552b481@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-07-04 16:14:39 +02:00
Linus Torvalds	4c06e63b92	for-6.16-rc4-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmhmwFQACgkQxWXV+ddt WDsVMA/+NuSth71V0AfiDnFyqjgDMqIlZL2+dqBiTYHXQQHKbqiUlKvYkWICCT6T 1YgDV+95XJYy4TDBoA49Ndd/l+CiDcMLbOYeneIfbJy13ts84jVANPkl4n03gPkF ktibCw15h0MENVctTCPc71dX2X0cV9WPf4iDmoxUZiukDA376akGTArZKwH4tVVg 4qVpzUtDdNOf848D+8DZKGd+ot/RWgEdLkFCZES27BMg/OFemxBK1MU6K8VjxiKF VoaSVJRDXuug8oVBAGNl86XpiSgd4gHyoNNA5b4mhdSWMSBMxUAaILsONT9pNQZA CFyHA1Jp2gLOIzQIzeXwWgXaAOQDtco8YWYaXhf0v0mySs89tweXjOibfj2mU9pS wPaJyeD+nyRDMwPa4VWEws64D3vXX6aKwiThUENuDmxBvrRXjrkGYH9tf0LNzDDe OKv/vOCfeyutxbjKhP+qElMhdh73BZnJ4UCxxYRRDq2v1Mg+k06swl+6uL6xenme a2KLJlwEoG6LAlkpZzV66ZEaIHDyGBZNdVYtuA/G3dDtmlt0aLXDdp1eq7NivS1j aV7cd0JMX89lAUtqKT932ZOw8RoDrUPPjsnXzCaZJ69mMVyEkxyCV+iYHTTJPDga W5Vg8Tq3d1gwxMebZHvyI6wwUhmGA0wUFG2eohYY/tcSrrUlrHQ= =Ke0p -----END PGP SIGNATURE----- Merge tag 'for-6.16-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - tree-log fixes: - fixes of log tracking of directories and subvolumes - fix iteration and error handling of inode references during log replay - fix free space tree rebuild (reported by syzbot) * tag 'for-6.16-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: use btrfs_record_snapshot_destroy() during rmdir btrfs: propagate last_unlink_trans earlier when doing a rmdir btrfs: record new subvolume in parent dir earlier to avoid dir logging races btrfs: fix inode lookup error handling during log replay btrfs: fix iteration of extrefs during log replay btrfs: fix missing error handling when searching for inode refs during log replay btrfs: fix failure to rebuild free space tree using multiple transactions	2025-07-03 13:29:56 -07:00
Eric Biggers	2c7528d36e	btrfs: stop parsing crc32c driver name To determine whether the crc32c implementation is "fast", use crc32_optimizations() instead of parsing the crypto_shash driver name. This keeps the code working as intended after the driver name is changed by the next commit. Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20250613183753.31864-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>	2025-06-30 09:31:56 -07:00
Filipe Manana	157501b046	btrfs: use btrfs_record_snapshot_destroy() during rmdir We are setting the parent directory's last_unlink_trans directly which may result in a concurrent task starting to log the directory not see the update and therefore can log the directory after we removed a child directory which had a snapshot within instead of falling back to a transaction commit. Replaying such a log tree would result in a mount failure since we can't currently delete snapshots (and subvolumes) during log replay. This is the type of failure described in commit `1ec9a1ae1e` ("Btrfs: fix unreplayable log after snapshot delete + parent dir fsync"). Fix this by using btrfs_record_snapshot_destroy() which updates the last_unlink_trans field while holding the inode's log_mutex lock. Fixes: `44f714dae5` ("Btrfs: improve performance on fsync against new inode after rename/unlink") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-27 19:58:12 +02:00
Filipe Manana	c466e33e72	btrfs: propagate last_unlink_trans earlier when doing a rmdir In case the removed directory had a snapshot that was deleted, we are propagating its inode's last_unlink_trans to the parent directory after we removed the entry from the parent directory. This leaves a small race window where someone can log the parent directory after we removed the entry and before we updated last_unlink_trans, and as a result if we ever try to replay such a log tree, we will fail since we will attempt to remove a snapshot during log replay, which is currently not possible and results in the log replay (and mount) to fail. This is the type of failure described in commit `1ec9a1ae1e` ("Btrfs: fix unreplayable log after snapshot delete + parent dir fsync"). So fix this by propagating the last_unlink_trans to the parent directory before we remove the entry from it. Fixes: `44f714dae5` ("Btrfs: improve performance on fsync against new inode after rename/unlink") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-27 19:57:47 +02:00
Filipe Manana	bf5bcf9a6f	btrfs: record new subvolume in parent dir earlier to avoid dir logging races Instead of recording that a new subvolume was created in a directory after we add the entry do the directory, record it before adding the entry. This is to avoid races where after creating the entry and before recording the new subvolume in the directory (the call to btrfs_record_new_subvolume()), another task logs the directory, so we end up with a log tree where we logged a directory that has an entry pointing to a root that was not yet committed, resulting in an invalid entry if the log is persisted and replayed later due to a power failure or crash. Also state this requirement in the function comment for btrfs_record_new_subvolume(), similar to what we do for the btrfs_record_unlink_dir() and btrfs_record_snapshot_destroy(). Fixes: `45c4102f0d` ("btrfs: avoid transaction commit on any fsync after subvolume creation") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-27 19:57:24 +02:00
Filipe Manana	5f61b96159	btrfs: fix inode lookup error handling during log replay When replaying log trees we use read_one_inode() to get an inode, which is just a wrapper around btrfs_iget_logging(), which in turn is a wrapper for btrfs_iget(). But read_one_inode() always returns NULL for any error that btrfs_iget_logging() / btrfs_iget() may return and this is a problem because: 1) In many callers of read_one_inode() we convert the NULL into -EIO, which is not accurate since btrfs_iget() may return -ENOMEM and -ENOENT for example, besides -EIO and other errors. So during log replay we may end up reporting a false -EIO, which is confusing since we may not have had any IO error at all; 2) When replaying directory deletes, at replay_dir_deletes(), we assume the NULL returned from read_one_inode() means that the inode doesn't exist and then proceed as if no error had happened. This is wrong because unless btrfs_iget() returned ERR_PTR(-ENOENT), we had an actual error and the target inode may exist in the target subvolume root - this may later result in the log replay code failing at a later stage (if we are "lucky") or succeed but leaving some inconsistency in the filesystem. So fix this by not ignoring errors from btrfs_iget_logging() and as a consequence remove the read_one_inode() wrapper and just use btrfs_iget_logging() directly. Also since btrfs_iget_logging() is supposed to be called only against subvolume roots, just like read_one_inode() which had a comment about it, add an assertion to btrfs_iget_logging() to check that the target root corresponds to a subvolume root. Fixes: `5d4f98a28c` ("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-27 19:57:06 +02:00
Filipe Manana	54a7081ed1	btrfs: fix iteration of extrefs during log replay At __inode_add_ref() when processing extrefs, if we jump into the next label we have an undefined value of victim_name.len, since we haven't initialized it before we did the goto. This results in an invalid memory access in the next iteration of the loop since victim_name.len was not initialized to the length of the name of the current extref. Fix this by initializing victim_name.len with the current extref's name length. Fixes: `e43eec81c5` ("btrfs: use struct qstr instead of name and namelen pairs") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-27 19:56:55 +02:00
Filipe Manana	6561a40cec	btrfs: fix missing error handling when searching for inode refs during log replay During log replay, at __add_inode_ref(), when we are searching for inode ref keys we totally ignore if btrfs_search_slot() returns an error. This may make a log replay succeed when there was an actual error and leave some metadata inconsistency in a subvolume tree. Fix this by checking if an error was returned from btrfs_search_slot() and if so, return it to the caller. Fixes: `e02119d5a7` ("Btrfs: Add a write ahead tree log to optimize synchronous operations") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-27 19:56:35 +02:00
Filipe Manana	1e6ed33cab	btrfs: fix failure to rebuild free space tree using multiple transactions If we are rebuilding a free space tree, while modifying the free space tree we may need to allocate a new metadata block group. If we end up using multiple transactions for the rebuild, when we call btrfs_end_transaction() we enter btrfs_create_pending_block_groups() which calls add_block_group_free_space() to add items to the free space tree for the block group. Then later during the free space tree rebuild, at btrfs_rebuild_free_space_tree(), we may find such new block groups and call populate_free_space_tree() for them, which fails with -EEXIST because there are already items in the free space tree. Then we abort the transaction with -EEXIST at btrfs_rebuild_free_space_tree(). Notice that we say "may find" the new block groups because a new block group may be inserted in the block groups rbtree, which is being iterated by the rebuild process, before or after the current node where the rebuild process is currently at. Syzbot recently reported such case which produces a trace like the following: ------------[ cut here ]------------ BTRFS: Transaction aborted (error -17) WARNING: CPU: 1 PID: 7626 at fs/btrfs/free-space-tree.c:1341 btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341 Modules linked in: CPU: 1 UID: 0 PID: 7626 Comm: syz.2.25 Not tainted 6.15.0-rc7-syzkaller-00085-gd7fa1af5b33e-dirty #0 PREEMPT Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341 lr : btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341 sp : ffff80009c4f7740 x29: ffff80009c4f77b0 x28: ffff0000d4c3f400 x27: 0000000000000000 x26: dfff800000000000 x25: ffff70001389eee8 x24: 0000000000000003 x23: 1fffe000182b6e7b x22: 0000000000000000 x21: ffff0000c15b73d8 x20: 00000000ffffffef x19: ffff0000c15b7378 x18: 1fffe0003386f276 x17: ffff80008f31e000 x16: ffff80008adbe98c x15: 0000000000000001 x14: 1fffe0001b281550 x13: 0000000000000000 x12: 0000000000000000 x11: ffff60001b281551 x10: 0000000000000003 x9 : 1c8922000a902c00 x8 : 1c8922000a902c00 x7 : ffff800080485878 x6 : 0000000000000000 x5 : 0000000000000001 x4 : 0000000000000001 x3 : ffff80008047843c x2 : 0000000000000001 x1 : ffff80008b3ebc40 x0 : 0000000000000001 Call trace: btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341 (P) btrfs_start_pre_rw_mount+0xa78/0xe10 fs/btrfs/disk-io.c:3074 btrfs_remount_rw fs/btrfs/super.c:1319 [inline] btrfs_reconfigure+0x828/0x2418 fs/btrfs/super.c:1543 reconfigure_super+0x1d4/0x6f0 fs/super.c:1083 do_remount fs/namespace.c:3365 [inline] path_mount+0xb34/0xde0 fs/namespace.c:4200 do_mount fs/namespace.c:4221 [inline] __do_sys_mount fs/namespace.c:4432 [inline] __se_sys_mount fs/namespace.c:4409 [inline] __arm64_sys_mount+0x3e8/0x468 fs/namespace.c:4409 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151 el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767 el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786 el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600 irq event stamp: 330 hardirqs last enabled at (329): [<ffff80008048590c>] raw_spin_rq_unlock_irq kernel/sched/sched.h:1525 [inline] hardirqs last enabled at (329): [<ffff80008048590c>] finish_lock_switch+0xb0/0x1c0 kernel/sched/core.c:5130 hardirqs last disabled at (330): [<ffff80008adb9e60>] el1_dbg+0x24/0x80 arch/arm64/kernel/entry-common.c:511 softirqs last enabled at (10): [<ffff8000801fbf10>] local_bh_enable+0x10/0x34 include/linux/bottom_half.h:32 softirqs last disabled at (8): [<ffff8000801fbedc>] local_bh_disable+0x10/0x34 include/linux/bottom_half.h:19 ---[ end trace 0000000000000000 ]--- Fix this by flagging new block groups which had their free space tree entries already added and then skip them in the rebuild process. Also, since the rebuild may be triggered when doing a remount, make sure that when we clear an existing free space tree that we clear such flag from every existing block group, otherwise we would skip those block groups during the rebuild. Reported-by: syzbot+d0014fb0fc39c5487ae5@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/68460a54.050a0220.daf97.0af5.GAE@google.com/ Fixes: `882af9f13e` ("btrfs: handle free space tree rebuild in multiple transactions") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-27 19:56:15 +02:00
Linus Torvalds	5ca7fe213b	for-6.16-rc3-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmhZQ9MACgkQxWXV+ddt WDsyvw/+K5N4zbig9D5QL5SdsQwMe/ZUk1KF0LLu6H3hFetdICeM/Z4K46EBh40X c9Sxb13gLnIAm8DR/IFTTlOZVrrbJ3CTazZuJbncCpaZchH863aYb/1KboxjJnpW KqOen20KdUh8HdevrJFhkFc7rOjp7KupfIHsbWqIxaWYPf8ORvUyK55lKxQz0HES E5tFXLNr6z/8Ws5pc2HnRLgnRcCHuRUNJUb1PEaTfPKxoFvTwjda6cDsYnXOJEO9 NOnh6lluurqja+3FUEFig2f292/CbKGtByYUDgfhHO21P//IHSDhlouvwipzI/kh 6WUoH1K+DWCxxNbIVFFbUYLxrDGu7R7/aWFHH2q0dNjqQeiQBbUnbn4WIjAAwDWf k9cmE+WgVqwQI+vpfG3eENUafG5MpcQQo2wKrxG0whWaC2fiA6QtI+3DfKyMj4XJ JI1jUhfCwHrqzoGQ4XBE3UYENqQw9RICNC+Z3UfZx+5sQMWcb+ac5qIGygvCfU8N Gtfx4ladZshpQUSuRneiLozxdxLyXX3LzCt2Ls1s5fPPikZft/+2QRu5rzSbb/Cp 50TDSn/pE1N/TEMVZaP5M2PxquBVDOZ4TFSsSm3IvceqFInm0UerAGaJ7+T2eZhM 3XHhIp6xTecHfwukvGqs+XSxB9PMLfF5M0gc+9PR+3oxzFRpowI= =XLWR -----END PGP SIGNATURE----- Merge tag 'for-6.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Fixes: - fix invalid inode pointer dereferences during log replay - fix a race between renames and directory logging - fix shutting down delayed iput worker - fix device byte accounting when dropping chunk - in zoned mode, fix offset calculations for DUP profile when conventional and sequential zones are used together Regression fixes: - fix possible double unlock of extent buffer tree (xarray conversion) - in zoned mode, fix extent buffer refcount when writing out extents (xarray conversion) Error handling fixes and updates: - handle unexpected extent type when replaying log - check and warn if there are remaining delayed inodes when putting a root - fix assertion when building free space tree - handle csum tree error with mount option 'rescue=ibadroot' Other: - error message updates: add prefix to all scrub related messages, include other information in messages" * tag 'for-6.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: fix alloc_offset calculation for partly conventional block groups btrfs: handle csum tree error with rescue=ibadroots correctly btrfs: fix race between async reclaim worker and close_ctree() btrfs: fix assertion when building free space tree btrfs: don't silently ignore unexpected extent type when replaying log btrfs: fix invalid inode pointer dereferences during log replay btrfs: fix double unlock of buffer_tree xarray when releasing subpage eb btrfs: update superblock's device bytes_used when dropping chunk btrfs: fix a race between renames and directory logging btrfs: scrub: add prefix for the error messages btrfs: warn if leaking delayed_nodes in btrfs_put_root() btrfs: fix delayed ref refcount leak in debug assertion btrfs: include root in error message when unlinking inode btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails	2025-06-23 11:16:38 -07:00
Johannes Thumshirn	c0d90a79e8	btrfs: zoned: fix alloc_offset calculation for partly conventional block groups When one of two zones composing a DUP block group is a conventional zone, we have the zone_info[i]->alloc_offset = WP_CONVENTIONAL. That will, of course, not match the write pointer of the other zone, and fails that block group. This commit solves that issue by properly recovering the emulated write pointer from the last allocated extent. The offset for the SINGLE, DUP, and RAID1 are straight-forward: it is same as the end of last allocated extent. The RAID0 and RAID10 are a bit tricky that we need to do the math of striping. This is the kernel equivalent of Naohiro's user-space commit: "btrfs-progs: zoned: fix alloc_offset calculation for partly conventional block groups". Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:21:15 +02:00
Qu Wenruo	547e836661	btrfs: handle csum tree error with rescue=ibadroots correctly [BUG] There is syzbot based reproducer that can crash the kernel, with the following call trace: (With some debug output added) DEBUG: rescue=ibadroots parsed BTRFS: device fsid 14d642db-7b15-43e4-81e6-4b8fac6a25f8 devid 1 transid 8 /dev/loop0 (7:0) scanned by repro (1010) BTRFS info (device loop0): first mount of filesystem 14d642db-7b15-43e4-81e6-4b8fac6a25f8 BTRFS info (device loop0): using blake2b (blake2b-256-generic) checksum algorithm BTRFS info (device loop0): using free-space-tree BTRFS warning (device loop0): checksum verify failed on logical 5312512 mirror 1 wanted 0xb043382657aede36608fd3386d6b001692ff406164733d94e2d9a180412c6003 found 0x810ceb2bacb7f0f9eb2bf3b2b15c02af867cb35ad450898169f3b1f0bd818651 level 0 DEBUG: read tree root path failed for tree csum, ret=-5 BTRFS warning (device loop0): checksum verify failed on logical 5328896 mirror 1 wanted 0x51be4e8b303da58e6340226815b70e3a93592dac3f30dd510c7517454de8567a found 0x51be4e8b303da58e634022a315b70e3a93592dac3f30dd510c7517454de8567a level 0 BTRFS warning (device loop0): checksum verify failed on logical 5292032 mirror 1 wanted 0x1924ccd683be9efc2fa98582ef58760e3848e9043db8649ee382681e220cdee4 found 0x0cb6184f6e8799d9f8cb335dccd1d1832da1071d12290dab3b85b587ecacca6e level 0 process 'repro' launched './file2' with NULL argv: empty string added DEBUG: no csum root, idatacsums=0 ibadroots=134217728 Oops: general protection fault, probably for non-canonical address 0xdffffc0000000041: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000208-0x000000000000020f] CPU: 5 UID: 0 PID: 1010 Comm: repro Tainted: G OE 6.15.0-custom+ #249 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022 RIP: 0010:btrfs_lookup_csum+0x93/0x3d0 [btrfs] Call Trace: <TASK> btrfs_lookup_bio_sums+0x47a/0xdf0 [btrfs] btrfs_submit_bbio+0x43e/0x1a80 [btrfs] submit_one_bio+0xde/0x160 [btrfs] btrfs_readahead+0x498/0x6a0 [btrfs] read_pages+0x1c3/0xb20 page_cache_ra_order+0x4b5/0xc20 filemap_get_pages+0x2d3/0x19e0 filemap_read+0x314/0xde0 __kernel_read+0x35b/0x900 bprm_execve+0x62e/0x1140 do_execveat_common.isra.0+0x3fc/0x520 __x64_sys_execveat+0xdc/0x130 do_syscall_64+0x54/0x1d0 entry_SYSCALL_64_after_hwframe+0x76/0x7e ---[ end trace 0000000000000000 ]--- [CAUSE] Firstly the fs has a corrupted csum tree root, thus to mount the fs we have to go "ro,rescue=ibadroots" mount option. Normally with that mount option, a bad csum tree root should set BTRFS_FS_STATE_NO_DATA_CSUMS flag, so that any future data read will ignore csum search. But in this particular case, we have the following call trace that caused NULL csum root, but not setting BTRFS_FS_STATE_NO_DATA_CSUMS: load_global_roots_objectid(): ret = btrfs_search_slot(); /* Succeeded / btrfs_item_key_to_cpu() found = true; / We found the root item for csum tree. / root = read_tree_root_path(); if (IS_ERR(root)) { if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) / * Since we have rescue=ibadroots mount option, * @ret is still 0. / break; if (!found \|\| ret) { / @found is true, @ret is 0, error handling for csum * tree is skipped. */ } This means we completely skipped to set BTRFS_FS_STATE_NO_DATA_CSUMS if the csum tree is corrupted, which results unexpected later csum lookup. [FIX] If read_tree_root_path() failed, always populate @ret to the error number. As at the end of the function, we need @ret to determine if we need to do the extra error handling for csum tree. Fixes: `abed4aaae4` ("btrfs: track the csum, extent, and free space trees in a rb tree") Reported-by: Zhiyu Zhang <zhiyuzhang999@gmail.com> Reported-by: Longxing Li <coregee2000@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:21:06 +02:00
Filipe Manana	a26bf338cd	btrfs: fix race between async reclaim worker and close_ctree() Syzbot reported an assertion failure due to an attempt to add a delayed iput after we have set BTRFS_FS_STATE_NO_DELAYED_IPUT in the fs_info state: WARNING: CPU: 0 PID: 65 at fs/btrfs/inode.c:3420 btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420 Modules linked in: CPU: 0 UID: 0 PID: 65 Comm: kworker/u8:4 Not tainted 6.15.0-next-20250530-syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 Workqueue: btrfs-endio-write btrfs_work_helper RIP: 0010:btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420 Code: 4e ad 5d (...) RSP: 0018:ffffc9000213f780 EFLAGS: 00010293 RAX: ffffffff83c635b7 RBX: ffff888058920000 RCX: ffff88801c769e00 RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000000 RBP: 0000000000000001 R08: ffff888058921b67 R09: 1ffff1100b12436c R10: dffffc0000000000 R11: ffffed100b12436d R12: 0000000000000001 R13: dffffc0000000000 R14: ffff88807d748000 R15: 0000000000000100 FS: 0000000000000000(0000) GS:ffff888125c53000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00002000000bd038 CR3: 000000006a142000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> btrfs_put_ordered_extent+0x19f/0x470 fs/btrfs/ordered-data.c:635 btrfs_finish_one_ordered+0x11d8/0x1b10 fs/btrfs/inode.c:3312 btrfs_work_helper+0x399/0xc20 fs/btrfs/async-thread.c:312 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> This can happen due to a race with the async reclaim worker like this: 1) The async metadata reclaim worker enters shrink_delalloc(), which calls btrfs_start_delalloc_roots() with an nr_pages argument that has a value less than LONG_MAX, and that in turn enters start_delalloc_inodes(), which sets the local variable 'full_flush' to false because wbc->nr_to_write is less than LONG_MAX; 2) There it finds inode X in a root's delalloc list, grabs a reference for inode X (with igrab()), and triggers writeback for it with filemap_fdatawrite_wbc(), which creates an ordered extent for inode X; 3) The unmount sequence starts from another task, we enter close_ctree() and we flush the workqueue fs_info->endio_write_workers, which waits for the ordered extent for inode X to complete and when dropping the last reference of the ordered extent, with btrfs_put_ordered_extent(), when we call btrfs_add_delayed_iput() we don't add the inode to the list of delayed iputs because it has a refcount of 2, so we decrement it to 1 and return; 4) Shortly after at close_ctree() we call btrfs_run_delayed_iputs() which runs all delayed iputs, and then we set BTRFS_FS_STATE_NO_DELAYED_IPUT in the fs_info state; 5) The async reclaim worker, after calling filemap_fdatawrite_wbc(), now calls btrfs_add_delayed_iput() for inode X and there we trigger an assertion failure since the fs_info state has the flag BTRFS_FS_STATE_NO_DELAYED_IPUT set. Fix this by setting BTRFS_FS_STATE_NO_DELAYED_IPUT only after we wait for the async reclaim workers to finish, after we call cancel_work_sync() for them at close_ctree(), and by running delayed iputs after wait for the reclaim workers to finish and before setting the bit. This race was recently introduced by commit `19e60b2a95` ("btrfs: add extra warning if delayed iput is added when it's not allowed"). Without the new validation at btrfs_add_delayed_iput(), this described scenario was safe because close_ctree() later calls btrfs_commit_super(). That will run any final delayed iputs added by reclaim workers in the window between the btrfs_run_delayed_iputs() and the the reclaim workers being shut down. Reported-by: syzbot+0ed30ad435bf6f5b7a42@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/6840481c.a00a0220.d4325.000c.GAE@google.com/T/#u Fixes: `19e60b2a95` ("btrfs: add extra warning if delayed iput is added when it's not allowed") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:20:57 +02:00
Filipe Manana	1961d20f6f	btrfs: fix assertion when building free space tree When building the free space tree with the block group tree feature enabled, we can hit an assertion failure like this: BTRFS info (device loop0 state M): rebuilding free space tree assertion failed: ret == 0, in fs/btrfs/free-space-tree.c:1102 ------------[ cut here ]------------ kernel BUG at fs/btrfs/free-space-tree.c:1102! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP Modules linked in: CPU: 1 UID: 0 PID: 6592 Comm: syz-executor322 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 lr : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 sp : ffff8000a4ce7600 x29: ffff8000a4ce76e0 x28: ffff0000c9bc6000 x27: ffff0000ddfff3d8 x26: ffff0000ddfff378 x25: dfff800000000000 x24: 0000000000000001 x23: ffff8000a4ce7660 x22: ffff70001499cecc x21: ffff0000e1d8c160 x20: ffff0000e1cb7800 x19: ffff0000e1d8c0b0 x18: 00000000ffffffff x17: ffff800092f39000 x16: ffff80008ad27e48 x15: ffff700011e740c0 x14: 1ffff00011e740c0 x13: 0000000000000004 x12: ffffffffffffffff x11: ffff700011e740c0 x10: 0000000000ff0100 x9 : 94ef24f55d2dbc00 x8 : 94ef24f55d2dbc00 x7 : 0000000000000001 x6 : 0000000000000001 x5 : ffff8000a4ce6f98 x4 : ffff80008f415ba0 x3 : ffff800080548ef0 x2 : 0000000000000000 x1 : 0000000100000000 x0 : 000000000000003e Call trace: populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 (P) btrfs_rebuild_free_space_tree+0x14c/0x54c fs/btrfs/free-space-tree.c:1337 btrfs_start_pre_rw_mount+0xa78/0xe10 fs/btrfs/disk-io.c:3074 btrfs_remount_rw fs/btrfs/super.c:1319 [inline] btrfs_reconfigure+0x828/0x2418 fs/btrfs/super.c:1543 reconfigure_super+0x1d4/0x6f0 fs/super.c:1083 do_remount fs/namespace.c:3365 [inline] path_mount+0xb34/0xde0 fs/namespace.c:4200 do_mount fs/namespace.c:4221 [inline] __do_sys_mount fs/namespace.c:4432 [inline] __se_sys_mount fs/namespace.c:4409 [inline] __arm64_sys_mount+0x3e8/0x468 fs/namespace.c:4409 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151 el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767 el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786 el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600 Code: f0047182 91178042 528089c3 9771d47b (d4210000) ---[ end trace 0000000000000000 ]--- This happens because we are processing an empty block group, which has no extents allocated from it, there are no items for this block group, including the block group item since block group items are stored in a dedicated tree when using the block group tree feature. It also means this is the block group with the highest start offset, so there are no higher keys in the extent root, hence btrfs_search_slot_for_read() returns 1 (no higher key found). Fix this by asserting 'ret' is 0 only if the block group tree feature is not enabled, in which case we should find a block group item for the block group since it's stored in the extent root and block group item keys are greater than extent item keys (the value for BTRFS_BLOCK_GROUP_ITEM_KEY is 192 and for BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_ITEM_KEY the values are 168 and 169 respectively). In case 'ret' is 1, we just need to add a record to the free space tree which spans the whole block group, and we can achieve this by making 'ret == 0' as the while loop's condition. Reported-by: syzbot+36fae25c35159a763a2a@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/6841dca8.a00a0220.d4325.0020.GAE@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:20:54 +02:00
Filipe Manana	16edae52f6	btrfs: don't silently ignore unexpected extent type when replaying log If there's an unexpected (invalid) extent type, we just silently ignore it. This means a corruption or some bug somewhere, so instead return -EUCLEAN to the caller, making log replay fail, and print an error message with relevant information. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:20:47 +02:00
Filipe Manana	2dcf838cf5	btrfs: fix invalid inode pointer dereferences during log replay In a few places where we call read_one_inode(), if we get a NULL pointer we end up jumping into an error path, or fallthrough in case of __add_inode_ref(), where we then do something like this: iput(&inode->vfs_inode); which results in an invalid inode pointer that triggers an invalid memory access, resulting in a crash. Fix this by making sure we don't do such dereferences. Fixes: `b4c50cbb01` ("btrfs: return a btrfs_inode from read_one_inode()") CC: stable@vger.kernel.org # 6.15+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:20:42 +02:00
Filipe Manana	e5b5596011	btrfs: fix double unlock of buffer_tree xarray when releasing subpage eb If we break out of the loop because an extent buffer doesn't have the bit EXTENT_BUFFER_TREE_REF set, we end up unlocking the xarray twice, once before we tested for the bit and break out of the loop, and once again after the loop. Fix this by testing the bit and exiting before unlocking the xarray. The time spent testing the bit is negligible and it's not worth trying to do that outside the critical section delimited by the xarray lock due to the code complexity required to avoid it (like using a local boolean variable to track whether the xarray is locked or not). The xarray unlock only needs to be done before calling release_extent_buffer(), as that needs to lock the xarray (through xa_cmpxchg_irq()) and does a more significant amount of work. Fixes: `19d7f65f03` ("btrfs: convert the buffer_radix to an xarray") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/linux-btrfs/aDRNDU0GM1_D4Xnw@stanley.mountain/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:20:33 +02:00
Mark Harmstone	ae4477f937	btrfs: update superblock's device bytes_used when dropping chunk Each superblock contains a copy of the device item for that device. In a transaction which drops a chunk but doesn't create any new ones, we were correctly updating the device item in the chunk tree but not copying over the new bytes_used value to the superblock. This can be seen by doing the following: # dd if=/dev/zero of=test bs=4096 count=2621440 # mkfs.btrfs test # mount test /root/temp # cd /root/temp # for i in {00..10}; do dd if=/dev/zero of=$i bs=4096 count=32768; done # sync # rm * # sync # btrfs balance start -dusage=0 . # sync # cd # umount /root/temp # btrfs check test For btrfs-check to detect this, you will also need my patch at https://github.com/kdave/btrfs-progs/pull/991. Change btrfs_remove_dev_extents() so that it adds the devices to the fs_info->post_commit_list if they're not there already. This causes btrfs_commit_device_sizes() to be called, which updates the bytes_used value in the superblock. Fixes: `bbbf7243d6` ("btrfs: combine device update operations during transaction commit") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <maharmstone@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:20:22 +02:00
Filipe Manana	3ca864de85	btrfs: fix a race between renames and directory logging We have a race between a rename and directory inode logging that if it happens and we crash/power fail before the rename completes, the next time the filesystem is mounted, the log replay code will end up deleting the file that was being renamed. This is best explained following a step by step analysis of an interleaving of steps that lead into this situation. Consider the initial conditions: 1) We are at transaction N; 2) We have directories A and B created in a past transaction (< N); 3) We have inode X corresponding to a file that has 2 hardlinks, one in directory A and the other in directory B, so we'll name them as "A/foo_link1" and "B/foo_link2". Both hard links were persisted in a past transaction (< N); 4) We have inode Y corresponding to a file that as a single hard link and is located in directory A, we'll name it as "A/bar". This file was also persisted in a past transaction (< N). The steps leading to a file loss are the following and for all of them we are under transaction N: 1) Link "A/foo_link1" is removed, so inode's X last_unlink_trans field is updated to N, through btrfs_unlink() -> btrfs_record_unlink_dir(); 2) Task A starts a rename for inode Y, with the goal of renaming from "A/bar" to "A/baz", so we enter btrfs_rename(); 3) Task A inserts the new BTRFS_INODE_REF_KEY for inode Y by calling btrfs_insert_inode_ref(); 4) Because the rename happens in the same directory, we don't set the last_unlink_trans field of directoty A's inode to the current transaction id, that is, we don't cal btrfs_record_unlink_dir(); 5) Task A then removes the entries from directory A (BTRFS_DIR_ITEM_KEY and BTRFS_DIR_INDEX_KEY items) when calling __btrfs_unlink_inode() (actually the dir index item is added as a delayed item, but the effect is the same); 6) Now before task A adds the new entry "A/baz" to directory A by calling btrfs_add_link(), another task, task B is logging inode X; 7) Task B starts a fsync of inode X and after logging inode X, at btrfs_log_inode_parent() it calls btrfs_log_all_parents(), since inode X has a last_unlink_trans value of N, set at in step 1; 8) At btrfs_log_all_parents() we search for all parent directories of inode X using the commit root, so we find directories A and B and log them. Bu when logging direct A, we don't have a dir index item for inode Y anymore, neither the old name "A/bar" nor for the new name "A/baz" since the rename has deleted the old name but has not yet inserted the new name - task A hasn't called yet btrfs_add_link() to do that. Note that logging directory A doesn't fallback to a transaction commit because its last_unlink_trans has a lower value than the current transaction's id (see step 4); 9) Task B finishes logging directories A and B and gets back to btrfs_sync_file() where it calls btrfs_sync_log() to persist the log tree; 10) Task B successfully persisted the log tree, btrfs_sync_log() completed with success, and a power failure happened. We have a log tree without any directory entry for inode Y, so the log replay code deletes the entry for inode Y, name "A/bar", from the subvolume tree since it doesn't exist in the log tree and the log tree is authorative for its index (we logged a BTRFS_DIR_LOG_INDEX_KEY item that covers the index range for the dentry that corresponds to "A/bar"). Since there's no other hard link for inode Y and the log replay code deletes the name "A/bar", the file is lost. The issue wouldn't happen if task B synced the log only after task A called btrfs_log_new_name(), which would update the log with the new name for inode Y ("A/bar"). Fix this by pinning the log root during renames before removing the old directory entry, and unpinning after btrfs_log_new_name() is called. Fixes: `259c4b96d7` ("btrfs: stop doing unnecessary log updates during a rename") CC: stable@vger.kernel.org # 5.18+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:19:58 +02:00
Anand Jain	65d5112b4d	btrfs: scrub: add prefix for the error messages Add a "scrub: " prefix to all messages logged by scrub so that it's easy to filter them from dmesg for analysis. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:19:06 +02:00
Leo Martins	186b9dc3c3	btrfs: warn if leaking delayed_nodes in btrfs_put_root() Add a warning for leaked delayed_nodes when putting a root. We currently do this for inodes, but not delayed_nodes. Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ Remove the changelog from the commit message. ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:18:39 +02:00
Leo Martins	dd276214e4	btrfs: fix delayed ref refcount leak in debug assertion If the delayed_root is not empty we are increasing the number of references to a delayed_node without decreasing it, causing a leak. Fix by decrementing the delayed_node reference count. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ Remove the changelog from the commit message. ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:18:35 +02:00
Filipe Manana	c769be2d3d	btrfs: include root in error message when unlinking inode To help debugging include the root number in the error message, and since this is a critical error that implies a metadata inconsistency and results in a transaction abort change the log message level from "info" to "critical", which is a much better fit. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:18:30 +02:00
Lorenzo Stoakes	2e3b37a7e4	fs: replace mmap hook with .mmap_prepare for simple mappings Since commit `c84bf6dd2b` ("mm: introduce new .mmap_prepare() file callback"), the f_op->mmap() hook has been deprecated in favour of f_op->mmap_prepare(). This callback is invoked in the mmap() logic far earlier, so error handling can be performed more safely without complicated and bug-prone state unwinding required should an error arise. This hook also avoids passing a pointer to a not-yet-correctly-established VMA avoiding any issues with referencing this data structure. It rather provides a pointer to the new struct vm_area_desc descriptor type which contains all required state and allows easy setting of required parameters without any consideration needing to be paid to locking or reference counts. Note that nested filesystems like overlayfs are compatible with an .mmap_prepare() callback since commit `bb666b7c27` ("mm: add mmap_prepare() compatibility layer for nested file systems"). In this patch we apply this change to file systems with relatively simple mmap() hook logic - exfat, ceph, f2fs, bcachefs, zonefs, btrfs, ocfs2, orangefs, nilfs2, romfs, ramfs and aio. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/f528ac4f35b9378931bd800920fee53fc0c5c74d.1750099179.git.lorenzo.stoakes@oracle.com Acked-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-06-19 13:56:59 +02:00

1 2 3 4 5 ...

14440 Commits