linux-loongson

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-08-28 09:22:08 +00:00

Author	SHA1	Message	Date
Yangtao Li	4044a7ed3b	btrfs: use rb_find() in lookup_block_entry() Use the rb-tree helper so we don't open code the search code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Yangtao Li	b017a92bd9	btrfs: use rb_find_add() in ulist_rbtree_insert() Use the rb-tree helper so we don't open code the search and insert code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Yangtao Li	c4f38e7ca5	btrfs: use rb_find() in ulist_rbtree_search() Use the rb-tree helper so we don't open code the search code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Yangtao Li	9734685854	btrfs: use rb_find() in __btrfs_lookup_delayed_item() Use the rb-tree helper so we don't open code the search code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Yangtao Li	7a91e01875	btrfs: use rb_find_add() in btrfs_insert_inode_defrag() Use the rb-tree helper so we don't open code the search and insert code. Signed-off-by: Yangtao Li <frank.li@vivo.com> Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Dan Johnson	06c3437f74	btrfs: fix comment in reserved space warning mkfs.btrfs up to v4.14 actually can leave a chunk inside the reserved space when invoked with `-m single`, fixed by 997f9977c24397eb6980bb9 ("mkfs: Prevent temporary system chunk to use space in reserved 1M range") released with v4.15. Signed-off-by: Dan Johnson <ComputerDruid@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Daniel Vacek	d8f6cb2b28	btrfs: relocation: simplify unused logic related to LINK_LOWER btrfs_backref_link_edge() is always called with the LINK_LOWER argument. We can simplify it and remove the LINK_LOWER and LINK_UPPER macros completely. The last call with LINK_UPPER was removed with commit `0097422c0d` ("btrfs: remove clone_backref_node() from relocation"). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:25 +02:00
Filipe Manana	593062f67b	btrfs: unfold transaction abort at btrfs_insert_one_raid_extent() We have a common error path where we abort the transaction, but like this in case we get a transaction abort stack trace we don't know exactly which previous function call failed. Instead abort the transaction after any function call that returns an error, so that we can easily identify which function failed. Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:24 +02:00
Filipe Manana	35bb03e57a	btrfs: unfold transaction abort at __btrfs_update_delayed_inode() We have a common error path where we abort the transaction, but like this in case we get a transaction abort stack trace we don't know exactly which previous function call failed. Instead abort the transaction after any function call that returns an error, so that we can easily identify which function failed. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:24 +02:00
Filipe Manana	33e8f24b52	btrfs: abort transaction on unexpected eb generation at btrfs_copy_root() If we find an unexpected generation for the extent buffer we are cloning at btrfs_copy_root(), we just WARN_ON() and don't error out and abort the transaction, meaning we allow to persist metadata with an unexpected generation. Instead of warning only, abort the transaction and return -EUCLEAN. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:53:16 +02:00
Filipe Manana	273bbb5b48	btrfs: unfold transaction abort at btrfs_copy_root() Instead of having a common btrfs_abort_transaction() call for when any of the two btrfs_inc_ref() calls fail, move the btrfs_abort_transaction() to happen immediately after each one of the calls, so that when analyzing a stack trace with a transaction abort we know which call failed. Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:50:35 +02:00
David Sterba	b63c8c1ede	btrfs: move transaction aborts to the error site in add_block_group_free_space() Transaction aborts should be done next to the place the error happens, which was not done in add_block_group_free_space(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:50:34 +02:00
David Sterba	0b10f3dd13	btrfs: move transaction aborts to the error site in remove_block_group_free_space() Transaction aborts should be done next to the place the error happens, which was not done in remove_block_group_free_space(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:50:34 +02:00
Filipe Manana	81bfd9d547	btrfs: simplify error detection flow during log replay We have this fuzzy logic at btrfs_recover_log_trees() where we don't abort the transaction and exit immediately after each function call that returned an error, and instead have if-then-else logic or check if the previous function call returned success before calling the next function. Make the flow more straightforward by immediately aborting the transaction and exiting after each function call failure. This also allows to avoid two consecutive if statements that test the same conditions: if (!ret && wc.stage == LOG_WALK_REPLAY_ALL) { (...) } Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:50:34 +02:00
Filipe Manana	6466084df6	btrfs: remove redundant path release when replaying a log tree There's no need to call btrfs_release_path() before calling btrfs_init_root_free_objectid() as we have released the path already at the top of the loop and the previous call to fixup_inode_link_counts() also releases the path. So remove it to simplify the code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:50:34 +02:00
Filipe Manana	2a5898c4aa	btrfs: abort transaction during log replay if walk_log_tree() failed If we failed walking a log tree during replay, we have a missing transaction abort to prevent committing a transaction where we didn't fully replay all the changes from a log tree and therefore can leave the respective subvolume tree in some inconsistent state. So add the missing transaction abort. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:50:28 +02:00
Filipe Manana	8f1e1b263d	btrfs: unfold transaction aborts when replaying log trees We have a single line doing a transaction abort in case either we got an error from btrfs_get_fs_root() different from -ENOENT or we got an error from btrfs_pin_extent_for_log_replay(), making it hard to figure out which function call failed when looking at a transaction abort massages and stack trace in dmesg. Change this to have an explicit transaction abort for each one of the two cases. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:44:11 +02:00
Johannes Thumshirn	2a946bf6d6	btrfs: make btrfs_should_periodic_reclaim() static btrfs_should_periodic_reclaim() is not used outside of space-info.c so make it static and remove the prototype from space-info.h. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:44:11 +02:00
Johannes Thumshirn	55f7c65b2f	btrfs: zoned: use filesystem size not disk size for reclaim decision When deciding if a zoned filesystem is reaching the threshold to reclaim data block groups, look at the size of the filesystem not to potentially total available size of all drives in the filesystem. Especially if a filesystem was created with mkfs' -b option, constraining it to only a portion of the block device, the numbers won't match and potentially garbage collection is kicking in too late. Fixes: `3687fcb075` ("btrfs: zoned: make auto-reclaim less aggressive") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:42:24 +02:00
Filipe Manana	f2de2b9ffd	btrfs: unfold transaction abort at clone_copy_inline_extent() We have a common error path where we abort the transaction, but like this in case we get a transaction abort stack trace we don't know exactly which previous function call failed. Instead abort the transaction after any function call that returns an error, so that we can easily identify which function failed. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 15:31:06 +02:00
Filipe Manana	5ff6050fcd	btrfs: remove pointless 'out' label from clone_finish_inode_update() The label is only used once and we can instead return directly where it's used, besides the fact that all we do under the label is to return the value of 'ret'. So get rid of the label and return directly. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 15:31:06 +02:00
Filipe Manana	5cf0e668ea	btrfs: unfold transaction abort at walk_up_proc() Instead of having a common btrfs_abort_transaction() call for when any of the two btrfs_dec_ref() calls fail, move the btrfs_abort_transaction() to happen immediately after each one of the calls, so that when analysing a stack trace with a transaction abort we know which call failed. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 15:31:05 +02:00
Filipe Manana	227aa55fa2	btrfs: unfold transaction abort at __btrfs_inc_extent_ref() Instead of having a common btrfs_abort_transaction() call for when either insert_tree_block_ref() failed or when insert_extent_data_ref() failed, move the btrfs_abort_transaction() to happen immediately after each one of those calls, so that when analysing a stack trace with a transaction abort we know which call failed. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 15:31:05 +02:00
Filipe Manana	3f757b56f1	btrfs: unfold transaction aborts at btrfs_create_new_inode() Instead of having a common btrfs_abort_transaction() call for when either btrfs_orphan_add() failed or when btrfs_add_link() failed, move the btrfs_abort_transaction() to happen immediately after each one of those calls, so that when analysing a stack trace with a transaction abort we know which call failed. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 15:31:05 +02:00
Caleb Sander Mateos	9aad72b4e3	btrfs/ioctl: store btrfs_uring_encoded_data in io_btrfs_cmd btrfs is the only user of struct io_uring_cmd_data and its op_data field. Switch its ->uring_cmd() implementations to store the struct btrfs_uring_encoded_data * in the struct io_btrfs_cmd, overlayed with io_uring_cmd's pdu field. This avoids having to touch another cache line to access the struct btrfs_uring_encoded_data *, and allows op_data and struct io_uring_cmd_data to be removed. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20250708202212.2851548-4-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-18 12:34:56 -06:00
Christian Brauner	ca115d7e75	tree-wide: s/struct fileattr/struct file_kattr/g Now that we expose struct file_attr as our uapi struct rename all the internal struct to struct file_kattr to clearly communicate that it is a kernel internal struct. This is similar to struct mount_{k}attr and others. Link: https://lore.kernel.org/20250703-restlaufzeit-baurecht-9ed44552b481@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-07-04 16:14:39 +02:00
Linus Torvalds	4c06e63b92	for-6.16-rc4-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmhmwFQACgkQxWXV+ddt WDsVMA/+NuSth71V0AfiDnFyqjgDMqIlZL2+dqBiTYHXQQHKbqiUlKvYkWICCT6T 1YgDV+95XJYy4TDBoA49Ndd/l+CiDcMLbOYeneIfbJy13ts84jVANPkl4n03gPkF ktibCw15h0MENVctTCPc71dX2X0cV9WPf4iDmoxUZiukDA376akGTArZKwH4tVVg 4qVpzUtDdNOf848D+8DZKGd+ot/RWgEdLkFCZES27BMg/OFemxBK1MU6K8VjxiKF VoaSVJRDXuug8oVBAGNl86XpiSgd4gHyoNNA5b4mhdSWMSBMxUAaILsONT9pNQZA CFyHA1Jp2gLOIzQIzeXwWgXaAOQDtco8YWYaXhf0v0mySs89tweXjOibfj2mU9pS wPaJyeD+nyRDMwPa4VWEws64D3vXX6aKwiThUENuDmxBvrRXjrkGYH9tf0LNzDDe OKv/vOCfeyutxbjKhP+qElMhdh73BZnJ4UCxxYRRDq2v1Mg+k06swl+6uL6xenme a2KLJlwEoG6LAlkpZzV66ZEaIHDyGBZNdVYtuA/G3dDtmlt0aLXDdp1eq7NivS1j aV7cd0JMX89lAUtqKT932ZOw8RoDrUPPjsnXzCaZJ69mMVyEkxyCV+iYHTTJPDga W5Vg8Tq3d1gwxMebZHvyI6wwUhmGA0wUFG2eohYY/tcSrrUlrHQ= =Ke0p -----END PGP SIGNATURE----- Merge tag 'for-6.16-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - tree-log fixes: - fixes of log tracking of directories and subvolumes - fix iteration and error handling of inode references during log replay - fix free space tree rebuild (reported by syzbot) * tag 'for-6.16-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: use btrfs_record_snapshot_destroy() during rmdir btrfs: propagate last_unlink_trans earlier when doing a rmdir btrfs: record new subvolume in parent dir earlier to avoid dir logging races btrfs: fix inode lookup error handling during log replay btrfs: fix iteration of extrefs during log replay btrfs: fix missing error handling when searching for inode refs during log replay btrfs: fix failure to rebuild free space tree using multiple transactions	2025-07-03 13:29:56 -07:00
Eric Biggers	2c7528d36e	btrfs: stop parsing crc32c driver name To determine whether the crc32c implementation is "fast", use crc32_optimizations() instead of parsing the crypto_shash driver name. This keeps the code working as intended after the driver name is changed by the next commit. Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20250613183753.31864-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>	2025-06-30 09:31:56 -07:00
Filipe Manana	157501b046	btrfs: use btrfs_record_snapshot_destroy() during rmdir We are setting the parent directory's last_unlink_trans directly which may result in a concurrent task starting to log the directory not see the update and therefore can log the directory after we removed a child directory which had a snapshot within instead of falling back to a transaction commit. Replaying such a log tree would result in a mount failure since we can't currently delete snapshots (and subvolumes) during log replay. This is the type of failure described in commit `1ec9a1ae1e` ("Btrfs: fix unreplayable log after snapshot delete + parent dir fsync"). Fix this by using btrfs_record_snapshot_destroy() which updates the last_unlink_trans field while holding the inode's log_mutex lock. Fixes: `44f714dae5` ("Btrfs: improve performance on fsync against new inode after rename/unlink") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-27 19:58:12 +02:00
Filipe Manana	c466e33e72	btrfs: propagate last_unlink_trans earlier when doing a rmdir In case the removed directory had a snapshot that was deleted, we are propagating its inode's last_unlink_trans to the parent directory after we removed the entry from the parent directory. This leaves a small race window where someone can log the parent directory after we removed the entry and before we updated last_unlink_trans, and as a result if we ever try to replay such a log tree, we will fail since we will attempt to remove a snapshot during log replay, which is currently not possible and results in the log replay (and mount) to fail. This is the type of failure described in commit `1ec9a1ae1e` ("Btrfs: fix unreplayable log after snapshot delete + parent dir fsync"). So fix this by propagating the last_unlink_trans to the parent directory before we remove the entry from it. Fixes: `44f714dae5` ("Btrfs: improve performance on fsync against new inode after rename/unlink") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-27 19:57:47 +02:00
Filipe Manana	bf5bcf9a6f	btrfs: record new subvolume in parent dir earlier to avoid dir logging races Instead of recording that a new subvolume was created in a directory after we add the entry do the directory, record it before adding the entry. This is to avoid races where after creating the entry and before recording the new subvolume in the directory (the call to btrfs_record_new_subvolume()), another task logs the directory, so we end up with a log tree where we logged a directory that has an entry pointing to a root that was not yet committed, resulting in an invalid entry if the log is persisted and replayed later due to a power failure or crash. Also state this requirement in the function comment for btrfs_record_new_subvolume(), similar to what we do for the btrfs_record_unlink_dir() and btrfs_record_snapshot_destroy(). Fixes: `45c4102f0d` ("btrfs: avoid transaction commit on any fsync after subvolume creation") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-27 19:57:24 +02:00
Filipe Manana	5f61b96159	btrfs: fix inode lookup error handling during log replay When replaying log trees we use read_one_inode() to get an inode, which is just a wrapper around btrfs_iget_logging(), which in turn is a wrapper for btrfs_iget(). But read_one_inode() always returns NULL for any error that btrfs_iget_logging() / btrfs_iget() may return and this is a problem because: 1) In many callers of read_one_inode() we convert the NULL into -EIO, which is not accurate since btrfs_iget() may return -ENOMEM and -ENOENT for example, besides -EIO and other errors. So during log replay we may end up reporting a false -EIO, which is confusing since we may not have had any IO error at all; 2) When replaying directory deletes, at replay_dir_deletes(), we assume the NULL returned from read_one_inode() means that the inode doesn't exist and then proceed as if no error had happened. This is wrong because unless btrfs_iget() returned ERR_PTR(-ENOENT), we had an actual error and the target inode may exist in the target subvolume root - this may later result in the log replay code failing at a later stage (if we are "lucky") or succeed but leaving some inconsistency in the filesystem. So fix this by not ignoring errors from btrfs_iget_logging() and as a consequence remove the read_one_inode() wrapper and just use btrfs_iget_logging() directly. Also since btrfs_iget_logging() is supposed to be called only against subvolume roots, just like read_one_inode() which had a comment about it, add an assertion to btrfs_iget_logging() to check that the target root corresponds to a subvolume root. Fixes: `5d4f98a28c` ("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-27 19:57:06 +02:00
Filipe Manana	54a7081ed1	btrfs: fix iteration of extrefs during log replay At __inode_add_ref() when processing extrefs, if we jump into the next label we have an undefined value of victim_name.len, since we haven't initialized it before we did the goto. This results in an invalid memory access in the next iteration of the loop since victim_name.len was not initialized to the length of the name of the current extref. Fix this by initializing victim_name.len with the current extref's name length. Fixes: `e43eec81c5` ("btrfs: use struct qstr instead of name and namelen pairs") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-27 19:56:55 +02:00
Filipe Manana	6561a40cec	btrfs: fix missing error handling when searching for inode refs during log replay During log replay, at __add_inode_ref(), when we are searching for inode ref keys we totally ignore if btrfs_search_slot() returns an error. This may make a log replay succeed when there was an actual error and leave some metadata inconsistency in a subvolume tree. Fix this by checking if an error was returned from btrfs_search_slot() and if so, return it to the caller. Fixes: `e02119d5a7` ("Btrfs: Add a write ahead tree log to optimize synchronous operations") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-27 19:56:35 +02:00
Filipe Manana	1e6ed33cab	btrfs: fix failure to rebuild free space tree using multiple transactions If we are rebuilding a free space tree, while modifying the free space tree we may need to allocate a new metadata block group. If we end up using multiple transactions for the rebuild, when we call btrfs_end_transaction() we enter btrfs_create_pending_block_groups() which calls add_block_group_free_space() to add items to the free space tree for the block group. Then later during the free space tree rebuild, at btrfs_rebuild_free_space_tree(), we may find such new block groups and call populate_free_space_tree() for them, which fails with -EEXIST because there are already items in the free space tree. Then we abort the transaction with -EEXIST at btrfs_rebuild_free_space_tree(). Notice that we say "may find" the new block groups because a new block group may be inserted in the block groups rbtree, which is being iterated by the rebuild process, before or after the current node where the rebuild process is currently at. Syzbot recently reported such case which produces a trace like the following: ------------[ cut here ]------------ BTRFS: Transaction aborted (error -17) WARNING: CPU: 1 PID: 7626 at fs/btrfs/free-space-tree.c:1341 btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341 Modules linked in: CPU: 1 UID: 0 PID: 7626 Comm: syz.2.25 Not tainted 6.15.0-rc7-syzkaller-00085-gd7fa1af5b33e-dirty #0 PREEMPT Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341 lr : btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341 sp : ffff80009c4f7740 x29: ffff80009c4f77b0 x28: ffff0000d4c3f400 x27: 0000000000000000 x26: dfff800000000000 x25: ffff70001389eee8 x24: 0000000000000003 x23: 1fffe000182b6e7b x22: 0000000000000000 x21: ffff0000c15b73d8 x20: 00000000ffffffef x19: ffff0000c15b7378 x18: 1fffe0003386f276 x17: ffff80008f31e000 x16: ffff80008adbe98c x15: 0000000000000001 x14: 1fffe0001b281550 x13: 0000000000000000 x12: 0000000000000000 x11: ffff60001b281551 x10: 0000000000000003 x9 : 1c8922000a902c00 x8 : 1c8922000a902c00 x7 : ffff800080485878 x6 : 0000000000000000 x5 : 0000000000000001 x4 : 0000000000000001 x3 : ffff80008047843c x2 : 0000000000000001 x1 : ffff80008b3ebc40 x0 : 0000000000000001 Call trace: btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341 (P) btrfs_start_pre_rw_mount+0xa78/0xe10 fs/btrfs/disk-io.c:3074 btrfs_remount_rw fs/btrfs/super.c:1319 [inline] btrfs_reconfigure+0x828/0x2418 fs/btrfs/super.c:1543 reconfigure_super+0x1d4/0x6f0 fs/super.c:1083 do_remount fs/namespace.c:3365 [inline] path_mount+0xb34/0xde0 fs/namespace.c:4200 do_mount fs/namespace.c:4221 [inline] __do_sys_mount fs/namespace.c:4432 [inline] __se_sys_mount fs/namespace.c:4409 [inline] __arm64_sys_mount+0x3e8/0x468 fs/namespace.c:4409 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151 el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767 el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786 el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600 irq event stamp: 330 hardirqs last enabled at (329): [<ffff80008048590c>] raw_spin_rq_unlock_irq kernel/sched/sched.h:1525 [inline] hardirqs last enabled at (329): [<ffff80008048590c>] finish_lock_switch+0xb0/0x1c0 kernel/sched/core.c:5130 hardirqs last disabled at (330): [<ffff80008adb9e60>] el1_dbg+0x24/0x80 arch/arm64/kernel/entry-common.c:511 softirqs last enabled at (10): [<ffff8000801fbf10>] local_bh_enable+0x10/0x34 include/linux/bottom_half.h:32 softirqs last disabled at (8): [<ffff8000801fbedc>] local_bh_disable+0x10/0x34 include/linux/bottom_half.h:19 ---[ end trace 0000000000000000 ]--- Fix this by flagging new block groups which had their free space tree entries already added and then skip them in the rebuild process. Also, since the rebuild may be triggered when doing a remount, make sure that when we clear an existing free space tree that we clear such flag from every existing block group, otherwise we would skip those block groups during the rebuild. Reported-by: syzbot+d0014fb0fc39c5487ae5@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/68460a54.050a0220.daf97.0af5.GAE@google.com/ Fixes: `882af9f13e` ("btrfs: handle free space tree rebuild in multiple transactions") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-27 19:56:15 +02:00
Linus Torvalds	5ca7fe213b	for-6.16-rc3-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmhZQ9MACgkQxWXV+ddt WDsyvw/+K5N4zbig9D5QL5SdsQwMe/ZUk1KF0LLu6H3hFetdICeM/Z4K46EBh40X c9Sxb13gLnIAm8DR/IFTTlOZVrrbJ3CTazZuJbncCpaZchH863aYb/1KboxjJnpW KqOen20KdUh8HdevrJFhkFc7rOjp7KupfIHsbWqIxaWYPf8ORvUyK55lKxQz0HES E5tFXLNr6z/8Ws5pc2HnRLgnRcCHuRUNJUb1PEaTfPKxoFvTwjda6cDsYnXOJEO9 NOnh6lluurqja+3FUEFig2f292/CbKGtByYUDgfhHO21P//IHSDhlouvwipzI/kh 6WUoH1K+DWCxxNbIVFFbUYLxrDGu7R7/aWFHH2q0dNjqQeiQBbUnbn4WIjAAwDWf k9cmE+WgVqwQI+vpfG3eENUafG5MpcQQo2wKrxG0whWaC2fiA6QtI+3DfKyMj4XJ JI1jUhfCwHrqzoGQ4XBE3UYENqQw9RICNC+Z3UfZx+5sQMWcb+ac5qIGygvCfU8N Gtfx4ladZshpQUSuRneiLozxdxLyXX3LzCt2Ls1s5fPPikZft/+2QRu5rzSbb/Cp 50TDSn/pE1N/TEMVZaP5M2PxquBVDOZ4TFSsSm3IvceqFInm0UerAGaJ7+T2eZhM 3XHhIp6xTecHfwukvGqs+XSxB9PMLfF5M0gc+9PR+3oxzFRpowI= =XLWR -----END PGP SIGNATURE----- Merge tag 'for-6.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Fixes: - fix invalid inode pointer dereferences during log replay - fix a race between renames and directory logging - fix shutting down delayed iput worker - fix device byte accounting when dropping chunk - in zoned mode, fix offset calculations for DUP profile when conventional and sequential zones are used together Regression fixes: - fix possible double unlock of extent buffer tree (xarray conversion) - in zoned mode, fix extent buffer refcount when writing out extents (xarray conversion) Error handling fixes and updates: - handle unexpected extent type when replaying log - check and warn if there are remaining delayed inodes when putting a root - fix assertion when building free space tree - handle csum tree error with mount option 'rescue=ibadroot' Other: - error message updates: add prefix to all scrub related messages, include other information in messages" * tag 'for-6.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: fix alloc_offset calculation for partly conventional block groups btrfs: handle csum tree error with rescue=ibadroots correctly btrfs: fix race between async reclaim worker and close_ctree() btrfs: fix assertion when building free space tree btrfs: don't silently ignore unexpected extent type when replaying log btrfs: fix invalid inode pointer dereferences during log replay btrfs: fix double unlock of buffer_tree xarray when releasing subpage eb btrfs: update superblock's device bytes_used when dropping chunk btrfs: fix a race between renames and directory logging btrfs: scrub: add prefix for the error messages btrfs: warn if leaking delayed_nodes in btrfs_put_root() btrfs: fix delayed ref refcount leak in debug assertion btrfs: include root in error message when unlinking inode btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails	2025-06-23 11:16:38 -07:00
Johannes Thumshirn	c0d90a79e8	btrfs: zoned: fix alloc_offset calculation for partly conventional block groups When one of two zones composing a DUP block group is a conventional zone, we have the zone_info[i]->alloc_offset = WP_CONVENTIONAL. That will, of course, not match the write pointer of the other zone, and fails that block group. This commit solves that issue by properly recovering the emulated write pointer from the last allocated extent. The offset for the SINGLE, DUP, and RAID1 are straight-forward: it is same as the end of last allocated extent. The RAID0 and RAID10 are a bit tricky that we need to do the math of striping. This is the kernel equivalent of Naohiro's user-space commit: "btrfs-progs: zoned: fix alloc_offset calculation for partly conventional block groups". Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:21:15 +02:00
Qu Wenruo	547e836661	btrfs: handle csum tree error with rescue=ibadroots correctly [BUG] There is syzbot based reproducer that can crash the kernel, with the following call trace: (With some debug output added) DEBUG: rescue=ibadroots parsed BTRFS: device fsid 14d642db-7b15-43e4-81e6-4b8fac6a25f8 devid 1 transid 8 /dev/loop0 (7:0) scanned by repro (1010) BTRFS info (device loop0): first mount of filesystem 14d642db-7b15-43e4-81e6-4b8fac6a25f8 BTRFS info (device loop0): using blake2b (blake2b-256-generic) checksum algorithm BTRFS info (device loop0): using free-space-tree BTRFS warning (device loop0): checksum verify failed on logical 5312512 mirror 1 wanted 0xb043382657aede36608fd3386d6b001692ff406164733d94e2d9a180412c6003 found 0x810ceb2bacb7f0f9eb2bf3b2b15c02af867cb35ad450898169f3b1f0bd818651 level 0 DEBUG: read tree root path failed for tree csum, ret=-5 BTRFS warning (device loop0): checksum verify failed on logical 5328896 mirror 1 wanted 0x51be4e8b303da58e6340226815b70e3a93592dac3f30dd510c7517454de8567a found 0x51be4e8b303da58e634022a315b70e3a93592dac3f30dd510c7517454de8567a level 0 BTRFS warning (device loop0): checksum verify failed on logical 5292032 mirror 1 wanted 0x1924ccd683be9efc2fa98582ef58760e3848e9043db8649ee382681e220cdee4 found 0x0cb6184f6e8799d9f8cb335dccd1d1832da1071d12290dab3b85b587ecacca6e level 0 process 'repro' launched './file2' with NULL argv: empty string added DEBUG: no csum root, idatacsums=0 ibadroots=134217728 Oops: general protection fault, probably for non-canonical address 0xdffffc0000000041: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000208-0x000000000000020f] CPU: 5 UID: 0 PID: 1010 Comm: repro Tainted: G OE 6.15.0-custom+ #249 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022 RIP: 0010:btrfs_lookup_csum+0x93/0x3d0 [btrfs] Call Trace: <TASK> btrfs_lookup_bio_sums+0x47a/0xdf0 [btrfs] btrfs_submit_bbio+0x43e/0x1a80 [btrfs] submit_one_bio+0xde/0x160 [btrfs] btrfs_readahead+0x498/0x6a0 [btrfs] read_pages+0x1c3/0xb20 page_cache_ra_order+0x4b5/0xc20 filemap_get_pages+0x2d3/0x19e0 filemap_read+0x314/0xde0 __kernel_read+0x35b/0x900 bprm_execve+0x62e/0x1140 do_execveat_common.isra.0+0x3fc/0x520 __x64_sys_execveat+0xdc/0x130 do_syscall_64+0x54/0x1d0 entry_SYSCALL_64_after_hwframe+0x76/0x7e ---[ end trace 0000000000000000 ]--- [CAUSE] Firstly the fs has a corrupted csum tree root, thus to mount the fs we have to go "ro,rescue=ibadroots" mount option. Normally with that mount option, a bad csum tree root should set BTRFS_FS_STATE_NO_DATA_CSUMS flag, so that any future data read will ignore csum search. But in this particular case, we have the following call trace that caused NULL csum root, but not setting BTRFS_FS_STATE_NO_DATA_CSUMS: load_global_roots_objectid(): ret = btrfs_search_slot(); /* Succeeded / btrfs_item_key_to_cpu() found = true; / We found the root item for csum tree. / root = read_tree_root_path(); if (IS_ERR(root)) { if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) / * Since we have rescue=ibadroots mount option, * @ret is still 0. / break; if (!found \|\| ret) { / @found is true, @ret is 0, error handling for csum * tree is skipped. */ } This means we completely skipped to set BTRFS_FS_STATE_NO_DATA_CSUMS if the csum tree is corrupted, which results unexpected later csum lookup. [FIX] If read_tree_root_path() failed, always populate @ret to the error number. As at the end of the function, we need @ret to determine if we need to do the extra error handling for csum tree. Fixes: `abed4aaae4` ("btrfs: track the csum, extent, and free space trees in a rb tree") Reported-by: Zhiyu Zhang <zhiyuzhang999@gmail.com> Reported-by: Longxing Li <coregee2000@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:21:06 +02:00
Filipe Manana	a26bf338cd	btrfs: fix race between async reclaim worker and close_ctree() Syzbot reported an assertion failure due to an attempt to add a delayed iput after we have set BTRFS_FS_STATE_NO_DELAYED_IPUT in the fs_info state: WARNING: CPU: 0 PID: 65 at fs/btrfs/inode.c:3420 btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420 Modules linked in: CPU: 0 UID: 0 PID: 65 Comm: kworker/u8:4 Not tainted 6.15.0-next-20250530-syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 Workqueue: btrfs-endio-write btrfs_work_helper RIP: 0010:btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420 Code: 4e ad 5d (...) RSP: 0018:ffffc9000213f780 EFLAGS: 00010293 RAX: ffffffff83c635b7 RBX: ffff888058920000 RCX: ffff88801c769e00 RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000000 RBP: 0000000000000001 R08: ffff888058921b67 R09: 1ffff1100b12436c R10: dffffc0000000000 R11: ffffed100b12436d R12: 0000000000000001 R13: dffffc0000000000 R14: ffff88807d748000 R15: 0000000000000100 FS: 0000000000000000(0000) GS:ffff888125c53000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00002000000bd038 CR3: 000000006a142000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> btrfs_put_ordered_extent+0x19f/0x470 fs/btrfs/ordered-data.c:635 btrfs_finish_one_ordered+0x11d8/0x1b10 fs/btrfs/inode.c:3312 btrfs_work_helper+0x399/0xc20 fs/btrfs/async-thread.c:312 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> This can happen due to a race with the async reclaim worker like this: 1) The async metadata reclaim worker enters shrink_delalloc(), which calls btrfs_start_delalloc_roots() with an nr_pages argument that has a value less than LONG_MAX, and that in turn enters start_delalloc_inodes(), which sets the local variable 'full_flush' to false because wbc->nr_to_write is less than LONG_MAX; 2) There it finds inode X in a root's delalloc list, grabs a reference for inode X (with igrab()), and triggers writeback for it with filemap_fdatawrite_wbc(), which creates an ordered extent for inode X; 3) The unmount sequence starts from another task, we enter close_ctree() and we flush the workqueue fs_info->endio_write_workers, which waits for the ordered extent for inode X to complete and when dropping the last reference of the ordered extent, with btrfs_put_ordered_extent(), when we call btrfs_add_delayed_iput() we don't add the inode to the list of delayed iputs because it has a refcount of 2, so we decrement it to 1 and return; 4) Shortly after at close_ctree() we call btrfs_run_delayed_iputs() which runs all delayed iputs, and then we set BTRFS_FS_STATE_NO_DELAYED_IPUT in the fs_info state; 5) The async reclaim worker, after calling filemap_fdatawrite_wbc(), now calls btrfs_add_delayed_iput() for inode X and there we trigger an assertion failure since the fs_info state has the flag BTRFS_FS_STATE_NO_DELAYED_IPUT set. Fix this by setting BTRFS_FS_STATE_NO_DELAYED_IPUT only after we wait for the async reclaim workers to finish, after we call cancel_work_sync() for them at close_ctree(), and by running delayed iputs after wait for the reclaim workers to finish and before setting the bit. This race was recently introduced by commit `19e60b2a95` ("btrfs: add extra warning if delayed iput is added when it's not allowed"). Without the new validation at btrfs_add_delayed_iput(), this described scenario was safe because close_ctree() later calls btrfs_commit_super(). That will run any final delayed iputs added by reclaim workers in the window between the btrfs_run_delayed_iputs() and the the reclaim workers being shut down. Reported-by: syzbot+0ed30ad435bf6f5b7a42@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/6840481c.a00a0220.d4325.000c.GAE@google.com/T/#u Fixes: `19e60b2a95` ("btrfs: add extra warning if delayed iput is added when it's not allowed") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:20:57 +02:00
Filipe Manana	1961d20f6f	btrfs: fix assertion when building free space tree When building the free space tree with the block group tree feature enabled, we can hit an assertion failure like this: BTRFS info (device loop0 state M): rebuilding free space tree assertion failed: ret == 0, in fs/btrfs/free-space-tree.c:1102 ------------[ cut here ]------------ kernel BUG at fs/btrfs/free-space-tree.c:1102! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP Modules linked in: CPU: 1 UID: 0 PID: 6592 Comm: syz-executor322 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 lr : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 sp : ffff8000a4ce7600 x29: ffff8000a4ce76e0 x28: ffff0000c9bc6000 x27: ffff0000ddfff3d8 x26: ffff0000ddfff378 x25: dfff800000000000 x24: 0000000000000001 x23: ffff8000a4ce7660 x22: ffff70001499cecc x21: ffff0000e1d8c160 x20: ffff0000e1cb7800 x19: ffff0000e1d8c0b0 x18: 00000000ffffffff x17: ffff800092f39000 x16: ffff80008ad27e48 x15: ffff700011e740c0 x14: 1ffff00011e740c0 x13: 0000000000000004 x12: ffffffffffffffff x11: ffff700011e740c0 x10: 0000000000ff0100 x9 : 94ef24f55d2dbc00 x8 : 94ef24f55d2dbc00 x7 : 0000000000000001 x6 : 0000000000000001 x5 : ffff8000a4ce6f98 x4 : ffff80008f415ba0 x3 : ffff800080548ef0 x2 : 0000000000000000 x1 : 0000000100000000 x0 : 000000000000003e Call trace: populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 (P) btrfs_rebuild_free_space_tree+0x14c/0x54c fs/btrfs/free-space-tree.c:1337 btrfs_start_pre_rw_mount+0xa78/0xe10 fs/btrfs/disk-io.c:3074 btrfs_remount_rw fs/btrfs/super.c:1319 [inline] btrfs_reconfigure+0x828/0x2418 fs/btrfs/super.c:1543 reconfigure_super+0x1d4/0x6f0 fs/super.c:1083 do_remount fs/namespace.c:3365 [inline] path_mount+0xb34/0xde0 fs/namespace.c:4200 do_mount fs/namespace.c:4221 [inline] __do_sys_mount fs/namespace.c:4432 [inline] __se_sys_mount fs/namespace.c:4409 [inline] __arm64_sys_mount+0x3e8/0x468 fs/namespace.c:4409 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151 el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767 el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786 el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600 Code: f0047182 91178042 528089c3 9771d47b (d4210000) ---[ end trace 0000000000000000 ]--- This happens because we are processing an empty block group, which has no extents allocated from it, there are no items for this block group, including the block group item since block group items are stored in a dedicated tree when using the block group tree feature. It also means this is the block group with the highest start offset, so there are no higher keys in the extent root, hence btrfs_search_slot_for_read() returns 1 (no higher key found). Fix this by asserting 'ret' is 0 only if the block group tree feature is not enabled, in which case we should find a block group item for the block group since it's stored in the extent root and block group item keys are greater than extent item keys (the value for BTRFS_BLOCK_GROUP_ITEM_KEY is 192 and for BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_ITEM_KEY the values are 168 and 169 respectively). In case 'ret' is 1, we just need to add a record to the free space tree which spans the whole block group, and we can achieve this by making 'ret == 0' as the while loop's condition. Reported-by: syzbot+36fae25c35159a763a2a@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/6841dca8.a00a0220.d4325.0020.GAE@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:20:54 +02:00
Filipe Manana	16edae52f6	btrfs: don't silently ignore unexpected extent type when replaying log If there's an unexpected (invalid) extent type, we just silently ignore it. This means a corruption or some bug somewhere, so instead return -EUCLEAN to the caller, making log replay fail, and print an error message with relevant information. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:20:47 +02:00
Filipe Manana	2dcf838cf5	btrfs: fix invalid inode pointer dereferences during log replay In a few places where we call read_one_inode(), if we get a NULL pointer we end up jumping into an error path, or fallthrough in case of __add_inode_ref(), where we then do something like this: iput(&inode->vfs_inode); which results in an invalid inode pointer that triggers an invalid memory access, resulting in a crash. Fix this by making sure we don't do such dereferences. Fixes: `b4c50cbb01` ("btrfs: return a btrfs_inode from read_one_inode()") CC: stable@vger.kernel.org # 6.15+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:20:42 +02:00
Filipe Manana	e5b5596011	btrfs: fix double unlock of buffer_tree xarray when releasing subpage eb If we break out of the loop because an extent buffer doesn't have the bit EXTENT_BUFFER_TREE_REF set, we end up unlocking the xarray twice, once before we tested for the bit and break out of the loop, and once again after the loop. Fix this by testing the bit and exiting before unlocking the xarray. The time spent testing the bit is negligible and it's not worth trying to do that outside the critical section delimited by the xarray lock due to the code complexity required to avoid it (like using a local boolean variable to track whether the xarray is locked or not). The xarray unlock only needs to be done before calling release_extent_buffer(), as that needs to lock the xarray (through xa_cmpxchg_irq()) and does a more significant amount of work. Fixes: `19d7f65f03` ("btrfs: convert the buffer_radix to an xarray") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/linux-btrfs/aDRNDU0GM1_D4Xnw@stanley.mountain/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:20:33 +02:00
Mark Harmstone	ae4477f937	btrfs: update superblock's device bytes_used when dropping chunk Each superblock contains a copy of the device item for that device. In a transaction which drops a chunk but doesn't create any new ones, we were correctly updating the device item in the chunk tree but not copying over the new bytes_used value to the superblock. This can be seen by doing the following: # dd if=/dev/zero of=test bs=4096 count=2621440 # mkfs.btrfs test # mount test /root/temp # cd /root/temp # for i in {00..10}; do dd if=/dev/zero of=$i bs=4096 count=32768; done # sync # rm * # sync # btrfs balance start -dusage=0 . # sync # cd # umount /root/temp # btrfs check test For btrfs-check to detect this, you will also need my patch at https://github.com/kdave/btrfs-progs/pull/991. Change btrfs_remove_dev_extents() so that it adds the devices to the fs_info->post_commit_list if they're not there already. This causes btrfs_commit_device_sizes() to be called, which updates the bytes_used value in the superblock. Fixes: `bbbf7243d6` ("btrfs: combine device update operations during transaction commit") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <maharmstone@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:20:22 +02:00
Filipe Manana	3ca864de85	btrfs: fix a race between renames and directory logging We have a race between a rename and directory inode logging that if it happens and we crash/power fail before the rename completes, the next time the filesystem is mounted, the log replay code will end up deleting the file that was being renamed. This is best explained following a step by step analysis of an interleaving of steps that lead into this situation. Consider the initial conditions: 1) We are at transaction N; 2) We have directories A and B created in a past transaction (< N); 3) We have inode X corresponding to a file that has 2 hardlinks, one in directory A and the other in directory B, so we'll name them as "A/foo_link1" and "B/foo_link2". Both hard links were persisted in a past transaction (< N); 4) We have inode Y corresponding to a file that as a single hard link and is located in directory A, we'll name it as "A/bar". This file was also persisted in a past transaction (< N). The steps leading to a file loss are the following and for all of them we are under transaction N: 1) Link "A/foo_link1" is removed, so inode's X last_unlink_trans field is updated to N, through btrfs_unlink() -> btrfs_record_unlink_dir(); 2) Task A starts a rename for inode Y, with the goal of renaming from "A/bar" to "A/baz", so we enter btrfs_rename(); 3) Task A inserts the new BTRFS_INODE_REF_KEY for inode Y by calling btrfs_insert_inode_ref(); 4) Because the rename happens in the same directory, we don't set the last_unlink_trans field of directoty A's inode to the current transaction id, that is, we don't cal btrfs_record_unlink_dir(); 5) Task A then removes the entries from directory A (BTRFS_DIR_ITEM_KEY and BTRFS_DIR_INDEX_KEY items) when calling __btrfs_unlink_inode() (actually the dir index item is added as a delayed item, but the effect is the same); 6) Now before task A adds the new entry "A/baz" to directory A by calling btrfs_add_link(), another task, task B is logging inode X; 7) Task B starts a fsync of inode X and after logging inode X, at btrfs_log_inode_parent() it calls btrfs_log_all_parents(), since inode X has a last_unlink_trans value of N, set at in step 1; 8) At btrfs_log_all_parents() we search for all parent directories of inode X using the commit root, so we find directories A and B and log them. Bu when logging direct A, we don't have a dir index item for inode Y anymore, neither the old name "A/bar" nor for the new name "A/baz" since the rename has deleted the old name but has not yet inserted the new name - task A hasn't called yet btrfs_add_link() to do that. Note that logging directory A doesn't fallback to a transaction commit because its last_unlink_trans has a lower value than the current transaction's id (see step 4); 9) Task B finishes logging directories A and B and gets back to btrfs_sync_file() where it calls btrfs_sync_log() to persist the log tree; 10) Task B successfully persisted the log tree, btrfs_sync_log() completed with success, and a power failure happened. We have a log tree without any directory entry for inode Y, so the log replay code deletes the entry for inode Y, name "A/bar", from the subvolume tree since it doesn't exist in the log tree and the log tree is authorative for its index (we logged a BTRFS_DIR_LOG_INDEX_KEY item that covers the index range for the dentry that corresponds to "A/bar"). Since there's no other hard link for inode Y and the log replay code deletes the name "A/bar", the file is lost. The issue wouldn't happen if task B synced the log only after task A called btrfs_log_new_name(), which would update the log with the new name for inode Y ("A/bar"). Fix this by pinning the log root during renames before removing the old directory entry, and unpinning after btrfs_log_new_name() is called. Fixes: `259c4b96d7` ("btrfs: stop doing unnecessary log updates during a rename") CC: stable@vger.kernel.org # 5.18+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:19:58 +02:00
Anand Jain	65d5112b4d	btrfs: scrub: add prefix for the error messages Add a "scrub: " prefix to all messages logged by scrub so that it's easy to filter them from dmesg for analysis. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:19:06 +02:00
Leo Martins	186b9dc3c3	btrfs: warn if leaking delayed_nodes in btrfs_put_root() Add a warning for leaked delayed_nodes when putting a root. We currently do this for inodes, but not delayed_nodes. Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ Remove the changelog from the commit message. ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:18:39 +02:00
Leo Martins	dd276214e4	btrfs: fix delayed ref refcount leak in debug assertion If the delayed_root is not empty we are increasing the number of references to a delayed_node without decreasing it, causing a leak. Fix by decrementing the delayed_node reference count. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ Remove the changelog from the commit message. ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:18:35 +02:00
Filipe Manana	c769be2d3d	btrfs: include root in error message when unlinking inode To help debugging include the root number in the error message, and since this is a critical error that implies a metadata inconsistency and results in a transaction abort change the log message level from "info" to "critical", which is a much better fit. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-06-19 15:18:30 +02:00
Lorenzo Stoakes	2e3b37a7e4	fs: replace mmap hook with .mmap_prepare for simple mappings Since commit `c84bf6dd2b` ("mm: introduce new .mmap_prepare() file callback"), the f_op->mmap() hook has been deprecated in favour of f_op->mmap_prepare(). This callback is invoked in the mmap() logic far earlier, so error handling can be performed more safely without complicated and bug-prone state unwinding required should an error arise. This hook also avoids passing a pointer to a not-yet-correctly-established VMA avoiding any issues with referencing this data structure. It rather provides a pointer to the new struct vm_area_desc descriptor type which contains all required state and allows easy setting of required parameters without any consideration needing to be paid to locking or reference counts. Note that nested filesystems like overlayfs are compatible with an .mmap_prepare() callback since commit `bb666b7c27` ("mm: add mmap_prepare() compatibility layer for nested file systems"). In this patch we apply this change to file systems with relatively simple mmap() hook logic - exfat, ceph, f2fs, bcachefs, zonefs, btrfs, ocfs2, orangefs, nilfs2, romfs, ramfs and aio. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/f528ac4f35b9378931bd800920fee53fc0c5c74d.1750099179.git.lorenzo.stoakes@oracle.com Acked-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-06-19 13:56:59 +02:00
Al Viro	05fb0e6664	new helper: set_default_d_op() ... to be used instead of manually assigning to ->s_d_op. All in-tree filesystem converted (and field itself is renamed, so any out-of-tree ones in need of conversion will be caught by compiler). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-10 22:21:16 -04:00
Linus Torvalds	a56baa2253	for-6.16-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmg1pKUACgkQxWXV+ddt WDt5JA/9ExCLAICZX9qXoq0GN9O921BpH86XtV7tBfmniFVFqII2xzk0K5KxMFGC E/CQVE5rh8n3i6abvm/nuSKcDl9532Lnqb98WPDpLx5o3ClaSArx7smgJKAGcmQ2 bE88UfK1TQFMUxyTxCKFrk/Q6iVWD+GgWatIyBLuiLn3E2aPkewwXMU64BdFUXyq 9ZLkyy75Lw41aQhAuGSVZGJTL81hg3mM5zO945+0vYFTS1o5ST7+mqlqPzOnWaxO 5v1ecsJy27uVqdIfGZoOt1kzJ1nWaWT+4frfHCVotusGtFEezc++tCghdEjqokCO ThjRxiwNJmuwN3yrSKadjbFLSiHWDhHCIv9rExJfqSQZqUV7URxKFTFDBBC/yCZW Y70TOzqdji1BYsNt7VYhGeuNsKuZHC8+zPiOLMxI6XprRoSvjGuknuMPTVcmy567 hC8Mbyk8RlQmTsk5rziuXiesGNkqxpPoNIV8smj1Qur+GpvWzU0kmWa20AFM8iOQ e58amI90LmM/C2Bl8I+I3Hh7pRgYFEhsbvGB6HwdhaEXEn+bEdDN0G4DGsOUtOtZ 4Se0ckmho5esOVlmw/TtDa81TTg+jMzrzKCs9/vl3iY7QuzK8VPc438wSy95D4z0 gkHuRO5tni9E5bl2rg/rE7JeNgBef1sgzjjv+hfc2ZxsO3d1IcY= =zeWb -----END PGP SIGNATURE----- Merge tag 'for-6.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "A fixup to the xarray conversion sent in the main 6.16 batch. It was not included because it would cause rebase/refresh of like 80 patches, right before sending the early pull request last week. It's fixing a bug when zoned mode is enabled on btrfs so it's not affecting most people" * tag 'for-6.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails	2025-05-28 11:59:25 -07:00
Josef Bacik	b83825a8f5	btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails In the zoned mode there's a bug in the extent buffer tree conversion to xarray. The reference for eb is dropped and code continues but the references get dropped by releasing the batch. Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/linux-btrfs/202505191521.435b97ac-lkp@intel.com/ Fixes: `19d7f65f03` ("btrfs: convert the buffer_radix to an xarray") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-27 13:39:15 +02:00
Josef Bacik	4db7384ce5	btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails In the zoned mode there's a bug in the extent buffer tree conversion to xarray. The reference for eb is dropped and code continues but the references get dropped by releasing the batch. Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Fixes: `19d7f65f03` ("btrfs: convert the buffer_radix to an xarray") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-27 13:26:28 +02:00
Linus Torvalds	5e82ed5ca4	for-6.16-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmgtuJgACgkQxWXV+ddt WDt79g//YndozUasOP0raqNVvod4wYvmG/CX1yHOkFQpfRQSVG4av0KlTWnupXKG oEQvFbZ639tmXbBYlKlK8Ts8fy1dpj+2iG4ValukA4L7xkY8ML5DrGQfKYbPEm2i Ab9lp4qnZZutYVH2/5UGQqkEUA3/YIiOZ0hsZWir//zbkTCL9cuHwl2FUYbmFlHi Hxkd30QC0kZuxINdMxXGauF4JkFJFyiNnmI5dMjj07xMMWk1cv8vunoZ3LVjAlbW gX16+4rUmtJl33HbYqofee4Dcovvcuvt/fEM1LX0rGbKXOnKA2dQPoMQsjMAV82B mjhma5T709MgVHQiDdJduh86seaul4Cuv/E/OqoDj7Kfkoew/YquHEfU4TB4bvCX KmONEyJFd9QDq5CUyvfow7HENja6QbU31Fw6akrbfpsVcla0MKAUWPi+Vqpqf+pe qIWNcovorD2g/EVJV6y+w0K+kXTarPtXXmVnJnJPYtOkBWpARI3Y8wVxDCKX8Nfo 7Kpi/h9K87+d9opjjEajydNONDL9GQa4AY4u/oeiwcSuJHvCt/rsKKwHZRyycRiI q+nGwsNcmY/ih/EVUzLgYomGG08H9nOcKvZOQkfHpOTI1EgvILAeV9SpGMex7du1 PiPqVtv9Z60dKy6OValh7ttMpt7LszAK4Dk7XiyHrN1Q3sYDyrs= =bDOD -----END PGP SIGNATURE----- Merge tag 'for-6.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Apart from numerous cleanups, there are some performance improvements and one minor mount option update. There's one more radix-tree conversion (one remaining), and continued work towards enabling large folios (almost finished). Performance: - extent buffer conversion to xarray gains throughput and runtime improvements on metadata heavy operations doing writeback (sample test shows +50% throughput, -33% runtime) - extent io tree cleanups lead to performance improvements by avoiding unnecessary searches or repeated searches - more efficient extent unpinning when committing transaction (estimated run time improvement 3-5%) User visible changes: - remove standalone mount option 'nologreplay', deprecated in 5.9, replacement is 'rescue=nologreplay' - in scrub, update reporting, add back device stats message after detected errors (accidentally removed during recent refactoring) Core: - convert extent buffer radix tree to xarray - in subpage mode, move block perfect compression out of experimental build - in zoned mode, introduce sub block groups to allow managing special block groups, like the one for relocation or tree-log, to handle some corner cases of ENOSPC - in scrub, simplify bitmaps for block tracking status - continued preparations for large folios: - remove assertions for folio order 0 - add support where missing: compression, buffered write, defrag, hole punching, subpage, send - fix fsync of files with no hard links not persisting deletion - reject tree blocks which are not nodesize aligned, a precaution from 4.9 times - move transaction abort calls closer to the error sites - remove usage of some struct bio_vec internals - simplifications in extent map - extent IO cleanups and optimizations - error handling improvements - enhanced ASSERT() macro with optional format strings - cleanups: - remove unused code - naming unifications, dropped __, added prefix - merge similar functions - use common helpers for various data structures" * tag 'for-6.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (198 commits) btrfs: move misplaced comment of btrfs_path::keep_locks btrfs: remove standalone "nologreplay" mount option btrfs: use a single variable to track return value at btrfs_page_mkwrite() btrfs: don't return VM_FAULT_SIGBUS on failure to set delalloc for mmap write btrfs: simplify early error checking in btrfs_page_mkwrite() btrfs: pass true to btrfs_delalloc_release_space() at btrfs_page_mkwrite() btrfs: fix wrong start offset for delalloc space release during mmap write btrfs: fix harmless race getting delayed ref head count when running delayed refs btrfs: log error codes during failures when writing super blocks btrfs: simplify error return logic when getting folio at prepare_one_folio() btrfs: return real error from __filemap_get_folio() calls btrfs: remove superfluous return value check at btrfs_dio_iomap_begin() btrfs: fix invalid data space release when truncating block in NOCOW mode btrfs: update Kconfig option descriptions btrfs: update list of features built under experimental config btrfs: send: remove btrfs_debug() calls btrfs: use boolean for delalloc argument to btrfs_free_reserved_extent() btrfs: use boolean for delalloc argument to btrfs_free_reserved_bytes() btrfs: fold error checks when allocating ordered extent and update comments btrfs: check we grabbed inode reference when allocating an ordered extent ...	2025-05-26 12:24:43 -07:00
Linus Torvalds	6f59de9bc0	for-6.16/block-20250523 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgwnGYQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpq9aD/4iqOts77xhWWLrOJWkkhOcV5rREeyppq8X MKYul9S4cc4Uin9Xou9a+nab31QBQEk3nsN3kX9o3yAXvkh6yUm36HD8qYNW/46q IUkwRQQJ0COyTnexMZQNTbZPQDIYcenXmQxOcrEJ5jC1Jcz0sOKHsgekL+ab3kCy fLnuz2ozvjGDMala/NmE8fN5qSlj4qQABHgbamwlwfo4aWu07cwfqn5G/FCYJgDO xUvsnTVclom2g4G+7eSSvGQI1QyAxl5QpviPnj/TEgfFBFnhbCSoBTEY6ecqhlfW 6u59MF/Uw8E+weiuGY4L87kDtBhjQs3UMSLxCuwH7MxXb25ff7qB4AIkcFD0kKFH 3V5NtwqlU7aQT0xOjGxaHhfPwjLD+FVss4ARmuHS09/Kn8egOW9yROPyetnuH84R Oz0Ctnt1IPLFjvGeg3+rt9fjjS9jWOXLITb9Q6nX9gnCt7orCwIYke8YCpmnJyhn i+fV4CWYIQBBRKxIT0E/GhJxZOmL0JKpomnbpP2dH8npemnsTCuvtfdrK9gfhH2X chBVqCPY8MNU5zKfzdEiavPqcm9392lMzOoOXW2pSC1eAKqnAQ86ZT3r7rLntqE8 75LxHcvaQIsnpyG+YuJVHvoiJ83TbqZNpyHwNaQTYhDmdYpp2d/wTtTQywX4DuXb Y6NDJw5+kQ== =1PNK -----END PGP SIGNATURE----- Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - ublk updates: - Add support for updating the size of a ublk instance - Zero-copy improvements - Auto-registering of buffers for zero-copy - Series simplifying and improving GET_DATA and request lookup - Series adding quiesce support - Lots of selftests additions - Various cleanups - NVMe updates via Christoph: - add per-node DMA pools and use them for PRP/SGL allocations (Caleb Sander Mateos, Keith Busch) - nvme-fcloop refcounting fixes (Daniel Wagner) - support delayed removal of the multipath node and optionally support the multipath node for private namespaces (Nilay Shroff) - support shared CQs in the PCI endpoint target code (Wilfred Mallawa) - support admin-queue only authentication (Hannes Reinecke) - use the crc32c library instead of the crypto API (Eric Biggers) - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes Reinecke, Leon Romanovsky, Gustavo A. R. Silva) - MD updates via Yu: - Fix that normal IO can be starved by sync IO, found by mkfs on newly created large raid5, with some clean up patches for bdev inflight counters - Clean up brd, getting rid of atomic kmaps and bvec poking - Add loop driver specifically for zoned IO testing - Eliminate blk-rq-qos calls with a static key, if not enabled - Improve hctx locking for when a plug has IO for multiple queues pending - Remove block layer bouncing support, which in turn means we can remove the per-node bounce stat as well - Improve blk-throttle support - Improve delay support for blk-throttle - Improve brd discard support - Unify IO scheduler switching. This should also fix a bunch of lockdep warnings we've been seeing, after enabling lockdep support for queue freezing/unfreezeing - Add support for block write streams via FDP (flexible data placement) on NVMe - Add a bunch of block helpers, facilitating the removal of a bunch of duplicated boilerplate code - Remove obsolete BLK_MQ pci and virtio Kconfig options - Add atomic/untorn write support to blktrace - Various little cleanups and fixes * tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits) selftests: ublk: add test for UBLK_F_QUIESCE ublk: add feature UBLK_F_QUIESCE selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE traceevent/block: Add REQ_ATOMIC flag to block trace events ublk: run auto buf unregisgering in same io_ring_ctx with registering io_uring: add helper io_uring_cmd_ctx_handle() ublk: remove io argument from ublk_auto_buf_reg_fallback() ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch() selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK selftests: ublk: support UBLK_F_AUTO_BUF_REG ublk: support UBLK_AUTO_BUF_REG_FALLBACK ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG ublk: prepare for supporting to register request buffer automatically ublk: convert to refcount_t selftests: ublk: make IO & device removal test more stressful nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk nvme: introduce multipath_always_on module param nvme-multipath: introduce delayed removal of the multipath head node nvme-pci: derive and better document max segments limits nvme-pci: use struct_size for allocation struct nvme_dev ...	2025-05-26 11:39:36 -07:00
Linus Torvalds	6d5b940e1e	vfs-6.16-rc1.async.dir -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBN6wAKCRCRxhvAZXjc ok32AQD9DTiSCAoVg+7s+gSBuLTi8drPTN++mCaxdTqRh5WpRAD9GVyrGQT0s6LH eo9bm8d1TAYjilEWM0c0K0TxyQ7KcAA= =IW7H -----END PGP SIGNATURE----- Merge tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs directory lookup updates from Christian Brauner: "This contains cleanups for the lookup_one() family of helpers. We expose a set of functions with names containing "lookup_one_len" and others without the "_len". This difference has nothing to do with "len". It's rater a historical accident that can be confusing. The functions without "_len" take a "mnt_idmap" pointer. This is found in the "vfsmount" and that is an important question when choosing which to use: do you have a vfsmount, or are you "inside" the filesystem. A related question is "is permission checking relevant here?". nfsd and cachefiles do* have a vfsmount but don't use the non-_len functions. They pass nop_mnt_idmap and refuse to work on filesystems which have any other idmap. This work changes nfsd and cachefile to use the lookup_one family of functions and to explictily pass &nop_mnt_idmap which is consistent with all other vfs interfaces used where &nop_mnt_idmap is explicitly passed. The remaining uses of the "_one" functions do not require permission checks so these are renamed to be "_noperm" and the permission checking is removed. This series also changes these lookup function to take a qstr instead of separate name and len. In many cases this simplifies the call" * tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: VFS: change lookup_one_common and lookup_noperm_common to take a qstr Use try_lookup_noperm() instead of d_hash_and_lookup() outside of VFS VFS: rename lookup_one_len family to lookup_noperm and remove permission check cachefiles: Use lookup_one() rather than lookup_one_len() nfsd: Use lookup_one() rather than lookup_one_len() VFS: improve interface for lookup_one functions	2025-05-26 08:02:43 -07:00
Sun YangKai	eeb133a634	btrfs: move misplaced comment of btrfs_path::keep_locks Commit `925baeddc5` ("Btrfs: Start btree concurrency work.") added the comment for the field keep_locks. This got moved later but without the comment, so move it to the right place and fix the comment style. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-17 21:15:08 +02:00
Qu Wenruo	8af94e772e	btrfs: remove standalone "nologreplay" mount option Standalone "nologreplay" mount option has been marked deprecated since commit `74ef00185e` ("btrfs: introduce "rescue=" mount option"), which dates back to v5.9 (2020). Furthermore there is no other filesystem with the same named mount option, so this one is btrfs specific and we will not hit the same problem when removing "norecovery" mount option. So let's remove the standalone "nologreplay" mount option. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-16 19:16:22 +02:00
Filipe Manana	1ce06d45d9	btrfs: use a single variable to track return value at btrfs_page_mkwrite() We have two variables to track return values, ret and ret2, with types vm_fault_t (an unsigned int type) and int, which makes it a bit confusing and harder to keep track. So use a single variable, of type int, and under the 'out' label return vmf_error(ret) in case ret contains an error, otherwise return VM_FAULT_NOPAGE. This is equivalent to what we had before and it's simpler. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 18:24:44 +02:00
Filipe Manana	d8cddf2a1d	btrfs: don't return VM_FAULT_SIGBUS on failure to set delalloc for mmap write If the call to btrfs_set_extent_delalloc() fails we are always returning VM_FAULT_SIGBUS, which is odd since the error means "bad access" and the most likely cause for btrfs_set_extent_delalloc() is -ENOMEM, which should be translated to VM_FAULT_OOM. Instead of returning VM_FAULT_SIGBUS return vmf_error(ret2), which gives us a more appropriate return value, and we use that everywhere else too. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 18:24:37 +02:00
Filipe Manana	a08625f825	btrfs: simplify early error checking in btrfs_page_mkwrite() We have this entangled error checks early at btrfs_page_mkwrite(): 1) Try to reserve delalloc space by calling btrfs_delalloc_reserve_space() and storing the return value in the ret2 variable; 2) If the reservation succeed, call file_update_time() and store the return value in ret2 and also set the local variable 'reserved' to true (1); 3) Then do an error check on ret2 to see if any of the previous calls failed and if so, jump either to the 'out' label or to the 'out_noreserve' label, depending on whether 'reserved' is true or not. This is unnecessarily complex. Instead change this to a simpler and more straightforward approach: 1) Call btrfs_delalloc_reserve_space(), if that returns an error jump to the 'out_noreserve' label; 2) The call file_update_time() and if that returns an error jump to the 'out' label. Like this there's less nested if statements, no need to use a local variable to track if space was reserved and if statements are used only to check errors. Also move the call to extent_changeset_free() out of the 'out_noreserve' label and under the 'out' label since the changeset is allocated only if the call to reserve delalloc space succeeded. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 18:24:23 +02:00
Filipe Manana	bf1c74ccba	btrfs: pass true to btrfs_delalloc_release_space() at btrfs_page_mkwrite() In the last call to btrfs_delalloc_release_space() where the value of the variable 'ret' is never zero, we pass the expression 'ret != 0' as the value for the argument 'qgroup_free', which always evaluates to true. Make this less confusing and more clear by explicitly passing true instead. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 18:24:15 +02:00
Filipe Manana	17a85f5204	btrfs: fix wrong start offset for delalloc space release during mmap write If we're doing a mmap write against a folio that has i_size somewhere in the middle and we have multiple sectors in the folio, we may have to release excess space previously reserved, for the range going from the rounded up (to sector size) i_size to the folio's end offset. We are calculating the right amount to release and passing it to btrfs_delalloc_release_space(), but we are passing the wrong start offset of that range - we're passing the folio's start offset instead of the end offset, plus 1, of the range for which we keep the reservation. This may result in releasing more space then we should and eventually trigger an underflow of the data space_info's bytes_may_use counter. So fix this by passing the start offset as 'end + 1' instead of 'page_start' to btrfs_delalloc_release_space(). Fixes: `d0b7da88f6` ("Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 18:24:07 +02:00
Filipe Manana	7dbfa4266c	btrfs: fix harmless race getting delayed ref head count when running delayed refs When running delayed references we are reading the number of ready delayed ref heads without taking any lock which can make KCSAN report a race since we can have concurrent tasks updating that number, such as for example when freeing a tree block which will end up decrementing that counter or when adding a new delayed ref while COWing a tree block which will increment that counter. This is a harmless race since running one more or one less delayed ref head doesn't result in any problem, in the critical section of a transaction commit we always run any remaining delayed refs and at that point no one can create more. So fix this harmless race by annotating the read with data_race(). Reported-by: cen zhang <zzzccc427@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAFRLqsUCLMz0hY-GaPj1Z=fhkgRHjxVXHZ8kz0PvkFN0b=8L2Q@mail.gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:57 +02:00
Filipe Manana	4469e95fe5	btrfs: log error codes during failures when writing super blocks When writing super blocks, at write_dev_supers(), we log an error message when we get some error but we don't show which error we got and we have that information. So enhance the error messages with the error codes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:57 +02:00
Filipe Manana	0f2bc22150	btrfs: simplify error return logic when getting folio at prepare_one_folio() There's no need to have special logic to return -EAGAIN in case the call to __filemap_get_folio() fails, because when FGP_NOWAIT is passed to __filemap_get_folio() it returns ERR_PTR(-EAGAIN) if it needs to do something that would imply blocking. The reason we have this logic is from the days before we migrated to the folio interface, when we called pagecache_get_page() which would return NULL instead of an error pointer. So remove this special casing and always return the error that the call to __filemap_get_folio() returned. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:57 +02:00
Filipe Manana	443e4d0e1c	btrfs: return real error from __filemap_get_folio() calls We have a few places that always assume a -ENOMEM error happened in case a call to __filemap_get_folio() returns an error, which is just too much of an assumption and even if it would be the case at some point in time, it's not future proof and there's nothing in the documentation that guarantees that only ERR_PTR(-ENOMEM) can be returned with the flags we are passing to it. So use the exact error returned by __filemap_get_folio() instead. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:57 +02:00
Filipe Manana	ca84913d49	btrfs: remove superfluous return value check at btrfs_dio_iomap_begin() In the if statement that checks the return value from btrfs_check_data_free_space(), there's no point to check if 'ret' is not zero in the else branch, since the main if branch checked that it's zero, so in the else branch it necessarily has a non-zero value. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:57 +02:00
Filipe Manana	d3914d6030	btrfs: fix invalid data space release when truncating block in NOCOW mode If when truncating a block we fail to reserve data space and then we proceed anyway because we can do a NOCOW write, if we later get an error when trying to get the folio from the inode's mapping, we end up releasing data space that we haven't reserved, screwing up the bytes_may_use counter from the data space_info, eventually resulting in an underflow when all other reservations done by other tasks are released, if any, or right away if there are no other reservations at the moment. This is because when we get an error when trying to grab the block's folio we call btrfs_delalloc_release_space(), which releases metadata (which we have reserved) and data (which we haven't reserved). Fix this by calling btrfs_delalloc_release_space() only if we did reserve data space, that is, if we aren't falling back to NOCOW, meaning the local variable @only_release_metadata has a false value, otherwise release only metadata by calling btrfs_delalloc_release_metadata(). Fixes: `6d4572a9d7` ("btrfs: allow btrfs_truncate_block() to fallback to nocow for data space reservation") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:56 +02:00
David Sterba	c16b984cdb	btrfs: update Kconfig option descriptions Expand what the options do and if they are OK to be enabled. Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:56 +02:00
David Sterba	5f9b394e32	btrfs: update list of features built under experimental config The list is out of date, the extent shrinker got fixed in 6.13. Add new entries: the COW fixup warning in 6.15, rund robin policies in 6.14. Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:56 +02:00
David Sterba	585e944a31	btrfs: send: remove btrfs_debug() calls There are debugging prints for each emitted send command and other related actions. This does not seem right as the number of commands can be high and dumping that to the system log will likely hit some rate limiting. This should be done by trace points that are more lightweight and can keep up with high frequency. Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:56 +02:00
Filipe Manana	9f6fa5b344	btrfs: use boolean for delalloc argument to btrfs_free_reserved_extent() We are using an integer for the 'delalloc' argument but all we need is a boolean, so switch the type to 'bool' and rename the parameter to 'is_delalloc' to better match the fact that it's a boolean. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:56 +02:00
Filipe Manana	ba4ec9a5a0	btrfs: use boolean for delalloc argument to btrfs_free_reserved_bytes() We are using an integer for the 'delalloc' argument but all we need is a boolean, so switch the type to 'bool' and rename the parameter to 'is_delalloc' to better match the fact that it's a boolean. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:56 +02:00
Filipe Manana	87417e0cbb	btrfs: fold error checks when allocating ordered extent and update comments Instead of having an error check and return on each branch of the if statement, move the error check to happen after that if branch, reducing source code and object code sizes. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1840174 163742 16136 2020052 1ed2d4 fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1840138 163742 16136 2020016 1ed2b0 fs/btrfs/btrfs.ko While at it and moving the comments, update the comments to be more clear about how qgroup reserved space is released and the intricacies of how it's managed for COW writes. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:56 +02:00
Filipe Manana	08c649a563	btrfs: check we grabbed inode reference when allocating an ordered extent When allocating an ordered extent we call igrab() to get a reference on the inode and attach it to the ordered extent. For an ordered extent we always must have an inode reference since we during its life cycle we need to access the inode for several things like for example: * Inserting the ordered extent right after allocating it, when calling insert_ordered_extent() - we need to lock the inode's ordered_tree_lock; * In the bio submission path we need to add checksums to the ordered extent and we end up at btrfs_add_ordered_sum(), where again we need to grab the inode from the ordered extent to lock the inode's ordered_tree_lock; * When finishing an ordered extent, at btrfs_finish_ordered_extent(), we need again to access its inode in order to lock the inode's ordered_tree_lock; * Etc etc etc. Everywhere we deal with an ordered extent we always expect its inode to be not NULL, the only exception being btrfs_put_ordered_extent() where we check if it's NULL before calling btrfs_add_delayed_iput(), even though we have already assumed it's not NULL when calling the tracepoint trace_btrfs_ordered_extent_put() since the tracepoint dereferences the inode to extract its number and root without ever checking it's NULL. The igrab() call can return NULL if the inode is about to be freed or is being freed (its state has I_FREEING or I_WILL_FREE set), and that's why there's such check at btrfs_put_ordered_extent(). The igrab() and NULL check were introduced in commit `5fd0204355` ("Btrfs: finish ordered extents in their own thread") but even back then we always needed and assumed igrab() returned a non-NULL pointer, since for example when removing an ordered extent, at btrfs_remove_ordered_extent(), we assumed the inode pointer was not NULL in order to access the inode's ordered extent tree. In fact whenever we allocate an ordered extent we are holding an inode reference and the inode is not being freed or going to be freed (which happens in the final iput), and since we depend on the inode for the life cycle of the ordered extent, just make ordered extent allocation to fail in case igrab() returns NULL and trigger a warning, to make it clear it's not expected. This allows to remove the confusing NULL inode check at btrfs_put_ordered_extent(). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:56 +02:00
Filipe Manana	1f2889f559	btrfs: fix qgroup reservation leak on failure to allocate ordered extent If we fail to allocate an ordered extent for a COW write we end up leaking a qgroup data reservation since we called btrfs_qgroup_release_data() but we didn't call btrfs_qgroup_free_refroot() (which would happen when running the respective data delayed ref created by ordered extent completion or when finishing the ordered extent in case an error happened). So make sure we call btrfs_qgroup_free_refroot() if we fail to allocate an ordered extent for a COW write. Fixes: `7dbeaad0af` ("btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:56 +02:00
Qu Wenruo	4ad57e1e22	btrfs: scrub: reduce memory usage of struct scrub_sector_verification That structure records needed info for block verification (either data checksum pointer, or expected tree block generation). But there is also a boolean to tell if this block belongs to a metadata or not, as the data checksum pointer and expected tree block generation is already a union, we need a dedicated bit to tell if this block is a metadata or not. However such layout means we're wasting 63 bits for x86_64, which is a huge memory waste. Thanks to the recent bitmap aggregation, we can easily move this single-bit-per-block member to a new sub-bitmap. And since we already have six 16 bits long bitmaps, adding another bitmap won't even increase any memory usage for x86_64, as we need two 64 bits long anyway. This will reduce the following memory usages: - sizeof(struct scrub_sector_verification) From 16 bytes to 8 bytes on x86_64. - scrub_stripe::sectors From 16 * 16 to 16 * 8 bytes. - Per-device scrub_ctx memory usage From 128 * (16 * 16) to 128 * (16 * 8), which saves 16KiB memory. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:56 +02:00
Qu Wenruo	4e2945f73b	btrfs: handle aligned EOF truncation correctly for subpage cases [BUG] For the following fsx -e 1 run, the btrfs still fails the run on 64K page size with 4K fs block size: READ BAD DATA: offset = 0x26b3a, size = 0xfafa, fname = /mnt/btrfs/junk OFFSET GOOD BAD RANGE 0x26b3a 0x0000 0x15b4 0x0 operation# (mod 256) for the bad data may be 21 [...] LOG DUMP (28 total operations): 1( 1 mod 256): SKIPPED (no operation) 2( 2 mod 256): SKIPPED (no operation) 3( 3 mod 256): SKIPPED (no operation) 4( 4 mod 256): SKIPPED (no operation) 5( 5 mod 256): WRITE 0x1ea90 thru 0x285e0 (0x9b51 bytes) HOLE 6( 6 mod 256): ZERO 0x1b1a8 thru 0x20bd4 (0x5a2d bytes) 7( 7 mod 256): FALLOC 0x22b1a thru 0x272fa (0x47e0 bytes) INTERIOR 8( 8 mod 256): WRITE 0x741d thru 0x13522 (0xc106 bytes) 9( 9 mod 256): MAPWRITE 0x73ee thru 0xdeeb (0x6afe bytes) 10( 10 mod 256): FALLOC 0xb719 thru 0xb994 (0x27b bytes) INTERIOR 11( 11 mod 256): COPY 0x15ed8 thru 0x18be1 (0x2d0a bytes) to 0x25f6e thru 0x28c77 12( 12 mod 256): ZERO 0x1615e thru 0x1770e (0x15b1 bytes) 13( 13 mod 256): SKIPPED (no operation) 14( 14 mod 256): DEDUPE 0x20000 thru 0x27fff (0x8000 bytes) to 0x1000 thru 0x8fff 15( 15 mod 256): SKIPPED (no operation) 16( 16 mod 256): CLONE 0xa000 thru 0xffff (0x6000 bytes) to 0x36000 thru 0x3bfff 17( 17 mod 256): ZERO 0x14adc thru 0x1b78a (0x6caf bytes) 18( 18 mod 256): TRUNCATE DOWN from 0x3c000 to 0x1e2e3 ****WWWW 19( 19 mod 256): CLONE 0x4000 thru 0x11fff (0xe000 bytes) to 0x16000 thru 0x23fff 20( 20 mod 256): FALLOC 0x311e1 thru 0x3681b (0x563a bytes) PAST_EOF 21( 21 mod 256): FALLOC 0x351c5 thru 0x40000 (0xae3b bytes) EXTENDING 22( 22 mod 256): WRITE 0x920 thru 0x7e51 (0x7532 bytes) 23( 23 mod 256): COPY 0x2b58 thru 0xc508 (0x99b1 bytes) to 0x117b1 thru 0x1b161 24( 24 mod 256): TRUNCATE DOWN from 0x40000 to 0x3c9a5 25( 25 mod 256): SKIPPED (no operation) 26( 26 mod 256): MAPWRITE 0x25020 thru 0x26b06 (0x1ae7 bytes) 27( 27 mod 256): SKIPPED (no operation) 28( 28 mod 256): READ 0x26b3a thru 0x36633 (0xfafa bytes) RRRR** [CAUSE] The involved operations are: fallocating to largest ever: 0x40000 21 pollute_eof 0x24000 thru 0x2ffff (0xc000 bytes) 21 falloc from 0x351c5 to 0x40000 (0xae3b bytes) 28 read 0x26b3a thru 0x36633 (0xfafa bytes) At operation #21 a pollute_eof is done, by memory mapped write into range [0x24000, 0x2ffff). At this stage, the inode size is 0x24000, which is block aligned. Then fallocate happens, and since it's expanding the inode, it will call btrfs_truncate_block() to truncate any unaligned range. But since the inode size is already block aligned, btrfs_truncate_block() does nothing and exits. However remember the folio at 0x20000 has some range polluted already, although it will not be written back to disk, it still affects the page cache, resulting the later operation #28 to read out the polluted value. [FIX] Instead of early exit from btrfs_truncate_block() if the range is already block aligned, do extra filio zeroing if the fs block size is smaller than the page size and we're truncating beyond EOF. This is to address exactly the above case where memory mapped write can still leave some garbage beyond EOF. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Qu Wenruo	8e4f21f2b1	btrfs: handle unaligned EOF truncation correctly for subpage cases [BUG] The following fsx sequence will fail on btrfs with 64K page size and 4K fs block size: #fsx -d -e 1 -N 4 $mnt/junk -S 36386 READ BAD DATA: offset = 0xe9ba, size = 0x6dd5, fname = /mnt/btrfs/junk OFFSET GOOD BAD RANGE 0xe9ba 0x0000 0x03ac 0x0 operation# (mod 256) for the bad data may be 3 ... LOG DUMP (4 total operations): 1( 1 mod 256): WRITE 0x6c62 thru 0x1147d (0xa81c bytes) HOLE *WWWW 2( 2 mod 256): TRUNCATE DOWN from 0x1147e to 0x5448 **WWWW 3( 3 mod 256): ZERO 0x1c7aa thru 0x28fe2 (0xc839 bytes) 4( 4 mod 256): MAPREAD 0xe9ba thru 0x1578e (0x6dd5 bytes) RRRR** [CAUSE] Only 2 operations are really involved in this case: 3 pollute_eof 0x5448 thru 0xffff (0xabb8 bytes) 3 zero from 0x1c7aa to 0x28fe3, (0xc839 bytes) 4 mapread 0xe9ba thru 0x1578e (0x6dd5 bytes) At operation 3, fsx pollutes beyond EOF, that is done by mmap() and write into that mmap() range beyond EOF. Such write will fill the range beyond EOF, but it will never reach disk as ranges beyond EOF will not be marked dirty nor uptodate. Then we zero_range for [0x1c7aa, 0x28fe3], and since the range is beyond our isize (which was 0x5448), we should zero out any range beyond EOF (0x5448). During btrfs_zero_range(), we call btrfs_truncate_block() to dirty the unaligned head block. But that function only really zeroes out the block at [0x5000, 0x5fff], it doesn't bother any range other that that block, since those ranges will not be marked dirty nor written back. So the range [0x6000, 0xffff] is still polluted, and later mapread() will return the poisoned value. [FIX] Enhance btrfs_truncate_block() by: - Pass a @start/@end pair to indicate the full truncation range This is to handle the following truncation case: Page size is 64K, fs block size is 4K, truncate range is [6K, 60K] 0 32K 64K \| \|///////////////////////////////////\| \| 6K 60K The range is not aligned for its head block, so we need to call btrfs_truncate_block() with @from = 6K, @front = 0, @len = 0. But with that information we only know to zero the range [6K, 8K), if we zero out the range [6K, 64K), the last block will also be zeroed, causing data loss. So here we need the full range we're truncating, so that we can avoid over-truncation. - Rename @from to @offset As now the parameter is only utilized to locate a block, it's not really carrying the old @from meaning well. - Remove @front parameter With the full truncate range passed in, we can determine if the @offset is at the head or tail block. - Skip truncation if @offset is not in the head nor tail blocks The call site in hole punch unconditionally call btrfs_truncate_block() without even checking the range is aligned or not. If the @offset is neither in the head nor in tail block, it means we can safely ignore it. - Skip truncate if the range inside the target block is already aligned - Make btrfs_truncate_block() zero all blocks beyond EOF Since we have the original range, we know exactly if we're doing truncation beyond EOF (the @end will be (u64)-1). If we're doing truncation beyond EOF, then enlarge the truncation range to the folio end, to address the possibly polluted ranges. Otherwise still keep the zero range inside the block, as we can have large data folios soon, always truncating every blocks inside the same folio can be costly for large folios. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Boris Burkov	3649833a58	btrfs: fix broken drop_caches on extent buffer folios The (correct) commit `e41c81d0d3` ("mm/truncate: Replace page_mapped() call in invalidate_inode_page()") replaced the page_mapped(page) check with a refcount check. However, this refcount check does not work as expected with drop_caches for btrfs's metadata pages. Btrfs has a per-sb metadata inode with cached pages, and when not in active use by btrfs, they have a refcount of 3. One from the initial call to alloc_pages(), one (nr_pages == 1) from filemap_add_folio(), and one from folio_attach_private(). We would expect such pages to get dropped by drop_caches. However, drop_caches calls into mapping_evict_folio() via mapping_try_invalidate() which gets a reference on the folio with find_lock_entries(). As a result, these pages have a refcount of 4, and fail this check. For what it's worth, such pages do get reclaimed under memory pressure, so I would say that while this behavior is surprising, it is not really dangerously broken. When I asked the mm folks about the expected refcount in this case, I was told that the correct thing to do is to donate the refcount from the original allocation to the page cache after inserting it. Therefore, attempt to fix this by adding a put_folio() to the critical spot in alloc_extent_buffer() where we are sure that we have really allocated and attached new pages. We must also adjust folio_detach_private() to properly handle being the last reference to the folio and not do a use-after-free after folio_detach_private(). extent_buffers allocated by clone_extent_buffer() and alloc_dummy_extent_buffer() are unmapped, so this transfer of ownership from allocation to insertion in the mapping does not apply to them. However, we can still folio_put() them safely once they are finished being allocated and have called folio_attach_private(). Finally, removing the generic put_folio() for the allocation from btrfs_detach_extent_buffer_folios() means we need to be careful to do the appropriate put_folio() in allocation failure paths in alloc_extent_buffer(), clone_extent_buffer() and alloc_dummy_extent_buffer(). Link: https://lore.kernel.org/linux-mm/ZrwhTXKzgDnCK76Z@casper.infradead.org/ Tested-by: Klara Modin <klarasmodin@gmail.com> Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Filipe Manana	1886b77f5b	btrfs: use verbose assert at peek_discard_list() We now have a verbose variant of ASSERT() so that we can print the value of the block group's discard_index. So use it for better problem analysis in case the assertion is triggered. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Qu Wenruo	1b660424a6	btrfs: scrub: aggregate small bitmaps into a larger one Currently we have several small bitmaps inside scrub_stripe: - extent_sector_bitmap - error_bitmap - io_error_bitmap - csum_error_bitmap - meta_error_bitmap - meta_gen_error_bitmap All those bitmaps are at most 16 bits long, but unsigned long is either 32 or 64 (more common) bits. This means we're wasting 1/2 or 3/4 space for each bitmap. And we can have 128 scrub_stripe for each device, such wasted space adds up quickly. Instead of using a single unsigned long for each bitmap, aggregate them into a larger bitmap, just like what we're doing for subpage support. This reduces 24 bytes from each scrub_stripe structure on x86_64 systems. This will need a lot of macros converting direct bitmap/bit operations into our scrub_stripe specific helpers, but all those helpers are very small and can be inlined. So overall the overhead shouldn't be that huge, and we save quite some memory space. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Qu Wenruo	f2c19541e4	btrfs: scrub: fix a wrong error type when metadata bytenr mismatches When the bytenr doesn't match for a metadata tree block, we will report it as an csum error, which is incorrect and should be reported as a metadata error instead. Fixes: `a3ddbaebc7` ("btrfs: scrub: introduce a helper to verify one metadata block") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Filipe Manana	5bc3b7e2b5	btrfs: defrag: use list_last_entry() at defrag_collect_targets() Instead of using list_entry() against the list's prev entry, use list_last_entry(), which removes the need to know the last member is accessed through the prev list pointer and the naming makes it easier to reason about what we are doing. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Filipe Manana	58fe389a2a	btrfs: simplify csum list release at btrfs_put_ordered_extent() Instead of extracting each element by grabbing the list's first member in a local list_head variable, then extracting the csum with list_entry() and iterating with a while loop checking for list emptyness, use the iteration helper list_for_each_entry_safe(). This also removes the need to delete elements from the list with list_del() since the ordered extent is freed immediately after. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Filipe Manana	4cde0724c9	btrfs: simplify extracting delayed node at btrfs_first_prepared_delayed_node() Instead of grabbing the next pointer from the list and then doing a list_entry() call, we can simply use list_first_entry(), removing the need for list_head variable. Also there's no need to check if the list is empty before attempting to extract the first element, we can use list_first_entry_or_null(), removing the need for a special if statement and the 'out' label. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Filipe Manana	32bc875cbc	btrfs: simplify extracting delayed node at btrfs_first_delayed_node() Instead of grabbing the next pointer from the list and then doing a list_entry() call, we can simply use list_first_entry(), removing the need for list_head variable. Also there's no need to check if the list is empty before attempting to extract the first element, we can use list_first_entry_or_null(), removing the need for a special if statement and the 'out' label. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:55 +02:00
Filipe Manana	c5d12d5b62	btrfs: raid56: use list_last_entry() at cache_rbio() Instead of using list_entry() against the list's prev entry, use list_last_entry(), which removes the need to know the last member is accessed through the prev list pointer and the naming makes it easier to reason about what we are doing. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
Filipe Manana	d26d16a432	btrfs: simplify cow only root list extraction during transaction commit There's no need to keep a local variable to extract the first member of the list and then do a list_entry() call, we can use list_first_entry() instead, removing the need for the temporary variable and extracting the first element in a single step. Also, there's no need to do a list_del_init() followed by list_add_tail(), instead we can use list_move_tail(). We are in transaction commit critical section where we don't need to worry about concurrency and that's why we don't take any locks and can use list_move_tail() (we do assert early at commit_cowonly_roots() that we are in the critical section, that the transaction's state is TRANS_STATE_COMMIT_DOING). Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
Filipe Manana	a20f732822	btrfs: simplify getting and extracting previous transaction at clean_pinned_extents() Instead of detecting if there is a previous transaction by comparing the current transaction's list prev member to the head of the transaction list (fs_info->trans_list), use the list_is_first() helper which contains that logic and the naming makes sense since a new transaction is always added to the end of the list fs_info->trans_list with list_add_tail(). We are also extracting the previous transaction with list_last_entry() against the transaction, which is correct but confusing because that function is usually meant to be used against a pointer to the start of a list and not a member of a list. It is easier to reason by either calling list_first_entry() against the list fs_info->trans_list, since we can never have more than two transactions in the list, or by calling list_prev_entry() against the transaction. So change that to use the later method. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
Filipe Manana	d887f03fdb	btrfs: simplify getting and extracting previous transaction during commit Instead of detecting if there is a previous transaction by comparing the current transaction's list prev member to the head of the transaction list (fs_info->trans_list), use the list_is_first() helper which contains that logic and the naming makes sense since a new transaction is always added to the end of the list fs_info->trans_list with list_add_tail(). And instead of extracting the previous transaction with the more generic list_entry() helper against the current transaction's list prev member, use the more specific list_prev_entry() helper, which makes it clear what we are doing and is shorter. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
David Sterba	8c4cfa99c2	btrfs: move transaction aborts to the error site in add_to_free_space_tree() Transaction aborts should be done next to the place the error happens, which was not done in add_to_free_space_tree(). Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
David Sterba	11f25fad92	btrfs: move transaction aborts to the error site in remove_from_free_space_tree() Transaction aborts should be done next to the place the error happens, which was not done in remove_from_free_space_tree(). Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
David Sterba	addaa67e33	btrfs: move transaction aborts to the error site in convert_free_space_to_extents() Transaction aborts should be done next to the place the error happens, which was not done in convert_free_space_to_extents(). The DEBUG_WARN() is removed because we get the abort message. Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
David Sterba	14338d4661	btrfs: move transaction aborts to the error site in convert_free_space_to_bitmaps() Transaction aborts should be done next to the place the error happens, which was not done in convert_free_space_to_bitmaps(). The DEBUG_WARN() is removed because we get the abort message. Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
Qu Wenruo	ce6920dba8	btrfs: scrub: move error reporting members to stack Currently the following members of scrub_stripe are only utilized for error reporting: - init_error_bitmap - init_nr_io_errors - init_nr_csum_errors - init_nr_meta_errors - init_nr_meta_gen_errors There is no need to put all those members into scrub_stripe, which take 24 bytes for each stripe, and we have 128 stripes for each device. Instead introduce a structure, scrub_error_records, and move all above members into that structure. And allocate such structure from stack inside scrub_stripe_read_repair_worker(). Since that function is called from a workqueue context, we have more than enough stack space for just 24 bytes. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
Qu Wenruo	ec1f3a207c	btrfs: scrub: update device stats when an error is detected [BUG] Since the migration to the new scrub_stripe interface, scrub no longer updates the device stats when hitting an error, no matter if it's a read or checksum mismatch error. E.g: BTRFS info (device dm-2): scrub: started on devid 1 BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488 BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file) BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488 BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file) BTRFS info (device dm-2): scrub: finished on devid 1 with status: 0 Note there is no line showing the device stats error update. [CAUSE] In the migration to the new scrub_stripe interface, we no longer call btrfs_dev_stat_inc_and_print(). [FIX] - Introduce a new bitmap for metadata generation errors * A new bitmap @meta_gen_error_bitmap is introduced to record which blocks have metadata generation mismatch errors. * A new counter for that bitmap @init_nr_meta_gen_errors, is also introduced to store the number of generation mismatch errors that are found during the initial read. This is for the error reporting at scrub_stripe_report_errors(). * New dedicated error message for unrepaired generation mismatches * Update @meta_gen_error_bitmap if a transid mismatch is hit - Add btrfs_dev_stat_inc_and_print() calls to the following call sites * scrub_stripe_report_errors() * scrub_write_endio() This is only for the write errors. This means there is a minor behavior change: - The timing of device stats error message Since we concentrate the error messages at scrub_stripe_report_errors(), the device stats error messages will all show up in one go, after the detailed scrub error messages: BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488 BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file) BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488 BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file) BTRFS error (device dm-2): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 BTRFS error (device dm-2): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Fixes: `e02ee89baa` ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
Naohiro Aota	45a59513b4	btrfs: add support for reclaiming from sub-space space_info Modify btrfs_async_{data,metadata}_reclaim() to run the reclaim process on the sub-spaces as well. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
Naohiro Aota	635da7ea9a	btrfs: add block reserve for treelog We need to add a dedicated block_rsv for tree-log, because the block_rsv serves for a tree node allocation in btrfs_alloc_tree_block(). Currently, tree-log tree uses fs_info->empty_block_rsv, which is shared across trees and points to the normal metadata space_info. Instead, we add a dedicated block_rsv and that block_rsv can use the dedicated sub-space_info. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	df07664074	btrfs: use proper data space_info for zoned mode Now that, we have data sub-space for the zoned mode. Tweak some space_info functions to use proper space_info for a file. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	cc0517fe77	btrfs: tweak extent/chunk allocation for space_info sub-space Make the extent allocator and the chunk allocator aware of the sub-space. It now uses BTRFS_SUB_GROUP_DATA_RELOC sub-space for data relocation block group, and uses BTRFS_SUB_GROUP_TREELOG for metadata tree-log block group. And, it needs to check the space_info is the right one when a block group candidate is given. Also, new block group should now belong to the specified one. Now that, block_group->space_info is always set before btrfs_add_bg_to_space_info(), we no longer need to "find" the space_info. So, rename the variable name to address that as well. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	9a3023b828	btrfs: introduce tree-log sub-space_info Introduce the tree-log sub-space_info, which is sub-space of metadata space_info and dedicated for tree-log node allocation. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	f92ee31e03	btrfs: introduce btrfs_space_info sub-group Current code assumes we have only one space_info for each block group type (DATA, METADATA, and SYSTEM). We sometime need multiple space infos to manage special block groups. One example is handling the data relocation block group for the zoned mode. That block group is dedicated for writing relocated data and we cannot allocate any regular extent from that block group, which is implemented in the zoned extent allocator. This block group still belongs to the normal data space_info. So, when all the normal data block groups are full and there is some free space in the dedicated block group, the space_info looks to have some free space, while it cannot allocate normal extent anymore. That results in a strange ENOSPC error. We need to have a space_info for the relocation data block group to represent the situation properly. Adds a basic infrastructure for having a "sub-group" of a space_info: creation and removing. A sub-group space_info belongs to one of the primary space_infos and has the same flags as its parent. This commit first introduces the relocation data sub-space_info, and the next commit will introduce tree-log sub-space_info. In the future, it could be useful to implement tiered storage for btrfs e.g. by implementing a sub-group space_info for block groups resides on a fast storage. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	4d5a047e07	btrfs: add space_info parameter for block group creation Add struct btrfs_space_info parameter to btrfs_make_block_group(), its related functions and related struct. Passed space_info will have a new block group. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	098a442d5b	btrfs: add space_info argument to btrfs_chunk_alloc() Take a btrfs_space_info argument in btrfs_chunk_alloc(). New block group will belong to that space_info. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	1cfdbe0d53	btrfs: factor out check_removing_space_info() from btrfs_free_block_groups() Factor out check_removing_space_info() from btrfs_free_block_groups(). It sanity checks a to-be-removed space_info. There is no functional change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	4ec06a9ddb	btrfs: factor out do_async_reclaim_{data,metadata}_space() Factor out the main part of btrfs_async_reclaim_data_space() to do_async_reclaim_data_space(), so it can take data space_info parameter it is working on. Do the same for metadata. There is no functional change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	ac5578fef3	btrfs: factor out init_space_info() from create_space_info() Factor out initialization of the space_info struct, which is used in a later patch. There is no functional change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	1c34e71966	btrfs: pass struct btrfs_inode to btrfs_free_reserved_data_space_noquota() As well as the last patch, pass struct btrfs_inode to the function and let it distinguish which data space it is working on in a later patch. There is no functional change with this commit. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:52 +02:00
Naohiro Aota	5d39fda880	btrfs: pass btrfs_space_info to btrfs_reserve_data_bytes() Pass struct btrfs_space_info to btrfs_reserve_data_bytes() to allow reserving the data from multiple data space_info candidates. This is a preparation for the following commits and there is no functional change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:52 +02:00
Filipe Manana	66864101d1	btrfs: make extent unpinning more efficient when committing transaction At btrfs_finish_extent_commit() we have this loop that keeps finding an extent range to unpin in the transaction's pinned_extents io tree, caches the extent state and then passes that cached extent state to btrfs_clear_extent_dirty(), which will free that extent state since we clear the only bit it can have set. So on each loop iteration we do a full io tree search and the cached state is used only to avoid having a tree search done by btrfs_clear_extent_dirty(). During the lifetime of a transaction we can pin many thousands of extents, resulting in a large and deep rb tree that backs the io tree. For example, for the following fs_mark run on a 12 cores boxes: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 FILES=100000 THREADS=$(nproc --all) echo "performance" \| \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor mkfs.btrfs -f $DEV mount $DEV $MNT OPTS="-S 0 -L 8 -n $FILES -s 0 -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT an histogram for the number of ranges (elements) in the pinned extents io tree of a transaction was the following: Count: 76 Range: 5440.000 - 51088.000; Mean: 27354.368; Median: 28312.000; Stddev: 9800.767 Percentiles: 90th: 40486.000; 95th: 43322.000; 99th: 51088.000 5440.000 - 6805.809: 1 ### 6805.809 - 10652.034: 1 ### 10652.034 - 13326.178: 3 ######## 13326.178 - 16671.590: 8 ###################### 16671.590 - 20856.773: 7 #################### 20856.773 - 26092.528: 13 #################################### 26092.528 - 32642.571: 19 ##################################################### 32642.571 - 40836.818: 17 ############################################### 40836.818 - 51088.000: 7 #################### We can improve on this by grabbing the next state before calling btrfs_clear_extent_dirty(), avoiding a full tree search on the next iteration which always has an O(log n) complexity while grabbing the next element (rb_next() rbtree operation) is in the worst case O(log n) too, but very often much less than that, making it more efficient. Here follow histograms for the execution times, in nanoseconds, of btrfs_finish_extent_commit() before and after applying this patch and all the other patches in the same patchset. Before patchset: Count: 32 Range: 3925691.000 - 269990635.000; Mean: 133814526.906; Median: 122758052.000; Stddev: 65776550.375 Percentiles: 90th: 228672087.000; 95th: 265187000.000; 99th: 269990635.000 3925691.000 - 5993208.660: 1 #### 5993208.660 - 75878537.656: 4 ################## 75878537.656 - 115840974.514: 12 ##################################################### 115840974.514 - 176850157.761: 6 ########################### 176850157.761 - 269990635.000: 9 ######################################## After patchset: Count: 32 Range: 1849393.000 - 231491064.000; Mean: 126978584.625; Median: 123732897.000; Stddev: 58007821.806 Percentiles: 90th: 203055491.000; 95th: 219952699.000; 99th: 231491064.000 1849393.000 - 2997642.092: 1 #### 2997642.092 - 88111637.071: 9 ##################################### 88111637.071 - 142818264.414: 9 ##################################### 142818264.414 - 231491064.000: 13 ##################################################### Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:52 +02:00
Filipe Manana	93ef6c232a	btrfs: remove variable to track trimmed bytes at btrfs_finish_extent_commit() We don't need to keep track of discarded (trimmed) bytes at btrfs_finish_extent_commit() but we are declaring a local variable for that and passing a reference to the btrfs_discard_extent() calls when we are processing delete block groups. So instead pass NULL to btrfs_discard_extent() and remove that variable. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:52 +02:00
Filipe Manana	2598371392	btrfs: don't BUG_ON() when unpinning extents during transaction commit In the final phase of a transaction commit, when unpinning extents at btrfs_finish_extent_commit(), there's no need to BUG_ON() if we fail to unpin an extent range. All that can happen is that we fail to return the extent range to the in-memory free space cache, meaning no future space allocations can reuse that extent range while the fs is mounted. So instead return the error to the caller and make it abort the transaction, so that the error is noticed and prevent misteriously leaking space. We keep track of the first error we get while unpinning an extent range and keep trying to unpin all the following extent ranges, so that we attempt to do all discards. The transaction abort will deal with all resource cleanups. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:52 +02:00
Filipe Manana	b2460c2aee	btrfs: remove unnecessary NULL checks before freeing extent state When manipulating extent bits for an io tree range, we have this pattern: if (prealloc) btrfs_free_extent_state(prealloc); but this is not needed nowadays since btrfs_free_extent_state() ignores a NULL pointer argument, following the common pattern of kernel and btrfs freeing functions, as well as libc and other user space libraries. So remove the NULL checks, reducing source code and object size. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:52 +02:00
Filipe Manana	aa2c80a9ae	btrfs: avoid re-searching tree when setting bits in an extent range When setting bits for an extent range (set_extent_bit()), if the current extent state record starts after the target range, we always do a jump to the 'search_again' label, which will cause us to do a full tree search for the next state if the current state ends before the target range. Unless we need to reschedule, we can just grab the next state and process it, avoiding a full tree search, even if that next state is not contiguous, as we'll allocate and insert a new prealloc state if needed. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:52 +02:00
Filipe Manana	b61dd9b1cb	btrfs: avoid repeated extent state processing when setting extent bits When setting bits for an extent range, if we find an extent state with its start offset greater than current start offset, we insert a new extent state to cover the gap, with its end offset computed and stored in the @this_end local variable, and after the insertion we update the current start offset to @this_end + 1. However if the insert_state() call resulted in an extent state merge then the end offset of the merged extent may be greater than @this_end and if that's the case, since we jump to the 'search_again' label, we'll do a full tree search that will leave us in the same extent state - this is harmless but wastes time by doing a pointless tree search and extent state processing. So improve on this by updating the current start offset to the end offset of the inserted state plus 1. This also removes the use of the @this_end variable and directly set the value in the prealloc extent state to avoid any confusion and misuse in the future. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:52 +02:00
Filipe Manana	8faab454c5	btrfs: simplify last record detection at set_extent_bit() There's no need to compare the current extent state's end offset to (u64)-1 to check if we have the last possible record and to check as as well if after updating the start offset to the end offset of the current record plus one we are still inside the target range. Instead we can simplify and exit if the current extent state ends at or after the target range and then remove the check for the (u64)-1 as well as the check to see if the updated start offset (to last_end + 1) is still inside the target range. Besides the simplification, this also avoids seaching for the next extent state record (through next_state()) when the current extent state record ends at the same offset as our target range, which is pointless and only wastes times iterating through the rb tree. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:52 +02:00
Filipe Manana	41d69d4d78	btrfs: exit after state split error at set_extent_bit() If split_state() returned an error we call extent_io_tree_panic() which will trigger a BUG() call. However if CONFIG_BUG is disabled, which is an uncommon and exotic scenario, then we fallthrough and hit a use after free when calling set_state_bits() since the extent state record which the local variable 'prealloc' points to was freed by split_state(). So jump to the label 'out' after calling extent_io_tree_panic() and set the 'prealloc' pointer to NULL since split_state() has already freed it when it hit an error. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:52 +02:00
Filipe Manana	67f10a1018	btrfs: exit after state insertion failure at set_extent_bit() If insert_state() state failed it returns an error pointer and we call extent_io_tree_panic() which will trigger a BUG() call. However if CONFIG_BUG is disabled, which is an uncommon and exotic scenario, then we fallthrough and call cache_state() which will dereference the error pointer, resulting in an invalid memory access. So jump to the 'out' label after calling extent_io_tree_panic(), it also makes the code more clear besides dealing with the exotic scenario where CONFIG_BUG is disabled. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:51 +02:00
Filipe Manana	0edc1a5c54	btrfs: simplify last record detection at btrfs_convert_extent_bit() There's no need to compare the current extent state's end offset to (u64)-1 to check if we have the last possible record and to check as as well if after updating the start offset to the end offset of the current record plus one we are still inside the target range. Instead we can simplify and exit if the current extent state ends at or after the target range and then remove the check for the (u64)-1 as well as the check to see if the updated start offset (to last_end + 1) is still inside the target range. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:51 +02:00
Filipe Manana	be2270262f	btrfs: avoid re-searching tree when converting bits in an extent range When converting bits for an extent range (btrfs_convert_extent_bit()), if the current extent state record starts after the target range, we always do a jump to the 'search_again' label, which will cause us to do a full tree search for the next state if the current state ends before the target range. Unless we need to reschedule, we can just grab the next state and process it, avoiding a full tree search, even if that next state is not contiguous, as we'll allocate and insert a new prealloc state if needed. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:51 +02:00
Filipe Manana	eeb808422f	btrfs: avoid repeated extent state processing when converting extent bits When converting bits for an extent range, if we find an extent state with its start offset greater than current start offset, we insert a new extent state to cover the gap, with its end offset computed and stored in the @this_end local variable, and after the insertion we update the current start offset to @this_end + 1. However if the insert_state() call resulted in an extent state merge then the end offset of the merged extent may be greater than @this_end and if that's the case, since we jump to the 'search_again' label, we'll do a full tree search that will leave us in the same extent state - this is harmless but wastes time by doing a pointless tree search and extent state processing. So improve on this by updating the current start offset to the end offset of the inserted state plus 1. This also removes the use of the @this_end variable and directly set the value in the prealloc extent state to avoid any confusion and misuse in the future. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:51 +02:00
Filipe Manana	240dd0e1bb	btrfs: avoid unnecessary next node searches when clearing bits from extent range When clearing bits for a range in an io tree, at clear_state_bit(), we always go search for the next node (through next_state() -> rb_next()) and return it. However if the current extent state record ends at or after the target range passed to btrfs_clear_extent_bit_changeset() or btrfs_convert_extent_bit(), we are just wasting time finding that next node since we won't use it in those functions. Improve on this by skipping the next node search if the current node ends at or after the target range. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:51 +02:00
Filipe Manana	3bf179e36d	btrfs: exit after state insertion failure at btrfs_convert_extent_bit() If insert_state() state failed it returns an error pointer and we call extent_io_tree_panic() which will trigger a BUG() call. However if CONFIG_BUG is disabled, which is an uncommon and exotic scenario, then we fallthrough and call cache_state() which will dereference the error pointer, resulting in an invalid memory access. So jump to the 'out' label after calling extent_io_tree_panic(), it also makes the code more clear besides dealing with the exotic scenario where CONFIG_BUG is disabled. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:51 +02:00
Filipe Manana	2a72dd9996	btrfs: exit after state split error at btrfs_convert_extent_bit() If split_state() returned an error we call extent_io_tree_panic() which will trigger a BUG() call. However if CONFIG_BUG is disabled, which is an uncommon and exotic scenario, then we fallthrough and hit a use after free when calling set_state_bits() since the extent state record which the local variable 'prealloc' points to was freed by split_state(). So jump to the label 'out' after calling extent_io_tree_panic() and set the 'prealloc' pointer to NULL since split_state() has already freed it when it hit an error. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:51 +02:00
Filipe Manana	5f9c554a6c	btrfs: remove duplicate error check at btrfs_convert_extent_bit() There's no need to check if split_state() returned an error twice, instead unify into a single if statement after setting 'prealloc' to NULL, because on error split_state() frees the 'prealloc' extent state record. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:51 +02:00
Filipe Manana	f2a24bef55	btrfs: simplify last record detection at btrfs_clear_extent_bit_changeset() Instead of checking for an end offset of (u64)-1 (U64_MAX) for the current extent state's end, and then checking after updating the current start offset if it's now beyond the range's end offset, we can simply stop if the current extent state's end is greater than or equals to our range's end offset. This helps remove one comparison under the 'next' label and allows to remove the if statement that checks if the start offset is greater than the end offset under the 'search_again' label. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:51 +02:00
Filipe Manana	6c28102f9a	btrfs: avoid extra tree search at btrfs_clear_extent_bit_changeset() When we find an extent state that starts before our range's start we split it and jump into the 'search_again' label with our start offset remaining the same, making us then go to the 'again' label and search again for an extent state that contains the 'start' offset, and this time it finds the same extent state but with its start offset set to our range's start offset (due to the split). This is because we have consumed the preallocated extent state record and we may need to split again, and by jumping to 'again' we release the spinlock, allocate a new prealloc state and restart the search. However we may not need to restart and allocate a new extent state in case we don't find extent states that cross our end offset, therefore no need for further extent state splits, or we may be able to do an atomic allocation (which is quick even if it fails). In these cases it's a waste to restart the search. So change the behaviour to do the restart only if we need to reschedule, otherwise fall through - if we need to allocate an extent state for split operations, we will try an atomic allocation and if that fails we will do the restart as before. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:51 +02:00
Filipe Manana	c832378622	btrfs: use bools for local variables at btrfs_clear_extent_bit_changeset() Several variables are defined as integers but used as booleans, and the 'delete' variable can be made const since it's not changed after being declared. So change them to proper booleans and simplify setting their value. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:50 +02:00
Filipe Manana	5af1eae78d	btrfs: add missing error return to btrfs_clear_extent_bit_changeset() We have a couple error branches where we have an error stored in the 'err' variable and then jump to the 'out' label, however we don't return that error, we just return 0. Normally this is not a problem since those error branches call extent_io_tree_panic() which triggers a BUG() call, however it's possible to have rather exotic kernel config with CONFIG_BUG disabled in which case the BUG() call does nothing and we fallthrough. So make sure to return the error, not just to fix that exotic case but also to make the code less confusing. While at it also rename the 'err' variable to 'ret' since this is the style we prefer and use more widely. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:50 +02:00
Filipe Manana	2187540b6f	btrfs: exit after state split error at btrfs_clear_extent_bit_changeset() If split_state() returned an error we call extent_io_tree_panic() which will trigger a BUG() call. However if CONFIG_BUG is disabled, which is an uncommon and exotic scenario, then we fallthrough and hit a use after free when calling clear_state_bit() since the extent state record which the local variable 'prealloc' points to was freed by split_state(). So jump to the label 'out' after calling extent_io_tree_panic() and set the 'prealloc' pointer to NULL since split_state() has already freed it when it hit an error. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:50 +02:00
Filipe Manana	f389e7b982	btrfs: remove duplicate error check at btrfs_clear_extent_bit_changeset() There's no need to check if split_state() returned an error twice, instead unify into a single if statement after setting 'prealloc' to NULL, because on error split_state() frees the 'prealloc' extent state record. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:50 +02:00
Qu Wenruo	007fa63225	btrfs: get rid of btrfs_read_dev_super() The function is introduced by commit `a512bbf855` ("Btrfs: superblock duplication") at the beginning of btrfs. It leaved a comment saying we'd need a special mount option to read all super blocks, but it's never been implemented and there was not need/request for it. The check/rescue tools are able to start from a specific copy and use it as primary eventually. This means btrfs_read_dev_super() is always reading the first super block, making all the code finding the latest super block unnecessary. Just remove that function and replace all call sites with btrfs_read_disk_super(bdev, 0, false). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:50 +02:00
Qu Wenruo	63f32b7b5d	btrfs: merge btrfs_read_dev_one_super() into btrfs_read_disk_super() We have two functions to read a super block from a block device: - btrfs_read_dev_one_super() Exported from disk-io.c - btrfs_read_disk_super() Local to volumes.c And they have some minor differences: - btrfs_read_dev_one_super() uses @copy_num Meanwhile btrfs_read_disk_super() relies on the physical and expected bytenr passed from the caller. The parameter list of btrfs_read_dev_one_super() is more user friendly. - btrfs_read_disk_super() makes sure the label is NUL terminated We do not need two different functions doing the same job, so merge the behavior into btrfs_read_disk_super() by: - Remove btrfs_read_dev_one_super() - Export btrfs_read_disk_super() The name pairs with btrfs_release_disk_super() perfectly. - Change the parameter list of btrfs_read_disk_super() to mimic btrfs_read_dev_one_super() All existing callers are calculating the physical address and expect bytenr before calling btrfs_read_disk_super() already. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:50 +02:00
Daniel Vacek	13ae88706a	btrfs: get rid of goto in alloc_test_extent_buffer() The `free_eb` label is used only once. Simplify by moving the code inplace. Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:50 +02:00
Josef Bacik	5e121ae687	btrfs: use buffer xarray for extent buffer writeback operations Currently we have this ugly back and forth with the btree writeback where we find the folio, find the eb associated with that folio, and then attempt to writeback. This results in two different paths for subpage ebs and >= page size ebs. Clean this up by adding our own infrastructure around looking up tagged ebs and writing the ebs out directly. This allows us to unify the subpage and >= pagesize IO paths, resulting in a much cleaner writeback path for extent buffers. I ran this through fsperf on a VM with 8 CPUs and 16GiB of RAM. I used smallfiles100k, but reduced the files to 1k to make it run faster, the results are as follows, with the statistically significant improvements marked with , there were no regressions. fsperf was run with -n 10 for both runs, so the baseline is the average 10 runs and the test is the average of 10 runs. smallfiles100k results metric baseline current stdev diff ================================================================================ avg_commit_ms 68.58 58.44 3.35 -14.79% commits 270.60 254.70 16.24 -5.88% dev_read_iops 48 48 0 0.00% dev_read_kbytes 1044 1044 0 0.00% dev_write_iops 866117.90 850028.10 14292.20 -1.86% dev_write_kbytes 10939976.40 10605701.20 351330.32 -3.06% elapsed 49.30 33 1.64 -33.06% * end_state_mount_ns 41251498.80 35773220.70 2531205.32 -13.28% * end_state_umount_ns 1.90e+09 1.50e+09 14186226.85 -21.38% * max_commit_ms 139 111.60 9.72 -19.71% * sys_cpu 4.90 3.86 0.88 -21.29% write_bw_bytes 42935768.20 64318451.10 1609415.05 49.80% * write_clat_ns_mean 366431.69 243202.60 14161.98 -33.63% * write_clat_ns_p50 49203.20 20992 264.40 -57.34% * write_clat_ns_p99 827392 653721.60 65904.74 -20.99% * write_io_kbytes 2035940 2035940 0 0.00% write_iops 10482.37 15702.75 392.92 49.80% * write_lat_ns_max 1.01e+08 90516129 3910102.06 -10.29% * write_lat_ns_mean 366556.19 243308.48 14154.51 -33.62% * As you can see we get about a 33% decrease runtime, with a 50% throughput increase, which is pretty significant. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:50 +02:00
Josef Bacik	4bc0a3cb75	btrfs: set DIRTY and WRITEBACK tags on the buffer_tree In preparation for changing how we do writeout of extent buffers, start tagging the extent buffer xarray with DIRTY and WRITEBACK to make it easier to find extent buffers that are in either state. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:50 +02:00
Josef Bacik	19d7f65f03	btrfs: convert the buffer_radix to an xarray In order to fully utilize xarray tagging to improve writeback we need to convert the buffer_radix to a proper xarray. This conversion is relatively straightforward as the radix code uses the xarray underneath. Using xarray directly allows for quite a lot less code. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:50 +02:00
David Sterba	656e9f51de	btrfs: rename btrfs_discard workqueue to btrfs-discard We use the "btrfs-" prefix for our workqueues, the discard has underscore instead of dash, so unify it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
David Sterba	13d6d866e8	btrfs: on unknown chunk allocation policy fallback to regular We have only two chunk allocation policies right now and the switch/cases don't handle an unknown one properly. The error is in the impossible category (the policy is stored only in memory), we don't have to BUG(), falling back to regular policy should be safe. Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
David Sterba	3329d3d833	btrfs: reformat comments in acls_after_inode_item() Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
David Sterba	f24d25544f	btrfs: switch int dev_replace_is_ongoing variables/parameters to bool Both the variable and the parameter are used as logical indicators so convert them to bool. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
David Sterba	f963e0128b	btrfs: trivial conversion to return bool instead of int Old code has a lot of int for bool return values, bool is recommended and done in new code. Convert the trivial cases that do simple 0/false and 1/true. Functions comment are updated if needed. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
Qu Wenruo	73d6bcf41b	btrfs: subpage: reject tree blocks which are not nodesize aligned When btrfs subpage support (fs block < page size) was introduced, a subpage filesystem will only reject tree blocks which cross page boundaries. This used to be a compromise to simplify the tree block handling and still allowing subpage cases to read some old converted filesystems which did not have proper chunk alignment. But in practice, suppose we have the following unaligned tree block on a 64K page sized system: 0 32K 44K 60K 64K \| \|///////////////\| \| Although btrfs has no problem reading the tree block at [44K, 60K), if extent allocator is allocating another tree block, it may choose the range [60K, 74K), as extent allocator has no awareness if it's a subpage metadata request or not. Then we'd get -EINVAL from the following sequence: btrfs_alloc_tree_block() \|- btrfs_reserve_extent() \| Which returned range [60K, 74K) \|- btrfs_init_new_buffer() \|- btrfs_find_create_tree_block() \|- alloc_extent_buffer() \|- check_eb_alignment() Which returned -EINVAL, because the range crosses page boundary. This situation will not fix itself and should mostly mark the fs read-only. Thankfully we didn't really get such reports in the real world because: - The original unaligned tree block is only caused by older btrfs-convert It's before the btrfs-convert rework was done in v4.6, where converted btrfs filesystem can have metadata block groups which are not aligned to nodesize nor stripe size (64K). But after btrfs-progs v4.6, all chunks allocated will be stripe (64K) aligned, thus no more such problem. Considering how old the fix is (v4.6 was released almost 10 years ago), subpage support for btrfs was introduced in v5.15, it should be safe to reject those unaligned tree blocks. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
Daniel Vacek	406698623a	btrfs: move folio initialization to one place in attach_eb_folio_to_filemap() This is just a trivial change. The code looks a bit more readable this way, IMO. Move initialization of existing_folio to the beginning of the retry loop so it's set to NULL at one place. Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
David Sterba	c779b7980c	btrfs: raid56: rename parameter err to status in endio helpers Trivial renames to unify the naming of blk_status_t variables/parameters. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
David Sterba	853b5727c9	btrfs: change return type of btrfs_alloc_dummy_sum() to int The type blk_status_t is from block layer and not related to checksums in our context. Use int internally and do the conversions to blk_status_t as needed in btrfs_submit_chunk(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
David Sterba	d2080c7a00	btrfs: rename ret2 to ret in btrfs_submit_compressed_read() We can now rename 'ret2' to 'ret' and use it for generic errors. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00
David Sterba	a83134b48a	btrfs: rename ret to status in btrfs_submit_compressed_read() We're using 'status' for the blk_status_t variables, rename 'ret' so we can use it for generic errors. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	79cbc151f9	btrfs: simplify reading bio status in end_compressed_writeback() We don't need to have a separate variable to read the bio status, 'ret' works for that just fine so remove 'error'. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	9c0b0807ec	btrfs: rename error to ret in btrfs_submit_chunk() We can now rename 'error' to 'ret' and use it for generic errors. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	beaa7cdb6a	btrfs: rename ret to status in btrfs_submit_chunk() We're using 'status' for the blk_status_t variables, rename 'ret' so we can use it for proper return type. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	64c13195dd	btrfs: change return type of btrfs_bio_csum() to int The type blk_status_t is from block layer and not related to checksums in our context. Use int internally and do the conversions to blk_status_t as needed. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	a24d185c36	btrfs: change return type of btree_csum_one_bio() to int The type blk_status_t is from block layer and not related to checksums in our context. Use int internally and do the conversions to blk_status_t as needed in btrfs_bio_csum(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	9b20d242af	btrfs: change return type of btrfs_csum_one_bio() to int The type blk_status_t is from block layer and not related to checksums in our context. Use int internally and do the conversions to blk_status_t as needed in btrfs_bio_csum(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	6f6e7e98b0	btrfs: change return type of btrfs_lookup_bio_sums() to int The type blk_status_t is from block layer and not related to checksums in our context. Use int internally and do the conversions to blk_status_t as needed in btrfs_submit_chunk(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	ae8ce87165	btrfs: drop redundant local variable in raid_wait_write_end_io() The bio status is read only once, no variable needed for that. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	c0ee55f796	btrfs: merge __setup_root() to btrfs_alloc_root() There's only one caller of __setup_root() so merge it there. Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	05a6ec865d	btrfs: use unsigned types for constants defined as bit shifts The unsigned type is a recommended practice (CWE-190, CWE-194) for bit shifts to avoid problems with potential unwanted sign extensions. Although there are no such cases in btrfs codebase, follow the recommendation. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:48 +02:00
David Sterba	d4d788a776	btrfs: remove unused btrfs_io_stripe::length First added (but not effectively used) in `02c372e1f0` ("btrfs: add support for inserting raid stripe extents"). The structure is initialized to zeros so the only use in btrfs_insert_one_raid_extent() u64 length = bioc->stripes[i].length; struct btrfs_raid_stride *raid_stride = &stripe_extent->strides[i]; if (length == 0) length = bioc->size; the 'if' always happens. Last use in `4016358e85` ("btrfs: remove unused variable length in btrfs_insert_one_raid_extent()") was an obvious cleanup. It seems to be safe to remove, raid-stripe-tree works without using it since 6.6. This was found by tool https://github.com/jirislaby/clang-struct . Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
David Sterba	2d44a15afd	btrfs: use list_first_entry() everywhere Using the helper makes it a bit more clear that we're accessing the first list entry. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
David Sterba	9e0a739a9e	btrfs: convert ASSERT(0) with handled errors to DEBUG_WARN() The use of ASSERT(0) is maybe useful for some cases but more like a notice for developers. Assertions can be compiled in independently so convert it to a debugging helper. The difference is that it's just a warning and will not end up in BUG(). The converted cases are in connection with proper error handling. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
David Sterba	ed50ab0fec	btrfs: convert WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG)) to DEBUG_WARN Use the conditional warning instead of typing the whole condition. Optional message is printed where it seems clear what could be the problem. Conversion is left out in btree_csum_one_bio() because of the additional condition. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
David Sterba	3db15c6ca6	btrfs: add debug build only WARN Add conditional WARN() wrapper that's enabled only in debug build. It should be used for unexpected conditions that should be noisy. Use it instead of ASSERT(0). As it will not lead to BUG() make sure that continuing is still possible, e.g. the error is handled anyway. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
David Sterba	94cb8d7144	btrfs: use verbose ASSERT() in volumes.c The file volumes.c has about 40 assertions and half of them are suitable for ASSERT() with additional data. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
David Sterba	19468a623a	btrfs: enhance ASSERT() to take optional format string Currently ASSERT() prints the stringified condition and without macro expansions so simple constants like BTRFS_MAX_METADATA_BLOCKSIZE remain readable in the output. There are expressions where we'd like to see the exact values but all we get is something like: assertion failed: em->start <= start && start < extent_map_end(em), in fs/btrfs/extent_map.c:613 It would be nice to be able to print any additional information to help understand the problem. With some preprocessor magic and compile-time optimizations we can enhance ASSERT to work like that as well: ASSERT(value > limit, "value=%llu limit=%llu", value, limit); with free-form printk arguments that will be part of the assertion message. Pros: - helps debugging and understanding reported problems - the optional format is verified at compile-time Cons: - increases the .ko size - writing the assertion code is repetitive (condition, format, values) - format and variable type must match (extra lookup) - needs gcc 8.x and newer, otherwise it's the short format Recommended use is for non-trivial expressions, so basic ASSERT(value) can be used for pointers or sometimes integers. The format has been slightly updated to also print the result of the evaluation of the condition, appended to the stringified condition as "condition :: <value>". Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
Yangtao Li	023beaeca6	btrfs: remove BTRFS_REF_LAST from enum btrfs_ref_type Commit `b28b1f0ce4` ("btrfs: delayed-ref: Introduce better documented delayed ref structures") introduced BTRFS_REF_LAST, which can be used for sanity checking, e.g. in switch/case or for loops. In btrfs_ref_type() there is an assertion ASSERT(ref->type == BTRFS_REF_DATA \|\| ref->type == BTRFS_REF_METADATA); to validate the values so we don't need the ending enum. Signed-off-by: Yangtao Li <frank.li@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
Christoph Hellwig	8d243aa9a8	btrfs: use bvec_kmap_local() in btrfs_decompress_buf2page() This removes the last direct poke into bvec internals in btrfs. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
Christoph Hellwig	adbfd189c4	btrfs: scrub: use virtual addresses directly Instead of the old @page and @page_offset pair inside scrub, here we can directly use the virtual address for a sector. This has the following benefit: - Simplified parameters A single @kaddr will repair @page and @page_offset. - No more unnecessary kmap/kunmap calls Since all pages utilized by scrub is allocated by scrub, and no highmem is allowed, we do not need to do any kmap/kunmap. And add an ASSERT() inside the new scrub_stripe_get_kaddr() to catch any unexpected highmem page. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:47 +02:00
Qu Wenruo	cd678925e9	btrfs: raid56: store a physical address in structure sector_ptr Instead of using a @page + @pg_offset pair inside sector_ptr structure, use a single physical address instead. This allows us to grab both the page and offset from a single u64 value. Although we still need an extra bool value, @has_paddr, to distinguish if the sector is properly mapped (as the 0 physical address is totally valid). This change doesn't change the size of structure sector_ptr, but reduces the parameters of several functions. Note: the original idea and patch is from Christoph Hellwig (https://lore.kernel.org/linux-btrfs/20250409111055.3640328-7-hch@lst.de/) but the final implementation is different. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Christoph Hellwig <hch@lst.de> [ Use physical addresses instead to handle highmem. ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
Christoph Hellwig	6f3f722df7	btrfs: simplify bvec iteration in index_one_bio() Flatten the two loops by open coding bio_for_each_segment() and advancing the iterator one sector at a time. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> [ Fix a bug that @offset is not increased. ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
Christoph Hellwig	959ddf2839	btrfs: move kmapping out of btrfs_check_sector_csum() Move kmapping the page out of btrfs_check_sector_csum(). This allows using bvec_kmap_local() where suitable and reduces the number of kmap*() calls in the raid56 code. This also means btrfs_check_sector_csum() will only accept a properly kmapped address. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
Christoph Hellwig	3240b2c97b	btrfs: pass a physical address to btrfs_repair_io_failure() Using physical address has the following advantages: - All involved callers only need a single pointer Instead of the old @folio + @offset pair. - No complex poking into the bio_vec structure As a bio_vec can be single or multiple paged, grabbing the real page can be quite complex if the bio_vec is a multi-page one. Instead bvec_phys() will always give a single physical address, and it cab be easily converted to a page. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
Christoph Hellwig	f6b2d8b134	btrfs: track the next file offset in struct btrfs_bio_ctrl The bio implementation is not something we should really mess around, and we shouldn't recalculate the pos from the folio over and over. Instead just track then end of the current bio in logical file offsets in the btrfs_bio_ctrl, which is much simpler and easier to read. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
Christoph Hellwig	8cad6fed82	btrfs: remove the alignment checks in end_bbio_data_read() end_bbio_data_read() checks that each iterated folio fragment is aligned and justifies that with block drivers advancing the bio. But block driver only advance bi_iter, while end_bbio_data_read() uses bio_for_each_folio_all() to iterate the immutable bi_io_vec array that can't be changed by drivers at all. Furthermore btrfs has already did the alignment check of the file offset inside submit_one_sector(), and the size is fixed to fs block size, there is no need to re-do the alignment check again inside the endio function. So just remove the unnecessary alignment check along with the incorrect comment. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
Charles Han	ecf5b757c7	btrfs: update and correct description of btrfs_get_or_create_delayed_node() The comment mistakenly says the function is returning PTR_ERR instead of ERR_PTR. Fix it and update it so it's more descriptive. Signed-off-by: Charles Han <hanchunchao@inspur.com> Reviewed-by: David Sterba <dsterba@suse.com> [ Enhance the function comment. ] Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
Yangtao Li	ea2a8bacb1	btrfs: simplify return logic from btrfs_delayed_ref_init() Make this simpler by returning directly when there's no other cleanup needed. Signed-off-by: Yangtao Li <frank.li@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
Yangtao Li	c900f415be	btrfs: reuse exit helper for cleanup in btrfs_bioset_init() Do not duplicate the cleanup after failed initialization in btrfs_bioset_init() and reuse the exit function btrfs_bioset_exit(). Signed-off-by: Yangtao Li <frank.li@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
David Sterba	af4fc2818d	btrfs: rename iov_iter iterator parameter in btrfs_buffered_write() Using 'i' for a parameter is confusing and conforming to current preferences, so rename it to 'iter'. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:46 +02:00
Qu Wenruo	9b23008268	btrfs: enable large data folios support for defrag Currently we reject large folios for defrag gracefully, but the implementation itself is already mostly large folios compatible. There are several parts of defrag in btrfs: - Extent map checking Aka, defrag_collect_targets(), which prepares a list of target ranges that should be defragged. This part is completely folio unrelated, thus it doesn't care about the folio size. - Target folio preparation Aka, defrag_prepare_one_folio(), which lock and read (if needed) the target folio. Since folio read and lock are already supporting large folios, this part needs only minor changes. - Redirty the target range of the folio This is already done in a way supporting large folios. So it's pretty straightforward to enable large folios for defrag: - Do not reject large folios for experimental builds This affects the large folio check inside defrag_prepare_one_folio(). - Wait for ordered extents of the whole folio in defrag_prepare_one_folio() - Lock the whole extent range for all involved folios in defrag_one_range() - Allow the folios[] array to be partially empty Since we can have large folios, folios[] will not always be full. This affects: * How to allocate folios in defrag_one_range() Now we cannot use page index, but use the end position of the folio as an iterator. * How to free the folios[] array If we hit an empty slot, it means we have large folios and already hit the end of the array. * How to mark the range dirty Instead of use page index directly, we have to go through each folio, and check if the folio covers the defrag target inside defrag_one_locked_target(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:45 +02:00
Qu Wenruo	7bf9bfa946	btrfs: prepare compression paths for large data folios All compression algorithms inside btrfs are not supporting large folios due to the following points: - btrfs_calc_input_length() is assuming page sized folio - kmap_local_folio() usages are using offset_in_page() Prepare them to support large data folios by: - Add a folio parameter to btrfs_calc_input_length() And use that folio parameter to calculate the correct length. Since we're here, also add extra ASSERT()s to make sure the parameter @cur is inside the folio range. This affects only zlib and zstd. Lzo compresses at most one block at a time, thus not affected. - Use offset_in_folio() to calculate the kmap_local_folio() offset This affects all 3 algorithms. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:45 +02:00
Filipe Manana	9a36bad6c3	btrfs: rename __tree_search() to remove double underscore prefix There's no need to have a double underscore prefix as there's no variant of the function without it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:45 +02:00
Filipe Manana	7e88669032	btrfs: rename __lookup_extent_mapping() to remove double underscore prefix There's no need to have a double underscore prefix as there's no variant of the function without it anymore. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:45 +02:00
Filipe Manana	d846a6d3b0	btrfs: rename remaining exported extent map functions Rename all the exported functions from extent_map.h that don't have a 'btrfs_' prefix in their names, so that they are consistent with all the other functions, to make it clear they are btrfs specific functions and to avoid potential name collisions in the future with functions defined elsewhere in the kernel. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:45 +02:00
Filipe Manana	ae98ae2a50	btrfs: rename functions to allocate and free extent maps These functions are exported and don't have a 'btrfs_' prefix in their names, which goes against coding style conventions. Rename them to have such prefix, making it clear they are from btrfs and avoiding potential collisions in the future with functions defined elsewhere outside btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:45 +02:00
Filipe Manana	2e871330ce	btrfs: rename extent map functions to get block start, end and check if in tree These functions are exported and don't have a 'btrfs_' prefix in their names, which goes against coding style conventions. Rename them to have such prefix, making it clear they are from btrfs and avoiding potential collisions in the future with functions defined elsewhere outside btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:45 +02:00
Filipe Manana	962162ffa6	btrfs: rename exported extent map compression functions These functions are exported and don't have a 'btrfs_' prefix in their names, which goes against coding style conventions. Rename them to have such prefix, making it clear they are from btrfs and avoiding potential collisions in the future with functions defined elsewhere outside btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:45 +02:00
Filipe Manana	81eb6ce8b5	btrfs: tracepoints: add btrfs prefix to names where it's missing Most of our tracepoints have the 'btrfs_' prefix in their names but a few of them are missing, making it inconsistent. So add the prefix to the ones that are missing it, creating consistency, making it clear for users these are btrfs tracepoints and eventually avoid name collisions with other tracepoints defined by other kernel subsystems. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:45 +02:00
Filipe Manana	9d072bfab5	btrfs: make btrfs_find_contiguous_extent_bit() return bool instead of int The function needs only to return true or false, so there's no need to return an integer. Currently it returns 0 when a range with the given bits is set and 1 when not found, which is a bit counter intuitive too. So change the function to return a bool instead, returning true when a range is found and false otherwise. Update the function's documentation to mention the return value too. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:44 +02:00
Filipe Manana	00ba32e5be	btrfs: remove double underscore prefix from __set_extent_bit() Now that set_extent_bit() was renamed to btrfs_set_extent_bit(), there's no need to have a __set_extent_bit() function, we can just remove the double underscore prefix, which we try to avoid according to the coding style conventions. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:44 +02:00
Filipe Manana	94bd699a08	btrfs: rename remaining exported functions from extent-io-tree.h Rename the remaning exported functions that don't have a 'btrfs_' prefix. By convention exported functions should have such prefix to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:44 +02:00
Filipe Manana	b351161f4f	btrfs: rename free_extent_state() to include a btrfs prefix This is an exported function so it should have a 'btrfs_' prefix by convention, to make it clear it's btrfs specific and to avoid collisions with functions from elsewhere in the kernel. Rename the function to add 'btrfs_' prefix to it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:44 +02:00
Filipe Manana	f81c2aea71	btrfs: rename the functions to count, test and get bit ranges in io trees These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a 'btrfs_' prefix to their names to make it clear they are from btrfs. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:44 +02:00
Filipe Manana	e965835c98	btrfs: rename the functions to init and release an extent io tree These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a 'btrfs_' prefix to their name to make it clear they are from btrfs. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:44 +02:00
Filipe Manana	20612db462	btrfs: directly grab inode at __btrfs_debug_check_extent_io_range() We've tested that we are dealing with io tree that is associated to an inode (its owner is IO_TREE_INODE_IO), so there's no need to call btrfs_extent_io_tree_to_inode() in a separate line and we just assign tree->inode to the local inode variable when we declare it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:44 +02:00
Filipe Manana	02c340c278	btrfs: rename the functions to get inode and fs_info from an extent io tree These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a 'btrfs_' prefix to their name to make it clear they are from btrfs. Also remove the 'const' suffix from extent_io_tree_to_inode_const() since there's no non-const variant anymore and makes the naming consistent with extent_io_tree_to_fs_info() (no 'const' suffix and returns a const pointer). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:44 +02:00
Filipe Manana	66da9c1bed	btrfs: rename the functions to search for bits in extent ranges These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a 'btrfs_' prefix to their name to make it clear they are from btrfs. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:44 +02:00
Filipe Manana	791b3455ac	btrfs: rename set_extent_bit() to include a btrfs prefix This is an exported function so it should have a 'btrfs_' prefix by convention, to make it clear it's btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So rename it to btrfs_set_extent_bit(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:44 +02:00
Filipe Manana	9d222562b4	btrfs: rename the functions to clear bits for an extent range These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. One of them has a double underscore prefix which is also discouraged. So remove double underscore prefix where applicable and add a 'btrfs_' prefix to their name to make it clear they are from btrfs. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:43 +02:00
Filipe Manana	2cb9ac3faa	btrfs: rename __lock_extent() and __try_lock_extent() These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. Their double underscore prefix is also discouraged. So remove their double underscore prefix, add a 'btrfs_' prefix to their name to make it clear they are from btrfs and a '_bits' suffix to avoid collision with btrfs_lock_extent() and btrfs_try_lock_extent(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:43 +02:00
Filipe Manana	b696440e5e	btrfs: add btrfs prefix to dio lock and unlock extent functions These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a prefix to their name. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:43 +02:00
Filipe Manana	242570e80b	btrfs: add btrfs prefix to main lock, try lock and unlock extent functions These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a prefix to their name. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:43 +02:00
Filipe Manana	41708a4c23	btrfs: add btrfs prefix to trace events for extent state alloc and free These trace events don't have the 'btrfs_' prefix in their name, unlike the other trace events from extent-io-tree.c. So add the prefix to make them consistent and follow coding style conventions too. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:43 +02:00
Filipe Manana	024b3bc190	btrfs: remove extent_io_tree_to_inode() and is_inode_io_tree() These functions aren't used outside extent-io-tree.c, but yet one of them (extent_io_tree_to_inode()) is unnecessarily exported in the header. Furthermore their single use is in a pattern like this: if (is_inode_io_tree(tree)) foo(extent_io_tree_to_inode(tree), ...); So we're effectively unnecessarily adding more indirection, checking twice if tree->owner == IO_TREE_INODE_IO before getting the inode and doing a non-inline function call to get tree->inode. Simplify this by removing these helper functions and instead doing thing like this: if (tree->owner == IO_TREE_INODE_IO) foo(tree->inode, ...); Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:43 +02:00
David Sterba	9633f48190	btrfs: tree-checker: more unlikely annotations Add more unlikely annotations to branches that lead to EUCLEAN, overall in the tree checker this helps to reorder instructions for the no-error case. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:43 +02:00
Qu Wenruo	2b14b74b99	btrfs: use folio_contains() for EOF detection Currently we use the following pattern to detect if the folio contains the end of a file: if (folio->index == end_index) folio_zero_range(); But that only works if the folio is page sized. For the following case, it will not work and leave the range beyond EOF uninitialized: The page size is 4K, and the fs block size is also 4K. 16K 20K 24K \| \| \| \| \| EOF at 22K And we have a large folio sized 8K at file offset 16K. In that case, the old "folio->index == end_index" will not work, thus the range [22K, 24K) will not be zeroed out. Fix the following call sites which use the above pattern: - add_ra_bio_pages() - extent_writepage() Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:43 +02:00
Qu Wenruo	e1fcad644b	btrfs: remove unnecessary early exits in delalloc folio lock and unlock Inside functions unlock_delalloc_folio() and lock_delalloc_folios(), we have the following early exits: if (index == locked_folio->index && end_index == index) return; This allows us to exit early if the range is inside the same locked folio. However the current check relies on page sized folios, if we have a large folio that contains @index but not at @index, then the early exit will no longer trigger. Furthermore without the above early check, the existing code can handle it well, as both __process_folios_contig() and lock_delalloc_folios() will skip any folio page lock/unlock if it's on the locked folio. Here we remove the early exits and let the existing code handle the same index case, to make the code a little simpler. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:43 +02:00
Qu Wenruo	05efe3eb3b	btrfs: zlib: prepare copy_data_into_buffer() for large data folios The function itself is already taking large folios into consideration, just remove the ASSERT(!folio_test_large()) line. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:42 +02:00
Qu Wenruo	3a8f948633	btrfs: subpage: prepare for large data folios The subpage handling code has two locations not supporting large folios: - btrfs_attach_subpage() Which is doing a metadata specific ASSERT() check. But for the future large data folios support, that check is too generic. Since it's metadata specific, only check the ASSERT() for metadata. - btrfs_subpage_assert() Just remove the "ASSERT(folio_order(folio) == 0)" check. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:42 +02:00
Qu Wenruo	c08d45de63	btrfs: prepare end_bbio_data_write() for large data folios The function is doing an ASSERT() checking the folio order, but all later functions are handling large folios properly, thus we can safely remove that ASSERT(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:42 +02:00
Qu Wenruo	b4e9aaad09	btrfs: prepare prepare_one_folio() for large data folios The only blockage is the ASSERT() rejecting large folios, just remove it. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:42 +02:00
Qu Wenruo	f45e538b00	btrfs: prepare btrfs_page_mkwrite() for large data folios The function btrfs_page_mkwrite() has an explicit ASSERT() checking the folio order. To make it support large data folios, we need to: - Remove the ASSERT(folio_order(folio) == 0) - Use folio_contains() to check if the folio covers the last page Otherwise the code is already supporting large folios well. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:42 +02:00
Qu Wenruo	a4a636a437	btrfs: send: prepare put_file_data() for large data folios Currently put_file_data() can only accept a page sized folio. However the function itself is not that complex, it's just copying data from filemap folio into the send buffer. Make it support large data folios: - Change the loop to use file offset instead of page index - Calculate @pg_offset and @cur_len after getting the folio - Remove the "WARN_ON(folio_order(folio));" line Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:42 +02:00
Qu Wenruo	70a376475d	btrfs: send: remove the again label inside put_file_data() The again label is here to retry to get the folio for the current index. When triggering that label, there is no advance of the iterator. So it can be replaced by a simple "continue" and remove the again label. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:42 +02:00
David Sterba	dcb5bcccb7	btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_insert_inode_extref() This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:42 +02:00
David Sterba	f6a359e307	btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_del_inode_extref() This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:42 +02:00
David Sterba	c7341d0337	btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_encoded_read_inline() This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:42 +02:00
David Sterba	5e8632035a	btrfs: use BTRFS_PATH_AUTO_FREE in can_nocow_extent() This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:41 +02:00
David Sterba	2c5563a394	btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_set_inode_index_count() This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:41 +02:00
David Sterba	516748f584	btrfs: use BTRFS_PATH_AUTO_FREE in may_destroy_subvol() This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:41 +02:00
David Sterba	e235418118	btrfs: do more trivial BTRFS_PATH_AUTO_FREE conversions The most trivial pattern for the auto freeing when the variable is declared with the macro and the final btrfs_free_path() is removed. There are almost none goto -> return conversions and there's no other function cleanup. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:41 +02:00
Filipe Manana	c6a43322a3	btrfs: remove redundant record start offset check at test_range_bit() It's pointless to check if the current record's start offset is greater than the end offset, as before we just tested if it was greater than the start offset - and if it's not it means it's less than or equal to the start offset, so it can not be greater than the end offset, as our start offset is always smaller than the end offset. So remove that check and also add an assertion to verify the start offset is smaller then the end offset. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:41 +02:00
Filipe Manana	53828c759a	btrfs: simplify last record detection at test_range_bit() The overflow detection for the start offset of the next record is not really necessary, we can just stop iterating if the current record ends at or after out end offset. This removes the need to test if the current record end offset is (u64)-1 and to check if adding 1 to the current end offset results in 0. By testing only if the current record ends at or after the end offset, we also don't need anymore to test the new start offset at the head of the while loop. This makes both the source code and assembly code simpler, more efficient and shorter (reducing the object text size). Also remove the pointless initialization to NULL of the state variable, as we don't use it before the first assignment to it. This may help avoid some warnings with clang tools such as the one reported/fixed by commit `966de47ff0` ("btrfs: remove redundant initialization of variables in log_new_ancestors"). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:41 +02:00
Filipe Manana	c54c245f80	btrfs: remove redundant check at find_first_extent_bit_state() The tree_search() function always returns an entry that either contains the search offset or the first entry in the tree that starts after the offset. So checking at find_first_extent_bit_state() if the returned entry ends at or after the search offset is pointless. Remove the check. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:41 +02:00
Filipe Manana	56ec21a6dd	btrfs: fix documentation for tree_search_for_insert() There are several things wrong with the documentation: 1) At the top it's only mentioned that we search for an entry containing the given offset, but when such entry does not exists we search for the first entry that starts and ends after that offset; 2) It mentions that @node_ret and @parent_ret aren't changed if the returned entry contains the given offset - that is true only if the returned entry starts exactly at @offset, otherwise those arguments are changed; 3) It mentions that if no entry containing offset is found then we return the first entry ending before the offset - that is not true, we return the first entry that starts and ends after that offset; 4) It also mentions that NULL is never returned. This is false as in case there's no entry containing offset or any entry that starts and ends after offset, NULL is returned. So fix the documentation. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:41 +02:00
Filipe Manana	131a4be1c0	btrfs: simplify last record detection at test_range_bit_exists() Instead of keeping track of the minimum start offset of the next record and detecting overflow every time we update that offset to be the sum of current record's end offset plus one, we can simply exit when the current record ends at or beyond our end offset and forget about updating the start offset on every iteration and testing for it at the top of the loop. This makes both the source code and assembly code simpler, more efficient and shorter (reducing the object text size). Also remove the pointless initialization to NULL of the state variable, as we don't use it before the first assignment to it. This may help avoid some warnings with clang tools such as the one reported/fixed by commit `966de47ff0` ("btrfs: remove redundant initialization of variables in log_new_ancestors"). Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:41 +02:00
Filipe Manana	c4e33a8431	btrfs: use clear_extent_bits() instead of clear_extent_bit() where possible Several places are using clear_extent_bit() and passing a NULL value for the 'cached' argument, which is pointless as they can use instead clear_extent_bits(). Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:41 +02:00
Filipe Manana	39c5714cb4	btrfs: use clear_extent_bits() at chunk_map_device_clear_bits() Instead of using __clear_extent_bit() we can use clear_extent_bits() since we pass a NULL value for the cached and changeset arguments. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:40 +02:00
Filipe Manana	c757c024fc	btrfs: use clear_extent_bit() at try_release_extent_state() Instead of using __clear_extent_bit() we can use clear_extent_bit() since we pass a NULL value for the changeset argument. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:40 +02:00
Qu Wenruo	af566bdaff	btrfs: fix the file offset calculation inside btrfs_decompress_buf2page() [BUG WITH EXPERIMENTAL LARGE FOLIOS] When testing the experimental large data folio support with compression, there are several ASSERT()s triggered from btrfs_decompress_buf2page() when running fsstress with compress=zstd mount option: - ASSERT(copy_len) from btrfs_decompress_buf2page() - VM_BUG_ON(offset + len > PAGE_SIZE) from memcpy_to_page() [CAUSE] Inside btrfs_decompress_buf2page(), we need to grab the file offset from the current bvec.bv_page, to check if we even need to copy data into the bio. And since we're using single page bvec, and no large folio, every page inside the folio should have its index properly setup. But when large folios are involved, only the first page (aka, the head page) of a large folio has its index properly initialized. The other pages inside the large folio will not have their indexes properly initialized. Thus the page_offset() call inside btrfs_decompress_buf2page() will result garbage, and completely screw up the @copy_len calculation. [FIX] Instead of using page->index directly, go with page_pgoff(), which can handle non-head pages correctly. So introduce a helper, file_offset_from_bvec(), to get the file offset from a single page bio_vec, so the copy_len calculation can be done correctly. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:40 +02:00
David Sterba	6aa79c4f25	btrfs: use rb_entry_safe() where possible to simplify code Simplify conditionally reading an rb_entry(), there's the rb_entry_safe() helper that checks the node pointer for NULL so we don't have to write it explicitly. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:40 +02:00
Filipe Manana	c4669e4a8b	btrfs: pass a pointer to get_range_bits() to cache first search result Allow get_range_bits() to take an extent state pointer to pointer argument so that we can cache the first extent state record in the target range, so that a caller can use it for subsequent operations without doing a full tree search. Currently the only user is try_release_extent_state(), which then does a call to __clear_extent_bit() which can use such a cached state record. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:40 +02:00
Filipe Manana	32c523c578	btrfs: allow folios to be released while ordered extent is finishing When the release_folio callback (from struct address_space_operations) is invoked we don't allow the folio to be released if its range is currently locked in the inode's io_tree, as it may indicate the folio may be needed by the task that locked the range. However if the range is locked because an ordered extent is finishing, then we can safely allow the folio to be released because ordered extent completion doesn't need to use the folio at all. When we are under memory pressure, the kernel starts writeback of dirty pages (folios) with the goal of releasing the pages from the page cache after writeback completes, however this often is not possible on btrfs because: * Once the writeback completes we queue the ordered extent completion; * Once the ordered extent completion starts, we lock the range in the inode's io_tree (at btrfs_finish_one_ordered()); * If the release_folio callback is called while the folio's range is locked in the inode's io_tree, we don't allow the folio to be released, so the kernel has to try to release memory elsewhere, which may result in triggering more writeback or releasing other pages from the page cache which may be more useful to have around for applications. In contrast, when the release_folio callback is invoked after writeback finishes and before ordered extent completion starts or locks the range, we allow the folio to be released, as well as when the release_folio callback is invoked after ordered extent completion unlocks the range. Improve on this by detecting if the range is locked for ordered extent completion and if it is, allow the folio to be released. This detection is achieved by adding a new extent flag in the io_tree that is set when the range is locked during ordered extent completion. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:40 +02:00
Filipe Manana	cbfb4cbf45	btrfs: update comment for try_release_extent_state() Drop reference to pages from the comment since the function is fully folio aware and works regardless of how many pages are in the folio. Also while at it, capitalize the first word and make it more explicit that release_folio is a callback from struct address_space_operations. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:40 +02:00
Qu Wenruo	1e5773e0ba	btrfs: prepare btrfs_punch_hole_lock_range() for large data folios The function btrfs_punch_hole_lock_range() needs to make sure there is no other folio in the range, thus it goes with filemap_range_has_page(), which works pretty fine. But if we have large folios, under the following case filemap_range_has_page() will always return true, forcing btrfs_punch_hole_lock_range() to do a very time consuming busy loop: start end \| \| \|//\|//\|//\|//\| \| \| \| \| \| \| \| \|//\|//\| \ / \ / Folio A Folio B In the above case, folio A and B contain our start/end indexes, and there are no other folios in the range. Thus we do not need to retry inside btrfs_punch_hole_lock_range(). To prepare for large data folios, introduce a helper, check_range_has_page(), which will: - Shrink the search range towards page boundaries If the rounded down end (exclusive, otherwise it can underflow when @end is inside the folio at file offset 0) is no larger than the rounded up start, it means the range contains no other pages other than the ones covering @start and @end. Can return false directly in that case. - Grab all the folios inside the range - Skip any large folios that cover the start and end indexes - If any other folios are found return true - Otherwise return false This new helper is going to handle both large folios and regular ones. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:40 +02:00
Qu Wenruo	be8ef7990c	btrfs: prepare btrfs_buffered_write() for large data folios This involves the following modifications: - Set the order flags for __filemap_get_folio() inside prepare_one_folio() This will allow __filemap_get_folio() to create a large folio if the address space supports it. - Limit the initial @write_bytes inside copy_one_range() If the largest folio boundary splits the initial write range, there is no way we can write beyond the largest folio boundary. This is done by a simple helper calc_write_bytes(). - Release exceeding reserved space if the folio is smaller than expected Which is doing the same handling when short copy happens. All the preparations should not change the behavior when the largest folio order is 0. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:40 +02:00
Qu Wenruo	581bb9e761	btrfs: refactor how we handle reserved space inside copy_one_range() There are several things not ideal in copy_one_range(): - Unnecessary temporary variables * block_offset * reserve_bytes * dirty_blocks * num_blocks * release_bytes These are utilized to handle short-copy cases. - Inconsistent handling of btrfs_delalloc_release_extents() There is a hidden behavior that, after reserving metadata for X bytes of data write, we have to call btrfs_delalloc_release_extents() with X once and only once. Calling btrfs_delalloc_release_extents(X - 4K) and btrfs_delalloc_release_extents(4K) will cause outstanding extents accounting to go wrong. This is because the outstanding extents mechanism is not designed to handle shrinking of reserved space. Improve above situations by: - Use a single @reserved_start and @reserved_len pair Now we reserve space for the initial range, and if a short copy happened and we need to shrink the reserved space, we can easily calculate the new length, and update @reserved_len. - Introduce helpers to shrink reserved data and metadata space This is done by two new helpers, shrink_reserved_space() and btrfs_delalloc_shrink_extents(). The later will do a better calculation if we need to modify the outstanding extents, and the first one will be utilized inside copy_one_range(). - Manually unlock, release reserved space and return if no byte is copied Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:40 +02:00
Filipe Manana	5c41f6010e	btrfs: remove EXTENT_UPTODATE io tree flag The EXTENT_UPTODATE io tree flag is now used only to mark ranges in the fs_info->excluded_extents as used by super blocks and not available for extent allocation (to prevent adding those ranges as free space in the in memory space caches). As we can use any flag for that purpose, and we are using EXTENT_DIRTY for the pinned extents io tree for example, remove the EXTENT_UPTODATE flag and use instead EXTENT_DIRTY for the excluded extents io tree. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:39 +02:00
Filipe Manana	db3f796c7c	btrfs: stop searching for EXTENT_DIRTY bit in the excluded extents io tree At btrfs_add_new_free_space() we keep searching for ranges in the excluded extents io tree that have the EXTENT_DIRTY bit set, however we never ever set that bit for ranges in that tree. That is a leftover from when that function used the global freed extents trees (fs_info->freed_extents[2]), where we used both the EXTENT_DIRTY and EXTENT_UPTODATE bits, but those trees are gone with commit `fe119a6eeb` ("btrfs: switch to per-transaction pinned extents"), which introduced the fs_info->excluded_extents io tree, where only EXTENT_UPTODATE is set. So remove the EXTENT_DIRTY bit search at btrfs_add_new_free_space(). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:39 +02:00
Filipe Manana	d2c41835fd	btrfs: remove leftover EXTENT_UPTODATE clear from an inode's io_tree After commit `52b029f427` ("btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path") we never set EXTENT_UPTODATE in an inode's io_tree anymore, but we still have some code attempting to clear that bit from an inode's io_tree. Remove that code as it doesn't do anything anymore. The sole use of the EXTENT_UPTODATE bit is for the excluded extents io_tree (fs_info->excluded_extents), which is used to track the locations of super blocks, so that their ranges are never marked as free, making them unavailable for extent allocation. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:39 +02:00
Filipe Manana	5e85262e54	btrfs: fix fsync of files with no hard links not persisting deletion If we fsync a file (or directory) that has no more hard links, because while a process had a file descriptor open on it, the file's last hard link was removed and then the process did an fsync against the file descriptor, after a power failure or crash the file still exists after replaying the log. This behaviour is incorrect since once an inode has no more hard links it's not accessible anymore and we insert an orphan item into its subvolume's tree so that the deletion of all its items is not missed in case of a power failure or crash. So after log replay the file shouldn't exist anymore, which is also the behaviour on ext4, xfs, f2fs and other filesystems. Fix this by not ignoring inodes with zero hard links at btrfs_log_inode_parent() and by committing an inode's delayed inode when we are not doing a fast fsync (either BTRFS_INODE_COPY_EVERYTHING or BTRFS_INODE_NEEDS_FULL_SYNC is set in the inode's runtime flags). This last step is necessary because when removing the last hard link we don't delete the corresponding ref (or extref) item, instead we record the change in the inode's delayed inode with the BTRFS_DELAYED_NODE_DEL_IREF flag, so that when the delayed inode is committed we delete the ref/extref item from the inode's subvolume tree - otherwise the logging code will log the last hard link and therefore upon log replay the inode is not deleted. The base code for a fstests test case that reproduces this bug is the following: . ./common/dmflakey _require_scratch _require_dm_target flakey _require_mknod _scratch_mkfs >>$seqres.full 2>&1 \|\| _fail "mkfs failed" _require_metadata_journaling $SCRATCH_DEV _init_flakey _mount_flakey touch $SCRATCH_MNT/foo # Commit the current transaction and persist the file. _scratch_sync # A fifo to communicate with a background xfs_io process that will # fsync the file after we deleted its hard link while it's open by # xfs_io. mkfifo $SCRATCH_MNT/fifo tail -f $SCRATCH_MNT/fifo \| \ $XFS_IO_PROG $SCRATCH_MNT/foo >>$seqres.full & XFS_IO_PID=$! # Give some time for the xfs_io process to open a file descriptor for # the file. sleep 1 # Now while the file is open by the xfs_io process, delete its only # hard link. rm -f $SCRATCH_MNT/foo # Now that it has no more hard links, make the xfs_io process fsync it. echo "fsync" > $SCRATCH_MNT/fifo # Terminate the xfs_io process so that we can unmount. echo "quit" > $SCRATCH_MNT/fifo wait $XFS_IO_PID unset XFS_IO_PID # Simulate a power failure and then mount again the filesystem to # replay the journal/log. _flakey_drop_and_remount # We don't expect the file to exist anymore, since it was fsynced when # it had no more hard links. [ -f $SCRATCH_MNT/foo ] && echo "file foo still exists" _unmount_flakey # success, all done echo "Silence is golden" status=0 exit A test case for fstests will be submitted soon. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:39 +02:00
Mark Harmstone	846b534075	btrfs: fix typo in space info explanation There's an explanation of how space info works at the top of fs/btrfs/space-info.c, which makes reference to a variable called bytes_may_reserve. There's nothing called that in the code, and wasn't at time the comment was written; as far I can tell this is a typo, and it should actually be bytes_may_use. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Mark Harmstone <maharmstone@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:39 +02:00
Daniel Vacek	062f3d02a2	btrfs: remove unused flag EXTENT_BUFFER_IN_TREE This flag is set after inserting the eb to the buffer tree and cleared on it's removal. It was added in commit `34b41acec1` ("Btrfs: use a bit to track if we're in the radix tree") and wanted to make use of it, `faa2dbf004` ("Btrfs: add sanity tests for new qgroup accounting code"). Both are 10+ years old, we can remove the flag. Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:39 +02:00
Daniel Vacek	c61660ec34	btrfs: remove unused flag EXTENT_BUFFER_CORRUPT This flag is no longer being used. It was added by commit `a826d6dcb3` ("Btrfs: check items for correctness as we search") but it's no longer being used after commit `f26c923860` ("btrfs: remove reada infrastructure"). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:39 +02:00
Daniel Vacek	350362e95f	btrfs: remove unused flag EXTENT_BUFFER_READAHEAD This flag is no longer being used. It was added by commit `ab0fff0305` ("btrfs: add READAHEAD extent buffer flag") and used in commits: `79fb65a1f6` ("Btrfs: don't call readahead hook until we have read the entire eb") `78e62c02ab` ("btrfs: Remove extent_io_ops::readpage_io_failed_hook") `371cdc0700` ("btrfs: introduce subpage metadata validation check") Finally all the code using it was removed by commit `f26c923860` ("btrfs: remove reada infrastructure"). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:39 +02:00
Daniel Vacek	40f47f6d72	btrfs: remove unused flag EXTENT_BUFFER_READ_ERR This flag was added by commit `656f30dba7` ("Btrfs: be aware of btree inode write errors to avoid fs corruption") but it stopped being used after commit `046b562b20` ("btrfs: use a separate end_io handler for read_extent_buffer"). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:39 +02:00
Qu Wenruo	ced47a4db4	btrfs: factor out the main loop of btrfs_buffered_write() into a helper Inside the main loop of btrfs_buffered_write() we are doing a lot of heavy lifting inside a while() loop. This makes it pretty hard to read, factor out the content into a helper, copy_one_range() to do the work. This has no functional change, but with some minor variable renames, e.g. rename all "sector" into "block". Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:39 +02:00
Qu Wenruo	af821cba72	btrfs: factor out space reservation code from btrfs_buffered_write() Inside the main loop of btrfs_buffered_write(), we have a complex data and metadata space reservation code, which tries to reserve space for a COW write, if failed then fallback to check if we can do a NOCOW write. Factor out that part of code into a dedicated helper, reserve_space(), to make the main loop a little easier to read. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:39 +02:00

... 3 4 5 6 7 ...

14440 Commits