The function btrfs_punch_hole_lock_range() needs to make sure there is
no other folio in the range, thus it goes with filemap_range_has_page(),
which works pretty fine.
But if we have large folios, under the following case
filemap_range_has_page() will always return true, forcing
btrfs_punch_hole_lock_range() to do a very time consuming busy loop:
start end
| |
|//|//|//|//| | | | | | | | |//|//|
\ / \ /
Folio A Folio B
In the above case, folio A and B contain our start/end indexes, and there
are no other folios in the range. Thus we do not need to retry inside
btrfs_punch_hole_lock_range().
To prepare for large data folios, introduce a helper,
check_range_has_page(), which will:
- Shrink the search range towards page boundaries
If the rounded down end (exclusive, otherwise it can underflow when @end
is inside the folio at file offset 0) is no larger than the rounded up
start, it means the range contains no other pages other than the ones
covering @start and @end.
Can return false directly in that case.
- Grab all the folios inside the range
- Skip any large folios that cover the start and end indexes
- If any other folios are found return true
- Otherwise return false
This new helper is going to handle both large folios and regular ones.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This involves the following modifications:
- Set the order flags for __filemap_get_folio() inside
prepare_one_folio()
This will allow __filemap_get_folio() to create a large folio if the
address space supports it.
- Limit the initial @write_bytes inside copy_one_range()
If the largest folio boundary splits the initial write range, there is
no way we can write beyond the largest folio boundary.
This is done by a simple helper calc_write_bytes().
- Release exceeding reserved space if the folio is smaller than expected
Which is doing the same handling when short copy happens.
All the preparations should not change the behavior when the largest
folio order is 0.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several things not ideal in copy_one_range():
- Unnecessary temporary variables
* block_offset
* reserve_bytes
* dirty_blocks
* num_blocks
* release_bytes
These are utilized to handle short-copy cases.
- Inconsistent handling of btrfs_delalloc_release_extents()
There is a hidden behavior that, after reserving metadata for X bytes
of data write, we have to call btrfs_delalloc_release_extents() with X
once and only once.
Calling btrfs_delalloc_release_extents(X - 4K) and
btrfs_delalloc_release_extents(4K) will cause outstanding extents
accounting to go wrong.
This is because the outstanding extents mechanism is not designed to
handle shrinking of reserved space.
Improve above situations by:
- Use a single @reserved_start and @reserved_len pair
Now we reserve space for the initial range, and if a short copy
happened and we need to shrink the reserved space, we can easily
calculate the new length, and update @reserved_len.
- Introduce helpers to shrink reserved data and metadata space
This is done by two new helpers, shrink_reserved_space() and
btrfs_delalloc_shrink_extents().
The later will do a better calculation if we need to modify the
outstanding extents, and the first one will be utilized inside
copy_one_range().
- Manually unlock, release reserved space and return if no byte is
copied
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The EXTENT_UPTODATE io tree flag is now used only to mark ranges in the
fs_info->excluded_extents as used by super blocks and not available for
extent allocation (to prevent adding those ranges as free space in the
in memory space caches). As we can use any flag for that purpose, and
we are using EXTENT_DIRTY for the pinned extents io tree for example,
remove the EXTENT_UPTODATE flag and use instead EXTENT_DIRTY for the
excluded extents io tree.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_add_new_free_space() we keep searching for ranges in the excluded
extents io tree that have the EXTENT_DIRTY bit set, however we never ever
set that bit for ranges in that tree. That is a leftover from when that
function used the global freed extents trees (fs_info->freed_extents[2]),
where we used both the EXTENT_DIRTY and EXTENT_UPTODATE bits, but those
trees are gone with commit fe119a6eeb ("btrfs: switch to per-transaction
pinned extents"), which introduced the fs_info->excluded_extents io tree,
where only EXTENT_UPTODATE is set.
So remove the EXTENT_DIRTY bit search at btrfs_add_new_free_space().
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After commit 52b029f427 ("btrfs: remove unnecessary EXTENT_UPTODATE
state in buffered I/O path") we never set EXTENT_UPTODATE in an inode's
io_tree anymore, but we still have some code attempting to clear that
bit from an inode's io_tree. Remove that code as it doesn't do anything
anymore. The sole use of the EXTENT_UPTODATE bit is for the excluded
extents io_tree (fs_info->excluded_extents), which is used to track the
locations of super blocks, so that their ranges are never marked as free,
making them unavailable for extent allocation.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fsync a file (or directory) that has no more hard links, because
while a process had a file descriptor open on it, the file's last hard
link was removed and then the process did an fsync against the file
descriptor, after a power failure or crash the file still exists after
replaying the log.
This behaviour is incorrect since once an inode has no more hard links
it's not accessible anymore and we insert an orphan item into its
subvolume's tree so that the deletion of all its items is not missed in
case of a power failure or crash.
So after log replay the file shouldn't exist anymore, which is also the
behaviour on ext4, xfs, f2fs and other filesystems.
Fix this by not ignoring inodes with zero hard links at
btrfs_log_inode_parent() and by committing an inode's delayed inode when
we are not doing a fast fsync (either BTRFS_INODE_COPY_EVERYTHING or
BTRFS_INODE_NEEDS_FULL_SYNC is set in the inode's runtime flags). This
last step is necessary because when removing the last hard link we don't
delete the corresponding ref (or extref) item, instead we record the
change in the inode's delayed inode with the BTRFS_DELAYED_NODE_DEL_IREF
flag, so that when the delayed inode is committed we delete the ref/extref
item from the inode's subvolume tree - otherwise the logging code will log
the last hard link and therefore upon log replay the inode is not deleted.
The base code for a fstests test case that reproduces this bug is the
following:
. ./common/dmflakey
_require_scratch
_require_dm_target flakey
_require_mknod
_scratch_mkfs >>$seqres.full 2>&1 || _fail "mkfs failed"
_require_metadata_journaling $SCRATCH_DEV
_init_flakey
_mount_flakey
touch $SCRATCH_MNT/foo
# Commit the current transaction and persist the file.
_scratch_sync
# A fifo to communicate with a background xfs_io process that will
# fsync the file after we deleted its hard link while it's open by
# xfs_io.
mkfifo $SCRATCH_MNT/fifo
tail -f $SCRATCH_MNT/fifo | \
$XFS_IO_PROG $SCRATCH_MNT/foo >>$seqres.full &
XFS_IO_PID=$!
# Give some time for the xfs_io process to open a file descriptor for
# the file.
sleep 1
# Now while the file is open by the xfs_io process, delete its only
# hard link.
rm -f $SCRATCH_MNT/foo
# Now that it has no more hard links, make the xfs_io process fsync it.
echo "fsync" > $SCRATCH_MNT/fifo
# Terminate the xfs_io process so that we can unmount.
echo "quit" > $SCRATCH_MNT/fifo
wait $XFS_IO_PID
unset XFS_IO_PID
# Simulate a power failure and then mount again the filesystem to
# replay the journal/log.
_flakey_drop_and_remount
# We don't expect the file to exist anymore, since it was fsynced when
# it had no more hard links.
[ -f $SCRATCH_MNT/foo ] && echo "file foo still exists"
_unmount_flakey
# success, all done
echo "Silence is golden"
status=0
exit
A test case for fstests will be submitted soon.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's an explanation of how space info works at the top of
fs/btrfs/space-info.c, which makes reference to a variable called
bytes_may_reserve. There's nothing called that in the code, and wasn't
at time the comment was written; as far I can tell this is a typo, and
it should actually be bytes_may_use.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This flag is set after inserting the eb to the buffer tree and cleared
on it's removal. It was added in commit 34b41acec1 ("Btrfs: use a
bit to track if we're in the radix tree") and wanted to make use of it,
faa2dbf004 ("Btrfs: add sanity tests for new qgroup accounting
code"). Both are 10+ years old, we can remove the flag.
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This flag is no longer being used. It was added by commit a826d6dcb3
("Btrfs: check items for correctness as we search") but it's no longer
being used after commit f26c923860 ("btrfs: remove reada
infrastructure").
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This flag is no longer being used. It was added by commit ab0fff0305
("btrfs: add READAHEAD extent buffer flag") and used in commits:
79fb65a1f6 ("Btrfs: don't call readahead hook until we have read the entire eb")
78e62c02ab ("btrfs: Remove extent_io_ops::readpage_io_failed_hook")
371cdc0700 ("btrfs: introduce subpage metadata validation check")
Finally all the code using it was removed by commit f26c923860 ("btrfs: remove
reada infrastructure").
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This flag was added by commit 656f30dba7 ("Btrfs: be aware of btree
inode write errors to avoid fs corruption") but it stopped being used
after commit 046b562b20 ("btrfs: use a separate end_io handler for
read_extent_buffer").
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside the main loop of btrfs_buffered_write() we are doing a lot of
heavy lifting inside a while() loop.
This makes it pretty hard to read, factor out the content into a helper,
copy_one_range() to do the work.
This has no functional change, but with some minor variable renames,
e.g. rename all "sector" into "block".
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside the main loop of btrfs_buffered_write(), we have a complex data
and metadata space reservation code, which tries to reserve space for
a COW write, if failed then fallback to check if we can do a NOCOW
write.
Factor out that part of code into a dedicated helper, reserve_space(),
to make the main loop a little easier to read.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside the main loop of btrfs_buffered_write(), if something wrong
happened, there is a out-of-loop cleanup path to release the reserved
space.
This behavior saves some code lines, but makes it much harder to read,
as we need to check release_bytes to make sure when we need to do the
cleanup.
Factor out the cleanup part into a helper, release_reserved_space(), to
do the cleanup inside the main loop, so that we can move @release_bytes
inside the loop.
This will make later refactoring of the main loop much easier.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit c87c299776 ("btrfs: make buffered write to copy one page a
time") changed how the variable @force_page_uptodate was updated.
Before that commit the variable was only initialized to false at the
beginning of the function, and after hitting a short copy, the next
retry on the same folio would force the folio to be read from the disk.
But after the commit, the variable is always initialized to false at the
beginning of the loop's scope, causing prepare_one_folio() never to get a
true value passed in.
The change in behavior is not a huge deal, it only makes a difference
on how we handle short copies:
Old: Allow the buffer to be split
The first short copy will be rejected, that's the same for both
cases.
But for the next retry, we require the folio to be read from disk.
Then even if we hit a short copy again, since the folio is already
uptodate, we do not need to handle partial uptodate range, and can
continue, marking the short copied range as dirty and continue.
This will split the buffer write into the folio as two buffered
writes.
New: Do not allow the buffer to be split
The first short copy will be rejected, that's the same for both
cases.
For the next retry, we do nothing special, thus if the short copy
happened again, we reject it again, until either the short copy is
gone, or we failed to fault in the buffer.
This will mean the buffer write into the folio will either fail or
succeed, no splitting will happen.
To me, either solution is fine, but the new one makes it simpler and
requires no special handling, so I prefer that solution.
And since @force_page_uptodate is always false when passed into
prepare_one_folio(), we can just remove the variable.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 1d2fbb7f1f ("btrfs: allow compression even if the range is not
page aligned") introduced the block perfect compression for block size <
page size cases.
Before that commit, if the fs block size is smaller than page size (aka
subpage cases), compressed write is only enabled if the dirty range is
fully page aligned.
This block perfect compression support was introduced in v6.13, and has
been tested for two kernel releases.
I believe it's time to move it out of experimental features so that we
can get more tests in the real world.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmglNBkACgkQxWXV+ddt
WDtOeA/+Ifj7fYP6feVya+KF5qLXg4H0x6p+IpoBhgzOyrRFiBR9yPbOADt3MEX4
ATpG7cHhOd8Mxaegbpz6zArHcZqO1VlPWbl+HpVJ6Ji7+N+u+eiHcSFyUT5yFIl7
HLrJ7bxpc8xVLLsPeBOrk3c7LKkiaeAw4EmuMAY70d0oqaMJ5nqSiYFvLislTETR
DaOoInem16WvjfEwHgXXZcfxxjqc/R8WFW1Tud+jJSkrxSQ/V1viP0G06IGq8ucz
cHx7SM9D/myqoHa/dTwx3DeZglcsYQN5tBk0aW3HkylcXLPueFf70cGxzk1mRUw5
zavKJ31mW73zNJs4hIFQiy2rbfyi7g/LuOFlhNT+AbDRX4HDP88+42anVlQl3VdC
FcKL+VEtY5sgfn4kslsyo4fMbNpt0VXA7wy0qOEmHbHdnBgaYTIjqwu1LUnU/eLJ
WQQstUkfuo+pZffaaKsR7S5r5i5xUzYjqHXF9qf1Dju9rEKYbLVtu/T3EVziO1Mc
vdVE2xxdnuf8UTeJ+gJtcyeUJT54SihaR2qm8tErMdILMjSTPmaAQFhtRV14nQTp
upqsJ5gesbc3++VPPmsBgcLP7UL9uN7s6NeRRanj1Zg2bZY8B+zGwhr8/k1ZmR8T
uMr0qFrYx5SVCS2g47FRK6dWrnYgAdT5LaXA5cx02nTynU2hw1o=
=8C8t
-----END PGP SIGNATURE-----
Merge tag 'for-6.15-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix potential endless loop when discarding a block group when
disabling discard
- reinstate message when setting a large value of mount option 'commit'
- fix a folio leak when async extent submission fails
* tag 'for-6.15-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: add back warning for mount option commit values exceeding 300
btrfs: fix folio leak in submit_one_async_extent()
btrfs: fix discard worker infinite loop after disabling discard
The Btrfs documentation states that if the commit value is greater than
300 a warning should be issued. The warning was accidentally lost in the
new mount API update.
Fixes: 6941823cc8 ("btrfs: remove old mount API code")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Kyoji Ogasawara <sawara04.o@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If btrfs_reserve_extent() fails while submitting an async_extent for a
compressed write, then we fail to call free_async_extent_pages() on the
async_extent and leak its folios. A likely cause for such a failure
would be btrfs_reserve_extent() failing to find a large enough
contiguous free extent for the compressed extent.
I was able to reproduce this by:
1. mount with compress-force=zstd:3
2. fallocating most of a filesystem to a big file
3. fragmenting the remaining free space
4. trying to copy in a file which zstd would generate large compressed
extents for (vmlinux worked well for this)
Step 4. hits the memory leak and can be repeated ad nauseam to
eventually exhaust the system memory.
Fix this by detecting the case where we fallback to uncompressed
submission for a compressed async_extent and ensuring that we call
free_async_extent_pages().
Fixes: 131a821a24 ("btrfs: fallback if compressed IO fails for ENOSPC")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Co-developed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the discard worker is running and there's currently only one block
group, that block group is a data block group, it's in the unused block
groups discard list and is being used (it got an extent allocated from it
after becoming unused), the worker can end up in an infinite loop if a
transaction abort happens or the async discard is disabled (during remount
or unmount for example).
This happens like this:
1) Task A, the discard worker, is at peek_discard_list() and
find_next_block_group() returns block group X;
2) Block group X is in the unused block groups discard list (its discard
index is BTRFS_DISCARD_INDEX_UNUSED) since at some point in the past
it become an unused block group and was added to that list, but then
later it got an extent allocated from it, so its ->used counter is not
zero anymore;
3) The current transaction is aborted by task B and we end up at
__btrfs_handle_fs_error() in the transaction abort path, where we call
btrfs_discard_stop(), which clears BTRFS_FS_DISCARD_RUNNING from
fs_info, and then at __btrfs_handle_fs_error() we set the fs to RO mode
(setting SB_RDONLY in the super block's s_flags field);
4) Task A calls __add_to_discard_list() with the goal of moving the block
group from the unused block groups discard list into another discard
list, but at __add_to_discard_list() we end up doing nothing because
btrfs_run_discard_work() returns false, since the super block has
SB_RDONLY set in its flags and BTRFS_FS_DISCARD_RUNNING is not set
anymore in fs_info->flags. So block group X remains in the unused block
groups discard list;
5) Task A then does a goto into the 'again' label, calls
find_next_block_group() again we gets block group X again. Then it
repeats the previous steps over and over since there are not other
block groups in the discard lists and block group X is never moved
out of the unused block groups discard list since
btrfs_run_discard_work() keeps returning false and therefore
__add_to_discard_list() doesn't move block group X out of that discard
list.
When this happens we can get a soft lockup report like this:
[71.957] watchdog: BUG: soft lockup - CPU#0 stuck for 27s! [kworker/u4:3:97]
[71.957] Modules linked in: xfs af_packet rfkill (...)
[71.957] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G W 6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa
[71.957] Tainted: [W]=WARN
[71.957] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
[71.957] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
[71.957] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs]
[71.957] Code: c1 01 48 83 (...)
[71.957] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246
[71.957] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000
[71.957] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad
[71.957] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080
[71.957] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860
[71.957] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000
[71.957] FS: 0000000000000000(0000) GS:ffff89707bc00000(0000) knlGS:0000000000000000
[71.957] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[71.957] CR2: 00005605bcc8d2f0 CR3: 000000010376a001 CR4: 0000000000770ef0
[71.957] PKRU: 55555554
[71.957] Call Trace:
[71.957] <TASK>
[71.957] process_one_work+0x17e/0x330
[71.957] worker_thread+0x2ce/0x3f0
[71.957] ? __pfx_worker_thread+0x10/0x10
[71.957] kthread+0xef/0x220
[71.957] ? __pfx_kthread+0x10/0x10
[71.957] ret_from_fork+0x34/0x50
[71.957] ? __pfx_kthread+0x10/0x10
[71.957] ret_from_fork_asm+0x1a/0x30
[71.957] </TASK>
[71.957] Kernel panic - not syncing: softlockup: hung tasks
[71.987] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G W L 6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa
[71.989] Tainted: [W]=WARN, [L]=SOFTLOCKUP
[71.989] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
[71.991] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
[71.992] Call Trace:
[71.993] <IRQ>
[71.994] dump_stack_lvl+0x5a/0x80
[71.994] panic+0x10b/0x2da
[71.995] watchdog_timer_fn.cold+0x9a/0xa1
[71.996] ? __pfx_watchdog_timer_fn+0x10/0x10
[71.997] __hrtimer_run_queues+0x132/0x2a0
[71.997] hrtimer_interrupt+0xff/0x230
[71.998] __sysvec_apic_timer_interrupt+0x55/0x100
[71.999] sysvec_apic_timer_interrupt+0x6c/0x90
[72.000] </IRQ>
[72.000] <TASK>
[72.001] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[72.002] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs]
[72.002] Code: c1 01 48 83 (...)
[72.005] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246
[72.006] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000
[72.006] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad
[72.007] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080
[72.008] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860
[72.009] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000
[72.010] ? btrfs_discard_workfn+0x51/0x400 [btrfs 23b01089228eb964071fb7ca156eee8cd3bf996f]
[72.011] process_one_work+0x17e/0x330
[72.012] worker_thread+0x2ce/0x3f0
[72.013] ? __pfx_worker_thread+0x10/0x10
[72.014] kthread+0xef/0x220
[72.014] ? __pfx_kthread+0x10/0x10
[72.015] ret_from_fork+0x34/0x50
[72.015] ? __pfx_kthread+0x10/0x10
[72.016] ret_from_fork_asm+0x1a/0x30
[72.017] </TASK>
[72.017] Kernel Offset: 0x15000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[72.019] Rebooting in 90 seconds..
So fix this by making sure we move a block group out of the unused block
groups discard list when calling __add_to_discard_list().
Fixes: 2bee7eb8bb ("btrfs: discard one region at a time in async discard")
Link: https://bugzilla.suse.com/show_bug.cgi?id=1242012
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace the code building a bio from a kernel direct map address and
submitting it synchronously with the bdev_rw_virt helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-19-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmgaECkACgkQxWXV+ddt
WDsHeA//SCLb1tlI9LEiOuDP7Dk429caxrQwPU/AXPOoUwGT0rNSjmBDLXfIRFHT
gRmI48huDvuVu00wL+wOY9Xs1M5oMkExsAW8nq08MHM2I+sNx+ppojjM5RgpwwCs
QAASTEu4DOhtYrzJ9SPn0jmK8kDadi3fFSNNIJBd5IjpcLIhNiyryU6l7iXq9f7A
pA3EEg7KL4jvciaOsnqE+/nvAd7oT0OtIRkrzPRKnsjJEg5zZEVo/4hUMhbNHVLC
7CuQB6MR79PoTOW8kZL/636FOQqv0XO+luHZEUf26sTuKiTEHgjq2jBymViDibCy
XNNKCnqTmmYCcN4bqIkdDzM5cPZmOchih7eTUUTlpNH3qmtGn0HVx6pmOS+U6lHI
DFRELbo+ry3LikZ8a7sGNcZQcooq7A7FgxggbI37Nbn0M6FxvmbiwfTDvvn6o04H
+Q7+Sdbklb3MnNCa/ebIq+9XewYIoNXCAqnLJxMIj8OzrBtvPWoI5R3/CGe7MYsf
jvEGHQuSLaw39tBJmrypImkoRocK/4hhHzYpGGQ5FNtbcgTEqHNIi+uIjHJlxQfi
9Tg95o2eK/glg+T3WrG/uviSnz5VbIKdj5Ksjw3evC0ihzX61NljMnPIlWEkAHAZ
AIFnx5aQe1FhN9HQMiGenCYg+QuFsHXX3Qbh+2PW6QHbQ0os9Fg=
=oczg
-----END PGP SIGNATURE-----
Merge tag 'for-6.15-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- revert device path canonicalization, this does not work as intended
with namespaces and is not reliable in all setups
- fix crash in scrub when checksum tree is not valid, e.g. when mounted
with rescue=ignoredatacsums
- fix crash when tracepoint btrfs_prelim_ref_insert is enabled
- other minor fixups:
- open code folio_index(), meant to be used in MM code
- use matching type for sizeof in compression allocation
* tag 'for-6.15-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: open code folio_index() in btree_clear_folio_dirty_tag()
Revert "btrfs: canonicalize the device path before adding it"
btrfs: avoid NULL pointer dereference if no valid csum tree
btrfs: handle empty eb->folios in num_extent_folios()
btrfs: correct the order of prelim_ref arguments in btrfs__prelim_ref
btrfs: compression: adjust cb->compressed_folios allocation type
The folio_index() helper is only needed for mixed usage of page cache
and swap cache, for pure page cache usage, the caller can just use
folio->index instead.
It can't be a swap cache folio here. Swap mapping may only call into fs
through 'swap_rw' but btrfs does not use that method for swap.
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit 7e06de7c83.
Commit 7e06de7c83 ("btrfs: canonicalize the device path before adding
it") tries to make btrfs to use "/dev/mapper/*" name first, then any
filename inside "/dev/" as the device path.
This is mostly fine when there is only the root namespace involved, but
when multiple namespace are involved, things can easily go wrong for the
d_path() usage.
As d_path() returns a file path that is namespace dependent, the
resulted string may not make any sense in another namespace.
Furthermore, the "/dev/" prefix checks itself is not reliable, one can
still make a valid initramfs without devtmpfs, and fill all needed
device nodes manually.
Overall the userspace has all its might to pass whatever device path for
mount, and we are not going to win the war trying to cover every corner
case.
So just revert that commit, and do no extra d_path() based file path
sanity check.
CC: stable@vger.kernel.org # 6.12+
Link: https://lore.kernel.org/linux-fsdevel/20250115185608.GA2223535@zen.localdomain/
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When trying read-only scrub on a btrfs with rescue=idatacsums mount
option, it will crash with the following call trace:
BUG: kernel NULL pointer dereference, address: 0000000000000208
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
CPU: 1 UID: 0 PID: 835 Comm: btrfs Tainted: G O 6.15.0-rc3-custom+ #236 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
RIP: 0010:btrfs_lookup_csums_bitmap+0x49/0x480 [btrfs]
Call Trace:
<TASK>
scrub_find_fill_first_stripe+0x35b/0x3d0 [btrfs]
scrub_simple_mirror+0x175/0x290 [btrfs]
scrub_stripe+0x5f7/0x6f0 [btrfs]
scrub_chunk+0x9a/0x150 [btrfs]
scrub_enumerate_chunks+0x333/0x660 [btrfs]
btrfs_scrub_dev+0x23e/0x600 [btrfs]
btrfs_ioctl+0x1dcf/0x2f80 [btrfs]
__x64_sys_ioctl+0x97/0xc0
do_syscall_64+0x4f/0x120
entry_SYSCALL_64_after_hwframe+0x76/0x7e
[CAUSE]
Mount option "rescue=idatacsums" will completely skip loading the csum
tree, so that any data read will not find any data csum thus we will
ignore data checksum verification.
Normally call sites utilizing csum tree will check the fs state flag
NO_DATA_CSUMS bit, but unfortunately scrub does not check that bit at all.
This results in scrub to call btrfs_search_slot() on a NULL pointer
and triggered above crash.
[FIX]
Check both extent and csum tree root before doing any tree search.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
num_extent_folios() unconditionally calls folio_order() on
eb->folios[0]. If that is NULL this will be a segfault. It is reasonable
for it to return 0 as the number of folios in the eb when the first
entry is NULL, so do that instead.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In preparation for making the kmalloc() family of allocators type aware,
we need to make sure that the returned type from the allocation matches
the type of the variable being assigned. (Before, the allocator would
always return "void *", which can be implicitly cast to any pointer type.)
The assigned type is "struct folio **" but the returned type will be
"struct page **". These are the same allocation size (pointer size), but
the types don't match. Adjust the allocation type to match the assignment.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmgSLb8ACgkQxWXV+ddt
WDsHZA//cqlq2zGs5dqRYhPFz5wwKqJcRKcJe2ag4x/Du18SJ5ZXMazlYcVfTZ18
7Wo2Bmk5cVUb83u/vbyA01FaqD8pYvEU/fLn6NY4YQfs9AIc/Ek/DexWmjoCe1aF
fxWoPPACl11jm6crUC5U/KtudZhDS4ALtCE+6GrbWamvnbG+BZjxzACzISU4jvHS
BVdXgf9Ogx6hk++b2rhMOsp2C807vnPwFJLwV8CAQQiSzRAlDUMM75P6fduN69if
nR/jxURojEX+x14k4kPO33vVA5ffblB6t15Ws/OtlFEtnU90kJShxTwHvDOgs0B/
d8Iu+9Rt0+vPbMb+GLQZBMCT24n0/67PCEJ0Y7R9y5/4Q65y2paWXihTDQBhJ/YO
GhbajDcRLrZ+WWO3kjrmePyGkz6AxmiAnnE75VcNpYRtO6CT89UhCvxGWCGqBdlr
2G7FY/snCOP1UdL0YyU46OZ7fCMjRpRxSJuDi1jxyrdW2PuOjlQX68LlNbFeERab
QU1QYNlwuck0GrsnVWKaS7lD7wKLPD53kXFUVZfLfTD7qzTzX3nHBxbM/P2dOBeO
0rx1JQdgBTPg60DHwnFRwYRgKGohwpW57/JAadqxy70RkHPquJayqWbkIeIm/4Sp
Kt4yHBGiN2EIHGMxyEAqia7Zrc8GkedC1S6DU7FOn/VWbQyiARM=
=HHoC
-----END PGP SIGNATURE-----
Merge tag 'for-6.15-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix potential inode leak in iget() after memory allocation failure
- in subpage mode, fix extent buffer bitmap iteration when writing out
dirty sectors
- fix range calculation when falling back to COW for a NOCOW file
* tag 'for-6.15-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: adjust subpage bit start based on sectorsize
btrfs: fix the inode leak in btrfs_iget()
btrfs: fix COW handling in run_delalloc_nocow()
When running machines with 64k page size and a 16k nodesize we started
seeing tree log corruption in production. This turned out to be because
we were not writing out dirty blocks sometimes, so this in fact affects
all metadata writes.
When writing out a subpage EB we scan the subpage bitmap for a dirty
range. If the range isn't dirty we do
bit_start++;
to move onto the next bit. The problem is the bitmap is based on the
number of sectors that an EB has. So in this case, we have a 64k
pagesize, 16k nodesize, but a 4k sectorsize. This means our bitmap is 4
bits for every node. With a 64k page size we end up with 4 nodes per
page.
To make this easier this is how everything looks
[0 16k 32k 48k ] logical address
[0 4 8 12 ] radix tree offset
[ 64k page ] folio
[ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers
[ | | | | | | | | | | | | | | | | ] bitmap
Now we use all of our addressing based on fs_info->sectorsize_bits, so
as you can see the above our 16k eb->start turns into radix entry 4.
When we find a dirty range for our eb, we correctly do bit_start +=
sectors_per_node, because if we start at bit 0, the next bit for the
next eb is 4, to correspond to eb->start 16k.
However if our range is clean, we will do bit_start++, which will now
put us offset from our radix tree entries.
In our case, assume that the first time we check the bitmap the block is
not dirty, we increment bit_start so now it == 1, and then we loop
around and check again. This time it is dirty, and we go to find that
start using the following equation
start = folio_start + bit_start * fs_info->sectorsize;
so in the case above, eb->start 0 is now dirty, and we calculate start
as
0 + 1 * fs_info->sectorsize = 4096
4096 >> 12 = 1
Now we're looking up the radix tree for 1, and we won't find an eb.
What's worse is now we're using bit_start == 1, so we do bit_start +=
sectors_per_node, which is now 5. If that eb is dirty we will run into
the same thing, we will look at an offset that is not populated in the
radix tree, and now we're skipping the writeout of dirty extent buffers.
The best fix for this is to not use sectorsize_bits to address nodes,
but that's a larger change. Since this is a fs corruption problem fix
it simply by always using sectors_per_node to increment the start bit.
Fixes: c4aec299fa ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a bug report that a syzbot reproducer can lead to the following
busy inode at unmount time:
BTRFS info (device loop1): last unmount of filesystem 1680000e-3c1e-4c46-84b6-56bd3909af50
VFS: Busy inodes after unmount of loop1 (btrfs)
------------[ cut here ]------------
kernel BUG at fs/super.c:650!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
CPU: 0 UID: 0 PID: 48168 Comm: syz-executor Not tainted 6.15.0-rc2-00471-g119009db2674 #2 PREEMPT(full)
Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:generic_shutdown_super+0x2e9/0x390 fs/super.c:650
Call Trace:
<TASK>
kill_anon_super+0x3a/0x60 fs/super.c:1237
btrfs_kill_super+0x3b/0x50 fs/btrfs/super.c:2099
deactivate_locked_super+0xbe/0x1a0 fs/super.c:473
deactivate_super fs/super.c:506 [inline]
deactivate_super+0xe2/0x100 fs/super.c:502
cleanup_mnt+0x21f/0x440 fs/namespace.c:1435
task_work_run+0x14d/0x240 kernel/task_work.c:227
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
exit_to_user_mode_loop kernel/entry/common.c:114 [inline]
exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline]
__syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
syscall_exit_to_user_mode+0x269/0x290 kernel/entry/common.c:218
do_syscall_64+0xd4/0x250 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
[CAUSE]
When btrfs_alloc_path() failed, btrfs_iget() directly returned without
releasing the inode already allocated by btrfs_iget_locked().
This results the above busy inode and trigger the kernel BUG.
[FIX]
Fix it by calling iget_failed() if btrfs_alloc_path() failed.
If we hit error inside btrfs_read_locked_inode(), it will properly call
iget_failed(), so nothing to worry about.
Although the iget_failed() cleanup inside btrfs_read_locked_inode() is a
break of the normal error handling scheme, let's fix the obvious bug
and backport first, then rework the error handling later.
Reported-by: Penglei Jiang <superman.xpt@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/20250421102425.44431-1-superman.xpt@gmail.com/
Fixes: 7c855e16ab ("btrfs: remove conditional path allocation in btrfs_read_locked_inode()")
CC: stable@vger.kernel.org # 6.13+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Penglei Jiang <superman.xpt@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In run_delalloc_nocow(), when the found btrfs_key's offset > cur_offset,
it indicates a gap between the current processing region and
the next file extent. The original code would directly jump to
the "must_cow" label, which increments the slot and forces a fallback
to COW. This behavior might skip an extent item and result in an
overestimated COW fallback range.
This patch modifies the logic so that when a gap is detected:
- If no COW range is already being recorded (cow_start is unset),
cow_start is set to cur_offset.
- cur_offset is then advanced to the beginning of the next extent.
- Instead of jumping to "must_cow", control flows directly to
"next_slot" so that the same extent item can be reexamined properly.
The change ensures that we accurately account for the extent gap and
avoid accidentally extending the range that needs to fallback to COW.
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Dave Chen <davechen@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmgHxA4ACgkQxWXV+ddt
WDtK6hAAqoqDYqM/Lt5/CmMJnrtXZBIoxlQUkw3b8946d6LDlmaQb4dOL8+/kuzy
mVhtPf0+WYm4YbchrAHpt2ZLp8s5e9TNbxX88HYJPc2pbjIbuzsnig0Ss7d0OipH
i4RSGxT5Pe0TZRFBQGM1iX+ehFbfOFOPwDBYiHoO9IRakbocZwuPAEAZ/r3v1jVW
YJrbgyF6HQt9/atTMbSO+DERMlCgLmMKQL1f0ciYrTcpAl3ermjV5sSFVFKQZQK7
jSd98NDxwfxAA/30pMFcvDS7SHgB4ZP6YT0CTeTYKQ2OTUgvQRIFCPeAORR4u5IN
n9SCLeFJwmG30zrRaOlSk4/4MHzBzycXr5xJI7TAD7Cko9AYNeWWCFwhbKTu/FxJ
26CGKNXtAOXwiPLwLrUcahok0UDbRmV2/DLrl09ltMvkY/s7hf3zD9WuBaq9DOtk
KlCjgWF/Rk9Qpb4kpLZxJtj9/zaNAyRUQDQH7IzcF4SLHEhf6N6ArhxX0PGhwWwy
B8VBZJz3Y7L8ZxP9R/Y29TO2JCvnIhJCy01Y/zfIXzD7Q4XlcC5fbzt7yoEa4Ogb
HrKG5Rtrq2pn7sUSbXg+Kvpvpqz1tD8Dcx3kQqDqo2LnAI4KSVwyLaBSK66gITv1
TwEqfJDVkt9She2mItc+bssCCm/f3ms7KE7dwdBhf7Y47v+Wjzk=
=+YLw
-----END PGP SIGNATURE-----
Merge tag 'for-6.15-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- subpage mode fixes:
- access correct object (folio) when looking up bit offset
- fix assertion condition for number of blocks per folio
- fix upper boundary of locking range in hole punch
- zoned fixes:
- fix potential deadlock caught by lockdep when zone reporting and
device freeze run in parallel
- fix zone write pointer mismatch and NULL pointer dereference when
metadata are converted from DUP to RAID1
- fix error handling when reloc inode creation fails
- in tree-checker, unify error code for header level check
- block layer: add helpers to read zone capacity
* tag 'for-6.15-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: skip reporting zone for new block group
block: introduce zone capacity helper
btrfs: tree-checker: adjust error code for header level check
btrfs: fix invalid inode pointer after failure to create reloc inode
btrfs: zoned: return EIO on RAID1 block group write pointer mismatch
btrfs: fix the ASSERT() inside GET_SUBPAGE_BITMAP()
btrfs: avoid page_lockend underflow in btrfs_punch_hole_lock_range()
btrfs: subpage: access correct object when reading bitmap start in subpage_calc_start_bit()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmgAzO8ACgkQxWXV+ddt
WDve6g//UWZ24/wLOoFC4u2wwuctnWy5FFOrvk0IqdxWzuSjA1Ou1P4WfD2xlnQv
wFqYk2SIuP68WQhd09Oj1WRQ9SbJIgAwITeryw4lFYq8v1q8xFB5kM0UTLXXlaNH
O342UK7HRW7XfXD9VkcQz5wXQvk0i7pmtZTjiD1QBbWS+qlEc5YQiZnMRlUlQKBw
85JM45iOFwHJLVt+A8ydC1yMdP7xktiVEhlPsjvzqUKs8orquuikxSW5d/WlDc9g
OeOf9pvxSNf3zsAzmwUrEOxsn3fLFFjoaPxDpfn42BsN4FcyIv4l9K9HdkcdzrLY
Gu0QaDVGCb6bXYhioyEzv/mzESQzOTWQUzI2fJrPPquwH9g0dss9uQwOwaOWbfHO
MDF7fBVwnChaC0O8NoKk5H8jQAXxPfAuU1JpypKOORuffTVz7uG3xkK56VJ/kfTh
qgqRImNGTuAu0C0xGdUjngpOfRypDQLQTo58AubLFAWjqD4elOFjanc/6xobYAJi
PnPk132yKxAdR9h4+1YUk1lzaauDinNzErt+vpUQ/g2QL9PtUbp1IG7VF9llGDzO
hqlifRBHcNy7cKNirFX0PYCke8fXrsKC1NbNiAQMjuK7agzg3b/+PW05EFLQv3EU
6CNgukLG8XbfK2F7PMwmno4zUXbA5JA2mxnKr4vRIMrGZVBTcTo=
=HZ/U
-----END PGP SIGNATURE-----
Merge tag 'for-6.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- handle encoded read ioctl returning EAGAIN so it does not mistakenly
free the work structure
- escape subvolume path in mount option list so it cannot be wrongly
parsed when the path contains ","
- remove folio size assertions when writing super block to device with
enabled large folios
* tag 'for-6.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: remove folio order ASSERT()s in super block writeback path
btrfs: correctly escape subvol in btrfs_show_options()
btrfs: ioctl: don't free iov when btrfs_encoded_read() returns -EAGAIN
The whole tree checker returns EUCLEAN, except the one check in
btrfs_verify_level_key(). This was inherited from the function that was
moved from disk-io.c in 2cac5af165 ("btrfs: move
btrfs_verify_level_key into tree-checker.c") but this should be unified
with the rest.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There was a bug report about a NULL pointer dereference in
__btrfs_add_free_space_zoned() that ultimately happens because a
conversion from the default metadata profile DUP to a RAID1 profile on two
disks.
The stack trace has the following signature:
BTRFS error (device sdc): zoned: write pointer offset mismatch of zones in raid1 profile
BUG: kernel NULL pointer dereference, address: 0000000000000058
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
RIP: 0010:__btrfs_add_free_space_zoned.isra.0+0x61/0x1a0
RSP: 0018:ffffa236b6f3f6d0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff96c8132f3400 RCX: 0000000000000001
RDX: 0000000010000000 RSI: 0000000000000000 RDI: ffff96c8132f3410
RBP: 0000000010000000 R08: 0000000000000003 R09: 0000000000000000
R10: 0000000000000000 R11: 00000000ffffffff R12: 0000000000000000
R13: ffff96c758f65a40 R14: 0000000000000001 R15: 000011aac0000000
FS: 00007fdab1cb2900(0000) GS:ffff96e60ca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000058 CR3: 00000001a05ae000 CR4: 0000000000350ef0
Call Trace:
<TASK>
? __die_body.cold+0x19/0x27
? page_fault_oops+0x15c/0x2f0
? exc_page_fault+0x7e/0x180
? asm_exc_page_fault+0x26/0x30
? __btrfs_add_free_space_zoned.isra.0+0x61/0x1a0
btrfs_add_free_space_async_trimmed+0x34/0x40
btrfs_add_new_free_space+0x107/0x120
btrfs_make_block_group+0x104/0x2b0
btrfs_create_chunk+0x977/0xf20
btrfs_chunk_alloc+0x174/0x510
? srso_return_thunk+0x5/0x5f
btrfs_inc_block_group_ro+0x1b1/0x230
btrfs_relocate_block_group+0x9e/0x410
btrfs_relocate_chunk+0x3f/0x130
btrfs_balance+0x8ac/0x12b0
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? __kmalloc_cache_noprof+0x14c/0x3e0
btrfs_ioctl+0x2686/0x2a80
? srso_return_thunk+0x5/0x5f
? ioctl_has_perm.constprop.0.isra.0+0xd2/0x120
__x64_sys_ioctl+0x97/0xc0
do_syscall_64+0x82/0x160
? srso_return_thunk+0x5/0x5f
? __memcg_slab_free_hook+0x11a/0x170
? srso_return_thunk+0x5/0x5f
? kmem_cache_free+0x3f0/0x450
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? syscall_exit_to_user_mode+0x10/0x210
? srso_return_thunk+0x5/0x5f
? do_syscall_64+0x8e/0x160
? sysfs_emit+0xaf/0xc0
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? seq_read_iter+0x207/0x460
? srso_return_thunk+0x5/0x5f
? vfs_read+0x29c/0x370
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? syscall_exit_to_user_mode+0x10/0x210
? srso_return_thunk+0x5/0x5f
? do_syscall_64+0x8e/0x160
? srso_return_thunk+0x5/0x5f
? exc_page_fault+0x7e/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fdab1e0ca6d
RSP: 002b:00007ffeb2b60c80 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fdab1e0ca6d
RDX: 00007ffeb2b60d80 RSI: 00000000c4009420 RDI: 0000000000000003
RBP: 00007ffeb2b60cd0 R08: 0000000000000000 R09: 0000000000000013
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffeb2b6343b R14: 00007ffeb2b60d80 R15: 0000000000000001
</TASK>
CR2: 0000000000000058
---[ end trace 0000000000000000 ]---
The 1st line is the most interesting here:
BTRFS error (device sdc): zoned: write pointer offset mismatch of zones in raid1 profile
When a RAID1 block-group is created and a write pointer mismatch between
the disks in the RAID set is detected, btrfs sets the alloc_offset to the
length of the block group marking it as full. Afterwards the code expects
that a balance operation will evacuate the data in this block-group and
repair the problems.
But before this is possible, the new space of this block-group will be
accounted in the free space cache. But in __btrfs_add_free_space_zoned()
it is being checked if it is a initial creation of a block group and if
not a reclaim decision will be made. But the decision if a block-group's
free space accounting is done for an initial creation depends on if the
size of the added free space is the whole length of the block-group and
the allocation offset is 0.
But as btrfs_load_block_group_zone_info() sets the allocation offset to
the zone capacity (i.e. marking the block-group as full) this initial
decision is not met, and the space_info pointer in the 'struct
btrfs_block_group' has not yet been assigned.
Fail creation of the block group and rely on manual user intervention to
re-balance the filesystem.
Afterwards the filesystem can be unmounted, mounted in degraded mode and
the missing device can be removed after a full balance of the filesystem.
Reported-by: 西木野羰基 <yanqiyu01@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAB_b4sBhDe3tscz=duVyhc9hNE+gu=B8CrgLO152uMyanR8BEA@mail.gmail.com/
Fixes: b1934cd606 ("btrfs: zoned: handle broken write pointer on zones")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After enabling large data folios for tests, I hit the ASSERT() inside
GET_SUBPAGE_BITMAP() where blocks_per_folio matches BITS_PER_LONG.
The ASSERT() itself is only based on the original subpage fs block size,
where we have at most 16 blocks per page, thus
"ASSERT(blocks_per_folio < BITS_PER_LONG)".
However the experimental large data folio support will set the max folio
order according to the BITS_PER_LONG, so we can have a case where a large
folio contains exactly BITS_PER_LONG blocks.
So the ASSERT() is too strict, change it to
"ASSERT(blocks_per_folio <= BITS_PER_LONG)" to avoid the false alert.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running btrfs/004 with 4K fs block size and 64K page size,
sometimes fsstress workload can take 100% CPU for a while, but not long
enough to trigger a 120s hang warning.
[CAUSE]
When such 100% CPU usage happens, btrfs_punch_hole_lock_range() is
always in the call trace.
One example when this problem happens, the function
btrfs_punch_hole_lock_range() got the following parameters:
lock_start = 4096, lockend = 20469
Then we calculate @page_lockstart by rounding up lock_start to page
boundary, which is 64K (page size is 64K).
For @page_lockend, we round down the value towards page boundary, which
result 0. Then since we need to pass an inclusive end to
filemap_range_has_page(), we subtract 1 from the rounded down value,
resulting in (u64)-1.
In the above case, the range is inside the same page, and we do not even
need to call filemap_range_has_page(), not to mention to call it with
(u64)-1 at the end.
This behavior will cause btrfs_punch_hole_lock_range() to busy loop
waiting for irrelevant range to have its pages dropped.
[FIX]
Calculate @page_lockend by just rounding down @lockend, without
decreasing the value by one. So @page_lockend will no longer overflow.
Then exit early if @page_lockend is no larger than @page_lockstart.
As it means either the range is inside the same page, or the two pages
are adjacent already.
Finally only decrease @page_lockend when calling filemap_range_has_page().
Fixes: 0528476b6a ("btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range()")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside the macro, subpage_calc_start_bit(), we need to calculate the
offset to the beginning of the folio.
But we're using offset_in_page(), on systems with 4K page size and 4K fs
block size, this means we will always return offset 0 for a large folio,
causing all kinds of errors.
Fix it by using offset_in_folio() instead.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Finish cleaning up the CRC kconfig options by removing the remaining
unnecessary prompts and an unnecessary 'default y', removing
CONFIG_LIBCRC32C, and documenting all the CRC library options.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZ/P7QhQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKyoOAQCynFcS1dWuD27S+SdUREmBjMAoZo5M
zdsIvlPv9KLycgD/QX5lXjW3KIYY6jQ8vHUuLVwfDl/JEp4GJS9dLGU+agg=
=0R1T
-----END PGP SIGNATURE-----
Merge tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull CRC cleanups from Eric Biggers:
"Finish cleaning up the CRC kconfig options by removing the remaining
unnecessary prompts and an unnecessary 'default y', removing
CONFIG_LIBCRC32C, and documenting all the CRC library options"
* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
lib/crc: remove CONFIG_LIBCRC32C
lib/crc: document all the CRC library kconfig options
lib/crc: remove unnecessary prompt for CONFIG_CRC_ITU_T
lib/crc: remove unnecessary prompt for CONFIG_CRC_T10DIF
lib/crc: remove unnecessary prompt for CONFIG_CRC16
lib/crc: remove unnecessary prompt for CONFIG_CRC_CCITT
lib/crc: remove unnecessary prompt for CONFIG_CRC32 and drop 'default y'
The family of functions:
lookup_one()
lookup_one_unlocked()
lookup_one_positive_unlocked()
appear designed to be used by external clients of the filesystem rather
than by filesystems acting on themselves as the lookup_one_len family
are used.
They are used by:
btrfs/ioctl - which is a user-space interface rather than an internal
activity
exportfs - i.e. from nfsd or the open_by_handle_at interface
overlayfs - at access the underlying filesystems
smb/server - for file service
They should be used by nfsd (more than just the exportfs path) and
cachefs but aren't.
It would help if the documentation didn't claim they should "not be
called by generic code".
Also the path component name is passed as "name" and "len" which are
(confusingly?) separate by the "base". In some cases the len in simply
"strlen" and so passing a qstr using QSTR() would make the calling
clearer.
Other callers do pass separate name and len which are stored in a
struct. Sometimes these are already stored in a qstr, other times it
easily could be.
So this patch changes these three functions to receive a 'struct qstr *',
and improves the documentation.
QSTR_LEN() is added to make it easy to pass a QSTR containing a known
len.
[brauner@kernel.org: take a struct qstr pointer]
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://lore.kernel.org/r/20250319031545.2999807-2-neil@brown.name
Signed-off-by: Christian Brauner <brauner@kernel.org>
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that LIBCRC32C does nothing besides select CRC32, make every option
that selects LIBCRC32C instead select CRC32 directly. Then remove
LIBCRC32C.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250401221600.24878-8-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
reservation" from Sourabh Jain changes powerpc's kexec code to use more
of the generic layers.
- The 2 patch series "get_maintainer: report subsystem status
separately" from Vlastimil Babka makes some long-requested improvements
to the get_maintainer output.
- The 4 patch series "ucount: Simplify refcounting with rcuref_t" from
Sebastian Siewior cleans up and optimizing the refcounting in the ucount
code.
- The 12 patch series "reboot: support runtime configuration of
emergency hw_protection action" from Ahmad Fatoum improves the ability
for a driver to perform an emergency system shutdown or reboot.
- The 16 patch series "Converge on using secs_to_jiffies() part two"
from Easwar Hariharan performs further migrations from
msecs_to_jiffies() to secs_to_jiffies().
- The 7 patch series "lib/interval_tree: add some test cases and
cleanup" from Wei Yang permits more userspace testing of kernel library
code, adds some more tests and performs some cleanups.
- The 2 patch series "hung_task: Dump the blocking task stacktrace" from
Masami Hiramatsu arranges for the hung_task detector to dump the stack
of the blocking task and not just that of the blocked task.
- The 4 patch series "resource: Split and use DEFINE_RES*() macros" from
Andy Shevchenko provides some cleanups to the resource definition
macros.
- Plus the usual shower of singleton patches - please see the individual
changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nuqwAKCRDdBJ7gKXxA
jtNqAQDxqJpjWkzn4yN9CNSs1ivVx3fr6SqazlYCrt3u89WQvwEA1oRrGpETzUGq
r6khQUIcQImPPcjFqEFpuiSOU0MBZA0=
=Kii8
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- The series "powerpc/crash: use generic crashkernel reservation" from
Sourabh Jain changes powerpc's kexec code to use more of the generic
layers.
- The series "get_maintainer: report subsystem status separately" from
Vlastimil Babka makes some long-requested improvements to the
get_maintainer output.
- The series "ucount: Simplify refcounting with rcuref_t" from
Sebastian Siewior cleans up and optimizing the refcounting in the
ucount code.
- The series "reboot: support runtime configuration of emergency
hw_protection action" from Ahmad Fatoum improves the ability for a
driver to perform an emergency system shutdown or reboot.
- The series "Converge on using secs_to_jiffies() part two" from Easwar
Hariharan performs further migrations from msecs_to_jiffies() to
secs_to_jiffies().
- The series "lib/interval_tree: add some test cases and cleanup" from
Wei Yang permits more userspace testing of kernel library code, adds
some more tests and performs some cleanups.
- The series "hung_task: Dump the blocking task stacktrace" from Masami
Hiramatsu arranges for the hung_task detector to dump the stack of
the blocking task and not just that of the blocked task.
- The series "resource: Split and use DEFINE_RES*() macros" from Andy
Shevchenko provides some cleanups to the resource definition macros.
- Plus the usual shower of singleton patches - please see the
individual changelogs for details.
* tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits)
mailmap: consolidate email addresses of Alexander Sverdlin
fs/procfs: fix the comment above proc_pid_wchan()
relay: use kasprintf() instead of fixed buffer formatting
resource: replace open coded variant of DEFINE_RES()
resource: replace open coded variants of DEFINE_RES_*_NAMED()
resource: replace open coded variant of DEFINE_RES_NAMED_DESC()
resource: split DEFINE_RES_NAMED_DESC() out of DEFINE_RES_NAMED()
samples: add hung_task detector mutex blocking sample
hung_task: show the blocker task if the task is hung on mutex
kexec_core: accept unaccepted kexec segments' destination addresses
watchdog/perf: optimize bytes copied and remove manual NUL-termination
lib/interval_tree: fix the comment of interval_tree_span_iter_next_gap()
lib/interval_tree: skip the check before go to the right subtree
lib/interval_tree: add test case for span iteration
lib/interval_tree: add test case for interval_tree_iter_xxx() helpers
lib/rbtree: add random seed
lib/rbtree: split tests
lib/rbtree: enable userland test suite for rbtree related data structure
checkpatch: describe --min-conf-desc-length
scripts/gdb/symbols: determine KASLR offset on s390
...
[BUG]
There is a syzbot report that the ASSERT() inside write_dev_supers() got
triggered:
assertion failed: folio_order(folio) == 0, in fs/btrfs/disk-io.c:3858
------------[ cut here ]------------
kernel BUG at fs/btrfs/disk-io.c:3858!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
CPU: 0 UID: 0 PID: 6730 Comm: syz-executor378 Not tainted 6.14.0-syzkaller-03565-gf6e0150b2003 #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:write_dev_supers fs/btrfs/disk-io.c:3858 [inline]
RIP: 0010:write_all_supers+0x400f/0x4090 fs/btrfs/disk-io.c:4155
Call Trace:
<TASK>
btrfs_commit_transaction+0x1eda/0x3750 fs/btrfs/transaction.c:2528
btrfs_quota_enable+0xfcc/0x21a0 fs/btrfs/qgroup.c:1226
btrfs_ioctl_quota_ctl+0x144/0x1c0 fs/btrfs/ioctl.c:3677
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl+0xf1/0x160 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5ad1f20289
</TASK>
---[ end trace 0000000000000000 ]---
[CAUSE]
Since commit f93ee0df51 ("btrfs: convert super block writes to folio
in write_dev_supers()") and commit c94b7349b8 ("btrfs: convert super
block writes to folio in wait_dev_supers()"), the super block writeback
path is converted to use folio.
Since the original code is using page based interfaces, we have an
"ASSERT(folio_order(folio) == 0);" added to make sure everything is not
changed.
But the folio here is not from any btrfs inode, but from the block
device, and we have no control on the folio order in bdev, the device
can choose whatever folio size they want/need.
E.g. the bdev may even have a block size of multiple pages.
So the ASSERT() is triggered.
[FIX]
The super block writeback path has taken larger folios into
consideration, so there is no need for the ASSERT().
And since commit bc00965dbf ("btrfs: count super block write errors in
device instead of tracking folio error state"), the wait path no longer
checks the folio status but only wait for the folio writeback to finish,
there is nothing requiring the ASSERT() either.
So we can remove both ASSERT()s safely now.
Reported-by: syzbot+34122898a11ab689518a@syzkaller.appspotmail.com
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, displaying the btrfs subvol mount option doesn't escape ','.
This makes parsing /proc/self/mounts and /proc/self/mountinfo
ambiguous for subvolume names that contain commas. The text after the
comma could be mistaken for another option (think "subvol=foo,ro", where
ro is actually part of the subvolumes name).
Replace the manual escape characters list with a call to
seq_show_option(). Thanks to Calvin Walton for suggesting this approach.
Fixes: c8d3fe028f ("Btrfs: show subvol= and subvolid= in /proc/mounts")
CC: stable@vger.kernel.org # 5.4+
Suggested-by: Calvin Walton <calvin.walton@kepstin.ca>
Signed-off-by: Johannes Kimmel <kernel@bareminimum.eu>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix a bug in encoded read that mistakenly frees the iov in case
btrfs_encoded_read() returns -EAGAIN assuming the structure will be
reused. This can happen when when receiving requests concurrently, the
io_uring subsystem does not reset the data, and the last free will
happen in btrfs_uring_read_finished().
Handle the -EAGAIN error and skip freeing iov.
CC: stable@vger.kernel.org # 6.13+
Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmfZy+0ACgkQxWXV+ddt
WDtcRw//bUfqbabUGBZ+t/a7YahSeukKx7jhHEHDvzaK8LSZj4otZtLtlKZbaNQK
gGhMitd+rwkf/KnnRvCmS9Y6v4PHbsH8NX0PaGH4ZFYD4mifAs6HNSQUzQIASAZt
OhX/PaKUdLN6kFOt4Yg8Qtem5LcF9Kmrc43ySkcF1T7KtZey8KZypMf0Af/4KvP/
QcNiYJiUlotz6m5K0+TjsDVJDKbYPYy07u3/9GHJBN8bEf5jswPmfDJrONd+NDFS
rMylVCTkW5Hl93qDM0zINPcyfuFFNUH4fWJVRizJPmOwQWUqkRx4J5nSKZzQSlgg
O3KTEYPJHG388an1Cs/k4oIEpOq2xJ7RKJP8ksPf/IcXOTJ0dLXUQisheRoeGyYR
04TWP1rZ2vyQI/LzlOiRozCkAWWhLMJMvWXRUTK/9z9Jh2dcbPdykJGQZ11D9hNI
W5i0XsHX/P2xD8D2sOHo+QY5o1QzMZpb+IaL/+Kv22s3Vb1brabZgOAq8H13l1/y
oe3RLVSLueth22q4GK/MSi7hxSZwV6Zj5HtxYxfs4RFqWo9sM6mp9xP3Via3MnLA
fK8FIMYUMqgvqonDqUD8Gv+YV15Haq8icO/2F9b9eiycJ1mSsRILVEiVCJGbBYIz
C1tB7j5Lv44ZExKHmxPzHMa8rrrG+jaSxxZpuLuOYX0VvVECKVY=
=t4Jn
-----END PGP SIGNATURE-----
Merge tag 'for-6.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"User visible changes:
- fall back to buffered write if direct io is done on a file that
requires checksums
- this avoids a problem with checksum mismatch errors, observed
e.g. on virtual images when writes to pages under writeback
cause the checksum mismatch reports
- this may lead to some performance degradation but currently the
recommended setup for VM images is to use the NOCOW file
attribute that also disables checksums
- fast/realtime zstd levels -15 to -1
- supported by mount options (compress=zstd:-5) and defrag ioctl
- improved speed, reduced compression ratio, check the commit for
sample measurements
- defrag ioctl extended to accept negative compression levels
- subpage mode
- remove warning when subpage mode is used, the feature is now
reasonably complete and tested
- in debug mode allow to create 2K b-tree nodes to allow testing
subpage on x86_64 with 4K pages too
Performance improvements:
- in send, better file path caching improves runtime (on sample load
by -30%)
- on s390x with hardware zlib support prepare the input buffer in a
better way to get the best results from the acceleration
- minor speed improvement in encoded read, avoid memory allocation in
synchronous mode
Core:
- enable stable writes on inodes, replacing manually waiting for
writeback and allowing to skip that on inodes without checksums
- add last checks and warnings for out-of-band dirty writes to pages,
requiring a fixup ("fixup worker"), this should not be necessary
since 5.8 where get_user_page() and pin_user_pages*() prevent this
- long history behind that, we'll be happy to remove the whole
infrastructure in the near future
- more folio API conversions and preparations for large folio support
- subpage cleanups and refactoring, split handling of data and
metadata to allow future support for large folios
- readpage works as block-by-block, no change for normal mode, this
is preparation for future subpage updates
- block group refcount fixes and hardening
- delayed iput fixes
- in zoned mode, fix zone activation on filesystem with missing
devices
Cleanups:
- inode parameter cleanups
- path auto-freeing updates
- code flow simplifications in send
- redundant parameter cleanups"
* tag 'for-6.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (164 commits)
btrfs: zoned: fix zone finishing with missing devices
btrfs: zoned: fix zone activation with missing devices
btrfs: remove end_no_trans label from btrfs_log_inode_parent()
btrfs: simplify condition for logging new dentries at btrfs_log_inode_parent()
btrfs: remove redundant else statement from btrfs_log_inode_parent()
btrfs: use memcmp_extent_buffer() at replay_one_extent()
btrfs: update outdated comment for overwrite_item()
btrfs: use variables to store extent buffer and slot at overwrite_item()
btrfs: avoid unnecessary memory allocation and copy at overwrite_item()
btrfs: don't clobber ret in btrfs_validate_super()
btrfs: prepare btrfs_page_mkwrite() for large folios
btrfs: prepare extent_io.c for future large folio support
btrfs: prepare btrfs_launcher_folio() for large folios support
btrfs: replace PAGE_SIZE with folio_size for subpage.[ch]
btrfs: add a size parameter to btrfs_alloc_subpage()
btrfs: subpage: make btrfs_is_subpage() check against a folio
btrfs: add extra warning if delayed iput is added when it's not allowed
btrfs: avoid redundant path slot assignment in btrfs_search_forward()
btrfs: remove unnecessary btrfs_key local variable in btrfs_search_forward()
btrfs: simplify the return value handling in search_ioctl()
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rNwAKCRCRxhvAZXjc
onBJAP9Z8Ywmlb5KQ1E3HvDmkwyY6yOSyZ9/CmbzrkCJ8ywYkQD/d9/xt0EP/O/q
N8YtzXArHWt7u0YbcVpy9WK3F72BdwU=
=VJgY
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs async dir updates from Christian Brauner:
"This contains cleanups that fell out of the work from async directory
handling:
- Change kern_path_locked() and user_path_locked_at() to never return
a negative dentry. This simplifies the usability of these helpers
in various places
- Drop d_exact_alias() from the remaining place in NFS where it is
still used. This also allows us to drop the d_exact_alias() helper
completely
- Drop an unnecessary call to fh_update() from nfsd_create_locked()
- Change i_op->mkdir() to return a struct dentry
Change vfs_mkdir() to return a dentry provided by the filesystems
which is hashed and positive. This allows us to reduce the number
of cases where the resulting dentry is not positive to very few
cases. The code in these places becomes simpler and easier to
understand.
- Repack DENTRY_* and LOOKUP_* flags"
* tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
doc: fix inline emphasis warning
VFS: Change vfs_mkdir() to return the dentry.
nfs: change mkdir inode_operation to return alternate dentry if needed.
fuse: return correct dentry for ->mkdir
ceph: return the correct dentry on mkdir
hostfs: store inode in dentry after mkdir if possible.
Change inode_operations.mkdir to return struct dentry *
nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked()
nfs/vfs: discard d_exact_alias()
VFS: add common error checks to lookup_one_qstr_excl()
VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry
VFS: repack LOOKUP_ bit flags.
VFS: repack DENTRY_ flags.
If do_zone_finish() is called with a filesystem that has missing devices
(e.g. a RAID file system mounted in degraded mode) it is accessing the
btrfs_device::zone_info pointer, which will not be set if the device
in question is missing.
Check if the device is present (by checking if it has a valid block device
pointer associated) and if not, skip zone finishing for it.
Fixes: 4dcbb8ab31 ("btrfs: zoned: make zone finishing multi stripe capable")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If btrfs_zone_activate() is called with a filesystem that has missing
devices (e.g. a RAID file system mounted in degraded mode) it is accessing
the btrfs_device::zone_info pointer, which will not be set if the device in
question is missing.
Check if the device is present (by checking if it has a valid block
device pointer associated) and if not, skip zone activation for it.
Fixes: f9a912a3c4 ("btrfs: zoned: make zone activation multi stripe capable")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's a pointless label as we don't have to do anything under it other
than return from the function. So remove it and directly return from the
function where we used to goto.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no point in checking if the inode is a directory as
ctx->log_new_dentries is only set in case we are logging a directory down
the call chain of btrfs_log_inode(). So remove that check making the logic
more simple and while at it add a comment about why use a local variable
to track if we later need to log new dentries.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we don't need to log new directory dentries, there's no point in having
an else branch just to set 'ret' to zero, as it's already zero because
every time it gets a non-zero value we jump into one of the exit labels.
So remove it, which reduces source code size and the module text size.
Before this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1813855 163737 16920 1994512 1e6f10 fs/btrfs/btrfs.ko
After this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1813807 163737 16920 1994464 1e6ee0 fs/btrfs/btrfs.ko
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using memcmp(), which requires copying both file extent items
from each extent buffer into a local buffer, use memcmp_extent_buffer() so
that we only need to copy one of the file extent items and directly use
the extent buffer of the other file extent item for the comparison.
This reduces code size, saves one memory copy and reduces stack usage.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is exclusively used for log replay since commit
3eb4234424 ("btrfs: remove outdated logic from overwrite_item() and add
assertion"), so update the comment so that it doesn't say it can be used
for logging. Also some minor rewording for clarity and while at it
reformat the affected text so that it fits closer to the 80 characters
limit for comments.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of referring to path->nodes[0] and path->slots[0] multiple times,
which is verbose and confusing since we have an 'eb' and 'slot' variables
as well, introduce local variables 'dst_eb' to point to path->nodes[0] and
'dst_slot' to have path->slots[0], reducing verbosity and making it more
obvious about which extent buffer and slot we are referring to.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to allocate memory and copy from both the destination and
source extent buffers to compare if the items are equal, we can instead
use memcmp_extent_buffer() which allows to do only one memory allocation
and copy instead of two.
So use memcmp_extent_buffer() instead of memcmp(), allowing us to avoid
one memory allocation, which can fail or be slow while under memory heavy
pressure, avoid the memory copying and reducing code.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 2a9bb78cfd ("btrfs: validate system chunk array at
btrfs_validate_super()") introduces a call to validate_sys_chunk_array()
in btrfs_validate_super(), which clobbers the value of ret set earlier.
This has the effect of negating the validity checks done earlier, making
it so btrfs could potentially try to mount invalid filesystems.
Fixes: 2a9bb78cfd ("btrfs: validate system chunk array at btrfs_validate_super()")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This changes the assumption that the folio is always page sized.
(Although the ASSERT() for folio order is still kept as-is).
Just replace the PAGE_SIZE with folio_size().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we're handling folios from filemap, we can no longer assume all
folios are page sized.
Thus for call sites assuming the folio is page sized, change the
PAGE_SIZE usage to folio_size() instead.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
That function is only calling btrfs_qgroup_free_data(), which doesn't
care about the size of the folio.
Just replace the fixed PAGE_SIZE with folio_size().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we can no longer assume all data filemap folios are page sized,
use proper folio_size() calls to determine the folio size, as a
preparation for future large data filemap folios.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we can no longer assume page sized folio for data filemap folios,
allow btrfs_alloc_subpage() to accept a new parameter, @fsize,
indicating the folio size.
This doesn't follow the regular behavior of passing a folio directly,
because this function is shared by both data and metadata folios, and
for metadata folios we have extra allocation policy to ensure no large
folios whose sizes are larger than nodesize (unless it's page sized).
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To support large data folios, we can no longer assume every filemap
folio is page sized.
So btrfs_is_subpage() check must be done against a folio.
Thankfully for metadata folios, we have the full control and ensure a
large folio will not be large than nodesize, so
btrfs_meta_is_subpage() doesn't need this change.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since I have triggered the ASSERT() on the delayed iput too many times,
now is the time to add some extra debug warnings for delayed iput.
All delayed iputs should be queued after all ordered extents finish
their IO and all involved workqueues are flushed.
Thus after the btrfs_run_delayed_iputs() inside close_ctree(), there
should be no more delayed puts added.
So introduce a new BTRFS_FS_STATE_NO_DELAYED_IPUT, set after the above
mentioned timing. And all btrfs_add_delayed_iput() will check that flag
and give a WARN_ON_ONCE().
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move path slot assignment before the condition check to prevent
duplicate assignment. Previously, the slot was set both inside and after
the 'slot >= nritems' block with no change in its value, which is
unnecessary.
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The 'found_key' variable was only used to temporarily store the found key
before copying it to 'min_key' at the end of the function when returning
success.
Eliminate the 'found_key' variable, and directly store the key into
'min_key' at the exact loop exit points where ret=0 is set, maintaining
identical functionality.
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the assignment of -EFAULT to within the error condition check
in fault_in_subpage_writeable(). The previous placement outside the
condition could lead to the error value being overwritten by subsequent
assignments, cause unnecessary assignments.
Simplify loop exit logic by removing redundant goto.
The original code used 'goto err' to bypass post-loop processing after
handling errors from btrfs_search_forward(). However, the loop's
termination naturally falls through to the post-loop section, which
already handles 'ret' values. Replacing 'goto err' with 'break'
eliminates redundant control flow, consolidates error handling, and
makes the loop's exit conditions explicit.
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fail to add the chunk map to the fs mapping tree we exit
test_rmap_block() without freeing the chunk map. Fix this by adding a
call to btrfs_free_chunk_map() before exiting the test function if the
call to btrfs_add_chunk_map() failed.
Fixes: 7dc66abb5a ("btrfs: use a dedicated data structure for chunk maps")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Similar to mark_bg_unused() and mark_bg_to_reclaim(), we have a few
places that use bg_list with refcounting, mostly for retrying failures
to reclaim/delete unused.
These have custom logic for handling locking and refcounting the bg_list
properly, but they actually all want to do the same thing, so pull that
logic out into a helper. Unfortunately, mark_bg_unused() does still need
the NEW flag to avoid prematurely marking stuff unused (even if refcount
is fine, we don't want to mess with bg creation), so it cannot use the
new helper.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
All other users of the bg_list list_head increment the refcount when
adding to a list and decrement it when deleting from the list. Just for
the sake of uniformity and to try to avoid refcounting bugs, do it for
this list as well.
This does not fix any known ref-counting bug, as the reference belongs
to a single task (trans_handle is not shared and this represents
trans_handle->new_bgs linkage) and will not lose its original refcount
while that thread is running. And BLOCK_GROUP_FLAG_NEW protects against
ref-counting errors "moving" the block group to the unused list without
taking a ref.
With that said, I still believe it is simpler to just hold the extra ref
count for this list user as well.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, the async discard machinery owns a ref to the block_group
when the block_group is queued on a discard list. However, to handle
races with discard cancellation and the discard workfn, we have a
specific logic to detect that the block_group is *currently* running in
the workfn, to protect the workfn's usage amidst cancellation.
As far as I can tell, this doesn't have any overt bugs (though
finish_discard_pass() and remove_from_discard_list() racing can have a
surprising outcome for the caller of remove_from_discard_list() in that
it is again added at the end).
But it is needlessly complicated to rely on locking and the nullity of
discard_ctl->block_group. Simplify this significantly by just taking a
refcount while we are in the workfn and unconditionally drop it in both
the remove and workfn paths, regardless of if they race.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
As far as I can tell, these calls of list_del_init() on bg_list cannot
run concurrently with btrfs_mark_bg_unused() or btrfs_mark_bg_to_reclaim(),
as they are in transaction error paths and situations where the block
group is readonly.
However, if there is any chance at all of racing with mark_bg_unused(),
or a different future user of bg_list, better to be safe than sorry.
Otherwise we risk the following interleaving (bg_list refcount in parens)
T1 (some random op) T2 (btrfs_mark_bg_unused)
!list_empty(&bg->bg_list); (1)
list_del_init(&bg->bg_list); (1)
list_move_tail (1)
btrfs_put_block_group (0)
btrfs_delete_unused_bgs
bg = list_first_entry
list_del_init(&bg->bg_list);
btrfs_put_block_group(bg); (-1)
Ultimately, this results in a broken ref count that hits zero one deref
early and the real final deref underflows the refcount, resulting in a WARNING.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Block group creation is done in two phases, which results in a slightly
unintuitive property: a block group can be allocated/deallocated from
after btrfs_make_block_group() adds it to the space_info with
btrfs_add_bg_to_space_info(), but before creation is completely completed
in btrfs_create_pending_block_groups(). As a result, it is possible for a
block group to go unused and have 'btrfs_mark_bg_unused' called on it
concurrently with 'btrfs_create_pending_block_groups'. This causes a
number of issues, which were fixed with the block group flag
'BLOCK_GROUP_FLAG_NEW'.
However, this fix is not quite complete. Since it does not use the
unused_bg_lock, it is possible for the following race to occur:
btrfs_create_pending_block_groups btrfs_mark_bg_unused
if list_empty // false
list_del_init
clear_bit
else if (test_bit) // true
list_move_tail
And we get into the exact same broken ref count and invalid new_bgs
state for transaction cleanup that BLOCK_GROUP_FLAG_NEW was designed to
prevent.
The broken refcount aspect will result in a warning like:
[1272.943527] refcount_t: underflow; use-after-free.
[1272.943967] WARNING: CPU: 1 PID: 61 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
[1272.944731] Modules linked in: btrfs virtio_net xor zstd_compress raid6_pq null_blk [last unloaded: btrfs]
[1272.945550] CPU: 1 UID: 0 PID: 61 Comm: kworker/u32:1 Kdump: loaded Tainted: G W 6.14.0-rc5+ #108
[1272.946368] Tainted: [W]=WARN
[1272.946585] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[1272.947273] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
[1272.947788] RIP: 0010:refcount_warn_saturate+0xba/0x110
[1272.949532] RSP: 0018:ffffbf1200247df0 EFLAGS: 00010282
[1272.949901] RAX: 0000000000000000 RBX: ffffa14b00e3f800 RCX: 0000000000000000
[1272.950437] RDX: 0000000000000000 RSI: ffffbf1200247c78 RDI: 00000000ffffdfff
[1272.950986] RBP: ffffa14b00dc2860 R08: 00000000ffffdfff R09: ffffffff90526268
[1272.951512] R10: ffffffff904762c0 R11: 0000000063666572 R12: ffffa14b00dc28c0
[1272.952024] R13: 0000000000000000 R14: ffffa14b00dc2868 R15: 000001285dcd12c0
[1272.952850] FS: 0000000000000000(0000) GS:ffffa14d33c40000(0000) knlGS:0000000000000000
[1272.953458] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1272.953931] CR2: 00007f838cbda000 CR3: 000000010104e000 CR4: 00000000000006f0
[1272.954474] Call Trace:
[1272.954655] <TASK>
[1272.954812] ? refcount_warn_saturate+0xba/0x110
[1272.955173] ? __warn.cold+0x93/0xd7
[1272.955487] ? refcount_warn_saturate+0xba/0x110
[1272.955816] ? report_bug+0xe7/0x120
[1272.956103] ? handle_bug+0x53/0x90
[1272.956424] ? exc_invalid_op+0x13/0x60
[1272.956700] ? asm_exc_invalid_op+0x16/0x20
[1272.957011] ? refcount_warn_saturate+0xba/0x110
[1272.957399] btrfs_discard_cancel_work.cold+0x26/0x2b [btrfs]
[1272.957853] btrfs_put_block_group.cold+0x5d/0x8e [btrfs]
[1272.958289] btrfs_discard_workfn+0x194/0x380 [btrfs]
[1272.958729] process_one_work+0x130/0x290
[1272.959026] worker_thread+0x2ea/0x420
[1272.959335] ? __pfx_worker_thread+0x10/0x10
[1272.959644] kthread+0xd7/0x1c0
[1272.959872] ? __pfx_kthread+0x10/0x10
[1272.960172] ret_from_fork+0x30/0x50
[1272.960474] ? __pfx_kthread+0x10/0x10
[1272.960745] ret_from_fork_asm+0x1a/0x30
[1272.961035] </TASK>
[1272.961238] ---[ end trace 0000000000000000 ]---
Though we have seen them in the async discard workfn as well. It is
most likely to happen after a relocation finishes which cancels discard,
tears down the block group, etc.
Fix this fully by taking the lock around the list_del_init + clear_bit
so that the two are done atomically.
Fixes: 0657b20c5a ("btrfs: fix use-after-free of new block group that became unused")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info can be taken from the given block group, so there is no need
to pass it as an argument. Also rename the local variable from 'info' to
'fs_info' which is more widely used, more clear and to be more consistent.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info can be taken from the given block group, so there is no need
to pass it as an argument.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info can be taken from the given block group, so there is no need
to pass it as an argument.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's an internal function and btrfs_iget() is now returning a btrfs inode,
so change btrfs_iget_path() to also return a btrfs inode instead of a VFS
inode.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's an internal function and most of the time the callers are doing a lot
of BTRFS_I() calls on the returned VFS inode to get the btrfs inode, so
change the return type to struct btrfs_inode instead.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fixup_inode_link_count() mostly wants to use a btrfs_inode, plus it's an
internal function so it should take btrfs_inode instead of a VFS inode.
Change the argument type to btrfs_inode, avoiding several BTRFS_I() calls
too.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers of read_one_inode() are mostly interested in the btrfs_inode
structure rather than the VFS inode, so make read_one_inode() return
the btrfs_inode instead, avoiding lots of BTRFS_I() calls.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers of btrfs_iget_logging() are interested in the btrfs_inode
structure rather than the VFS inode, so make btrfs_iget_logging() return
the btrfs_inode instead, avoiding lots of BTRFS_I() calls.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The inline function btrfs_is_testing() is hardcoded to return 0 if
CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set. Currently we're relying on
the compiler optimizing out the call to alloc_test_extent_buffer() in
btrfs_find_create_tree_block(), as it's not been defined (it's behind an
#ifdef).
Add a stub version of alloc_test_extent_buffer() to avoid linker errors
on non-standard optimization levels. This problem was seen on GCC 14
with -O0 and is helps to see symbols that would be otherwise optimized
out.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Even after all the error fixes related the
"ASSERT(list_empty(&fs_info->delayed_iputs));" in close_ctree(), I can
still hit it reliably with my experimental 2K block size.
[CAUSE]
In my case, all the error is triggered after the fs is already in error
status.
I find the following call trace to be the cause of race:
Main thread | endio_write_workers
---------------------------------------------+---------------------------
close_ctree() |
|- btrfs_error_commit_super() |
| |- btrfs_cleanup_transaction() |
| | |- btrfs_destroy_all_ordered_extents() |
| | |- btrfs_wait_ordered_roots() |
| |- btrfs_run_delayed_iputs() |
| | btrfs_finish_ordered_io()
| | |- btrfs_put_ordered_extent()
| | |- btrfs_add_delayed_iput()
|- ASSERT(list_empty(delayed_iputs)) |
!!! Triggered !!!
The root cause is that, btrfs_wait_ordered_roots() only wait for
ordered extents to finish their IOs, not to wait for them to finish and
removed.
[FIX]
Since btrfs_error_commit_super() will flush and wait for all ordered
extents, it should be executed early, before we start flushing the
workqueues.
And since btrfs_error_commit_super() now runs early, there is no need to
run btrfs_run_delayed_iputs() inside it, so just remove the
btrfs_run_delayed_iputs() call from btrfs_error_commit_super().
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The zstd and zlib compression types support setting compression levels.
Enhance the defrag interface to specify the levels as well. For zstd the
negative (realtime) levels are also accepted.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The 'out' label is pointless as we don't have anything to cleanup anymore
(we used to have an inode to iput), so remove it and make error paths
directly return an error.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are doing a lookup of the inode but we don't use it at all. So just
remove this pointless lookup.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have dereferenced the async_submit_bio structure and extracted the bio
pointer into a local variable, so there's no need to dereference it again
when calling btrfs_bio_end_io(). Just use "bio->bi_status" instead of the
longer expression "async->bbio->bio.bi_status".
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At close_ctree() after we have ran delayed iputs either explicitly through
calling btrfs_run_delayed_iputs() or later during the call to
btrfs_commit_super() or btrfs_error_commit_super(), we assert that the
delayed iputs list is empty.
We have (another) race where this assertion might fail because we have
queued an async write into the fs_info->workers workqueue. Here's how it
happens:
1) We are submitting a data bio for an inode that is not the data
relocation inode, so we call btrfs_wq_submit_bio();
2) btrfs_wq_submit_bio() submits a work for the fs_info->workers queue
that will run run_one_async_done();
3) We enter close_ctree(), flush several work queues except
fs_info->workers, explicitly run delayed iputs with a call to
btrfs_run_delayed_iputs() and then again shortly after by calling
btrfs_commit_super() or btrfs_error_commit_super(), which also run
delayed iputs;
4) run_one_async_done() is executed in the work queue, and because there
was an IO error (bio->bi_status is not 0) it calls btrfs_bio_end_io(),
which drops the final reference on the associated ordered extent by
calling btrfs_put_ordered_extent() - and that adds a delayed iput for
the inode;
5) At close_ctree() we find that after stopping the cleaner and
transaction kthreads the delayed iputs list is not empty, failing the
following assertion:
ASSERT(list_empty(&fs_info->delayed_iputs));
Fix this by flushing the fs_info->workers workqueue before running delayed
iputs at close_ctree().
David reported this when running generic/648, which exercises IO error
paths by using the DM error table.
Reported-by: David Sterba <dsterba@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[OUT-OF-BAND DIRTY FOLIOS]
An out-of-band folio means the folio is marked dirty but without
notifying the filesystem.
This can lead to various problems, not limited to:
- No folio::private to track per block status
- No proper space reserved for such a dirty folio
[HISTORY IN BTRFS]
This used to be a problem related to get_user_page(), but with the
introduction of pin_user_pages*(), we should no longer hit such
case anymore.
In btrfs, we have a long history of catching such out-of-band dirty
folios by:
- Mark the folio ordered during delayed allocation
- Check the folio ordered flag during writeback
If the folio has no ordered flag, it means it doesn't go through
delayed allocation, thus it's definitely an out-of-band
one.
If we got one, we go through COW fixup, which will re-dirty the folio
with proper handling in another workqueue.
[PROBLEMS OF COW-FIXUP]
Such workaround is a blockage for us to migrate to iomap (it requires
extra flags to trace if a folio is dirtied by the fs or not) and I'd
argue it's not data checksum safe, since if a folio can be marked dirty
without informing the fs, the content can also change at any time.
But with the introduction of pin_user_pages*() during v5.8 merge
window, such out-of-band dirty folio such be treated as a bug.
Ext4 has treated such case by warning and erroring out even before
pin_user_pages*().
Furthermore, there are already proofs that such folio ordered flag
tracking can be screwed up by incorrect error handling, check the commit
messages of the following commits:
06f3642847 ("btrfs: do proper folio cleanup when cow_file_range() failed")
c2b47df81c ("btrfs: do proper folio cleanup when run_delalloc_nocow() failed")
[FIXES]
Unlike btrfs, ext4 and xfs (iomap) never bother handling such
out-of-band dirty folios.
- Ext4 just warns and errors out
- Iomap always follows the folio/block dirty flags
And there is nothing really COW specific, xfs also supports COW too.
Here we take one step towards ext4 by doing warning and erroring out.
But since the cow fixup thing is introduced from the beginning, we keep
the old behavior for non-experimental builds, and only do the new warning
for experimental builds before we're 100% sure and remove cow fixup.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is just a small clean up, it doesn't change how the code works.
Originally this code had a goto so we needed to set "ret = 0;" but now
it returns directly and so we can simplify it a bit by doing a
"return 0;" and removing the assignment.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs_cleanup_bio() helper is trivial and has a single caller, there's
no point in having a dedicated helper function. So get rid of it and move
its code into the caller (btrfs_bio_end_io()).
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The __btrfs_bio_end_io() helper is trivial and has a single caller, so
there's no point in having a dedicated helper function. Further the double
underscore prefix in the name is discouraged. So get rid of it and move
its code into the caller (btrfs_bio_end_io()).
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At close_ctree() after we have ran delayed iputs either through explicitly
calling btrfs_run_delayed_iputs() or later during the call to
btrfs_commit_super() or btrfs_error_commit_super(), we assert that the
delayed iputs list is empty.
When we have compressed writes this assertion may fail because delayed
iputs may have been added to the list after we last ran delayed iputs.
This happens like this:
1) We have a compressed write bio executing;
2) We enter close_ctree() and flush the fs_info->endio_write_workers
queue which is the queue used for running ordered extent completion;
3) The compressed write bio finishes and enters
btrfs_finish_compressed_write_work(), where it calls
btrfs_finish_ordered_extent() which in turn calls
btrfs_queue_ordered_fn(), which queues a work item in the
fs_info->endio_write_workers queue that we have flushed before;
4) At close_ctree() we proceed, run all existing delayed iputs and
call btrfs_commit_super() (which also runs delayed iputs), but before
we run the following assertion below:
ASSERT(list_empty(&fs_info->delayed_iputs))
A delayed iput is added by the step below...
5) The ordered extent completion job queued in step 3 runs and results in
creating a delayed iput when dropping the last reference of the ordered
extent (a call to btrfs_put_ordered_extent() made from
btrfs_finish_one_ordered());
6) At this point the delayed iputs list is not empty, so the assertion at
close_ctree() fails.
Fix this by flushing the fs_info->compressed_write_workers queue at
close_ctree() before flushing the fs_info->endio_write_workers queue,
respecting the queue dependency as the later is responsible for the
execution of ordered extent completion.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_ioctl_subvol_getflags() as it's an
internal interface, allowing to remove some use of BTRFS_I.
Signed-off-by: David Sterba <dsterba@suse.com>
The search tree ioctl use btrfs_root so change that from btrfs_inode
pointers so we don't have to do the conversion.
Signed-off-by: David Sterba <dsterba@suse.com>
The ioctl switch btrfs_ioctl() provides several parameter types for
convenience so we don't have to do the conversion in the callbacks.
Pass root pointers to the send related functions.
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we only support two block sizes, 4K and PAGE_SIZE.
This means on the most common architecture x86_64, we have no way to
test subpage block size. And that's exactly I have an aarch64 machine
dedicated for subpage tests.
But this is still a hurdle for a lot of btrfs developers, and to improve
the test coverage mostly on x86_64, here we enable debug builds to
accept 2K block size.
This involves:
- Introduce a dedicated minimal block size macro
BTRFS_MIN_BLOCKSIZE, which depends on if CONFIG_BTRFS_DEBUG is set.
If so it's 2K, otherwise it's 4K as usual.
- Allow 4K, PAGE_SIZE and BTRFS_MIN_BLOCKSIZE as block size
- Update subpage block size checks to be based on BTRFS_MIN_BLOCKSIZE
- Export the new supported blocksize through sysfs interfaces
As most of the subpage support is already pretty mature, there is no
extra work needed to support the extra 2K block size.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs utilizes inline data extent for the following cases:
- Regular small files
- Symlinks
And "btrfs check" detects any file extents that are too large as an
error.
It's not a problem for 4K block size, but for the incoming smaller
block sizes (2K), it can cause problems due to bad limits:
- Non-compressed inline data extents
We do not allow a non-compressed inline data extent to be as large as
block size.
- Symlinks
Currently the only real limit on symlinks are 4K, which can be larger
than 2K block size.
These will result btrfs-check to report too large file extents.
Fix it by adding proper size checks for the above cases.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the initial enablement of block size < page size support for
btrfs in v5.15, we have hit several milestones for block size < page
size (subpage) support:
- RAID56 subpage support
In v5.19
- Refactored scrub support to support subpage better
In v6.4
- Block perfect (previously requires page aligned ranges) compressed write
In v6.13
- Various error handling fixes involving subpage
In v6.14
Finally the only missing feature is the pretty simple and harmless
inlined data extent creation, just added in previous patches.
Now btrfs has all of its features ready for both regular and subpage
cases, there is no reason to output a warning about the experimental
subpage support, and we can finally remove it now.
Acked-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously inline data extents creation was disabled if the block size
(previously called sector size) is smaller than the page size, for the
following reasons:
- Possible mixed inline and regular data extents
However this is also the same if the block size matches the page size,
thus we do not treat mixed inline and regular extents as an error.
And the chance to cause mixed inline and regular data extents are not
even increased, it has the same requirement (compressed inline data
extent covering the whole first block, followed by regular extents).
- Inability to handle async/inline delalloc range for block size < page
size cases
This is already fixed since commit 1d2fbb7f1f ("btrfs: allow
compression even if the range is not page aligned").
This was the major technical obstacle, but it's not anymore.
With that removed, we can enable inline data extents creation no matter
the block size nor the page size, allowing btrfs to have the same
capacity for all block sizes.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Since the support of block size (sector size) < page size for btrfs,
test case generic/563 fails with 4K block size and 64K page size:
--- tests/generic/563.out 2024-04-25 18:13:45.178550333 +0930
+++ /home/adam/xfstests-dev/results//generic/563.out.bad 2024-09-30 09:09:16.155312379 +0930
@@ -3,7 +3,8 @@
read is in range
write is in range
write -> read/write
-read is in range
+read has value of 8388608
+read is NOT in range -33792 .. 33792
write is in range
...
[CAUSE]
The test case creates a 8MiB file, then does buffered write into the 8MiB
using 4K block size, to overwrite the whole file.
On 4K page sized systems, since the write range covers the full block and
page, btrfs will not bother reading the page, just like what XFS and EXT4
do.
But on 64K page sized systems, although the 4K sized write is still block
aligned, it's not page aligned anymore, thus btrfs will read the full
page, which will be accounted by cgroup and fail the test.
As the test case itself expects such 4K block aligned write should not
trigger any read.
Such expected behavior is an optimization to reduce folio reads when
possible, and unfortunately btrfs does not implement such optimization.
[FIX]
To skip the full page read, we need to do the following modification:
- Do not trigger full page read as long as the buffered write is block
aligned
This is pretty simple by modifying the check inside
prepare_uptodate_page().
- Skip already uptodate blocks during full page read
Or we can lead to the following data corruption:
0 32K 64K
|///////| |
Where the file range [0, 32K) is dirtied by buffered write, the
remaining range [32K, 64K) is not.
When reading the full page, since [0,32K) is only dirtied but not
written back, there is no data extent map for it, but a hole covering
[0, 64k).
If we continue reading the full page range [0, 64K), the dirtied range
will be filled with 0 (since there is only a hole covering the whole
range).
This causes the dirtied range to get lost.
With this optimization, btrfs can pass generic/563 even if the page size
is larger than fs block size.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently if btrfs has its block size (the older sector size) smaller
than the page size, btrfs_do_readpage() will handle the range extent by
extent, this is good for performance as it doesn't need to re-lookup the
same extent map again and again.
(Although get_extent_map() already does extra cached em check, thus
the optimization is not that obvious.)
This is totally fine and is a valid optimization, but it has an
assumption that there is no partial uptodate range in the page.
Meanwhile there is an incoming feature, requiring btrfs to skip the full
page read if a buffered write range covers a full block but not a full
page.
In that case, we can have a page that is partially uptodate, and the
current per-extent lookup cannot handle such case.
So here we change btrfs_do_readpage() to do block-by-block read, this
simplifies the following things:
- Remove the need for @iosize variable
Because we just use sectorsize as our increment.
- Remove @pg_offset, and calculate it inside the loop when needed
It's just offset_in_folio().
- Use a for() loop instead of a while() loop
This will slightly reduce the read performance for subpage cases, but for
the future where we need to skip already uptodate blocks, it should still
be worth.
For block size == page size, this brings no performance change.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we're using btrfs_lock_and_flush_ordered_range() for both
btrfs_read_folio() and btrfs_readahead(), but it has one critical
problem for future subpage optimizations:
- It will call btrfs_start_ordered_extent() to writeback the involved
folios
But remember we're calling btrfs_lock_and_flush_ordered_range() at
read paths, meaning the folio is already locked by read path.
If we really trigger writeback for those already locked folios, this
will lead to a deadlock and writeback cannot get the folio lock.
Such dead lock is prevented by the fact that btrfs always keeps a
dirty folio also uptodate, by either dirtying all blocks of the folio,
or by reading the whole folio before dirtying.
To prepare for the incoming patch which allows btrfs to skip full folio
read if the buffered write is block aligned, we have to start by solving
the possible deadlock first.
Instead of blindly calling btrfs_start_ordered_extent(), introduce a
new helper, which is smarter in the following ways:
- Only wait and flush the ordered extent if
* The folio doesn't even have private bit set
* Part of the blocks of the ordered extent are not uptodate
This can happen by:
* The folio writeback finished, then got invalidated.
There are a lot of reasons that a folio can get invalidated,
from memory pressure to direct IO (which invalidates all folios
of the range).
But OE not yet finished.
We have to wait for the ordered extent, as the OE may contain
to-be-inserted data checksum.
Without waiting, our read can fail due to the missing checksum.
But either way, the OE should not need any extra flush inside the
locked folio range.
- Skip the ordered extent completely if
* All the blocks are dirty
This happens when OE creation is caused by a folio writeback whose
file offset is before our folio.
E.g. 16K page size and 4K block size
0 8K 16K 24K 32K
|//////////////||///////| |
The writeback of folio 0 created an OE for range [0, 24K), but since
folio 16K is not fully uptodate, a read is triggered for folio 16K.
The writeback will never happen (we're holding the folio lock for
read), nor will the OE finish.
Thus we must skip the range.
* All the blocks are uptodate
This happens when the writeback finished, but OE not yet finished.
Since the blocks are already uptodate, we can skip the OE range.
The new helper lock_extents_for_read() will do a loop for the target
range by:
1) Lock the full range
2) If there is no ordered extent in the remaining range, exit
3) If there is an ordered extent that we can skip
Skip to the end of the OE, and continue checking
We do not trigger writeback nor wait for the OE.
4) If there is an ordered extent that we cannot skip
Unlock the whole extent range and start the ordered extent.
And also update btrfs_start_ordered_extent() to add two more parameters:
@nowriteback_start and @nowriteback_len, to prevent triggering flush for
a certain range.
This will allow us to handle the following case properly in the future:
16K page size, 4K btrfs block size:
0 4K 8K 12K 16K 20K 24K 28K 32K
|/////////////////////////////||////////////////| | |
|<-------------------- OE 2 ------------------->| |< OE 1 >|
The folio has been written back before, thus we have an OE at
[28K, 32K).
Although the OE 1 finished its IO, the OE is not yet removed from IO
tree.
The folio got invalidated after writeback completed and before the
ordered extent finished.
And [16K, 24K) range is dirty and uptodate, caused by a block aligned
buffered write (and future enhancements allowing btrfs to skip full
folio read for such case).
But writeback for folio 0 has began, thus it generated OE 2, covering
range [0, 24K).
Since the full folio 16K is not uptodate, if we want to read the folio,
the existing btrfs_lock_and_flush_ordered_range() will dead lock, by:
btrfs_read_folio()
| Folio 16K is already locked
|- btrfs_lock_and_flush_ordered_range()
|- btrfs_start_ordered_extent() for range [16K, 24K)
|- filemap_fdatawrite_range() for range [16K, 24K)
|- extent_write_cache_pages()
folio_lock() on folio 16K, deadlock.
But now we will have the following sequence:
btrfs_read_folio()
| Folio 16K is already locked
|- lock_extents_for_read()
|- can_skip_ordered_extent() for range [16K, 24K)
| Returned true, the range [16K, 24K) will be skipped.
|- can_skip_ordered_extent() for range [28K, 32K)
| Returned false.
|- btrfs_start_ordered_extent() for range [28K, 32K) with
[16K, 32K) as no writeback range
No writeback for folio 16K will be triggered.
And there will be no more possible deadlock on the same folio.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside function __cow_file_range_inline() since the inlined data no
longer take any data space, we need to free up the reserved space.
However the code is still using the old page size == sector size
assumption, and will not handle subpage case well.
Thankfully it is not going to cause any problems because we have two extra
safe nets:
- Inline data extents creation is disabled for sector size < page size
cases for now
But it won't stay that for long.
- btrfs_qgroup_free_data() will only clear ranges which have been already
reserved
So even if we pass a range larger than what we need, it should still
be fine, especially there is only reserved space for a single block at
file offset 0 of an inline data extent.
But just for the sake of consistency, fix the call site to use
sectorsize instead of page size.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently reading an inline data extent will zero out the remaining
range in the page.
This is not yet causing problems even for block size < page size
(subpage) cases because:
1) An inline data extent always starts at file offset 0
Meaning at page read, we always read the inline extent first, before
any other blocks in the page. Then later blocks are properly read out
and re-fill the zeroed out ranges.
2) Currently btrfs will read out the whole page if a buffered write is
not page aligned
So a page is either fully uptodate at buffered write time (covers the
whole page), or we will read out the whole page first.
Meaning there is nothing to lose for such an inline extent read.
But it's still not ideal:
- We're zeroing out the page twice
Once done by read_inline_extent()/uncompress_inline(), once done by
btrfs_do_readpage() for ranges beyond i_size.
- We're touching blocks that don't belong to the inline extent
In the incoming patches, we can have a partial uptodate folio, of
which some dirty blocks can exist while the page is not fully uptodate:
The page size is 16K and block size is 4K:
0 4K 8K 12K 16K
| | |/////////| |
And range [8K, 12K) is dirtied by a buffered write, the remaining
blocks are not uptodate.
If range [0, 4K) contains an inline data extent, and we try to read
the whole page, the current behavior will overwrite range [8K, 12K)
with zero and cause data loss.
So to make the behavior more consistent and in preparation for future
changes, limit the inline data extents read to only zero out the range
inside the first block, not the whole page.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We now parse human-friendly size values (e.g. '1G', '2M') when setting
read policies.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
This applies to both path and path2.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with some return simplifications.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The most trivial pattern for the auto freeing when the variable is
declared with the macro and the final btrfs_free_path() is removed.
There are almost none goto -> return conversions and there's no other
function cleanup.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As the helper num_extent_folios() is now __pure, we can use it in for
loop without storing its value in a variable explicitly, the compiler
will do this for us.
The effects on btrfs.ko is -200 bytes and there are stack space savings
too:
btrfs_clone_extent_buffer -8 (32 -> 24)
btrfs_clear_buffer_dirty -8 (48 -> 40)
clear_extent_buffer_uptodate -8 (40 -> 32)
set_extent_buffer_dirty -8 (32 -> 24)
write_one_eb -8 (88 -> 80)
set_extent_buffer_uptodate -8 (40 -> 32)
read_extent_buffer_pages_nowait -16 (64 -> 48)
find_extent_buffer -8 (32 -> 24)
Signed-off-by: David Sterba <dsterba@suse.com>
The functions qualify for the pure attribute as they always return the
same value for the same argument (in the given scope). This allows to
optimize the calls and cache the value.
Signed-off-by: David Sterba <dsterba@suse.com>
Unlike folio helpers for date the ones for metadata always take the
extent buffer start and length, so they can be simplified to take the
eb only. The fs_info can be obtained from eb too so it can be dropped
as parameter.
Added in patch "btrfs: use metadata specific helpers to simplify extent
buffer helpers".
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are considering the used bytes counter of a block group as the amount
to update the space info's reclaim bytes counter after relocating the
block group, but this value alone is often not enough. This is because we
may have a reserved extent (or more) and in that case its size is
reflected in the reserved counter of the block group - the size of the
extent is only transferred from the reserved counter to the used counter
of the block group when the delayed ref for the extent is run - typically
when committing the transaction (or when flushing delayed refs due to
ENOSPC on space reservation). Such call chain for data extents is:
btrfs_run_delayed_refs_for_head()
run_one_delayed_ref()
run_delayed_data_ref()
alloc_reserved_file_extent()
alloc_reserved_extent()
btrfs_update_block_group()
-> transfers the extent size from the reserved
counter to the used counter
For metadata extents:
btrfs_run_delayed_refs_for_head()
run_one_delayed_ref()
run_delayed_tree_ref()
alloc_reserved_tree_block()
alloc_reserved_extent()
btrfs_update_block_group()
-> transfers the extent size from the reserved
counter to the used counter
Since relocation flushes delalloc, waits for ordered extent completion
and commits the current transaction before doing the actual relocation
work, the correct amount of reclaimed space is therefore the sum of the
"used" and "reserved" counters of the block group before we call
btrfs_relocate_chunk() at btrfs_reclaim_bgs_work().
So fix this by taking the "reserved" counter into consideration.
Fixes: 243192b676 ("btrfs: report reclaim stats in sysfs")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_reclaim_bgs_work(), we are grabbing twice the used bytes counter
of the block group while not holding the block group's spinlock. This can
result in races, reported by KCSAN and similar tools, since a concurrent
task can be updating that counter while at btrfs_update_block_group().
So avoid these races by grabbing the counter in a critical section
delimited by the block group's spinlock after setting the block group to
RO mode. This also avoids using two different values of the counter in
case it changes in between each read. This silences KCSAN and is required
for the next patch in the series too.
Fixes: 243192b676 ("btrfs: report reclaim stats in sysfs")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_reclaim_bgs_work(), we are grabbing a block group's zone unusable
bytes while not under the protection of the block group's spinlock, so
this can trigger race reports from KCSAN (or similar tools) since that
field is typically updated while holding the lock, such as at
__btrfs_add_free_space_zoned() for example.
Fix this by grabbing the zone unusable bytes while we are still in the
critical section holding the block group's spinlock, which is right above
where we are currently grabbing it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After previous patch removing nodesize from parameters,
__alloc_dummy_extent_buffer() and alloc_dummy_extent_buffer() are
identical so we can drop one.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass a valid fs_info so we can read the nodesize from that
instead of passing it as parameter.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no longer any need for the 'out' label as there are no resources
to cleanup anymore in case of an error and we can directly return if
begin_cmd() fails.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Whenever we issue a command we allocate a path and then compute it. For
the current inode this is not necessary since we have one preallocated
and computed in the send context structure, so we can use it instead
and avoid allocating and freeing a path.
For example if we have 100 extents to send (100 write commands) for a
file, we are allocating and freeing paths 100 times.
So improve on this by avoiding path allocation and freeing whenever a
command is for the current inode by using the current inode's path
stored in the send context structure.
A test was run before applying this patch and the previous one in the
series:
"btrfs: send: keep the current inode's path cached"
The test script is the following:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
DIR="$MNT/one/two/three/four"
FILE="$DIR/foobar"
mkdir -p $DIR
# Create some empty files to get a deeper btree and therefore make
# path computations slower.
for ((i = 1; i <= 30000; i++)); do
echo -n > "$DIR/filler_$i"
done
for ((i = 0; i < 10000; i += 2)); do
offset=$(( i * 4096 ))
xfs_io -f -c "pwrite -S 0xab $offset 4K" $FILE > /dev/null
done
btrfs subvolume snapshot -r $MNT $MNT/snap
start=$(date +%s%N)
btrfs send -f /dev/null $MNT/snap
end=$(date +%s%N)
echo -e "\nsend took $(( (end - start) / 1000000 )) milliseconds"
umount $MNT
Result before applying the 2 patches: 1121 milliseconds
Result after applying the 2 patches: 815 milliseconds (-31.6%)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Whenever we need to send a command for the current inode, like sending
writes, xattr updates, truncates, utimes, etc, we compute the inode's
path each time, which implies doing some memory allocations and traversing
the inode hierarchy to extract the name of the inode and each ancestor
directory, and that implies doing lookups in the subvolume tree amongst
other operations.
Most of the time, by far, the current inode's path doesn't change while
we are processing it (like if we need to issue 100 write commands, the
path remains the same and it's pointless to compute it 100 times).
To avoid this keep the current inode's path cached in the send context
and invalidate it or update it whenever it's needed (after unlinks or
renames).
A performance test, and its results, is mentioned in the next patch in
the series (subject: "btrfs: send: avoid path allocation for the current
inode when issuing commands").
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need for the 'out' label as there are no resources to cleanup
in case of an error and we can directly return if begin_cmd() fails.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need for the 'ret' variable, we can just return directly the
result of the call to iterate_dir_item().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need for the 'out' label as there are no resources to cleanup
in case of an error and we can directly return if begin_cmd() fails.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Extract the logic to rename the current inode at process_recorded_refs()
into a helper function and use it, therefore removing duplicated logic
and making it easier for an upcoming patch by avoiding yet more duplicated
logic.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have several local variables at process_recorded_refs() that are used
as booleans, with some of them having a 'bool' type while two of them
having an 'int' type. Change this to make them all use the 'bool' type
which is more clear and to make everything more consistent.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We always send xattrs for the current inode only and both callers of
send_set_xattr() pass a path for the current inode. So move the path
allocation and computation to send_set_xattr(), reducing duplicated
code. This also facilitates an upcoming patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When attempting to build a too long path we are currently returning
-ENOMEM, which is very odd and misleading. So update fs_path_ensure_buf()
to return -ENAMETOOLONG instead. Also, while at it, move the WARN_ON()
into the if statement's expression, as it makes it clear what is being
tested and also has the effect of adding 'unlikely' to the statement,
which allows the compiler to generate better code as this condition is
never expected to happen.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helper fs_path_add_path() is basically a copy of fs_path_add() and it
can be made a wrapper around fs_path_add(). So do that and also make it
inline and constify its second argument.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Several places are hardcoding the path length calculation instead of using
the helper fs_path_len() for that. Update all those places to instead use
fs_path_len().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helper function fs_path_len() is trivial and doesn't need to change
its path argument, so make it inline and constify the argument.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's duplicated logic in both branches of the if statement, so move it
outside the branches.
This also reduces the object code size.
Before this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1746279 163600 16920 1926799 1d668f fs/btrfs/btrfs.ko
After this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1746047 163592 16920 1926559 1d659f fs/btrfs/btrfs.ko
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use a struct btrfs_inode in btrfs_get_name() as it's an internal
helper, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use a struct btrfs_inode to btrfs_get_parent() as it's an internal
helper, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use a struct btrfs_inode in btrfs_remap_file_range_prep() as it's an
internal helper, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use a struct btrfs_inode to btrfs_remap_file_range() as it's an internal
helper, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_extent_same_range() as it's an
internal interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_double_mmap_unlock() as it's an
internal interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_double_mmap_lock() as it's an
internal interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to clone_copy_inline_extent() as it's an
internal interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to the extract() callback as it's an internal
interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to the apply() callback as it's an internal
interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_inherit_props() as it's an internal
interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_load_inode_props() as it's an
internal interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_fill_inode() as it's an internal
interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to fill_stack_inode_item() as it's an internal
interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use a struct btrfs_inode in create_pending_snapshot() as it's an
internal helper, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_defrag_file() as it's an internal
interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_inode() as it's an internal
interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to new_simple_dir() as it's an internal
interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_inode() as it's an internal
interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_read_locked_inode() as it's an
internal interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to extent_range_clear_dirty_for_io() as it's
an internal interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to can_nocow_extent() as it's an internal
interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a label that does nothing else than return, so remove it and
also change other gotos to immediate returns as the function is short
enough for this pattern.
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs_key is defined as objectid/type/offset and the keys are also
printed like that. For better readability, update all key
initializations to match this order.
Signed-off-by: David Sterba <dsterba@suse.com>
When using offset_in_page() it's clear what it means, we don't need to
store it in the local variable just to use it right away. There's no
change in the generated code, but keeps the declarations smaller.
Signed-off-by: David Sterba <dsterba@suse.com>
Reduce stack consumption of zstd_compress_folios() by 40 bytes
(10*sizeof(int)) as we can store struct zstd_parameters in the workspace
that is reused for each call.
typedef struct {
ZSTD_compressionParameters cParams;
ZSTD_frameParameters fParams;
} ZSTD_parameters;
typedef struct {
unsigned windowLog;
unsigned chainLog;
unsigned hashLog;
unsigned searchLog;
unsigned minMatch;
unsigned targetLength;
ZSTD_strategy strategy;
} ZSTD_compressionParameters;
typedef struct {
int contentSizeFlag;
int checksumFlag;
int noDictIDFlag;
} ZSTD_frameParameters;
Signed-off-by: David Sterba <dsterba@suse.com>
If BTRFS_FS_NEED_ZONE_FINISH is already set for the whole filesystem, exit
early in btrfs_can_activate_zone(). There's no need to check if
BTRFS_FS_NEED_ZONE_FINISH needs to be set if it is already set.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we have btrfs_meta_is_subpage(), we should make btrfs_is_subpage()
to be data inode specific.
This change involves:
- Simplify btrfs_is_subpage()
Now we only need to do a very simple sectorsize check against
PAGE_SIZE.
And since the function is pretty simple now, just make it an inline
function.
- Add an extra ASSERT() to make sure btrfs_is_subpage() is only called
on data inode mapping
- Migrate btree_csum_one_bio() to use btrfs_meta_folio_*() helpers
- Migrate alloc_extent_buffer() to use btrfs_meta_folio_*() helpers
- Migrate end_bbio_meta_write() to use btrfs_meta_folio_*() helpers
Or we will trigger the ASSERT() due to calling btrfs_folio_*() on
metadata folios.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
By using a shared bio_add_folio_nofail() with calculated
range_start/range_len, so no more explicit subpage routine needed.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently inside write_one_eb() we have two different ways of handling
subpage and regular metadata.
The differences are:
- Extra offset/length calculation when adding the folio range to bio for
subpage cases
- Only decrease wbc->nr_to_write if the whole page is no longer dirty
for subpage cases
- Use subpage helper for subpage cases
Merge the tow ways into a shared one:
- Always calculate the to-be-queued range
So that bio_add_folio() can use the same calculated resulted length
and offset for both cases.
- Use btrfs_meta_folio_clear_dirty() and
btrfs_meta_folio_set_writeback() helpers
This will cover both cases.
- Only decrease wbc->nr_to_write if the folio is no longer dirty
Since we have the folio locked, no one else can modify the folio dirty
flags (set_extent_buffer_dirty() will also lock the folio for subpage
cases).
Thus after our btrfs_meta_folio_clear_dirty() call, if the whole folio
is no longer dirty, we're submitting the last dirty eb of the folio,
and can decrease wbc->nr_to_write properly.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_clear_buffer_dirty() is called on dirty extent buffer
that will not be written back.
The function will call btree_clear_folio_dirty() to clear the folio
dirty flag and also clear PAGECACHE_TAG_DIRTY flag.
And we split the subpage and regular handling, as for subpage cases we
should only clear PAGECACHE_TAG_DIRTY if the last dirty extent buffer in
the page is cleared.
So here we can simplify the function by:
- Use the newly introduced btrfs_meta_folio_clear_and_test_dirty() helper
The helper will return true if we cleared the folio dirty flag.
With that we can use the same helper for both subpage and regular
cases.
- Rename btree_clear_folio_dirty() to btree_clear_folio_dirty_tag()
As we move the folio dirty clearing in the btrfs_clear_buffer_dirty().
- Call btrfs_meta_folio_clear_and_test_dirty() to clear the dirty flags
for both regular and subpage metadata cases
- Only call btree_clear_folio_dirty_tag() when the folio is no longer
dirty
- Update the comment inside set_extent_buffer_dirty()
As there is no separate clear_subpage_extent_buffer_dirty() anymore.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The following functions are doing metadata specific checks:
- set_extent_buffer_uptodate()
- clear_extent_buffer_uptodate()
The reason why we do not use btrfs_folio_*() helpers for those helpers
is, btrfs_is_subpage() cannot handle dummy extent buffer if nodesize >=
PAGE_SIZE but block size < PAGE_SIZE.
In that case, we do not need to attach extra bitmaps to the extent
buffer folio. But since dummy extent buffer folios are not attached to
btree inode, btrfs_is_subpage() will return true, causing problems.
And the following are using btrfs_folio_*() helpers for metadata, but
in theory we should use metadata specific checks:
- set_extent_buffer_dirty()
This is not causing problems because a dummy extent buffer should never
be marked dirty.
To make code simpler, introduce btrfs_meta_folio_*() helpers, to do
the metadata specific handling, so that we do not to open-code such
checks in above involved functions.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently subpage attach/detach is not doing proper dummy extent buffer
subpage check, as btrfs_is_subpage() is not reliable for dummy extent
buffer folios.
Since we have a metadata specific check now, use that for
btrfs_attach_subpage() first.
Then enhance btrfs_detach_subpage() to accept a type parameter, so that
we can do extra checks for dummy extent buffers properly.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we have only one btrfs_is_subpage() to cover both data and
metadata.
But there is a special case for metadata:
- dummy extent buffer, sector size < PAGE_SIZE and node size >= PAGE_SIZE
In such case, btrfs_is_subpage() will return true for extent buffer
folio.
But that is not correct, and that's exactly why we have some open-coded
checks for functions like set_extent_buffer_uptodate() and
clear_extent_buffer_uptodate().
Just extract the metadata specific checks into a helper, and replace
those call sites.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For the future large folio support, our filemap can have folios with
different sizes, thus we can no longer rely on a fixed blocks_per_page
value.
To prepare for that future, here we do:
- Remove btrfs_fs_info::sectors_per_page
- Introduce a helper, btrfs_blocks_per_folio()
Which uses the folio size to calculate the number of blocks for each
folio.
- Migrate the existing btrfs_fs_info::sectors_per_page to use that
helper
There are some exceptions:
* Metadata nodesize < page size support
In the future, even if we support large folios, we will only
allocate a folio that matches our nodesize.
Thus we won't have a folio covering multiple metadata unless
nodesize < page size.
* Existing subpage bitmap dump
We use a single unsigned long to store the bitmap.
That means until we change the bitmap dumping code, our upper limit
for folio size will only be 256K (4K block size, 64 bit unsigned
long).
* btrfs_is_subpage() check
This will be migrated into a future patch.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The ordered extent cleanup is hard to grasp because it doesn't follow
the common cleanup-asap pattern.
E.g. run_delalloc_nocow() and cow_file_range() allocate one or more
ordered extent, but if any error is hit, the cleanup is done later inside
btrfs_run_delalloc_range().
To change the existing delayed cleanup:
- Update the comment on error handling of run_delalloc_nocow()
There are in fact 3 different cases other than 2 if we are doing
ordered extents cleanup inside run_delalloc_nocow():
1) @cow_start and @cow_end not set
No fallback to COW at all.
Before @cur_offset we need to cleanup the OE and page dirty.
After @cur_offset just clear all involved page and extent flags.
2) @cow_start set but @cow_end not set.
This means we failed before even calling fallback_to_cow().
It's just a variant of case 1), where it's @cow_start splitting
the two parts (and we should just ignore @cur_offset since it's
advanced without any new ordered extent).
3) @cow_start and @cow_end both set
This means fallback_to_cow() failed, meaning [start, cow_start)
needs the regular OE and dirty folio cleanup, and skip range
[cow_start, cow_end) as cow_file_range() has done the cleanup,
and eventually cleanup [cow_end, end) range.
- Only reset @cow_start after fallback_to_cow() succeeded
As above case 2) and 3) are both relying on @cow_start to determine
the cleanup range.
- Move btrfs_cleanup_ordered_extents() into run_delalloc_nocow(),
cow_file_range() and nocow_one_range()
For cow_file_range() it's pretty straightforward and easy.
For run_delalloc_nocow() refer to the above 3 different error cases.
For nocow_one_range() if we hit an error, we need to cleanup the
ordered extents by ourselves.
And then it will fallback to case 1), since @cur_offset is not yet
advanced, the existing cleanup will co-operate with nocow_one_range()
well.
- Remove the btrfs_cleanup_ordered_extents() inside submit_uncompressed_range()
As failed cow_file_range() will do all the proper cleanup now.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we're doing all the ordered extent and extent map generation
inside a while() loop of run_delalloc_nocow(). This makes it pretty
hard to read, nor doing proper error handling.
So move that part of code into a helper, nocow_one_range().
This should not change anything, but there is a tiny timing change where
btrfs_dec_nocow_writers() is only called after nocow_one_range() helper
exits.
This timing change is small, and makes error handling easier, thus
should be fine.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The address space flag AS_STABLE_WRITES determine if FGP_STABLE for will
wait for the folio to finish its writeback.
For btrfs, due to the default data checksum behavior, if we modify the
folio while it's still under writeback, it will cause data checksum
mismatch. Thus for quite some call sites we manually call
folio_wait_writeback() to prevent such problem from happening.
Currently there is only one call site inside btrfs really utilizing
FGP_STABLE, and in that case we also manually call folio_wait_writeback()
to do the waiting.
But it's better to properly expose the stable writes flag to a per-inode
basis, to allow call sites to fully benefit from FGP_STABLE flag.
E.g. for inodes with NODATASUM allowing beginning dirtying the page
without waiting for writeback.
This involves:
- Update the mapping's stable write flag when setting/clearing NODATASUM
inode flag using ioctl
This only works for empty files, so it should be fine.
- Update the mapping's stable write flag when reading an inode from disk
- Remove the explicit folio_wait_writeback() for FGP_BEGINWRITE call
site
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently for s390x HW zlib compression, to get the best performance we
need a buffer size which is larger than a page.
This means we need to copy multiple pages into workspace->buf, then use
that buffer as zlib compression input.
Currently it's hardcoded using page sized folio, and all the handling
are deep inside a loop.
Refactor the code by:
- Introduce a dedicated helper to do the buffer copy
The new helper will be called copy_data_into_buffer().
- Add extra ASSERT()s
* Make sure we only go into the function for hardware acceleration
* Make sure we still get page sized folio
- Prepare for future large folios
This means we will rely on the folio size, other than PAGE_SIZE to do
the copy.
- Handle the folio mapping and unmapping inside the helper function
For S390x hardware acceleration case, it never utilize the @data_in
pointer, thus we can do folio mapping/unmapping all inside the function.
Acked-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Tested-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_do_readpage() if we get an extent map for a prealloc extent we
end up assigning twice to the 'block_start' variable, first the value
returned by extent_map_block_start() and then EXTENT_MAP_HOLE. This is
pointless so make it more clear by using an if-else statement and doing
only one assignment. Also, while at it, move the declaration of
'block_start' into the while loop's scope, since it's not used outside of
it and the related 'disk_bytenr' is also declared in this scope.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
It is a long known bug that VM image on btrfs can lead to data csum
mismatch, if the qemu is using direct-io for the image (this is commonly
known as cache mode 'none').
[CAUSE]
Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
fs is allowed to dirty/modify the folio even if the folio is under
writeback (as long as the address space doesn't have AS_STABLE_WRITES
flag inherited from the block device).
This is a valid optimization to improve the concurrency, and since these
filesystems have no extra checksum on data, the content change is not a
problem at all.
But the final write into the image file is handled by btrfs, which needs
the content not to be modified during writeback, or the checksum will
not match the data (checksum is calculated before submitting the bio).
So EXT4/XFS/NTRFS assume they can modify the folio under writeback, but
btrfs requires no modification, this leads to the false csum mismatch.
This is only a controlled example, there are even cases where
multi-thread programs can submit a direct IO write, then another thread
modifies the direct IO buffer for whatever reason.
For such cases, btrfs has no sane way to detect such cases and leads to
false data csum mismatch.
[FIX]
I have considered the following ideas to solve the problem:
- Make direct IO to always skip data checksum
This not only requires a new incompatible flag, as it breaks the
current per-inode NODATASUM flag.
But also requires extra handling for no csum found cases.
And this also reduces our checksum protection.
- Let hardware handle all the checksum
AKA, just nodatasum mount option.
That requires trust for hardware (which is not that trustful in a lot
of cases), and it's not generic at all.
- Always fallback to buffered write if the inode requires checksum
This was suggested by Christoph, and is the solution utilized by this
patch.
The cost is obvious, the extra buffer copying into page cache, thus it
reduces the performance.
But at least it's still user configurable, if the end user still wants
the zero-copy performance, just set NODATASUM flag for the inode
(which is a common practice for VM images on btrfs).
Since we cannot trust user space programs to keep the buffer
consistent during direct IO, we have no choice but always falling back
to buffered IO. At least by this, we avoid the more deadly false data
checksum mismatch error.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In that function we call set_extent_buffer_uptodate() or
clear_extent_buffer_uptodate(), which will already update the uptodate
flag for all the involved extent buffer folios.
Thus there is no need to update the folio uptodate flags again.
Just remove the open-coded part.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Retrieve folios instead of pages and work on them throughout. Removes
a few calls to compound_head() and a reference to page->mapping.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove references to the page lock and page->mapping. Also btrfs folios
can never be swizzled into swap (mentioned in extent_write_cache_pages()).
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Only allocate the btrfs_encoded_read_private structure for asynchronous
(io_uring) mode.
There's no need to allocate an object from slab in the synchronous mode. In
such a case stack can be happily used as it used to be before 68d3b27e05
("btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages()")
which was a preparation for the async mode.
While at it, fix the comment to reflect the atomic => refcount change in
d29662695e ("btrfs: fix use-after-free waiting for encoded read endios").
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit b35108a51c ("jiffies: Define secs_to_jiffies()") introduced
secs_to_jiffies(). As the value here is a multiple of 1000, use
secs_to_jiffies() instead of msecs_to_jiffies() to avoid the
multiplication
This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with
the following Coccinelle rules:
@depends on patch@
expression E;
@@
-msecs_to_jiffies
+secs_to_jiffies
(E
- * \( 1000 \| MSEC_PER_SEC \)
)
Link: https://lkml.kernel.org/r/20250225-converge-secs-to-jiffies-part-two-v3-5-a43967e36c88@linux.microsoft.com
Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dick Kennedy <dick.kennedy@broadcom.com>
Cc: Dongsheng Yang <dongsheng.yang@easystack.cn>
Cc: Fabio Estevam <festevam@gmail.com>
Cc: Frank Li <frank.li@nxp.com>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: Ilpo Jarvinen <ilpo.jarvinen@linux.intel.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: James Bottomley <james.bottomley@HansenPartnership.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kalesh Anakkur Purayil <kalesh-anakkur.purayil@broadcom.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Mark Brown <broonie@kernel.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Nicolas Palix <nicolas.palix@imag.fr>
Cc: Niklas Cassel <cassel@kernel.org>
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Selvin Thyparampil Xavier <selvin.xavier@broadcom.com>
Cc: Shawn Guo <shawnguo@kernel.org>
Cc: Shyam-sundar S-k <Shyam-sundar.S-k@amd.com>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmfLM5wACgkQxWXV+ddt
WDsK3A/7BEIUzin4CpmhBkFQamPCLjLL+Zz2etmoYWCrKnNPRMutVbsgeRM43cBt
NXMD4RSoeXO/aYzrPhe4KMP4a5PkI02v2CEpPJqMRPmbADGyExx5Vnh68ioZWQbi
N54Sd5LqhMT9FcViG46VJXr+MOBKIzO8++TxswIrCDO+6X/Y39+xZGxj4DXrnF38
zgvxbILbiH+7vC1m9NV8K7Vl0jp36hQKcCjJYCfohbVoFQiyvmuh2x0LDL2HnIfH
VpREP+eo/a3ZO8vPo7+4HZ5DVf5AolulbEC6myxsvFScLhWlh218plVyuv4QyACW
RYDm9MqLqfqOkEDgj+Tb0C4s6uyVon5xbRL3aNbSE73KnUVeb/bB77qAejjzAkIr
MvEEeEJp0H34OZm2fnUyFIu3ShDcSif1qH0rCOm1rBeqYZZsX7ny7TvKIqkgrsKk
JbzgpYLyzzqTHs9QERw3OUhIBuefFCs4HlUeukLbUCdqI+ruPp5s76jfHQnT3dzG
ad5CUW8eHf6mkU93dUlQIeDJSVPdaanf0Whomk3eOKgBeu8+gNp9R41kKJ7UtoA9
GG504bqNjSe8t0sVmSyuE30BWAQWYnyCSY/9u46JrB6MtfWv+wikU/Nox4qZjM4d
UhhWkDTELaTngcYkbm5+MD0DkkglTeqEbR9gCM21c9xiJrojhcw=
=v6KI
-----END PGP SIGNATURE-----
Merge tag 'for-6.14-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix leaked extent map after error when reading chunks
- replace use of deprecated strncpy
- in zoned mode, fixed range when ulocking extent range, causing a hang
* tag 'for-6.14-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix a leaked chunk map issue in read_one_chunk()
btrfs: replace deprecated strncpy() with strscpy()
btrfs: zoned: fix extent range end unlock in cow_file_range()
Add btrfs_free_chunk_map() to free the memory allocated
by btrfs_alloc_chunk_map() if btrfs_add_chunk_map() fails.
Fixes: 7dc66abb5a ("btrfs: use a dedicated data structure for chunk maps")
CC: stable@vger.kernel.org
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Some filesystems, such as NFS, cifs, ceph, and fuse, do not have
complete control of sequencing on the actual filesystem (e.g. on a
different server) and may find that the inode created for a mkdir
request already exists in the icache and dcache by the time the mkdir
request returns. For example, if the filesystem is mounted twice the
directory could be visible on the other mount before it is on the
original mount, and a pair of name_to_handle_at(), open_by_handle_at()
calls could instantiate the directory inode with an IS_ROOT() dentry
before the first mkdir returns.
This means that the dentry passed to ->mkdir() may not be the one that
is associated with the inode after the ->mkdir() completes. Some
callers need to interact with the inode after the ->mkdir completes and
they currently need to perform a lookup in the (rare) case that the
dentry is no longer hashed.
This lookup-after-mkdir requires that the directory remains locked to
avoid races. Planned future patches to lock the dentry rather than the
directory will mean that this lookup cannot be performed atomically with
the mkdir.
To remove this barrier, this patch changes ->mkdir to return the
resulting dentry if it is different from the one passed in.
Possible returns are:
NULL - the directory was created and no other dentry was used
ERR_PTR() - an error occurred
non-NULL - this other dentry was spliced in
This patch only changes file-systems to return "ERR_PTR(err)" instead of
"err" or equivalent transformations. Subsequent patches will make
further changes to some file-systems to return a correct dentry.
Not all filesystems reliably result in a positive hashed dentry:
- NFS, cifs, hostfs will sometimes need to perform a lookup of
the name to get inode information. Races could result in this
returning something different. Note that this lookup is
non-atomic which is what we are trying to avoid. Placing the
lookup in filesystem code means it only happens when the filesystem
has no other option.
- kernfs and tracefs leave the dentry negative and the ->revalidate
operation ensures that lookup will be called to correctly populate
the dentry. This could be fixed but I don't think it is important
to any of the users of vfs_mkdir() which look at the dentry.
The recommendation to use
d_drop();d_splice_alias()
is ugly but fits with current practice. A planned future patch will
change this.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
strncpy() is deprecated for NUL-terminated destination buffers. Use
strscpy() instead and don't zero-initialize the param array.
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Running generic/751 on the for-next branch often results in a hang like
below. They are both stack by locking an extent. This suggests someone
forget to unlock an extent.
INFO: task kworker/u128:1:12 blocked for more than 323 seconds.
Not tainted 6.13.0-BTRFS-ZNS+ #503
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u128:1 state:D stack:0 pid:12 tgid:12 ppid:2 flags:0x00004000
Workqueue: btrfs-fixup btrfs_work_helper [btrfs]
Call Trace:
<TASK>
__schedule+0x534/0xdd0
schedule+0x39/0x140
__lock_extent+0x31b/0x380 [btrfs]
? __pfx_autoremove_wake_function+0x10/0x10
btrfs_writepage_fixup_worker+0xf1/0x3a0 [btrfs]
btrfs_work_helper+0xff/0x480 [btrfs]
? lock_release+0x178/0x2c0
process_one_work+0x1ee/0x570
? srso_return_thunk+0x5/0x5f
worker_thread+0x1d1/0x3b0
? __pfx_worker_thread+0x10/0x10
kthread+0x10b/0x230
? __pfx_kthread+0x10/0x10
ret_from_fork+0x30/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
INFO: task kworker/u134:0:184 blocked for more than 323 seconds.
Not tainted 6.13.0-BTRFS-ZNS+ #503
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u134:0 state:D stack:0 pid:184 tgid:184 ppid:2 flags:0x00004000
Workqueue: writeback wb_workfn (flush-btrfs-4)
Call Trace:
<TASK>
__schedule+0x534/0xdd0
schedule+0x39/0x140
__lock_extent+0x31b/0x380 [btrfs]
? __pfx_autoremove_wake_function+0x10/0x10
find_lock_delalloc_range+0xdb/0x260 [btrfs]
writepage_delalloc+0x12f/0x500 [btrfs]
? srso_return_thunk+0x5/0x5f
extent_write_cache_pages+0x232/0x840 [btrfs]
btrfs_writepages+0x72/0x130 [btrfs]
do_writepages+0xe7/0x260
? srso_return_thunk+0x5/0x5f
? lock_acquire+0xd2/0x300
? srso_return_thunk+0x5/0x5f
? find_held_lock+0x2b/0x80
? wbc_attach_and_unlock_inode.part.0+0x102/0x250
? wbc_attach_and_unlock_inode.part.0+0x102/0x250
__writeback_single_inode+0x5c/0x4b0
writeback_sb_inodes+0x22d/0x550
__writeback_inodes_wb+0x4c/0xe0
wb_writeback+0x2f6/0x3f0
wb_workfn+0x32a/0x510
process_one_work+0x1ee/0x570
? srso_return_thunk+0x5/0x5f
worker_thread+0x1d1/0x3b0
? __pfx_worker_thread+0x10/0x10
kthread+0x10b/0x230
? __pfx_kthread+0x10/0x10
ret_from_fork+0x30/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
This happens because we have another success path for the zoned mode. When
there is no active zone available, btrfs_reserve_extent() returns
-EAGAIN. In this case, we have two reactions.
(1) If the given range is never allocated, we can only wait for someone
to finish a zone, so wait on BTRFS_FS_NEED_ZONE_FINISH bit and retry
afterward.
(2) Or, if some allocations are already done, we must bail out and let
the caller to send IOs for the allocation. This is because these IOs
may be necessary to finish a zone.
The commit 06f3642847 ("btrfs: do proper folio cleanup when
cow_file_range() failed") moved the unlock code from the inside of the
loop to the outside. So, previously, the allocated extents are unlocked
just after the allocation and so before returning from the function.
However, they are no longer unlocked on the case (2) above. That caused
the hang issue.
Fix the issue by modifying the 'end' to the end of the allocated
range. Then, we can exit the loop and the same unlock code can properly
handle the case.
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Fixes: 06f3642847 ("btrfs: do proper folio cleanup when cow_file_range() failed")
CC: stable@vger.kernel.org
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAme95g8ACgkQxWXV+ddt
WDvi3g//V55iBXnPv0Jrs7b95GRskYv8A4vJsZhGtub4PlcEh8S6Q1IoU3qwiKHv
E2THDA/A14qetxh3tSo73+RdS3JHpIH4QKjO54k74gOh45OEUs4Lq8NBAujmpz4b
BMZZnM5iyZipNfbebUa/XxlPLvHg8D2rUqwycS/A0c5BE56HTvVzmKL3RdUfkAvA
uZaJa6FOKfr6ge3ikl/dm+Rl7f+ZymIK4T9XsW3Lt223siYvcLJvWEIL0tk9B1y/
ZUQNqPOCHY0mX/zPC0425LoeH3LWDPyZPCakaY8tiwI20p/sP+hPLBC8WDrJvoam
losu6v8EqkYK9zND/ETVq3d1Y9mzub/soKuM+aDQ/UM0JXz1vI3RYQcpskECR0Gf
ZPq5tv+dSBbMmscvkxnkuNBaTr3IbOhkxaKwOvdoRN9F4HbmhgxTscshaQHklmiG
4qRx2HtW9Zw8ufyLUFUYaRAj45eFDZMQStQMCNSECD8X+fS6CPGUqGFcuXrm+kLL
v6k0cbvh1NOLSchqtfR4rochJFUp5veUNHoYQ7YRy3CqV1yrF7iM1e0G1rvyOQYQ
9tpN93IYwLItRdUjtqyS/q8WOddRTo0LTqh5HDXPnLd3jc/kO7KjHv9dJna7wyhO
MUJmLlpy1dRDHCvTl70oF0Nxe4Ve20n7U2QayF5bMGtCmQnzGL0=
=4+6s
-----END PGP SIGNATURE-----
Merge tag 'for-6.14-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- extent map shrinker fixes:
- fix potential use after free accessing an inode to reach fs_info,
the shrinker could do iput() in the meantime
- skip unnecessary scanning of inodes without extent maps
- do direct iput(), no need for indirection via workqueue
- in block < page mode, fix race when extending i_size in buffered mode
- fix minor memory leak in selftests
- print descriptive error message when seeding device is not found
* tag 'for-6.14-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix data overwriting bug during buffered write when block size < page size
btrfs: output an error message if btrfs failed to find the seed fsid
btrfs: do regular iput instead of delayed iput during extent map shrinking
btrfs: skip inodes without loaded extent maps when shrinking extent maps
btrfs: fix use-after-free on inode when scanning root during em shrinking
btrfs: selftests: fix btrfs_test_delayed_refs() leak of transaction
[BUG]
When running generic/418 with a btrfs whose block size < page size
(subpage cases), it always fails.
And the following minimal reproducer is more than enough to trigger it
reliably:
workload()
{
mkfs.btrfs -s 4k -f $dev > /dev/null
dmesg -C
mount $dev $mnt
$fsstree_dir/src/dio-invalidate-cache -r -b 4096 -n 3 -i 1 -f $mnt/diotest
ret=$?
umount $mnt
stop_trace
if [ $ret -ne 0 ]; then
fail
fi
}
for (( i = 0; i < 1024; i++)); do
echo "=== $i/$runtime ==="
workload
done
[CAUSE]
With extra trace printk added to the following functions:
- btrfs_buffered_write()
* Which folio is touched
* The file offset (start) where the buffered write is at
* How many bytes are copied
* The content of the write (the first 2 bytes)
- submit_one_sector()
* Which folio is touched
* The position inside the folio
* The content of the page cache (the first 2 bytes)
- pagecache_isize_extended()
* The parameters of the function itself
* The parameters of the folio_zero_range()
Which are enough to show the problem:
22.158114: btrfs_buffered_write: folio pos=0 start=0 copied=4096 content=0x0101
22.158161: submit_one_sector: r/i=5/257 folio=0 pos=0 content=0x0101
22.158609: btrfs_buffered_write: folio pos=0 start=4096 copied=4096 content=0x0101
22.158634: btrfs_buffered_write: folio pos=0 start=8192 copied=4096 content=0x0101
22.158650: pagecache_isize_extended: folio=0 from=4096 to=8192 bsize=4096 zero off=4096 len=8192
22.158682: submit_one_sector: r/i=5/257 folio=0 pos=4096 content=0x0000
22.158686: submit_one_sector: r/i=5/257 folio=0 pos=8192 content=0x0101
The tool dio-invalidate-cache will start 3 threads, each doing a buffered
write with 0x01 at offset 0, 4096 and 8192, do a fsync, then do a direct read,
and compare the read buffer with the write buffer.
Note that all 3 btrfs_buffered_write() are writing the correct 0x01 into
the page cache.
But at submit_one_sector(), at file offset 4096, the content is zeroed
out, by pagecache_isize_extended().
The race happens like this:
Thread A is writing into range [4K, 8K).
Thread B is writing into range [8K, 12k).
Thread A | Thread B
-------------------------------------+------------------------------------
btrfs_buffered_write() | btrfs_buffered_write()
|- old_isize = 4K; | |- old_isize = 4096;
|- btrfs_inode_lock() | |
|- write into folio range [4K, 8K) | |
|- pagecache_isize_extended() | |
| extend isize from 4096 to 8192 | |
| no folio_zero_range() called | |
|- btrfs_inode_lock() | |
| |- btrfs_inode_lock()
| |- write into folio range [8K, 12K)
| |- pagecache_isize_extended()
| | calling folio_zero_range(4K, 8K)
| | This is caused by the old_isize is
| | grabbed too early, without any
| | inode lock.
| |- btrfs_inode_unlock()
The @old_isize is grabbed without inode lock, causing race between two
buffered write threads and making pagecache_isize_extended() to zero
range which is still containing cached data.
And this is only affecting subpage btrfs, because for regular blocksize
== page size case, the function pagecache_isize_extended() will do
nothing if the block size >= page size.
[FIX]
Grab the old i_size while holding the inode lock.
This means each buffered write thread will have a stable view of the
old inode size, thus avoid the above race.
CC: stable@vger.kernel.org # 5.15+
Fixes: 5e8b9ef303 ("btrfs: move pos increment and pagecache extension to btrfs_buffered_write")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
If btrfs failed to locate the seed device for whatever reason, mounting
the sprouted device will fail without any meaning error message:
# mkfs.btrfs -f /dev/test/scratch1
# btrfstune -S1 /dev/test/scratch1
# mount /dev/test/scratch1 /mnt/btrfs
# btrfs dev add -f /dev/test/scratch2 /mnt/btrfs
# umount /mnt/btrfs
# btrfs dev scan -u
# btrfs mount /dev/test/scratch2 /mnt/btrfs
mount: /mnt/btrfs: fsconfig system call failed: No such file or directory.
dmesg(1) may have more information after failed mount system call.
# dmesg -t | tail -n6
BTRFS info (device dm-5): first mount of filesystem 64252ded-5953-4868-b962-cea48f7ac4ea
BTRFS info (device dm-5): using crc32c (crc32c-generic) checksum algorithm
BTRFS info (device dm-5): using free-space-tree
BTRFS error (device dm-5): failed to read chunk tree: -2
BTRFS error (device dm-5): open_ctree failed: -2
[CAUSE]
The failure to mount is pretty straight forward, just unable to find the
seed device and its fsid, caused by `btrfs dev scan -u`.
But the lack of any useful info is a problem.
[FIX]
Just add an extra error message in open_seed_devices() to indicate the
error.
Now the error message would look like this:
BTRFS info (device dm-4): first mount of filesystem 7769223d-4db1-4e4c-ac29-0a96f53576ab
BTRFS info (device dm-4): using crc32c (crc32c-generic) checksum algorithm
BTRFS info (device dm-4): using free-space-tree
BTRFS error (device dm-4): failed to find fsid e87c12e6-584b-4e98-8b88-962c33a619ff when attempting to open seed devices
BTRFS error (device dm-4): failed to read chunk tree: -2
BTRFS error (device dm-4): open_ctree failed: -2
Link: https://github.com/kdave/btrfs-progs/issues/959
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent map shrinker now runs in the system unbound workqueue and no
longer in kswapd context so it can directly do an iput() on inodes even
if that blocks or needs to acquire any lock (we aren't holding any locks
when requesting the delayed iput from the shrinker). So we don't need to
add a delayed iput, wake up the cleaner and delegate the iput() to the
cleaner, which also adds extra contention on the spinlock that protects
the delayed iputs list.
Reported-by: Ivan Shapovalov <intelfx@intelfx.name>
Tested-by: Ivan Shapovalov <intelfx@intelfx.name>
Link: https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If there are inodes that don't have any loaded extent maps, we end up
grabbing a reference on them and later adding a delayed iput, which wakes
up the cleaner and makes it do unnecessary work. This is common when for
example the inodes were open only to run stat(2) or all their extent maps
were already released through the folio release callback
(btrfs_release_folio()) or released by a previous run of the shrinker, or
directories which never have extent maps.
Reported-by: Ivan Shapovalov <intelfx@intelfx.name>
Tested-by: Ivan Shapovalov <intelfx@intelfx.name>
Link: https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/
CC: stable@vger.kernel.org # 6.13+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_scan_root() we are accessing the inode's root (and fs_info) in a
call to btrfs_fs_closing() after we have scheduled the inode for a delayed
iput, and that can result in a use-after-free on the inode in case the
cleaner kthread does the iput before we dereference the inode in the call
to btrfs_fs_closing().
Fix this by using the fs_info stored already in a local variable instead
of doing inode->root->fs_info.
Fixes: 1020443840 ("btrfs: make the extent map shrinker run asynchronously as a work queue job")
CC: stable@vger.kernel.org # 6.13+
Tested-by: Ivan Shapovalov <intelfx@intelfx.name>
Link: https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmeuSwwACgkQxWXV+ddt
WDtQ8Q//fsTAu1DLjeVrzMhVNswGwr3PgWzLk3PqWTDEG9UeR/jJYntIPglVyhhP
Mp2E3CYe2rlWwK0K3PITDu179tLrnCvbfKEPwWvyVZw5D0EDjPQYs9/H5ztSE8O4
4i3kv2LjlXHE3h62tjNoeHL4NK1SRJcFeH69XhhIe0ELvTQVarvfJupZwdQQivWg
sDlQXklXxl1kEtHVGnmz6jd09a0vti7xw8MAG6QiIP83Hvt6Ie+NLfTfTCkRIWSK
95mPM+1YhmLQe15sD8xjHyYmH5E0cEXQh1Pvlz6xqQWRvZERG8Pmj+iwFTLaw4iA
JR6sN2/KFgXE9OIGbFqQ+dvm++2hWcnPwW+h6EdOSj0DQkupbJm4VeBK0WQ4YZ+x
Q0OQXPTfGpcjp7KyJrT6EZFq5VxeEfOz4hozhiCSTs+Xpx7Oh/2THL01N/dUMn0C
SNR9E4/Rlq7rWV7euGwicwo/tZZIdCr4ihUGk4jpamlUbIXj+2SrOc4cpQdypmsO
DeYvwzIXnPe8/Eo3rZ5ej0DK7GxfEFyd6v6l0oS6HepvMJ6y6/eiOYteVbGpvhXv
J2M6PLstiZc152VHPApN9+ZlXBeGjyMfxLcsweblpSBBt/57otY6cMhqNuIp0j9B
0zP3KKOwrIJ8tzcwjMSH+2OZsDQ7oc7eiJI08r0IcpCbCBTIeyE=
=0hI+
-----END PGP SIGNATURE-----
Merge tag 'for-6.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix stale page cache after race between readahead and direct IO write
- fix hole expansion when writing at an offset beyond EOF, the range
will not be zeroed
- use proper way to calculate offsets in folio ranges
* tag 'for-6.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix hole expansion when writing at an offset beyond EOF
btrfs: fix stale page cache after race between readahead and direct IO write
btrfs: fix two misuses of folio_shift()
At btrfs_write_check() if our file's i_size is not sector size aligned and
we have a write that starts at an offset larger than the i_size that falls
within the same page of the i_size, then we end up not zeroing the file
range [i_size, write_offset).
The code is this:
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
if (start_pos > oldsize) {
/* Expand hole size to cover write data, preventing empty gap */
loff_t end_pos = round_up(pos + count, fs_info->sectorsize);
ret = btrfs_cont_expand(BTRFS_I(inode), oldsize, end_pos);
if (ret)
return ret;
}
So if our file's i_size is 90269 bytes and a write at offset 90365 bytes
comes in, we get 'start_pos' set to 90112 bytes, which is less than the
i_size and therefore we don't zero out the range [90269, 90365) by
calling btrfs_cont_expand().
This is an old bug introduced in commit 9036c10208 ("Btrfs: update hole
handling v2"), from 2008, and the buggy code got moved around over the
years.
Fix this by discarding 'start_pos' and comparing against the write offset
('pos') without any alignment.
This bug was recently exposed by test case generic/363 which tests this
scenario by polluting ranges beyond EOF with an mmap write and than verify
that after a file increases we get zeroes for the range which is supposed
to be a hole and not what we wrote with the previous mmaped write.
We're only seeing this exposed now because generic/363 used to run only
on xfs until last Sunday's fstests update.
The test was failing like this:
$ ./check generic/363
FSTYP -- btrfs
PLATFORM -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ #17 SMP PREEMPT_DYNAMIC Mon Feb 3 12:28:46 WET 2025
MKFS_OPTIONS -- /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
generic/363 0s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad)
--- tests/generic/363.out 2025-02-05 15:31:14.013646509 +0000
+++ /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad 2025-02-05 17:25:33.112630781 +0000
@@ -1 +1,46 @@
QA output created by 363
+READ BAD DATA: offset = 0xdcad, size = 0xd921, fname = /home/fdmanana/btrfs-tests/dev/junk
+OFFSET GOOD BAD RANGE
+0x1609d 0x0000 0x3104 0x0
+operation# (mod 256) for the bad data may be 4
+0x1609e 0x0000 0x0472 0x1
+operation# (mod 256) for the bad data may be 4
...
(Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/363.out /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad' to see the entire diff)
Ran: generic/363
Failures: generic/363
Failed 1 of 1 tests
Fixes: 9036c10208 ("Btrfs: update hole handling v2")
CC: stable@vger.kernel.org
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After commit ac325fc2aa ("btrfs: do not hold the extent lock for entire
read") we can now trigger a race between a task doing a direct IO write
and readahead. When this race is triggered it results in tasks getting
stale data when they attempt do a buffered read (including the task that
did the direct IO write).
This race can be sporadically triggered with test case generic/418, failing
like this:
$ ./check generic/418
FSTYP -- btrfs
PLATFORM -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ #17 SMP PREEMPT_DYNAMIC Mon Feb 3 12:28:46 WET 2025
MKFS_OPTIONS -- /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
generic/418 14s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad)
--- tests/generic/418.out 2020-06-10 19:29:03.850519863 +0100
+++ /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad 2025-02-03 15:42:36.974609476 +0000
@@ -1,2 +1,5 @@
QA output created by 418
+cmpbuf: offset 0: Expected: 0x1, got 0x0
+[6:0] FAIL - comparison failed, offset 24576
+diotest -wp -b 4096 -n 8 -i 4 failed at loop 3
Silence is golden
...
(Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/418.out /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad' to see the entire diff)
Ran: generic/418
Failures: generic/418
Failed 1 of 1 tests
The race happens like this:
1) A file has a prealloc extent for the range [16K, 28K);
2) Task A starts a direct IO write against file range [24K, 28K).
At the start of the direct IO write it invalidates the page cache at
__iomap_dio_rw() with kiocb_invalidate_pages() for the 4K page at file
offset 24K;
3) Task A enters btrfs_dio_iomap_begin() and locks the extent range
[24K, 28K);
4) Task B starts a readahead for file range [16K, 28K), entering
btrfs_readahead().
First it attempts to read the page at offset 16K by entering
btrfs_do_readpage(), where it calls get_extent_map(), locks the range
[16K, 20K) and gets the extent map for the range [16K, 28K), caching
it into the 'em_cached' variable declared in the local stack of
btrfs_readahead(), and then unlocks the range [16K, 20K).
Since the extent map has the prealloc flag, at btrfs_do_readpage() we
zero out the page's content and don't submit any bio to read the page
from the extent.
Then it attempts to read the page at offset 20K entering
btrfs_do_readpage() where we reuse the previously cached extent map
(decided by get_extent_map()) since it spans the page's range and
it's still in the inode's extent map tree.
Just like for the previous page, we zero out the page's content since
the extent map has the prealloc flag set.
Then it attempts to read the page at offset 24K entering
btrfs_do_readpage() where we reuse the previously cached extent map
(decided by get_extent_map()) since it spans the page's range and
it's still in the inode's extent map tree.
Just like for the previous pages, we zero out the page's content since
the extent map has the prealloc flag set. Note that we didn't lock the
extent range [24K, 28K), so we didn't synchronize with the ongoing
direct IO write being performed by task A;
5) Task A enters btrfs_create_dio_extent() and creates an ordered extent
for the range [24K, 28K), with the flags BTRFS_ORDERED_DIRECT and
BTRFS_ORDERED_PREALLOC set;
6) Task A unlocks the range [24K, 28K) at btrfs_dio_iomap_begin();
7) The ordered extent enters btrfs_finish_one_ordered() and locks the
range [24K, 28K);
8) Task A enters fs/iomap/direct-io.c:iomap_dio_complete() and it tries
to invalidate the page at offset 24K by calling
kiocb_invalidate_post_direct_write(), resulting in a call chain that
ends up at btrfs_release_folio().
The btrfs_release_folio() call ends up returning false because the range
for the page at file offset 24K is currently locked by the task doing
the ordered extent completion in the previous step (7), so we have:
btrfs_release_folio() ->
__btrfs_release_folio() ->
try_release_extent_mapping() ->
try_release_extent_state()
This last function checking that the range is locked and returning false
and propagating it up to btrfs_release_folio().
So this results in a failure to invalidate the page and
kiocb_invalidate_post_direct_write() triggers this message logged in
dmesg:
Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O!
After this we leave the page cache with stale data for the file range
[24K, 28K), filled with zeroes instead of the data written by direct IO
write (all bytes with a 0x01 value), so any task attempting to read with
buffered IO, including the task that did the direct IO write, will get
all bytes in the range with a 0x00 value instead of the written data.
Fix this by locking the range, with btrfs_lock_and_flush_ordered_range(),
at the two callers of btrfs_do_readpage() instead of doing it at
get_extent_map(), just like we did before commit ac325fc2aa ("btrfs: do
not hold the extent lock for entire read"), and unlocking the range after
all the calls to btrfs_do_readpage(). This way we never reuse a cached
extent map without flushing any pending ordered extents from a concurrent
direct IO write.
Fixes: ac325fc2aa ("btrfs: do not hold the extent lock for entire read")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It is meaningless to shift a byte count by folio_shift(). The folio index
is in units of PAGE_SIZE, not folio_size(). We can use folio_contains()
to make this work for arbitrary-order folios, so remove the assertion
that the folios are of order 0.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmejbmkACgkQxWXV+ddt
WDuXuQ/8DETuhqww7hwrDgHyDHRgj/783Oy1+V/jJPQZ1hpjAbSJBU7aKxneAMMc
Gj4ExbDfjd8kRAvsE71/1McJqVBd6CiuBG/k65vjqVS4lM8mEasVefCa4OsOR3iB
gGzQeNbrDzIs3IOg8hM1l1iPDkI/AyOyeysD/qNMdWO4mcKEMwrBYkQJhL6+DzfE
BKVP+NAdK4iv/W/EngAqvcd7mq2RdxeR9nBesHnKTzPCVFf6bBT3b3Qem/ovY6Td
ZNQLjx9GXumDj0jSiyRI5rxjSYOrqSU4JV+1C7ghOwBj2uD2SZxAvghuRSuUTKnN
9/9x1RNO7i+FE3GHzYWShhHkNuuLZspmi2J1neSttQ5Jy/PFJR/tUAED5zY6Nl0X
73GWSzLlmaJ+E77DkTJksBYmRrnwtA6plMKD4umDCHw0lNu/0GCvxtZY1T1ZiKy2
yK37Ja559+k7RPIdKoyHo81A7num4gLeBZTqd9F/XPjU26b57Qnqk1LetgGeT8xk
IZFQtz9HdmvLwwbxWwNvp1ttRf+1dj1lpVnb5n6r0d8Uyta9tgQpDvwhNjluBsEx
AQxK9yUZ5kAXEEbEyDwoOsZ5yjwkaMzqRpQWWappb0jCLm5dADI87odFRaOSlgWS
WoXL6Vbod8G5vaEVwPDl6yuSS+609c7M8ftBlgvcx3XZ5/N6y2M=
=YjCE
-----END PGP SIGNATURE-----
Merge tag 'for-6.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- add lockdep annotation for relocation root to fix a splat warning
while merging roots
- fix assertion failure when splitting ordered extent after transaction
abort
- don't print 'qgroup inconsistent' message when rescan process updates
qgroup data sooner than the subvolume deletion process
- fix use-after-free (accessing the error number) when attempting to
join an aborted transaction
- avoid starting new transaction if not necessary when cleaning qgroup
during subvolume drop
* tag 'for-6.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: avoid starting new transaction when cleaning qgroup during subvolume drop
btrfs: fix use-after-free when attempting to join an aborted transaction
btrfs: do not output error message if a qgroup has been already cleaned up
btrfs: fix assertion failure when splitting ordered extent after transaction abort
btrfs: fix lockdep splat while merging a relocation root
indivudual patches which are described in their changelogs.
- "Allocate and free frozen pages" from Matthew Wilcox reorganizes the
page allocator so we end up with the ability to allocate and free
zero-refcount pages. So that callers (ie, slab) can avoid a refcount
inc & dec.
- "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use
large folios other than PMD-sized ones.
- "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and
fixes for this small built-in kernel selftest.
- "mas_anode_descend() related cleanup" from Wei Yang tidies up part of
the mapletree code.
- "mm: fix format issues and param types" from Keren Sun implements a
few minor code cleanups.
- "simplify split calculation" from Wei Yang provides a few fixes and a
test for the mapletree code.
- "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes
continues the work of moving vma-related code into the (relatively) new
mm/vma.c.
- "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
Hildenbrand cleans up and rationalizes handling of gfp flags in the page
allocator.
- "readahead: Reintroduce fix for improper RA window sizing" from Jan
Kara is a second attempt at fixing a readahead window sizing issue. It
should reduce the amount of unnecessary reading.
- "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
addresses an issue where "huge" amounts of pte pagetables are
accumulated
(https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/).
Qi's series addresses this windup by synchronously freeing PTE memory
within the context of madvise(MADV_DONTNEED).
- "selftest/mm: Remove warnings found by adding compiler flags" from
Muhammad Usama Anjum fixes some build warnings in the selftests code
when optional compiler warnings are enabled.
- "mm: don't use __GFP_HARDWALL when migrating remote pages" from David
Hildenbrand tightens the allocator's observance of __GFP_HARDWALL.
- "pkeys kselftests improvements" from Kevin Brodsky implements various
fixes and cleanups in the MM selftests code, mainly pertaining to the
pkeys tests.
- "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
estimate application working set size.
- "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
provides some cleanups to memcg's hugetlb charging logic.
- "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based
kernel build was demonstrated.
- "zram: split page type read/write handling" from Sergey Senozhatsky
has several fixes and cleaups for zram in the area of zram_write_page().
A watchdog softlockup warning was eliminated.
- "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky
cleans up the pagetable destructor implementations. A rare
use-after-free race is fixed.
- "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
simplifies and cleans up the debugging code in the VMA merging logic.
- "Account page tables at all levels" from Kevin Brodsky cleans up and
regularizes the pagetable ctor/dtor handling. This results in
improvements in accounting accuracy.
- "mm/damon: replace most damon_callback usages in sysfs with new core
functions" from SeongJae Park cleans up and generalizes DAMON's sysfs
file interface logic.
- "mm/damon: enable page level properties based monitoring" from
SeongJae Park increases the amount of information which is presented in
response to DAMOS actions.
- "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes
DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs
is completed.
- "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter
Xu cleans up and generalizes the hugetlb reservation accounting.
- "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
removes a never-used feature of the alloc_pages_bulk() interface.
- "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
extends DAMOS filters to support not only exclusion (rejecting), but
also inclusion (allowing) behavior.
- "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
"introduces a new memory descriptor for zswap.zpool that currently
overlaps with struct page for now. This is part of the effort to reduce
the size of struct page and to enable dynamic allocation of memory
descriptors."
- "mm, swap: rework of swap allocator locks" from Kairui Song redoes and
simplifies the swap allocator locking. A speedup of 400% was
demonstrated for one workload. As was a 35% reduction for kernel build
time with swap-on-zram.
- "mm: update mips to use do_mmap(), make mmap_region() internal" from
Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
mmap_region() can be made MM-internal.
- "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU
regressions and otherwise improves MGLRU performance.
- "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park
updates DAMON documentation.
- "Cleanup for memfd_create()" from Isaac Manjarres does that thing.
- "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand
provides various cleanups in the areas of hugetlb folios, THP folios and
migration.
- "Uncached buffered IO" from Jens Axboe implements the new
RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache
reading and writing. To permite userspace to address issues with
massive buildup of useless pagecache when reading/writing fast devices.
- "selftests/mm: virtual_address_range: Reduce memory" from Thomas
Weißschuh fixes and optimizes some of the MM selftests.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA
jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz
uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE=
=Ib2s
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs.
- "Allocate and free frozen pages" from Matthew Wilcox reorganizes
the page allocator so we end up with the ability to allocate and
free zero-refcount pages. So that callers (ie, slab) can avoid a
refcount inc & dec
- "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to
use large folios other than PMD-sized ones
- "Fix mm/rodata_test" from Petr Tesarik performs some maintenance
and fixes for this small built-in kernel selftest
- "mas_anode_descend() related cleanup" from Wei Yang tidies up part
of the mapletree code
- "mm: fix format issues and param types" from Keren Sun implements a
few minor code cleanups
- "simplify split calculation" from Wei Yang provides a few fixes and
a test for the mapletree code
- "mm/vma: make more mmap logic userland testable" from Lorenzo
Stoakes continues the work of moving vma-related code into the
(relatively) new mm/vma.c
- "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
Hildenbrand cleans up and rationalizes handling of gfp flags in the
page allocator
- "readahead: Reintroduce fix for improper RA window sizing" from Jan
Kara is a second attempt at fixing a readahead window sizing issue.
It should reduce the amount of unnecessary reading
- "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
addresses an issue where "huge" amounts of pte pagetables are
accumulated:
https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
Qi's series addresses this windup by synchronously freeing PTE
memory within the context of madvise(MADV_DONTNEED)
- "selftest/mm: Remove warnings found by adding compiler flags" from
Muhammad Usama Anjum fixes some build warnings in the selftests
code when optional compiler warnings are enabled
- "mm: don't use __GFP_HARDWALL when migrating remote pages" from
David Hildenbrand tightens the allocator's observance of
__GFP_HARDWALL
- "pkeys kselftests improvements" from Kevin Brodsky implements
various fixes and cleanups in the MM selftests code, mainly
pertaining to the pkeys tests
- "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
estimate application working set size
- "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
provides some cleanups to memcg's hugetlb charging logic
- "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
removes the global swap cgroup lock. A speedup of 10% for a
tmpfs-based kernel build was demonstrated
- "zram: split page type read/write handling" from Sergey Senozhatsky
has several fixes and cleaups for zram in the area of
zram_write_page(). A watchdog softlockup warning was eliminated
- "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin
Brodsky cleans up the pagetable destructor implementations. A rare
use-after-free race is fixed
- "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
simplifies and cleans up the debugging code in the VMA merging
logic
- "Account page tables at all levels" from Kevin Brodsky cleans up
and regularizes the pagetable ctor/dtor handling. This results in
improvements in accounting accuracy
- "mm/damon: replace most damon_callback usages in sysfs with new
core functions" from SeongJae Park cleans up and generalizes
DAMON's sysfs file interface logic
- "mm/damon: enable page level properties based monitoring" from
SeongJae Park increases the amount of information which is
presented in response to DAMOS actions
- "mm/damon: remove DAMON debugfs interface" from SeongJae Park
removes DAMON's long-deprecated debugfs interfaces. Thus the
migration to sysfs is completed
- "mm/hugetlb: Refactor hugetlb allocation resv accounting" from
Peter Xu cleans up and generalizes the hugetlb reservation
accounting
- "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
removes a never-used feature of the alloc_pages_bulk() interface
- "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
extends DAMOS filters to support not only exclusion (rejecting),
but also inclusion (allowing) behavior
- "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
introduces a new memory descriptor for zswap.zpool that currently
overlaps with struct page for now. This is part of the effort to
reduce the size of struct page and to enable dynamic allocation of
memory descriptors
- "mm, swap: rework of swap allocator locks" from Kairui Song redoes
and simplifies the swap allocator locking. A speedup of 400% was
demonstrated for one workload. As was a 35% reduction for kernel
build time with swap-on-zram
- "mm: update mips to use do_mmap(), make mmap_region() internal"
from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
mmap_region() can be made MM-internal
- "mm/mglru: performance optimizations" from Yu Zhao fixes a few
MGLRU regressions and otherwise improves MGLRU performance
- "Docs/mm/damon: add tuning guide and misc updates" from SeongJae
Park updates DAMON documentation
- "Cleanup for memfd_create()" from Isaac Manjarres does that thing
- "mm: hugetlb+THP folio and migration cleanups" from David
Hildenbrand provides various cleanups in the areas of hugetlb
folios, THP folios and migration
- "Uncached buffered IO" from Jens Axboe implements the new
RWF_DONTCACHE flag which provides synchronous dropbehind for
pagecache reading and writing. To permite userspace to address
issues with massive buildup of useless pagecache when
reading/writing fast devices
- "selftests/mm: virtual_address_range: Reduce memory" from Thomas
Weißschuh fixes and optimizes some of the MM selftests"
* tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
mm/compaction: fix UBSAN shift-out-of-bounds warning
s390/mm: add missing ctor/dtor on page table upgrade
kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()
tools: add VM_WARN_ON_VMG definition
mm/damon/core: use str_high_low() helper in damos_wmark_wait_us()
seqlock: add missing parameter documentation for raw_seqcount_try_begin()
mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh
mm/page_alloc: remove the incorrect and misleading comment
zram: remove zcomp_stream_put() from write_incompressible_page()
mm: separate move/undo parts from migrate_pages_batch()
mm/kfence: use str_write_read() helper in get_access_type()
selftests/mm/mkdirty: fix memory leak in test_uffdio_copy()
kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags()
selftests/mm: virtual_address_range: avoid reading from VM_IO mappings
selftests/mm: vm_util: split up /proc/self/smaps parsing
selftests/mm: virtual_address_range: unmap chunks after validation
selftests/mm: virtual_address_range: mmap() without PROT_WRITE
selftests/memfd/memfd_test: fix possible NULL pointer dereference
mm: add FGP_DONTCACHE folio creation flag
mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
...
Remove highest_bit and lowest_bit. After the HDD allocation path has been
removed, the only purpose of these two fields is to determine whether the
device is full or not, which can instead be determined by checking the
inuse_pages.
Link: https://lkml.kernel.org/r/20250113175732.48099-6-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Chis Li <chrisl@kernel.org>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickens <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The previous commit removed the page_list argument from
alloc_pages_bulk_noprof() along with the alloc_pages_bulk_list() function.
Now that only the *_array() flavour of the API remains, we can do the
following renaming (along with the _noprof() ones):
alloc_pages_bulk_array -> alloc_pages_bulk
alloc_pages_bulk_array_mempolicy -> alloc_pages_bulk_mempolicy
alloc_pages_bulk_array_node -> alloc_pages_bulk_node
Link: https://lkml.kernel.org/r/275a3bbc0be20fbe9002297d60045e67ab3d4ada.1734991165.git.luizcap@redhat.com
Signed-off-by: Luiz Capitulino <luizcap@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmePs7oACgkQnJ2qBz9k
QNmHuAf9GkLnY5u1/81xP5V9ukZ4N2yeMW0dydLS5cjWj/St5ELeMAza3jeqtJtD
j36vbnmy2c5pPaGLAK8BJpMXT/R2TkmmKD004zcfqF2S3SgbGzdgO1zMZzq9KJpM
woRKZtLuglDajedsDEBBcKotBhlN2+C/sQlFuL1mX4zitk9ajr0qYUB1+JqOeg5f
qwPsDLT077ADpxd7lVIMcm+OqbduP5KWkBKYHpn7lJcLe1eqVMMzceJroW42zhVG
Dq8Iln26bbU9Wx6FSPFCUcHEzHRHUfXmu07HN9U0X++0QgWjrmBQQLooGFB/bR4a
edBrPpVas6xE4/brjgFX3gOKtv8xYg==
=ewDV
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_hsm_for_v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify pre-content notification support from Jan Kara:
"This introduces a new fsnotify event (FS_PRE_ACCESS) that gets
generated before a file contents is accessed.
The event is synchronous so if there is listener for this event, the
kernel waits for reply. On success the execution continues as usual,
on failure we propagate the error to userspace. This allows userspace
to fill in file content on demand from slow storage. The context in
which the events are generated has been picked so that we don't hold
any locks and thus there's no risk of a deadlock for the userspace
handler.
The new pre-content event is available only for users with global
CAP_SYS_ADMIN capability (similarly to other parts of fanotify
functionality) and it is an administrator responsibility to make sure
the userspace event handler doesn't do stupid stuff that can DoS the
system.
Based on your feedback from the last submission, fsnotify code has
been improved and now file->f_mode encodes whether pre-content event
needs to be generated for the file so the fast path when nobody wants
pre-content event for the file just grows the additional file->f_mode
check. As a bonus this also removes the checks whether the old
FS_ACCESS event needs to be generated from the fast path. Also the
place where the event is generated during page fault has been moved so
now filemap_fault() generates the event if and only if there is no
uptodate folio in the page cache.
Also we have dropped FS_PRE_MODIFY event as current real-world users
of the pre-content functionality don't really use it so let's start
with the minimal useful feature set"
* tag 'fsnotify_hsm_for_v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (21 commits)
fanotify: Fix crash in fanotify_init(2)
fs: don't block write during exec on pre-content watched files
fs: enable pre-content events on supported file systems
ext4: add pre-content fsnotify hook for DAX faults
btrfs: disable defrag on pre-content watched files
xfs: add pre-content fsnotify hook for DAX faults
fsnotify: generate pre-content permission event on page fault
mm: don't allow huge faults for files with pre content watches
fanotify: disable readahead if we have pre-content watches
fanotify: allow to set errno in FAN_DENY permission response
fanotify: report file range info with pre-content events
fanotify: introduce FAN_PRE_ACCESS permission event
fsnotify: generate pre-content permission event on truncate
fsnotify: pass optional file access range in pre-content event
fsnotify: introduce pre-content permission events
fanotify: reserve event bit of deprecated FAN_DIR_MODIFY
fanotify: rename a misnamed constant
fanotify: don't skip extra event info if no info_mode is set
fsnotify: check if file is actually being watched for pre-content events on open
fsnotify: opt-in for permission events at file open time
...
At btrfs_qgroup_cleanup_dropped_subvolume() all we want to commit the
current transaction in order to have all the qgroup rfer/excl numbers up
to date. However we are using btrfs_start_transaction(), which joins the
current transaction if there is one that is not yet committing, but also
starts a new one if there is none or if the current one is already
committing (its state is >= TRANS_STATE_COMMIT_START). This later case
results in unnecessary IO, wasting time and a pointless rotation of the
backup roots in the super block.
So instead of using btrfs_start_transaction() followed by a
btrfs_commit_transaction(), use btrfs_commit_current_transaction() which
achieves our purpose and avoids starting and committing new transactions.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a bug report that btrfs outputs the following error message:
BTRFS info (device nvme0n1p2): qgroup scan completed (inconsistency flag cleared)
BTRFS warning (device nvme0n1p2): failed to cleanup qgroup 0/1179: -2
[CAUSE]
The error itself is pretty harmless, and the end user should ignore it.
When a subvolume is fully dropped, btrfs will call
btrfs_qgroup_cleanup_dropped_subvolume() to delete the qgroup.
However if a qgroup rescan happened before a subvolume fully dropped,
qgroup for that subvolume will not be re-created, as rescan will only
create new qgroup if there is a BTRFS_ROOT_REF_KEY found.
But before we drop a subvolume, the subvolume is unlinked thus there is no
BTRFS_ROOT_REF_KEY.
In that case, btrfs_remove_qgroup() will fail with -ENOENT and trigger
the above error message.
[FIX]
Just ignore -ENOENT error from btrfs_remove_qgroup() inside
btrfs_qgroup_cleanup_dropped_subvolume().
Reported-by: John Shand <jshand2013@gmail.com>
Link: https://bugzilla.suse.com/show_bug.cgi?id=1236056
Fixes: 839d6ea4f8 ("btrfs: automatically remove the subvolume qgroup")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When COWing a relocation tree path, at relocation.c:replace_path(), we
can trigger a lockdep splat while we are in the btrfs_search_slot() call
against the relocation root. This happens in that callchain at
ctree.c:read_block_for_search() when we happen to find a child extent
buffer already loaded through the fs tree with a lockdep class set to
the fs tree. So when we attempt to lock that extent buffer through a
relocation tree we have to reset the lockdep class to the class for a
relocation tree, since a relocation tree has extent buffers that used
to belong to a fs tree and may currently be already loaded (we swap
extent buffers between the two trees at the end of replace_path()).
However we are missing calls to btrfs_maybe_reset_lockdep_class() to reset
the lockdep class at ctree.c:read_block_for_search() before we read lock
an extent buffer, just like we did for btrfs_search_slot() in commit
b40130b23c ("btrfs: fix lockdep splat with reloc root extent buffers").
So add the missing btrfs_maybe_reset_lockdep_class() calls before the
attempts to read lock an extent buffer at ctree.c:read_block_for_search().
The lockdep splat was reported by syzbot and it looks like this:
======================================================
WARNING: possible circular locking dependency detected
6.13.0-rc5-syzkaller-00163-gab75170520d4 #0 Not tainted
------------------------------------------------------
syz.0.0/5335 is trying to acquire lock:
ffff8880545dbc38 (btrfs-tree-01){++++}-{4:4}, at: btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146
but task is already holding lock:
ffff8880545dba58 (btrfs-treloc-02/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (btrfs-treloc-02/1){+.+.}-{4:4}:
reacquire_held_locks+0x3eb/0x690 kernel/locking/lockdep.c:5374
__lock_release kernel/locking/lockdep.c:5563 [inline]
lock_release+0x396/0xa30 kernel/locking/lockdep.c:5870
up_write+0x79/0x590 kernel/locking/rwsem.c:1629
btrfs_force_cow_block+0x14b3/0x1fd0 fs/btrfs/ctree.c:660
btrfs_cow_block+0x371/0x830 fs/btrfs/ctree.c:755
btrfs_search_slot+0xc01/0x3180 fs/btrfs/ctree.c:2153
replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224
merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692
merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942
relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754
btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087
btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494
__btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278
btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655
btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (btrfs-tree-01/1){+.+.}-{4:4}:
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
down_write_nested+0xa2/0x220 kernel/locking/rwsem.c:1693
btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189
btrfs_init_new_buffer fs/btrfs/extent-tree.c:5052 [inline]
btrfs_alloc_tree_block+0x41c/0x1440 fs/btrfs/extent-tree.c:5132
btrfs_force_cow_block+0x526/0x1fd0 fs/btrfs/ctree.c:573
btrfs_cow_block+0x371/0x830 fs/btrfs/ctree.c:755
btrfs_search_slot+0xc01/0x3180 fs/btrfs/ctree.c:2153
btrfs_insert_empty_items+0x9c/0x1a0 fs/btrfs/ctree.c:4351
btrfs_insert_empty_item fs/btrfs/ctree.h:688 [inline]
btrfs_insert_inode_ref+0x2bb/0xf80 fs/btrfs/inode-item.c:330
btrfs_rename_exchange fs/btrfs/inode.c:7990 [inline]
btrfs_rename2+0xcb7/0x2b90 fs/btrfs/inode.c:8374
vfs_rename+0xbdb/0xf00 fs/namei.c:5067
do_renameat2+0xd94/0x13f0 fs/namei.c:5224
__do_sys_renameat2 fs/namei.c:5258 [inline]
__se_sys_renameat2 fs/namei.c:5255 [inline]
__x64_sys_renameat2+0xce/0xe0 fs/namei.c:5255
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #0 (btrfs-tree-01){++++}-{4:4}:
check_prev_add kernel/locking/lockdep.c:3161 [inline]
check_prevs_add kernel/locking/lockdep.c:3280 [inline]
validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904
__lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
down_read_nested+0xb5/0xa50 kernel/locking/rwsem.c:1649
btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146
btrfs_tree_read_lock fs/btrfs/locking.h:188 [inline]
read_block_for_search+0x718/0xbb0 fs/btrfs/ctree.c:1610
btrfs_search_slot+0x1274/0x3180 fs/btrfs/ctree.c:2237
replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224
merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692
merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942
relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754
btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087
btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494
__btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278
btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655
btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
btrfs-tree-01 --> btrfs-tree-01/1 --> btrfs-treloc-02/1
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-treloc-02/1);
lock(btrfs-tree-01/1);
lock(btrfs-treloc-02/1);
rlock(btrfs-tree-01);
*** DEADLOCK ***
8 locks held by syz.0.0/5335:
#0: ffff88801e3ae420 (sb_writers#13){.+.+}-{0:0}, at: mnt_want_write_file+0x5e/0x200 fs/namespace.c:559
#1: ffff888052c760d0 (&fs_info->reclaim_bgs_lock){+.+.}-{4:4}, at: __btrfs_balance+0x4c2/0x26b0 fs/btrfs/volumes.c:4183
#2: ffff888052c74850 (&fs_info->cleaner_mutex){+.+.}-{4:4}, at: btrfs_relocate_block_group+0x775/0xd90 fs/btrfs/relocation.c:4086
#3: ffff88801e3ae610 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xf11/0x1ad0 fs/btrfs/relocation.c:1659
#4: ffff888052c76470 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x405/0xda0 fs/btrfs/transaction.c:288
#5: ffff888052c76498 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x405/0xda0 fs/btrfs/transaction.c:288
#6: ffff8880545db878 (btrfs-tree-01/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189
#7: ffff8880545dba58 (btrfs-treloc-02/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189
stack backtrace:
CPU: 0 UID: 0 PID: 5335 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller-00163-gab75170520d4 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
print_circular_bug+0x13a/0x1b0 kernel/locking/lockdep.c:2074
check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2206
check_prev_add kernel/locking/lockdep.c:3161 [inline]
check_prevs_add kernel/locking/lockdep.c:3280 [inline]
validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904
__lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
down_read_nested+0xb5/0xa50 kernel/locking/rwsem.c:1649
btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146
btrfs_tree_read_lock fs/btrfs/locking.h:188 [inline]
read_block_for_search+0x718/0xbb0 fs/btrfs/ctree.c:1610
btrfs_search_slot+0x1274/0x3180 fs/btrfs/ctree.c:2237
replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224
merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692
merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942
relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754
btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087
btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494
__btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278
btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655
btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f1ac6985d29
Code: ff ff c3 (...)
RSP: 002b:00007f1ac63fe038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f1ac6b76160 RCX: 00007f1ac6985d29
RDX: 0000000020000180 RSI: 00000000c4009420 RDI: 0000000000000007
RBP: 00007f1ac6a01b08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000001 R14: 00007f1ac6b76160 R15: 00007fffda145a88
</TASK>
Reported-by: syzbot+63913e558c084f7f8fdc@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/677b3014.050a0220.3b53b0.0064.GAE@google.com/
Fixes: 99785998ed ("btrfs: reduce lock contention when eb cache miss for btree search")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmeHvVQACgkQxWXV+ddt
WDsJ6w//cPqI8tf3kMxurZcG7clJRIIISotPrC6hm3UDNpJLa7HDaVJ50FAoIhMV
sB4RQNZky4mfB6ypXxmETzV3ZHvP0+oFgRs72Ommi0ZbdnBgxhaUTrDXLKl52o4r
UoeqvRKReEYOesN09rPXYPwytUOkxHU/GjNzv7bC/Tzvq/xKaIN5qMYZwkHtJ8PK
JtCFypfbmDPNDJz37l0BhRya2oMtpcUtxM9uP8RWVuQtaELgjcy56W/+osoyJTy9
FSKaoWUPsDVDufnILlGR8Kub2Z5mcISVqyARUdr/q3j5CDfyTdQvahmUy7sHgUAe
HGh5QBdRJu1QTvdZw+nK4YCaYpK6Nj4liDtO1cwVitde5RXsJrt6kYBLlY/kU2Qr
KODOloM/zVKxULR0ARl11NULZquUsczP6Wxfn+dtyDJ3JGlY9OcuESmorHoUtkMX
75Tj1AtRMNcfZAE2HquL1Oz3bIMcg4btDJsC+9Yp5K11SP12XpOwC42k/9Bx3iBe
Iki0BSuppFqX5MMY3OEWzD1pz2vOGYR8ISD6EIsjpjl2vBeRwydaCCZfuszSC7gl
Y4goSdwFMPVlqllL1h27XUjKVXvttCqqdB6P28MbvZKnFAPlm189BJQZC5cbHAJU
ceBww5PvI9QxnJnFG5iOLcnko6liUWPP9l2c5LLtUsJIi8B5Hu0=
=SXLv
-----END PGP SIGNATURE-----
Merge tag 'for-6.14-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"User visible changes, features:
- rebuilding of the free space tree at mount time is done in more
transactions, fix potential hangs when the transaction thread is
blocked due to large amount of block groups
- more read IO balancing strategies (experimental config), add two
new ways how to select a device for read if the profiles allow that
(all RAID1*), the current default selects the device by pid which
is good on average but less performant for single reader workloads
- select preferred device for all reads (namely for testing)
- round-robin, balance reads across devices relevant for the
requested IO range
- add encoded write ioctl support to io_uring (read was added in
6.12), basis for writing send stream using that instead of
syscalls, non-blocking mode is not yet implemented
- support FS_IOC_READ_VERITY_METADATA, applications can use the
metadata to do their own verification
- pass inode's i_write_hint to bios, for parity with other
filesystems, ioctls F_GET_RW_HINT/F_SET_RW_HINT
Core:
- in zoned mode: allow to directly reclaim a block group by simply
resetting it, then it can be reused and another block group does
not need to be allocated
- super block validation now also does more comprehensive sys array
validation, adding it to the points where superblock is validated
(post-read, pre-write)
- subpage mode fixes:
- fix double accounting of blocks due to some races
- improved or fixed error handling in a few cases (compression,
delalloc)
- raid stripe tree:
- fix various cases with extent range splitting or deleting
- implement hole punching to extent range
- reduce number of stripe tree lookups during bio submission
- more self-tests
- updated self-tests (delayed refs)
- error handling improvements
- cleanups, refactoring
- remove rest of backref caching infrastructure from relocation,
not needed anymore
- error message updates
- remove unnecessary calls when extent buffer was marked dirty
- unused parameter removal
- code moved to new files
Other code changes: add rb_find_add_cached() to the rb-tree API"
* tag 'for-6.14-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (127 commits)
btrfs: selftests: add a selftest for deleting two out of three extents
btrfs: selftests: add test for punching a hole into 3 RAID stripe-extents
btrfs: selftests: add selftest for punching holes into the RAID stripe extents
btrfs: selftests: test RAID stripe-tree deletion spanning two items
btrfs: selftests: don't split RAID extents in half
btrfs: selftests: check for correct return value of failed lookup
btrfs: don't use btrfs_set_item_key_safe on RAID stripe-extents
btrfs: implement hole punching for RAID stripe extents
btrfs: fix deletion of a range spanning parts two RAID stripe extents
btrfs: fix tail delete of RAID stripe-extents
btrfs: fix front delete range calculation for RAID stripe extents
btrfs: assert RAID stripe-extent length is always greater than 0
btrfs: don't try to delete RAID stripe-extents if we don't need to
btrfs: selftests: correct RAID stripe-tree feature flag setting
btrfs: add io_uring interface for encoded writes
btrfs: remove the unused locked_folio parameter from btrfs_cleanup_ordered_extents()
btrfs: add extra error messages for delalloc range related errors
btrfs: subpage: dump the involved bitmap when ASSERT() failed
btrfs: subpage: fix the bitmap dump of the locked flags
btrfs: do proper folio cleanup when run_delalloc_nocow() failed
...
Add a selftest creating three extents and then deleting two out of the
three extents.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Test creating a range of three RAID stripe-extents and then punch a hole
in the middle, deleting all of the middle extents and partially deleting
the "book ends".
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a selftest for punching a hole into a RAID stripe extent. The test
create an 1M extent and punches a 64k bytes long hole at offset of 32k from
the start of the extent.
Afterwards it verifies the start and length of both resulting new extents
"left" and "right" as well as the absence of the hole.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a selftest for RAID stripe-tree deletion with a delete range spanning
two items, so that we're punching a hole into two adjacent RAID stripe
extents truncating the first and "moving" the second to the right.
The following diagram illustrates the operation:
|--- RAID Stripe Extent ---||--- RAID Stripe Extent ---|
|----- keep -----|--- drop ---|----- keep ----|
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The selftests for partially deleting the start or tail of RAID
stripe-extents split these extents in half.
This can hide errors in the calculation, so don't split the RAID
stripe-extents in half but delete the first or last 16K of the 64K
extents.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 5e72aabc1f ("btrfs: return ENODATA in case RST lookup fails")
changed btrfs_get_raid_extent_offset()'s return value to ENODATA in case
the RAID stripe-tree lookup failed.
Adjust the test cases which check for absence of a given range to check
for ENODATA as return value in this case.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the stripe extent we want to delete starts before the range we want to
delete and ends after the range we want to delete we're punching a
hole in the stripe extent:
|--- RAID Stripe Extent ---|
| keep |--- drop ---| keep |
This means we need to a) truncate the existing item and b)
create a second item for the remaining range.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a user requests the deletion of a range that spans multiple stripe
extents and btrfs_search_slot() returns us the second RAID stripe extent,
we need to pick the previous item and truncate it, if there's still a
range to delete left, move on to the next item.
The following diagram illustrates the operation:
|--- RAID Stripe Extent ---||--- RAID Stripe Extent ---|
|--- keep ---|--- drop ---|
While at it, comment the trivial case of a whole item delete as well.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix tail delete of RAID stripe-extents, if there is a range to be deleted
as well after the tail delete of the extent.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When deleting the front of a RAID stripe-extent the delete code
miscalculates the size on how much to pad the remaining extent part in the
front.
Fix the calculation so we're always having the sizes we expect.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When modifying a RAID stripe-extent, ASSERT() that the length of the new
RAID stripe-extent is always greater than 0.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Even if the RAID stripe-tree is not enabled in the filesystem,
do_free_extent_accounting() still calls into btrfs_delete_raid_extent().
Check if the extent in question is on a block-group that has a profile
which is used by RAID stripe-tree before attempting to delete a stripe
extent. Return early if it doesn't, otherwise we're doing a unnecessary
search.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
RAID stripe-tree is an incompatible feature not a read-only compatible, so
set the incompat flag not a compat_ro one in the selftest code.
Subsequent changes in btrfs_delete_raid_extent() will start checking for
this flag.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside function get_canonical_dev_path(), we call d_path() to get the
final device path.
But d_path() can return error, and in that case the next strscpy() call
will trigger an invalid memory access.
Add back the missing error handling for d_path().
Reported-by: Boris Burkov <boris@bur.io>
Fixes: 7e06de7c83 ("btrfs: canonicalize the device path before adding it")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add an io_uring interface for encoded writes, with the same parameters
as the BTRFS_IOC_ENCODED_WRITE ioctl.
As with the encoded reads code, there's a test program for this at
https://github.com/maharmstone/io_uring-encoded, and I'll get this
worked into an fstest.
How io_uring works is that it initially calls btrfs_uring_cmd with the
IO_URING_F_NONBLOCK flag set, and if we return -EAGAIN it tries again in
a kthread with the flag cleared.
Ideally we'd honour this and call try_lock etc., but there's still a lot
of work to be done to create non-blocking versions of all the functions
in our write path. Instead, just validate the input in
btrfs_uring_encoded_write() on the first pass and return -EAGAIN, with a
view to properly optimizing the optimistic path later on.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_cleanup_ordered_extents() is only called in error
handling path, and the last caller with a @locked_folio parameter was
removed to fix a bug in the btrfs_run_delalloc_range() error handling.
There is no need to pass @locked_folio parameter anymore.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All the error handling bugs I hit so far are all -ENOSPC from either:
- cow_file_range()
- run_delalloc_nocow()
- submit_uncompressed_range()
Previously when those functions failed, there was no error message at
all, making the debugging much harder.
So here we introduce extra error messages for:
- cow_file_range()
- run_delalloc_nocow()
- submit_uncompressed_range()
- writepage_delalloc() when btrfs_run_delalloc_range() failed
- extent_writepage() when extent_writepage_io() failed
One example of the new debug error messages is the following one:
run fstests generic/750 at 2024-12-08 12:41:41
BTRFS: device fsid 461b25f5-e240-4543-8deb-e7c2bd01a6d3 devid 1 transid 8 /dev/mapper/test-scratch1 (253:4) scanned by mount (2436600)
BTRFS info (device dm-4): first mount of filesystem 461b25f5-e240-4543-8deb-e7c2bd01a6d3
BTRFS info (device dm-4): using crc32c (crc32c-arm64) checksum algorithm
BTRFS info (device dm-4): forcing free space tree for sector size 4096 with page size 65536
BTRFS info (device dm-4): using free-space-tree
BTRFS warning (device dm-4): read-write for sector size 4096 with page size 65536 is experimental
BTRFS info (device dm-4): checking UUID tree
BTRFS error (device dm-4): cow_file_range failed, root=363 inode=412 start=503808 len=98304: -28
BTRFS error (device dm-4): run_delalloc_nocow failed, root=363 inode=412 start=503808 len=98304: -28
BTRFS error (device dm-4): failed to run delalloc range, root=363 ino=412 folio=458752 submit_bitmap=11-15 start=503808 len=98304: -28
Which shows an error from cow_file_range() which is called inside a
nocow write attempt, along with the extra bitmap from
writepage_delalloc().
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For btrfs_folio_assert_not_dirty() and btrfs_folio_set_lock(), we call
bitmap_test_range_all_zero() to ensure the involved range has no
dirty/lock bit already set.
However with my recent enhanced delalloc range error handling, I was
hitting the ASSERT() inside btrfs_folio_set_lock(), and it turns out
that some error handling path is not properly updating the folio flags.
So add some extra dumping for the ASSERTs to dump the involved bitmap
to help debug.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're dumping the locked bitmap into the @checked_bitmap variable,
printing incorrect values during debug.
Thankfully even during my development I haven't hit a case where I need
to dump the locked bitmap. But for the sake of consistency, fix it by
dupping the locked bitmap into @locked_bitmap variable for output.
Fixes: 75258f20fb ("btrfs: subpage: dump extra subpage bitmaps for debug")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With CONFIG_DEBUG_VM set, test case generic/476 has some chance to crash
with the following VM_BUG_ON_FOLIO():
BTRFS error (device dm-3): cow_file_range failed, start 1146880 end 1253375 len 106496 ret -28
BTRFS error (device dm-3): run_delalloc_nocow failed, start 1146880 end 1253375 len 106496 ret -28
page: refcount:4 mapcount:0 mapping:00000000592787cc index:0x12 pfn:0x10664
aops:btrfs_aops [btrfs] ino:101 dentry name(?):"f1774"
flags: 0x2fffff80004028(uptodate|lru|private|node=0|zone=2|lastcpupid=0xfffff)
page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio))
------------[ cut here ]------------
kernel BUG at mm/page-writeback.c:2992!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 2 UID: 0 PID: 3943513 Comm: kworker/u24:15 Tainted: G OE 6.12.0-rc7-custom+ #87
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
pc : folio_clear_dirty_for_io+0x128/0x258
lr : folio_clear_dirty_for_io+0x128/0x258
Call trace:
folio_clear_dirty_for_io+0x128/0x258
btrfs_folio_clamp_clear_dirty+0x80/0xd0 [btrfs]
__process_folios_contig+0x154/0x268 [btrfs]
extent_clear_unlock_delalloc+0x5c/0x80 [btrfs]
run_delalloc_nocow+0x5f8/0x760 [btrfs]
btrfs_run_delalloc_range+0xa8/0x220 [btrfs]
writepage_delalloc+0x230/0x4c8 [btrfs]
extent_writepage+0xb8/0x358 [btrfs]
extent_write_cache_pages+0x21c/0x4e8 [btrfs]
btrfs_writepages+0x94/0x150 [btrfs]
do_writepages+0x74/0x190
filemap_fdatawrite_wbc+0x88/0xc8
start_delalloc_inodes+0x178/0x3a8 [btrfs]
btrfs_start_delalloc_roots+0x174/0x280 [btrfs]
shrink_delalloc+0x114/0x280 [btrfs]
flush_space+0x250/0x2f8 [btrfs]
btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
process_one_work+0x164/0x408
worker_thread+0x25c/0x388
kthread+0x100/0x118
ret_from_fork+0x10/0x20
Code: 910a8021 a90363f7 a9046bf9 94012379 (d4210000)
---[ end trace 0000000000000000 ]---
[CAUSE]
The first two lines of extra debug messages show the problem is caused
by the error handling of run_delalloc_nocow().
E.g. we have the following dirtied range (4K blocksize 4K page size):
0 16K 32K
|//////////////////////////////////////|
| Pre-allocated |
And the range [0, 16K) has a preallocated extent.
- Enter run_delalloc_nocow() for range [0, 16K)
Which found range [0, 16K) is preallocated, can do the proper NOCOW
write.
- Enter fallback_to_fow() for range [16K, 32K)
Since the range [16K, 32K) is not backed by preallocated extent, we
have to go COW.
- cow_file_range() failed for range [16K, 32K)
So cow_file_range() will do the clean up by clearing folio dirty,
unlock the folios.
Now the folios in range [16K, 32K) is unlocked.
- Enter extent_clear_unlock_delalloc() from run_delalloc_nocow()
Which is called with PAGE_START_WRITEBACK to start page writeback.
But folios can only be marked writeback when it's properly locked,
thus this triggered the VM_BUG_ON_FOLIO().
Furthermore there is another hidden but common bug that
run_delalloc_nocow() is not clearing the folio dirty flags in its error
handling path.
This is the common bug shared between run_delalloc_nocow() and
cow_file_range().
[FIX]
- Clear folio dirty for range [@start, @cur_offset)
Introduce a helper, cleanup_dirty_folios(), which
will find and lock the folio in the range, clear the dirty flag and
start/end the writeback, with the extra handling for the
@locked_folio.
- Introduce a helper to clear folio dirty, start and end writeback
- Introduce a helper to record the last failed COW range end
This is to trace which range we should skip, to avoid double
unlocking.
- Skip the failed COW range for the error handling
CC: stable@vger.kernel.org
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When testing with COW fixup marked as BUG_ON() (this is involved with the
new pin_user_pages*() change, which should not result new out-of-band
dirty pages), I hit a crash triggered by the BUG_ON() from hitting COW
fixup path.
This BUG_ON() happens just after a failed btrfs_run_delalloc_range():
BTRFS error (device dm-2): failed to run delalloc range, root 348 ino 405 folio 65536 submit_bitmap 6-15 start 90112 len 106496: -28
------------[ cut here ]------------
kernel BUG at fs/btrfs/extent_io.c:1444!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 0 UID: 0 PID: 434621 Comm: kworker/u24:8 Tainted: G OE 6.12.0-rc7-custom+ #86
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
pc : extent_writepage_io+0x2d4/0x308 [btrfs]
lr : extent_writepage_io+0x2d4/0x308 [btrfs]
Call trace:
extent_writepage_io+0x2d4/0x308 [btrfs]
extent_writepage+0x218/0x330 [btrfs]
extent_write_cache_pages+0x1d4/0x4b0 [btrfs]
btrfs_writepages+0x94/0x150 [btrfs]
do_writepages+0x74/0x190
filemap_fdatawrite_wbc+0x88/0xc8
start_delalloc_inodes+0x180/0x3b0 [btrfs]
btrfs_start_delalloc_roots+0x174/0x280 [btrfs]
shrink_delalloc+0x114/0x280 [btrfs]
flush_space+0x250/0x2f8 [btrfs]
btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
process_one_work+0x164/0x408
worker_thread+0x25c/0x388
kthread+0x100/0x118
ret_from_fork+0x10/0x20
Code: aa1403e1 9402f3ef aa1403e0 9402f36f (d4210000)
---[ end trace 0000000000000000 ]---
[CAUSE]
That failure is mostly from cow_file_range(), where we can hit -ENOSPC.
Although the -ENOSPC is already a bug related to our space reservation
code, let's just focus on the error handling.
For example, we have the following dirty range [0, 64K) of an inode,
with 4K sector size and 4K page size:
0 16K 32K 48K 64K
|///////////////////////////////////////|
|#######################################|
Where |///| means page are still dirty, and |###| means the extent io
tree has EXTENT_DELALLOC flag.
- Enter extent_writepage() for page 0
- Enter btrfs_run_delalloc_range() for range [0, 64K)
- Enter cow_file_range() for range [0, 64K)
- Function btrfs_reserve_extent() only reserved one 16K extent
So we created extent map and ordered extent for range [0, 16K)
0 16K 32K 48K 64K
|////////|//////////////////////////////|
|<- OE ->|##############################|
And range [0, 16K) has its delalloc flag cleared.
But since we haven't yet submit any bio, involved 4 pages are still
dirty.
- Function btrfs_reserve_extent() returns with -ENOSPC
Now we have to run error cleanup, which will clear all
EXTENT_DELALLOC* flags and clear the dirty flags for the remaining
ranges:
0 16K 32K 48K 64K
|////////| |
| | |
Note that range [0, 16K) still has its pages dirty.
- Some time later, writeback is triggered again for the range [0, 16K)
since the page range still has dirty flags.
- btrfs_run_delalloc_range() will do nothing because there is no
EXTENT_DELALLOC flag.
- extent_writepage_io() finds page 0 has no ordered flag
Which falls into the COW fixup path, triggering the BUG_ON().
Unfortunately this error handling bug dates back to the introduction of
btrfs. Thankfully with the abuse of COW fixup, at least it won't crash
the kernel.
[FIX]
Instead of immediately unlocking the extent and folios, we keep the extent
and folios locked until either erroring out or the whole delalloc range
finished.
When the whole delalloc range finished without error, we just unlock the
whole range with PAGE_SET_ORDERED (and PAGE_UNLOCK for !keep_locked
cases), with EXTENT_DELALLOC and EXTENT_LOCKED cleared.
And the involved folios will be properly submitted, with their dirty
flags cleared during submission.
For the error path, it will be a little more complex:
- The range with ordered extent allocated (range (1))
We only clear the EXTENT_DELALLOC and EXTENT_LOCKED, as the remaining
flags are cleaned up by
btrfs_mark_ordered_io_finished()->btrfs_finish_one_ordered().
For folios we finish the IO (clear dirty, start writeback and
immediately finish the writeback) and unlock the folios.
- The range with reserved extent but no ordered extent (range(2))
- The range we never touched (range(3))
For both range (2) and range(3) the behavior is not changed.
Now even if cow_file_range() failed halfway with some successfully
reserved extents/ordered extents, we will keep all folios clean, so
there will be no future writeback triggered on them.
CC: stable@vger.kernel.org
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
If we failed to compress the range, or cannot reserve a large enough
data extent (e.g. too fragmented free space), we will fall back to
submit_uncompressed_range().
But inside submit_uncompressed_range(), run_delalloc_cow() can also fail
due to -ENOSPC or any other error.
In that case there are 3 bugs in the error handling:
1) Double freeing for the same ordered extent
This can lead to crash due to ordered extent double accounting
2) Start/end writeback without updating the subpage writeback bitmap
3) Unlock the folio without clear the subpage lock bitmap
Both bugs 2) and 3) will crash the kernel if the btrfs block size is
smaller than folio size, as the next time the folio gets writeback/lock
updates, subpage will find the bitmap already have the range set,
triggering an ASSERT().
[CAUSE]
Bug 1) happens in the following call chain:
submit_uncompressed_range()
|- run_delalloc_cow()
| |- cow_file_range()
| |- btrfs_reserve_extent()
| Failed with -ENOSPC or whatever error
|
|- btrfs_clean_up_ordered_extents()
| |- btrfs_mark_ordered_io_finished()
| Which cleans all the ordered extents in the async_extent range.
|
|- btrfs_mark_ordered_io_finished()
Which cleans the folio range.
The finished ordered extents may not be immediately removed from the
ordered io tree, as they are removed inside a work queue.
So the second btrfs_mark_ordered_io_finished() may find the finished but
not-yet-removed ordered extents, and double free them.
Furthermore, the second btrfs_mark_ordered_io_finished() is not subpage
compatible, as it uses fixed folio_pos() with PAGE_SIZE, which can cover
other ordered extents.
Bugs 2) and 3) are more straightforward, btrfs just calls folio_unlock(),
folio_start_writeback() and folio_end_writeback(), other than the helpers
which handle subpage cases.
[FIX]
For bug 1) since the first btrfs_cleanup_ordered_extents() call is
handling the whole range, we should not do the second
btrfs_mark_ordered_io_finished() call.
And for the first btrfs_cleanup_ordered_extents(), we no longer need to
pass the @locked_page parameter, as we are already in the async extent
context, thus will never rely on the error handling inside
btrfs_run_delalloc_range().
So just let the btrfs_clean_up_ordered_extents() handle every folio
equally.
For bug 2) we should not even call
folio_start_writeback()/folio_end_writeback() anymore.
As the error handling protocol, cow_file_range() should clear
dirty flag and start/finish the writeback for the whole range passed in.
For bug 3) just change the folio_unlock() to btrfs_folio_end_lock()
helper.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
If submit_one_sector() failed inside extent_writepage_io() for sector
size < page size cases (e.g. 4K sector size and 64K page size), then
we can hit double ordered extent accounting error.
This should be very rare, as submit_one_sector() only fails when we
failed to grab the extent map, and such extent map should exist inside
the memory and has been pinned.
[CAUSE]
For example we have the following folio layout:
0 4K 32K 48K 60K 64K
|//| |//////| |///|
Where |///| is the dirty range we need to writeback. The 3 different
dirty ranges are submitted for regular COW.
Now we hit the following sequence:
- submit_one_sector() returned 0 for [0, 4K)
- submit_one_sector() returned 0 for [32K, 48K)
- submit_one_sector() returned error for [60K, 64K)
- btrfs_mark_ordered_io_finished() called for the whole folio
This will mark the following ranges as finished:
* [0, 4K)
* [32K, 48K)
Both ranges have their IO already submitted, this cleanup will
lead to double accounting.
* [60K, 64K)
That's the correct cleanup.
The only good news is, this error is only theoretical, as the target
extent map is always pinned, thus we should directly grab it from
memory, other than reading it from the disk.
[FIX]
Instead of calling btrfs_mark_ordered_io_finished() for the whole folio
range, which can touch ranges we should not touch, instead
move the error handling inside extent_writepage_io().
So that we can cleanup exact sectors that ought to be submitted but failed.
This provides much more accurate cleanup, avoiding the double accounting.
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running btrfs with block size (4K) smaller than page size (64K,
aarch64), there is a very high chance to crash the kernel at
generic/750, with the following messages:
(before the call traces, there are 3 extra debug messages added)
BTRFS warning (device dm-3): read-write for sector size 4096 with page size 65536 is experimental
BTRFS info (device dm-3): checking UUID tree
hrtimer: interrupt took 5451385 ns
BTRFS error (device dm-3): cow_file_range failed, root=4957 inode=257 start=1605632 len=69632: -28
BTRFS error (device dm-3): run_delalloc_nocow failed, root=4957 inode=257 start=1605632 len=69632: -28
BTRFS error (device dm-3): failed to run delalloc range, root=4957 ino=257 folio=1572864 submit_bitmap=8-15 start=1605632 len=69632: -28
------------[ cut here ]------------
WARNING: CPU: 2 PID: 3020984 at ordered-data.c:360 can_finish_ordered_extent+0x370/0x3b8 [btrfs]
CPU: 2 UID: 0 PID: 3020984 Comm: kworker/u24:1 Tainted: G OE 6.13.0-rc1-custom+ #89
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
pc : can_finish_ordered_extent+0x370/0x3b8 [btrfs]
lr : can_finish_ordered_extent+0x1ec/0x3b8 [btrfs]
Call trace:
can_finish_ordered_extent+0x370/0x3b8 [btrfs] (P)
can_finish_ordered_extent+0x1ec/0x3b8 [btrfs] (L)
btrfs_mark_ordered_io_finished+0x130/0x2b8 [btrfs]
extent_writepage+0x10c/0x3b8 [btrfs]
extent_write_cache_pages+0x21c/0x4e8 [btrfs]
btrfs_writepages+0x94/0x160 [btrfs]
do_writepages+0x74/0x190
filemap_fdatawrite_wbc+0x74/0xa0
start_delalloc_inodes+0x17c/0x3b0 [btrfs]
btrfs_start_delalloc_roots+0x17c/0x288 [btrfs]
shrink_delalloc+0x11c/0x280 [btrfs]
flush_space+0x288/0x328 [btrfs]
btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
process_one_work+0x228/0x680
worker_thread+0x1bc/0x360
kthread+0x100/0x118
ret_from_fork+0x10/0x20
---[ end trace 0000000000000000 ]---
BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1605632 OE len=16384 to_dec=16384 left=0
BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1622016 OE len=12288 to_dec=12288 left=0
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1634304 OE len=8192 to_dec=4096 left=0
CPU: 1 UID: 0 PID: 3286940 Comm: kworker/u24:3 Tainted: G W OE 6.13.0-rc1-custom+ #89
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: btrfs_work_helper [btrfs] (btrfs-endio-write)
pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : process_one_work+0x110/0x680
lr : worker_thread+0x1bc/0x360
Call trace:
process_one_work+0x110/0x680 (P)
worker_thread+0x1bc/0x360 (L)
worker_thread+0x1bc/0x360
kthread+0x100/0x118
ret_from_fork+0x10/0x20
Code: f84086a1 f9000fe1 53041c21 b9003361 (f9400661)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Oops: Fatal exception
SMP: stopping secondary CPUs
SMP: failed to stop secondary CPUs 2-3
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: 0x275bb9540000 from 0xffff800080000000
PHYS_OFFSET: 0xffff8fbba0000000
CPU features: 0x100,00000070,00801250,8201720b
[CAUSE]
The above warning is triggered immediately after the delalloc range
failure, this happens in the following sequence:
- Range [1568K, 1636K) is dirty
1536K 1568K 1600K 1636K 1664K
| |/////////|////////| |
Where 1536K, 1600K and 1664K are page boundaries (64K page size)
- Enter extent_writepage() for page 1536K
- Enter run_delalloc_nocow() with locked page 1536K and range
[1568K, 1636K)
This is due to the inode having preallocated extents.
- Enter cow_file_range() with locked page 1536K and range
[1568K, 1636K)
- btrfs_reserve_extent() only reserved two extents
The main loop of cow_file_range() only reserved two data extents,
Now we have:
1536K 1568K 1600K 1636K 1664K
| |<-->|<--->|/|///////| |
1584K 1596K
Range [1568K, 1596K) has an ordered extent reserved.
- btrfs_reserve_extent() failed inside cow_file_range() for file offset
1596K
This is already a bug in our space reservation code, but for now let's
focus on the error handling path.
Now cow_file_range() returned -ENOSPC.
- btrfs_run_delalloc_range() do error cleanup <<< ROOT CAUSE
Call btrfs_cleanup_ordered_extents() with locked folio 1536K and range
[1568K, 1636K)
Function btrfs_cleanup_ordered_extents() normally needs to skip the
ranges inside the folio, as it will normally be cleaned up by
extent_writepage().
Such split error handling is already problematic in the first place.
What's worse is the folio range skipping itself, which is not taking
subpage cases into consideration at all, it will only skip the range
if the page start >= the range start.
In our case, the page start < the range start, since for subpage cases
we can have delalloc ranges inside the folio but not covering the
folio.
So it doesn't skip the page range at all.
This means all the ordered extents, both [1568K, 1584K) and
[1584K, 1596K) will be marked as IOERR.
And these two ordered extents have no more pending ios, they are marked
finished, and *QUEUED* to be deleted from the io tree.
- extent_writepage() do error cleanup
Call btrfs_mark_ordered_io_finished() for the range [1536K, 1600K).
Although ranges [1568K, 1584K) and [1584K, 1596K) are finished, the
deletion from io tree is async, it may or may not happen at this
time.
If the ranges have not yet been removed, we will do double cleaning on
those ranges, triggering the above ordered extent warnings.
In theory there are other bugs, like the cleanup in extent_writepage()
can cause double accounting on ranges that are submitted asynchronously
(compression for example).
But that's much harder to trigger because normally we do not mix regular
and compression delalloc ranges.
[FIX]
The folio range split is already buggy and not subpage compatible, it
was introduced a long time ago where subpage support was not even considered.
So instead of splitting the ordered extents cleanup into the folio range
and out of folio range, do all the cleanup inside writepage_delalloc().
- Pass @NULL as locked_folio for btrfs_cleanup_ordered_extents() in
btrfs_run_delalloc_range()
- Skip the btrfs_cleanup_ordered_extents() if writepage_delalloc()
failed
So all ordered extents are only cleaned up by
btrfs_run_delalloc_range().
- Handle the ranges that already have ordered extents allocated
If part of the folio already has ordered extent allocated, and
btrfs_run_delalloc_range() failed, we also need to cleanup that range.
Now we have a concentrated error handling for ordered extents during
btrfs_run_delalloc_range().
Fixes: d1051d6ebf ("btrfs: Fix error handling in btrfs_cleanup_ordered_extents")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rename the macro so it's obvious what it means.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Same pattern in both functions, we really only use index, start_index is
redundant.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are only 2 WAIT_* values left for wait parameter, we can encode
this to the function name if the waiting functionality is split.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Last use was in the readahead code that got removed by f26c923860
("btrfs: remove reada infrastructure").
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Another conversion to folio API, use the folio locking directly instead
of back and forth page <-> folio conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The check function pattern is supposed to return true/false, currently
there's only one error code.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the folio API, remove page_folio/folio_page conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Continue page to folio updates, sync what the function does with it's
name.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using the kmem cache freeing directly is clear enough, we don't need to
wrap it. All the users are in the same file.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The simple helper can be inlined, no need for the separate function.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As extent_writepage() is internal helper we should use our inode type,
so change it from struct inode.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The double underscore naming scheme does not apply here, there's only
only get_extent_map(). As the definition is changed also pass the struct
btrfs_inode.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function set_page_extent_mapped() is now a simple wrapper so use the
folio helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Drop the leading underscores in '__unlock_for_delalloc()' and rename it
to 'unlock_delalloc_folio()'. This also ensures naming parity with
'lock_delalloc_folios()'.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the existing define for single sector size.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The subvol_srcu was removed in c75e839414 ("btrfs: kill the
subvol_srcu") years ago.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 449813515d ("block, fs: Restore the per-bio/request data
lifetime fields") restored write-hint support in btrfs. But that is
applicable only for direct IO. This patch supports passing
write-hint for buffered IO from btrfs file system to block layer
by filling bi_write_hint of struct bio in alloc_new_bio().
There's an ongoing discussion which devices can use that,
https://lore.kernel.org/all/20240910150200.6589-6-joshi.k@samsung.com,
in SCSI there's support using sd_group_number() and
sd_setup_rw32_cmnd().
The hint goes from the application directly to the block device so it's
up to the application to set up everything properly to utilize the
different hint classes.
Link: https://lore.kernel.org/all/20240910150200.6589-6-joshi.k@samsung.com
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jing Xia <j.xia@samsung.com>
[ Add more context and use case. ]
Signed-off-by: David Sterba <dsterba@suse.com>
Print the read read policy if set as module parameter (with
CONFIG_BTRFS_EXPERIMENTAL).
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For testing purposes allow to configure the read policy via module
parameter from the beginning. Available only with CONFIG_BTRFS_EXPERIMENTAL
Examples:
- Set the RAID1 balancing method to round-robin with a custom
min_contig_read of 4k:
$ modprobe btrfs read_policy=round-robin:4096
- Set the round-robin balancing method with the default
min_contiguous_read:
$ modprobe btrfs read_policy=round-robin
- Set the "devid" balancing method, defaulting to the latest device:
$ modprobe btrfs read_policy=devid
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit c9c49e8f157e ("btrfs: split out CONFIG_BTRFS_EXPERIMENTAL from
CONFIG_BTRFS_DEBUG") introduces a way to enable or disable experimental
features, print its status during module load, like:
Btrfs loaded, experimental=on, debug=on, assert=on, zoned=yes, fsverity=yes
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add read policy that will force all reads to go to the given device
(specified by devid) on the RAID1 profiles.
This will be used for testing, e.g. to read from stale device. Users may
find other use cases.
Can be set in sysfs, the value format is "devid:<devid>" to the file
/sys/fs/btrfs/FSID/read_policy
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add round-robin read policy that balances reads over available devices
(all RAID1 block group profiles). Switch to the next devices is done
after a number of blocks is read, which is 256K by default and is
configurable in sysfs.
The format is "round-robin:<min-contig-read>" and can be set in file
/sys/fs/btrfs/FSID/read_policy
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Track number of read blocks in the whole filesystem. The counter is
initialized when devices are opened. The counter is increased at
btrfs_submit_dev_bio() if the stats tracking is enabled (depends on the
read policy). Stats tracking is disabled by default and is enabled
through fs_devices::collect_fs_stats when required.
The code is not under the EXPERIMENTAL define, as stats can be expanded
to include write counts and other performance counters, with the user
interface independent of its internal use.
This is an in-memory-only feature, not related to the dev error stats.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Enable specifying additional configuration values along the RAID1
balancing read policy in a single input string.
Update btrfs_read_policy_to_enum() to parse and handle a value
associated with the policy in the format "policy:value", the value part
if present is converted to 64-bit integer.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce btrfs_read_policy_to_enum() helper to simplify the conversion
of a string read policy to its corresponding enum value. This reduces
duplication and improves code clarity in btrfs_read_policy_store().
The parameter is copied locally to allow modification, enabling the
separation of the method and its value. This prepares for the addition of
more functionality in subsequent patches.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Refactor the logic in btrfs_read_policy_show() for easier extension with
more balancing methods. Streamline the space and bracket handling
around the active policy without altering the functional output. This
is in preparation to add more methods.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, fs_devices->fs_info is initialized in btrfs_init_devices_late(),
but this occurs too late for find_live_mirror(), which is invoked by
load_super_root() much earlier than btrfs_init_devices_late().
Fix this by moving the initialization to open_ctree(), before load_super_root().
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The call to btrfs_mark_buffer_dirty() at btrfs_setxattr() is not
necessary as we have a path setup for writing with btrfs_search_slot()
having a 'cow' argument set to 1.
This just makes the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.
These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The call to btrfs_mark_buffer_dirty() at btrfs_uuid_tree_add() is not
necessary as we have a path setup for writing with btrfs_search_slot()
having a 'cow' argument set to 1 (through btrfs_insert_empty_item()).
This just makes the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.
These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.
These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The call to btrfs_mark_buffer_dirty() at update_raid_extent_item() is not
necessary as we have a path setup for writing with btrfs_search_slot()
having a 'cow' argument set to 1.
This just makes the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.
These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The call to btrfs_mark_buffer_dirty() at btrfs_ioctl_default_subvol() is
not necessary as we have a path setup for writing with btrfs_search_slot()
having a 'cow' argument set to 1.
This just makes the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.
These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.
These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.
These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.
These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.
These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.
These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The call to btrfs_mark_buffer_dirty() at btrfs_run_dev_replace() is not
necessary as we have a path setup for writing with btrfs_search_slot()
having a 'cow' argument set to 1.
This just makes the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The call to btrfs_mark_buffer_dirty() at __btrfs_update_delayed_inode() is
not necessary as we have a path setup for writing with btrfs_search_slot()
having a 'cow' argument set to 1.
This just makes the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.
These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.
These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
ike btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.
These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.
These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During renames we are grouping transaction aborts that can be due to a
failure of one of several function calls. While this makes the code less
verbose, it makes it harder to debug as we end up not knowing from which
function call we got an error.
So change this to trigger a transaction abort after each function call
failure, so that when we get a transaction abort message we know exactly
which function call failed, helping us to debug issues.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_validate_super() only does a very basic check on the
array chunk size (not too large than the available space, but not too
small to contain no chunk).
The more comprehensive checks (the regular chunk checks and size check
inside the system chunk array) are all done inside btrfs_read_sys_array().
It's not a big deal, but it also means we do not do any validation on
the system chunk array at super block writeback time either.
Do the following modification to centralize the system chunk array
checks into btrfs_validate_super():
- Make chunk_err() helper accept stack chunk pointer
If @leaf parameter is NULL, then the @chunk pointer will be a pointer
to the chunk item, other than the offset inside the leaf.
And since @leaf can be NULL, add a new @fs_info parameter for that
case.
- Make btrfs_check_chunk_valid() handle stack chunk pointer
The same as chunk_err(), a new @fs_info parameter, and if @leaf is
NULL, then @chunk will be a pointer to a stack chunk.
If @chunk is NULL, then all needed btrfs_chunk members will be read
using the stack helper instead of the leaf helper.
This means we need to read out all the needed member at the beginning
of the function.
Furthermore, at super block read time, fs_info->sectorsize is not yet
initialized, we need one extra @sectorsize parameter to grab the
correct sectorsize.
- Introduce a helper validate_sys_chunk_array()
* Validate the disk key.
* Validate the size before we access the full chunk items.
* Do the full chunk item validation.
- Call validate_sys_chunk_array() at btrfs_validate_super()
- Simplify the checks inside btrfs_read_sys_array()
Now the checks will be converted to an ASSERT().
- Simplify the checks inside read_one_chunk()
Now that all chunk items inside system chunk array and chunk tree are
verified, there is no need to verify them again inside read_one_chunk().
This change has the following advantages:
- More comprehensive checks at write time
And unlike the sys_chunk_array read routine, this time we do not need
to allocate a dummy extent buffer to do the check.
All the checks done here require no new memory allocation.
- Slightly improved readability when iterating the system chunk array
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update tree_insert() to use rb_find_add_cached().
add cmp_refs_node in rb_find_add_cached() to compare.
Since we're here, also make comp_data_refs() and comp_refs() accept
both parameters as const.
Signed-off-by: Roger L. Beckermeyer III <beckerlee3@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update btrfs_add_chunk_map() to use rb_find_add_cached().
Signed-off-by: Roger L. Beckermeyer III <beckerlee3@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update __btrfs_add_delayed_item() to use rb_find_add_cached().
Signed-off-by: Roger L. Beckermeyer III <beckerlee3@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update prelim_ref_insert() to use rb_find_add_cached().
There is a special change that the existing prelim_ref_compare() is
called with the first parameter as the existing ref in the rbtree.
But the newer rb_find_add_cached() expects the cmp() function to have
the first parameter as the to-be-added node, thus the new helper
prelim_ref_rb_add_cmp() need to adapt this new order.
Signed-off-by: Roger L. Beckermeyer III <beckerlee3@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update fs/btrfs/block-group.c to use rb_find_add_cached().
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Roger L. Beckermeyer III <beckerlee3@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The header clearly states that it does not want to be included directly,
only via linux/spinlock_types.h. Drop this as we can simply use the
spinlock.h which is already included.
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During free space tree rebuild, we're holding a transaction handle for
the whole rebuild process.
This can lead to blocked task warning, e.g. btrfs-transaction kthread
(which is already created before btrfs_start_pre_rw_mount()) can be
waked up to join and commit the current transaction.
But the free space tree rebuild process may need to go through thousands
block groups, this will block btrfs-transaction kthread for a long time.
Fix the problem by calling btrfs_should_end_transaction() after each
block group, so that we won't hold the transaction handle too long.
And since the free-space-tree rebuild can be split into
multiple transactions, we need to consider the safety when the rebuild
process is interrupted.
Thankfully since we only set the FREE_SPACE_TREE compat_ro flag without
FREE_SPACE_TREE_VALID flag, even if the rebuild is interrupted, on the
next RW mount, we will still go rebuild the free space tree, by deleting
any items we have and re-starting the rebuild from scratch.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_is_empty_uuid() we have our custom code to check if an uuid is
empty, however there a kernel uuid library that has a function named
uuid_is_null() which does the same and probably more efficient.
So change btrfs_is_empty_uuid() to use uuid_is_null(), which is almost
a directly replacement, it just wraps the necessary casting since our
uuid types are u8 arrays while the uuid kernel library uses the uuid_t
type, which is just a typedef of an u8 array of 16 elements as well.
Also since the function is now to trivial, make it a static inline
function in fs.h.
Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's pointless to have a comment above the prototype declarations of
btrfs_ctree_init() and btrfs_ctree_exit() mentioning that they are
declared in ctree.c. This is from the old days when ctree.h was used
to place anything that didn't fit in any other file. So remove it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have 3 functions that have their prototypes declared in ctree.h but
they are defined at extent-tree.c and they are unrelated to the btree
data structure. Move the prototypes out of ctree.h and into extent-tree.h.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_alloc_write_mask() is defined in ctree.h but it's not
related at all to the btree data structure, so move it into fs.h.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently BTRFS_BYTES_TO_BLKS() is defined in ctree.h but it's not related
at all to the btree data structure, so move it into fs.h.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The folio ordered helper macros are defined at ctree.h but this is not
the best place since ctree.{h,c} is all about the btree data structure
implementation and not a generic module. So move these macros into the
fs.h header.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's a generic helper not specific to ioctls and used in several places,
so move it out from ioctl.c and into fs.c. While at it change its return
type from int to bool and declare the loop variable in the loop itself.
This also slightly reduces the module's size.
Before this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1781492 161037 16920 1959449 1de619 fs/btrfs/btrfs.ko
After this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1781340 161037 16920 1959297 1de581 fs/btrfs/btrfs.ko
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The declarations for the exclusive operation functions are located at fs.h
but their definitions are in ioctl.c, which doesn't make much sense since
(most of them) are used in several files other than ioctl.c. Since they
are used in several files and they are generic enough, move them out of
ioctl.c and into fs.c, even the ones that are currently only used at
ioctl.c, for the sake of having them all in the same C file.
This also reduces the module's size.
Before this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1782094 161045 16920 1960059 1de87b fs/btrfs/btrfs.ko
After this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1781492 161037 16920 1959449 1de619 fs/btrfs/btrfs.ko
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The ctree module is about the implementation of the btree data structure
and not a place holder for generic filesystem things like the csum
algorithm details. Move the functions related to the csum algorithm
details away from ctree.c and into fs.c, which is a far better place for
them. Also fix missing punctuation in comments and change one multiline
comment to a single line comment since everything fits in under 80
characters.
For some reason this also slightly reduces the module's size.
Before this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1782126 161045 16920 1960091 1de89b fs/btrfs/btrfs.ko
After this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1782094 161045 16920 1960059 1de87b fs/btrfs/btrfs.ko
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function abort_should_print_stack() is declared in transaction.h but
its definition is in ctree.c, which doesn't make sense since ctree.c is
the btree implementation and the function is related to the transaction
code. Move its definition into transaction.h as an inline function since
it's a very short and trivial function, and also add the 'btrfs_' prefix
into its name.
This change also reduces the module size.
Before this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1783148 161137 16920 1961205 1decf5 fs/btrfs/btrfs.ko
After this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1782126 161045 16920 1960091 1de89b fs/btrfs/btrfs.ko
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have the stripe tree decision saved in struct
btrfs_io_geometry we can pass it into is_single_device_io() and get rid of
another call to btrfs_need_raid_stripe_tree_update().
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Cache the decision if a particular I/O needs to update RAID stripe tree
entries in struct btrfs_io_context.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Cache the return of btrfs_need_stripe_tree_update() in struct
btrfs_io_geometry starting from btrfs_map_block().
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We should always call check_delayed_ref() with a path having a locked leaf
from the extent tree where either the extent item is located or where it
should be located in case it doesn't exist yet (when there's a pending
unflushed delayed ref to do it), as we need to lock any existing delayed
ref head while holding such leaf locked in order to avoid races with
flushing delayed references, which could make us think an extent is not
shared when it really is.
So add some assertions and a comment about such expectations to
btrfs_cross_ref_exist(), which is the only caller of check_delayed_ref().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are some not immediately obvious details about the operation of
check_committed_ref(), namely that when it returns 0 it must return with
the path having a locked leaf from the extent tree that contains the
extent's extent item, so that we can later check for delayed refs when
calling check_delayed_ref() in a way that doesn't race with a task running
delayed references. For similar reasons, it must also return with a locked
leaf when the extent item is not found, and that leaf is where the extent
item should be located, because we may have delayed references that are
going to create the extent item. Also document that the function can
return false positives in order to not be too slow, and that the most
important is to not return false negatives.
So add a function comment to check_committed_ref().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of passing a root and an objectid which matches an inode number,
pass the inode instead, since the root is always the root associated to
the inode and the objectid is the number of that inode.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of setting the value to return in a local variable 'ret' and then
jumping into a label named 'out' that does nothing but return that value,
simplify everything by getting rid of the label and directly returning a
value.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At check_committed_ref() we are calling btrfs_get_extent_inline_ref_type()
twice, once before we check if have an inline extent owner ref (for simple
qgroups) and then once again sometime after that check. This second call
is redundant when we have simple quotas disabled or we found an inline ref
that is not of the owner ref type. So avoid this second call unless we
have simple quotas enabled and found an owner ref, saving a function call
that does inline ref validation again.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At check_committed_ref() we have this check to see if the data extent was
created in a generation lower than or equals to the generation where the
last snapshot for the root was created, and if so we return immediately
with 1, since it's very likely the extent is shared, referenced by other
root.
The only call chain for check_committed_ref() is the following:
can_nocow_file_extent()
btrfs_cross_ref_exist()
check_committed_ref()
And we already do that snapshot check at can_nocow_file_extent(), before
we call btrfs_cross_ref_exist(). This makes the check done at
check_committed_ref() redundant, so remove it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers of can_nocow_extent() now pass a value of false for its
'strict' argument, making it redundant. So remove the argument from
can_nocow_extent() as well as can_nocow_file_extent(),
btrfs_cross_ref_exist() and check_committed_ref(), because this
argument was used just to influence the behavior of check_committed_ref().
Also remove the 'strict' field from struct can_nocow_file_extent_args,
which is now always false as well, as its value is taken from the
argument to can_nocow_extent().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove the variable length in btrfs_insert_one_raid_extent() as it is
unused.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a recent ML report that mounting a large fs backed by hardware
RAID56 controller (with one device missing) took too much time, and
systemd seems to kill the mount attempt.
In that case, the only error message is:
BTRFS error (device sdj): open_ctree failed
There is no reason on why the failure happened, making it very hard to
understand the reason.
At least output the error number (in the particular case it should be
-EINTR) to provide some clue.
Link: https://lore.kernel.org/linux-btrfs/9b9c4d2810abcca2f9f76e32220ed9a90febb235.camel@scientia.org/
Reported-by: Christoph Anton Mitterer <calestyo@scientia.org>
Cc: stable@vger.kernel.org
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_copy_from_user() handles the folio dirtying for
buffered write. The original design is to allow that function to handle
multiple folios, but since commit c87c299776 ("btrfs: make buffered
write to copy one page a time") there is no need to support multiple
folios.
So here open-code btrfs_copy_from_user() to
copy_folio_from_iter_atomic() and flush_dcache_folio() calls.
The short-copy check and revert are still kept as-is.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[WARNING]
There are several warnings about the recently introduced qgroup
auto-removal that it triggers WARN_ON() for the non-zero rfer/excl
numbers, e.g:
------------[ cut here ]------------
WARNING: CPU: 67 PID: 2882 at fs/btrfs/qgroup.c:1854 btrfs_remove_qgroup+0x3df/0x450
CPU: 67 UID: 0 PID: 2882 Comm: btrfs-cleaner Kdump: loaded Not tainted 6.11.6-300.fc41.x86_64 #1
RIP: 0010:btrfs_remove_qgroup+0x3df/0x450
Call Trace:
<TASK>
btrfs_qgroup_cleanup_dropped_subvolume+0x97/0xc0
btrfs_drop_snapshot+0x44e/0xa80
btrfs_clean_one_deleted_snapshot+0xc3/0x110
cleaner_kthread+0xd8/0x130
kthread+0xd2/0x100
ret_from_fork+0x34/0x50
ret_from_fork_asm+0x1a/0x30
</TASK>
---[ end trace 0000000000000000 ]---
BTRFS warning (device sda): to be deleted qgroup 0/319 has non-zero numbers, rfer 258478080 rfer_cmpr 258478080 excl 0 excl_cmpr 0
[CAUSE]
Although the root cause is still unclear, as if qgroup is consistent a
fully dropped subvolume (with extra transaction committed) should lead
to all zero numbers for the qgroup.
My current guess is the subvolume drop triggered the new subtree drop
threshold thus marked qgroup inconsistent, then rescan cleared it but
some corner case is not properly handled during subvolume dropping.
But at least for this particular case, since it's only the rfer/excl not
properly reset to 0, and qgroup is already marked inconsistent, there is
nothing to be worried for the end users.
The user space tool utilizing qgroup would queue a rescan to handle
everything, so the kernel wanring is a little overkilled.
[ENHANCEMENT]
Enhance the warning inside btrfs_remove_qgroup() by:
- Only do WARN() if CONFIG_BTRFS_DEBUG is enabled
As explained the kernel can handle inconsistent qgroups by simply do a
rescan, there is nothing to bother the end users.
- Treat the reserved space leak the same as non-zero numbers
By outputting the values and trigger a WARN() if it's a debug build.
So far I haven't experienced any case related to reserved space so I
hope we will never need to bother them.
Fixes: 839d6ea4f8 ("btrfs: automatically remove the subvolume qgroup")
Link: https://github.com/kdave/btrfs-progs/issues/922
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't ever look at this list, remove it.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before we were keeping all of our nodes on various lists in order to
make sure everything got cleaned up correctly. We used node->lowest to
indicate that node->lower was linked into the cache->leaves list. Now
that we do cleanup based on the rb-tree both the list and the flag are
useless, so delete them both.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We rely on finding all our nodes on the various lists in the backref
cache, when they are all also in the rbtree. Instead just search
through the rbtree and free everything.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we handle relocation for non-shareable roots without using the
backref cache, remove the ->cowonly field from the backref nodes and
update the handling to throw an error.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already determine the owner for any blocks we find when we're
relocating, and for COW-only blocks (and the data reloc tree) we COW
down to the block and call it good enough. However we still build a
whole backref tree for them, even though we're not going to use it, and
then just don't put these blocks in the cache.
Rework the code to check if the block belongs to a COW-only root or the
data reloc root, and then just cow down to the block, skipping the
backref cache generation.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we no longer maintain backref cache across transactions, and this
is only called when we're creating the reloc root for a newly created
snapshot in the transaction critical section, we will end up doing a
bunch of work that will just get thrown away when we start the
transaction in the relocation loop. Delete this code as it no longer
does anything for us.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have this setup as a loop, but in reality we will never walk back up
the backref tree, if we do then it's a bug. Get rid of the loop and
handle the case where we have node->new_bytenr set at all. Previous
check was only if node->new_bytenr != root->node->start, but if it did
then we would hit the WARN_ON() and walk back up the tree.
Instead we want to just return error if ->new_bytenr is set, and then do
the normal updating of the node for the reloc root and carry on.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a comment for this field so we know what it is used for. Previously
we used it to update the backref cache, so people may mistakenly think
it is useless, but in fact exists to make sure the backref cache makes
sense.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we're not updating the backref cache when we switch transids we
can remove the changed list.
We're going to keep the new_bytenr field because it serves as a good
sanity check for the backref cache and relocation, and can prevent us
from making extent tree corruption worse.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This BUG_ON is meant to catch backref cache problems, but these can
arise from either bugs in the backref cache or corruption in the extent
tree. Fix it to be a proper error.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A data race occurs when the function `insert_ordered_extent_file_extent()`
and the function `btrfs_inode_safe_disk_i_size_write()` are executed
concurrently. The function `insert_ordered_extent_file_extent()` is not
locked when reading inode->disk_i_size, causing
`btrfs_inode_safe_disk_i_size_write()` to cause data competition when
writing inode->disk_i_size, thus affecting the value of `modify_tree`.
The specific call stack that appears during testing is as follows:
============DATA_RACE============
btrfs_drop_extents+0x89a/0xa060 [btrfs]
insert_reserved_file_extent+0xb54/0x2960 [btrfs]
insert_ordered_extent_file_extent+0xff5/0x1760 [btrfs]
btrfs_finish_one_ordered+0x1b85/0x36a0 [btrfs]
btrfs_finish_ordered_io+0x37/0x60 [btrfs]
finish_ordered_fn+0x3e/0x50 [btrfs]
btrfs_work_helper+0x9c9/0x27a0 [btrfs]
process_scheduled_works+0x716/0xf10
worker_thread+0xb6a/0x1190
kthread+0x292/0x330
ret_from_fork+0x4d/0x80
ret_from_fork_asm+0x1a/0x30
============OTHER_INFO============
btrfs_inode_safe_disk_i_size_write+0x4ec/0x600 [btrfs]
btrfs_finish_one_ordered+0x24c7/0x36a0 [btrfs]
btrfs_finish_ordered_io+0x37/0x60 [btrfs]
finish_ordered_fn+0x3e/0x50 [btrfs]
btrfs_work_helper+0x9c9/0x27a0 [btrfs]
process_scheduled_works+0x716/0xf10
worker_thread+0xb6a/0x1190
kthread+0x292/0x330
ret_from_fork+0x4d/0x80
ret_from_fork_asm+0x1a/0x30
=================================
The main purpose of the check of the inode's disk_i_size is to avoid
taking write locks on a btree path when we have a write at or beyond
EOF, since in these cases we don't expect to find extent items in the
root to drop. However if we end up taking write locks due to a data
race on disk_i_size, everything is still correct, we only add extra
lock contention on the tree in case there's concurrency from other tasks.
If the race causes us to not take write locks when we actually need them,
then everything is functionally correct as well, since if we find out we
have extent items to drop and we took read locks (modify_tree set to 0),
we release the path and retry again with write locks.
Since this data race does not affect the correctness of the function,
it is a harmless data race, use data_race() to check inode->disk_i_size.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Hao-ran Zheng <zhenghaoran154@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_drop_extents() calls BUG_ON() in case the counter of to be deleted
extents is greater than 0. But all of these code paths can handle errors,
so there's no need to crash the kernel. Instead WARN() that the condition
has been met and gracefully bail out.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
On the zoned mode, once used and freed region is still not reusable after the
freeing. The underlying zone needs to be reset before reusing. Btrfs resets a
zone when it removes a block group, and then new block group is allocated on
the zones to reuse the zones. But, it is sometime too late to catch up with a
write side.
This commit introduces a new space-info reclaim method ZONE_RESET. That will
pick a block group from the unused list and reset its zone to reuse the
zone_unusable space. It is faster than removing the block group and re-creating
a new block group on the same zones.
For the first implementation, the ZONE_RESET is only applied to a block group
whose region is fully zone_unusable. Reclaiming partial zone_unusable block
group could be implemented later.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit e1e577aafe41 ("btrfs: store fs_info in space_info"), we have
the fs_info in a space_info. So, we can drop fs_info argument from
btrfs_update_space_info_*. There is no behavior change.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Factor out a part of unpin_extent_range() that returns space back to the
space info, prioritizing global block reserve. Also, move the "len"
variable into the loop to clarify we don't need to carry it beyond an
iteration.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 146054090b ("btrfs: initial fsverity support") introduced
fs-verity support for btrfs, but didn't add support for
FS_IOC_READ_VERITY_METADATA to directly query the Merkle tree,
descriptor and signature blocks for fs-verity enabled files.
Add the (trival) implementation: we just need to wire it through to the
fs-verity code, the same way as is done in the other two filesystems
which support this ioctl (ext4, f2fs). The fs-verity code already has
access to the required data.
This is also safe to backport to older stable trees (5.15+) if needed.
Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variable ret is being initialized to zero and also later re-assigned
to zero. In both cases the assignment is redundant since the value is
never read after the assignment and hence they can be removed.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_get_extent() will only return an PTR_ERR() or a valid
extent map pointer. It will not return NULL.
Thus the usage of PTR_ERR_OR_ZERO() inside submit_one_sector() is not
needed, use plain PTR_ERR() instead, and that is the only usage of
PTR_ERR_OR_ZERO() after btrfs_get_extent().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The recent fix for a stupid mistake I made uncovered the fact that we
don't have adequate testing in the delayed refs code, as it took a
pretty extensive and long running stress test to uncover something that
a unit test would have uncovered right away.
Fix this by adding a delayed refs self test suite. This will validate
that the btrfs_ref transformation does the correct thing, that we do the
correct thing when merging delayed refs, and that we get the delayed
refs in the order that we expect. These are all crucial to how the
delayed refs operate.
I introduced various bugs (including the original bug) into the delayed
refs code to validate that these tests caught all of the shenanigans
that I could think of.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This helper is how we select the delayed ref to run once we've selected
the delayed ref head. I need this exported to add a unit test for
delayed refs, and it's more natural home is in delayed-ref.c. Rename it
to btrfs_select_delayed_ref and move it into delayed-ref.c.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmd/7dYACgkQxWXV+ddt
WDuX7Q//UkrNtVh7UEiyNyujLjjvczfMXhpD1fAdVU0zMon6ux3RQ3JSs3xvAGrb
jFFa9c9+Db8/kWzdWp5n1u9Q/+sy4XBaeKGuzPRLPPGT1yXfKEa4mrm1sCrWRJoS
c8b07Kfuepldcim80x8WSa2qhr5gmDmSZBgvjKt63ppp5/jaNKCZg+d3BhwqhHbI
XA9JjIk9j0ZsAYauYflQTwgUpkyvXV1a9YyeKv4U6mYA1r+rXl2aolcndNkS1U/D
dDGuiDpOjKtIUecRi4YbOkt2zvwREDdQCbRV/QLsZajHxqeHV5QH0TBI/URikx2z
1shwYMzLfLtQIW0+PhHCGKiftMIb4NliyMUxxviPdN78nCFmocrR/ZkPx+a5M9Io
d7oqwS/8U3pFGeB4bAey8WvMzQI5BtCCYJY+3HreNTDkiubqcRtTCtJ9dNDTAMFH
FMZ6DA8wTsqSA2e9Q8OwKNjvMCLAKevXn/4wiJi5b75Fiu5ZB/imTfJ+geEMUZCR
3uq9oybFCKti7lestM0z06K19AKtmPWLoq5YJ1Hg69DsafS2aR3CBeYOi7uQ+56D
7uwAFjVrGPrxOgGkCohYpPLCUikJ0y3Nl/k5fnybsnLPWr0cenLroUeP7Rao4fFU
8hLzMSv3ImL+Io0RjH0XBAM8YLN+xO3CLYCv6D8d42AlQTgAIVw=
=QYC1
-----END PGP SIGNATURE-----
Merge tag 'for-6.13-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes.
Besides the one-liners in Btrfs there's fix to the io_uring and
encoded read integration (added in this development cycle). The update
to io_uring provides more space for the ongoing command that is then
used in Btrfs to handle some cases.
- io_uring and encoded read:
- provide stable storage for io_uring command data
- make a copy of encoded read ioctl call, reuse that in case the
call would block and will be called again
- properly initialize zlib context for hardware compression on s390
- fix max extent size calculation on filesystems with non-zoned
devices
- fix crash in scrub on crafted image due to invalid extent tree"
* tag 'for-6.13-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zlib: fix avail_in bytes for s390 zlib HW compression path
btrfs: zoned: calculate max_extent_size properly on non-zoned setup
btrfs: avoid NULL pointer dereference if no valid extent tree
btrfs: don't read from userspace twice in btrfs_uring_encoded_read()
io_uring: add io_uring_cmd_get_async_data helper
io_uring/cmd: add per-op data to struct io_uring_cmd_data
io_uring/cmd: rename struct uring_cache to io_uring_cmd_data
Since commit 559218d43e ("block: pre-calculate max_zone_append_sectors"),
queue_limits's max_zone_append_sectors is default to be 0 and it is only
updated when there is a zoned device. So, we have
lim->max_zone_append_sectors = 0 when there is no zoned device in the
filesystem.
That leads to fs_info->max_zone_append_size and thus
fs_info->max_extent_size to be 0, which is wrong and can for example
lead to a divide by zero in count_max_extents().
Fix this by only capping fs_info->max_extent_size to
fs_info->max_zone_append_size when it is non-zero.
Based on a patch from Naohiro Aota <naohiro.aota@wdc.com>, from which
much of this commit message is stolen as well.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 559218d43e ("block: pre-calculate max_zone_append_sectors")
Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Syzbot reported a crash with the following call trace:
BTRFS info (device loop0): scrub: started on devid 1
BUG: kernel NULL pointer dereference, address: 0000000000000208
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 106e70067 P4D 106e70067 PUD 107143067 PMD 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 UID: 0 PID: 689 Comm: repro Kdump: loaded Tainted: G O 6.13.0-rc4-custom+ #206
Tainted: [O]=OOT_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
RIP: 0010:find_first_extent_item+0x26/0x1f0 [btrfs]
Call Trace:
<TASK>
scrub_find_fill_first_stripe+0x13d/0x3b0 [btrfs]
scrub_simple_mirror+0x175/0x260 [btrfs]
scrub_stripe+0x5d4/0x6c0 [btrfs]
scrub_chunk+0xbb/0x170 [btrfs]
scrub_enumerate_chunks+0x2f4/0x5f0 [btrfs]
btrfs_scrub_dev+0x240/0x600 [btrfs]
btrfs_ioctl+0x1dc8/0x2fa0 [btrfs]
? do_sys_openat2+0xa5/0xf0
__x64_sys_ioctl+0x97/0xc0
do_syscall_64+0x4f/0x120
entry_SYSCALL_64_after_hwframe+0x76/0x7e
</TASK>
[CAUSE]
The reproducer is using a corrupted image where extent tree root is
corrupted, thus forcing to use "rescue=all,ro" mount option to mount the
image.
Then it triggered a scrub, but since scrub relies on extent tree to find
where the data/metadata extents are, scrub_find_fill_first_stripe()
relies on an non-empty extent root.
But unfortunately scrub_find_fill_first_stripe() doesn't really expect
an NULL pointer for extent root, it use extent_root to grab fs_info and
triggered a NULL pointer dereference.
[FIX]
Add an extra check for a valid extent root at the beginning of
scrub_find_fill_first_stripe().
The new error path is introduced by 42437a6386 ("btrfs: introduce
mount option rescue=ignorebadroots"), but that's pretty old, and later
commit b979547513 ("btrfs: scrub: introduce helper to find and fill
sector info for a scrub_stripe") changed how we do scrub.
So for kernels older than 6.6, the fix will need manual backport.
Reported-by: syzbot+339e9dbe3a2ca419b85d@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/67756935.050a0220.25abdd.0a12.GAE@google.com/
Fixes: 42437a6386 ("btrfs: introduce mount option rescue=ignorebadroots")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we return -EAGAIN the first time because we need to block,
btrfs_uring_encoded_read() will get called twice. Take a copy of args,
the iovs, and the iter the first time, as by the time we are called the
second time these may have gone out of scope.
Reported-by: Jens Axboe <axboe@kernel.dk>
Fixes: 34310c442e ("btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)")
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmdw8AgACgkQxWXV+ddt
WDsL4w/+Ib5WGmd2Rjn1+1X9U5dzrEb+/072UBAhwwaqOOUTlBofeyRSdYqFB0oZ
aucRMXdXPpVe1xrXsj0WsOZmPsuZT46Eh2ALqqZP5fO1sgBkJ2WmQF0Ei7uypfb+
abQwiEO2IaMMwt2XgDNzbpZS7oVNGEXHzoHF0R/deL4FoBDNMsbCfRnW+L9++tWU
dUSpafLhgMMwivJN07VJYwU4ZVXsBhmKv2qI8WpJ5w9kJb1ssN692CvBOVjhuSYd
A8IMV84dW2KO37fmPqN36QAWotz4mKpv8yrhjJvrix7nAOcXe3TXFUhaFBh1Vmzg
G5bhkqYcNP6UHT7CIcLZE1mdv6ZAKTp0zSNCh2Uu51+MJL2tIQVjTaUQhbkYLnLN
9DS2dXz4ksm9ISrjr2tmPe4kgyNQIrp5TCdwXu3CYs+AaU7yKeEBukZ7mXcp/e/W
TdLKvzPRLMED8mGlFBwg2QbOvcJJ663UW2esyv6DvC61F3tXyiV2RXSC/1qF+RyZ
FBJvvEevensQlASn1NScuQV+iEQpMo2lMURnRjSG8dGhwMmHpW3wifa2TJDyBzWS
AH0MriQA9nsYQTkPGPnqr46/BAhFG2vEfVlX20Sk9S0PTBLu8YRy/o2evcV67J8v
zGaa5pa7fQPbEjRv4Rthdb4R2VIFkZTOtIZSZfjHkPDjtvS7ahU=
=NwGH
-----END PGP SIGNATURE-----
Merge tag 'for-6.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes that accumulated over the last two weeks, fixing some
user reported problems:
- swapfile fixes:
- conditional reschedule in the activation loop
- fix race with memory mapped file when activating
- make activation loop interruptible
- rework and fix extent sharing checks
- folio fixes:
- in send, recheck folio mapping after unlock
- in relocation, recheck folio mapping after unlock
- fix waiting for encoded read io_uring requests
- fix transaction atomicity when enabling simple quotas
- move COW block trace point before the block gets freed
- print various sizes in sysfs with correct endianity"
* tag 'for-6.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: sysfs: fix direct super block member reads
btrfs: fix transaction atomicity bug when enabling simple quotas
btrfs: avoid monopolizing a core when activating a swap file
btrfs: allow swap activation to be interruptible
btrfs: fix swap file activation failure due to extents that used to be shared
btrfs: fix race with memory mapped writes when activating swap file
btrfs: check folio mapping after unlock in put_file_data()
btrfs: check folio mapping after unlock in relocate_one_folio()
btrfs: fix use-after-free when COWing tree bock and tracing is enabled
btrfs: fix use-after-free waiting for encoded read endios
The following sysfs entries are reading super block member directly,
which can have a different endian and cause wrong values:
- sys/fs/btrfs/<uuid>/nodesize
- sys/fs/btrfs/<uuid>/sectorsize
- sys/fs/btrfs/<uuid>/clone_alignment
Thankfully those values (nodesize and sectorsize) are always aligned
inside the btrfs_super_block, so it won't trigger unaligned read errors,
just endian problems.
Fix them by using the native cached members instead.
Fixes: df93589a17 ("btrfs: export more from FS_INFO to sysfs")
CC: stable@vger.kernel.org
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Set squota incompat bit before committing the transaction that enables
the feature.
With the config CONFIG_BTRFS_ASSERT enabled, an assertion
failure occurs regarding the simple quota feature.
[5.596534] assertion failed: btrfs_fs_incompat(fs_info, SIMPLE_QUOTA), in fs/btrfs/qgroup.c:365
[5.597098] ------------[ cut here ]------------
[5.597371] kernel BUG at fs/btrfs/qgroup.c:365!
[5.597946] CPU: 1 UID: 0 PID: 268 Comm: mount Not tainted 6.13.0-rc2-00031-gf92f4749861b #146
[5.598450] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[5.599008] RIP: 0010:btrfs_read_qgroup_config+0x74d/0x7a0
[5.604303] <TASK>
[5.605230] ? btrfs_read_qgroup_config+0x74d/0x7a0
[5.605538] ? exc_invalid_op+0x56/0x70
[5.605775] ? btrfs_read_qgroup_config+0x74d/0x7a0
[5.606066] ? asm_exc_invalid_op+0x1f/0x30
[5.606441] ? btrfs_read_qgroup_config+0x74d/0x7a0
[5.606741] ? btrfs_read_qgroup_config+0x74d/0x7a0
[5.607038] ? try_to_wake_up+0x317/0x760
[5.607286] open_ctree+0xd9c/0x1710
[5.607509] btrfs_get_tree+0x58a/0x7e0
[5.608002] vfs_get_tree+0x2e/0x100
[5.608224] fc_mount+0x16/0x60
[5.608420] btrfs_get_tree+0x2f8/0x7e0
[5.608897] vfs_get_tree+0x2e/0x100
[5.609121] path_mount+0x4c8/0xbc0
[5.609538] __x64_sys_mount+0x10d/0x150
The issue can be easily reproduced using the following reproducer:
root@q:linux# cat repro.sh
set -e
mkfs.btrfs -q -f /dev/sdb
mount /dev/sdb /mnt/btrfs
btrfs quota enable -s /mnt/btrfs
umount /mnt/btrfs
mount /dev/sdb /mnt/btrfs
The issue is that when enabling quotas, at btrfs_quota_enable(), we set
BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE at fs_info->qgroup_flags and persist
it in the quota root in the item with the key BTRFS_QGROUP_STATUS_KEY, but
we only set the incompat bit BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA after we
commit the transaction used to enable simple quotas.
This means that if after that transaction commit we unmount the filesystem
without starting and committing any other transaction, or we have a power
failure, the next time we mount the filesystem we will find the flag
BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE set in the item with the key
BTRFS_QGROUP_STATUS_KEY but we will not find the incompat bit
BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA set in the superblock, triggering an
assertion failure at:
btrfs_read_qgroup_config() -> qgroup_read_enable_gen()
To fix this issue, set the BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA flag
immediately after setting the BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE.
This ensures that both flags are flushed to disk within the same
transaction.
Fixes: 182940f4f4 ("btrfs: qgroup: add new quota mode for simple quotas")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During swap activation we iterate over the extents of a file and we can
have many thousands of them, so we can end up in a busy loop monopolizing
a core. Avoid this by doing a voluntary reschedule after processing each
extent.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During swap activation we iterate over the extents of a file, then do
several checks for each extent, some of which may take some significant
time such as checking if an extent is shared. Since a file can have
many thousands of extents, this can be a very slow operation and it's
currently not interruptible. I had a bug during development of a previous
patch that resulted in an infinite loop when iterating the extents, so
a core was busy looping and I couldn't cancel the operation, which is very
annoying and requires a reboot. So make the loop interruptible by checking
for fatal signals at the end of each iteration and stopping immediately if
there is one.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When activating a swap file, to determine if an extent is shared we use
can_nocow_extent(), which ends up at btrfs_cross_ref_exist(). That helper
is meant to be quick because it's used in the NOCOW write path, when
flushing delalloc and when doing a direct IO write, however it does return
some false positives, meaning it may indicate that an extent is shared
even if it's no longer the case. For the write path this is fine, we just
do a unnecessary COW operation instead of doing a more rigorous check
which would be too heavy (calling btrfs_is_data_extent_shared()).
However when activating a swap file, the false positives simply result
in a failure, which is confusing for users/applications. One particular
case where this happens is when a data extent only has 1 reference but
that reference is not inlined in the extent item located in the extent
tree - this happens when we create more than 33 references for an extent
and then delete those 33 references plus every other non-inline reference
except one. The function check_committed_ref() assumes that if the size
of an extent item doesn't match the size of struct btrfs_extent_item
plus the size of an inline reference (plus an owner reference in case
simple quotas are enabled), then the extent is shared - that is not the
case however, we can have a single reference but it's not inlined - the
reason we do this is to be fast and avoid inspecting non-inline references
which may be located in another leaf of the extent tree, slowing down
write paths.
The following test script reproduces the bug:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
NUM_CLONES=50
umount $DEV &> /dev/null
run_test()
{
local sync_after_add_reflinks=$1
local sync_after_remove_reflinks=$2
mkfs.btrfs -f $DEV > /dev/null
#mkfs.xfs -f $DEV > /dev/null
mount $DEV $MNT
touch $MNT/foo
chmod 0600 $MNT/foo
# On btrfs the file must be NOCOW.
chattr +C $MNT/foo &> /dev/null
xfs_io -s -c "pwrite -b 1M 0 1M" $MNT/foo
mkswap $MNT/foo
for ((i = 1; i <= $NUM_CLONES; i++)); do
touch $MNT/foo_clone_$i
chmod 0600 $MNT/foo_clone_$i
# On btrfs the file must be NOCOW.
chattr +C $MNT/foo_clone_$i &> /dev/null
cp --reflink=always $MNT/foo $MNT/foo_clone_$i
done
if [ $sync_after_add_reflinks -ne 0 ]; then
# Flush delayed refs and commit current transaction.
sync -f $MNT
fi
# Remove the original file and all clones except the last.
rm -f $MNT/foo
for ((i = 1; i < $NUM_CLONES; i++)); do
rm -f $MNT/foo_clone_$i
done
if [ $sync_after_remove_reflinks -ne 0 ]; then
# Flush delayed refs and commit current transaction.
sync -f $MNT
fi
# Now use the last clone as a swap file. It should work since
# its extent are not shared anymore.
swapon $MNT/foo_clone_${NUM_CLONES}
swapoff $MNT/foo_clone_${NUM_CLONES}
umount $MNT
}
echo -e "\nTest without sync after creating and removing clones"
run_test 0 0
echo -e "\nTest with sync after creating clones"
run_test 1 0
echo -e "\nTest with sync after removing clones"
run_test 0 1
echo -e "\nTest with sync after creating and removing clones"
run_test 1 1
Running the test:
$ ./test.sh
Test without sync after creating and removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0017 sec (556.793 MiB/sec and 556.7929 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=a6b9c29e-5ef4-4689-a8ac-bc199c750f02
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Test with sync after creating clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0036 sec (271.739 MiB/sec and 271.7391 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=5e9008d6-1f7a-4948-a1b4-3f30aba20a33
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Test with sync after removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0103 sec (96.665 MiB/sec and 96.6651 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=916c2740-fa9f-4385-9f06-29c3f89e4764
Test with sync after creating and removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0031 sec (314.268 MiB/sec and 314.2678 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=06aab1dd-4d90-49c0-bd9f-3a8db4e2f912
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Fix this by reworking btrfs_swap_activate() to instead of using extent
maps and checking for shared extents with can_nocow_extent(), iterate
over the inode's file extent items and use the accurate
btrfs_is_data_extent_shared().
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When activating the swap file we flush all delalloc and wait for ordered
extent completion, so that we don't miss any delalloc and extents before
we check that the file's extent layout is usable for a swap file and
activate the swap file. We are called with the inode's VFS lock acquired,
so we won't race with buffered and direct IO writes, however we can still
race with memory mapped writes since they don't acquire the inode's VFS
lock. The race window is between flushing all delalloc and locking the
whole file's extent range, since memory mapped writes lock an extent range
with the length of a page.
Fix this by acquiring the inode's mmap lock before we flush delalloc.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we call btrfs_read_folio() we get an unlocked folio, so it is possible
for a different thread to concurrently modify folio->mapping. We must
check that this hasn't happened once we do have the lock.
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
When we call btrfs_read_folio() to bring a folio uptodate, we unlock the
folio. The result of that is that a different thread can modify the
mapping (like remove it with invalidate) before we call folio_lock().
This results in an invalid page and we need to try again.
In particular, if we are relocating concurrently with aborting a
transaction, this can result in a crash like the following:
BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD 0 P4D 0
Oops: 0000 [#1] SMP
CPU: 76 PID: 1411631 Comm: kworker/u322:5
Workqueue: events_unbound btrfs_reclaim_bgs_work
RIP: 0010:set_page_extent_mapped+0x20/0xb0
RSP: 0018:ffffc900516a7be8 EFLAGS: 00010246
RAX: ffffea009e851d08 RBX: ffffea009e0b1880 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffc900516a7b90 RDI: ffffea009e0b1880
RBP: 0000000003573000 R08: 0000000000000001 R09: ffff88c07fd2f3f0
R10: 0000000000000000 R11: 0000194754b575be R12: 0000000003572000
R13: 0000000003572fff R14: 0000000000100cca R15: 0000000005582fff
FS: 0000000000000000(0000) GS:ffff88c07fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000407d00f002 CR4: 00000000007706f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
? __die+0x78/0xc0
? page_fault_oops+0x2a8/0x3a0
? __switch_to+0x133/0x530
? wq_worker_running+0xa/0x40
? exc_page_fault+0x63/0x130
? asm_exc_page_fault+0x22/0x30
? set_page_extent_mapped+0x20/0xb0
relocate_file_extent_cluster+0x1a7/0x940
relocate_data_extent+0xaf/0x120
relocate_block_group+0x20f/0x480
btrfs_relocate_block_group+0x152/0x320
btrfs_relocate_chunk+0x3d/0x120
btrfs_reclaim_bgs_work+0x2ae/0x4e0
process_scheduled_works+0x184/0x370
worker_thread+0xc6/0x3e0
? blk_add_timer+0xb0/0xb0
kthread+0xae/0xe0
? flush_tlb_kernel_range+0x90/0x90
ret_from_fork+0x2f/0x40
? flush_tlb_kernel_range+0x90/0x90
ret_from_fork_asm+0x11/0x20
</TASK>
This occurs because cleanup_one_transaction() calls
destroy_delalloc_inodes() which calls invalidate_inode_pages2() which
takes the folio_lock before setting mapping to NULL. We fail to check
this, and subsequently call set_extent_mapping(), which assumes that
mapping != NULL (in fact it asserts that in debug mode)
Note that the "fixes" patch here is not the one that introduced the
race (the very first iteration of this code from 2009) but a more recent
change that made this particular crash happen in practice.
Fixes: e7f1326cc2 ("btrfs: set page extent mapped after read_folio in relocate_one_page")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
When a COWing a tree block, at btrfs_cow_block(), and we have the
tracepoint trace_btrfs_cow_block() enabled and preemption is also enabled
(CONFIG_PREEMPT=y), we can trigger a use-after-free in the COWed extent
buffer while inside the tracepoint code. This is because in some paths
that call btrfs_cow_block(), such as btrfs_search_slot(), we are holding
the last reference on the extent buffer @buf so btrfs_force_cow_block()
drops the last reference on the @buf extent buffer when it calls
free_extent_buffer_stale(buf), which schedules the release of the extent
buffer with RCU. This means that if we are on a kernel with preemption,
the current task may be preempted before calling trace_btrfs_cow_block()
and the extent buffer already released by the time trace_btrfs_cow_block()
is called, resulting in a use-after-free.
Fix this by moving the trace_btrfs_cow_block() from btrfs_cow_block() to
btrfs_force_cow_block() before the COWed extent buffer is freed.
This also has a side effect of invoking the tracepoint in the tree defrag
code, at defrag.c:btrfs_realloc_node(), since btrfs_force_cow_block() is
called there, but this is fine and it was actually missing there.
Reported-by: syzbot+8517da8635307182c8a5@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6759a9b9.050a0220.1ac542.000d.GAE@google.com/
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix a use-after-free in the I/O completion path for encoded reads by
using a completion instead of a wait_queue for synchronizing the
destruction of 'struct btrfs_encoded_read_private'.
Fixes: 1881fba89b ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmdhyQAACgkQxWXV+ddt
WDuveg//bJSuXHrA7jkijst8rdoAFrceiUXuQPZ6bqb9QrSqlDZlP5/XQpdXZ3yU
qJh/aE13cy0zWTQ2+fMcc770WSvU1cRW/f5BZ+fdXgvO8lS516suXGYd2Q06Cl9/
DriAKGKtRfJn1BrEEv8+fjKS/chxZg6IR/W4kN6AinW31myY9jE5mEDAn+vyTDgQ
8USZ/ar/3KuWo+wO5h5JzrvGnhzK0W0HRs/A0NZ3gG8J5T4yj+8zG0VJR4Gf93AL
iBlsnAR8VzAYJOZCi36SD3j3/eDxJio5GhDYsdt28tk1bL8FqSuI4Yxt+LuiZ2Fg
Cq/31lELEkyEH8AoVFm9pX3HNyRmV6JhpvDXiyofHaOUZ3VeivVE59gOShLUUMkn
f9Pl/uh5/t/ioWWHBnCMyRpI9GZUGCvW24k7HjT7QZhsDGFLTm07diCiRgZ7eaOu
LZRKMOL5jifAnfxNSvIJV19H4lQLTZfbdjmJyb6Il39tIU/1U9pXicgih3iyidW2
N5n4pHf3OQFwG8kNw1mR1g1CPBALP62ja8kMv//IgH4YXXnm1Mo7B3CcJogAAmo4
HB9f/gFqZ8kWaiuIUJKfPZkkLFt5x0TNZQyyOhVUd7V4mFdtEzVtZRWo3juYuLGk
7Shp/MTlYokwnEropiWHU5ab3Bb9vLxlh8daGK/OmwBz01DaApI=
=AAmb
-----END PGP SIGNATURE-----
Merge tag 'for-6.13-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- tree-checker catches invalid number of inline extent references
- zoned mode fixes:
- enhance zone append IO command so it also detects emulated writes
- handle bio splitting at sectorsize boundary
- when deleting a snapshot, fix a condition for visiting nodes in reloc
trees
* tag 'for-6.13-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: tree-checker: reject inline extent items with 0 ref count
btrfs: split bios to the fs sector size boundary
btrfs: use bio_is_zone_append() in the completion handler
btrfs: fix improper generation check in snapshot delete
[BUG]
There is a bug report in the mailing list where btrfs_run_delayed_refs()
failed to drop the ref count for logical 25870311358464 num_bytes
2113536.
The involved leaf dump looks like this:
item 166 key (25870311358464 168 2113536) itemoff 10091 itemsize 50
extent refs 1 gen 84178 flags 1
ref#0: shared data backref parent 32399126528000 count 0 <<<
ref#1: shared data backref parent 31808973717504 count 1
Notice the count number is 0.
[CAUSE]
There is no concrete evidence yet, but considering 0 -> 1 is also a
single bit flipped, it's possible that hardware memory bitflip is
involved, causing the on-disk extent tree to be corrupted.
[FIX]
To prevent us reading such corrupted extent item, or writing such
damaged extent item back to disk, enhance the handling of
BTRFS_EXTENT_DATA_REF_KEY and BTRFS_SHARED_DATA_REF_KEY keys for both
inlined and key items, to detect such 0 ref count and reject them.
CC: stable@vger.kernel.org # 5.4+
Link: https://lore.kernel.org/linux-btrfs/7c69dd49-c346-4806-86e7-e6f863a66f48@app.fastmail.com/
Reported-by: Frankie Fisher <frankie@terrorise.me.uk>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs like other file systems can't really deal with I/O not aligned to
it's internal block size (which strangely is called sector size in
btrfs, for historical reasons), but the block layer split helper doesn't
even know about that.
Round down the split boundary so that all I/Os are aligned.
Fixes: d5e4377d50 ("btrfs: split zone append bios in btrfs_submit_bio")
CC: stable@vger.kernel.org # 6.12
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Otherwise it won't catch bios turned into regular writes by the block
level zone write plugging. The additional test it adds is for emulated
zone append.
Fixes: 9b1ce7f0c6 ("block: Implement zone append emulation")
CC: stable@vger.kernel.org # 6.12
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have been using the following check
if (generation <= root->root_key.offset)
to make decisions about whether or not to visit a node during snapshot
delete. This is because for normal subvolumes this is set to 0, and for
snapshots it's set to the creation generation. The idea being that if
the generation of the node is less than or equal to our creation
generation then we don't need to visit that node, because it doesn't
belong to us, we can simply drop our reference and move on.
However reloc roots don't have their generation stored in
root->root_key.offset, instead that is the objectid of their
corresponding fs root. This means we can incorrectly not walk into
nodes that need to be dropped when deleting a reloc root.
There are a variety of consequences to making the wrong choice in two
distinct areas.
visit_node_for_delete()
1. False positive. We think we are newer than the block when we really
aren't. We don't visit the node and drop our reference to the node
and carry on. This would result in leaked space.
2. False negative. We do decide to walk down into a block that we
should have just dropped our reference to. However this means that
the child node will have refs > 1, so we will switch to
UPDATE_BACKREF, and then the subsequent walk_down_proc() will notice
that btrfs_header_owner(node) != root->root_key.objectid and it'll
break out of the loop, and then walk_up_proc() will drop our reference,
so this appears to be ok.
do_walk_down()
1. False positive. We are in UPDATE_BACKREF and incorrectly decide that
we are done and don't need to update the backref for our lower nodes.
This is another case that simply won't happen with relocation, as we
only have to do UPDATE_BACKREF if the node below us was shared and
didn't have FULL_BACKREF set, and since we don't own that node
because we're a reloc root we actually won't end up in this case.
2. False negative. Again this is tricky because as described above, we
simply wouldn't be here from relocation, because we don't own any of
the nodes because we never set btrfs_header_owner() to the reloc root
objectid, and we always use FULL_BACKREF, we never actually need to
set FULL_BACKREF on any children.
Having spent a lot of time stressing relocation/snapshot delete recently
I've not seen this pop in practice. But this is objectively incorrect,
so fix this to get the correct starting generation based on the root
we're dropping to keep me from thinking there's a problem here.
CC: stable@vger.kernel.org
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We queue up inodes to be defrag'ed asynchronously, which means we do not
have their original file for readahead. This means that the code to
skip readahead on pre-content watched files will not run, and we could
potentially read in empty pages.
Handle this corner case by disabling defrag on files that are currently
being watched for pre-content events.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/4cc5bcea13db7904174353d08e85157356282a59.1731684329.git.josef@toxicpanda.com
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmdYzmoACgkQxWXV+ddt
WDv5GxAAnCsGctNax89x/VpCDZynRghrkxlzu/4kG/pqxsJyzlgXDFtzHAEewSMs
MYL+WCZLYpeKB5FpZq98mDJVLGNMG+9wqkx1bH/xy2ajBGZTeQe5pnkXMNlv9U1O
SX34t8nzOdTCENDnQeRc5I2vTcsQRhgHoVjJkAYdWdhcD9fs6xHKZRe+himlstSn
46ioKzEKSR3ztEUW4ycPF379g7d4kTR0hkk3pu5Nxe7ER8iq+jNSWXj0mzKg7mpJ
KxP56VgY0OrsiUcJr2qFZ1hQIp810puaAuM4C1lLgRplECHxtLbP9JvL9Rr7a8Ox
68tuThyLEpQtR59078jIX3RK6CwVi15rKb/ZkLZkW19TNSAAfM5qrB146hLBUM4T
16WaiJ0x9lVkH2oYQv8zbNZiqDxPhPUdS/JArNAcQYk9ma+C1hCsxPQ/N5yoWH/C
OABJddNR83sm4VTXu3Nci1EB8QuEoOuihYO6CdRkJ3PPNDuQiG6gwnoA2zqSihhy
L5fQaLSWAUsLczarHZrvAi9Y0rfG66QzqGR+A1K/8qMTQ8pSCupd+LfqVa21QpI1
Awx/wVFzsAm7z9CrnPTRJe+JSlBDQdeXWX7pDhhkXgwbCsMVSf3dbBweCD3o1EiM
BVI7SfEgImlbatd0QvDp9FcsnEqp90SCi+99U+zZCmQ1SW8CEC0=
=+DUB
-----END PGP SIGNATURE-----
Merge tag 'for-6.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes. Apart from the one liners and updated bio splitting
error handling there's a fix for subvolume mount with different flags.
This was known and fixed for some time but I've delayed it to give it
more testing.
- fix unbalanced locking when swapfile activation fails when the
subvolume gets deleted in the meantime
- add btrfs error handling after bio_split() calls that got error
handling recently
- during unmount, flush delalloc workers at the right time before the
cleaner thread is shut down
- fix regression in buffered write folio conversion, explicitly wait
for writeback as FGP_STABLE flag is currently a no-op on btrfs
- handle race in subvolume mount with different flags, the conversion
to the new mount API did not handle the case where multiple
subvolumes get mounted in parallel, which is a distro use case"
* tag 'for-6.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: flush delalloc workers queue before stopping cleaner kthread during unmount
btrfs: handle bio_split() errors
btrfs: properly wait for writeback before buffered write
btrfs: fix missing snapshot drew unlock when root is dead during swap activation
btrfs: fix mount failure due to remount races
Commit e546fe1da9 ("block: Rework bio_split() return value") changed
bio_split() so that it can return errors.
Add error handling for it in btrfs_split_bio() and ultimately
btrfs_submit_chunk(). As the bio is not submitted, the bio counter must
be decremented to pair btrfs_bio_counter_inc_blocked().
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Before commit e820dbeb6a ("btrfs: convert btrfs_buffered_write() to
use folios"), function prepare_one_folio() will always wait for folio
writeback to finish before returning the folio.
However commit e820dbeb6a ("btrfs: convert btrfs_buffered_write() to
use folios") changed to use FGP_STABLE to do the writeback wait, but
FGP_STABLE is calling folio_wait_stable(), which only calls
folio_wait_writeback() if the address space has AS_STABLE_WRITES, which
is not set for btrfs inodes.
This means we will not wait for the folio writeback at all.
[CAUSE]
The cause is FGP_STABLE is not waiting for writeback unconditionally, but
only for address spaces with AS_STABLE_WRITES, normally such flag is set
when the super block has SB_I_STABLE_WRITES flag.
Such super block flag is set when the block device has hardware digest
support or has internal checksum requirement.
I'd argue btrfs should set such super block due to its default data
checksum behavior, but it is not set yet, so this means FGP_STABLE flag
will have no effect at all.
(For NODATASUM inodes, we can skip the waiting in theory but that should
be an optimization in the future.)
This can lead to data checksum mismatch, as we can modify the folio
while it's still under writeback, this will make the contents differ
from the contents at submission and checksum calculation.
[FIX]
Instead of fully relying on FGP_STABLE, manually do the folio writeback
waiting, until we set the address space or super flag.
Fixes: e820dbeb6a ("btrfs: convert btrfs_buffered_write() to use folios")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When activating a swap file we acquire the root's snapshot drew lock and
then check if the root is dead, failing and returning with -EPERM if it's
dead but without unlocking the root's snapshot lock. Fix this by adding
the missing unlock.
Fixes: 60021bd754 ("btrfs: prevent subvol with swapfile from being deleted")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
The following reproducer can cause btrfs mount to fail:
dev="/dev/test/scratch1"
mnt1="/mnt/test"
mnt2="/mnt/scratch"
mkfs.btrfs -f $dev
mount $dev $mnt1
btrfs subvolume create $mnt1/subvol1
btrfs subvolume create $mnt1/subvol2
umount $mnt1
mount $dev $mnt1 -o subvol=subvol1
while mount -o remount,ro $mnt1; do mount -o remount,rw $mnt1; done &
bg=$!
while mount $dev $mnt2 -o subvol=subvol2; do umount $mnt2; done
kill $bg
wait
umount -R $mnt1
umount -R $mnt2
The script will fail with the following error:
mount: /mnt/scratch: /dev/mapper/test-scratch1 already mounted on /mnt/test.
dmesg(1) may have more information after failed mount system call.
umount: /mnt/test: target is busy.
umount: /mnt/scratch/: not mounted
And there is no kernel error message.
[CAUSE]
During the btrfs mount, to support mounting different subvolumes with
different RO/RW flags, we need to detect that and retry if needed:
Retry with matching RO flags if the initial mount fail with -EBUSY.
The problem is, during that retry we do not hold any super block lock
(s_umount), this means there can be a remount process changing the RO
flags of the original fs super block.
If so, we can have an EBUSY error during retry. And this time we treat
any failure as an error, without any retry and cause the above EBUSY
mount failure.
[FIX]
The current retry behavior is racy because we do not have a super block
thus no way to hold s_umount to prevent the race with remount.
Solve the root problem by allowing fc->sb_flags to mismatch from the
sb->s_flags at btrfs_get_tree_super().
Then at the re-entry point btrfs_get_tree_subvol(), manually check the
fc->s_flags against sb->s_flags, if it's a RO->RW mismatch, then
reconfigure with s_umount lock hold.
Reported-by: Enno Gotthold <egotthold@suse.com>
Reported-by: Fabian Vogt <fvogt@suse.com>
[ Special thanks for the reproducer and early analysis pointing to btrfs. ]
Fixes: f044b31867 ("btrfs: handle the ro->rw transition for mounting different subvolumes")
Link: https://bugzilla.suse.com/show_bug.cgi?id=1231836
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmdPMbMACgkQxWXV+ddt
WDtFSA/+Kd61BPKaiZFF0yjOsjBOlu8WMHFLO5xa2ZRqV3HUzm6rdO4wUSq3Eyqg
lrFszPfLA0REZIEdY2rqDKJduk1MZZg6NY7Pvn+/ByGOorM0Ym1BJoydtlUN5o2Y
AWeaDxs6LBMwRqpai0+AkikufK/jK7QfhHci+Oo2XmOv1C39DQbkXO76SYh/yERt
CcZNaSjm9DUlLxOOpkefxYvpPW4Uv4NSBfh9aymX/u1VxXqeuMuzPZqwZO+nwl/p
M1yr9fcbrqh3yKC+JjhD7xmOJM3x4c2PmzSkQTdepOuAlQvuQ/iFD+Zc1YjY0XlT
Fl938rdTKULgCarR5rdsXqdlFRnOprlgt0J1Pdf+GipTVU0EY3WU343HCz6h8pmG
F/NPvlCahkEtk1UCguL92NOBeAf0adWhuYfKjkuxYuL5ZTvzsOl2ymF6Vlja0Y39
VK4exjG4ilDESxweinJe53k3QLDwvUc2h0D291QVoo2X06dhXda8oOVK+6lR5mij
zDtXqurjlCybT1W7op6RcWMY0TaA8IR3Bo9oU1YbVcctuc86/X8SERv1G8MQtXhh
tevOVb/cKy7gEw9q73OtrH/J2EAklnsesnpH2sdt7WUEzOQPw9BIwEa/NRbjvoqn
n2ts5KktBhNjM1s0elHBf7o4+OaTrTlTrp0HelXyONZucPrzDQ4=
=4mFO
-----END PGP SIGNATURE-----
Merge tag 'for-6.13-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- add lockdep annotations for io_uring/encoded read integration, inode
lock is held when returning to userspace
- properly reflect experimental config option to sysfs
- handle NULL root in case the rescue mode accepts invalid/damaged tree
roots (rescue=ibadroot)
- regression fix of a deadlock between transaction and extent locks
- fix pending bio accounting bug in encoded read ioctl
- fix NOWAIT mode when checking references for NOCOW files
- fix use-after-free in a rb-tree cleanup in ref-verify debugging tool
* tag 'for-6.13-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix lockdep warnings on io_uring encoded reads
btrfs: ref-verify: fix use-after-free after invalid ref action
btrfs: add a sanity check for btrfs root in btrfs_search_slot()
btrfs: don't loop for nowait writes when checking for cross references
btrfs: sysfs: advertise experimental features only if CONFIG_BTRFS_EXPERIMENTAL=y
btrfs: fix deadlock between transaction commits and extent locks
btrfs: fix use-after-free in btrfs_encoded_read_endio()
Lockdep doesn't like the fact that btrfs_uring_read_extent() returns to
userspace still holding the inode lock, even though we release it once
the I/O finishes. Add calls to rwsem_release() and rwsem_acquire_read() to
work round this.
Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
34310c442e ("btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)")
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Syzbot reports a null-ptr-deref in btrfs_search_slot().
The reproducer is using rescue=ibadroots, and the extent tree root is
corrupted thus the extent tree is NULL.
When scrub tries to search the extent tree to gather the needed extent
info, btrfs_search_slot() doesn't check if the target root is NULL or
not, resulting the null-ptr-deref.
Add sanity check for btrfs root before using it in btrfs_search_slot().
Reported-by: syzbot+3030e17bd57a73d39bd7@syzkaller.appspotmail.com
Fixes: 42437a6386 ("btrfs: introduce mount option rescue=ignorebadroots")
Link: https://syzkaller.appspot.com/bug?extid=3030e17bd57a73d39bd7
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: syzbot+3030e17bd57a73d39bd7@syzkaller.appspotmail.com
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When checking for delayed refs when verifying if there are cross
references for a data extent, we stop if the path has nowait set and we
can't try lock the delayed ref head's mutex, returning -EAGAIN with the
goal of making a write fallback to a blocking context. However we ignore
the -EAGAIN at btrfs_cross_ref_exist() when check_delayed_ref() returns
it, and keep looping instead of immediately returning the -EAGAIN to the
caller.
Fix this by not looping if we get -EAGAIN and we have a nowait path.
Fixes: 26ce911446 ("btrfs: make can_nocow_extent nowait compatible")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are advertising experimental features through sysfs if
CONFIG_BTRFS_DEBUG is set, without looking if CONFIG_BTRFS_EXPERIMENTAL
is set. This is wrong as it will result in reporting experimental
features as supported when CONFIG_BTRFS_EXPERIMENTAL is not set but
CONFIG_BTRFS_DEBUG is set.
Fix this by checking for CONFIG_BTRFS_EXPERIMENTAL instead of
CONFIG_BTRFS_DEBUG.
Fixes: 67cd3f2217 ("btrfs: split out CONFIG_BTRFS_EXPERIMENTAL from CONFIG_BTRFS_DEBUG")
Reviewed-by: Neal Gompa <neal@gompa.dev>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When running a workload with fsstress and duperemove (generic/561) we can
hit a deadlock related to transaction commits and locking extent ranges,
as described below.
Task A hanging during a transaction commit, waiting for all other writers
to complete:
[178317.334817] INFO: task fsstress:555623 blocked for more than 120 seconds.
[178317.335693] Not tainted 6.12.0-rc6-btrfs-next-179+ #1
[178317.336528] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[178317.337673] task:fsstress state:D stack:0 pid:555623 tgid:555623 ppid:555620 flags:0x00004002
[178317.337679] Call Trace:
[178317.337681] <TASK>
[178317.337685] __schedule+0x364/0xbe0
[178317.337691] schedule+0x26/0xa0
[178317.337695] btrfs_commit_transaction+0x5c5/0x1050 [btrfs]
[178317.337769] ? start_transaction+0xc4/0x800 [btrfs]
[178317.337815] ? __pfx_autoremove_wake_function+0x10/0x10
[178317.337819] btrfs_mksubvol+0x381/0x640 [btrfs]
[178317.337878] btrfs_mksnapshot+0x7a/0xb0 [btrfs]
[178317.337935] __btrfs_ioctl_snap_create+0x1bb/0x1d0 [btrfs]
[178317.337995] btrfs_ioctl_snap_create_v2+0x103/0x130 [btrfs]
[178317.338053] btrfs_ioctl+0x29b/0x2a90 [btrfs]
[178317.338118] ? kmem_cache_alloc_noprof+0x5f/0x2c0
[178317.338126] ? getname_flags+0x45/0x1f0
[178317.338133] ? _raw_spin_unlock+0x15/0x30
[178317.338145] ? __x64_sys_ioctl+0x88/0xc0
[178317.338149] __x64_sys_ioctl+0x88/0xc0
[178317.338152] do_syscall_64+0x4a/0x110
[178317.338160] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[178317.338190] RIP: 0033:0x7f13c28e271b
Which corresponds to line 2361 of transaction.c:
$ cat -n fs/btrfs/transaction.c
(...)
2162 int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
2163 {
(...)
2349 spin_lock(&fs_info->trans_lock);
2350 add_pending_snapshot(trans);
2351 cur_trans->state = TRANS_STATE_COMMIT_DOING;
2352 spin_unlock(&fs_info->trans_lock);
2353
2354 /*
2355 * The thread has started/joined the transaction thus it holds the
2356 * lockdep map as a reader. It has to release it before acquiring the
2357 * lockdep map as a writer.
2358 */
2359 btrfs_lockdep_release(fs_info, btrfs_trans_num_writers);
2360 btrfs_might_wait_for_event(fs_info, btrfs_trans_num_writers);
2361 wait_event(cur_trans->writer_wait,
2362 atomic_read(&cur_trans->num_writers) == 1);
(...)
The transaction is in the TRANS_STATE_COMMIT_DOING state and so it's
waiting for all other existing writers to complete and release their
transaction handle.
Task B is running ordered extent completion and blocked waiting to lock an
extent range in an inode's io tree:
[178317.327411] INFO: task kworker/u48:8:554545 blocked for more than 120 seconds.
[178317.328630] Not tainted 6.12.0-rc6-btrfs-next-179+ #1
[178317.329635] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[178317.330872] task:kworker/u48:8 state:D stack:0 pid:554545 tgid:554545 ppid:2 flags:0x00004000
[178317.330878] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[178317.330944] Call Trace:
[178317.330945] <TASK>
[178317.330947] __schedule+0x364/0xbe0
[178317.330952] schedule+0x26/0xa0
[178317.330955] __lock_extent+0x337/0x3a0 [btrfs]
[178317.331014] ? __pfx_autoremove_wake_function+0x10/0x10
[178317.331017] btrfs_finish_one_ordered+0x47a/0xaa0 [btrfs]
[178317.331074] ? psi_group_change+0x132/0x2d0
[178317.331078] btrfs_work_helper+0xbd/0x370 [btrfs]
[178317.331140] process_scheduled_works+0xd3/0x460
[178317.331144] ? __pfx_worker_thread+0x10/0x10
[178317.331146] worker_thread+0x121/0x250
[178317.331149] ? __pfx_worker_thread+0x10/0x10
[178317.331151] kthread+0xe9/0x120
[178317.331154] ? __pfx_kthread+0x10/0x10
[178317.331157] ret_from_fork+0x2d/0x50
[178317.331159] ? __pfx_kthread+0x10/0x10
[178317.331162] ret_from_fork_asm+0x1a/0x30
This extent range locking happens after joining the current transaction,
so task A is waiting for task B to release its transaction handle
(decrementing the transaction's num_writers counter).
Task C while doing a fiemap it tries to join the current transaction:
[242682.812815] task:pool state:D stack:0 pid:560767 tgid:560724 ppid:555622 flags:0x00004006
[242682.812827] Call Trace:
[242682.812856] <TASK>
[242682.812864] __schedule+0x364/0xbe0
[242682.812879] ? _raw_spin_unlock_irqrestore+0x23/0x40
[242682.812897] schedule+0x26/0xa0
[242682.812909] wait_current_trans+0xd6/0x130 [btrfs]
[242682.813148] ? __pfx_autoremove_wake_function+0x10/0x10
[242682.813162] start_transaction+0x3d4/0x800 [btrfs]
[242682.813399] btrfs_is_data_extent_shared+0xd2/0x440 [btrfs]
[242682.813723] fiemap_process_hole+0x2a2/0x300 [btrfs]
[242682.813995] extent_fiemap+0x9b8/0xb80 [btrfs]
[242682.814249] btrfs_fiemap+0x78/0xc0 [btrfs]
[242682.814501] do_vfs_ioctl+0x2db/0xa50
[242682.814519] __x64_sys_ioctl+0x6a/0xc0
[242682.814531] do_syscall_64+0x4a/0x110
[242682.814544] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[242682.814556] RIP: 0033:0x7efff595e71b
It tries to join the current transaction, but it can't because the
transaction is in the TRANS_STATE_COMMIT_DOING state, so
join_transaction() returns -EBUSY to start_transaction() and makes it
wait for the current transaction to complete. And while it's waiting
for the transaction to complete, it's holding an extent range locked
in the same inode that task B is operating, which causes a deadlock
between these 3 tasks. The extent range for the inode was locked at
the start of the fiemap operation, early at extent_fiemap().
In short these tasks deadlock because:
1) Task A is waiting for task B to release its transaction handle;
2) Task B is waiting to lock an extent range for an inode while holding a
transaction handle open;
3) Task C is waiting for the current transaction to complete (for task A
to finish the transaction commit) while holding the extent range for
the inode locked, so task B can't progress and release its transaction
handle.
This results in an ABBA deadlock involving transaction commits and extent
locks. Extent locks are higher level locks, like inode VFS locks, and
should always be acquired before joining or starting a transaction, but
recently commit 2206265f41 ("btrfs: remove code duplication in ordered
extent finishing") accidentally changed btrfs_finish_one_ordered() to do
the transaction join before locking the extent range.
Fix this by making sure that btrfs_finish_one_ordered() always locks the
extent before joining a transaction and add an explicit comment about the
need for this order.
Fixes: 2206265f41 ("btrfs: remove code duplication in ordered extent finishing")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Shinichiro reported the following use-after free that sometimes is
happening in our CI system when running fstests' btrfs/284 on a TCMU
runner device:
BUG: KASAN: slab-use-after-free in lock_release+0x708/0x780
Read of size 8 at addr ffff888106a83f18 by task kworker/u80:6/219
CPU: 8 UID: 0 PID: 219 Comm: kworker/u80:6 Not tainted 6.12.0-rc6-kts+ #15
Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
Call Trace:
<TASK>
dump_stack_lvl+0x6e/0xa0
? lock_release+0x708/0x780
print_report+0x174/0x505
? lock_release+0x708/0x780
? __virt_addr_valid+0x224/0x410
? lock_release+0x708/0x780
kasan_report+0xda/0x1b0
? lock_release+0x708/0x780
? __wake_up+0x44/0x60
lock_release+0x708/0x780
? __pfx_lock_release+0x10/0x10
? __pfx_do_raw_spin_lock+0x10/0x10
? lock_is_held_type+0x9a/0x110
_raw_spin_unlock_irqrestore+0x1f/0x60
__wake_up+0x44/0x60
btrfs_encoded_read_endio+0x14b/0x190 [btrfs]
btrfs_check_read_bio+0x8d9/0x1360 [btrfs]
? lock_release+0x1b0/0x780
? trace_lock_acquire+0x12f/0x1a0
? __pfx_btrfs_check_read_bio+0x10/0x10 [btrfs]
? process_one_work+0x7e3/0x1460
? lock_acquire+0x31/0xc0
? process_one_work+0x7e3/0x1460
process_one_work+0x85c/0x1460
? __pfx_process_one_work+0x10/0x10
? assign_work+0x16c/0x240
worker_thread+0x5e6/0xfc0
? __pfx_worker_thread+0x10/0x10
kthread+0x2c3/0x3a0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x31/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Allocated by task 3661:
kasan_save_stack+0x30/0x50
kasan_save_track+0x14/0x30
__kasan_kmalloc+0xaa/0xb0
btrfs_encoded_read_regular_fill_pages+0x16c/0x6d0 [btrfs]
send_extent_data+0xf0f/0x24a0 [btrfs]
process_extent+0x48a/0x1830 [btrfs]
changed_cb+0x178b/0x2ea0 [btrfs]
btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
_btrfs_ioctl_send+0x117/0x330 [btrfs]
btrfs_ioctl+0x184a/0x60a0 [btrfs]
__x64_sys_ioctl+0x12e/0x1a0
do_syscall_64+0x95/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Freed by task 3661:
kasan_save_stack+0x30/0x50
kasan_save_track+0x14/0x30
kasan_save_free_info+0x3b/0x70
__kasan_slab_free+0x4f/0x70
kfree+0x143/0x490
btrfs_encoded_read_regular_fill_pages+0x531/0x6d0 [btrfs]
send_extent_data+0xf0f/0x24a0 [btrfs]
process_extent+0x48a/0x1830 [btrfs]
changed_cb+0x178b/0x2ea0 [btrfs]
btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
_btrfs_ioctl_send+0x117/0x330 [btrfs]
btrfs_ioctl+0x184a/0x60a0 [btrfs]
__x64_sys_ioctl+0x12e/0x1a0
do_syscall_64+0x95/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The buggy address belongs to the object at ffff888106a83f00
which belongs to the cache kmalloc-rnd-07-96 of size 96
The buggy address is located 24 bytes inside of
freed 96-byte region [ffff888106a83f00, ffff888106a83f60)
The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888106a83800 pfn:0x106a83
flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
page_type: f5(slab)
raw: 0017ffffc0000000 ffff888100053680 ffffea0004917200 0000000000000004
raw: ffff888106a83800 0000000080200019 00000001f5000000 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888106a83e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
ffff888106a83e80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>ffff888106a83f00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
^
ffff888106a83f80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
ffff888106a84000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================
Further analyzing the trace and the crash dump's vmcore file shows that
the wake_up() call in btrfs_encoded_read_endio() is calling wake_up() on
the wait_queue that is in the private data passed to the end_io handler.
Commit 4ff47df40447 ("btrfs: move priv off stack in
btrfs_encoded_read_regular_fill_pages()") moved 'struct
btrfs_encoded_read_private' off the stack.
Before that commit one can see a corruption of the private data when
analyzing the vmcore after a crash:
*(struct btrfs_encoded_read_private *)0xffff88815626eec8 = {
.wait = (wait_queue_head_t){
.lock = (spinlock_t){
.rlock = (struct raw_spinlock){
.raw_lock = (arch_spinlock_t){
.val = (atomic_t){
.counter = (int)-2005885696,
},
.locked = (u8)0,
.pending = (u8)157,
.locked_pending = (u16)40192,
.tail = (u16)34928,
},
.magic = (unsigned int)536325682,
.owner_cpu = (unsigned int)29,
.owner = (void *)__SCT__tp_func_btrfs_transaction_commit+0x0 = 0x0,
.dep_map = (struct lockdep_map){
.key = (struct lock_class_key *)0xffff8881575a3b6c,
.class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
.name = (const char *)0xffff88815626f100 = "",
.wait_type_outer = (u8)37,
.wait_type_inner = (u8)178,
.lock_type = (u8)154,
},
},
.__padding = (u8 [24]){ 0, 157, 112, 136, 50, 174, 247, 31, 29 },
.dep_map = (struct lockdep_map){
.key = (struct lock_class_key *)0xffff8881575a3b6c,
.class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
.name = (const char *)0xffff88815626f100 = "",
.wait_type_outer = (u8)37,
.wait_type_inner = (u8)178,
.lock_type = (u8)154,
},
},
.head = (struct list_head){
.next = (struct list_head *)0x112cca,
.prev = (struct list_head *)0x47,
},
},
.pending = (atomic_t){
.counter = (int)-1491499288,
},
.status = (blk_status_t)130,
}
Here we can see several indicators of in-memory data corruption, e.g. the
large negative atomic values of ->pending or
->wait->lock->rlock->raw_lock->val, as well as the bogus spinlock magic
0x1ff7ae32 (decimal 536325682 above) instead of 0xdead4ead or the bogus
pointer values for ->wait->head.
To fix this, change atomic_dec_return() to atomic_dec_and_test() to fix the
corruption, as atomic_dec_return() is defined as two instructions on
x86_64, whereas atomic_dec_and_test() is defined as a single atomic
operation. This can lead to a situation where counter value is already
decremented but the if statement in btrfs_encoded_read_endio() is not
completely processed, i.e. the 0 test has not completed. If another thread
continues executing btrfs_encoded_read_regular_fill_pages() the
atomic_dec_return() there can see an already updated ->pending counter and
continues by freeing the private data. Continuing in the endio handler the
test for 0 succeeds and the wait_queue is woken up, resulting in a
use-after-free.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com>
Fixes: 1881fba89b ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
- Constify range_contains() input parameters to prevent changes.
- Add support for displaying RCD capabilities in sysfs to support lspci for CXL device.
- Downgrade warning message to debug in cxl_probe_component_regs().
- Add support for adding a printf specifier '$pra' to emit 'struct range' content.
- Add sanity tests for 'struct resource'.
- Add documentation for special case.
- Add %pra for 'struct range'.
- Add %pra usage in CXL code.
- Add preparation code for DCD support
- Add range_overlaps().
- Add CDAT DSMAS table shared and read only flag in ACPICA.
- Add documentation to 'struct dev_dax_range'.
- Delay event buffer allocation in CXL PCI code until needed.
- Use guard() in cxl_dpa_set_mode().
- Refactor create region code to consolidate common code.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAmc84dMACgkQYGjFFmlT
OEoGTg//cSJlQ9X7+xZDbngnzpJwcLzQkR/FXDfe3obtmgs7woDJgNNcYnKSlgyf
wal47Q0UM/1Hv8Dtfrt62Ay1fmOvDL2GSpey35NVJGCEpIsfOqqk1zTCgfgwRHTO
MZJLnOSFUIlDYlVz8ljLNHnNqPjr7dCoUh9tdBefvkw59FqbkHNcWI8hG1lh1SR4
2frtJcqVg54S6vJa2eeWmNVpxz7RZvPFrb8TJzhdrGM8PkTMNFA2oJINAf0j00Ev
8/T6HXTxXvFtNhBH0dtMO1MFh1d6Qr/zFnX/gmrnPWl1l/12HFDMBIZIzq/Whjpo
+7hQ5xK3cwkMevFgFrAhwdZMj8maR84x1dbFItoThaoeDIQ4sGfyQEMPsbkZP/Sc
67i5hQFIBZc+ORLB0W+z9Da52ZFGyVw/xsCmDRzXCw4s7N2twpydIoA7Pvu9NN1X
3JVF35NrsRZ+PyuGWEitNjo0Rj6swNpBC5Xv/T1mgFtSgvVuk1T2QtSHJcPoQyzQ
zbijsCKmvJYbdJBnPiotdrBs1BUxBsP9dBT9IxWzMy6lcEpTJrYpUheRCk2tSHFa
Kk8O8IYNiBKZaSpN9UHKaGzr43H8gNbLf4svSIiu1lZJTSSdtWqfZZYjXFBgB1Vb
l2gBCDmPJ0y7WKZSCa53UmQiOusr+l3Pi+OflZEfCy6JxbSqTTM=
=GNlu
-----END PGP SIGNATURE-----
Merge tag 'cxl-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull cxl updates from Dave Jiang:
- Constify range_contains() input parameters to prevent changes
- Add support for displaying RCD capabilities in sysfs to support lspci
for CXL device
- Downgrade warning message to debug in cxl_probe_component_regs()
- Add support for adding a printf specifier '%pra' to emit 'struct
range' content:
- Add sanity tests for 'struct resource'
- Add documentation for special case
- Add %pra for 'struct range'
- Add %pra usage in CXL code
- Add preparation code for DCD support:
- Add range_overlaps()
- Add CDAT DSMAS table shared and read only flag in ACPICA
- Add documentation to 'struct dev_dax_range'
- Delay event buffer allocation in CXL PCI code until needed
- Use guard() in cxl_dpa_set_mode()
- Refactor create region code to consolidate common code
* tag 'cxl-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl/region: Refactor common create region code
cxl/hdm: Use guard() in cxl_dpa_set_mode()
cxl/pci: Delay event buffer allocation
dax: Document struct dev_dax_range
ACPI/CDAT: Add CDAT/DSMAS shared and read only flag values
range: Add range_overlaps()
cxl/cdat: Use %pra for dpa range outputs
printf: Add print format (%pra) for struct range
Documentation/printf: struct resource add start == end special case
test printf: Add very basic struct resource tests
cxl: downgrade a warning message to debug level in cxl_probe_component_regs()
cxl/pci: Add sysfs attribute for CXL 1.1 device link status
cxl/core/regs: Add rcd_pcie_cap initialization
kernel/range: Const-ify range_contains parameters
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmc7S40QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjHVD/43rDZ8ehs+IAAr6S0RemNX1SRG0mK2UOEb
kMoNogS7StO/c4JYW3JuzCyLRn5ZsgeWV/muqxwDEWQrmTGrvi+V45KikrZPwm3k
p0ump33qV9EU2jiR1MKZjtwK2P0CI7/DD3W8ww6IOvKbTT7RcqQcdHznvXArFBtc
xCuQPpayFG7ZasC+N9VaBwtiUEVgU3Ek9AFT7UVZRWajjHPNalQwaooJWayO0rEG
KdoW5yG0ryLrgCY2ACSvRLS+2s14EJtb8hgT08WKHTNgd5LxhSKxfsTapamua+7U
FdVS6Ij0tEkgu2jpvgj7QKO0Uw10Cnep2gj7RHts/LVewvkliS6XcheOzqRS1jWU
I2EI+UaGOZ11OUiw52VIveEVS5zV/NWhgy5BSP9LYEvXw0BUAHRDYGMem8o5G1V1
SWqjIM1UWvcQDlAnMF9FDVzojvjVUmYWvcAlFFztO8J0B7SavHR3NcfHwEf57reH
rNoUbi/9c4/wjJJF33gejiR5pU+ewy/Mk75GrtX3xpEqlztfRbf9/FbPCMEAO1KR
DF/b3lkUV9i2/BRW6a0SpZ5RDSmSYMnateel6TrPyVSRnpiSSFO8FrbynwUOa17b
6i49YDFWzzXOrR1YWDg6IEtTrcmBEmvi7F6aoDs020qUnL0hwLn1ZuoIxuiFEpor
Z0iFF1B/nw==
=PWTH
-----END PGP SIGNATURE-----
Merge tag 'for-6.13/block-20241118' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe updates via Keith:
- Use uring_cmd helper (Pavel)
- Host Memory Buffer allocation enhancements (Christoph)
- Target persistent reservation support (Guixin)
- Persistent reservation tracing (Guixen)
- NVMe 2.1 specification support (Keith)
- Rotational Meta Support (Matias, Wang, Keith)
- Volatile cache detection enhancment (Guixen)
- MD updates via Song:
- Maintainers update
- raid5 sync IO fix
- Enhance handling of faulty and blocked devices
- raid5-ppl atomic improvement
- md-bitmap fix
- Support for manually defining embedded partition tables
- Zone append fixes and cleanups
- Stop sending the queued requests in the plug list to the driver
->queue_rqs() handle in reverse order.
- Zoned write plug cleanups
- Cleanups disk stats tracking and add support for disk stats for
passthrough IO
- Add preparatory support for file system atomic writes
- Add lockdep support for queue freezing. Already found a bunch of
issues, and some fixes for that are in here. More will be coming.
- Fix race between queue stopping/quiescing and IO queueing
- ublk recovery improvements
- Fix ublk mmap for 64k pages
- Various fixes and cleanups
* tag 'for-6.13/block-20241118' of git://git.kernel.dk/linux: (118 commits)
MAINTAINERS: Update git tree for mdraid subsystem
block: make struct rq_list available for !CONFIG_BLOCK
block/genhd: use seq_put_decimal_ull for diskstats decimal values
block: don't reorder requests in blk_mq_add_to_batch
block: don't reorder requests in blk_add_rq_to_plug
block: add a rq_list type
block: remove rq_list_move
virtio_blk: reverse request order in virtio_queue_rqs
nvme-pci: reverse request order in nvme_queue_rqs
btrfs: validate queue limits
block: export blk_validate_limits
nvmet: add tracing of reservation commands
nvme: parse reservation commands's action and rtype to string
nvmet: report ns's vwc not present
md/raid5: Increase r5conf.cache_name size
block: remove the ioprio field from struct request
block: remove the write_hint field from struct request
nvme: check ns's volatile write cache not present
nvme: add rotational support
nvme: use command set independent id ns if available
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmc0zT4ACgkQxWXV+ddt
WDtThRAAhzSSiHcJqTfCL5nHh7w85MNEVw28o1ETgXSYJmx0JOWLE7Znlp2FV7jj
IbYkFfF2gXJzYvRZkcXB/TAHV9KJG5yZIBZfccbM+9db9f8xkImVKMuqQRXPU41R
ppSCmqZTeujtt8ucsaJkMpm6pzECKJCJaGOsMJ8fiqKpo89dKO3eGAVboSbpPF4C
r0YmppiBwSP/cCXQCqWxZRbqPGN+lUgZpIGNRi157kehfmRHlVVJTO1pgqK8PCXb
uIT09Kulppfez8+1A10CPcniDTyinLik/qLTNlzdWoDBL4iNJMg0A0wsA04AJVf0
PdOS0REusiv3QcEIO6PefuRFRRfXcSLPpPDUceltJT5O0uM2gUqf2C7dEHXUGU3o
TdgYlbQpsJWpZ7VGWQDZeGGV04lOPQvu0LGLPgEerUQd5H9ABa0dX8Fn0sPhKsa8
whpAcdfE4rdNxB2OJFnqQeFq0z3cSjP/rvKlluCmAj97QYI+kiu3QyhemcT1YSC9
U7n5Ya9IzIYCN3ml54q3hEgyD0IVGGG20GuUmqC9XSP9mrQRC8I1g7v26AiOTrrk
VhgSdtMmphDxXudifsnYMaQ0Z1QqiUrW1SM/prAEOnBYCo75+HDsTgrq9ithgHoI
4xz4YXJyMRs18qfTJctXC1wmGuz5plTdQrwarHdNsELN5HEyqX4=
=aAcf
-----END PGP SIGNATURE-----
Merge tag 'for-6.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Changes outside of btrfs: add io_uring command flag to track a dying
task (the rest will go via the block git tree).
User visible changes:
- wire encoded read (ioctl) to io_uring commands, this can be used on
itself, in the future this will allow 'send' to be asynchronous. As
a consequence, the encoded read ioctl can also work in non-blocking
mode
- new ioctl to wait for cleaned subvolumes, no need to use the
generic and root-only SEARCH_TREE ioctl, will be used by "btrfs
subvol sync"
- recognize different paths/symlinks for the same devices and don't
report them during rescanning, this can be observed with LVM or DM
- seeding device use case change, the sprout device (the one
capturing new writes) will not clear the read-only status of the
super block; this prevents accumulating space from deleted
snapshots
Performance improvements:
- reduce lock contention when traversing extent buffers
- reduce extent tree lock contention when searching for inline
backref
- switch from rb-trees to xarray for delayed ref tracking,
improvements due to better cache locality, branching factors and
more compact data structures
- enable extent map shrinker again (prevent memory exhaustion under
some types of IO load), reworked to run in a single worker thread
(there used to be problems causing long stalls under memory
pressure)
Core changes:
- raid-stripe-tree feature updates:
- make device replace and scrub work
- implement partial deletion of stripe extents
- new selftests
- split the config option BTRFS_DEBUG and add EXPERIMENTAL for
features that are experimental or with known problems so we don't
misuse debugging config for that
- subpage mode updates (sector < page):
- update compression implementations
- update writepage, writeback
- continued folio API conversions:
- buffered writes
- make buffered write copy one page at a time, preparatory work for
future integration with large folios, may cause performance drop
- proper locking of root item regarding starting send
- error handling improvements
- code cleanups and refactoring:
- dead code removal
- unused parameter reduction
- lockdep assertions"
* tag 'for-6.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (119 commits)
btrfs: send: check for read-only send root under critical section
btrfs: send: check for dead send root under critical section
btrfs: remove check for NULL fs_info at btrfs_folio_end_lock_bitmap()
btrfs: fix warning on PTR_ERR() against NULL device at btrfs_control_ioctl()
btrfs: fix a typo in btrfs_use_zone_append
btrfs: avoid superfluous calls to free_extent_map() in btrfs_encoded_read()
btrfs: simplify logic to decrement snapshot counter at btrfs_mksnapshot()
btrfs: remove hole from struct btrfs_delayed_node
btrfs: update stale comment for struct btrfs_delayed_ref_node::add_list
btrfs: add new ioctl to wait for cleaned subvolumes
btrfs: simplify range tracking in cow_file_range()
btrfs: remove conditional path allocation in btrfs_read_locked_inode()
btrfs: push cleanup into btrfs_read_locked_inode()
io_uring/cmd: let cmds to know about dying task
btrfs: add struct io_btrfs_cmd as type for io_uring_cmd_to_pdu()
btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)
btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages()
btrfs: don't sleep in btrfs_encoded_read() if IOCB_NOWAIT is set
btrfs: change btrfs_encoded_read() so that reading of extent is done by caller
btrfs: remove pointless iocb::ki_pos addition in btrfs_encoded_read()
...
Making sure that struct fd instances are destroyed in the same
scope where they'd been created, getting rid of reassignments
and passing them by reference, converting to CLASS(fd{,_pos,_raw}).
We are getting very close to having the memory safety of that stuff
trivial to verify.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZzdikAAKCRBZ7Krx/gZQ
69nJAQCmbQHK3TGUbQhOw6MJXOK9ezpyEDN3FZb4jsu38vTIdgEA6OxAYDO2m2g9
CN18glYmD3wRyU6Bwl4vGODouSJvDgA=
=gVH3
-----END PGP SIGNATURE-----
Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull 'struct fd' class updates from Al Viro:
"The bulk of struct fd memory safety stuff
Making sure that struct fd instances are destroyed in the same scope
where they'd been created, getting rid of reassignments and passing
them by reference, converting to CLASS(fd{,_pos,_raw}).
We are getting very close to having the memory safety of that stuff
trivial to verify"
* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
deal with the last remaing boolean uses of fd_file()
css_set_fork(): switch to CLASS(fd_raw, ...)
memcg_write_event_control(): switch to CLASS(fd)
assorted variants of irqfd setup: convert to CLASS(fd)
do_pollfd(): convert to CLASS(fd)
convert do_select()
convert vfs_dedupe_file_range().
convert cifs_ioctl_copychunk()
convert media_request_get_by_fd()
convert spu_run(2)
switch spufs_calls_{get,put}() to CLASS() use
convert cachestat(2)
convert do_preadv()/do_pwritev()
fdget(), more trivial conversions
fdget(), trivial conversions
privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget()
o2hb_region_dev_store(): avoid goto around fdget()/fdput()
introduce "fd_pos" class, convert fdget_pos() users to it.
fdget_raw() users: switch to CLASS(fd_raw)
convert vmsplice() to CLASS(fd)
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcUQAAKCRCRxhvAZXjc
onEpAQCUdwIBHpwmSIFvJFA9aNGpbLzi0dDSEIxuWYtp5qVuogD+ImccwqpG3kEi
Zq9vokdPpB1zbahxKl1mkvBG4G0GFQE=
=LbP6
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.pagecache' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs pagecache updates from Christian Brauner:
"Cleanup filesystem page flag usage: This continues the work to make
the mappedtodisk/owner_2 flag available to filesystems which don't use
buffer heads. Further patches remove uses of Private2. This brings us
very close to being rid of it entirely"
* tag 'vfs-6.13.pagecache' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
migrate: Remove references to Private2
ceph: Remove call to PagePrivate2()
btrfs: Switch from using the private_2 flag to owner_2
mm: Remove PageMappedToDisk
nilfs2: Convert nilfs_copy_buffer() to use folios
fs: Move clearing of mappedtodisk to buffer.c
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcToAAKCRCRxhvAZXjc
osL9AP948FFumJRC28gDJ4xp+X4eohNOfkgoEG8FTbF2zU6ulwD+O0pr26FqpFli
pqlG+38UdATImpfqqWjPbb72sBYcfQg=
=wLUh
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Fixup and improve NLM and kNFSD file lock callbacks
Last year both GFS2 and OCFS2 had some work done to make their
locking more robust when exported over NFS. Unfortunately, part of
that work caused both NLM (for NFS v3 exports) and kNFSD (for
NFSv4.1+ exports) to no longer send lock notifications to clients
This in itself is not a huge problem because most NFS clients will
still poll the server in order to acquire a conflicted lock
It's important for NLM and kNFSD that they do not block their
kernel threads inside filesystem's file_lock implementations
because that can produce deadlocks. We used to make sure of this by
only trusting that posix_lock_file() can correctly handle blocking
lock calls asynchronously, so the lock managers would only setup
their file_lock requests for async callbacks if the filesystem did
not define its own lock() file operation
However, when GFS2 and OCFS2 grew the capability to correctly
handle blocking lock requests asynchronously, they started
signalling this behavior with EXPORT_OP_ASYNC_LOCK, and the check
for also trusting posix_lock_file() was inadvertently dropped, so
now most filesystems no longer produce lock notifications when
exported over NFS
Fix this by using an fop_flag which greatly simplifies the problem
and grooms the way for future uses by both filesystems and lock
managers alike
- Add a sysctl to delete the dentry when a file is removed instead of
making it a negative dentry
Commit 681ce86235 ("vfs: Delete the associated dentry when
deleting a file") introduced an unconditional deletion of the
associated dentry when a file is removed. However, this led to
performance regressions in specific benchmarks, such as
ilebench.sum_operations/s, prompting a revert in commit
4a4be1ad3a ("Revert "vfs: Delete the associated dentry when
deleting a file""). This reintroduces the concept conditionally
through a sysctl
- Expand the statmount() system call:
* Report the filesystem subtype in a new fs_subtype field to
e.g., report fuse filesystem subtypes
* Report the superblock source in a new sb_source field
* Add a new way to return filesystem specific mount options in an
option array that returns filesystem specific mount options
separated by zero bytes and unescaped. This allows caller's to
retrieve filesystem specific mount options and immediately pass
them to e.g., fsconfig() without having to unescape or split
them
* Report security (LSM) specific mount options in a separate
security option array. We don't lump them together with
filesystem specific mount options as security mount options are
generic and most users aren't interested in them
The format is the same as for the filesystem specific mount
option array
- Support relative paths in fsconfig()'s FSCONFIG_SET_STRING command
- Optimize acl_permission_check() to avoid costly {g,u}id ownership
checks if possible
- Use smp_mb__after_spinlock() to avoid full smp_mb() in evict()
- Add synchronous wakeup support for ep_poll_callback.
Currently, epoll only uses wake_up() to wake up task. But sometimes
there are epoll users which want to use the synchronous wakeup flag
to give a hint to the scheduler, e.g., the Android binder driver.
So add a wake_up_sync() define, and use wake_up_sync() when sync is
true in ep_poll_callback()
Fixes:
- Fix kernel documentation for inode_insert5() and iget5_locked()
- Annotate racy epoll check on file->f_ep
- Make F_DUPFD_QUERY associative
- Avoid filename buffer overrun in initramfs
- Don't let statmount() return empty strings
- Add a cond_resched() to dump_user_range() to avoid hogging the CPU
- Don't query the device logical blocksize multiple times for hfsplus
- Make filemap_read() check that the offset is positive or zero
Cleanups:
- Various typo fixes
- Cleanup wbc_attach_fdatawrite_inode()
- Add __releases annotation to wbc_attach_and_unlock_inode()
- Add hugetlbfs tracepoints
- Fix various vfs kernel doc parameters
- Remove obsolete TODO comment from io_cancel()
- Convert wbc_account_cgroup_owner() to take a folio
- Fix comments for BANDWITH_INTERVAL and wb_domain_writeout_add()
- Reorder struct posix_acl to save 8 bytes
- Annotate struct posix_acl with __counted_by()
- Replace one-element array with flexible array member in freevxfs
- Use idiomatic atomic64_inc_return() in alloc_mnt_ns()"
* tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
statmount: retrieve security mount options
vfs: make evict() use smp_mb__after_spinlock instead of smp_mb
statmount: add flag to retrieve unescaped options
fs: add the ability for statmount() to report the sb_source
writeback: wbc_attach_fdatawrite_inode out of line
writeback: add a __releases annoation to wbc_attach_and_unlock_inode
fs: add the ability for statmount() to report the fs_subtype
fs: don't let statmount return empty strings
fs:aio: Remove TODO comment suggesting hash or array usage in io_cancel()
hfsplus: don't query the device logical block size multiple times
freevxfs: Replace one-element array with flexible array member
fs: optimize acl_permission_check()
initramfs: avoid filename buffer overrun
fs/writeback: convert wbc_account_cgroup_owner to take a folio
acl: Annotate struct posix_acl with __counted_by()
acl: Realign struct posix_acl to save 8 bytes
epoll: Add synchronous wakeup support for ep_poll_callback
coredump: add cond_resched() to dump_user_range
mm/page-writeback.c: Fix comment of wb_domain_writeout_add()
mm/page-writeback.c: Update comment for BANDWIDTH_INTERVAL
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcScQAKCRCRxhvAZXjc
oj+5AP4k822a77wc/3iPFk379naIvQ4dsrgemh0/Pb6ZvzvkFQEAi3vFCfzCDR2x
SkJF/RwXXKZv6U31QXMRt2Qo6wfBuAc=
=nVlm
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs multigrain timestamps from Christian Brauner:
"This is another try at implementing multigrain timestamps. This time
with significant help from the timekeeping maintainers to reduce the
performance impact.
Thomas provided a base branch that contains the required timekeeping
interfaces for the VFS. It serves as the base for the multi-grain
timestamp work:
- Multigrain timestamps allow the kernel to use fine-grained
timestamps when an inode's attributes is being actively observed
via ->getattr(). With this support, it's possible for a file to get
a fine-grained timestamp, and another modified after it to get a
coarse-grained stamp that is earlier than the fine-grained time. If
this happens then the files can appear to have been modified in
reverse order, which breaks VFS ordering guarantees.
To prevent this, a floor value is maintained for multigrain
timestamps. Whenever a fine-grained timestamp is handed out, record
it, and when later coarse-grained stamps are handed out, ensure
they are not earlier than that value. If the coarse-grained
timestamp is earlier than the fine-grained floor, return the floor
value instead.
The timekeeper changes add a static singleton atomic64_t into
timekeeper.c that is used to keep track of the latest fine-grained
time ever handed out. This is tracked as a monotonic ktime_t value
to ensure that it isn't affected by clock jumps. Because it is
updated at different times than the rest of the timekeeper object,
the floor value is managed independently of the timekeeper via a
cmpxchg() operation, and sits on its own cacheline.
Two new public timekeeper interfaces are added:
(1) ktime_get_coarse_real_ts64_mg() fills a timespec64 with the
later of the coarse-grained clock and the floor time
(2) ktime_get_real_ts64_mg() gets the fine-grained clock value,
and tries to swap it into the floor. A timespec64 is filled
with the result.
- The VFS has always used coarse-grained timestamps when updating the
ctime and mtime after a change. This has the benefit of allowing
filesystems to optimize away a lot metadata updates, down to around
1 per jiffy, even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting
via NFSv3, which relies on timestamps to validate caches. A lot of
changes can happen in a jiffy, so timestamps aren't sufficient to
help the client decide when to invalidate the cache. Even with
NFSv4, a lot of exported filesystems don't properly support a
change attribute and are subject to the same problems with
timestamp granularity. Other applications have similar issues with
timestamps (e.g backup applications).
If we were to always use fine-grained timestamps, that would
improve the situation, but that becomes rather expensive, as the
underlying filesystem would have to log a lot more metadata
updates.
This adds a way to only use fine-grained timestamps when they are
being actively queried. Use the (unused) top bit in
inode->i_ctime_nsec as a flag that indicates whether the current
timestamps have been queried via stat() or the like. When it's set,
we allow the kernel to use a fine-grained timestamp iff it's
necessary to make the ctime show a different value.
This solves the problem of being able to distinguish the timestamp
between updates, but introduces a new problem: it's now possible
for a file being changed to get a fine-grained timestamp. A file
that is altered just a bit later can then get a coarse-grained one
that appears older than the earlier fine-grained time. This
violates timestamp ordering guarantees.
This is where the earlier mentioned timkeeping interfaces help. A
global monotonic atomic64_t value is kept that acts as a timestamp
floor. When we go to stamp a file, we first get the latter of the
current floor value and the current coarse-grained time. If the
inode ctime hasn't been queried then we just attempt to stamp it
with that value.
If it has been queried, then first see whether the current coarse
time is later than the existing ctime. If it is, then we accept
that value. If it isn't, then we get a fine-grained time and try to
swap that into the global floor. Whether that succeeds or fails, we
take the resulting floor time, convert it to realtime and try to
swap that into the ctime.
We take the result of the ctime swap whether it succeeds or fails,
since either is just as valid.
Filesystems can opt into this by setting the FS_MGTIME fstype flag.
Others should be unaffected (other than being subject to the same
floor value as multigrain filesystems)"
* tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: reduce pointer chasing in is_mgtime() test
tmpfs: add support for multigrain timestamps
btrfs: convert to multigrain timestamps
ext4: switch to multigrain timestamps
xfs: switch to multigrain timestamps
Documentation: add a new file documenting multigrain timestamps
fs: add percpu counters for significant multigrain timestamp events
fs: tracepoints around multigrain timestamp events
fs: handle delegated timestamps in setattr_copy_mgtime
timekeeping: Add percpu counter for tracking floor swap events
timekeeping: Add interfaces for handling timestamps with a floor value
fs: have setattr_copy handle multigrain timestamps appropriately
fs: add infrastructure for multigrain timestamps
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmc2aGIACgkQxWXV+ddt
WDu0AQ/9FLfC/e3X2GjZ0auna/7A/rF8MPoATUdAyHn75Md6Hc8PXpi1YvMph+ba
pvoufOqrU66/g0UNeacgsp6rF4rKJHg0q9Id+7wueLnDr41g9paXjsLYItq4j26w
GusDiZvUDwuDmb70vlTXrgAfnjooIdSwqJTlIzJxvl4wrNzOiUlUJtTMzmUrwn/9
Lf/iByWlGcPKKBc+1ZzFz4HlVOZZSt9YePeJw2/Aul2OMtuI3RTTAL/NtjaFIlYc
pb+NHVqFrrfgC+xo68hLBmnsBfS41EGR58rYRjEuQo0+hARa8WbxL3DNA/E/Vi5X
dsq/wQVlD7IVIWCoF9J94/iyDdwlDOGFMoL6FUrJwDtPGN/v/xxtA6ruvuC7k5zy
bHCR8ZVrJWVaxE7u0Gtl+hFPpDTwNTR7SfvK69gxPfci1cN0m2wCNK02SEUJwV09
N82N2ENGGwyWS+nOl/ERB+7A0QxViMr3JpUrPzSYqsmn8bwDvovSjK2fFouJoSey
bpAzbFWj+OS0O9nnRqabTJDM/Tk9O0s0Ye76aUS+Vfk9d5EuVfAg6pHiOBcFDhsK
UEG9QbPltfh6LPDHCdV93HOOsC0uNxCTCSpbQ9LFGKBICQsPIX/vZeV45fNFJDLX
j5kEtHFVU3snU+jA97nvYXPRANDnnNx/EzXv7zo0Ye8L+plecBs=
=ssYj
-----END PGP SIGNATURE-----
Merge tag 'for-6.12-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"One more fix that seems urgent and good to have in 6.12 final.
It could potentially lead to unexpected transaction aborts, due to
wrong comparison and order of processing of delayed refs"
* tag 'for-6.12-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix incorrect comparison for delayed refs
When I reworked delayed ref comparison in cf4f04325b ("btrfs: move
->parent and ->ref_root into btrfs_delayed_ref_node"), I made a mistake
and returned -1 for the case where ref1->ref_root was > than
ref2->ref_root. This is a subtle bug that can result in improper
delayed ref running order, which can result in transaction aborts.
Fixes: cf4f04325b ("btrfs: move ->parent and ->ref_root into btrfs_delayed_ref_node")
CC: stable@vger.kernel.org # 6.10+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Call blk_validate_limits on the queue limits used for zone append
splitting so that calculated values get filled in and any stacking
conflicts get cought.
Without this there isn't a max_zone_append_sectors limits as of commit
559218d43e ("block: pre-calculate max_zone_append_sectors").
Fixes: 559218d43e ("block: pre-calculate max_zone_append_sectors")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241113084541.34315-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We're checking if the send root is read-only without being under the
protection of the root's root_item_lock spinlock, which is what protects
the root's flags when clearing the read-only flag, done at
btrfs_ioctl_subvol_setflags(). Furthermore, it should be done in the
same critical section that increments the root's send_in_progress counter,
as btrfs_ioctl_subvol_setflags() clears the read-only flag in the same
critical section that checks the counter's value.
So fix this by moving the read-only check under the critical section
delimited by the root's root_item_lock which also increments the root's
send_in_progress counter.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're checking if the send root is dead without the protection of the
root's root_item_lock spinlock, which is what protects the root's flags.
The inverse, setting the dead flag on a root, is done under the protection
of that lock, at btrfs_delete_subvolume(). Also checking and updating the
root's send_in_progress counter is supposed to be done in the same
critical section as checking for or setting the root dead flag, so that
these operations are done atomically as a single step (which is correctly
done by btrfs_delete_subvolume()).
So fix this by checking if the send root is dead in the same critical
section that updates the send_in_progress counter, which is protected by
the root's root_item_lock spinlock.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Smatch complains about possibly dereferencing a NULL fs_info at
btrfs_folio_end_lock_bitmap():
fs/btrfs/subpage.c:332 btrfs_folio_end_lock_bitmap() warn: variable dereferenced before check 'fs_info' (see line 326)
because we access fs_info to set the 'start_bit' variable before doing the
check for a NULL fs_info.
However fs_info is never NULL, since in the only caller of
btrfs_folio_end_lock_bitmap() is extent_writepage(), where we have an
inode which always as a non-NULL fs_info.
So remove the check for a NULL fs_info at btrfs_folio_end_lock_bitmap().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Smatch complains about calling PTR_ERR() against a NULL pointer:
fs/btrfs/super.c:2272 btrfs_control_ioctl() warn: passing zero to 'PTR_ERR'
Fix this by calling PTR_ERR() against the device pointer only if it
contains an error.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
REQ_OP_ZONE_APPNED -> REQ_OP_ZONE_APPEND.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Change the control flow of btrfs_encoded_read() so that it doesn't call
free_extent_map() when we know that this has already been done.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no point in having a 'snapshot_force_cow' variable to track if we
need to decrement the root->snapshot_force_cow counter, as we never jump
to the 'out' label after incrementing the counter. Simplify this by
removing the variable and always decrementing the counter before the 'out'
label, right after the call to btrfs_mksubvol().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The comment refers to a list in the respective delayed ref head that no
longer exists (ref_list), it was replaced with a rbtree (ref_tree) in
commit 0e0adbcfdc ("btrfs: track refs in a rb_tree instead of a list").
So update the stale comment to refer to the rbtree instead of the old
list.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a new unprivileged ioctl that will let the command
'btrfs subvolume sync' work without the (privileged) SEARCH_TREE ioctl.
There are several modes of operation, where the most common ones are to
wait on a specific subvolume or all currently queued for cleaning. This
is utilized e.g. in backup applications that delete subvolumes and wait
until they're cleaned to check for remaining space.
The other modes are for flexibility, e.g. for monitoring or
checkpoints in the queue of deleted subvolumes, again without the need
to use SEARCH_TREE.
Notes:
- waiting is interruptible, the timeout is set to 1 second and is not
configurable
- repeated calls to the ioctl see a different state, so this is
inherently racy when using e.g. the count or peek next/last
Use cases:
- a subvolume A was deleted, wait for cleaning (WAIT_FOR_ONE)
- a bunch of subvolumes were deleted, wait for all (WAIT_FOR_QUEUED or
PEEK_LAST + WAIT_FOR_ONE)
- count how many are queued (not blocking), for monitoring purposes
- report progress (PEEK_NEXT), may miss some if cleaning is quick
- own waiting in user space (PEEK_LAST until it's 0)
Signed-off-by: David Sterba <dsterba@suse.com>
Simplify tracking of the range processed by using cur_alloc_size only to
store the reserved part that may fail to the allocated extent. Remove
the ram_size as well since it is always equal to cur_alloc_size in the
context. Advance the start in normal path until extent allocation
succeeds and keep the start unchanged in the error handling path.
Passed the fstest generic/475 test for a hundred times with quota
enabled. And a modified generic/475 test by removing the sleep time
for a hundred times. About one tenth of the tests do enter the error
handling path due to fail to reserve extent.
Suggested-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove conditional path allocation from btrfs_read_locked_inode(). Add
an ASSERT(path) to indicate it should never be called with a NULL path.
Call btrfs_read_locked_inode() directly from btrfs_iget(). This causes
code duplication between btrfs_iget() and btrfs_iget_path(), but I
think this is justifiable as it removes the need for conditionally
allocating the path inside of btrfs_read_locked_inode(). This makes the
code easier to reason about and makes it clear who has the
responsibility of allocating and freeing the path.
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move btrfs_add_inode_to_root() so it can be called from
btrfs_read_locked_inode(), no changes were made to the function.
Move cleanup code from btrfs_iget_path() to btrfs_read_locked_inode.
This improves readability and improves a leaky abstraction. Previously
btrfs_iget_path() had to handle a positive error case as a result of a
call to btrfs_search_slot(), but it makes more sense to handle this
closer to the source of the call.
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add struct io_btrfs_cmd as a wrapper type for io_uring_cmd_to_pdu(),
rather than using a raw pointer.
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add an io_uring command for encoded reads, using the same interface as
the existing BTRFS_IOC_ENCODED_READ ioctl.
btrfs_uring_encoded_read() is an io_uring version of
btrfs_ioctl_encoded_read(), which validates the user input and calls
btrfs_encoded_read() to read the appropriate metadata. If we determine
that we need to read an extent from disk, we call
btrfs_encoded_read_regular_fill_pages() through
btrfs_uring_read_extent() to prepare the bio.
The existing btrfs_encoded_read_regular_fill_pages() is changed so that
if it is passed a valid uring_ctx, rather than waking up any waiting
threads it calls btrfs_uring_read_extent_endio(). This in turn copies
the read data back to userspace, and calls io_uring_cmd_done() to
complete the io_uring command.
Because we're potentially doing a non-blocking read,
btrfs_uring_read_extent() doesn't clean up after itself if it returns
-EIOCBQUEUED. Instead, it allocates a priv struct, populates the fields
there that we will need to unlock the inode and free our allocations,
and defers this to the btrfs_uring_read_finished() that gets called when
the bio completes.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Change btrfs_encoded_read_regular_fill_pages() so that the priv struct
is allocated rather than stored on the stack, in preparation for adding
an asynchronous mode to the function.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Change btrfs_encoded_read() so that it returns -EAGAIN rather than sleeps
if IOCB_NOWAIT is set in iocb->ki_flags. The conditions that require
sleeping are: inode lock, writeback, extent lock, ordered range.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Change the behaviour of btrfs_encoded_read() so that if it needs to read
an extent from disk, it leaves the extent and inode locked and returns
-EIOCBQUEUED. The caller is then responsible for doing the I/O via
btrfs_encoded_read_regular() and unlocking the extent and inode.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
iocb->ki_pos isn't used after this function, so there's no point in
changing its value.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After the previous patch, which converted the rb-tree used to track
delayed ref heads into an xarray, the find_ref_head() function is now
used only by one caller which always passes false to the 'return_bigger'
argument. So remove the 'return_bigger' logic, simplifying the function,
and move all the function code to the single caller.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we use a red black tree (rb-tree) to track the delayed ref
heads (in struct btrfs_delayed_ref_root::href_root). This however is not
very efficient when the number of delayed ref heads is large (and it's
very common to be at least in the order of thousands) since rb-trees are
binary trees. For example for 10K delayed ref heads, the tree has a depth
of 13. Besides that, inserting into the tree requires navigating through
it and pulling useless cache lines in the process since the red black tree
nodes are embedded within the delayed ref head structure - on the other
hand, by being embedded, it requires no extra memory allocations.
We can improve this by using an xarray instead which has a much higher
branching factor than a red black tree (binary balanced tree) and is more
cache friendly and behaves like a resizable array, with a much better
search and insertion complexity than a red black tree. This only has one
small disadvantage which is that insertion will sometimes require
allocating memory for the xarray - which may fail (not that often since
it uses a kmem_cache) - but on the other hand we can reduce the delayed
ref head structure size by 24 bytes (from 152 down to 128 bytes) after
removing the embedded red black tree node, meaning than we can now fit
32 delayed ref heads per 4K page instead of 26, and that gain compensates
for the occasional memory allocations needed for the xarray nodes. We
also end up using only 2 cache lines instead of 3 per delayed ref head.
Running the following fs_mark test showed some improvements:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
MOUNT_OPTIONS="-o ssd"
FILES=100000
THREADS=$(nproc --all)
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $DEV
mount $MOUNT_OPTIONS $DEV $MNT
OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k"
for ((i = 1; i <= $THREADS; i++)); do
OPTS="$OPTS -d $MNT/d$i"
done
fs_mark $OPTS
umount $MNT
Before this patch:
FSUse% Count Size Files/sec App Overhead
10 1200000 0 171845.7 12253839
16 2400000 0 230898.7 12308254
23 3600000 0 212292.9 12467768
30 4800000 0 195737.8 12627554
46 6000000 0 171055.2 12783329
After this patch:
FSUse% Count Size Files/sec App Overhead
10 1200000 0 173835.0 12246131
16 2400000 0 233537.8 12271746
23 3600000 0 220398.7 12307737
30 4800000 0 204483.6 12392318
40 6000000 0 182923.3 12771843
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add some comments to struct btrfs_delayed_ref_root's fields to mention
what its spinlock protects.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The delayed refs lock must be held when calling add_delayed_ref_head(),
so assert that it's being held.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The delayed refs lock must be held when calling find_first_ref_head(), so
assert that it's being held.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have 3 callers for find_ref_head() so assert at find_ref_head() that we
have the delayed refs lock held, removing the assertion from one of its
callers (btrfs_find_delayed_ref_head()).
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
One of the following patches in the series will need to access fs_info at
btrfs_delete_ref_head(), so pass a fs_info argument to it.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
One of the following patches in the series will need to access fs_info in
the function find_ref_head(), so pass a fs_info argument to it as well as
to the functions btrfs_select_ref_head() and btrfs_find_delayed_ref_head()
which call find_ref_head().
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The unselect_delayed_ref_head() at extent-tree.c doesn't really belong in
that file as it's a delayed refs specific detail and therefore should be
at delayed-ref.c. Further its inverse, btrfs_select_ref_head(), is at
delayed-ref.c, so it only makes sense to have it there too.
So move unselect_delayed_ref_head() into delayed-ref.c and rename it to
btrfs_unselect_ref_head() so that its name closely matches its inverse
(btrfs_select_ref_head()).
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of doing it in two steps outside of delayed-ref.c, leaking low
level details such as locking, move the logic entirely to delayed-ref.c
under btrfs_select_ref_head(), reducing code and making things simpler
for the caller.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function only returns 0, meaning it was able to lock the delayed ref
head, or -EAGAIN in case it wasn't able to lock it. So simplify this and
use a boolean return type instead, returning true if it was able to lock
and false otherwise.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The atomic counter 'num_entries' is not used anymore, we increment it
and decrement it but then we don't ever read it to use for any logic.
Its last use was removed with commit 61a56a992f ("btrfs: delayed refs
pre-flushing should only run the heads we have"). So remove it.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of open coding it, use the find_first_ref_head() helper at
btrfs_destroy_delayed_refs(). This avoids duplicating the logic,
specially with the upcoming changes in subsequent patches.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When destroying delayed refs during a transaction abort, we have open
coded the removal of a delayed ref, which is also done by the static
helper function drop_delayed_ref(). So remove that duplicated code and
use drop_delayed_ref() instead.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info parameter is redundant because it can be extracted from the
transaction given as another parameter. So remove it and use the fs_info
accessible from the transaction.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info parameter is redundant because it can be extracted from the
transaction given as another parameter. So remove it and use the fs_info
accessible from the transaction.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's better suited at delayed-ref.c since it's about delayed refs and
contains logic to iterate over them (using the red black tree, doing all
the locking, freeing, etc), so move it from disk-io.c, which is pretty
big, into delayed-ref.c, hiding implementation details of how delayed
refs are tracked and managed. This also facilitates the next patches in
the series.
This change moves the code between files but also does the following
simple cleanups:
1) Rename the 'cache' variable to 'bg', since it's a block group
(the 'cache' logic comes from old days where the block group
structure was named 'btrfs_block_group_cache');
2) Move the 'ref' variable declaration to the scope of the inner
while loop, since it's not used outside that loop.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_destroy_delayed_refs() it's unexpected to not find the block
group to which a delayed reference's extent belongs to, so we have this
BUG_ON(), not just because it's highly unexpected but also because we
don't know what to do there.
Since we are in the transaction abort path, there's nothing we can do
other than proceed and cleanup all used resources we can. So remove
the BUG_ON() and deal with a missing block group by logging an error
message and continuing to cleanup all we can related to the current
delayed ref head and moving to other delayed refs.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When inserting extent backref, in order to check whether refs other than
inline refs are used, we always use path keep locks for tree search, which
will increase the lock contention of extent tree.
We do not need the parent node every time to determine whether normal
refs are used. It is only needed when the extent item is the last item
in a leaf.
Therefore, we change it to first use keep_locks=0 for search. If the
extent item happens to be the last item in the leaf, we then change to
keep_locks=1 for the second search to reduce lock contention.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Implement self-tests for partial deletion of RAID stripe-tree entries.
These two new tests cover both the deletion of the front of a RAID
stripe-tree stripe extent as well as truncation of an item to make it
smaller.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In our CI system, the RAID stripe tree configuration sometimes fails with
the following ASSERT():
assertion failed: found_start >= start && found_end <= end, in fs/btrfs/raid-stripe-tree.c:64
This ASSERT()ion triggers, because for the initial design of RAID
stripe-tree, I had the "one ordered-extent equals one bio" rule of zoned
btrfs in mind.
But for a RAID stripe-tree based system, that is not hosted on a zoned
storage device, but on a regular device this rule doesn't apply.
So in case the range we want to delete starts in the middle of the
previous item, grab the item and "truncate" it's length. That is, clone
the item, subtract the deleted portion from the key's offset, delete the
old item and insert the new one.
In case the range to delete ends in the middle of an item, we have to
adjust both the item's key as well as the stripe extents and then
re-insert the modified clone into the tree after deleting the old stripe
extent.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When fgp_flags and gfp_flags are zero, use filemap_get_folio(A, B)
instead of __filemap_get_folio(A, B, 0, 0)—no need for the extra
arguments 0, 0.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The buffered write path is still heavily utilizing the page interface.
Since we have converted it to do a page-by-page copying, it's much easier
to convert all involved functions to folio interface, this involves:
- btrfs_copy_from_user()
- btrfs_drop_folio()
- prepare_uptodate_page()
- prepare_one_page()
- lock_and_cleanup_extent_if_need()
- btrfs_dirty_page()
All function are changed to accept a folio parameter, and if the word
"page" is in the function name, change that to "folio" too.
The function btrfs_dirty_page() is exported for v1 space cache, convert
v1 cache call site to convert its page to folio for the new interface.
And there is a small enhancement for prepare_one_folio(), instead of
manually waiting for the page writeback, let __filemap_get_folio() to
handle that by using FGP_WRITEBEGIN, which implies
(FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE).
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the btrfs_buffered_write() is preparing multiple page a time,
allowing a better performance.
But the current trend is to support larger folio as an optimization,
instead of implementing own multi-page optimization.
This is inspired by generic_perform_write(), which is copying one folio
a time.
Such change will prepare us to migrate to implement the write_begin()
and write_end() callbacks, and make every involved function a little
easier.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_do_encoded_write() was converted to use folios in 400b172b8c,
but we're still allocating based on sizeof(struct page *) rather than
sizeof(struct folio *). There's no functional change.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove hard-coded strings by using the str_yes_no() and str_no_yes()
helper functions.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since there is no user of reader locks, rename the writer locks into a
more generic name, by removing the "_writer" part from the name.
And also rename btrfs_subpage::writer into btrfs_subpage::locked.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit d7172f52e9 ("btrfs: use per-buffer locking for
extent_buffer reading"), metadata read no longer relies on the subpage
reader locking.
This means we do not need to maintain a different metadata/data split
for locking, so we can convert the existing reader lock users by:
- add_ra_bio_pages()
Convert to btrfs_folio_set_writer_lock()
- end_folio_read()
Convert to btrfs_folio_end_writer_lock()
- begin_folio_read()
Convert to btrfs_folio_set_writer_lock()
- folio_range_has_eb()
Remove the subpage->readers checks, since it is always 0.
- Remove btrfs_subpage_start_reader() and btrfs_subpage_end_reader()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is not really suitable to lock a folio, as it lacks the
proper mapping checks, thus the locked folio may not even belong to
btrfs.
And due to the above reason, the last user inside lock_delalloc_folios()
is already removed, and we can remove this function.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If you follow the seed/sprout wiki, it suggests the following workflow:
btrfstune -S 1 seed_dev
mount seed_dev mnt
btrfs device add sprout_dev
mount -o remount,rw mnt
The first mount mounts the FS readonly, which results in not setting
BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
somewhat surprisingly clears the readonly bit on the sb (though the
mount is still practically readonly, from the users perspective...).
Finally, the remount checks the readonly bit on the sb against the flag
and sees no change, so it does not run the code intended to run on
ro->rw transitions, leaving BTRFS_FS_OPEN unset.
As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
does no work. This results in leaking deleted snapshots until we run out
of space.
I propose fixing it at the first departure from what feels reasonable:
when we clear the readonly bit on the sb during device add.
A new fstest I have written reproduces the bug and confirms the fix.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
It's redundant to have the 'gen' variable since we already have the same
value in the local btrfs_tree_parent_check structure. So remove it and
instead use the structure's field.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's pointless to initialize the has_first_key field of the stack local
btrfs_tree_parent_check structure at btrfs_tree_parent_check() and at
btrfs_qgroup_trace_subtree() since all fields not explicitly initialized
are zeroed out. In the case of the first function it's a bit odd because
we are assigning 0 and the field is of type bool, however not incorrect
since a 0 is converted to false.
Just remove the explicit initializations due to their redundancy.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only caller of btrfs_verify_level_key() is read_block_for_search() and
it's passing 3 arguments to it that can be extracted from its on stack
variable of type struct btrfs_tree_parent_check.
So change btrfs_verify_level_key() to accept an argument of type
struct btrfs_tree_parent_check instead of level, first key and parent
transid arguments.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The level parameter passed to read_block_for_search() always matches the
level of the extent buffer passed in the "eb_ret" parameter, which we are
also extracting into the "parent_level" local variable.
So remove the level parameter and instead use the "parent_level" variable
which in fact has a better name (it's the level of the parent node from
which we are reading a child node/leaf).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the extent map shrinker can only be run by a single task and runs
asynchronously as a work queue job, enable it as it can no longer cause
stalls on tasks allocating memory and entering the extent map shrinker
through the fs shrinker (implemented by btrfs_free_cached_objects()).
This is crucial to prevent exhaustion of memory due to unbounded extent
map creation, primarily with direct IO but also for buffered IO on files
with holes. This problem, for the direct IO case, was first reported in
the Link tag below. That report was added to a Link tag of the first patch
that introduced the extent map shrinker, commit 956a17d9d0 ("btrfs: add
a shrinker for extent maps"), however the Link tag disappeared somehow
from the committed patch (but was included in the submitted patch to the
mailing list), so adding it below for future reference.
Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The names for the members of struct btrfs_fs_info related to the extent
map shrinker are a bit too long, so rename them to be shorter by replacing
the "extent_map_" prefix with the "em_" prefix.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the extent map shrinker can only be run by a single task (as a
work queue item) there is no need to keep the progress of the shrinker
protected by a spinlock and passing the progress to trace events as
parameters. So remove the lock and simplify the arguments for the trace
events.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the extent map shrinker is run synchronously for kswapd tasks
that end up calling the fs shrinker (fs/super.c:super_cache_scan()).
This has some disadvantages and for some heavy workloads with memory
pressure it can cause some delays and stalls that make a machine
unresponsive for some periods. This happens because:
1) We can have several kswapd tasks on machines with multiple NUMA zones,
and running the extent map shrinker concurrently can cause high
contention on some spin locks, namely the spin locks that protect
the radix tree that tracks roots, the per root xarray that tracks
open inodes and the list of delayed iputs. This not only delays the
shrinker but also causes high CPU consumption and makes the task
running the shrinker monopolize a core, resulting in the symptoms
of an unresponsive system. This was noted in previous commits such as
commit ae1e766f62 ("btrfs: only run the extent map shrinker from
kswapd tasks");
2) The extent map shrinker's iteration over inodes can often be slow, even
after changing the data structure that tracks open inodes for a root
from a red black tree (up to kernel 6.10) to an xarray (kernel 6.10+).
The transition to the xarray while it made things a bit faster, it's
still somewhat slow - for example in a test scenario with 10000 inodes
that have no extent maps loaded, the extent map shrinker took between
5ms to 8ms, using a release, non-debug kernel. Iterating over the
extent maps of an inode can also be slow if have an inode with many
thousands of extent maps, since we use a red black tree to track and
search extent maps. So having the extent map shrinker run synchronously
adds extra delay for other things a kswapd task does.
So make the extent map shrinker run asynchronously as a job for the
system unbounded workqueue, just like what we do for data and metadata
space reclaim jobs.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the common code to remove an extent map from its inode's tree into a
helper function and use it, reducing duplicated code.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When crawling btree, if an eb cache miss occurs, we change to use the eb
read lock and release all previous locks (including the parent lock) to
reduce lock contention.
If an eb cache miss occurs in a leaf and needs to execute IO, before this
change we released locks only from level 2 and up and we read a leaf's
content from disk while holding a lock on its parent (level 1), causing
the unnecessary lock contention on the parent, after this change we
release locks from level 1 and up, but we lock level 0, and read leaf's
content from disk.
Because we have prepared the check parameters and the read lock of eb we
hold, we can ensure that no race will occur during the check and cause
unexpected errors.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The compression heuristic pass does not need a level, so we can drop the
parameter.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Cascaded removal of fs_info that is not needed in several functions.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function got split in commit 6ab6ebb760 ("btrfs: split
alloc_log_tree()") and since then transaction parameter has been unused.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only caller passes NULL, we can drop the parameter. This is since
the new mount option parser done in 3bb17a25bc ("btrfs: add get_tree
callback for new mount API").
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the new mount option parser in commit ad21f15b0f ("btrfs:
switch to the new mount API") we don't pass the options like that
anymore.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter was added in 8ff8466d29 ("btrfs: support subpage for
extent buffer page release") for page but hasn't been used since, so we
can drop it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The mask parameter used for allocations got unified to GFP_NOFS and
removed from relevant functions in 1d12680044 ("btrfs: drop gfp from
parameter extent state helpers").
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter duplicates what can be effectively obtained from
wc->refs[level - 1] and this is what's actually used inside. Added in
commit 2b73c7e761 ("btrfs: unify logic to decide if we need to walk
down into a node during snapshot delete").
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter 'from' has never been used since commit b8d8e1fd57
("btrfs: introduce btrfs_write_check()"), this is for buffered write.
Direct io write needs it so it was probably an interface thing, but we
can drop it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The file_offset parameter used to be passed to encoded read struct but
was removed in commit b665affe93 ("btrfs: remove unused members from
struct btrfs_encoded_read_private").
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need offset for inline extents, they always start from 0.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need the inode pointer to read inline extent, it's all
accessible from the path pointer.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need the user passed parameter, rescan is a filesystem
operation so fs_info is sufficient.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The LZO compression has only one level, we don't need to pass the
parameter.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The caller replace_path() runs under transaction but we don't need it in
btrfs_qgroup_add_swapped_blocks().
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need fs_info here, everything is reachable from qgroup.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter map used to be passed to scrub_extent() until
e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror() to
scrub_stripe infrastructure"), where the scrub implementation was
completely reworked.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter is unused and we can reach sctx from scrub stripe if
needed.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
None of the ref iteration callbacks needs the index parameter (this is
for the directory item iteration), so we can drop it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
None of the ref iteration callbacks needs the num parameter (this is for
the directory item iteration), so we can drop it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter is unused and we can get it from space info if needed.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter is not used, we can also reach it from the space info if
needed in the future.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>