After the conversion to folio interfaces (but without the patch to
enable larger folio allocation), there is an LTP report about observable
performance drop on metadata heavy operations.
https://lore.kernel.org/linux-btrfs/202312221750.571925bd-oliver.sang@intel.com/
This drop is caused by the extra code of calculating the
folio_size()/folio_shift(), instead of the old hard coded
PAGE_SIZE/PAGE_SHIFT.
To slightly reduce the overhead, just cache both folio_size and
folio_shift in extent_buffer.
The two new members (u32 folio_size and u8 folio_shift) are stored
inside the holes of extent_buffer. folio_size is shared with len, which
is reduced to u32. The size of eb does not change.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variable @bio_offset was introduced in commit 7ffd27e378 ("btrfs:
pass bio_offset to check_data_csum() directly"), when we are still using
the same endio function for both data and metadata.
Later we had several changes to data and metadata endio functions:
- Data verification is handled by btrfs bio layer
- Split data and metadata endio paths
Now for data path we no longer do any verification in
end_bbio_data_read(), as the verification is handled by btrfs bio layer
already.
Thus there is no need for such bio_offset variable.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter @pg_offset of btrfs_get_extent() is only utilized for
inlined extent, and we already have an ASSERT() and tree-checker, to
make sure we can only get inline extent at file offset 0.
Any invalid inline extent with non-zero file offset would be rejected by
tree-checker in the first place.
Thus the @pg_offset parameter is not really necessary, just remove it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmXh1GgACgkQxWXV+ddt
WDtnvA/7BN7BZ6QmwWv9UyxhgSBtzI19AXPi/kBsssnnjNuzXoHFaVHj68lQCCOB
a4YjRxAg7nmwFGHdVDTdnwXgUECzqlVkeX9cXg1ZpJy0IfP9RriGedRlC/93z7aV
pg6DnKMh2FlkibK7yO6kRBR8RYLc5aVIytqHXgUeqbaquuhj2Hh8EpqRo2X0RsoE
wDXlK0qgrU8HyrA3fHdqKYPcm1+cYABGTCwGx65iRffy8vH+KFSAr71G8jOJVEUj
DgNWJCpBjXJNs0dsKrik5oGmvLd3GDBKinNX7R2mAvMAMGWrL+fVVTVTfBS/clUT
FBiVFNYCJuphMcO3Qjs6JIuEez0GuGEeh1m+tQ8W795At1FSiINtE5J7LjmJUl5X
FuUwOUpxco1lTXBLX149Y9kk7AEOaqYxy0XbH4r5bbKyuzQegRGB9/qQX4sSaCt3
3T+Td9PvS+6Jo+CDO0dsYhG/h3bsHeHtHGR6f2CiO/m1zHDnTX9aYVcLMM3hsrMI
8OUEy1jkuKnDZQuZuIWES/3V9FlJL34dR3Cb236Pv/yIH1iujIc27g0qXrC1vzPg
wnUL1wheLQ9IRLedXoiHtX2Y2ZfFQGQDrIKNCJFD+WNPkZYffih5QNTV/mPZmL80
9EbjoVTu+6rygzdD43R1RWvK0kFsY44RKhHreI8SItO+e3/0TAs=
=hMf8
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix freeing allocated id for anon dev when snapshot creation fails
- fiemap fixes:
- followup for a recent deadlock fix, ranges that fiemap can access
can still race with ordered extent completion
- make sure fiemap with SYNC flag does not race with writes
* tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix double free of anonymous device after snapshot creation failure
btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given
btrfs: fix race between ordered extent completion and fiemap
When FIEMAP_FLAG_SYNC is given to fiemap the expectation is that that
are no concurrent writes and we get a stable view of the inode's extent
layout.
When the flag is given we flush all IO (and wait for ordered extents to
complete) and then lock the inode in shared mode, however that leaves open
the possibility that a write might happen right after the flushing and
before locking the inode. So fix this by flushing again after locking the
inode - we leave the initial flushing before locking the inode to avoid
holding the lock and blocking other RO operations while waiting for IO
and ordered extents to complete. The second flushing while holding the
inode's lock will most of the time do nothing or very little since the
time window for new writes to have happened is small.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For fiemap we recently stopped locking the target extent range for the
whole duration of the fiemap call, in order to avoid a deadlock in a
scenario where the fiemap buffer happens to be a memory mapped range of
the same file. This use case is very unlikely to be useful in practice but
it may be triggered by fuzz testing (syzbot, etc).
However by not locking the target extent range for the whole duration of
the fiemap call we can race with an ordered extent. This happens like
this:
1) The fiemap task finishes processing a file extent item that covers
the file range [512K, 1M[, and that file extent item is the last item
in the leaf currently being processed;
2) And ordered extent for the file range [768K, 2M[, in COW mode,
completes (btrfs_finish_one_ordered()) and the file extent item
covering the range [512K, 1M[ is trimmed to cover the range
[512K, 768K[ and then a new file extent item for the range [768K, 2M[
is inserted in the inode's subvolume tree;
3) The fiemap task calls fiemap_next_leaf_item(), which then calls
btrfs_next_leaf() to find the next leaf / item. This finds that the
the next key following the one we previously processed (its type is
BTRFS_EXTENT_DATA_KEY and its offset is 512K), is the key corresponding
to the new file extent item inserted by the ordered extent, which has
a type of BTRFS_EXTENT_DATA_KEY and an offset of 768K;
4) Later the fiemap code ends up at emit_fiemap_extent() and triggers
the warning:
if (cache->offset + cache->len > offset) {
WARN_ON(1);
return -EINVAL;
}
Since we get 1M > 768K, because the previously emitted entry for the
old extent covering the file range [512K, 1M[ ends at an offset that
is greater than the new extent's start offset (768K). This makes fiemap
fail with -EINVAL besides triggering the warning that produces a stack
trace like the following:
[1621.677651] ------------[ cut here ]------------
[1621.677656] WARNING: CPU: 1 PID: 204366 at fs/btrfs/extent_io.c:2492 emit_fiemap_extent+0x84/0x90 [btrfs]
[1621.677899] Modules linked in: btrfs blake2b_generic (...)
[1621.677951] CPU: 1 PID: 204366 Comm: pool Not tainted 6.8.0-rc5-btrfs-next-151+ #1
[1621.677954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[1621.677956] RIP: 0010:emit_fiemap_extent+0x84/0x90 [btrfs]
[1621.678033] Code: 2b 4c 89 63 (...)
[1621.678035] RSP: 0018:ffffab16089ffd20 EFLAGS: 00010206
[1621.678037] RAX: 00000000004fa000 RBX: ffffab16089ffe08 RCX: 0000000000009000
[1621.678039] RDX: 00000000004f9000 RSI: 00000000004f1000 RDI: ffffab16089ffe90
[1621.678040] RBP: 00000000004f9000 R08: 0000000000001000 R09: 0000000000000000
[1621.678041] R10: 0000000000000000 R11: 0000000000001000 R12: 0000000041d78000
[1621.678043] R13: 0000000000001000 R14: 0000000000000000 R15: ffff9434f0b17850
[1621.678044] FS: 00007fa6e20006c0(0000) GS:ffff943bdfa40000(0000) knlGS:0000000000000000
[1621.678046] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1621.678048] CR2: 00007fa6b0801000 CR3: 000000012d404002 CR4: 0000000000370ef0
[1621.678053] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1621.678055] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1621.678056] Call Trace:
[1621.678074] <TASK>
[1621.678076] ? __warn+0x80/0x130
[1621.678082] ? emit_fiemap_extent+0x84/0x90 [btrfs]
[1621.678159] ? report_bug+0x1f4/0x200
[1621.678164] ? handle_bug+0x42/0x70
[1621.678167] ? exc_invalid_op+0x14/0x70
[1621.678170] ? asm_exc_invalid_op+0x16/0x20
[1621.678178] ? emit_fiemap_extent+0x84/0x90 [btrfs]
[1621.678253] extent_fiemap+0x766/0xa30 [btrfs]
[1621.678339] btrfs_fiemap+0x45/0x80 [btrfs]
[1621.678420] do_vfs_ioctl+0x1e4/0x870
[1621.678431] __x64_sys_ioctl+0x6a/0xc0
[1621.678434] do_syscall_64+0x52/0x120
[1621.678445] entry_SYSCALL_64_after_hwframe+0x6e/0x76
There's also another case where before calling btrfs_next_leaf() we are
processing a hole or a prealloc extent and we had several delalloc ranges
within that hole or prealloc extent. In that case if the ordered extents
complete before we find the next key, we may end up finding an extent item
with an offset smaller than (or equals to) the offset in cache->offset.
So fix this by changing emit_fiemap_extent() to address these three
scenarios like this:
1) For the first case, steps listed above, adjust the length of the
previously cached extent so that it does not overlap with the current
extent, emit the previous one and cache the current file extent item;
2) For the second case where he had a hole or prealloc extent with
multiple delalloc ranges inside the hole or prealloc extent's range,
and the current file extent item has an offset that matches the offset
in the fiemap cache, just discard what we have in the fiemap cache and
assign the current file extent item to the cache, since it's more up
to date;
3) For the third case where he had a hole or prealloc extent with
multiple delalloc ranges inside the hole or prealloc extent's range
and the offset of the file extent item we just found is smaller than
what we have in the cache, just skip the current file extent item
if its range end at or behind the cached extent's end, because we may
have emitted (to the fiemap user space buffer) delalloc ranges that
overlap with the current file extent item's range. If the file extent
item's range goes beyond the end offset of the cached extent, just
emit the cached extent and cache a subrange of the file extent item,
that goes from the end offset of the cached extent to the end offset
of the file extent item.
Dealing with those cases in those ways makes everything consistent by
reflecting the current state of file extent items in the btree and
without emitting extents that have overlapping ranges (which would be
confusing and violating expectations).
This issue could be triggered often with test case generic/561, and was
also hit and reported by Wang Yugui.
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20240223104619.701F.409509F4@e16-tech.com/
Fixes: b0ad381fa7 ("btrfs: fix deadlock with fiemap and extent locking")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmXV2z8ACgkQxWXV+ddt
WDsudxAAoKcp1DbuOtaOzG/XVnIKt36drK4cwyZnGo9PZ9vlgT6k+T0efto4DkOF
fNWy2d/9iGy9RHy4oxZL6ceb3rcWW0NbhiKHeTPNqL4ZCPa7t6bxMWXSYBh6pYgZ
6EUS6H9Don05F7rQs8rERc+VIW6u1HFTLn4wS1cmlTyTQzZwlk9B2V6KtDtHBi0k
B4CCxY6jX2bl7BncXdYteb13Xjg7+JnWvfSKb7ouSVnL8VEcGG13QkPFNV2Xsoi2
uDDsw+QKBEcPNgBIubBUwbLS5V5vYa1H1meUFJkciaeblHVlVIMN3h7+Y8VNKMTC
qpxEo3Hx6oqmw9LdEIU7WsvFs0JJum2fKOjOx3vr1d3AiyFG6W6lrm1fKwnp3dt7
dHdAYuo8+Q4rirGlMDcEoYqpy7AcV8QqtSYajdrdpB1dqHcHhukSNqJ0dx5lYElU
HtnMXD9vLc4uJDcl9Z1aTWEmB+7nj5HwukSnTqQhgwpCZM7mrz6pe1DuD5iKinG6
Yth9QFhgcbcnchPA+3SOtC6uuje9chHo6L6eTVacnKAKyKgLY8qTTsh3zYQNVhMX
M2aWcAkizq20vKe7JFxs7M/tClyuswTjOP6RYzeayY21Rn8gGwe7uhXhv5MKpEh5
TjXjiixsrxxsyyaED7Kl69i54BvmM/TI35p7Jbx6Ln7PPD8lf8M=
=pXyL
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- Fix a deadlock in fiemap.
There was a big lock around the whole operation that can interfere
with a page fault and mkwrite.
Reducing the lock scope can also speed up fiemap
- Fix range condition for extent defragmentation which could lead to
worse layout in some cases
* tag 'for-6.8-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix deadlock with fiemap and extent locking
btrfs: defrag: avoid unnecessary defrag caused by incorrect extent size
While working on the patchset to remove extent locking I got a lockdep
splat with fiemap and pagefaulting with my new extent lock replacement
lock.
This deadlock exists with our normal code, we just don't have lockdep
annotations with the extent locking so we've never noticed it.
Since we're copying the fiemap extent to user space on every iteration
we have the chance of pagefaulting. Because we hold the extent lock for
the entire range we could mkwrite into a range in the file that we have
mmap'ed. This would deadlock with the following stack trace
[<0>] lock_extent+0x28d/0x2f0
[<0>] btrfs_page_mkwrite+0x273/0x8a0
[<0>] do_page_mkwrite+0x50/0xb0
[<0>] do_fault+0xc1/0x7b0
[<0>] __handle_mm_fault+0x2fa/0x460
[<0>] handle_mm_fault+0xa4/0x330
[<0>] do_user_addr_fault+0x1f4/0x800
[<0>] exc_page_fault+0x7c/0x1e0
[<0>] asm_exc_page_fault+0x26/0x30
[<0>] rep_movs_alternative+0x33/0x70
[<0>] _copy_to_user+0x49/0x70
[<0>] fiemap_fill_next_extent+0xc8/0x120
[<0>] emit_fiemap_extent+0x4d/0xa0
[<0>] extent_fiemap+0x7f8/0xad0
[<0>] btrfs_fiemap+0x49/0x80
[<0>] __x64_sys_ioctl+0x3e1/0xb50
[<0>] do_syscall_64+0x94/0x1a0
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
I wrote an fstest to reproduce this deadlock without my replacement lock
and verified that the deadlock exists with our existing locking.
To fix this simply don't take the extent lock for the entire duration of
the fiemap. This is safe in general because we keep track of where we
are when we're searching the tree, so if an ordered extent updates in
the middle of our fiemap call we'll still emit the correct extents
because we know what offset we were on before.
The only place we maintain the lock is searching delalloc. Since the
delalloc stuff can change during writeback we want to lock the extent
range so we have a consistent view of delalloc at the time we're
checking to see if we need to set the delalloc flag.
With this patch applied we no longer deadlock with my testcase.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmWYTmMACgkQxWXV+ddt
WDvPRg/+KgS5LV3nNC0MguYcTMQxmgeutIgXZIMfeA3v6EnFS7nj8leP4EPc6+bj
JPSkwj4u2vHVwpnTVuEAuJUXnmFY+Qu70nVy6bM2uOHOYTVBQ8zRVK4cErNNLWCp
OekDaADR53RrZ/xprlQ7b7Ph0Ch2uq9OrpH50IcyquEsH1ffkxlqwyrvth4/8dxC
6zgsFHWrbtVKJf0DYoQPpjEPz5tpdQ+xHZwtmf1cNlUgI1objODr/ZTqXtZqTfw4
/GwrtDPbEri53K/qjgr0dDH7pBVqD6PtnbgoHfYkiizZ0G7UkmlaK6rZIurtATJb
Yk/RCqCUp9tPC4yeFSewFMm1Y8Ae3rkUBG7rnYkvMmBspMqyh/kQAWSBimF5yk/y
vFEdFTe9AbdvP19Nw0CqovLzaO6RrOXCL1usnFvCmBgvF5gZAv63ZW1njP3ZoNta
wB8Rs6hxdRkph8Dk7yvYf54uUR+JyKqjHY6egg2qkKTjz0CSf6qQFyFZXpr81m97
gK4WN5SeP/P2ukRbBKKyzZ5IljUxZuVatvJa0tktd7kAbU26WLzofOJ7pX+iqimM
F2G7gKGJZykLY1WPntXBp9Dg97Ras2O5iViQ7ZKwRdOx1yZS5zzTYlIznHBAmXbL
UgXfVnpJH1xFdkvedNTn+Fz9BHNV1K2a2AT7VITj7sxz23z3aJA=
=4sw3
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"There are no exciting changes for users, it's been mostly API
conversions and some fixes or refactoring.
The mount API conversion is a base for future improvements that would
come with VFS. Metadata processing has been converted to folios, not
yet enabling the large folios but it's one patch away once everything
gets tested enough.
Core changes:
- convert extent buffers to folios:
- direct API conversion where possible
- performance can drop by a few percent on metadata heavy
workloads, the folio sizes are not constant and the calculations
add up in the item helpers
- both regular and subpage modes
- data cannot be converted yet, we need to port that to iomap and
there are some other generic changes required
- convert mount to the new API, should not be user visible:
- options deprecated long time ago have been removed: inode_cache,
recovery
- the new logic that splits mount to two phases slightly changes
timing of device scanning for multi-device filesystems
- LSM options will now work (like for selinux)
- convert delayed nodes radix tree to xarray, preserving the
preload-like logic that still allows to allocate with GFP_NOFS
- more validation of sysfs value of scrub_speed_max
- refactor chunk map structure, reduce size and improve performance
- extent map refactoring, smaller data structures, improved
performance
- reduce size of struct extent_io_tree, embedded in several
structures
- temporary pages used for compression are cached and attached to a
shrinker, this may slightly improve performance
- in zoned mode, remove redirty extent buffer tracking, zeros are
written in case an out-of-order is detected and proper data are
written to the actual write pointer
- cleanups, refactoring, error message improvements, updated tests
- verify and update branch name or tag
- remove unwanted text"
* tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (89 commits)
btrfs: pass btrfs_io_geometry into btrfs_max_io_len
btrfs: pass struct btrfs_io_geometry to set_io_stripe
btrfs: open code set_io_stripe for RAID56
btrfs: change block mapping to switch/case in btrfs_map_block
btrfs: factor out block mapping for single profiles
btrfs: factor out block mapping for RAID5/6
btrfs: reduce scope of data_stripes in btrfs_map_block
btrfs: factor out block mapping for RAID10
btrfs: factor out block mapping for DUP profiles
btrfs: factor out RAID1 block mapping
btrfs: factor out block-mapping for RAID0
btrfs: re-introduce struct btrfs_io_geometry
btrfs: factor out helper for single device IO check
btrfs: migrate btrfs_repair_io_failure() to folio interfaces
btrfs: migrate eb_bitmap_offset() to folio interfaces
btrfs: migrate various end io functions to folios
btrfs: migrate subpage code to folio interfaces
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
btrfs: don't double put our subpage reference in alloc_extent_buffer
btrfs: cleanup metadata page pointer usage
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUxRQAKCRCRxhvAZXjc
ov/QAQDzvge3oQ9MEymmOiyzzcF+HhAXBr+9oEsYJjFc1p0TsgEA61gXjZo7F1jY
KBqd6znOZCR+Waj0kIVJRAo/ISRBqQc=
=0bRl
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.
Features:
- Add Jan Kara as VFS reviewer
- Show correct device and inode numbers in proc/<pid>/maps for vma
files on stacked filesystems. This is now easily doable thanks to
the backing file work from the last cycles. This comes with
selftests
Cleanups:
- Remove a redundant might_sleep() from wait_on_inode()
- Initialize pointer with NULL, not 0
- Clarify comment on access_override_creds()
- Rework and simplify eventfd_signal() and eventfd_signal_mask()
helpers
- Process aio completions in batches to avoid needless wakeups
- Completely decouple struct mnt_idmap from namespaces. We now only
keep the actual idmapping around and don't stash references to
namespaces
- Reformat maintainer entries to indicate that a given subsystem
belongs to fs/
- Simplify fput() for files that were never opened
- Get rid of various pointless file helpers
- Rename various file helpers
- Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from
last cycle
- Make relatime_need_update() return bool
- Use GFP_KERNEL instead of GFP_USER when allocating superblocks
- Replace deprecated ida_simple_*() calls with their current ida_*()
counterparts
Fixes:
- Fix comments on user namespace id mapping helpers. They aren't
kernel doc comments so they shouldn't be using /**
- s/Retuns/Returns/g in various places
- Add missing parameter documentation on can_move_mount_beneath()
- Rename i_mapping->private_data to i_mapping->i_private_data
- Fix a false-positive lockdep warning in pipe_write() for watch
queues
- Improve __fget_files_rcu() code generation to improve performance
- Only notify writer that pipe resizing has finished after setting
pipe->max_usage otherwise writers are never notified that the pipe
has been resized and hang
- Fix some kernel docs in hfsplus
- s/passs/pass/g in various places
- Fix kernel docs in ntfs
- Fix kcalloc() arguments order reported by gcc 14
- Fix uninitialized value in reiserfs"
* tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits)
reiserfs: fix uninit-value in comp_keys
watch_queue: fix kcalloc() arguments order
ntfs: dir.c: fix kernel-doc function parameter warnings
fs: fix doc comment typo fs tree wide
selftests/overlayfs: verify device and inode numbers in /proc/pid/maps
fs/proc: show correct device and inode numbers in /proc/pid/maps
eventfd: Remove usage of the deprecated ida_simple_xx() API
fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation
fs/hfsplus: wrapper.c: fix kernel-doc warnings
fs: add Jan Kara as reviewer
fs/inode: Make relatime_need_update return bool
pipe: wakeup wr_wait after setting max_usage
file: remove __receive_fd()
file: stop exposing receive_fd_user()
fs: replace f_rcuhead with f_task_work
file: remove pointless wrapper
file: s/close_fd_get_file()/file_close_fd()/g
Improve __fget_files_rcu() code generation (and thus __fget_light())
file: massage cleanup of files that failed to open
fs/pipe: Fix lockdep false-positive in watchqueue pipe_write()
...
[BUG]
Test case btrfs/002 would fail if larger folios are enabled for
metadata:
assertion failed: folio, in fs/btrfs/extent_io.c:4358
------------[ cut here ]------------
kernel BUG at fs/btrfs/extent_io.c:4358!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 30916 Comm: fsstress Tainted: G OE 6.7.0-rc3-custom+ #128
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
RIP: 0010:assert_eb_folio_uptodate+0x98/0xe0 [btrfs]
Call Trace:
<TASK>
extent_buffer_test_bit+0x3c/0x70 [btrfs]
free_space_test_bit+0xcd/0x140 [btrfs]
modify_free_space_bitmap+0x27a/0x430 [btrfs]
add_to_free_space_tree+0x8d/0x160 [btrfs]
__btrfs_free_extent.isra.0+0xef1/0x13c0 [btrfs]
__btrfs_run_delayed_refs+0x786/0x13c0 [btrfs]
btrfs_run_delayed_refs+0x33/0x120 [btrfs]
btrfs_commit_transaction+0xa2/0x1350 [btrfs]
iterate_supers+0x77/0xe0
ksys_sync+0x60/0xa0
__do_sys_sync+0xa/0x20
do_syscall_64+0x3f/0xf0
entry_SYSCALL_64_after_hwframe+0x6e/0x76
</TASK>
[CAUSE]
The function extent_buffer_test_bit() is not folio compatible.
It still assumes the old fixed page size, when an extent buffer with
large folio passed in, only eb->folios[0] is populated.
Then if the target bit range falls in the 2nd page of the folio, then we
would check eb->folios[1], and trigger the ASSERT().
[FIX]
Just migrate eb_bitmap_offset() to folio interfaces, using the
folio_size() to replace PAGE_SIZE.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we still go the old page based iterator functions, like
bio_for_each_segment_all(), we can hit middle pages of a folio (compound
page).
In that case if we set any page flag on those middle pages, we can
easily trigger VM_BUG_ON(), as for compound page flags, they should
follow their flag policies (normally only set on leading or tail pages).
To avoid such problem in the future full folio migration, here we do:
- Change from bio_for_each_segment_all() to bio_for_each_folio_all()
This completely removes the ability to access the middle page.
- Add extra ASSERT()s for data read/write paths
To ensure we only get single paged folio for data now.
- Rename those end io functions to follow a certain schema
* end_bbio_compressed_read()
* end_bbio_compressed_write()
These two endio functions don't set any page flags, as they use pages
not mapped to any address space.
They can be very good candidates for higher order folio testing.
And they are shared between compression and encoded IO.
* end_bbio_data_read()
* end_bbio_data_write()
* end_bbio_meta_read()
* end_bbio_meta_write()
The old function names are not unified:
- end_bio_extent_writepage()
- end_bio_extent_readpage()
- extent_buffer_write_end_io()
- extent_buffer_read_end_io()
They share no schema on where the "end_*io" string should be, nor can
be confusing just using "extent_buffer" and "extent" to distinguish
data and metadata paths.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although subpage itself is conflicting with higher folio, since subpage
(sectorsize < PAGE_SIZE and nodesize < PAGE_SIZE) means we will never
need higher order folio, there is a hidden pitfall:
- btrfs_page_*() helpers
Those helpers are an abstraction to handle both subpage and non-subpage
cases, which means we're going to pass pages pointers to those helpers.
And since those helpers are shared between data and metadata paths, it's
unavoidable to let them to handle folios, including higher order
folios).
Meanwhile for true subpage case, we should only have a single page
backed folios anyway, thus add a new ASSERT() for btrfs_subpage_assert()
to ensure that.
Also since those helpers are shared between both data and metadata, add
some extra ASSERT()s for data path to make sure we only get single page
backed folio for now.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This fixes as case in "btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method".
We have been seeing panics in the CI for the subpage stuff recently, it
happens on btrfs/187 but could potentially happen anywhere.
In the subpage case, if we race with somebody else inserting the same
extent buffer, the error case will end up calling
detach_extent_buffer_page() on the page twice.
This is done first in the bit
for (int i = 0; i < attached; i++)
detach_extent_buffer_page(eb, eb->pages[i];
and then again in btrfs_release_extent_buffer().
This works fine for !subpage because we're the only person who ever has
ourselves on the private, and so when we do the initial
detach_extent_buffer_page() we know we've completely removed it.
However for subpage we could be using this page private elsewhere, so
this results in a double put on the subpage, which can result in an
early freeing.
The fix here is to clear eb->pages[i] for everything we detach. Then
anything still attached to the eb is freed in
btrfs_release_extent_buffer().
Because of this change we must update
btrfs_release_extent_buffer_pages() to not use num_extent_folios,
because it assumes eb->folio[0] is set properly. Since this is only
interested in freeing any pages we have on the extent buffer we can
simply use INLINE_EXTENT_BUFFER_PAGES.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although we have migrated extent_buffer::pages[] to folios[], we're
still mostly using the folio_page() help to grab the page.
This patch would do the following cleanups for metadata:
- Introduce num_extent_folios() helper
This is to replace most num_extent_pages() callers.
- Use num_extent_folios() to iterate future large folios
This allows us to use things like
bio_add_folio()/bio_add_folio_nofail(), and only set the needed flags
for the folio (aka the leading/tailing page), which reduces the loop
iteration to 1 for large folios.
- Change metadata related functions to use folio pointers
Including their function name, involving:
* attach_extent_buffer_page()
* detach_extent_buffer_page()
* page_range_has_eb()
* btrfs_release_extent_buffer_pages()
* btree_clear_page_dirty()
* btrfs_page_inc_eb_refs()
* btrfs_page_dec_eb_refs()
- Change btrfs_is_subpage() to accept an address_space pointer
This is to allow both page->mapping and folio->mapping to be utilized.
As data is still using the old per-page code, and may keep so for a
while.
- Special corner case place holder for future order mismatches between
extent buffer and inode filemap
For now it's just a block of comments and a dead ASSERT(), no real
handling yet.
The subpage code would still go page, just because subpage and large
folio are conflicting conditions, thus we don't need to bother subpage
with higher order folios at all. Just folio_page(folio, 0) would be
enough.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor styling tweaks ]
Signed-off-by: David Sterba <dsterba@suse.com>
For now extent_buffer::pages[] are still only accepting single page
pointer, thus we can migrate to folios pretty easily.
As for single page, page and folio are 1:1 mapped, including their page
flags.
This patch would just do the conversion from struct page to struct
folio, providing the first step to higher order folio in the future.
This conversion is pretty simple:
- extent_buffer::pages[] -> extent_buffer::folios[]
- page_address(eb->pages[i]) -> folio_address(eb->pages[i])
- eb->pages[i] -> folio_page(eb->folios[i], 0)
There would be more specific cleanups preparing for the incoming higher
order folio support.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently alloc_extent_buffer() utilizes find_or_create_page() to
allocate one page a time for an extent buffer.
This method has the following disadvantages:
- find_or_create_page() is the legacy way of allocating new pages
With the new folio infrastructure, find_or_create_page() is just
redirected to filemap_get_folio().
- Lacks the way to support higher order (order >= 1) folios
As we can not yet let filemap give us a higher order folio.
This patch would change the workflow by the following way:
Old | new
-----------------------------------+-------------------------------------
| ret = btrfs_alloc_page_array();
for (i = 0; i < num_pages; i++) { | for (i = 0; i < num_pages; i++) {
p = find_or_create_page(); | ret = filemap_add_folio();
/* Attach page private */ | /* Reuse page cache if needed */
/* Reused eb if needed */ |
| /* Attach page private and
| reuse eb if needed */
| }
By this we split the page allocation and private attaching into two
parts, allowing future updates to each part more easily, and migrate to
folio interfaces (especially for possible higher order folios).
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, in struct extent_map, we use an unsigned int (32 bits) to
identify the compression type of an extent and an unsigned long (64 bits
on a 64 bits platform, 32 bits otherwise) for flags. We are only using
6 different flags, so an unsigned long is excessive and we can use flags
to identify the compression type instead of using a dedicated 32 bits
field.
We can easily have tens or hundreds of thousands (or more) of extent maps
on busy and large filesystems, specially with compression enabled or many
or large files with tons of small extents. So it's convenient to have the
extent_map structure as small as possible in order to use less memory.
So remove the compression type field from struct extent_map, use flags
to identify the compression type and shorten the flags field from an
unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and
reduces the size of the structure from 136 bytes down to 128 bytes, using
now only two cache lines, and increases the number of extent maps we can
have per 4K page from 30 to 32. By using a u32 for the flags instead of
an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(),
but that level of atomicity is not needed as most flags are never cleared
once set (before adding an extent map to the tree), and the ones that can
be cleared or set after an extent map is added to the tree, are always
performed while holding the write lock on the extent map tree, while the
reader holds a lock on the tree or tests for a flag that never changes
once the extent map is in the tree (such as compression flags).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs extent buffer helpers are doing all the cross-page
handling, as there is no guarantee that all those eb pages are
contiguous.
However on systems with enough memory, there is a very high chance the
page cache for btree_inode are allocated with physically contiguous
pages.
In that case, we can skip all the complex cross-page handling, thus
speeding up the code.
This patch adds a new member, extent_buffer::addr, which is only set to
non-NULL if all the extent buffer pages are physically contiguous.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use memset_page() in memset_extent_buffer() instead of opencoding it.
This does not not change any functionality.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
One a zoned filesystem, never clear the dirty flag of an extent buffer,
but instead mark it as zeroout.
On writeout, when encountering a marked extent_buffer, zero it out.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
EXTENT_BUFFER_ZONED_ZEROOUT better describes the state of the extent buffer,
namely it is written as all zeros. This is needed in zoned mode, to
preserve I/O ordering.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As a cleanup and preparation for future folio migration, this patch
would replace all page->private to folio version. This includes:
- PagePrivate()
-> folio_test_private()
- page->private
-> folio_get_private()
- attach_page_private()
-> folio_attach_private()
- detach_page_private()
-> folio_detach_private()
Since we're here, also remove the forced cast on page->private, since
it's (void *) already, we don't really need to do the cast.
For now even if we missed some call sites, it won't cause any problem
yet, as we're only using order 0 folio (single page), thus all those
folio/page flags should be synced.
But for the future conversion to utilize higher order folio, the page
<-> folio flag sync is no longer guaranteed, thus we have to migrate to
utilize folio flags.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The EXTENT_QGROUP_RESERVED bit is used to "lock" regions of the file for
duplicate reservations. That is two writes to that range in one
transaction shouldn't create two reservations, as the reservation will
only be freed once when the write finally goes down. Therefore, it is
never OK to clear that bit without freeing the associated qgroup
reserve. At this point, we don't want to be freeing the reserve, so mask
off the bit.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
If btrfs_alloc_page_array() fail to allocate all pages but part of the
slots, then the partially allocated pages would be leaked in function
btrfs_submit_compressed_read().
[CAUSE]
As explicitly stated, if btrfs_alloc_page_array() returned -ENOMEM,
caller is responsible to free the partially allocated pages.
For the existing call sites, most of them are fine:
- btrfs_raid_bio::stripe_pages
Handled by free_raid_bio().
- extent_buffer::pages[]
Handled btrfs_release_extent_buffer_pages().
- scrub_stripe::pages[]
Handled by release_scrub_stripe().
But there is one exception in btrfs_submit_compressed_read(), if
btrfs_alloc_page_array() failed, we didn't cleanup the array and freed
the array pointer directly.
Initially there is still the error handling in commit dd137dd1f2
("btrfs: factor out allocating an array of pages"), but later in commit
544fe4a903 ("btrfs: embed a btrfs_bio into struct compressed_bio"),
the error handling is removed, leading to the possible memory leak.
[FIX]
This patch would add back the error handling first, then to prevent such
situation from happening again, also
Make btrfs_alloc_page_array() to free the allocated pages as a extra
safety net, then we don't need to add the error handling to
btrfs_submit_compressed_read().
Fixes: 544fe4a903 ("btrfs: embed a btrfs_bio into struct compressed_bio")
CC: stable@vger.kernel.org # 6.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It is hard to find where mapping->private_lock, mapping->private_list and
mapping->private_data are used, due to private_XXX being a relatively
common name for variables and structure members in the kernel. To fit
with other members of struct address_space, rename them all to have an
i_ prefix. Tested with an allmodconfig build.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20231117215823.2821906-1-willy@infradead.org
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The semantics of test_range_bit() with filled == 0 is now in it's own
helper so test_range_bit will check the whole range unconditionally.
The detection logic is flipped and assumes success by default and
catches exceptions.
Signed-off-by: David Sterba <dsterba@suse.com>
The existing helper test_range_bit works in two ways, checks if the whole
range contains all the bits, or stop on the first occurrence. By adding
a specific helper for the latter case, the inner loop can be simplified
and contains fewer conditionals, making it a bit faster.
There's no caller that uses the cached state pointer so this reduces the
argument count further.
Signed-off-by: David Sterba <dsterba@suse.com>
A long time ago, we had some metadata chunks which started at sector
boundary but not aligned to nodesize boundary.
This led to some older filesystems which can have tree blocks only
aligned to sectorsize, but not nodesize.
Later 'btrfs check' gained the ability to detect and warn about such tree
blocks, and kernel fixed the chunk allocation behavior, nowadays those
tree blocks should be pretty rare.
But in the future, if we want to migrate metadata to folio, we cannot
have such tree blocks, as filemap_add_folio() requires the page index to
be aligned with the folio number of pages. Such unaligned tree blocks
can lead to VM_BUG_ON().
So this patch adds extra warning for those unaligned tree blocks, as a
preparation for the future folio migration.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfsic_mount() is part of the deprecated check-integrity
functionality.
Now let's remove the main entry point of check-integrity, and thankfully
most of the check-integrity code is self-contained inside
check-integrity.c, we can safely remove the function without huge
changes to btrfs code base.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function name in the comment does not bring much value to code not
exposed as API and we don't stick to the kdoc format anymore. Update
formatting of parameter descriptions.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit f98b6215d7 ("btrfs: extent_io: do extra check for extent buffer
read write functions") changed how we handle invalid extent buffer range
for read_extent_buffer().
Previously if the range is invalid we just set the destination to zero,
but after the patch we do nothing and error out.
This can lead to smatch static checker errors like:
fs/btrfs/print-tree.c:186 print_uuid_item() error: uninitialized symbol 'subvol_id'.
fs/btrfs/tests/extent-io-tests.c:338 check_eb_bitmap() error: uninitialized symbol 'has'.
fs/btrfs/tests/extent-io-tests.c:353 check_eb_bitmap() error: uninitialized symbol 'has'.
fs/btrfs/uuid-tree.c:203 btrfs_uuid_tree_remove() error: uninitialized symbol 'read_subid'.
fs/btrfs/uuid-tree.c:353 btrfs_uuid_tree_iterate() error: uninitialized symbol 'subid_le'.
fs/btrfs/uuid-tree.c:72 btrfs_uuid_tree_lookup() error: uninitialized symbol 'data'.
fs/btrfs/volumes.c:7415 btrfs_dev_stats_value() error: uninitialized symbol 'val'.
Fix those warnings by reverting back to the old memset() behavior.
By this we keep the static checker happy and would still make a lot of
noise when such invalid ranges are passed in.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: f98b6215d7 ("btrfs: extent_io: do extra check for extent buffer read write functions")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have been consistently seeing hangs with generic/648 in our subpage
GitHub CI setup. This is a classic deadlock, we are calling
btrfs_read_folio() on a folio, which requires holding the folio lock on
the folio, and then finding a ordered extent that overlaps that range
and calling btrfs_start_ordered_extent(), which then tries to write out
the dirty page, which requires taking the folio lock and then we
deadlock.
The hang happens because we're writing to range [1271750656, 1271767040),
page index [77621, 77622], and page 77621 is !Uptodate. It is also Dirty,
so we call btrfs_read_folio() for 77621 and which does
btrfs_lock_and_flush_ordered_range() for that range, and we find an ordered
extent which is [1271644160, 1271746560), page index [77615, 77621].
The page indexes overlap, but the actual bytes don't overlap. We're
holding the page lock for 77621, then call
btrfs_lock_and_flush_ordered_range() which tries to flush the dirty
page, and tries to lock 77621 again and then we deadlock.
The byte ranges do not overlap, but with subpage support if we clear
uptodate on any portion of the page we mark the entire thing as not
uptodate.
We have been clearing page uptodate on write errors, but no other file
system does this, and is in fact incorrect. This doesn't hurt us in the
!subpage case because we can't end up with overlapped ranges that don't
also overlap on the page.
Fix this by not clearing uptodate when we have a write error. The only
thing we should be doing in this case is setting the mapping error and
carrying on. This makes it so we would no longer call
btrfs_read_folio() on the page as it's uptodate and eliminates the
deadlock.
With this patch we're now able to make it through a full fstests run on
our subpage blocksize VMs.
Note for stable backports: this probably goes beyond 6.1 but the code
has been cleaned up and clearing the uptodate bit must be verified on
each version independently.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We currently advance the meta_write_pointer in
btrfs_check_meta_write_pointer(). That makes it necessary to revert it
when locking the buffer failed. Instead, we can advance it just before
sending the buffer.
Also, this is necessary for the following commit. In the commit, it needs
to release the zoned_meta_io_lock to allow IOs to come in and wait for them
to fill the currently active block group. If we advance the
meta_write_pointer before locking the extent buffer, the following extent
buffer can pass the meta_write_pointer check, resulting in an unaligned
write failure.
Advancing the pointer is still thread-safe as the extent buffer is locked.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have writeback_control passed to
btrfs_check_meta_write_pointer(), we can move the wbc condition in
submit_eb_page() to btrfs_check_meta_write_pointer() and return int.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For metadata write out on the zoned mode, we call
btrfs_check_meta_write_pointer() to check if an extent buffer to be written
is aligned to the write pointer.
We look up a block group containing the extent buffer for every extent
buffer, which takes unnecessary effort as the writing extent buffers are
mostly contiguous.
Introduce "zoned_bg" to cache the block group working on. Also, while
at it, rename "cache" to "block_group".
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce btrfs_eb_write_context to consolidate writeback_control and the
exntent buffer context. This will help adding a block group context as
well.
While at it, move the eb context setting before
btrfs_check_meta_write_pointer(). We can set it here because we anyway need
to skip pages in the same eb if that eb is rejected by
btrfs_check_meta_write_pointer().
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BACKGROUND]
Currently memove_extent_buffer() does a loop where it strop at any page
boundary inside [dst_offset, dst_offset + len) or [src_offset,
src_offset + len).
This is mostly allowing us to do copy_pages(), but if we're going to use
folios we will need to handle multi-page (the old behavior) or single
folio (the new optimization).
The current code would be a burden for future changes.
[ENHANCEMENT]
Instead of sticking with copy_pages(), here we utilize the new
__write_extent_buffer() helper to handle the writes.
Unlike the refactoring in memcpy_extent_buffer(), we can not just rely
on the write_extent_buffer() and only handle page boundaries inside src
range.
The function write_extent_buffer() itself is still doing forward
writing, thus it cannot handle the following case: (already in the
extent buffer memory operation tests, cross page overlapping run 2)
Src Page boundary
|///////|
|///|////|
Dst
In the above case, if we just follow page boundary in the src range, we
have no need to do any split, just one __write_extent_buffer() with
use_memmove = true.
But __write_extent_buffer() would split the dst range into two,
so it first copies the beginning part of the src range into the first half
of the dst range.
After this operation, the beginning of the dst range is already updated,
causing corruption.
So we have to follow the old behavior of handling both page boundaries.
And since we're the last caller of copy_pages(), we can remove it
completely.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BACKGROUND]
Currently memcpy_extent_buffer() does a loop where it would stop at
any page boundary inside [dst_offset, dst_offset + len) or [src_offset,
src_offset + len).
This is mostly allowing us to do copy_pages(), but if we're going to use
folios we will need to handle multi-page (the old behavior) or single
folio (the new optimization).
The current code would be a burden for future changes.
[ENHANCEMENT]
There is a hidden pitfall of the naming memcpy_extent_buffer(), unlike
regular memcpy(), this function can handle overlapping ranges.
So here we extract write_extent_buffer() into a new internal helper,
__write_extent_buffer(), and add a new parameter @use_memmove, to
indicate whether we should use memmove() or regular memcpy().
Now we can go __write_extent_buffer() to handle writing into the dst
range, with proper overlapping detection.
This has a tiny change to the chance of calling memmove().
As the split only happens at the source range page boundaries, the
memcpy/memmove() range would be slightly larger than the old code,
thus slightly increase the chance we call memmove() other than memcopy().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_clone_extent_buffer() calls copy_page() at each iteration but we
can copy all pages at the end in one go if there were no errors.
This would make later conversion to folios easier.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BACKGROUND]
copy_extent_buffer_full() currently does different handling for regular
and subpage cases, for regular cases it does a page by page copying.
For subpage cases, it just copies the content.
This is fine for the page based extent buffer code, but for the incoming
folio conversion, it can be a burden to add a new branch just to handle
all the different combinations (subpage vs regular, one single folio vs
multi pages).
[ENHANCE]
Instead of handling the different combinations, just go one single
handling for all cases, utilizing write_extent_buffer() to do the
copying.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Helpers write_extent_buffer_chunk_tree_uuid() and
write_extent_buffer_fsid(), they can be implemented by
write_extent_buffer().
These two helpers are not that frequently used, they only get called
during initialization of a new tree block. There is not much need for
those slightly optimized versions. And since they can be easily
converted to one write_extent_buffer() call, define them as inline
helpers.
This would make later page/folio switch much easier, as all change only
need to happen in write_extent_buffer().
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BACKGROUND]
Currently we handle extent bitmaps manually in
extent_buffer_bitmap_set() and extent_buffer_bitmap_clear().
Although with various helpers like eb_bitmap_offset() it's still a little
messy to read. The code seems to be a copy of bitmap_set(), but with
all the cross-page handling embedded into the code.
[ENHANCEMENT]
This patch would enhance the readability by introducing two helpers:
- memset_extent_buffer()
To handle the byte aligned range, thus all the cross-page handling is
done there.
- extent_buffer_get_byte()
This for the first and the last byte operations, which only need to
grab one byte, thus no need for any cross-page handling.
So we can split both extent_buffer_bitmap_set() and
extent_buffer_bitmap_clear() into 3 parts:
- Handle the first byte
If the range fits inside the first byte, we can exit early.
- Handle the byte aligned part
This is the part which can have cross-page operations, and it would
be handled by memset_extent_buffer().
- Handle the last byte
This refactoring does not only make the code a little easier to read,
but also makes later folio/page switch much easier, as the switch only
needs to be done inside memset_extent_buffer() and extent_buffer_get_byte().
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_write_locked_range currently expects that either all or no
pages are dirty when it is called. Bur run_delalloc_zoned is called
directly in the writepages path, and has the dirty bit cleared only
for locked_page and which the extent_write_cache_pages currently
operates. It currently works around this by redirtying locked_page,
but that is a bit inefficient and cumbersome. Pass a locked_page
argument to run_delalloc_zoned so that clearing the dirty bit can
be skipped on just that page.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
compress_file_range needs to clear the dirty bit before handing off work
to the compression worker threads to prevent processes coming in through
mmap and changing the file contents while the compression is accessing
the data (See commit 4adaa61102 ("Btrfs: fix race between mmap writes
and compression").
But when compress_file_range decides to not compress the data, it falls
back to submit_uncompressed_range which uses extent_write_locked_range
to write the uncompressed data. extent_write_locked_range currently
expects all pages to be marked dirty so that it can clear the dirty
bit itself, and thus compress_file_range has to redirty the page range.
Redirtying the page range is rather inefficient and also pointless,
so instead pass a pages_dirty parameter to extent_write_locked_range
and skip the redirty game entirely.
Note that compress_file_range was even redirtying the locked_page twice
given that extent_range_clear_dirty_for_io already redirties all pages
in the range, which must include locked_page if there is one.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of a separate page_started argument that tells the callers that
btrfs_run_delalloc_range already started writeback by itself, overload
the return value with a positive 1 in additio to 0 and a negative error
code to indicate that is has already started writeback, and remove the
nr_written argument as that caller can calculate it directly based on
the range, and in fact already does so for the case where writeback
wasn't started yet.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently writepage_delalloc adds to delalloc_to_write in every loop
operation. That is not only more work than doing it once after the
loop, but can also over-increment the counter due to rounding errors
when a new loop iteration starts with an offset into a page.
Add a new page_start variable instead of recaculation that value over
and over, move the delalloc_to_write calculation out of the loop, use
the DIV_ROUND_UP helper instead of open coding it and remove the pointless
found local variable.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The return value from extent_write_locked_range is ignored, and that's
fine because the error reporting happens through the mapping and
ordered_extent.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
end_extent_writepage is a small helper that combines a call to
btrfs_mark_ordered_io_finished with conditional error-only calls to
btrfs_page_clear_uptodate and mapping_set_error with a somewhat
unfortunate calling convention that passes and inclusive end instead
of the len expected by the underlying functions.
Remove end_extent_writepage and open code it in the 4 callers. Out
of those two already are error-only and thus don't need the extra
conditional, and one already has the mapping_set_error, so a duplicate
call can be avoided.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_writepage_endio_finish_ordered is a small wrapper around
btrfs_mark_ordered_io_finished that just changs the argument passing
slightly, and adds a tracepoint.
Move the tracpoint to btrfs_mark_ordered_io_finished, which means
it now also covers the error handling in btrfs_cleanup_ordered_extent
and switch all callers to just call btrfs_mark_ordered_io_finished
directly.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a lot of complexity in __process_pages_contig to deal with the
PAGE_LOCK case that can return an error unlike all the other actions.
Open code the page iteration for page locking in lock_delalloc_pages and
remove all the now unused code from __process_pages_contig.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Initially we preallocate btrfs_subpage structure in the main loop of
alloc_extent_buffer().
But later commit fbca46eb46 ("btrfs: make nodesize >= PAGE_SIZE case
to reuse the non-subpage routine") has made sure we only go subpage
routine if our nodesize is smaller than PAGE_SIZE.
This means for that case, we only need to allocate the subpage structure
once anyway.
So this patch would make the preallocation out of the main loop. This
would slightly reduce the workload when we hold the page lock, and make
code a little easier to read.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using
the existing helper folio_next_index().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Minjie Du <duminjie@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
bio_ctrl->len_to_oe_boundary is used to make sure we stay inside a zone
as we submit bios for writes. Every time we add a page to the bio, we
decrement those bytes from len_to_oe_boundary, and then we submit the
bio if we happen to hit zero.
Most of the time, len_to_oe_boundary gets set to U32_MAX.
submit_extent_page() adds pages into our bio, and the size of the bio
ends up limited by:
- Are we contiguous on disk?
- Does bio_add_page() allow us to stuff more in?
- is len_to_oe_boundary > 0?
The len_to_oe_boundary math starts with U32_MAX, which isn't page or
sector aligned, and subtracts from it until it hits zero. In the
non-zoned case, the last IO we submit before we hit zero is going to be
unaligned, triggering BUGs.
This is hard to trigger because bio_add_page() isn't going to make a bio
of U32_MAX size unless you give it a perfect set of pages and fully
contiguous extents on disk. We can hit it pretty reliably while making
large swapfiles during provisioning because the machine is freshly
booted, mostly idle, and the disk is freshly formatted. It's also
possible to trigger with reads when read_ahead_kb is set to 4GB.
The code has been clean up and shifted around a few times, but this flaw
has been lurking since the counter was added. I think the commit
24e6c80822 ("btrfs: simplify main loop in submit_extent_page") ended
up exposing the bug.
The fix used here is to skip doing math on len_to_oe_boundary unless
we've changed it from the default U32_MAX value. bio_add_page() is the
real limit we want, and there's no reason to do extra math when block
layer is doing it for us.
Sample reproducer, note you'll need to change the path to the bdi and
device:
SUBVOL=/btrfs/swapvol
SWAPFILE=$SUBVOL/swapfile
SZMB=8192
mkfs.btrfs -f /dev/vdb
mount /dev/vdb /btrfs
btrfs subvol create $SUBVOL
chattr +C $SUBVOL
dd if=/dev/zero of=$SWAPFILE bs=1M count=$SZMB
sync
echo 4 > /proc/sys/vm/drop_caches
echo 4194304 > /sys/class/bdi/btrfs-2/read_ahead_kb
while true; do
echo 1 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/drop_caches
dd of=/dev/zero if=$SWAPFILE bs=4096M count=2 iflag=fullblock
done
Fixes: 24e6c80822 ("btrfs: simplify main loop in submit_extent_page")
CC: stable@vger.kernel.org # 6.4+
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__extent_writepage could have started on more pages than the one it was
called for. This happens regularly for zoned file systems, and in theory
could happen for compressed I/O if the worker thread was executed very
quickly. For such pages extent_write_cache_pages waits for writeback
to complete before moving on to the next page, which is highly inefficient
as it blocks the flusher thread.
Port over the PageDirty check that was added to write_cache_pages in
commit 515f4a037f ("mm: write_cache_pages optimise page cleaning") to
fix this.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_write_cache_pages stops writing pages as soon as nr_to_write hits
zero. That is the right thing for opportunistic writeback, but incorrect
for data integrity writeback, which needs to ensure that no dirty pages
are left in the range. Thus only stop the writeback for WB_SYNC_NONE
if nr_to_write hits 0.
This is a port of write_cache_pages changes in commit 05fe478dd0
("mm: write_cache_pages integrity fix").
Note that I've only trigger the problem with other changes to the btrfs
writeback code, but this condition seems worthwhile fixing anyway.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ updated comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Use the btrfs_finish_ordered_extent helper to complete compressed writes
using the bbio->ordered pointer instead of requiring an rbtree lookup
in the otherwise equivalent btrfs_mark_ordered_io_finished called from
btrfs_writepage_endio_finish_ordered.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This prepares for switching to more efficient ordered_extent processing
and already removes the forth and back conversion from len to end back to
len.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a pointer to the ordered_extent to the existing union in struct
btrfs_bio, so all code dealing with data write bios can just use a
pointer dereference to retrieve the ordered_extent instead of doing
multiple rbtree lookups per I/O.
The reference to this ordered_extent is dropped at end I/O time,
which implies that an extra one must be acquired when the bio is split.
This also requires moving the btrfs_extract_ordered_extent call into
btrfs_split_bio so that the invariant of always having a valid
ordered_extent reference for the btrfs_bio is kept.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently buffered writeback bios are allowed to span multiple
ordered_extents, although that basically never actually happens since
commit 4a445b7b61 ("btrfs: don't merge pages into bio if their page
offset is not contiguous").
Supporting bios than span ordered_extents complicates the file
checksumming code, and prevents us from adding an ordered_extent pointer
to the btrfs_bio structure. Use the existing code to limit a bio to
single ordered_extent for zoned device writes for all writes.
This allows to remove the REQ_BTRFS_ONE_ORDERED flags, and the
handling of multiple ordered_extents in btrfs_csum_one_bio.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When extent_write_locked_range was originally added, it was only used
writing back compressed pages from an async helper thread. But it is
now also used for writing back pages on zoned devices, where it is
called directly from the ->writepage context. In this case we want to
be able to pass on the writeback_control instead of creating a new one,
and more importantly want to use all the normal cgroup interaction
instead of potentially deferring writeback to another helper.
Fixes: 898793d992 ("btrfs: zoned: write out partially allocated region")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__extent_writepage does a lot of things that make no sense for
extent_write_locked_range, given that extent_write_locked_range itself is
called from __extent_writepage either directly or through a workqueue,
and all this work has already been done in the first invocation and the
pages haven't been unlocked since. Call __extent_writepage_io directly
instead and open code the logic tracked in
btrfs_bio_ctrl::extent_locked.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the nr_to_write accounting from __extent_writepage_io to
__extent_writepage_io as we'll grow another __extent_writepage_io that
doesn't want this accounting soon. Also drop the obsolete comment -
decrementing a counter in the on-stack writeback_control data structure
doesn't need the page lock.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__extent_writepage_io is never called for compressed or inline extents,
or holes. Remove the not quite working code for them and replace it with
asserts that these cases don't happen.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the btrfs writeback code has stopped using PageError, using
PAGE_SET_ERROR to just set the per-address_space error flag is confusing.
Open code the mapping_set_error calls in the callers and remove
the PAGE_SET_ERROR flag.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
PageError is not used by the VFS/MM and deprecated because it uses up a
page bit and has no coherent rules. Instead read errors are usually
propagated by not setting or clearing the uptodate bit, and write errors
are propagated through the address_space. Btrfs now only sets the flag
and never clears it for data pages, so just remove all places setting it,
and the subpage error bit.
Note that the error propagation for superblock writes that work on the
block device mapping still uses PageError for now, but that will be
addressed in a separate series.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__extent_writepage currenly sets PageError whenever any error happens,
and the also checks for PageError to decide if to call error handling.
This leads to very unclear responsibility for cleaning up on errors.
In the VM and generic writeback helpers the basic idea is that once
I/O is fired off all error handling responsibility is delegated to the
end I/O handler. But if that end I/O handler sets the PageError bit,
and the submitter checks it, the bit could in some cases leak into the
submission context for fast enough I/O.
Fix this by simply not checking PageError and just using the local
ret variable to check for submission errors. This also fundamentally
solves the long problem documented in a comment in __extent_writepage
by never leaking the error bit into the submission context.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_verify_page is called from the readpage completion handler, which
is only used to read pages, or parts of pages that aren't uptodate yet.
The only case where PageError could be set on a page in btrfs is if we
had a previous writeback error, but in that case we won't called readpage
on it, as it has previously been marked uptodate.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Also clear the uptodate bit to make sure the page isn't seen as uptodate
in the page cache if fsverity verification fails.
Fixes: 146054090b ("btrfs: initial fsverity support")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Split all the conditionals for the fsverity calls in end_page_read into
a btrfs_verify_page helper to keep the code readable and make additional
refactoring easier.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The range_end field in struct writeback_control is inclusive, just like
the end parameter passed to extent_write_locked_range. Not doing this
could cause extra writeout, which is harmless but suboptimal.
Fixes: 771ed689d2 ("Btrfs: Optimize compressed writeback and reads")
CC: stable@vger.kernel.org # 5.9+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In function end_bio_extent_readpage() we call
endio_readpage_release_extent() to unlock the extent io tree.
However we pass PageUptodate(page) as @uptodate parameter for it, while
for previous end_page_read() call, we use a dedicated @uptodate local
variable.
This is not a big deal, as even for subpage cases, either the bio only
covers part of the page, then the @uptodate is always false, and the
subpage ranges can still be merged.
But for the sake of consistency, always use @uptodate variable when
possible.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently alloc_extent_buffer() would make the extent buffer uptodate if
the corresponding pages are also uptodate.
But this check is only checking PageUptodate, which is fine for regular
cases, but not for subpage cases, as we can have multiple extent buffers
in the same page.
So here we go btrfs_page_test_uptodate() instead.
The old code doesn't cause any problem, but is not efficient, as it
would cause extra metadata read even if the range is already uptodate.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a bug report that assert_eb_page_uptodate() gets triggered for
free space tree metadata.
Without proper dump for the subpage bitmaps it's much harder to debug.
Thus this patch would dump all the subpage bitmaps (split them into
their own bitmaps) for a easier debugging.
The output would look like this:
(Dumped after a tree block got read from disk)
page:000000006e34bf49 refcount:4 mapcount:0 mapping:0000000067661ac4 index:0x1d1 pfn:0x110e9
memcg:ffff0000d7d62000
aops:btree_aops [btrfs] ino:1
flags: 0x8000000000002002(referenced|private|zone=2)
page_type: 0xffffffff()
raw: 8000000000002002 0000000000000000 dead000000000122 ffff00000188bed0
raw: 00000000000001d1 ffff0000c7992700 00000004ffffffff ffff0000d7d62000
page dumped because: btrfs subpage dump
BTRFS warning (device dm-1): start=30490624 len=16384 page=30474240 bitmaps: uptodate=4-7 error= dirty= writeback= ordered= checked=
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that all extent state bit helpers effectively take the GFP_NOFS mask
(and GFP_NOWAIT is encoded in the bits) we can remove the parameter.
This reduces stack consumption in many functions and simplifies a lot of
code.
Net effect on module on a release build:
text data bss dec hex filename
1250432 20985 16088 1287505 13a551 pre/btrfs.ko
1247074 20985 16088 1284147 139833 post/btrfs.ko
DELTA: -3358
Signed-off-by: David Sterba <dsterba@suse.com>
Most of the code in write_one_subpage_eb and write_one_eb is shared,
so merge the two functions into one.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of locking and unlocking every page or the extent, just add a
new EXTENT_BUFFER_READING bit that mirrors EXTENT_BUFFER_WRITEBACK
for synchronizing threads trying to read an extent_buffer and to wait
for I/O completion.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only place that reads in pages and thus marks them uptodate for
the btree inode is read_extent_buffer_pages. Which means that either
pages are already uptodate from an old buffer when creating a new
one in alloc_extent_buffer, or they will be updated by ca call
to read_extent_buffer_pages. This means the checks for uptodate
pages in read_extent_buffer_pages and read_extent_buffer_subpage are
superfluous and can be removed.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
PageError is only used to limit the uptodate check in
assert_eb_page_uptodate. But we have a much more useful flag indicating
the exact condition we are about with the EXTENT_BUFFER_WRITE_ERR flag,
so use that instead and help the kernel toward eventually removing
PageError.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
No need to track the number of pages under I/O now that each
extent_buffer is read and written using a single bio. For the
read side we need to grab an extra reference for the duration of
the I/O to prevent eviction, though.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we always use a single bio to write an extent_buffer, the buffer
can be passed to the end_io handler as private data. This allows
to simplify the metadata write end I/O handler, and merge the subpage
end_io handler into the main one.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs_bio_ctrl machinery is overkill for writing extent_buffers
as we always operate on PAGE_SIZE chunks (or one smaller one for the
subpage case) that are contiguous and are guaranteed to fit into a
single bio. Replace it with open coded btrfs_bio_alloc, __bio_add_page
and btrfs_submit_bio calls.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Locking the pages in lock_extent_buffer_for_io only for the non-subpage
case is very confusing. Move it to write_one_eb to mirror the subpage
case and simplify the code. Now lock_extent_buffer_for_io does not leave
all the pages locked and each is individually locked/unlocked in
write_one_eb.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stop trying to cluster writes of multiple extent_buffers into a single
bio. There is no need for that as the blk_plug mechanism used all the
way up in writeback_inodes_wb gives us the same I/O pattern even with
multiple bios. Removing the clustering simplifies
lock_extent_buffer_for_io a lot and will also allow passing the eb
as private data to the end I/O handler.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
lock_extent_buffer_for_io never returns a negative error value, so switch
the return value to a simple bool.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ keep noinline_for_stack ]
Signed-off-by: David Sterba <dsterba@suse.com>
Only subpage metadata reads lock the extent. Don't try to unlock it and
waste cycles in the extent tree lookup for PAGE_SIZE or larger metadata.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we always use a single bio to read an extent_buffer, the buffer
can be passed to the end_io handler as private data. This allows
implementing a much simplified dedicated end I/O handler for metadata
reads.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Given that read recovery for data I/O is handled in the storage layer,
the mirror_num argument to btrfs_submit_compressed_read is always 0,
so remove it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs_bio_ctrl machinery is overkill for reading extent_buffers
as we always operate on PAGE_SIZE chunks (or one smaller one for the
subpage case) that are contiguous and are guaranteed to fit into a
single bio. Replace it with open coded btrfs_bio_alloc, __bio_add_page
and btrfs_submit_bio calls in a helper function shared between
the subpage and node size >= PAGE_SIZE cases.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently read_extent_buffer_pages skips pages that are already uptodate
when reading in an extent_buffer. While this reduces the amount of data
read, it increases the number of I/O operations as we now need to do
multiple I/Os when reading an extent buffer with one or more uptodate
pages in the middle of it. On any modern storage device, be that hard
drives or SSDs this actually decreases I/O performance. Fortunately
this case is pretty rare as the pages are always initially read together
and then aged the same way. Besides simplifying the code a bit as-is
this will allow for major simplifications to the I/O completion handler
later on.
Note that the case where all pages are uptodate is still handled by an
optimized fast path that does not read any data from disk.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Call btrfs_page_clear_uptodate instead of ClearPageUptodate to properly
manage the uptodate bit for the subpage case.
Reported-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_buffer_under_io is only used in extent_io.c, so mark it static.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When btrfs_redirty_list_add redirties a buffer, it also acquires
an extra reference that is released on transaction commit. But
this is not required as buffers that are dirty or under writeback
are never freed (look for calls to extent_buffer_under_io())).
Remove the extra reference and the infrastructure used to drop it
again.
History behind redirty logic:
In the first place, it used releasing_list to hold all the
to-be-released extent buffers, and decided which buffers to re-dirty at
the commit time. Then, in a later version, the behaviour got changed to
re-dirty a necessary buffer and add re-dirtied one to the list in
btrfs_free_tree_block(). In short, the list was there mostly for the
patch series' historical reason.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[ add Naohiro's comment regarding history ]
Signed-off-by: David Sterba <dsterba@suse.com>
dirty_metadata_bytes is decremented in both places that clear the dirty
bit in a buffer, but only incremented in btrfs_mark_buffer_dirty, which
means that a buffer that is redirtied using btrfs_redirty_list_add won't
be added to dirty_metadata_bytes, but it will be subtracted when written
out, leading an inconsistency in the counter.
Move the dirty_metadata_bytes from btrfs_mark_buffer_dirty into
set_extent_buffer_dirty to also account for the redirty case, and remove
the now unused set_extent_buffer_dirty return value.
Fixes: d3575156f6 ("btrfs: zoned: redirty released extent buffers")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we're doing a lot of work for btrfs_bio:
- Checksum verification for data read bios
- Bio splits if it crosses stripe boundary
- Read repair for data read bios
However for the incoming scrub patches, we don't want this extra
functionality at all, just plain logical + mirror -> physical mapping
ability.
Thus here we do the following changes:
- Introduce btrfs_bio::fs_info
This is for the new scrub specific btrfs_bio, which would not populate
btrfs_bio::inode.
Thus we need such new member to grab a fs_info
This new member will always be populated.
- Replace @inode argument with @fs_info for btrfs_bio_init() and its
caller
Since @inode is no longer a mandatory member, replace it with
@fs_info, and let involved users populate @inode.
- Skip checksum verification and generation if @bbio->inode is NULL
- Add extra ASSERT()s
To make sure:
* bbio->inode is properly set for involved read repair path
* if @file_offset is set, bbio->inode is also populated
- Grab @fs_info from @bbio directly
We can no longer go @bbio->inode->root->fs_info, as bbio->inode can be
NULL. This involves:
* btrfs_simple_end_io()
* should_async_write()
* btrfs_wq_submit_bio()
* btrfs_use_zone_append()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
REQ_CGROUP_PUNT is a bit annoying as it is hard to follow and adds
a branch to the bio submission hot path. To fix this, export
blkcg_punt_bio_submit and let btrfs call it directly. Add a new
REQ_FS_PRIVATE flag for btrfs to indicate to it's own low-level
bio submission code that a punt to the cgroup submission helper
is required.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
punt_to_cgroup is only used by extent_write_locked_range, but that
function also directly controls the bio flags for the actual submission.
Remove th punt_to_cgroup field, and just set REQ_CGROUP_PUNT directly
in extent_write_locked_range.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Return the containing struct btrfs_bio instead of the less type safe
struct bio from btrfs_bio_alloc.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The bio in struct btrfs_bio_ctrl must be a btrfs_bio, so store a pointer
to the btrfs_bio for better type checking.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
struct btrfs_bio now has an always valid inode pointer that can be used
to find the inode in submit_one_bio, so use that and initialize all
variables for which it is possible at declaration time.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_submit_compressed_read expects the bio passed to it to be embedded
into a btrfs_bio structure. Pass the btrfs_bio directly to increase type
safety and make the code self-documenting.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_submit_bio expects the bio passed to it to be embedded into a
btrfs_bio structure. Pass the btrfs_bio directly to increase type
safety and make the code self-documenting.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Nowadays calc_bio_boundaries() is a relatively simple function that only
guarantees the one bio equals to one ordered extent rule for uncompressed
Zone Append bios.
Sink it into it's only caller alloc_new_bio().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
bio_add_page always adds either the entire range passed to it or nothing.
Based on that btrfs_bio_add_page can only return a length smaller than
the passed in one when hitting the ordered extent limit, which can only
happen for writes. Given that compressed writes never even use this code
path, this means that all the special cases for compressed extent offset
handling are dead code.
Reflow submit_extent_page to take advantage of this by inlining
btrfs_bio_add_page and handling the ordered extent limit by decrementing
it for each added range and thus significantly simplifying the loop.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Different loop iterations in btrfs_bio_add_page not only have the same
contiguity parameters, but also any non-initial operation operates on a
fresh bio anyway.
Factor out the contiguity check into a new btrfs_bio_is_contig and only
call it once in submit_extent_page before descending into the
bio_add_page loop.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove the has_error and saved_ret variables, and just jump to a goto
label for error handling from the only place returning an error from the
main loop.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
submit_extent_page always returns 0 since commit d5e4377d50 ("btrfs:
split zone append bios in btrfs_submit_bio"). Change it to a void return
type and remove all the unreachable error handling code in the callers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update the compress_type in the btrfs_bio_ctrl after forcing out the
previous bio in btrfs_do_readpage, so that alloc_new_bio can just use
the compress_type member in struct btrfs_bio_ctrl instead of passing the
same information redundantly as a function argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rename this_bio_flag to compress_type to match the surrounding code
and better document the intent. Also use the proper enum type instead
of unsigned long.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The compress_type can only change on a per-extent basis. So instead of
checking it for every page in btrfs_bio_add_page, do the check once in
btrfs_do_readpage, which is the only caller of btrfs_bio_add_page and
submit_extent_page that deals with compressed extents.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of passing down the wbc pointer the deep call chain, just
add it to the btrfs_bio_ctrl structure.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The sync_io flag is equivalent to wbc->sync_mode == WB_SYNC_ALL, so
just check for that and remove the separate flag.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The bio op and flags never change over the life time of a bio_ctrl,
so move it in there instead of passing it down the deep call chain
all the way down to alloc_new_bio.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If force_bio_submit, submit_extent_page simply calls submit_one_bio as
the first thing. This can just be moved to the only caller that sets
force_bio_submit to true.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When read_extent_buffer_subpage calls submit_extent_page, it does
so on a freshly initialized btrfs_bio_ctrl structure that can't have
a valid bio to submit. Clear the force_bio_submit parameter to false
as there is nothing to submit.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Embed a btrfs_bio into struct compressed_bio. This avoids potential
(so far theoretical) deadlocks due to nesting of btrfs_bioset allocations
for the original read bio and the compressed bio, and avoids an extra
memory allocation in the I/O path.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X bit.
- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.
- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes
- Johannes Weiner has a series ("mm: push down lock_page_memcg()") which
does perform some memcg maintenance and cleanup work.
- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter". These filters provide users
with finer-grained control over DAMOS's actions. SeongJae has also done
some DAMON cleanup work.
- Kairui Song adds a series ("Clean up and fixes for swap").
- Vernon Yang contributed the series "Clean up and refinement for maple
tree".
- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.
- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".
- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".
- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".
- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".
- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series "mm:
support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap
PTEs".
- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".
- Sergey Senozhatsky has improved zsmalloc's memory utilization with his
series "zsmalloc: make zspage chain size configurable".
- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings. The previous BPF-based approach had
shortcomings. See "mm: In-kernel support for memory-deny-write-execute
(MDWE)".
- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".
- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a per-node
basis. See the series "Introduce per NUMA node memory error
statistics".
- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage during
compaction".
- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".
- Christoph Hellwig has removed block_device_operations.rw_page() in ths
series "remove ->rw_page".
- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".
- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier functions".
- Some pagemap cleanup and generalization work in Mike Rapoport's series
"mm, arch: add generic implementation of pfn_valid() for FLATMEM" and
"fixups for generic implementation of pfn_valid()"
- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
- Jason Gunthorpe rationalized the GUP system's interface to the rest of
the kernel in the series "Simplify the external interface for GUP".
- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the series
"mm/damon: deprecate DAMON debugfs interface".
- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.
- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".
- Arnd Bergmann has some objtool fixups in "objtool warning fixes".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA
jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K
DmxHkn0LAitGgJRS/W9w81yrgig9tAQ=
=MlGs
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X
bit.
- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.
- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes
- Johannes Weiner has a series ("mm: push down lock_page_memcg()")
which does perform some memcg maintenance and cleanup work.
- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter".
These filters provide users with finer-grained control over DAMOS's
actions. SeongJae has also done some DAMON cleanup work.
- Kairui Song adds a series ("Clean up and fixes for swap").
- Vernon Yang contributed the series "Clean up and refinement for maple
tree".
- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.
- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".
- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".
- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".
- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".
- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series
"mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
swap PTEs".
- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".
- Sergey Senozhatsky has improved zsmalloc's memory utilization with
his series "zsmalloc: make zspage chain size configurable".
- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings.
The previous BPF-based approach had shortcomings. See "mm: In-kernel
support for memory-deny-write-execute (MDWE)".
- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".
- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a
per-node basis. See the series "Introduce per NUMA node memory error
statistics".
- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage
during compaction".
- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".
- Christoph Hellwig has removed block_device_operations.rw_page() in
ths series "remove ->rw_page".
- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".
- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier
functions".
- Some pagemap cleanup and generalization work in Mike Rapoport's
series "mm, arch: add generic implementation of pfn_valid() for
FLATMEM" and "fixups for generic implementation of pfn_valid()"
- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
- Jason Gunthorpe rationalized the GUP system's interface to the rest
of the kernel in the series "Simplify the external interface for
GUP".
- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the
series "mm/damon: deprecate DAMON debugfs interface".
- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.
- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".
- Arnd Bergmann has some objtool fixups in "objtool warning fixes".
* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
include/linux/migrate.h: remove unneeded externs
mm/memory_hotplug: cleanup return value handing in do_migrate_range()
mm/uffd: fix comment in handling pte markers
mm: change to return bool for isolate_movable_page()
mm: hugetlb: change to return bool for isolate_hugetlb()
mm: change to return bool for isolate_lru_page()
mm: change to return bool for folio_isolate_lru()
objtool: add UACCESS exceptions for __tsan_volatile_read/write
kmsan: disable ftrace in kmsan core code
kasan: mark addr_has_metadata __always_inline
mm: memcontrol: rename memcg_kmem_enabled()
sh: initialize max_mapnr
m68k/nommu: add missing definition of ARCH_PFN_OFFSET
mm: percpu: fix incorrect size in pcpu_obj_full_size()
maple_tree: reduce stack usage with gcc-9 and earlier
mm: page_alloc: call panic() when memoryless node allocation fails
mm: multi-gen LRU: avoid futile retries
migrate_pages: move THP/hugetlb migration support check to simplify code
migrate_pages: batch flushing TLB
migrate_pages: share more code between _unmap and _move
...
struct btrfs_bio has all the information needed for btrfs_use_append, so
pass that instead of a btrfs_inode and file_offset.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of digging into the bio_vec in submit_one_bio, set file_offset at
bio allocation time from the provided parameter. This also ensures that
the file_offset is available all the time when building up the bio
payload.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_ordered_extent->disk_bytenr can be rewritten by the zoned I/O
completion handler, and thus in general is not a good idea to limit I/O
size. But the maximum bio size calculation can easily be done using the
file_offset fields in the btrfs_ordered_extent and btrfs_bio structures,
so switch to that instead.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_clear_buffer_dirty just does the test_clear_bit() and then calls
clear_extent_buffer_dirty and does the dirty metadata accounting.
Combine this into clear_extent_buffer_dirty and make the result
btrfs_clear_buffer_dirty.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only add if we set the extent buffer dirty, and we subtract when we
clear the extent buffer dirty. If we end up in set_btree_ioerr we have
already cleared the buffer dirty, and we aren't resetting dirty on the
extent buffer, so this is simply wrong.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The current btrfs zoned device support is a little cumbersome in the data
I/O path as it requires the callers to not issue I/O larger than the
supported ZONE_APPEND size of the underlying device. This leads to a lot
of extra accounting. Instead change btrfs_submit_bio so that it can take
write bios of arbitrary size and form from the upper layers, and just
split them internally to the ZONE_APPEND queue limits. Then remove all
the upper layer warts catering to limited write sized on zoned devices,
including the extra refcount in the compressed_bio.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Call btrfs_submit_bio and btrfs_submit_compressed_read directly from
submit_one_bio now that all additional functionality has moved into
btrfs_submit_bio.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove btrfs_bio_ctrl::len_to_stripe_boundary, so that buffer
I/O will no longer limit its bio size according to stripe length
now that btrfs_submit_bio can split bios at stripe boundaries.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[hch: simplify calc_bio_boundaries a little more]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the code that splits the ordered extents and records the physical
location for them to the storage layer so that the higher level consumers
don't have to care about physical block numbers at all. This will also
allow to eventually remove accounting for the zone append write sizes in
the upper layer with a little bit more block layer work.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
struct io_failure_record and the io_failure_tree tree are unused now,
so remove them. This in turn makes struct btrfs_inode smaller by 16
bytes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs handles checksum validation and repair in the end I/O
handler for the btrfs_bio. This leads to a lot of duplicate code
plus issues with varying semantics or bugs, e.g.
- the until recently broken repair for compressed extents
- the fact that encoded reads validate the checksums but do not kick
of read repair
- the inconsistent checking of the BTRFS_FS_STATE_NO_CSUMS flag
This commit revamps the checksum validation and repair code to instead
work below the btrfs_submit_bio interfaces.
In case of a checksum failure (or a plain old I/O error), the repair
is now kicked off before the upper level ->end_io handler is invoked.
Progress of an in-progress repair is tracked by a small structure
that is allocated using a mempool for each original bio with failed
sectors, which holds a reference to the original bio. This new
structure is allocated using a mempool to guarantee forward progress
even under memory pressure. The mempool will be replenished when
the repair completes, just as the mempools backing the bios.
There is one significant behavior change here: If repair fails or
is impossible to start with, the whole bio will be failed to the
upper layer. This is the behavior that all I/O submitters except
for buffered I/O already emulated in their end_io handler. For
buffered I/O this now means that a large readahead request can
fail due to a single bad sector, but as readahead errors are ignored
the following readpage if the sector is actually accessed will
still be able to read. This also matches the I/O failure handling
in other file systems.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers of btrfs_submit_bio that want to validate checksums
currently have to store a copy of the iter in the btrfs_bio. Move
the assignment into common code.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All btrfs_bio I/Os are associated with an inode. Add a pointer to that
inode, which will allow to simplify a lot of calling conventions, and
which will be needed in the I/O completion path in the future.
This grow the btrfs_bio structure by a pointer, but that grows will
be offset by the removal of the device pointer soon.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently fiemap does not take the inode's lock (VFS lock), it only locks
a file range in the inode's io tree. This however can lead to a deadlock
if we have a concurrent fsync on the file and fiemap code triggers a fault
when accessing the user space buffer with fiemap_fill_next_extent(). The
deadlock happens on the inode's i_mmap_lock semaphore, which is taken both
by fsync and btrfs_page_mkwrite(). This deadlock was recently reported by
syzbot and triggers a trace like the following:
task:syz-executor361 state:D stack:20264 pid:5668 ppid:5119 flags:0x00004004
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5293 [inline]
__schedule+0x995/0xe20 kernel/sched/core.c:6606
schedule+0xcb/0x190 kernel/sched/core.c:6682
wait_on_state fs/btrfs/extent-io-tree.c:707 [inline]
wait_extent_bit+0x577/0x6f0 fs/btrfs/extent-io-tree.c:751
lock_extent+0x1c2/0x280 fs/btrfs/extent-io-tree.c:1742
find_lock_delalloc_range+0x4e6/0x9c0 fs/btrfs/extent_io.c:488
writepage_delalloc+0x1ef/0x540 fs/btrfs/extent_io.c:1863
__extent_writepage+0x736/0x14e0 fs/btrfs/extent_io.c:2174
extent_write_cache_pages+0x983/0x1220 fs/btrfs/extent_io.c:3091
extent_writepages+0x219/0x540 fs/btrfs/extent_io.c:3211
do_writepages+0x3c3/0x680 mm/page-writeback.c:2581
filemap_fdatawrite_wbc+0x11e/0x170 mm/filemap.c:388
__filemap_fdatawrite_range mm/filemap.c:421 [inline]
filemap_fdatawrite_range+0x175/0x200 mm/filemap.c:439
btrfs_fdatawrite_range fs/btrfs/file.c:3850 [inline]
start_ordered_ops fs/btrfs/file.c:1737 [inline]
btrfs_sync_file+0x4ff/0x1190 fs/btrfs/file.c:1839
generic_write_sync include/linux/fs.h:2885 [inline]
btrfs_do_write_iter+0xcd3/0x1280 fs/btrfs/file.c:1684
call_write_iter include/linux/fs.h:2189 [inline]
new_sync_write fs/read_write.c:491 [inline]
vfs_write+0x7dc/0xc50 fs/read_write.c:584
ksys_write+0x177/0x2a0 fs/read_write.c:637
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f7d4054e9b9
RSP: 002b:00007f7d404fa2f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007f7d405d87a0 RCX: 00007f7d4054e9b9
RDX: 0000000000000090 RSI: 0000000020000000 RDI: 0000000000000006
RBP: 00007f7d405a51d0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 61635f65646f6e69
R13: 65646f7475616f6e R14: 7261637369646f6e R15: 00007f7d405d87a8
</TASK>
INFO: task syz-executor361:5697 blocked for more than 145 seconds.
Not tainted 6.2.0-rc3-syzkaller-00376-g7c6984405241 #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor361 state:D stack:21216 pid:5697 ppid:5119 flags:0x00004004
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5293 [inline]
__schedule+0x995/0xe20 kernel/sched/core.c:6606
schedule+0xcb/0x190 kernel/sched/core.c:6682
rwsem_down_read_slowpath+0x5f9/0x930 kernel/locking/rwsem.c:1095
__down_read_common+0x54/0x2a0 kernel/locking/rwsem.c:1260
btrfs_page_mkwrite+0x417/0xc80 fs/btrfs/inode.c:8526
do_page_mkwrite+0x19e/0x5e0 mm/memory.c:2947
wp_page_shared+0x15e/0x380 mm/memory.c:3295
handle_pte_fault mm/memory.c:4949 [inline]
__handle_mm_fault mm/memory.c:5073 [inline]
handle_mm_fault+0x1b79/0x26b0 mm/memory.c:5219
do_user_addr_fault+0x69b/0xcb0 arch/x86/mm/fault.c:1428
handle_page_fault arch/x86/mm/fault.c:1519 [inline]
exc_page_fault+0x7a/0x110 arch/x86/mm/fault.c:1575
asm_exc_page_fault+0x22/0x30 arch/x86/include/asm/idtentry.h:570
RIP: 0010:copy_user_short_string+0xd/0x40 arch/x86/lib/copy_user_64.S:233
Code: 74 0a 89 (...)
RSP: 0018:ffffc9000570f330 EFLAGS: 00050202
RAX: ffffffff843e6601 RBX: 00007fffffffefc8 RCX: 0000000000000007
RDX: 0000000000000000 RSI: ffffc9000570f3e0 RDI: 0000000020000120
RBP: ffffc9000570f490 R08: 0000000000000000 R09: fffff52000ae1e83
R10: fffff52000ae1e83 R11: 1ffff92000ae1e7c R12: 0000000000000038
R13: ffffc9000570f3e0 R14: 0000000020000120 R15: ffffc9000570f3e0
copy_user_generic arch/x86/include/asm/uaccess_64.h:37 [inline]
raw_copy_to_user arch/x86/include/asm/uaccess_64.h:58 [inline]
_copy_to_user+0xe9/0x130 lib/usercopy.c:34
copy_to_user include/linux/uaccess.h:169 [inline]
fiemap_fill_next_extent+0x22e/0x410 fs/ioctl.c:144
emit_fiemap_extent+0x22d/0x3c0 fs/btrfs/extent_io.c:3458
fiemap_process_hole+0xa00/0xad0 fs/btrfs/extent_io.c:3716
extent_fiemap+0xe27/0x2100 fs/btrfs/extent_io.c:3922
btrfs_fiemap+0x172/0x1e0 fs/btrfs/inode.c:8209
ioctl_fiemap fs/ioctl.c:219 [inline]
do_vfs_ioctl+0x185b/0x2980 fs/ioctl.c:810
__do_sys_ioctl fs/ioctl.c:868 [inline]
__se_sys_ioctl+0x83/0x170 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f7d4054e9b9
RSP: 002b:00007f7d390d92f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f7d405d87b0 RCX: 00007f7d4054e9b9
RDX: 0000000020000100 RSI: 00000000c020660b RDI: 0000000000000005
RBP: 00007f7d405a51d0 R08: 00007f7d390d9700 R09: 0000000000000000
R10: 00007f7d390d9700 R11: 0000000000000246 R12: 61635f65646f6e69
R13: 65646f7475616f6e R14: 7261637369646f6e R15: 00007f7d405d87b8
</TASK>
What happens is the following:
1) Task A is doing an fsync, enters btrfs_sync_file() and flushes delalloc
before locking the inode and the i_mmap_lock semaphore, that is, before
calling btrfs_inode_lock();
2) After task A flushes delalloc and before it calls btrfs_inode_lock(),
another task dirties a page;
3) Task B starts a fiemap without FIEMAP_FLAG_SYNC, so the page dirtied
at step 2 remains dirty and unflushed. Then when it enters
extent_fiemap() and it locks a file range that includes the range of
the page dirtied in step 2;
4) Task A calls btrfs_inode_lock() and locks the inode (VFS lock) and the
inode's i_mmap_lock semaphore in write mode. Then it tries to flush
delalloc by calling start_ordered_ops(), which will block, at
find_lock_delalloc_range(), when trying to lock the range of the page
dirtied at step 2, since this range was locked by the fiemap task (at
step 3);
5) Task B generates a page fault when accessing the user space fiemap
buffer with a call to fiemap_fill_next_extent().
The fault handler needs to call btrfs_page_mkwrite() for some other
page of our inode, and there we deadlock when trying to lock the
inode's i_mmap_lock semaphore in read mode, since the fsync task locked
it in write mode (step 4) and the fsync task can not progress because
it's waiting to lock a file range that is currently locked by us (the
fiemap task, step 3).
Fix this by taking the inode's lock (VFS lock) in shared mode when
entering fiemap. This effectively serializes fiemap with fsync (except the
most expensive part of fsync, the log sync), preventing this deadlock.
Reported-by: syzbot+cc35f55c41e34c30dcb5@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000032dc7305f2a66f46@google.com/
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Convert function to use folios throughout. This is in preparation for the
removal of find_get_pages_range_tag(). Now also supports large folios.
Link: https://lkml.kernel.org/r/20230104211448.4804-8-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Convert function to use folios throughout. This is in preparation for the
removal of find_get_pages_range_tag().
Link: https://lkml.kernel.org/r/20230104211448.4804-7-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[BUG]
There is a bug report that on a RAID0 NVMe btrfs system, under heavy
write load the filesystem can flip RO randomly.
With extra debugging, it shows some tree blocks failed to pass their
level checks, and if that happens at critical path of a transaction, we
abort the transaction:
BTRFS error (device nvme0n1p3): level verify failed on logical 5446121209856 mirror 1 wanted 0 found 1
BTRFS error (device nvme0n1p3: state A): Transaction aborted (error -5)
BTRFS: error (device nvme0n1p3: state A) in btrfs_finish_ordered_io:3343: errno=-5 IO failure
BTRFS info (device nvme0n1p3: state EA): forced readonly
[CAUSE]
The reporter has already bisected to commit 947a629988 ("btrfs: move
tree block parentness check into validate_extent_buffer()").
And with extra debugging, it shows we can have btrfs_tree_parent_check
filled with all zeros in the following call trace:
submit_one_bio+0xd4/0xe0
submit_extent_page+0x142/0x550
read_extent_buffer_pages+0x584/0x9c0
? __pfx_end_bio_extent_readpage+0x10/0x10
? folio_unlock+0x1d/0x50
btrfs_read_extent_buffer+0x98/0x150
read_tree_block+0x43/0xa0
read_block_for_search+0x266/0x370
btrfs_search_slot+0x351/0xd30
? lock_is_held_type+0xe8/0x140
btrfs_lookup_csum+0x63/0x150
btrfs_csum_file_blocks+0x197/0x6c0
? sched_clock_cpu+0x9f/0xc0
? lock_release+0x14b/0x440
? _raw_read_unlock+0x29/0x50
btrfs_finish_ordered_io+0x441/0x860
btrfs_work_helper+0xfe/0x400
? lock_is_held_type+0xe8/0x140
process_one_work+0x294/0x5b0
worker_thread+0x4f/0x3a0
? __pfx_worker_thread+0x10/0x10
kthread+0xf5/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2c/0x50
Currently we only copy the btrfs_tree_parent_check structure into bbio
at read_extent_buffer_pages() after we have assembled the bbio.
But as shown above, submit_extent_page() itself can already submit the
bbio, leaving the bbio->parent_check uninitialized, and cause the false
alert.
[FIX]
Instead of copying @check into bbio after bbio is assembled, we pass
@check in btrfs_bio_ctrl::parent_check, and copy the content of
parent_check in submit_one_bio() for metadata read.
By this we should be able to pass the needed info for metadata endio
verification, and fix the false alert.
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CABXGCsNzVxo4iq-tJSGm_kO1UggHXgq6CdcHDL=z5FL4njYXSQ@mail.gmail.com/
Fixes: 947a629988 ("btrfs: move tree block parentness check into validate_extent_buffer()")
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is simply the same thing as btrfs_item_nr_offset(leaf, 0), so
remove this helper and replace it's usage with the above statement.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a change needed for extent tree v2, as we will be growing the
header size. This exists in btrfs-progs currently, and not having it
makes syncing accessors.[ch] more problematic. So make this change to
set us up for extent tree v2 and match what btrfs-progs does to make
syncing easier.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is actually a change for extent tree v2, but it exists in
btrfs-progs but not in the kernel. This makes it annoying to sync
accessors.h with btrfs-progs, and since this is the way I need it for
extent-tree v2 simply update these helpers to take the extent buffer in
order to make syncing possible now, and make the extent tree v2 stuff
easier moving forward.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is only used in ctree.c, with the exception of zero'ing out extent
buffers we're getting ready to write out. In theory we shouldn't have
an extent buffer with 0 items that we're writing out, however I'd rather
be safe than sorry so open code it in extent_io.c, and then copy the
helper into ctree.c. This will make it easier to sync accessors.[ch]
into btrfs-progs, as this requires a helper that isn't defined in
accessors.h.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
repair_io_failure ties directly into all the glory low-level details of
mapping a bio with a logic address to the actual physical location.
Move it right below btrfs_submit_bio to keep all the related logic
together.
Also move btrfs_repair_eb_io_failure to its caller in disk-io.c now that
repair_io_failure is available in a header.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The code used by btrfs_submit_bio only interacts with the rest of
volumes.c through __btrfs_map_block (which itself is a more generic
version of two exported helpers) and does not really have anything
to do with volumes.c. Create a new bio.c file and a bio.h header
going along with it for the btrfs_bio-based storage layer, which
will grow even more going forward.
Also update the file with my copyright notice given that a large
part of the moved code was written or rewritten by me.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
If dev-replace failed to re-construct its data/metadata, the kernel
message would be incorrect for the missing device:
BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started
BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev (efault)
Note the above "dev (efault)" of the second line.
While the first line is properly reporting "<missing disk>".
[CAUSE]
Although dev-replace is using btrfs_dev_name(), the heavy lifting work
is still done by scrub (scrub is reused by both dev-replace and regular
scrub).
Unfortunately scrub code never uses btrfs_dev_name() helper, as it's
only declared locally inside dev-replace.c.
[FIX]
Fix the output by:
- Move the btrfs_dev_name() helper to volumes.h
- Use btrfs_dev_name() to replace open-coded rcu_str_deref() calls
Only zoned code is not touched, as I'm not familiar with degraded
zoned code.
- Constify return value and parameter
Now the output looks pretty sane:
BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started
BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev <missing disk>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During fiemap, whenever we find a hole or prealloc extent, we will look
for delalloc in that range, and one of the things we do for that is to
find out ranges in the inode's io_tree marked with EXTENT_DELALLOC, using
calls to count_range_bits().
Since we process file extents from left to right, if we have a file with
several holes or prealloc extents, we benefit from keeping a cached extent
state record for calls to count_range_bits(). Most of the time the last
extent state record we visited in one call to count_range_bits() matches
the first extent state record we will use in the next call to
count_range_bits(), so there's a benefit here. So use an extent state
record to cache results from count_range_bits() calls during fiemap.
This change is part of a patchset that has the goal to make performance
better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
iterate over the extents of a file. Two examples are the cp program from
coreutils 9.0+ and the tar program (when using its --sparse / -S option).
A sample test and results are listed in the changelog of the last patch
in the series:
1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
5/9 btrfs: remove no longer used btrfs_next_extent_map()
6/9 btrfs: allow passing a cached state record to count_range_bits()
7/9 btrfs: update stale comment for count_range_bits()
8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
9/9 btrfs: use cached state when looking for delalloc ranges with lseek
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need to set the EXTENT_UPDATE bit in an inode's io_tree to mark a
range as uptodate, we rely on the pages themselves being uptodate - page
reading is not triggered for already uptodate pages. Recently we removed
most use of the EXTENT_UPTODATE for buffered IO with commit 52b029f427
("btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path"),
but there were a few leftovers, namely when reading from holes and
successfully finishing read repair.
These leftovers are unnecessarily making an inode's tree larger and deeper,
slowing down searches on it. So remove all the leftovers.
This change is part of a patchset that has the goal to make performance
better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
iterate over the extents of a file. Two examples are the cp program from
coreutils 9.0+ and the tar program (when using its --sparse / -S option).
A sample test and results are listed in the changelog of the last patch
in the series:
1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
5/9 btrfs: remove no longer used btrfs_next_extent_map()
6/9 btrfs: allow passing a cached state record to count_range_bits()
7/9 btrfs: update stale comment for count_range_bits()
8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
9/9 btrfs: use cached state when looking for delalloc ranges with lseek
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BACKGROUND]
Although both btrfs metadata and data has their read time verification
done at endio time (btrfs_validate_metadata_buffer() and
btrfs_verify_data_csum()), metadata has extra verification, mostly
parentness check including first key/transid/owner_root/level, done at
read_tree_block() and btrfs_read_extent_buffer().
On the other hand, all the data verification is done at endio context.
[ENHANCEMENT]
This patch will make a new union in btrfs_bio, taking the space of the
old data checksums, thus it will not increase the memory usage.
With that extra btrfs_tree_parent_check inside btrfs_bio, we can just
pass the check parameter into read_extent_buffer_pages(), and before
submitting the bio, we can copy the check structure into btrfs_bio.
And finally at endio time, we can grab btrfs_bio::parent_check and pass
it to validate_extent_buffer(), to move the remaining checks into it.
This brings the following benefits:
- Much simpler btrfs_read_extent_buffer()
Now it only needs to iterate through all mirrors.
- Simpler read-time transid check
Previously we go verify_parent_transid() after reading out the extent
buffer.
Now the transid check is done inside the endio function, no other
code can modify the content.
Thus no need to use the extent lock anymore.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Compression and direct io don't work together so the compression
parameter can be dropped after previous patch that changed the call
to direct.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a function pointer passed to btrfs_repair_one_sector that will
submit the right bio for repair. However there are only two callbacks,
for buffered and for direct IO. This can be simplified to a bool-based
switch and call either function, indirect calls in this case is an
unnecessary abstraction. This allows to remove the submit_bio_hook_t
typedef.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The two structures appear on the same call paths, btrfs_bio_ctrl is
embedded in extent_page_data and we pass bio_ctrl to some functions.
After merging there are fewer indirections and we have only one control
structure. The packing remains same.
The btrfs_bio_ctrl was selected as the target structure as the operation
is closer to bio processing.
Structure layout:
struct btrfs_bio_ctrl {
struct bio * bio; /* 0 8 */
int mirror_num; /* 8 4 */
enum btrfs_compression_type compress_type; /* 12 4 */
u32 len_to_stripe_boundary; /* 16 4 */
u32 len_to_oe_boundary; /* 20 4 */
btrfs_bio_end_io_t end_io_func; /* 24 8 */
bool extent_locked; /* 32 1 */
bool sync_io; /* 33 1 */
/* size: 40, cachelines: 1, members: 8 */
/* padding: 6 */
/* last cacheline: 40 bytes */
};
Signed-off-by: David Sterba <dsterba@suse.com>
The semantics of the two members is a boolean, so change the type
accordingly. We have space in extent_page_data due to alignment there's
no change in size.
Signed-off-by: David Sterba <dsterba@suse.com>
This will make syncing fs.h to user space a little easier if we can pull
the super block specific helpers out of fs.h and put them in super.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already have a dev-replace.h, simply move these prototypes and
helpers into dev-replace.h where they belong.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move these out of ctree.h into file.h to cut down on code in ctree.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move these prototypes out of ctree.h and into file-item.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update, reformat or reword function comments. This also removes the kdoc
marker so we don't get reports when the function name is missing.
Changes made:
- remove kdoc markers
- reformat the brief description to be a proper sentence
- reword to imperative voice
- align parameter list
- fix typos
Signed-off-by: David Sterba <dsterba@suse.com>
This is a large patch, but because they're all macros it's impossible to
split up. Simply copy all of the item accessors in ctree.h and paste
them in accessors.h, and then update any files to include the header so
everything compiles.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments, style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to use fs.h to hold fs wide related helpers and definitions,
move the FS_STATE enum and related helpers to fs.h, and then update all
files that need these definitions to include fs.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have done read endio in an async thread for a very, very long time,
which makes the use of GFP_ATOMIC and unlock_extent_atomic() unneeded in
our read endio path. We've noticed under heavy memory pressure in our
fleet that we can fail these allocations, and then often trip a
BUG_ON(!allocation), which isn't an ideal outcome. Begin to address
this by simply not using GFP_ATOMIC, which will allow us to do things
like actually allocate a extent state when doing
set_extent_bits(UPTODATE) in the endio handler.
End io handlers are not called in atomic context, besides we have been
allocating failrec with GFP_NOFS so we'd notice there's a problem.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During fiemap, when determining if a data extent is shared or not, if we
don't find the extent is directly shared, then we need to determine if
it's shared through subtrees. For that we need to resolve the indirect
reference we found in order to figure out the path in the inode's fs tree,
which is a path starting at the fs tree's root node and going down to the
leaf that contains the file extent item that points to the data extent.
We then proceed to determine if any extent buffer in that path is shared
with other trees or not.
Currently whenever we find the data extent that a file extent item points
to is not directly shared, we always resolve the path in the fs tree, and
then check if any extent buffer in the path is shared. This is a lot of
work and when we have file extent items that belong to the same leaf, we
have the same path, so we only need to calculate it once.
This change does that, it keeps track of the current and previous leaf,
and when we find that a data extent is not directly shared, we try to
compute the fs tree path only once and then use it for every other file
extent item in the same leaf, using the existing cached path result for
the leaf as long as the cache results are valid.
This saves us from doing expensive b+tree searches in the fs tree of our
target inode, as well as other minor work.
The following test was run on a non-debug kernel (Debian's default kernel
config):
$ cat test-with-snapshots.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
# Use compression to quickly create files with a lot of extents
# (each with a size of 128K).
mount -o compress=lzo $DEV $MNT
# 40G gives 327680 extents, each with a size of 128K.
xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
# Add some more files to increase the size of the fs and extent
# trees (in the real world there's a lot of files and extents
# from other files).
xfs_io -f -c "pwrite -S 0xcd -b 1M 0 20G" $MNT/file1
xfs_io -f -c "pwrite -S 0xef -b 1M 0 20G" $MNT/file2
xfs_io -f -c "pwrite -S 0x73 -b 1M 0 20G" $MNT/file3
# Create a snapshot so all the extents become indirectly shared
# through subtrees, with a generation less than or equals to the
# generation used to create the snapshot.
btrfs subvolume snapshot -r $MNT $MNT/snap1
umount $MNT
mount -o compress=lzo $DEV $MNT
start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata not cached)"
echo
start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata cached)"
umount $MNT
Result before applying this patch:
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 1204 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 729 milliseconds (metadata cached)
Result after applying this patch:
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 732 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 421 milliseconds (metadata cached)
That's a -46.1% total reduction for the metadata not cached case, and
a -42.2% reduction for the cached metadata case.
The test is somewhat limited in the sense the gains may be higher in
practice, because in the test the filesystem is small, so we have small
fs and extent trees, plus there's no concurrent access to the trees as
well, therefore no lock contention there.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When calling btrfs_is_data_extent_shared() we pass two ulists that were
allocated by the caller. This is because the single caller, fiemap, calls
btrfs_is_data_extent_shared() multiple times and the ulists can be reused,
instead of allocating new ones before each call and freeing them after
each call.
Now that we have a context structure/object that we pass to
btrfs_is_data_extent_shared(), we can move those ulists to it, and hide
their allocation and the context's allocation in a helper function, as
well as the freeing of the ulists and the context object. This allows to
reduce the number of parameters passed to btrfs_is_data_extent_shared(),
the need to pass the ulists from extent_fiemap() to fiemap_process_hole()
and having the caller deal with allocating and releasing the ulists.
Also rename one of the ulists from 'tmp' / 'tmp_ulist' to 'refs', since
that's a much better name as it reflects what the list is used for (and
matching the argument name for find_parent_nodes()).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Right now we are using a struct btrfs_backref_shared_cache to pass state
across multiple btrfs_is_data_extent_shared() calls. The structure's name
closely follows its current purpose, which is to cache previous checks
for the sharedness of metadata extents. However we will start using the
structure for more things other than caching sharedness checks, so rename
it to struct btrfs_backref_share_check_ctx.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we pass a root and an inode number as arguments for
btrfs_is_data_extent_shared() and the inode number is always from an
inode that belongs to that root (it wouldn't make sense otherwise).
In every context that we call btrfs_is_data_extent_shared() (fiemap only),
we have an inode available, so directly pass the inode to the function
instead of a root and inode number. This reduces the number of parameters
and it makes the function's signature conform to most other functions we
have.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When allocating an extent buffer, at __alloc_extent_buffer(), there's no
point in explicitly assigning zero to the bflags field of the new extent
buffer because we allocated it with kmem_cache_zalloc().
So just remove the redundant initialization, it saves one mov instruction
in the generated assembly code for x86_64 ("movq $0x0,0x10(%rax)").
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_clone_extent_buffer(), before allocating the pages array for the
new extent buffer we are calling memset() to zero out the pages array of
the extent buffer. This is pointless however, because the extent buffer
already has every element in its pages array pointing to NULL, as it was
allocated with kmem_cache_zalloc(). The memset() was introduced with
commit dd137dd1f2 ("btrfs: factor out allocating an array of pages"),
but even before that commit we already depended on the pages array being
initialized to NULL for the error paths that need to call
btrfs_release_extent_buffer().
So remove the memset(), it's useless and slightly increases the object
text size.
Before this change:
$ size fs/btrfs/extent_io.o
text data bss dec hex filename
70580 5469 40 76089 12939 fs/btrfs/extent_io.o
After this change:
$ size fs/btrfs/extent_io.o
text data bss dec hex filename
70564 5469 40 76073 12929 fs/btrfs/extent_io.o
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't use a cached state here at all, which generally makes sense as
async reads are going to unlock at endio time. However for blocking
reads we will call wait_extent_bit() for our range. Since the
lock_extent() stuff will return the cached_state for the start of the
range this is a helpful optimization to have for this case, we'll have
the exact state we want to wait on. Add a cached state here and simply
throw it away if we're a non-blocking read, otherwise we'll get a small
improvement by eliminating some tree searches.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently if we fail to lock a range we'll return the start of the range
that we failed to lock. We'll then search down to this range and wait
on any extent states in this range.
However we can avoid this search altogether if we simply cache the
extent_state that had the contention. We can pass this into
wait_extent_bit() and start from that extent_state without doing the
search. In the most optimistic case we can avoid all searches, more
likely we'll avoid the initial search and have to perform the search
after we wait on the failed state, or worst case we must search both
times which is what currently happens.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With nowait becoming more pervasive throughout our codebase go ahead and
add a cached_state to try_lock_extent(). This allows us to be faster
about clearing the locked area if we have contention, and then gives us
the same optimization for unlock if we are able to lock the range.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
linux-next for a couple of months without, to my knowledge, any negative
reports (or any positive ones, come to that).
- Also the Maple Tree from Liam R. Howlett. An overlapping range-based
tree for vmas. It it apparently slight more efficient in its own right,
but is mainly targeted at enabling work to reduce mmap_lock contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
(https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com).
This has yet to be addressed due to Liam's unfortunately timed
vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down to
the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support
file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA
joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf
bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU=
=xfWx
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any
negative reports (or any positive ones, come to that).
- Also the Maple Tree from Liam Howlett. An overlapping range-based
tree for vmas. It it apparently slightly more efficient in its own
right, but is mainly targeted at enabling work to reduce mmap_lock
contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
at [1]. This has yet to be addressed due to Liam's unfortunately
timed vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down
to the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
support file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging
activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]
* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
hugetlb: allocate vma lock for all sharable vmas
hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
hugetlb: fix vma lock handling during split vma and range unmapping
mglru: mm/vmscan.c: fix imprecise comments
mm/mglru: don't sync disk for each aging cycle
mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
mm: memcontrol: use do_memsw_account() in a few more places
mm: memcontrol: deprecate swapaccounting=0 mode
mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
mm/secretmem: remove reduntant return value
mm/hugetlb: add available_huge_pages() func
mm: remove unused inline functions from include/linux/mm_inline.h
selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
selftests/vm: add thp collapse shmem testing
selftests/vm: add thp collapse file and tmpfs testing
selftests/vm: modularize thp collapse memory operations
selftests/vm: dedup THP helpers
mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
mm/madvise: add file and shmem support to MADV_COLLAPSE
...
For function submit_extent_page() and alloc_new_bio(), we have an
argument @end_io_func to indicate the end io function.
But that function never change inside any call site of them, thus no
need to pass the pointer around everywhere.
There is a better match for the lifespan of all the call sites, as we
have btrfs_bio_ctrl structure, thus we can put the endio function
pointer there, and grab the pointer every time we allocate a new bio.
Also add extra ASSERT()s to make sure every call site of
submit_extent_page() and alloc_new_bio() has properly set the pointer
inside btrfs_bio_ctrl.
This removes one argument from the already long argument list of
submit_extent_page().
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Normally we put (page, pg_len, pg_offset) arguments together, just like
what __bio_add_page() does.
But in submit_extent_page(), what we got is, (page, disk_bytenr, pg_len,
pg_offset), which sometimes can be confusing.
Change the order to (disk_bytenr, page, pg_len, pg_offset) to make it
to follow the common schema.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 390ed29b81 ("btrfs: refactor submit_extent_page() to make
bio and its flag tracing easier"), we are using bio_ctrl structure to
replace some of arguments of submit_extent_page().
But unfortunately that commit didn't update the comment for
submit_extent_page(), thus some arguments are stale like:
- bio_ret
- mirror_num
Those are all contained in bio_ctrl now.
- prev_bio_flags
We no longer use this flag to determine if we can merge bios.
Update the comment for submit_extent_page() to keep it up-to-date.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is defined in btrfs_inode.h, and dereferences btrfs_root and
btrfs_fs_info, both of which aren't defined in btrfs_inode.h.
Additionally, in many places we already have root or fs_info, so this
helper often makes the code harder to read. So delete the helper and
simply open code it in the few places that we use it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a separate I/O failure tree to track the fail reads, so remove
the extra EXTENT_DAMAGED bit in the I/O tree as it's set but never used.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of taking up a whole argument to indicate we're clearing
everything in a range, simply add another EXTENT bit to control this,
and then update all the callers to drop this argument from the
clear_extent_bit variants.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When trying to release the extent states due to memory pressure we'll
set all the bits except LOCKED, NODATASUM, and DELALLOC_NEW. This
includes some of the CTL bits, which isn't really a problem but isn't
correct either. Exclude the CTL bits from this clearing.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have two variants of lock/unlock extent, one set that takes a cached
state, another that does not. This is slightly annoying, and generally
speaking there are only a few places where we don't have a cached state.
Simplify this by making lock_extent/unlock_extent the only variant and
make it take a cached state, then convert all the callers appropriately.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is only used in the case that we are clearing EXTENT_LOCKED, so
infer this value from the bits passed in instead of taking it as an
argument.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is still huge, but unfortunately I cannot make it smaller without
renaming tree_search() and changing all the callers to use the new name,
then moving those chunks and then changing the name back. This feels
like too much churn for code movement, so I've limited this to only
things that called tree_search(). With this patch all of the
extent_io_tree code is now in extent-io-tree.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These are the last few helpers that do not rely on tree_search() and
who's other helpers are exported and in extent-io-tree.c already. Move
these across now in order to make the core move smaller.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to avoid moving all of the related code at once temporarily
export all of the extent state related helpers. Then move these helpers
into extent-io-tree.c. We will clean up the exports and make them
static in followup patches.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A lot of the various internals of extent_io_tree call these two
functions for insert or searching the rb tree for entries, so
temporarily export them and then move them to extent-io-tree.c. We
can't move tree_search() without renaming it, and I don't want to
introduce a bunch of churn just to do that, so move these functions
first and then we can move a few big functions and then the remaining
users of tree_search().
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This helper is used by a lot of the core extent_io_tree helpers, so
temporarily export it and move it into extent-io-tree.c in order to make
it straightforward to migrate the helpers in batches.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is used by the subpage code in addition to lock_extent_bits, so
export it so we can move it out of extent_io.c
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These are just variants and wrappers around the actual work horses of
the extent state. Extract these out of extent_io.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only call these functions from the qgroup code which doesn't call
with EXTENT_BIT_LOCKED. These are BUG_ON()'s that exist to keep us
developers from using these functions with EXTENT_BIT_LOCKED, so convert
them to ASSERT()'s.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Start cleaning up extent_io.c by moving the extent state code out of it.
This patch starts with the extent state allocation code and the
extent_io_tree init code.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to move this code in stages, but while we're doing that we
need to export these helpers so we can more easily move the code into
the new file.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we have the add/del functions generic so that we can use them
for both extent buffers and extent states. We want to separate this
code however, so separate these helpers into per-object helpers in
anticipation of the split.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to help separate the extent buffer from the extent io tree code
we need to break up the init functions.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we're using find_first_extent_bit_state to check if our state
contains the given failrec range, however this is more of an internal
extent_io_tree helper, and is technically unsafe to use because we're
accessing the state outside of the extent_io_tree lock.
Instead use the normal helper find_first_extent_bit which returns the
range of the extent state we find in find_first_extent_bit_state and use
that to do our sanity checking.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We still have this oddity of stashing the io_failure_record in the
extent state for the io_failure_tree, which is leftover from when we
used to stuff private pointers in extent_io_trees.
However this doesn't make a lot of sense for the io failure records, we
can simply use a normal rb_tree for this. This will allow us to further
simplify the extent_io_tree code by removing the io_failure_rec pointer
from the extent state.
Convert the io_failure_tree to an rb tree + spinlock in the inode, and
then use our rb tree simple helpers to insert and find failed records.
This greatly cleans up this code and makes it easier to separate out the
extent_io_tree code.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These are internally used functions and are not used outside of
extent_io.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is exported, so rename it to btrfs_clean_io_failure. Additionally
we are passing in the io tree's and such from the inode, so instead of
doing all that simply pass in the inode itself and get all the
components we need directly inside of btrfs_clean_io_failure.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The current fiemap implementation does not scale very well with the number
of extents a file has. This is both because the main algorithm to find out
the extents has a high algorithmic complexity and because for each extent
we have to check if it's shared. This second part, checking if an extent
is shared, is significantly improved by the two previous patches in this
patchset, while the first part is improved by this specific patch. Every
now and then we get reports from users mentioning fiemap is too slow or
even unusable for files with a very large number of extents, such as the
two recent reports referred to by the Link tags at the bottom of this
change log.
To understand why the part of finding which extents a file has is very
inefficient, consider the example of doing a full ranged fiemap against
a file that has over 100K extents (normal for example for a file with
more than 10G of data and using compression, which limits the extent size
to 128K). When we enter fiemap at extent_fiemap(), the following happens:
1) Before entering the main loop, we call get_extent_skip_holes() to get
the first extent map. This leads us to btrfs_get_extent_fiemap(), which
in turn calls btrfs_get_extent(), to find the first extent map that
covers the file range [0, LLONG_MAX).
btrfs_get_extent() will first search the inode's extent map tree, to
see if we have an extent map there that covers the range. If it does
not find one, then it will search the inode's subvolume b+tree for a
fitting file extent item. After finding the file extent item, it will
allocate an extent map, fill it in with information extracted from the
file extent item, and add it to the inode's extent map tree (which
requires a search for insertion in the tree).
2) Then we enter the main loop at extent_fiemap(), emit the details of
the extent, and call again get_extent_skip_holes(), with a start
offset matching the end of the extent map we previously processed.
We end up at btrfs_get_extent() again, will search the extent map tree
and then search the subvolume b+tree for a file extent item if we could
not find an extent map in the extent tree. We allocate an extent map,
fill it in with the details in the file extent item, and then insert
it into the extent map tree (yet another search in this tree).
3) The second step is repeated over and over, until we have processed the
whole file range. Each iteration ends at btrfs_get_extent(), which
does a red black tree search on the extent map tree, then searches the
subvolume b+tree, allocates an extent map and then does another search
in the extent map tree in order to insert the extent map.
In the best scenario we have all the extent maps already in the extent
tree, and so for each extent we do a single search on a red black tree,
so we have a complexity of O(n log n).
In the worst scenario we don't have any extent map already loaded in
the extent map tree, or have very few already there. In this case the
complexity is much higher since we do:
- A red black tree search on the extent map tree, which has O(log n)
complexity, initially very fast since the tree is empty or very
small, but as we end up allocating extent maps and adding them to
the tree when we don't find them there, each subsequent search on
the tree gets slower, since it's getting bigger and bigger after
each iteration.
- A search on the subvolume b+tree, also O(log n) complexity, but it
has items for all inodes in the subvolume, not just items for our
inode. Plus on a filesystem with concurrent operations on other
inodes, we can block doing the search due to lock contention on
b+tree nodes/leaves.
- Allocate an extent map - this can block, and can also fail if we
are under serious memory pressure.
- Do another search on the extent maps red black tree, with the goal
of inserting the extent map we just allocated. Again, after every
iteration this tree is getting bigger by 1 element, so after many
iterations the searches are slower and slower.
- We will not need the allocated extent map anymore, so it's pointless
to add it to the extent map tree. It's just wasting time and memory.
In short we end up searching the extent map tree multiple times, on a
tree that is growing bigger and bigger after each iteration. And
besides that we visit the same leaf of the subvolume b+tree many times,
since a leaf with the default size of 16K can easily have more than 200
file extent items.
This is very inefficient overall. This patch changes the algorithm to
instead iterate over the subvolume b+tree, visiting each leaf only once,
and only searching in the extent map tree for file ranges that have holes
or prealloc extents, in order to figure out if we have delalloc there.
It will never allocate an extent map and add it to the extent map tree.
This is very similar to what was previously done for the lseek's hole and
data seeking features.
Also, the current implementation relying on extent maps for figuring out
which extents we have is not correct. This is because extent maps can be
merged even if they represent different extents - we do this to minimize
memory utilization and keep extent map trees smaller. For example if we
have two extents that are contiguous on disk, once we load the two extent
maps, they get merged into a single one - however if only one of the
extents is shared, we end up reporting both as shared or both as not
shared, which is incorrect.
This reproducer triggers that bug:
$ cat fiemap-bug.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV
mount $DEV $MNT
# Create a file with two 256K extents.
# Since there is no other write activity, they will be contiguous,
# and their extent maps merged, despite having two distinct extents.
xfs_io -f -c "pwrite -S 0xab 0 256K" \
-c "fsync" \
-c "pwrite -S 0xcd 256K 256K" \
-c "fsync" \
$MNT/foo
# Now clone only the second extent into another file.
xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
# Filefrag will report a single 512K extent, and say it's not shared.
echo
filefrag -v $MNT/foo
umount $MNT
Running the reproducer:
$ ./fiemap-bug.sh
wrote 262144/262144 bytes at offset 0
256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
wrote 262144/262144 bytes at offset 262144
256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
linked 262144/262144 bytes at offset 0
256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
Filesystem type is: 9123683e
File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 127: 3328.. 3455: 128: last,eof
/mnt/sdj/foo: 1 extent found
We end up reporting that we have a single 512K that is not shared, however
we have two 256K extents, and the second one is shared. Changing the
reproducer to clone instead the first extent into file 'bar', makes us
report a single 512K extent that is shared, which is algo incorrect since
we have two 256K extents and only the first one is shared.
This patch is part of a larger patchset that is comprised of the following
patches:
btrfs: allow hole and data seeking to be interruptible
btrfs: make hole and data seeking a lot more efficient
btrfs: remove check for impossible block start for an extent map at fiemap
btrfs: remove zero length check when entering fiemap
btrfs: properly flush delalloc when entering fiemap
btrfs: allow fiemap to be interruptible
btrfs: rename btrfs_check_shared() to a more descriptive name
btrfs: speedup checking for extent sharedness during fiemap
btrfs: skip unnecessary extent buffer sharedness checks during fiemap
btrfs: make fiemap more efficient and accurate reporting extent sharedness
The patchset was tested on a machine running a non-debug kernel (Debian's
default config) and compared the tests below on a branch without the
patchset versus the same branch with the whole patchset applied.
The following test for a large compressed file without holes:
$ cat fiemap-perf-test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV
mount -o compress=lzo $DEV $MNT
# 40G gives 327680 128K file extents (due to compression).
xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar
umount $MNT
mount -o compress=lzo $DEV $MNT
start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata not cached)"
start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata cached)"
umount $MNT
Before patchset:
$ ./fiemap-perf-test.sh
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 3597 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 2107 milliseconds (metadata cached)
After patchset:
$ ./fiemap-perf-test.sh
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 1214 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 684 milliseconds (metadata cached)
That's a speedup of about 3x for both cases (no metadata cached and all
metadata cached).
The test provided by Pavel (first Link tag at the bottom), which uses
files with a large number of holes, was also used to measure the gains,
and it consists on a small C program and a shell script to invoke it.
The C program is the following:
$ cat pavels-test.c
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/ioctl.h>
#include <linux/fs.h>
#include <linux/fiemap.h>
#define FILE_INTERVAL (1<<13) /* 8Kb */
long long interval(struct timeval t1, struct timeval t2)
{
long long val = 0;
val += (t2.tv_usec - t1.tv_usec);
val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000;
return val;
}
int main(int argc, char **argv)
{
struct fiemap fiemap = {};
struct timeval t1, t2;
char data = 'a';
struct stat st;
int fd, off, file_size = FILE_INTERVAL;
if (argc != 3 && argc != 2) {
printf("usage: %s <path> [size]\n", argv[0]);
return 1;
}
if (argc == 3)
file_size = atoi(argv[2]);
if (file_size < FILE_INTERVAL)
file_size = FILE_INTERVAL;
file_size -= file_size % FILE_INTERVAL;
fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
if (fd < 0) {
perror("open");
return 1;
}
for (off = 0; off < file_size; off += FILE_INTERVAL) {
if (pwrite(fd, &data, 1, off) != 1) {
perror("pwrite");
close(fd);
return 1;
}
}
if (ftruncate(fd, file_size)) {
perror("ftruncate");
close(fd);
return 1;
}
if (fstat(fd, &st) < 0) {
perror("fstat");
close(fd);
return 1;
}
printf("size: %ld\n", st.st_size);
printf("actual size: %ld\n", st.st_blocks * 512);
fiemap.fm_length = FIEMAP_MAX_OFFSET;
gettimeofday(&t1, NULL);
if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
perror("fiemap");
close(fd);
return 1;
}
gettimeofday(&t2, NULL);
printf("fiemap: fm_mapped_extents = %d\n",
fiemap.fm_mapped_extents);
printf("time = %lld us\n", interval(t1, t2));
close(fd);
return 0;
}
$ gcc -o pavels_test pavels_test.c
And the wrapper shell script:
$ cat fiemap-pavels-test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f -O no-holes $DEV
mount $DEV $MNT
echo
echo "*********** 256M ***********"
echo
./pavels-test $MNT/testfile $((1 << 28))
echo
./pavels-test $MNT/testfile $((1 << 28))
echo
echo "*********** 512M ***********"
echo
./pavels-test $MNT/testfile $((1 << 29))
echo
./pavels-test $MNT/testfile $((1 << 29))
echo
echo "*********** 1G ***********"
echo
./pavels-test $MNT/testfile $((1 << 30))
echo
./pavels-test $MNT/testfile $((1 << 30))
umount $MNT
Running his reproducer before applying the patchset:
*********** 256M ***********
size: 268435456
actual size: 134217728
fiemap: fm_mapped_extents = 32768
time = 4003133 us
size: 268435456
actual size: 134217728
fiemap: fm_mapped_extents = 32768
time = 4895330 us
*********** 512M ***********
size: 536870912
actual size: 268435456
fiemap: fm_mapped_extents = 65536
time = 30123675 us
size: 536870912
actual size: 268435456
fiemap: fm_mapped_extents = 65536
time = 33450934 us
*********** 1G ***********
size: 1073741824
actual size: 536870912
fiemap: fm_mapped_extents = 131072
time = 224924074 us
size: 1073741824
actual size: 536870912
fiemap: fm_mapped_extents = 131072
time = 217239242 us
Running it after applying the patchset:
*********** 256M ***********
size: 268435456
actual size: 134217728
fiemap: fm_mapped_extents = 32768
time = 29475 us
size: 268435456
actual size: 134217728
fiemap: fm_mapped_extents = 32768
time = 29307 us
*********** 512M ***********
size: 536870912
actual size: 268435456
fiemap: fm_mapped_extents = 65536
time = 58996 us
size: 536870912
actual size: 268435456
fiemap: fm_mapped_extents = 65536
time = 59115 us
*********** 1G ***********
size: 1073741824
actual size: 536870912
fiemap: fm_mapped_extents = 116251
time = 124141 us
size: 1073741824
actual size: 536870912
fiemap: fm_mapped_extents = 131072
time = 119387 us
The speedup is massive, both on the first fiemap call and on the second
one as well, as his test creates files with many holes and small extents
(every extent follows a hole and precedes another hole).
For the 256M file we go from 4 seconds down to 29 milliseconds in the
first run, and then from 4.9 seconds down to 29 milliseconds again in the
second run, a speedup of 138x and 169x, respectively.
For the 512M file we go from 30.1 seconds down to 59 milliseconds in the
first run, and then from 33.5 seconds down to 59 milliseconds again in the
second run, a speedup of 510x and 568x, respectively.
For the 1G file, we go from 225 seconds down to 124 milliseconds in the
first run, and then from 217 seconds down to 119 milliseconds in the
second run, a speedup of 1815x and 1824x, respectively.
Reported-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During fiemap, for each file extent we find, we must check if it's shared
or not. The sharedness check starts by verifying if the extent is directly
shared (its refcount in the extent tree is > 1), and if it is not directly
shared, then we will check if every node in the subvolume b+tree leading
from the root to the leaf that has the file extent item (in reverse order),
is shared (through snapshots).
However this second step is not needed if our extent was created in a
transaction more recent than the last transaction where a snapshot of the
inode's root happened, because it can't be shared indirectly (through
shared subtrees) without a snapshot created in a more recent transaction.
So grab the generation of the extent from the extent map and pass it to
btrfs_is_data_extent_shared(), which will skip this second phase when the
generation is more recent than the root's last snapshot value. Note that
we skip this optimization if the extent map is the result of merging 2
or more extent maps, because in this case its generation is the maximum
of the generations of all merged extent maps.
The fact the we use extent maps and they can be merged despite the
underlying extents being distinct (different file extent items in the
subvolume b+tree and different extent items in the extent b+tree), can
result in some bugs when reporting shared extents. But this is a problem
of the current implementation of fiemap relying on extent maps.
One example where we get incorrect results is:
$ cat fiemap-bug.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV
mount $DEV $MNT
# Create a file with two 256K extents.
# Since there is no other write activity, they will be contiguous,
# and their extent maps merged, despite having two distinct extents.
xfs_io -f -c "pwrite -S 0xab 0 256K" \
-c "fsync" \
-c "pwrite -S 0xcd 256K 256K" \
-c "fsync" \
$MNT/foo
# Now clone only the second extent into another file.
xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
# Filefrag will report a single 512K extent, and say it's not shared.
echo
filefrag -v $MNT/foo
umount $MNT
Running the reproducer:
$ ./fiemap-bug.sh
wrote 262144/262144 bytes at offset 0
256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
wrote 262144/262144 bytes at offset 262144
256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
linked 262144/262144 bytes at offset 0
256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
Filesystem type is: 9123683e
File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 127: 3328.. 3455: 128: last,eof
/mnt/sdj/foo: 1 extent found
We end up reporting that we have a single 512K that is not shared, however
we have two 256K extents, and the second one is shared. Changing the
reproducer to clone instead the first extent into file 'bar', makes us
report a single 512K extent that is shared, which is algo incorrect since
we have two 256K extents and only the first one is shared.
This is z problem that existed before this change, and remains after this
change, as it can't be easily fixed. The next patch in the series reworks
fiemap to primarily use file extent items instead of extent maps (except
for checking for delalloc ranges), with the goal of improving its
scalability and performance, but it also ends up fixing this particular
bug caused by extent map merging.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
One of the most expensive tasks performed during fiemap is to check if
an extent is shared. This task has two major steps:
1) Check if the data extent is shared. This implies checking the extent
item in the extent tree, checking delayed references, etc. If we
find the data extent is directly shared, we terminate immediately;
2) If the data extent is not directly shared (its extent item has a
refcount of 1), then it may be shared if we have snapshots that share
subtrees of the inode's subvolume b+tree. So we check if the leaf
containing the file extent item is shared, then its parent node, then
the parent node of the parent node, etc, until we reach the root node
or we find one of them is shared - in which case we stop immediately.
During fiemap we process the extents of a file from left to right, from
file offset 0 to EOF. This means that we iterate b+tree leaves from left
to right, and has the implication that we keep repeating that second step
above several times for the same b+tree path of the inode's subvolume
b+tree.
For example, if we have two file extent items in leaf X, and the path to
leaf X is A -> B -> C -> X, then when we try to determine if the data
extent referenced by the first extent item is shared, we check if the data
extent is shared - if it's not, then we check if leaf X is shared, if not,
then we check if node C is shared, if not, then check if node B is shared,
if not than check if node A is shared. When we move to the next file
extent item, after determining the data extent is not shared, we repeat
the checks for X, C, B and A - doing all the expensive searches in the
extent tree, delayed refs, etc. If we have thousands of tile extents, then
we keep repeating the sharedness checks for the same paths over and over.
On a file that has no shared extents or only a small portion, it's easy
to see that this scales terribly with the number of extents in the file
and the sizes of the extent and subvolume b+trees.
This change eliminates the repeated sharedness check on extent buffers
by caching the results of the last path used. The results can be used as
long as no snapshots were created since they were cached (for not shared
extent buffers) or no roots were dropped since they were cached (for
shared extent buffers). This greatly reduces the time spent by fiemap for
files with thousands of extents and/or large extent and subvolume b+trees.
Example performance test:
$ cat fiemap-perf-test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV
mount -o compress=lzo $DEV $MNT
# 40G gives 327680 128K file extents (due to compression).
xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
umount $MNT
mount -o compress=lzo $DEV $MNT
start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata not cached)"
start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata cached)"
umount $MNT
Before this patch:
$ ./fiemap-perf-test.sh
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 3597 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 2107 milliseconds (metadata cached)
After this patch:
$ ./fiemap-perf-test.sh
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 1646 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 698 milliseconds (metadata cached)
That's about 2.2x faster when no metadata is cached, and about 3x faster
when all metadata is cached. On a real filesystem with many other files,
data, directories, etc, the b+trees will be 2 or 3 levels higher,
therefore this optimization will have a higher impact.
Several reports of a slow fiemap show up often, the two Link tags below
refer to two recent reports of such slowness. This patch, together with
the next ones in the series, is meant to address that.
Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_check_shared() is supposed to be used to check if a
data extent is shared, but its name is too generic, may easily cause
confusion in the sense that it may be used for metadata extents.
So rename it to btrfs_is_data_extent_shared(), which will also make it
less confusing after the next change that adds a backref lookup cache for
the b+tree nodes that lead to the leaf that contains the file extent item
that points to the target data extent.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no point to check for a 0 length at extent_fiemap(), as before
calling it, we called fiemap_prep() at btrfs_fiemap(), which already
checks for a zero length and returns the same -EINVAL error. So remove
the pointless check.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During fiemap we are testing if an extent map has a block start with a
value of EXTENT_MAP_LAST_BYTE, but that is never set on an extent map,
and never was according to git history. So remove that useless check.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_bio end I/O handling is a bit of a mess. The bi_end_io
handler and bi_private pointer of the embedded struct bio are both used
to handle the completion of the high-level btrfs_bio and for the I/O
completion for the low-level device that the embedded bio ends up being
sent to.
To support this bi_end_io and bi_private are saved into the
btrfs_io_context structure and then restored after the bio sent to the
underlying device has completed the actual I/O.
Untangle this by adding an end I/O handler and private data to struct
btrfs_bio for the high-level btrfs_bio based completions, and leave the
actual bio bi_end_io handler and bi_private pointer entirely to the
low-level device I/O.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass the operation to btrfs_bio_alloc, matching what bio_alloc_bioset
set does.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
volumes.c is the place that implements the storage layer using the
btrfs_bio structure, so move the bio_set and allocation helpers there
as well.
To make up for the new initialization boilerplate, merge the two
init/exit helpers in extent_io.c into a single one.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs never uses bio integrity data itself, so don't allocate
the integrity pools for btrfs_bioset.
This patch is a revert of the commit b208c2f7ce ("btrfs: Fix crash due
to not allocating integrity data for a set"). The integrity data pool
is not needed, the bio-integrity code now handles allocating the
integrity payload without that.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
After we copied data to page cache in buffered I/O, we
1. Insert a EXTENT_UPTODATE state into inode's io_tree, by
endio_readpage_release_extent(), set_extent_delalloc() or
set_extent_defrag().
2. Set page uptodate before we unlock the page.
But the only place we check io_tree's EXTENT_UPTODATE state is in
btrfs_do_readpage(). We know we enter btrfs_do_readpage() only when we
have a non-uptodate page, so it is unnecessary to set EXTENT_UPTODATE.
For example, when performing a buffered random read:
fio --rw=randread --ioengine=libaio --direct=0 --numjobs=4 \
--filesize=32G --size=4G --bs=4k --name=job \
--filename=/mnt/file --name=job
Then check how many extent_state in io_tree:
cat /proc/slabinfo | grep btrfs_extent_state | awk '{print $2}'
w/o this patch, we got 640567 btrfs_extent_state.
w/ this patch, we got 204 btrfs_extent_state.
Maintaining such a big tree brings overhead since every I/O needs to insert
EXTENT_LOCKED, insert EXTENT_UPTODATE, then remove EXTENT_LOCKED. And in
every insert or remove, we need to lock io_tree, do tree search, alloc or
dealloc extent states. By removing unnecessary EXTENT_UPTODATE, we keep
io_tree in a minimal size and reduce overhead when performing buffered I/O.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use `atomic_try_cmpxchg(ptr, &old, new)` instead of
`atomic_cmpxchg(ptr, old, new) == old` in free_extent_buffer. This
has two benefits:
- The x86 cmpxchg instruction returns success in the ZF flag, so this
change saves a compare after cmpxchg, as well as a related move
instruction in the front of cmpxchg.
- atomic_try_cmpxchg implicitly assigns the *ptr value to &old when
cmpxchg fails, enabling further code simplifications.
This patch has no functional change.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Convert to use folios throughout. This is in preparation for the removal
of find_get_pages_contig(). Now also supports large folios.
Since we may receive more than nr_pages pages, nr_pages may underflow.
Since nr_pages > 0 is equivalent to index <= end_index, we replaced it
with this check instead.
Link: https://lkml.kernel.org/r/20220824004023.77310-3-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterb@suse.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmMLY9oACgkQxWXV+ddt
WDue/w/8C3ZF8nLAI/sMrUpef2vSD62bvkKRRS45wzR2uod6yc0Fle9upzBssJQZ
qO3mQ53+QV+imCq7dY5mmtmwCUJNmbV5gbiMoF1OoV9TYtpZb/NIDklSX8se2eJX
drdAWQr2pYwU2M4duA4IEW08TvQ2TFh0JiqMi0aYM5apyL80uv3WniOu+xpRipA3
CMFAnDqayIgQ5OIsedqNy2MBLBopodUL5PZv/H7/g6KSKIuAZP9zgg1eKPfaz2t3
HO183ubmMbVtxgxeu+EnvCkg/iQ5hQiuGmyi0FLYMs/A6/NglwBnIJU5jCMQhcp6
HO5+FSUn6lHQetVzt2uHb9Lo+gX4FtCaHqVv1bXT62lnmDsZO1D7RVSg1Fra+CY+
jJmi8vvIbfbYlSZPZlJANoWe8ODOMVPk+pM4SFHlxOWGAY6HViX2RfHnIjNj5x9O
iDSTGvH6++nBF1Wu2/Xja/VKZ1avxRyTu2srW8JOF62j/tTU/EoPJcO9rxXOBBmC
Hi4UmJ690p3h5xZeeiyE8CmaSlPtfdCcnc/97FnusEjBao9O7THX0PCDVJX6VBkm
hVk01Z6+az1UNcD18KecvCpKYF/At4WpjaUGgf7q+LBfJXuXA6jfzOVDJMKV3TFd
n1yMFg+duGj90l8gT0aa/VQiBlUlnzQKz6ceqyKkPccwveNis6I=
=p8YV
-----END PGP SIGNATURE-----
Merge tag 'for-6.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Fixes:
- check that subvolume is writable when changing xattrs from security
namespace
- fix memory leak in device lookup helper
- update generation of hole file extent item when merging holes
- fix space cache corruption and potential double allocations; this
is a rare bug but can be serious once it happens, stable backports
and analysis tool will be provided
- fix error handling when deleting root references
- fix crash due to assert when attempting to cancel suspended device
replace, add message what to do if mount fails due to missing
replace item
Regressions:
- don't merge pages into bio if their page offset is not contiguous
- don't allow large NOWAIT direct reads, this could lead to short
reads eg. in io_uring"
* tag 'for-6.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: add info when mount fails due to stale replace target
btrfs: replace: drop assert for suspended replace
btrfs: fix silent failure when deleting root reference
btrfs: fix space cache corruption and potential double allocations
btrfs: don't allow large NOWAIT direct reads
btrfs: don't merge pages into bio if their page offset is not contiguous
btrfs: update generation of hole file extent item when merging holes
btrfs: fix possible memory leak in btrfs_get_dev_args_from_path()
btrfs: check if root is readonly while setting security xattr
[BUG]
Zygo reported on latest development branch, he could hit
ASSERT()/BUG_ON() caused crash when doing RAID5 recovery (intentionally
corrupt one disk, and let btrfs to recover the data during read/scrub).
And The following minimal reproducer can cause extent state leakage at
rmmod time:
mkfs.btrfs -f -d raid5 -m raid5 $dev1 $dev2 $dev3 -b 1G > /dev/null
mount $dev1 $mnt
fsstress -w -d $mnt -n 25 -s 1660807876
sync
fssum -A -f -w /tmp/fssum.saved $mnt
umount $mnt
# Wipe the dev1 but keeps its super block
xfs_io -c "pwrite -S 0x0 1m 1023m" $dev1
mount $dev1 $mnt
fssum -r /tmp/fssum.saved $mnt > /dev/null
umount $mnt
rmmod btrfs
This will lead to the following extent states leakage:
BTRFS: state leak: start 499712 end 503807 state 5 in tree 1 refs 1
BTRFS: state leak: start 495616 end 499711 state 5 in tree 1 refs 1
BTRFS: state leak: start 491520 end 495615 state 5 in tree 1 refs 1
BTRFS: state leak: start 487424 end 491519 state 5 in tree 1 refs 1
BTRFS: state leak: start 483328 end 487423 state 5 in tree 1 refs 1
BTRFS: state leak: start 479232 end 483327 state 5 in tree 1 refs 1
BTRFS: state leak: start 475136 end 479231 state 5 in tree 1 refs 1
BTRFS: state leak: start 471040 end 475135 state 5 in tree 1 refs 1
[CAUSE]
Since commit 7aa51232e2 ("btrfs: pass a btrfs_bio to
btrfs_repair_one_sector"), we always use btrfs_bio->file_offset to
determine the file offset of a page.
But that usage assume that, one bio has all its page having a continuous
page offsets.
Unfortunately that's not true, btrfs only requires the logical bytenr
contiguous when assembling its bios.
From above script, we have one bio looks like this:
fssum-27671 submit_one_bio: bio logical=217739264 len=36864
fssum-27671 submit_one_bio: r/i=5/261 page_offset=466944 <<<
fssum-27671 submit_one_bio: r/i=5/261 page_offset=724992 <<<
fssum-27671 submit_one_bio: r/i=5/261 page_offset=729088
fssum-27671 submit_one_bio: r/i=5/261 page_offset=733184
fssum-27671 submit_one_bio: r/i=5/261 page_offset=737280
fssum-27671 submit_one_bio: r/i=5/261 page_offset=741376
fssum-27671 submit_one_bio: r/i=5/261 page_offset=745472
fssum-27671 submit_one_bio: r/i=5/261 page_offset=749568
fssum-27671 submit_one_bio: r/i=5/261 page_offset=753664
Note that the 1st and the 2nd page has non-contiguous page offsets.
This means, at repair time, we will have completely wrong file offset
passed in:
kworker/u32:2-19927 btrfs_repair_one_sector: r/i=5/261 page_off=729088 file_off=475136 bio_offset=8192
Since the file offset is incorrect, we latter incorrectly set the extent
states, and no way to really release them.
Thus later it causes the leakage.
In fact, this can be even worse, since the file offset is incorrect, we
can hit cases like the incorrect file offset belongs to a HOLE, and
later cause btrfs_num_copies() to trigger error, finally hit
BUG_ON()/ASSERT() later.
[FIX]
Add an extra condition in btrfs_bio_add_page() for uncompressed IO.
Now we will have more strict requirement for bio pages:
- They should all have the same mapping
(the mapping check is already implied by the call chain)
- Their logical bytenr should be adjacent
This is the same as the old condition.
- Their page_offset() (file offset) should be adjacent
This is the new check.
This would result a slightly increased amount of bios from btrfs
(needs holes and inside the same stripe boundary to trigger).
But this would greatly reduce the confusion, as it's pretty common
to assume a btrfs bio would only contain continuous page cache.
Later we may need extra cleanups, as we no longer needs to handle gaps
between page offsets in endio functions.
Currently this should be the minimal patch to fix commit 7aa51232e2
("btrfs: pass a btrfs_bio to btrfs_repair_one_sector").
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: 7aa51232e2 ("btrfs: pass a btrfs_bio to btrfs_repair_one_sector")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmL/dJgACgkQxWXV+ddt
WDtUmg/9Fje7L+jAtQLqvJYvvGCFCCs8gkm+Rwh9MLYHcEDnBtRWzDWYTKcfO9WW
mPmiNlNPAbnAw/9apeJDvFOrd4Vqr8ZD3Vr0flXl5EJ9QXSTL6JXfaiLS1o6rQ0q
OXqxVTh5B5aEmfsEhyJBLzsZGdISJbr60dAE8/ZDX9wo8cDavZ6YxToIroUeizGO
dx2DY5A7OQRKTEf4aJgu4zcm1Sq+U5A8M1pcfU4Rhb/YapVgcCc2wrdrw4NOkNJu
B/NZFcD0BcF2bZW7uHT5vr8rGzj2xZUCkuggBQ6/0/h7OdRXIzHCanMw4lPy602v
IfZASF7eYkq7FHRbANj/WAYHOFl41vS+whqZ+sEI/qrW+ZyrODIzS67sbU/Bsa7+
ZoL6RSlIbcMgz7XVqcc5d8bKNK/Hc3MCyMWlLT0XScm05BiJy5O2iEZ7UOJ4r8E+
6J/pFD1bNdacp6UrzwEjyXuSufvp4pJdNWn5ttIWYTBygXMT8AJ403yIheiV3KA3
SkoMj4A54tF3G7NhkzfR5sC7hlgcA0njJLzloicmNgP7E12vmcDL2rTdTt/Yl0cw
3w6ztJS+1sWocFSJ43lE2muHlj7wW7QDYPvul1t6yWgO9wtWECo7c/ASeRl0zPkP
atAJtsr3uV9/aA2ae9QeWzut1W3hbE5NFQOLPcF/iXN+kxMmesI=
=HgV0
-----END PGP SIGNATURE-----
Merge tag 'for-6.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few short fixes and a lockdep warning fix (needs moving some code):
- tree-log replay fixes:
- fix error handling when looking up extent refs
- fix warning when setting inode number of links
- relocation fixes:
- reset block group read-only status when relocation fails
- unset control structure if transaction fails when starting
to process a block group
- add lockdep annotations to fix a warning during relocation
where blocks temporarily belong to another tree and can lead
to reversed dependencies
- tree-checker verifies that extent items don't overlap"
* tag 'for-6.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: tree-checker: check for overlapping extent items
btrfs: fix warning during log replay when bumping inode link count
btrfs: fix lost error handling when looking up extended ref on log replay
btrfs: fix lockdep splat with reloc root extent buffers
btrfs: move lockdep class helpers to locking.c
btrfs: unset reloc control if transaction commit fails in prepare_to_relocate()
btrfs: reset RO counter on block group if we fail to relocate
We have been hitting the following lockdep splat with btrfs/187 recently
WARNING: possible circular locking dependency detected
5.19.0-rc8+ #775 Not tainted
------------------------------------------------------
btrfs/752500 is trying to acquire lock:
ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
but task is already holding lock:
ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (btrfs-tree-01/1){+.+.}-{3:3}:
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_init_new_buffer+0x7d/0x2c0
btrfs_alloc_tree_block+0x120/0x3b0
__btrfs_cow_block+0x136/0x600
btrfs_cow_block+0x10b/0x230
btrfs_search_slot+0x53b/0xb70
btrfs_lookup_inode+0x2a/0xa0
__btrfs_update_delayed_inode+0x5f/0x280
btrfs_async_run_delayed_root+0x24c/0x290
btrfs_work_helper+0xf2/0x3e0
process_one_work+0x271/0x590
worker_thread+0x52/0x3b0
kthread+0xf0/0x120
ret_from_fork+0x1f/0x30
-> #1 (btrfs-tree-01){++++}-{3:3}:
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_search_slot+0x3c3/0xb70
do_relocation+0x10c/0x6b0
relocate_tree_blocks+0x317/0x6d0
relocate_block_group+0x1f1/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #0 (btrfs-treloc-02#2){+.+.}-{3:3}:
__lock_acquire+0x1122/0x1e10
lock_acquire+0xc2/0x2d0
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_lock_root_node+0x31/0x50
btrfs_search_slot+0x1cb/0xb70
replace_path+0x541/0x9f0
merge_reloc_root+0x1d6/0x610
merge_reloc_roots+0xe2/0x260
relocate_block_group+0x2c8/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
other info that might help us debug this:
Chain exists of:
btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-tree-01/1);
lock(btrfs-tree-01);
lock(btrfs-tree-01/1);
lock(btrfs-treloc-02#2);
*** DEADLOCK ***
7 locks held by btrfs/752500:
#0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90
#1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40
#2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400
#3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610
#4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
#5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
#6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
stack backtrace:
CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x56/0x73
check_noncircular+0xd6/0x100
? lock_is_held_type+0xe2/0x140
__lock_acquire+0x1122/0x1e10
lock_acquire+0xc2/0x2d0
? __btrfs_tree_lock+0x24/0x110
down_write_nested+0x41/0x80
? __btrfs_tree_lock+0x24/0x110
__btrfs_tree_lock+0x24/0x110
btrfs_lock_root_node+0x31/0x50
btrfs_search_slot+0x1cb/0xb70
? lock_release+0x137/0x2d0
? _raw_spin_unlock+0x29/0x50
? release_extent_buffer+0x128/0x180
replace_path+0x541/0x9f0
merge_reloc_root+0x1d6/0x610
merge_reloc_roots+0xe2/0x260
relocate_block_group+0x2c8/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
? lock_is_held_type+0xe2/0x140
? lock_is_held_type+0xe2/0x140
? __x64_sys_ioctl+0x88/0xc0
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
This isn't necessarily new, it's just tricky to hit in practice. There
are two competing things going on here. With relocation we create a
snapshot of every fs tree with a reloc tree. Any extent buffers that
get initialized here are initialized with the reloc root lockdep key.
However since it is a snapshot, any blocks that are currently in cache
that originally belonged to the fs tree will have the normal tree
lockdep key set. This creates the lock dependency of
reloc tree -> normal tree
for the extent buffer locking during the first phase of the relocation
as we walk down the reloc root to relocate blocks.
However this is problematic because the final phase of the relocation is
merging the reloc root into the original fs root. This involves
searching down to any keys that exist in the original fs root and then
swapping the relocated block and the original fs root block. We have to
search down to the fs root first, and then go search the reloc root for
the block we need to replace. This creates the dependency of
normal tree -> reloc tree
which is why lockdep complains.
Additionally even if we were to fix this particular mismatch with a
different nesting for the merge case, we're still slotting in a block
that has a owner of the reloc root objectid into a normal tree, so that
block will have its lockdep key set to the tree reloc root, and create a
lockdep splat later on when we wander into that block from the fs root.
Unfortunately the only solution here is to make sure we do not set the
lockdep key to the reloc tree lockdep key normally, and then reset any
blocks we wander into from the reloc root when we're doing the merged.
This solves the problem of having mixed tree reloc keys intermixed with
normal tree keys, and then allows us to make sure in the merge case we
maintain the lock order of
normal tree -> reloc tree
We handle this by setting a bit on the reloc root when we do the search
for the block we want to relocate, and any block we search into or COW
at that point gets set to the reloc tree key. This works correctly
because we only ever COW down to the parent node, so we aren't resetting
the key for the block we're linking into the fs root.
With this patch we no longer have the lockdep splat in btrfs/187.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmLnyNUACgkQxWXV+ddt
WDt9vA/9HcF+v5EkknyW07tatTap/Hm/ZB86Z5OZi6ikwIEcHsWhp3rUICejm88e
GecDPIluDtCtyD6x4stuqkwOm22aDP5q2T9H6+gyw92ozyb436OV1Z8IrmftzXKY
EpZO70PHZT+E6E/WYvyoTmmoCrjib7YlqCWZZhSLUFpsqqlOInmHEH49PW6KvM4r
acUZ/RxHurKdmI3kNY6ECbAQl6CASvtTdYcVCx8fT2zN0azoLIQxpYa7n/9ca1R6
8WnYilCbLbNGtcUXvO2M3tMZ4/5kvxrwQsUn93ccCJYuiN0ASiDXbLZ2g4LZ+n56
JGu+y5v5oBwjpVf+46cuvnENP5BQ61594WPseiVjrqODWnPjN28XkcVC0XmPsiiZ
lszeHO2cuIrIFoCah8ELMl8usu8+qxfXmPxIXtPu9rEyKsDtOjxVYc8SMXqLp0qQ
qYtBoFm0JcZHqtZRpB+dhQ37/xXtH4ljUi/mI6x8iALVujeR273URs7yO9zgIdeW
uZoFtbwpHFLUk+TL7Ku82/zOXp3fCwtDpNmlYbxeMbea/be3ShjncM4+mYzvHYri
dYON2LFrq+mnRDqtIXTCaAYwX7zU8Y18Ev9QwlNll8dKlKwS89+jpqLoa+eVYy3c
/HitHFza70KxmOj4dvDVZlzDpPvl7kW1UBkmskg4u3jnNWzedkM=
=sS1q
-----END PGP SIGNATURE-----
Merge tag 'for-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This brings some long awaited changes, the send protocol bump,
otherwise lots of small improvements and fixes. The main core part is
reworking bio handling, cleaning up the submission and endio and
improving error handling.
There are some changes outside of btrfs adding helpers or updating
API, listed at the end of the changelog.
Features:
- sysfs:
- export chunk size, in debug mode add tunable for setting its size
- show zoned among features (was only in debug mode)
- show commit stats (number, last/max/total duration)
- send protocol updated to 2
- new commands:
- ability write larger data chunks than 64K
- send raw compressed extents (uses the encoded data ioctls),
ie. no decompression on send side, no compression needed on
receive side if supported
- send 'otime' (inode creation time) among other timestamps
- send file attributes (a.k.a file flags and xflags)
- this is first version bump, backward compatibility on send and
receive side is provided
- there are still some known and wanted commands that will be
implemented in the near future, another version bump will be
needed, however we want to minimize that to avoid causing
usability issues
- print checksum type and implementation at mount time
- don't print some messages at mount (mentioned as people asked about
it), we want to print messages namely for new features so let's
make some space for that
- big metadata - this has been supported for a long time and is
not a feature that's worth mentioning
- skinny metadata - same reason, set by default by mkfs
Performance improvements:
- reduced amount of reserved metadata for delayed items
- when inserted items can be batched into one leaf
- when deleting batched directory index items
- when deleting delayed items used for deletion
- overall improved count of files/sec, decreased subvolume lock
contention
- metadata item access bounds checker micro-optimized, with a few
percent of improved runtime for metadata-heavy operations
- increase direct io limit for read to 256 sectors, improved
throughput by 3x on sample workload
Notable fixes:
- raid56
- reduce parity writes, skip sectors of stripe when there are no
data updates
- restore reading from on-disk data instead of using stripe cache,
this reduces chances to damage correct data due to RMW cycle
- refuse to replay log with unknown incompat read-only feature bit
set
- zoned
- fix page locking when COW fails in the middle of allocation
- improved tracking of active zones, ZNS drives may limit the
number and there are ENOSPC errors due to that limit and not
actual lack of space
- adjust maximum extent size for zone append so it does not cause
late ENOSPC due to underreservation
- mirror reading error messages show the mirror number
- don't fallback to buffered IO for NOWAIT direct IO writes, we don't
have the NOWAIT semantics for buffered io yet
- send, fix sending link commands for existing file paths when there
are deleted and created hardlinks for same files
- repair all mirrors for profiles with more than 1 copy (raid1c34)
- fix repair of compressed extents, unify where error detection and
repair happen
Core changes:
- bio completion cleanups
- don't double defer compression bios
- simplify endio workqueues
- add more data to btrfs_bio to avoid allocation for read requests
- rework bio error handling so it's same what block layer does,
the submission works and errors are consumed in endio
- when asynchronous bio offload fails fall back to synchronous
checksum calculation to avoid errors under writeback or memory
pressure
- new trace points
- raid56 events
- ordered extent operations
- super block log_root_transid deprecated (never used)
- mixed_backref and big_metadata sysfs feature files removed, they've
been default for sufficiently long time, there are no known users
and mixed_backref could be confused with mixed_groups
Non-btrfs changes, API updates:
- minor highmem API update to cover const arguments
- switch all kmap/kmap_atomic to kmap_local
- remove redundant flush_dcache_page()
- address_space_operations::writepage callback removed
- add bdev_max_segments() helper"
* tag 'for-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (163 commits)
btrfs: don't call btrfs_page_set_checked in finish_compressed_bio_read
btrfs: fix repair of compressed extents
btrfs: remove the start argument to check_data_csum and export
btrfs: pass a btrfs_bio to btrfs_repair_one_sector
btrfs: simplify the pending I/O counting in struct compressed_bio
btrfs: repair all known bad mirrors
btrfs: merge btrfs_dev_stat_print_on_error with its only caller
btrfs: join running log transaction when logging new name
btrfs: simplify error handling in btrfs_lookup_dentry
btrfs: send: always use the rbtree based inode ref management infrastructure
btrfs: send: fix sending link commands for existing file paths
btrfs: send: introduce recorded_ref_alloc and recorded_ref_free
btrfs: zoned: wait until zone is finished when allocation didn't progress
btrfs: zoned: write out partially allocated region
btrfs: zoned: activate necessary block group
btrfs: zoned: activate metadata block group on flush_space
btrfs: zoned: disable metadata overcommit for zoned
btrfs: zoned: introduce space_info->active_total_bytes
btrfs: zoned: finish least available block group on data bg allocation
btrfs: let can_allocate_chunk return error
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLko3gQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmQaD/90NKFj4v8I456TUQyg1jimXEsL+e84E6o2
ALWVb6JzQvlPVQXNLnK5YKIunMWOTtTMz0nyB8sVRwVJVJO0P5d7QopAkZM8fkyU
MK5OCzoryENw4DTc2wJS4in6cSbGylIuN74wMzlf7+M67JTImfoZQhbTMcjwzZfn
b3OlL6sID7zMXwGcuOJPZyUJICCpDhzdSF9JXqKma5PQuG2SBmQyvFxJAcsoFBPc
YetnoRIOIN6yBvsIZaPaYq7XI9MIvF0e67EQtyCEHj4tHpyVnyDWkeObVFULsISU
gGEKbkYPvNUzRAU5Q1NBBHh1tTfkf/MaUxTuZwoEwZ/s04IGBGMmrZGyfvdfzYo6
M7NwSEg/TrUSNfTwn65mQi7uOXu1pGkJrqz84Flm8u9Qid9Vd7LExLG5p/ggnWdH
5th93MDEmtEg29e9DXpEAuS5d0t3TtSvosflaKpyfNNfr+P0rWCN6GM/uW62VUTK
ls69SQh/AQJRbg64jU4xper6WhaYtSXK7TKEnxJycoEn9gYNyCcdot2uekth0xRH
ChHGmRlteiqe/y4uFWn/2dcxWjoleiHbFjTaiRL75WVl8wIDEjw02LGuoZ61Ss9H
WOV+MT7KqNjBGe6lreUY+O/PO02dzmoR6heJXN19p8zr/pBuLCTGX7UpO7rzgaBR
4N1HEozvIw==
=celk
-----END PGP SIGNATURE-----
Merge tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
- Improve the type checking of request flags (Bart)
- Ensure queue mapping for a single queues always picks the right queue
(Bart)
- Sanitize the io priority handling (Jan)
- rq-qos race fix (Jinke)
- Reserved tags handling improvements (John)
- Separate memory alignment from file/disk offset aligment for O_DIRECT
(Keith)
- Add new ublk driver, userspace block driver using io_uring for
communication with the userspace backend (Ming)
- Use try_cmpxchg() to cleanup the code in various spots (Uros)
- Finally remove bdevname() (Christoph)
- Clean up the zoned device handling (Christoph)
- Clean up independent access range support (Christoph)
- Clean up and improve block sysfs handling (Christoph)
- Clean up and improve teardown of block devices.
This turns the usual two step process into something that is simpler
to implement and handle in block drivers (Christoph)
- Clean up chunk size handling (Christoph)
- Misc cleanups and fixes (Bart, Bo, Dan, GuoYong, Jason, Keith, Liu,
Ming, Sebastian, Yang, Ying)
* tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block: (178 commits)
ublk_drv: fix double shift bug
ublk_drv: make sure that correct flags(features) returned to userspace
ublk_drv: fix error handling of ublk_add_dev
ublk_drv: fix lockdep warning
block: remove __blk_get_queue
block: call blk_mq_exit_queue from disk_release for never added disks
blk-mq: fix error handling in __blk_mq_alloc_disk
ublk: defer disk allocation
ublk: rewrite ublk_ctrl_get_queue_affinity to not rely on hctx->cpumask
ublk: fold __ublk_create_dev into ublk_ctrl_add_dev
ublk: cleanup ublk_ctrl_uring_cmd
ublk: simplify ublk_ch_open and ublk_ch_release
ublk: remove the empty open and release block device operations
ublk: remove UBLK_IO_F_PREFLUSH
ublk: add a MAINTAINERS entry
block: don't allow the same type rq_qos add more than once
mmc: fix disk/queue leak in case of adding disk failure
ublk_drv: fix an IS_ERR() vs NULL check
ublk: remove UBLK_IO_F_INTEGRITY
ublk_drv: remove unneeded semicolon
...
Currently the checksum of compressed extents is verified based on the
compressed data and the lower btrfs_bio, but the actual repair process
is driven by end_bio_extent_readpage on the upper btrfs_bio for the
decompressed data.
This has a bunch of issues, including not being able to properly
communicate the failed mirror up in case that the I/O submission got
preempted, a general loss of if an error was an I/O error or a checksum
verification failure, but most importantly that this design causes
btrfs_clean_io_failure to eventually write back the uncompressed good
data onto the disk sectors that are supposed to contain compressed data.
Fix this by moving the repair to the lower btrfs_bio. To do so, a fair
amount of code has to be reshuffled:
a) the lower btrfs_bio now needs a valid csum pointer. The easiest way
to achieve that is to pass NULL btrfs_lookup_bio_sums and just use
the btrfs_bio management of csums. For a compressed_bio that is
split into multiple btrfs_bios this means additional memory
allocations, but the code becomes a lot more regular.
b) checksum verification now runs directly on the lower btrfs_bio instead
of the compressed_bio. This actually nicely simplifies the end I/O
processing.
c) btrfs_repair_one_sector can't just look up the logical address for
the file offset any more, as there is no corresponding relative
offsets that apply to the file offset and the logic address for
compressed extents. Instead require that the saved bvec_iter in the
btrfs_bio is filled out for all read bios and use that, which again
removes a fair amount of code.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass the btrfs_bio instead of the plain bio to btrfs_repair_one_sector,
and remove the start and failed_mirror arguments in favor of deriving
them from the btrfs_bio. For this to work ensure that the file_offset
field is also initialized for buffered I/O.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When there is more than a single level of redundancy there can also be
multiple bad mirrors, and the current read repair code only repairs the
last bad one.
Restructure btrfs_repair_one_sector so that it records the originally
failed mirror and the number of copies, and then repair all known bad
copies until we reach the originally failed copy in clean_io_failure.
Note that this also means the read repair reads will always start from
the next bad mirror and not mirror 0.
This fixes btrfs/265 in xfstests.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
On zoned filesystem, data write out is limited by max_zone_append_size,
and a large ordered extent is split according the size of a bio. OTOH,
the number of extents to be written is calculated using
BTRFS_MAX_EXTENT_SIZE, and that estimated number is used to reserve the
metadata bytes to update and/or create the metadata items.
The metadata reservation is done at e.g, btrfs_buffered_write() and then
released according to the estimation changes. Thus, if the number of extent
increases massively, the reserved metadata can run out.
The increase of the number of extents easily occurs on zoned filesystem
if BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the
following warning on a small RAM environment with disabling metadata
over-commit (in the following patch).
[75721.498492] ------------[ cut here ]------------
[75721.505624] BTRFS: block rsv 1 returned -28
[75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G W 5.18.0-rc2-BTRFS-ZNS+ #109
[75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
[75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286
[75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000
[75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e
[75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7
[75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28
[75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a
[75721.701878] FS: 0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000
[75721.712601] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0
[75721.730499] Call Trace:
[75721.735166] <TASK>
[75721.739886] btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs]
[75721.747545] ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs]
[75721.756145] ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.762852] ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.769520] ? push_leaf_left+0x420/0x620 [btrfs]
[75721.776431] ? memcpy+0x4e/0x60
[75721.781931] split_leaf+0x433/0x12d0 [btrfs]
[75721.788392] ? btrfs_get_token_32+0x580/0x580 [btrfs]
[75721.795636] ? push_for_double_split.isra.0+0x420/0x420 [btrfs]
[75721.803759] ? leaf_space_used+0x15d/0x1a0 [btrfs]
[75721.811156] btrfs_search_slot+0x1bc3/0x2790 [btrfs]
[75721.818300] ? lock_downgrade+0x7c0/0x7c0
[75721.824411] ? free_extent_buffer.part.0+0x107/0x200 [btrfs]
[75721.832456] ? split_leaf+0x12d0/0x12d0 [btrfs]
[75721.839149] ? free_extent_buffer.part.0+0x14f/0x200 [btrfs]
[75721.846945] ? free_extent_buffer+0x13/0x20 [btrfs]
[75721.853960] ? btrfs_release_path+0x4b/0x190 [btrfs]
[75721.861429] btrfs_csum_file_blocks+0x85c/0x1500 [btrfs]
[75721.869313] ? rcu_read_lock_sched_held+0x16/0x80
[75721.876085] ? lock_release+0x552/0xf80
[75721.881957] ? btrfs_del_csums+0x8c0/0x8c0 [btrfs]
[75721.888886] ? __kasan_check_write+0x14/0x20
[75721.895152] ? do_raw_read_unlock+0x44/0x80
[75721.901323] ? _raw_write_lock_irq+0x60/0x80
[75721.907983] ? btrfs_global_root+0xb9/0xe0 [btrfs]
[75721.915166] ? btrfs_csum_root+0x12b/0x180 [btrfs]
[75721.921918] ? btrfs_get_global_root+0x820/0x820 [btrfs]
[75721.929166] ? _raw_write_unlock+0x23/0x40
[75721.935116] ? unpin_extent_cache+0x1e3/0x390 [btrfs]
[75721.942041] btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs]
[75721.949906] ? try_to_wake_up+0x30/0x14a0
[75721.955700] ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs]
[75721.962661] ? rcu_read_lock_sched_held+0x16/0x80
[75721.969111] ? lock_acquire+0x41b/0x4c0
[75721.974982] finish_ordered_fn+0x15/0x20 [btrfs]
[75721.981639] btrfs_work_helper+0x1af/0xa80 [btrfs]
[75721.988184] ? _raw_spin_unlock_irq+0x28/0x50
[75721.994643] process_one_work+0x815/0x1460
[75722.000444] ? pwq_dec_nr_in_flight+0x250/0x250
[75722.006643] ? do_raw_spin_trylock+0xbb/0x190
[75722.013086] worker_thread+0x59a/0xeb0
[75722.018511] kthread+0x2ac/0x360
[75722.023428] ? process_one_work+0x1460/0x1460
[75722.029431] ? kthread_complete_and_exit+0x30/0x30
[75722.036044] ret_from_fork+0x22/0x30
[75722.041255] </TASK>
[75722.045047] irq event stamp: 0
[75722.049703] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0
[75722.067533] softirqs last enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0
[75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0
[75722.085335] ---[ end trace 0000000000000000 ]---
To fix the estimation, we need to introduce fs_info->max_extent_size to
replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for
regular vs zoned filesystem.
Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned
filesystem, it is set to fs_info->max_zone_append_size.
CC: stable@vger.kernel.org # 5.12+
Fixes: d8e3fb106f ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Same as in commit 21b4ee7029 ("xfs: drop ->writepage completely"): we
can remove the callback as it's only used in one place - single page
writeback from memory reclaim and is not called for cgroup writeback at
all.
We only allow such writeback from kswapd, not from direct memory
reclaim, and so it is rarely used. When it comes from kswapd, it is
effectively random dirty page shoot-down, which is horrible for IO
patterns. We can rely on background writeback to clean all dirty pages
in an efficient way and not let it be interrupted by kswapd.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Simplify helper to return only next and prev pointers, we don't need all
the node/parent/prev/next pointers of __etree_search as there are now
other specialized helpers. Rename parameters so they follow the naming.
Signed-off-by: David Sterba <dsterba@suse.com>
With a slight extension of tree_search_for_insert (fill the return node
and parent return parameters) we can avoid calling __etree_search from
tree_search, that could be removed eventually in followup patches.
Signed-off-by: David Sterba <dsterba@suse.com>
The call chain from
tree_search
tree_search_for_insert
__etree_search
can be open coded and allow further simplifications, here we need a tree
search with fallback to the next node in case it's not found. This is
represented as __etree_search parameters next_ret=valid, prev_ret=NULL.
Signed-off-by: David Sterba <dsterba@suse.com>
In two cases the exact location where to insert the extent state is
known at the call time so we don't need to pass it to insert_state that
takes the fast path.
Signed-off-by: David Sterba <dsterba@suse.com>
The bits are passed to all extent state helpers for no apparent reason,
the value only read and never updated so remove the indirection and pass
it directly. Also unify the type to u32 where needed.
Signed-off-by: David Sterba <dsterba@suse.com>
Let callers of insert_state to set up the extent state to allow further
simplifications of the parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
The rbtree search is a known pattern and can be open coded, allowing to
remove the tree_insert and further cleanups.
Signed-off-by: David Sterba <dsterba@suse.com>
Preparatory work to remove tree_insert from extent_io.c, the rbtree
search loop is a known and simple so it can be open coded.
Signed-off-by: David Sterba <dsterba@suse.com>
submit_one_bio always works on the bio and compression flags from a
btrfs_bio_ctrl structure. Pass the explicitly and clean up the
calling conventions by handling a NULL bio in submit_one_bio, and
using the btrfs_bio_ctrl to pass the mirror number as well.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Merge end_write_bio and flush_write_bio into a single submit_write_bio
helper, that either submits the bio or ends it if a negative errno was
passed in. This consolidates a lot of duplicated checks in the callers.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
submit_one_bio is only used for page cache I/O, so the inode can be
trivially derived from the first page in the bio.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The bios submitted from btrfs_map_bio don't really interact with the
rest of btrfs and the only btrfs_bio member actually used in the
low-level bios is the pointer to the btrfs_io_context used for endio
handler.
Use a union in struct btrfs_io_stripe that allows the endio handler to
find the btrfs_io_context and remove the spurious ->device assignment
so that a plain fs_bio_set bio can be used for the low-level bios
allocated inside btrfs_map_bio.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Set REQ_META in btrfs_submit_metadata_bio instead of the various callers.
We'll start relying on this flag inside of btrfs in a bit, and this
ensures it is always set correctly.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Split btrfs_submit_data_bio into one helper for reads and one for writes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Both memzero_page and memcpy_to_page already call flush_dcache_page so
we can remove the calls from btrfs code.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Untangle the goto and move the code it jumps to so it goes in the order
of the most likely states first.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Add a helper to end I/O on a single sector, which will come in handy
with the new read repair code.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function submit_data_read_repair() is only called for buffered data
read path, thus those members can be calculated using bvec directly:
- start
start = page_offset(bvec->bv_page) + bvec->bv_offset;
- end
end = start + bvec->bv_len - 1;
- page
page = bvec->bv_page;
- pgoff
pgoff = bvec->bv_offset;
Thus we can safely replace those 4 parameters with just one bio_vec.
Also remove the unused return value.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[hch: also remove the return value]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The <linux/mm.h> already provides the PAGE_ALIGNED macro. Let's
use it instead of IS_ALIGNED and passing PAGE_SIZE directly.
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fanjun Kong <bh1scw@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmLRpPgACgkQxWXV+ddt
WDtu/BAAnfx7CXKIfWKpz6FZEio9Qb3mUHVOglyKzqR0qB72OdrC1dQMvEWPJc6h
N65di6+8tTNmRIlaFBMU0MDHODR2aDRpDtlR9eUzUuidTc4iOp1fi31uBwl31r7b
k8mCZBc/IAdfH13lBtcfkb2HGid7rik5ZC6Kx/glMcqh647QkSMAleupUsIYHKsK
IgcUWuN3wFIUK2WVgsja7+ljlwIHBHKRp9yrEYw+ef/B0NCNKvOnrIOPJzO7nxMP
1FbqJ6F7u7HjoMFcMwn5rbV/BoIwSSvXyKRqOW+EhGeQR/imVmkH9jXJ7wXdblSz
IvSqaZ0DaWWSvivdMpwbr8Z0Cu4iIYhVY6PSA0hukR63qB5GwKKJ6j1L0zoYoz8C
IDWJPW03FNRIu5ZOduvUQ3qG7jcJQZ3WPCCfrDST1cO2xHT/7f65Tjz4k0hvp4za
edITetC1mEv310CHeGsJaLxGYPNrRe38VZYPxgJ7yFpteGYjh0ZwsuyUHb4MH1no
JWwgElNW+m1BatdWSUBYk6xhqod1s2LOFPNqo7jNlv8I27hPViqCbBA2i9FlkXf+
FwL5kWyJXs69gfjUIj59381Z0U1VdA1tvU8GP2m2+JvIDS6ooAcZj7yEQ69mCZxi
2RFJIU0NFbnc/5j2ARSzOTGs9glDD0yffgXJM+cK+TWsQ3AC31I=
=/47A
-----END PGP SIGNATURE-----
Merge tag 'for-5.19-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs reverts from David Sterba:
"Due to a recent report [1] we need to revert the radix tree to xarray
conversion patches.
There's a problem with sleeping under spinlock, when xa_insert could
allocate memory under pressure. We use GFP_NOFS so this is a real
problem that we unfortunately did not discover during review.
I'm sorry to do such change at rc6 time but the revert is IMO the
safer option, there are patches to use mutex instead of the spin locks
but that would need more testing. The revert branch has been tested on
a few setups, all seem ok.
The conversion to xarray will be revisited in the future"
Link: https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ [1]
* tag 'for-5.19-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Revert "btrfs: turn delayed_nodes_tree into an XArray"
Revert "btrfs: turn name_cache radix tree into XArray in send_ctx"
Revert "btrfs: turn fs_info member buffer_radix into XArray"
Revert "btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray"
This reverts commit 8ee922689d.
Revert the xarray conversion, there's a problem with potential
sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS
allocation. The radix tree used the preloading mechanism to avoid
sleeping but this is not available in xarray.
Conversion from spin lock to mutex is possible but at time of rc6 is
riskier than a clean revert.
[1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Improve static type checking by using the enum req_op type for variables
that represent a request operation and the new blk_opf_t type for
variables that represent request flags.
Acked-by: David Sterba <dsterba@suse.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-51-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmK4dV4ACgkQxWXV+ddt
WDs4uQ/7B0XqPK05NJntJfwnuIoT/yOreKf47wt/6DyFV3CDMFte/qzaZwthwu6P
F0GMpSYAlVszLlML5elvF9VXymlV+e+QROtbD6QCNLNW1IwHA7ZiF5fV/a1Rj930
XSuaDyVFPAK7892RR6yMQ20IeMBuvqiAhXWEzaIJ2tIcAHn+fP+VkY8Nc0aZj3iC
mI+ep4n93karDxmnHVGUxJTxAe0l/uNopx+fYBWQDj7HuoMLo0Cu+rAdv0gRIxi2
RWUBkR4e4PBwV1OFScwNCsljjt6bHdUHrtdB3fo5Hzu9cO5hHdL7NEsKB1K2w7rV
bgNuNqfj6Y4xUBchAfQO5CCJ9ISci5KoJ4RBpk6EprZR3QN40kN8GPlhi2519K7w
F3d8jolDDHlkqxIsqoe47MYOcSepNEadVNsiYKb0rM6doilfxyXiu6dtTFMrC8Vy
K2HDCdTyuIgw+TnwqT1puaUwxiIL8DFJf1CVyjwGuQ4UgaIEkHXKIsCssyyJ76Jh
QkWX1aeRldbfkVArJWHQWqDQopx9pFBz1gjlws0YjAsU5YijOOXva464P9Rxg+Gq
4pRlgnO48joQam9bRirP2Z6yhqa4O6jkzKDOXSYduAUYD7IMfpsYnz09wKS95jj+
QCrR7VmKnpQdsXg5a/mqyacfIH30ph002VywRxPiFM89Syd25yo=
=rUrf
-----END PGP SIGNATURE-----
Merge tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- zoned relocation fixes:
- fix critical section end for extent writeback, this could lead
to out of order write
- prevent writing to previous data relocation block group if space
gets low
- reflink fixes:
- fix race between reflinking and ordered extent completion
- proper error handling when block reserve migration fails
- add missing inode iversion/mtime/ctime updates on each iteration
when replacing extents
- fix deadlock when running fsync/fiemap/commit at the same time
- fix false-positive KCSAN report regarding pid tracking for read locks
and data race
- minor documentation update and link to new site
* tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Documentation: update btrfs list of features and link to readthedocs.io
btrfs: fix deadlock with fsync+fiemap+transaction commit
btrfs: don't set lock_owner when locking extent buffer for reading
btrfs: zoned: fix critical section of relocation inode writeback
btrfs: zoned: prevent allocation from previous data relocation BG
btrfs: do not BUG_ON() on failure to migrate space when replacing extents
btrfs: add missing inode updates on each iteration when replacing extents
btrfs: fix race between reflinking and ordered extent completion
We use btrfs_zoned_data_reloc_{lock,unlock} to allow only one process to
write out to the relocation inode. That critical section must include all
the IO submission for the inode. However, flush_write_bio() in
extent_writepages() is out of the critical section, causing an IO
submission outside of the lock. This leads to an out of the order IO
submission and fail the relocation process.
Fix it by extending the critical section.
Fixes: 35156d8527 ("btrfs: zoned: only allow one process to add pages to a relocation inode")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>