Remove references to the page lock and page->mapping. Also btrfs folios
can never be swizzled into swap (mentioned in extent_write_cache_pages()).
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Only allocate the btrfs_encoded_read_private structure for asynchronous
(io_uring) mode.
There's no need to allocate an object from slab in the synchronous mode. In
such a case stack can be happily used as it used to be before 68d3b27e05
("btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages()")
which was a preparation for the async mode.
While at it, fix the comment to reflect the atomic => refcount change in
d29662695e ("btrfs: fix use-after-free waiting for encoded read endios").
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmfLM5wACgkQxWXV+ddt
WDsK3A/7BEIUzin4CpmhBkFQamPCLjLL+Zz2etmoYWCrKnNPRMutVbsgeRM43cBt
NXMD4RSoeXO/aYzrPhe4KMP4a5PkI02v2CEpPJqMRPmbADGyExx5Vnh68ioZWQbi
N54Sd5LqhMT9FcViG46VJXr+MOBKIzO8++TxswIrCDO+6X/Y39+xZGxj4DXrnF38
zgvxbILbiH+7vC1m9NV8K7Vl0jp36hQKcCjJYCfohbVoFQiyvmuh2x0LDL2HnIfH
VpREP+eo/a3ZO8vPo7+4HZ5DVf5AolulbEC6myxsvFScLhWlh218plVyuv4QyACW
RYDm9MqLqfqOkEDgj+Tb0C4s6uyVon5xbRL3aNbSE73KnUVeb/bB77qAejjzAkIr
MvEEeEJp0H34OZm2fnUyFIu3ShDcSif1qH0rCOm1rBeqYZZsX7ny7TvKIqkgrsKk
JbzgpYLyzzqTHs9QERw3OUhIBuefFCs4HlUeukLbUCdqI+ruPp5s76jfHQnT3dzG
ad5CUW8eHf6mkU93dUlQIeDJSVPdaanf0Whomk3eOKgBeu8+gNp9R41kKJ7UtoA9
GG504bqNjSe8t0sVmSyuE30BWAQWYnyCSY/9u46JrB6MtfWv+wikU/Nox4qZjM4d
UhhWkDTELaTngcYkbm5+MD0DkkglTeqEbR9gCM21c9xiJrojhcw=
=v6KI
-----END PGP SIGNATURE-----
Merge tag 'for-6.14-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix leaked extent map after error when reading chunks
- replace use of deprecated strncpy
- in zoned mode, fixed range when ulocking extent range, causing a hang
* tag 'for-6.14-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix a leaked chunk map issue in read_one_chunk()
btrfs: replace deprecated strncpy() with strscpy()
btrfs: zoned: fix extent range end unlock in cow_file_range()
Some filesystems, such as NFS, cifs, ceph, and fuse, do not have
complete control of sequencing on the actual filesystem (e.g. on a
different server) and may find that the inode created for a mkdir
request already exists in the icache and dcache by the time the mkdir
request returns. For example, if the filesystem is mounted twice the
directory could be visible on the other mount before it is on the
original mount, and a pair of name_to_handle_at(), open_by_handle_at()
calls could instantiate the directory inode with an IS_ROOT() dentry
before the first mkdir returns.
This means that the dentry passed to ->mkdir() may not be the one that
is associated with the inode after the ->mkdir() completes. Some
callers need to interact with the inode after the ->mkdir completes and
they currently need to perform a lookup in the (rare) case that the
dentry is no longer hashed.
This lookup-after-mkdir requires that the directory remains locked to
avoid races. Planned future patches to lock the dentry rather than the
directory will mean that this lookup cannot be performed atomically with
the mkdir.
To remove this barrier, this patch changes ->mkdir to return the
resulting dentry if it is different from the one passed in.
Possible returns are:
NULL - the directory was created and no other dentry was used
ERR_PTR() - an error occurred
non-NULL - this other dentry was spliced in
This patch only changes file-systems to return "ERR_PTR(err)" instead of
"err" or equivalent transformations. Subsequent patches will make
further changes to some file-systems to return a correct dentry.
Not all filesystems reliably result in a positive hashed dentry:
- NFS, cifs, hostfs will sometimes need to perform a lookup of
the name to get inode information. Races could result in this
returning something different. Note that this lookup is
non-atomic which is what we are trying to avoid. Placing the
lookup in filesystem code means it only happens when the filesystem
has no other option.
- kernfs and tracefs leave the dentry negative and the ->revalidate
operation ensures that lookup will be called to correctly populate
the dentry. This could be fixed but I don't think it is important
to any of the users of vfs_mkdir() which look at the dentry.
The recommendation to use
d_drop();d_splice_alias()
is ugly but fits with current practice. A planned future patch will
change this.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
Running generic/751 on the for-next branch often results in a hang like
below. They are both stack by locking an extent. This suggests someone
forget to unlock an extent.
INFO: task kworker/u128:1:12 blocked for more than 323 seconds.
Not tainted 6.13.0-BTRFS-ZNS+ #503
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u128:1 state:D stack:0 pid:12 tgid:12 ppid:2 flags:0x00004000
Workqueue: btrfs-fixup btrfs_work_helper [btrfs]
Call Trace:
<TASK>
__schedule+0x534/0xdd0
schedule+0x39/0x140
__lock_extent+0x31b/0x380 [btrfs]
? __pfx_autoremove_wake_function+0x10/0x10
btrfs_writepage_fixup_worker+0xf1/0x3a0 [btrfs]
btrfs_work_helper+0xff/0x480 [btrfs]
? lock_release+0x178/0x2c0
process_one_work+0x1ee/0x570
? srso_return_thunk+0x5/0x5f
worker_thread+0x1d1/0x3b0
? __pfx_worker_thread+0x10/0x10
kthread+0x10b/0x230
? __pfx_kthread+0x10/0x10
ret_from_fork+0x30/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
INFO: task kworker/u134:0:184 blocked for more than 323 seconds.
Not tainted 6.13.0-BTRFS-ZNS+ #503
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u134:0 state:D stack:0 pid:184 tgid:184 ppid:2 flags:0x00004000
Workqueue: writeback wb_workfn (flush-btrfs-4)
Call Trace:
<TASK>
__schedule+0x534/0xdd0
schedule+0x39/0x140
__lock_extent+0x31b/0x380 [btrfs]
? __pfx_autoremove_wake_function+0x10/0x10
find_lock_delalloc_range+0xdb/0x260 [btrfs]
writepage_delalloc+0x12f/0x500 [btrfs]
? srso_return_thunk+0x5/0x5f
extent_write_cache_pages+0x232/0x840 [btrfs]
btrfs_writepages+0x72/0x130 [btrfs]
do_writepages+0xe7/0x260
? srso_return_thunk+0x5/0x5f
? lock_acquire+0xd2/0x300
? srso_return_thunk+0x5/0x5f
? find_held_lock+0x2b/0x80
? wbc_attach_and_unlock_inode.part.0+0x102/0x250
? wbc_attach_and_unlock_inode.part.0+0x102/0x250
__writeback_single_inode+0x5c/0x4b0
writeback_sb_inodes+0x22d/0x550
__writeback_inodes_wb+0x4c/0xe0
wb_writeback+0x2f6/0x3f0
wb_workfn+0x32a/0x510
process_one_work+0x1ee/0x570
? srso_return_thunk+0x5/0x5f
worker_thread+0x1d1/0x3b0
? __pfx_worker_thread+0x10/0x10
kthread+0x10b/0x230
? __pfx_kthread+0x10/0x10
ret_from_fork+0x30/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
This happens because we have another success path for the zoned mode. When
there is no active zone available, btrfs_reserve_extent() returns
-EAGAIN. In this case, we have two reactions.
(1) If the given range is never allocated, we can only wait for someone
to finish a zone, so wait on BTRFS_FS_NEED_ZONE_FINISH bit and retry
afterward.
(2) Or, if some allocations are already done, we must bail out and let
the caller to send IOs for the allocation. This is because these IOs
may be necessary to finish a zone.
The commit 06f3642847 ("btrfs: do proper folio cleanup when
cow_file_range() failed") moved the unlock code from the inside of the
loop to the outside. So, previously, the allocated extents are unlocked
just after the allocation and so before returning from the function.
However, they are no longer unlocked on the case (2) above. That caused
the hang issue.
Fix the issue by modifying the 'end' to the end of the allocated
range. Then, we can exit the loop and the same unlock code can properly
handle the case.
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Fixes: 06f3642847 ("btrfs: do proper folio cleanup when cow_file_range() failed")
CC: stable@vger.kernel.org
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
indivudual patches which are described in their changelogs.
- "Allocate and free frozen pages" from Matthew Wilcox reorganizes the
page allocator so we end up with the ability to allocate and free
zero-refcount pages. So that callers (ie, slab) can avoid a refcount
inc & dec.
- "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use
large folios other than PMD-sized ones.
- "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and
fixes for this small built-in kernel selftest.
- "mas_anode_descend() related cleanup" from Wei Yang tidies up part of
the mapletree code.
- "mm: fix format issues and param types" from Keren Sun implements a
few minor code cleanups.
- "simplify split calculation" from Wei Yang provides a few fixes and a
test for the mapletree code.
- "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes
continues the work of moving vma-related code into the (relatively) new
mm/vma.c.
- "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
Hildenbrand cleans up and rationalizes handling of gfp flags in the page
allocator.
- "readahead: Reintroduce fix for improper RA window sizing" from Jan
Kara is a second attempt at fixing a readahead window sizing issue. It
should reduce the amount of unnecessary reading.
- "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
addresses an issue where "huge" amounts of pte pagetables are
accumulated
(https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/).
Qi's series addresses this windup by synchronously freeing PTE memory
within the context of madvise(MADV_DONTNEED).
- "selftest/mm: Remove warnings found by adding compiler flags" from
Muhammad Usama Anjum fixes some build warnings in the selftests code
when optional compiler warnings are enabled.
- "mm: don't use __GFP_HARDWALL when migrating remote pages" from David
Hildenbrand tightens the allocator's observance of __GFP_HARDWALL.
- "pkeys kselftests improvements" from Kevin Brodsky implements various
fixes and cleanups in the MM selftests code, mainly pertaining to the
pkeys tests.
- "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
estimate application working set size.
- "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
provides some cleanups to memcg's hugetlb charging logic.
- "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based
kernel build was demonstrated.
- "zram: split page type read/write handling" from Sergey Senozhatsky
has several fixes and cleaups for zram in the area of zram_write_page().
A watchdog softlockup warning was eliminated.
- "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky
cleans up the pagetable destructor implementations. A rare
use-after-free race is fixed.
- "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
simplifies and cleans up the debugging code in the VMA merging logic.
- "Account page tables at all levels" from Kevin Brodsky cleans up and
regularizes the pagetable ctor/dtor handling. This results in
improvements in accounting accuracy.
- "mm/damon: replace most damon_callback usages in sysfs with new core
functions" from SeongJae Park cleans up and generalizes DAMON's sysfs
file interface logic.
- "mm/damon: enable page level properties based monitoring" from
SeongJae Park increases the amount of information which is presented in
response to DAMOS actions.
- "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes
DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs
is completed.
- "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter
Xu cleans up and generalizes the hugetlb reservation accounting.
- "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
removes a never-used feature of the alloc_pages_bulk() interface.
- "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
extends DAMOS filters to support not only exclusion (rejecting), but
also inclusion (allowing) behavior.
- "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
"introduces a new memory descriptor for zswap.zpool that currently
overlaps with struct page for now. This is part of the effort to reduce
the size of struct page and to enable dynamic allocation of memory
descriptors."
- "mm, swap: rework of swap allocator locks" from Kairui Song redoes and
simplifies the swap allocator locking. A speedup of 400% was
demonstrated for one workload. As was a 35% reduction for kernel build
time with swap-on-zram.
- "mm: update mips to use do_mmap(), make mmap_region() internal" from
Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
mmap_region() can be made MM-internal.
- "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU
regressions and otherwise improves MGLRU performance.
- "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park
updates DAMON documentation.
- "Cleanup for memfd_create()" from Isaac Manjarres does that thing.
- "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand
provides various cleanups in the areas of hugetlb folios, THP folios and
migration.
- "Uncached buffered IO" from Jens Axboe implements the new
RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache
reading and writing. To permite userspace to address issues with
massive buildup of useless pagecache when reading/writing fast devices.
- "selftests/mm: virtual_address_range: Reduce memory" from Thomas
Weißschuh fixes and optimizes some of the MM selftests.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA
jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz
uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE=
=Ib2s
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs.
- "Allocate and free frozen pages" from Matthew Wilcox reorganizes
the page allocator so we end up with the ability to allocate and
free zero-refcount pages. So that callers (ie, slab) can avoid a
refcount inc & dec
- "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to
use large folios other than PMD-sized ones
- "Fix mm/rodata_test" from Petr Tesarik performs some maintenance
and fixes for this small built-in kernel selftest
- "mas_anode_descend() related cleanup" from Wei Yang tidies up part
of the mapletree code
- "mm: fix format issues and param types" from Keren Sun implements a
few minor code cleanups
- "simplify split calculation" from Wei Yang provides a few fixes and
a test for the mapletree code
- "mm/vma: make more mmap logic userland testable" from Lorenzo
Stoakes continues the work of moving vma-related code into the
(relatively) new mm/vma.c
- "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
Hildenbrand cleans up and rationalizes handling of gfp flags in the
page allocator
- "readahead: Reintroduce fix for improper RA window sizing" from Jan
Kara is a second attempt at fixing a readahead window sizing issue.
It should reduce the amount of unnecessary reading
- "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
addresses an issue where "huge" amounts of pte pagetables are
accumulated:
https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
Qi's series addresses this windup by synchronously freeing PTE
memory within the context of madvise(MADV_DONTNEED)
- "selftest/mm: Remove warnings found by adding compiler flags" from
Muhammad Usama Anjum fixes some build warnings in the selftests
code when optional compiler warnings are enabled
- "mm: don't use __GFP_HARDWALL when migrating remote pages" from
David Hildenbrand tightens the allocator's observance of
__GFP_HARDWALL
- "pkeys kselftests improvements" from Kevin Brodsky implements
various fixes and cleanups in the MM selftests code, mainly
pertaining to the pkeys tests
- "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
estimate application working set size
- "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
provides some cleanups to memcg's hugetlb charging logic
- "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
removes the global swap cgroup lock. A speedup of 10% for a
tmpfs-based kernel build was demonstrated
- "zram: split page type read/write handling" from Sergey Senozhatsky
has several fixes and cleaups for zram in the area of
zram_write_page(). A watchdog softlockup warning was eliminated
- "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin
Brodsky cleans up the pagetable destructor implementations. A rare
use-after-free race is fixed
- "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
simplifies and cleans up the debugging code in the VMA merging
logic
- "Account page tables at all levels" from Kevin Brodsky cleans up
and regularizes the pagetable ctor/dtor handling. This results in
improvements in accounting accuracy
- "mm/damon: replace most damon_callback usages in sysfs with new
core functions" from SeongJae Park cleans up and generalizes
DAMON's sysfs file interface logic
- "mm/damon: enable page level properties based monitoring" from
SeongJae Park increases the amount of information which is
presented in response to DAMOS actions
- "mm/damon: remove DAMON debugfs interface" from SeongJae Park
removes DAMON's long-deprecated debugfs interfaces. Thus the
migration to sysfs is completed
- "mm/hugetlb: Refactor hugetlb allocation resv accounting" from
Peter Xu cleans up and generalizes the hugetlb reservation
accounting
- "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
removes a never-used feature of the alloc_pages_bulk() interface
- "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
extends DAMOS filters to support not only exclusion (rejecting),
but also inclusion (allowing) behavior
- "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
introduces a new memory descriptor for zswap.zpool that currently
overlaps with struct page for now. This is part of the effort to
reduce the size of struct page and to enable dynamic allocation of
memory descriptors
- "mm, swap: rework of swap allocator locks" from Kairui Song redoes
and simplifies the swap allocator locking. A speedup of 400% was
demonstrated for one workload. As was a 35% reduction for kernel
build time with swap-on-zram
- "mm: update mips to use do_mmap(), make mmap_region() internal"
from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
mmap_region() can be made MM-internal
- "mm/mglru: performance optimizations" from Yu Zhao fixes a few
MGLRU regressions and otherwise improves MGLRU performance
- "Docs/mm/damon: add tuning guide and misc updates" from SeongJae
Park updates DAMON documentation
- "Cleanup for memfd_create()" from Isaac Manjarres does that thing
- "mm: hugetlb+THP folio and migration cleanups" from David
Hildenbrand provides various cleanups in the areas of hugetlb
folios, THP folios and migration
- "Uncached buffered IO" from Jens Axboe implements the new
RWF_DONTCACHE flag which provides synchronous dropbehind for
pagecache reading and writing. To permite userspace to address
issues with massive buildup of useless pagecache when
reading/writing fast devices
- "selftests/mm: virtual_address_range: Reduce memory" from Thomas
Weißschuh fixes and optimizes some of the MM selftests"
* tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
mm/compaction: fix UBSAN shift-out-of-bounds warning
s390/mm: add missing ctor/dtor on page table upgrade
kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()
tools: add VM_WARN_ON_VMG definition
mm/damon/core: use str_high_low() helper in damos_wmark_wait_us()
seqlock: add missing parameter documentation for raw_seqcount_try_begin()
mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh
mm/page_alloc: remove the incorrect and misleading comment
zram: remove zcomp_stream_put() from write_incompressible_page()
mm: separate move/undo parts from migrate_pages_batch()
mm/kfence: use str_write_read() helper in get_access_type()
selftests/mm/mkdirty: fix memory leak in test_uffdio_copy()
kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags()
selftests/mm: virtual_address_range: avoid reading from VM_IO mappings
selftests/mm: vm_util: split up /proc/self/smaps parsing
selftests/mm: virtual_address_range: unmap chunks after validation
selftests/mm: virtual_address_range: mmap() without PROT_WRITE
selftests/memfd/memfd_test: fix possible NULL pointer dereference
mm: add FGP_DONTCACHE folio creation flag
mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
...
Remove highest_bit and lowest_bit. After the HDD allocation path has been
removed, the only purpose of these two fields is to determine whether the
device is full or not, which can instead be determined by checking the
inuse_pages.
Link: https://lkml.kernel.org/r/20250113175732.48099-6-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Chis Li <chrisl@kernel.org>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickens <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The function btrfs_cleanup_ordered_extents() is only called in error
handling path, and the last caller with a @locked_folio parameter was
removed to fix a bug in the btrfs_run_delalloc_range() error handling.
There is no need to pass @locked_folio parameter anymore.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All the error handling bugs I hit so far are all -ENOSPC from either:
- cow_file_range()
- run_delalloc_nocow()
- submit_uncompressed_range()
Previously when those functions failed, there was no error message at
all, making the debugging much harder.
So here we introduce extra error messages for:
- cow_file_range()
- run_delalloc_nocow()
- submit_uncompressed_range()
- writepage_delalloc() when btrfs_run_delalloc_range() failed
- extent_writepage() when extent_writepage_io() failed
One example of the new debug error messages is the following one:
run fstests generic/750 at 2024-12-08 12:41:41
BTRFS: device fsid 461b25f5-e240-4543-8deb-e7c2bd01a6d3 devid 1 transid 8 /dev/mapper/test-scratch1 (253:4) scanned by mount (2436600)
BTRFS info (device dm-4): first mount of filesystem 461b25f5-e240-4543-8deb-e7c2bd01a6d3
BTRFS info (device dm-4): using crc32c (crc32c-arm64) checksum algorithm
BTRFS info (device dm-4): forcing free space tree for sector size 4096 with page size 65536
BTRFS info (device dm-4): using free-space-tree
BTRFS warning (device dm-4): read-write for sector size 4096 with page size 65536 is experimental
BTRFS info (device dm-4): checking UUID tree
BTRFS error (device dm-4): cow_file_range failed, root=363 inode=412 start=503808 len=98304: -28
BTRFS error (device dm-4): run_delalloc_nocow failed, root=363 inode=412 start=503808 len=98304: -28
BTRFS error (device dm-4): failed to run delalloc range, root=363 ino=412 folio=458752 submit_bitmap=11-15 start=503808 len=98304: -28
Which shows an error from cow_file_range() which is called inside a
nocow write attempt, along with the extra bitmap from
writepage_delalloc().
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With CONFIG_DEBUG_VM set, test case generic/476 has some chance to crash
with the following VM_BUG_ON_FOLIO():
BTRFS error (device dm-3): cow_file_range failed, start 1146880 end 1253375 len 106496 ret -28
BTRFS error (device dm-3): run_delalloc_nocow failed, start 1146880 end 1253375 len 106496 ret -28
page: refcount:4 mapcount:0 mapping:00000000592787cc index:0x12 pfn:0x10664
aops:btrfs_aops [btrfs] ino:101 dentry name(?):"f1774"
flags: 0x2fffff80004028(uptodate|lru|private|node=0|zone=2|lastcpupid=0xfffff)
page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio))
------------[ cut here ]------------
kernel BUG at mm/page-writeback.c:2992!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 2 UID: 0 PID: 3943513 Comm: kworker/u24:15 Tainted: G OE 6.12.0-rc7-custom+ #87
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
pc : folio_clear_dirty_for_io+0x128/0x258
lr : folio_clear_dirty_for_io+0x128/0x258
Call trace:
folio_clear_dirty_for_io+0x128/0x258
btrfs_folio_clamp_clear_dirty+0x80/0xd0 [btrfs]
__process_folios_contig+0x154/0x268 [btrfs]
extent_clear_unlock_delalloc+0x5c/0x80 [btrfs]
run_delalloc_nocow+0x5f8/0x760 [btrfs]
btrfs_run_delalloc_range+0xa8/0x220 [btrfs]
writepage_delalloc+0x230/0x4c8 [btrfs]
extent_writepage+0xb8/0x358 [btrfs]
extent_write_cache_pages+0x21c/0x4e8 [btrfs]
btrfs_writepages+0x94/0x150 [btrfs]
do_writepages+0x74/0x190
filemap_fdatawrite_wbc+0x88/0xc8
start_delalloc_inodes+0x178/0x3a8 [btrfs]
btrfs_start_delalloc_roots+0x174/0x280 [btrfs]
shrink_delalloc+0x114/0x280 [btrfs]
flush_space+0x250/0x2f8 [btrfs]
btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
process_one_work+0x164/0x408
worker_thread+0x25c/0x388
kthread+0x100/0x118
ret_from_fork+0x10/0x20
Code: 910a8021 a90363f7 a9046bf9 94012379 (d4210000)
---[ end trace 0000000000000000 ]---
[CAUSE]
The first two lines of extra debug messages show the problem is caused
by the error handling of run_delalloc_nocow().
E.g. we have the following dirtied range (4K blocksize 4K page size):
0 16K 32K
|//////////////////////////////////////|
| Pre-allocated |
And the range [0, 16K) has a preallocated extent.
- Enter run_delalloc_nocow() for range [0, 16K)
Which found range [0, 16K) is preallocated, can do the proper NOCOW
write.
- Enter fallback_to_fow() for range [16K, 32K)
Since the range [16K, 32K) is not backed by preallocated extent, we
have to go COW.
- cow_file_range() failed for range [16K, 32K)
So cow_file_range() will do the clean up by clearing folio dirty,
unlock the folios.
Now the folios in range [16K, 32K) is unlocked.
- Enter extent_clear_unlock_delalloc() from run_delalloc_nocow()
Which is called with PAGE_START_WRITEBACK to start page writeback.
But folios can only be marked writeback when it's properly locked,
thus this triggered the VM_BUG_ON_FOLIO().
Furthermore there is another hidden but common bug that
run_delalloc_nocow() is not clearing the folio dirty flags in its error
handling path.
This is the common bug shared between run_delalloc_nocow() and
cow_file_range().
[FIX]
- Clear folio dirty for range [@start, @cur_offset)
Introduce a helper, cleanup_dirty_folios(), which
will find and lock the folio in the range, clear the dirty flag and
start/end the writeback, with the extra handling for the
@locked_folio.
- Introduce a helper to clear folio dirty, start and end writeback
- Introduce a helper to record the last failed COW range end
This is to trace which range we should skip, to avoid double
unlocking.
- Skip the failed COW range for the error handling
CC: stable@vger.kernel.org
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When testing with COW fixup marked as BUG_ON() (this is involved with the
new pin_user_pages*() change, which should not result new out-of-band
dirty pages), I hit a crash triggered by the BUG_ON() from hitting COW
fixup path.
This BUG_ON() happens just after a failed btrfs_run_delalloc_range():
BTRFS error (device dm-2): failed to run delalloc range, root 348 ino 405 folio 65536 submit_bitmap 6-15 start 90112 len 106496: -28
------------[ cut here ]------------
kernel BUG at fs/btrfs/extent_io.c:1444!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 0 UID: 0 PID: 434621 Comm: kworker/u24:8 Tainted: G OE 6.12.0-rc7-custom+ #86
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
pc : extent_writepage_io+0x2d4/0x308 [btrfs]
lr : extent_writepage_io+0x2d4/0x308 [btrfs]
Call trace:
extent_writepage_io+0x2d4/0x308 [btrfs]
extent_writepage+0x218/0x330 [btrfs]
extent_write_cache_pages+0x1d4/0x4b0 [btrfs]
btrfs_writepages+0x94/0x150 [btrfs]
do_writepages+0x74/0x190
filemap_fdatawrite_wbc+0x88/0xc8
start_delalloc_inodes+0x180/0x3b0 [btrfs]
btrfs_start_delalloc_roots+0x174/0x280 [btrfs]
shrink_delalloc+0x114/0x280 [btrfs]
flush_space+0x250/0x2f8 [btrfs]
btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
process_one_work+0x164/0x408
worker_thread+0x25c/0x388
kthread+0x100/0x118
ret_from_fork+0x10/0x20
Code: aa1403e1 9402f3ef aa1403e0 9402f36f (d4210000)
---[ end trace 0000000000000000 ]---
[CAUSE]
That failure is mostly from cow_file_range(), where we can hit -ENOSPC.
Although the -ENOSPC is already a bug related to our space reservation
code, let's just focus on the error handling.
For example, we have the following dirty range [0, 64K) of an inode,
with 4K sector size and 4K page size:
0 16K 32K 48K 64K
|///////////////////////////////////////|
|#######################################|
Where |///| means page are still dirty, and |###| means the extent io
tree has EXTENT_DELALLOC flag.
- Enter extent_writepage() for page 0
- Enter btrfs_run_delalloc_range() for range [0, 64K)
- Enter cow_file_range() for range [0, 64K)
- Function btrfs_reserve_extent() only reserved one 16K extent
So we created extent map and ordered extent for range [0, 16K)
0 16K 32K 48K 64K
|////////|//////////////////////////////|
|<- OE ->|##############################|
And range [0, 16K) has its delalloc flag cleared.
But since we haven't yet submit any bio, involved 4 pages are still
dirty.
- Function btrfs_reserve_extent() returns with -ENOSPC
Now we have to run error cleanup, which will clear all
EXTENT_DELALLOC* flags and clear the dirty flags for the remaining
ranges:
0 16K 32K 48K 64K
|////////| |
| | |
Note that range [0, 16K) still has its pages dirty.
- Some time later, writeback is triggered again for the range [0, 16K)
since the page range still has dirty flags.
- btrfs_run_delalloc_range() will do nothing because there is no
EXTENT_DELALLOC flag.
- extent_writepage_io() finds page 0 has no ordered flag
Which falls into the COW fixup path, triggering the BUG_ON().
Unfortunately this error handling bug dates back to the introduction of
btrfs. Thankfully with the abuse of COW fixup, at least it won't crash
the kernel.
[FIX]
Instead of immediately unlocking the extent and folios, we keep the extent
and folios locked until either erroring out or the whole delalloc range
finished.
When the whole delalloc range finished without error, we just unlock the
whole range with PAGE_SET_ORDERED (and PAGE_UNLOCK for !keep_locked
cases), with EXTENT_DELALLOC and EXTENT_LOCKED cleared.
And the involved folios will be properly submitted, with their dirty
flags cleared during submission.
For the error path, it will be a little more complex:
- The range with ordered extent allocated (range (1))
We only clear the EXTENT_DELALLOC and EXTENT_LOCKED, as the remaining
flags are cleaned up by
btrfs_mark_ordered_io_finished()->btrfs_finish_one_ordered().
For folios we finish the IO (clear dirty, start writeback and
immediately finish the writeback) and unlock the folios.
- The range with reserved extent but no ordered extent (range(2))
- The range we never touched (range(3))
For both range (2) and range(3) the behavior is not changed.
Now even if cow_file_range() failed halfway with some successfully
reserved extents/ordered extents, we will keep all folios clean, so
there will be no future writeback triggered on them.
CC: stable@vger.kernel.org
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
If we failed to compress the range, or cannot reserve a large enough
data extent (e.g. too fragmented free space), we will fall back to
submit_uncompressed_range().
But inside submit_uncompressed_range(), run_delalloc_cow() can also fail
due to -ENOSPC or any other error.
In that case there are 3 bugs in the error handling:
1) Double freeing for the same ordered extent
This can lead to crash due to ordered extent double accounting
2) Start/end writeback without updating the subpage writeback bitmap
3) Unlock the folio without clear the subpage lock bitmap
Both bugs 2) and 3) will crash the kernel if the btrfs block size is
smaller than folio size, as the next time the folio gets writeback/lock
updates, subpage will find the bitmap already have the range set,
triggering an ASSERT().
[CAUSE]
Bug 1) happens in the following call chain:
submit_uncompressed_range()
|- run_delalloc_cow()
| |- cow_file_range()
| |- btrfs_reserve_extent()
| Failed with -ENOSPC or whatever error
|
|- btrfs_clean_up_ordered_extents()
| |- btrfs_mark_ordered_io_finished()
| Which cleans all the ordered extents in the async_extent range.
|
|- btrfs_mark_ordered_io_finished()
Which cleans the folio range.
The finished ordered extents may not be immediately removed from the
ordered io tree, as they are removed inside a work queue.
So the second btrfs_mark_ordered_io_finished() may find the finished but
not-yet-removed ordered extents, and double free them.
Furthermore, the second btrfs_mark_ordered_io_finished() is not subpage
compatible, as it uses fixed folio_pos() with PAGE_SIZE, which can cover
other ordered extents.
Bugs 2) and 3) are more straightforward, btrfs just calls folio_unlock(),
folio_start_writeback() and folio_end_writeback(), other than the helpers
which handle subpage cases.
[FIX]
For bug 1) since the first btrfs_cleanup_ordered_extents() call is
handling the whole range, we should not do the second
btrfs_mark_ordered_io_finished() call.
And for the first btrfs_cleanup_ordered_extents(), we no longer need to
pass the @locked_page parameter, as we are already in the async extent
context, thus will never rely on the error handling inside
btrfs_run_delalloc_range().
So just let the btrfs_clean_up_ordered_extents() handle every folio
equally.
For bug 2) we should not even call
folio_start_writeback()/folio_end_writeback() anymore.
As the error handling protocol, cow_file_range() should clear
dirty flag and start/finish the writeback for the whole range passed in.
For bug 3) just change the folio_unlock() to btrfs_folio_end_lock()
helper.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running btrfs with block size (4K) smaller than page size (64K,
aarch64), there is a very high chance to crash the kernel at
generic/750, with the following messages:
(before the call traces, there are 3 extra debug messages added)
BTRFS warning (device dm-3): read-write for sector size 4096 with page size 65536 is experimental
BTRFS info (device dm-3): checking UUID tree
hrtimer: interrupt took 5451385 ns
BTRFS error (device dm-3): cow_file_range failed, root=4957 inode=257 start=1605632 len=69632: -28
BTRFS error (device dm-3): run_delalloc_nocow failed, root=4957 inode=257 start=1605632 len=69632: -28
BTRFS error (device dm-3): failed to run delalloc range, root=4957 ino=257 folio=1572864 submit_bitmap=8-15 start=1605632 len=69632: -28
------------[ cut here ]------------
WARNING: CPU: 2 PID: 3020984 at ordered-data.c:360 can_finish_ordered_extent+0x370/0x3b8 [btrfs]
CPU: 2 UID: 0 PID: 3020984 Comm: kworker/u24:1 Tainted: G OE 6.13.0-rc1-custom+ #89
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
pc : can_finish_ordered_extent+0x370/0x3b8 [btrfs]
lr : can_finish_ordered_extent+0x1ec/0x3b8 [btrfs]
Call trace:
can_finish_ordered_extent+0x370/0x3b8 [btrfs] (P)
can_finish_ordered_extent+0x1ec/0x3b8 [btrfs] (L)
btrfs_mark_ordered_io_finished+0x130/0x2b8 [btrfs]
extent_writepage+0x10c/0x3b8 [btrfs]
extent_write_cache_pages+0x21c/0x4e8 [btrfs]
btrfs_writepages+0x94/0x160 [btrfs]
do_writepages+0x74/0x190
filemap_fdatawrite_wbc+0x74/0xa0
start_delalloc_inodes+0x17c/0x3b0 [btrfs]
btrfs_start_delalloc_roots+0x17c/0x288 [btrfs]
shrink_delalloc+0x11c/0x280 [btrfs]
flush_space+0x288/0x328 [btrfs]
btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
process_one_work+0x228/0x680
worker_thread+0x1bc/0x360
kthread+0x100/0x118
ret_from_fork+0x10/0x20
---[ end trace 0000000000000000 ]---
BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1605632 OE len=16384 to_dec=16384 left=0
BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1622016 OE len=12288 to_dec=12288 left=0
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1634304 OE len=8192 to_dec=4096 left=0
CPU: 1 UID: 0 PID: 3286940 Comm: kworker/u24:3 Tainted: G W OE 6.13.0-rc1-custom+ #89
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: btrfs_work_helper [btrfs] (btrfs-endio-write)
pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : process_one_work+0x110/0x680
lr : worker_thread+0x1bc/0x360
Call trace:
process_one_work+0x110/0x680 (P)
worker_thread+0x1bc/0x360 (L)
worker_thread+0x1bc/0x360
kthread+0x100/0x118
ret_from_fork+0x10/0x20
Code: f84086a1 f9000fe1 53041c21 b9003361 (f9400661)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Oops: Fatal exception
SMP: stopping secondary CPUs
SMP: failed to stop secondary CPUs 2-3
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: 0x275bb9540000 from 0xffff800080000000
PHYS_OFFSET: 0xffff8fbba0000000
CPU features: 0x100,00000070,00801250,8201720b
[CAUSE]
The above warning is triggered immediately after the delalloc range
failure, this happens in the following sequence:
- Range [1568K, 1636K) is dirty
1536K 1568K 1600K 1636K 1664K
| |/////////|////////| |
Where 1536K, 1600K and 1664K are page boundaries (64K page size)
- Enter extent_writepage() for page 1536K
- Enter run_delalloc_nocow() with locked page 1536K and range
[1568K, 1636K)
This is due to the inode having preallocated extents.
- Enter cow_file_range() with locked page 1536K and range
[1568K, 1636K)
- btrfs_reserve_extent() only reserved two extents
The main loop of cow_file_range() only reserved two data extents,
Now we have:
1536K 1568K 1600K 1636K 1664K
| |<-->|<--->|/|///////| |
1584K 1596K
Range [1568K, 1596K) has an ordered extent reserved.
- btrfs_reserve_extent() failed inside cow_file_range() for file offset
1596K
This is already a bug in our space reservation code, but for now let's
focus on the error handling path.
Now cow_file_range() returned -ENOSPC.
- btrfs_run_delalloc_range() do error cleanup <<< ROOT CAUSE
Call btrfs_cleanup_ordered_extents() with locked folio 1536K and range
[1568K, 1636K)
Function btrfs_cleanup_ordered_extents() normally needs to skip the
ranges inside the folio, as it will normally be cleaned up by
extent_writepage().
Such split error handling is already problematic in the first place.
What's worse is the folio range skipping itself, which is not taking
subpage cases into consideration at all, it will only skip the range
if the page start >= the range start.
In our case, the page start < the range start, since for subpage cases
we can have delalloc ranges inside the folio but not covering the
folio.
So it doesn't skip the page range at all.
This means all the ordered extents, both [1568K, 1584K) and
[1584K, 1596K) will be marked as IOERR.
And these two ordered extents have no more pending ios, they are marked
finished, and *QUEUED* to be deleted from the io tree.
- extent_writepage() do error cleanup
Call btrfs_mark_ordered_io_finished() for the range [1536K, 1600K).
Although ranges [1568K, 1584K) and [1584K, 1596K) are finished, the
deletion from io tree is async, it may or may not happen at this
time.
If the ranges have not yet been removed, we will do double cleaning on
those ranges, triggering the above ordered extent warnings.
In theory there are other bugs, like the cleanup in extent_writepage()
can cause double accounting on ranges that are submitted asynchronously
(compression for example).
But that's much harder to trigger because normally we do not mix regular
and compression delalloc ranges.
[FIX]
The folio range split is already buggy and not subpage compatible, it
was introduced a long time ago where subpage support was not even considered.
So instead of splitting the ordered extents cleanup into the folio range
and out of folio range, do all the cleanup inside writepage_delalloc().
- Pass @NULL as locked_folio for btrfs_cleanup_ordered_extents() in
btrfs_run_delalloc_range()
- Skip the btrfs_cleanup_ordered_extents() if writepage_delalloc()
failed
So all ordered extents are only cleaned up by
btrfs_run_delalloc_range().
- Handle the ranges that already have ordered extents allocated
If part of the folio already has ordered extent allocated, and
btrfs_run_delalloc_range() failed, we also need to cleanup that range.
Now we have a concentrated error handling for ordered extents during
btrfs_run_delalloc_range().
Fixes: d1051d6ebf ("btrfs: Fix error handling in btrfs_cleanup_ordered_extents")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.
These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During renames we are grouping transaction aborts that can be due to a
failure of one of several function calls. While this makes the code less
verbose, it makes it harder to debug as we end up not knowing from which
function call we got an error.
So change this to trigger a transaction abort after each function call
failure, so that when we get a transaction abort message we know exactly
which function call failed, helping us to debug issues.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of passing a root and an objectid which matches an inode number,
pass the inode instead, since the root is always the root associated to
the inode and the objectid is the number of that inode.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers of can_nocow_extent() now pass a value of false for its
'strict' argument, making it redundant. So remove the argument from
can_nocow_extent() as well as can_nocow_file_extent(),
btrfs_cross_ref_exist() and check_committed_ref(), because this
argument was used just to influence the behavior of check_committed_ref().
Also remove the 'strict' field from struct can_nocow_file_extent_args,
which is now always false as well, as its value is taken from the
argument to can_nocow_extent().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit e1e577aafe41 ("btrfs: store fs_info in space_info"), we have
the fs_info in a space_info. So, we can drop fs_info argument from
btrfs_update_space_info_*. There is no behavior change.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmdw8AgACgkQxWXV+ddt
WDsL4w/+Ib5WGmd2Rjn1+1X9U5dzrEb+/072UBAhwwaqOOUTlBofeyRSdYqFB0oZ
aucRMXdXPpVe1xrXsj0WsOZmPsuZT46Eh2ALqqZP5fO1sgBkJ2WmQF0Ei7uypfb+
abQwiEO2IaMMwt2XgDNzbpZS7oVNGEXHzoHF0R/deL4FoBDNMsbCfRnW+L9++tWU
dUSpafLhgMMwivJN07VJYwU4ZVXsBhmKv2qI8WpJ5w9kJb1ssN692CvBOVjhuSYd
A8IMV84dW2KO37fmPqN36QAWotz4mKpv8yrhjJvrix7nAOcXe3TXFUhaFBh1Vmzg
G5bhkqYcNP6UHT7CIcLZE1mdv6ZAKTp0zSNCh2Uu51+MJL2tIQVjTaUQhbkYLnLN
9DS2dXz4ksm9ISrjr2tmPe4kgyNQIrp5TCdwXu3CYs+AaU7yKeEBukZ7mXcp/e/W
TdLKvzPRLMED8mGlFBwg2QbOvcJJ663UW2esyv6DvC61F3tXyiV2RXSC/1qF+RyZ
FBJvvEevensQlASn1NScuQV+iEQpMo2lMURnRjSG8dGhwMmHpW3wifa2TJDyBzWS
AH0MriQA9nsYQTkPGPnqr46/BAhFG2vEfVlX20Sk9S0PTBLu8YRy/o2evcV67J8v
zGaa5pa7fQPbEjRv4Rthdb4R2VIFkZTOtIZSZfjHkPDjtvS7ahU=
=NwGH
-----END PGP SIGNATURE-----
Merge tag 'for-6.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes that accumulated over the last two weeks, fixing some
user reported problems:
- swapfile fixes:
- conditional reschedule in the activation loop
- fix race with memory mapped file when activating
- make activation loop interruptible
- rework and fix extent sharing checks
- folio fixes:
- in send, recheck folio mapping after unlock
- in relocation, recheck folio mapping after unlock
- fix waiting for encoded read io_uring requests
- fix transaction atomicity when enabling simple quotas
- move COW block trace point before the block gets freed
- print various sizes in sysfs with correct endianity"
* tag 'for-6.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: sysfs: fix direct super block member reads
btrfs: fix transaction atomicity bug when enabling simple quotas
btrfs: avoid monopolizing a core when activating a swap file
btrfs: allow swap activation to be interruptible
btrfs: fix swap file activation failure due to extents that used to be shared
btrfs: fix race with memory mapped writes when activating swap file
btrfs: check folio mapping after unlock in put_file_data()
btrfs: check folio mapping after unlock in relocate_one_folio()
btrfs: fix use-after-free when COWing tree bock and tracing is enabled
btrfs: fix use-after-free waiting for encoded read endios
During swap activation we iterate over the extents of a file and we can
have many thousands of them, so we can end up in a busy loop monopolizing
a core. Avoid this by doing a voluntary reschedule after processing each
extent.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During swap activation we iterate over the extents of a file, then do
several checks for each extent, some of which may take some significant
time such as checking if an extent is shared. Since a file can have
many thousands of extents, this can be a very slow operation and it's
currently not interruptible. I had a bug during development of a previous
patch that resulted in an infinite loop when iterating the extents, so
a core was busy looping and I couldn't cancel the operation, which is very
annoying and requires a reboot. So make the loop interruptible by checking
for fatal signals at the end of each iteration and stopping immediately if
there is one.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When activating a swap file, to determine if an extent is shared we use
can_nocow_extent(), which ends up at btrfs_cross_ref_exist(). That helper
is meant to be quick because it's used in the NOCOW write path, when
flushing delalloc and when doing a direct IO write, however it does return
some false positives, meaning it may indicate that an extent is shared
even if it's no longer the case. For the write path this is fine, we just
do a unnecessary COW operation instead of doing a more rigorous check
which would be too heavy (calling btrfs_is_data_extent_shared()).
However when activating a swap file, the false positives simply result
in a failure, which is confusing for users/applications. One particular
case where this happens is when a data extent only has 1 reference but
that reference is not inlined in the extent item located in the extent
tree - this happens when we create more than 33 references for an extent
and then delete those 33 references plus every other non-inline reference
except one. The function check_committed_ref() assumes that if the size
of an extent item doesn't match the size of struct btrfs_extent_item
plus the size of an inline reference (plus an owner reference in case
simple quotas are enabled), then the extent is shared - that is not the
case however, we can have a single reference but it's not inlined - the
reason we do this is to be fast and avoid inspecting non-inline references
which may be located in another leaf of the extent tree, slowing down
write paths.
The following test script reproduces the bug:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
NUM_CLONES=50
umount $DEV &> /dev/null
run_test()
{
local sync_after_add_reflinks=$1
local sync_after_remove_reflinks=$2
mkfs.btrfs -f $DEV > /dev/null
#mkfs.xfs -f $DEV > /dev/null
mount $DEV $MNT
touch $MNT/foo
chmod 0600 $MNT/foo
# On btrfs the file must be NOCOW.
chattr +C $MNT/foo &> /dev/null
xfs_io -s -c "pwrite -b 1M 0 1M" $MNT/foo
mkswap $MNT/foo
for ((i = 1; i <= $NUM_CLONES; i++)); do
touch $MNT/foo_clone_$i
chmod 0600 $MNT/foo_clone_$i
# On btrfs the file must be NOCOW.
chattr +C $MNT/foo_clone_$i &> /dev/null
cp --reflink=always $MNT/foo $MNT/foo_clone_$i
done
if [ $sync_after_add_reflinks -ne 0 ]; then
# Flush delayed refs and commit current transaction.
sync -f $MNT
fi
# Remove the original file and all clones except the last.
rm -f $MNT/foo
for ((i = 1; i < $NUM_CLONES; i++)); do
rm -f $MNT/foo_clone_$i
done
if [ $sync_after_remove_reflinks -ne 0 ]; then
# Flush delayed refs and commit current transaction.
sync -f $MNT
fi
# Now use the last clone as a swap file. It should work since
# its extent are not shared anymore.
swapon $MNT/foo_clone_${NUM_CLONES}
swapoff $MNT/foo_clone_${NUM_CLONES}
umount $MNT
}
echo -e "\nTest without sync after creating and removing clones"
run_test 0 0
echo -e "\nTest with sync after creating clones"
run_test 1 0
echo -e "\nTest with sync after removing clones"
run_test 0 1
echo -e "\nTest with sync after creating and removing clones"
run_test 1 1
Running the test:
$ ./test.sh
Test without sync after creating and removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0017 sec (556.793 MiB/sec and 556.7929 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=a6b9c29e-5ef4-4689-a8ac-bc199c750f02
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Test with sync after creating clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0036 sec (271.739 MiB/sec and 271.7391 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=5e9008d6-1f7a-4948-a1b4-3f30aba20a33
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Test with sync after removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0103 sec (96.665 MiB/sec and 96.6651 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=916c2740-fa9f-4385-9f06-29c3f89e4764
Test with sync after creating and removing clones
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0031 sec (314.268 MiB/sec and 314.2678 ops/sec)
Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
no label, UUID=06aab1dd-4d90-49c0-bd9f-3a8db4e2f912
swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument
Fix this by reworking btrfs_swap_activate() to instead of using extent
maps and checking for shared extents with can_nocow_extent(), iterate
over the inode's file extent items and use the accurate
btrfs_is_data_extent_shared().
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When activating the swap file we flush all delalloc and wait for ordered
extent completion, so that we don't miss any delalloc and extents before
we check that the file's extent layout is usable for a swap file and
activate the swap file. We are called with the inode's VFS lock acquired,
so we won't race with buffered and direct IO writes, however we can still
race with memory mapped writes since they don't acquire the inode's VFS
lock. The race window is between flushing all delalloc and locking the
whole file's extent range, since memory mapped writes lock an extent range
with the length of a page.
Fix this by acquiring the inode's mmap lock before we flush delalloc.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix a use-after-free in the I/O completion path for encoded reads by
using a completion instead of a wait_queue for synchronizing the
destruction of 'struct btrfs_encoded_read_private'.
Fixes: 1881fba89b ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmdYzmoACgkQxWXV+ddt
WDv5GxAAnCsGctNax89x/VpCDZynRghrkxlzu/4kG/pqxsJyzlgXDFtzHAEewSMs
MYL+WCZLYpeKB5FpZq98mDJVLGNMG+9wqkx1bH/xy2ajBGZTeQe5pnkXMNlv9U1O
SX34t8nzOdTCENDnQeRc5I2vTcsQRhgHoVjJkAYdWdhcD9fs6xHKZRe+himlstSn
46ioKzEKSR3ztEUW4ycPF379g7d4kTR0hkk3pu5Nxe7ER8iq+jNSWXj0mzKg7mpJ
KxP56VgY0OrsiUcJr2qFZ1hQIp810puaAuM4C1lLgRplECHxtLbP9JvL9Rr7a8Ox
68tuThyLEpQtR59078jIX3RK6CwVi15rKb/ZkLZkW19TNSAAfM5qrB146hLBUM4T
16WaiJ0x9lVkH2oYQv8zbNZiqDxPhPUdS/JArNAcQYk9ma+C1hCsxPQ/N5yoWH/C
OABJddNR83sm4VTXu3Nci1EB8QuEoOuihYO6CdRkJ3PPNDuQiG6gwnoA2zqSihhy
L5fQaLSWAUsLczarHZrvAi9Y0rfG66QzqGR+A1K/8qMTQ8pSCupd+LfqVa21QpI1
Awx/wVFzsAm7z9CrnPTRJe+JSlBDQdeXWX7pDhhkXgwbCsMVSf3dbBweCD3o1EiM
BVI7SfEgImlbatd0QvDp9FcsnEqp90SCi+99U+zZCmQ1SW8CEC0=
=+DUB
-----END PGP SIGNATURE-----
Merge tag 'for-6.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes. Apart from the one liners and updated bio splitting
error handling there's a fix for subvolume mount with different flags.
This was known and fixed for some time but I've delayed it to give it
more testing.
- fix unbalanced locking when swapfile activation fails when the
subvolume gets deleted in the meantime
- add btrfs error handling after bio_split() calls that got error
handling recently
- during unmount, flush delalloc workers at the right time before the
cleaner thread is shut down
- fix regression in buffered write folio conversion, explicitly wait
for writeback as FGP_STABLE flag is currently a no-op on btrfs
- handle race in subvolume mount with different flags, the conversion
to the new mount API did not handle the case where multiple
subvolumes get mounted in parallel, which is a distro use case"
* tag 'for-6.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: flush delalloc workers queue before stopping cleaner kthread during unmount
btrfs: handle bio_split() errors
btrfs: properly wait for writeback before buffered write
btrfs: fix missing snapshot drew unlock when root is dead during swap activation
btrfs: fix mount failure due to remount races
When activating a swap file we acquire the root's snapshot drew lock and
then check if the root is dead, failing and returning with -EPERM if it's
dead but without unlocking the root's snapshot lock. Fix this by adding
the missing unlock.
Fixes: 60021bd754 ("btrfs: prevent subvol with swapfile from being deleted")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmdPMbMACgkQxWXV+ddt
WDtFSA/+Kd61BPKaiZFF0yjOsjBOlu8WMHFLO5xa2ZRqV3HUzm6rdO4wUSq3Eyqg
lrFszPfLA0REZIEdY2rqDKJduk1MZZg6NY7Pvn+/ByGOorM0Ym1BJoydtlUN5o2Y
AWeaDxs6LBMwRqpai0+AkikufK/jK7QfhHci+Oo2XmOv1C39DQbkXO76SYh/yERt
CcZNaSjm9DUlLxOOpkefxYvpPW4Uv4NSBfh9aymX/u1VxXqeuMuzPZqwZO+nwl/p
M1yr9fcbrqh3yKC+JjhD7xmOJM3x4c2PmzSkQTdepOuAlQvuQ/iFD+Zc1YjY0XlT
Fl938rdTKULgCarR5rdsXqdlFRnOprlgt0J1Pdf+GipTVU0EY3WU343HCz6h8pmG
F/NPvlCahkEtk1UCguL92NOBeAf0adWhuYfKjkuxYuL5ZTvzsOl2ymF6Vlja0Y39
VK4exjG4ilDESxweinJe53k3QLDwvUc2h0D291QVoo2X06dhXda8oOVK+6lR5mij
zDtXqurjlCybT1W7op6RcWMY0TaA8IR3Bo9oU1YbVcctuc86/X8SERv1G8MQtXhh
tevOVb/cKy7gEw9q73OtrH/J2EAklnsesnpH2sdt7WUEzOQPw9BIwEa/NRbjvoqn
n2ts5KktBhNjM1s0elHBf7o4+OaTrTlTrp0HelXyONZucPrzDQ4=
=4mFO
-----END PGP SIGNATURE-----
Merge tag 'for-6.13-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- add lockdep annotations for io_uring/encoded read integration, inode
lock is held when returning to userspace
- properly reflect experimental config option to sysfs
- handle NULL root in case the rescue mode accepts invalid/damaged tree
roots (rescue=ibadroot)
- regression fix of a deadlock between transaction and extent locks
- fix pending bio accounting bug in encoded read ioctl
- fix NOWAIT mode when checking references for NOCOW files
- fix use-after-free in a rb-tree cleanup in ref-verify debugging tool
* tag 'for-6.13-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix lockdep warnings on io_uring encoded reads
btrfs: ref-verify: fix use-after-free after invalid ref action
btrfs: add a sanity check for btrfs root in btrfs_search_slot()
btrfs: don't loop for nowait writes when checking for cross references
btrfs: sysfs: advertise experimental features only if CONFIG_BTRFS_EXPERIMENTAL=y
btrfs: fix deadlock between transaction commits and extent locks
btrfs: fix use-after-free in btrfs_encoded_read_endio()
When running a workload with fsstress and duperemove (generic/561) we can
hit a deadlock related to transaction commits and locking extent ranges,
as described below.
Task A hanging during a transaction commit, waiting for all other writers
to complete:
[178317.334817] INFO: task fsstress:555623 blocked for more than 120 seconds.
[178317.335693] Not tainted 6.12.0-rc6-btrfs-next-179+ #1
[178317.336528] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[178317.337673] task:fsstress state:D stack:0 pid:555623 tgid:555623 ppid:555620 flags:0x00004002
[178317.337679] Call Trace:
[178317.337681] <TASK>
[178317.337685] __schedule+0x364/0xbe0
[178317.337691] schedule+0x26/0xa0
[178317.337695] btrfs_commit_transaction+0x5c5/0x1050 [btrfs]
[178317.337769] ? start_transaction+0xc4/0x800 [btrfs]
[178317.337815] ? __pfx_autoremove_wake_function+0x10/0x10
[178317.337819] btrfs_mksubvol+0x381/0x640 [btrfs]
[178317.337878] btrfs_mksnapshot+0x7a/0xb0 [btrfs]
[178317.337935] __btrfs_ioctl_snap_create+0x1bb/0x1d0 [btrfs]
[178317.337995] btrfs_ioctl_snap_create_v2+0x103/0x130 [btrfs]
[178317.338053] btrfs_ioctl+0x29b/0x2a90 [btrfs]
[178317.338118] ? kmem_cache_alloc_noprof+0x5f/0x2c0
[178317.338126] ? getname_flags+0x45/0x1f0
[178317.338133] ? _raw_spin_unlock+0x15/0x30
[178317.338145] ? __x64_sys_ioctl+0x88/0xc0
[178317.338149] __x64_sys_ioctl+0x88/0xc0
[178317.338152] do_syscall_64+0x4a/0x110
[178317.338160] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[178317.338190] RIP: 0033:0x7f13c28e271b
Which corresponds to line 2361 of transaction.c:
$ cat -n fs/btrfs/transaction.c
(...)
2162 int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
2163 {
(...)
2349 spin_lock(&fs_info->trans_lock);
2350 add_pending_snapshot(trans);
2351 cur_trans->state = TRANS_STATE_COMMIT_DOING;
2352 spin_unlock(&fs_info->trans_lock);
2353
2354 /*
2355 * The thread has started/joined the transaction thus it holds the
2356 * lockdep map as a reader. It has to release it before acquiring the
2357 * lockdep map as a writer.
2358 */
2359 btrfs_lockdep_release(fs_info, btrfs_trans_num_writers);
2360 btrfs_might_wait_for_event(fs_info, btrfs_trans_num_writers);
2361 wait_event(cur_trans->writer_wait,
2362 atomic_read(&cur_trans->num_writers) == 1);
(...)
The transaction is in the TRANS_STATE_COMMIT_DOING state and so it's
waiting for all other existing writers to complete and release their
transaction handle.
Task B is running ordered extent completion and blocked waiting to lock an
extent range in an inode's io tree:
[178317.327411] INFO: task kworker/u48:8:554545 blocked for more than 120 seconds.
[178317.328630] Not tainted 6.12.0-rc6-btrfs-next-179+ #1
[178317.329635] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[178317.330872] task:kworker/u48:8 state:D stack:0 pid:554545 tgid:554545 ppid:2 flags:0x00004000
[178317.330878] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[178317.330944] Call Trace:
[178317.330945] <TASK>
[178317.330947] __schedule+0x364/0xbe0
[178317.330952] schedule+0x26/0xa0
[178317.330955] __lock_extent+0x337/0x3a0 [btrfs]
[178317.331014] ? __pfx_autoremove_wake_function+0x10/0x10
[178317.331017] btrfs_finish_one_ordered+0x47a/0xaa0 [btrfs]
[178317.331074] ? psi_group_change+0x132/0x2d0
[178317.331078] btrfs_work_helper+0xbd/0x370 [btrfs]
[178317.331140] process_scheduled_works+0xd3/0x460
[178317.331144] ? __pfx_worker_thread+0x10/0x10
[178317.331146] worker_thread+0x121/0x250
[178317.331149] ? __pfx_worker_thread+0x10/0x10
[178317.331151] kthread+0xe9/0x120
[178317.331154] ? __pfx_kthread+0x10/0x10
[178317.331157] ret_from_fork+0x2d/0x50
[178317.331159] ? __pfx_kthread+0x10/0x10
[178317.331162] ret_from_fork_asm+0x1a/0x30
This extent range locking happens after joining the current transaction,
so task A is waiting for task B to release its transaction handle
(decrementing the transaction's num_writers counter).
Task C while doing a fiemap it tries to join the current transaction:
[242682.812815] task:pool state:D stack:0 pid:560767 tgid:560724 ppid:555622 flags:0x00004006
[242682.812827] Call Trace:
[242682.812856] <TASK>
[242682.812864] __schedule+0x364/0xbe0
[242682.812879] ? _raw_spin_unlock_irqrestore+0x23/0x40
[242682.812897] schedule+0x26/0xa0
[242682.812909] wait_current_trans+0xd6/0x130 [btrfs]
[242682.813148] ? __pfx_autoremove_wake_function+0x10/0x10
[242682.813162] start_transaction+0x3d4/0x800 [btrfs]
[242682.813399] btrfs_is_data_extent_shared+0xd2/0x440 [btrfs]
[242682.813723] fiemap_process_hole+0x2a2/0x300 [btrfs]
[242682.813995] extent_fiemap+0x9b8/0xb80 [btrfs]
[242682.814249] btrfs_fiemap+0x78/0xc0 [btrfs]
[242682.814501] do_vfs_ioctl+0x2db/0xa50
[242682.814519] __x64_sys_ioctl+0x6a/0xc0
[242682.814531] do_syscall_64+0x4a/0x110
[242682.814544] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[242682.814556] RIP: 0033:0x7efff595e71b
It tries to join the current transaction, but it can't because the
transaction is in the TRANS_STATE_COMMIT_DOING state, so
join_transaction() returns -EBUSY to start_transaction() and makes it
wait for the current transaction to complete. And while it's waiting
for the transaction to complete, it's holding an extent range locked
in the same inode that task B is operating, which causes a deadlock
between these 3 tasks. The extent range for the inode was locked at
the start of the fiemap operation, early at extent_fiemap().
In short these tasks deadlock because:
1) Task A is waiting for task B to release its transaction handle;
2) Task B is waiting to lock an extent range for an inode while holding a
transaction handle open;
3) Task C is waiting for the current transaction to complete (for task A
to finish the transaction commit) while holding the extent range for
the inode locked, so task B can't progress and release its transaction
handle.
This results in an ABBA deadlock involving transaction commits and extent
locks. Extent locks are higher level locks, like inode VFS locks, and
should always be acquired before joining or starting a transaction, but
recently commit 2206265f41 ("btrfs: remove code duplication in ordered
extent finishing") accidentally changed btrfs_finish_one_ordered() to do
the transaction join before locking the extent range.
Fix this by making sure that btrfs_finish_one_ordered() always locks the
extent before joining a transaction and add an explicit comment about the
need for this order.
Fixes: 2206265f41 ("btrfs: remove code duplication in ordered extent finishing")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Shinichiro reported the following use-after free that sometimes is
happening in our CI system when running fstests' btrfs/284 on a TCMU
runner device:
BUG: KASAN: slab-use-after-free in lock_release+0x708/0x780
Read of size 8 at addr ffff888106a83f18 by task kworker/u80:6/219
CPU: 8 UID: 0 PID: 219 Comm: kworker/u80:6 Not tainted 6.12.0-rc6-kts+ #15
Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
Call Trace:
<TASK>
dump_stack_lvl+0x6e/0xa0
? lock_release+0x708/0x780
print_report+0x174/0x505
? lock_release+0x708/0x780
? __virt_addr_valid+0x224/0x410
? lock_release+0x708/0x780
kasan_report+0xda/0x1b0
? lock_release+0x708/0x780
? __wake_up+0x44/0x60
lock_release+0x708/0x780
? __pfx_lock_release+0x10/0x10
? __pfx_do_raw_spin_lock+0x10/0x10
? lock_is_held_type+0x9a/0x110
_raw_spin_unlock_irqrestore+0x1f/0x60
__wake_up+0x44/0x60
btrfs_encoded_read_endio+0x14b/0x190 [btrfs]
btrfs_check_read_bio+0x8d9/0x1360 [btrfs]
? lock_release+0x1b0/0x780
? trace_lock_acquire+0x12f/0x1a0
? __pfx_btrfs_check_read_bio+0x10/0x10 [btrfs]
? process_one_work+0x7e3/0x1460
? lock_acquire+0x31/0xc0
? process_one_work+0x7e3/0x1460
process_one_work+0x85c/0x1460
? __pfx_process_one_work+0x10/0x10
? assign_work+0x16c/0x240
worker_thread+0x5e6/0xfc0
? __pfx_worker_thread+0x10/0x10
kthread+0x2c3/0x3a0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x31/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Allocated by task 3661:
kasan_save_stack+0x30/0x50
kasan_save_track+0x14/0x30
__kasan_kmalloc+0xaa/0xb0
btrfs_encoded_read_regular_fill_pages+0x16c/0x6d0 [btrfs]
send_extent_data+0xf0f/0x24a0 [btrfs]
process_extent+0x48a/0x1830 [btrfs]
changed_cb+0x178b/0x2ea0 [btrfs]
btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
_btrfs_ioctl_send+0x117/0x330 [btrfs]
btrfs_ioctl+0x184a/0x60a0 [btrfs]
__x64_sys_ioctl+0x12e/0x1a0
do_syscall_64+0x95/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Freed by task 3661:
kasan_save_stack+0x30/0x50
kasan_save_track+0x14/0x30
kasan_save_free_info+0x3b/0x70
__kasan_slab_free+0x4f/0x70
kfree+0x143/0x490
btrfs_encoded_read_regular_fill_pages+0x531/0x6d0 [btrfs]
send_extent_data+0xf0f/0x24a0 [btrfs]
process_extent+0x48a/0x1830 [btrfs]
changed_cb+0x178b/0x2ea0 [btrfs]
btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
_btrfs_ioctl_send+0x117/0x330 [btrfs]
btrfs_ioctl+0x184a/0x60a0 [btrfs]
__x64_sys_ioctl+0x12e/0x1a0
do_syscall_64+0x95/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The buggy address belongs to the object at ffff888106a83f00
which belongs to the cache kmalloc-rnd-07-96 of size 96
The buggy address is located 24 bytes inside of
freed 96-byte region [ffff888106a83f00, ffff888106a83f60)
The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888106a83800 pfn:0x106a83
flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
page_type: f5(slab)
raw: 0017ffffc0000000 ffff888100053680 ffffea0004917200 0000000000000004
raw: ffff888106a83800 0000000080200019 00000001f5000000 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888106a83e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
ffff888106a83e80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>ffff888106a83f00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
^
ffff888106a83f80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
ffff888106a84000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================
Further analyzing the trace and the crash dump's vmcore file shows that
the wake_up() call in btrfs_encoded_read_endio() is calling wake_up() on
the wait_queue that is in the private data passed to the end_io handler.
Commit 4ff47df40447 ("btrfs: move priv off stack in
btrfs_encoded_read_regular_fill_pages()") moved 'struct
btrfs_encoded_read_private' off the stack.
Before that commit one can see a corruption of the private data when
analyzing the vmcore after a crash:
*(struct btrfs_encoded_read_private *)0xffff88815626eec8 = {
.wait = (wait_queue_head_t){
.lock = (spinlock_t){
.rlock = (struct raw_spinlock){
.raw_lock = (arch_spinlock_t){
.val = (atomic_t){
.counter = (int)-2005885696,
},
.locked = (u8)0,
.pending = (u8)157,
.locked_pending = (u16)40192,
.tail = (u16)34928,
},
.magic = (unsigned int)536325682,
.owner_cpu = (unsigned int)29,
.owner = (void *)__SCT__tp_func_btrfs_transaction_commit+0x0 = 0x0,
.dep_map = (struct lockdep_map){
.key = (struct lock_class_key *)0xffff8881575a3b6c,
.class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
.name = (const char *)0xffff88815626f100 = "",
.wait_type_outer = (u8)37,
.wait_type_inner = (u8)178,
.lock_type = (u8)154,
},
},
.__padding = (u8 [24]){ 0, 157, 112, 136, 50, 174, 247, 31, 29 },
.dep_map = (struct lockdep_map){
.key = (struct lock_class_key *)0xffff8881575a3b6c,
.class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
.name = (const char *)0xffff88815626f100 = "",
.wait_type_outer = (u8)37,
.wait_type_inner = (u8)178,
.lock_type = (u8)154,
},
},
.head = (struct list_head){
.next = (struct list_head *)0x112cca,
.prev = (struct list_head *)0x47,
},
},
.pending = (atomic_t){
.counter = (int)-1491499288,
},
.status = (blk_status_t)130,
}
Here we can see several indicators of in-memory data corruption, e.g. the
large negative atomic values of ->pending or
->wait->lock->rlock->raw_lock->val, as well as the bogus spinlock magic
0x1ff7ae32 (decimal 536325682 above) instead of 0xdead4ead or the bogus
pointer values for ->wait->head.
To fix this, change atomic_dec_return() to atomic_dec_and_test() to fix the
corruption, as atomic_dec_return() is defined as two instructions on
x86_64, whereas atomic_dec_and_test() is defined as a single atomic
operation. This can lead to a situation where counter value is already
decremented but the if statement in btrfs_encoded_read_endio() is not
completely processed, i.e. the 0 test has not completed. If another thread
continues executing btrfs_encoded_read_regular_fill_pages() the
atomic_dec_return() there can see an already updated ->pending counter and
continues by freeing the private data. Continuing in the endio handler the
test for 0 succeeds and the wait_queue is woken up, resulting in a
use-after-free.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com>
Fixes: 1881fba89b ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmc0zT4ACgkQxWXV+ddt
WDtThRAAhzSSiHcJqTfCL5nHh7w85MNEVw28o1ETgXSYJmx0JOWLE7Znlp2FV7jj
IbYkFfF2gXJzYvRZkcXB/TAHV9KJG5yZIBZfccbM+9db9f8xkImVKMuqQRXPU41R
ppSCmqZTeujtt8ucsaJkMpm6pzECKJCJaGOsMJ8fiqKpo89dKO3eGAVboSbpPF4C
r0YmppiBwSP/cCXQCqWxZRbqPGN+lUgZpIGNRi157kehfmRHlVVJTO1pgqK8PCXb
uIT09Kulppfez8+1A10CPcniDTyinLik/qLTNlzdWoDBL4iNJMg0A0wsA04AJVf0
PdOS0REusiv3QcEIO6PefuRFRRfXcSLPpPDUceltJT5O0uM2gUqf2C7dEHXUGU3o
TdgYlbQpsJWpZ7VGWQDZeGGV04lOPQvu0LGLPgEerUQd5H9ABa0dX8Fn0sPhKsa8
whpAcdfE4rdNxB2OJFnqQeFq0z3cSjP/rvKlluCmAj97QYI+kiu3QyhemcT1YSC9
U7n5Ya9IzIYCN3ml54q3hEgyD0IVGGG20GuUmqC9XSP9mrQRC8I1g7v26AiOTrrk
VhgSdtMmphDxXudifsnYMaQ0Z1QqiUrW1SM/prAEOnBYCo75+HDsTgrq9ithgHoI
4xz4YXJyMRs18qfTJctXC1wmGuz5plTdQrwarHdNsELN5HEyqX4=
=aAcf
-----END PGP SIGNATURE-----
Merge tag 'for-6.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Changes outside of btrfs: add io_uring command flag to track a dying
task (the rest will go via the block git tree).
User visible changes:
- wire encoded read (ioctl) to io_uring commands, this can be used on
itself, in the future this will allow 'send' to be asynchronous. As
a consequence, the encoded read ioctl can also work in non-blocking
mode
- new ioctl to wait for cleaned subvolumes, no need to use the
generic and root-only SEARCH_TREE ioctl, will be used by "btrfs
subvol sync"
- recognize different paths/symlinks for the same devices and don't
report them during rescanning, this can be observed with LVM or DM
- seeding device use case change, the sprout device (the one
capturing new writes) will not clear the read-only status of the
super block; this prevents accumulating space from deleted
snapshots
Performance improvements:
- reduce lock contention when traversing extent buffers
- reduce extent tree lock contention when searching for inline
backref
- switch from rb-trees to xarray for delayed ref tracking,
improvements due to better cache locality, branching factors and
more compact data structures
- enable extent map shrinker again (prevent memory exhaustion under
some types of IO load), reworked to run in a single worker thread
(there used to be problems causing long stalls under memory
pressure)
Core changes:
- raid-stripe-tree feature updates:
- make device replace and scrub work
- implement partial deletion of stripe extents
- new selftests
- split the config option BTRFS_DEBUG and add EXPERIMENTAL for
features that are experimental or with known problems so we don't
misuse debugging config for that
- subpage mode updates (sector < page):
- update compression implementations
- update writepage, writeback
- continued folio API conversions:
- buffered writes
- make buffered write copy one page at a time, preparatory work for
future integration with large folios, may cause performance drop
- proper locking of root item regarding starting send
- error handling improvements
- code cleanups and refactoring:
- dead code removal
- unused parameter reduction
- lockdep assertions"
* tag 'for-6.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (119 commits)
btrfs: send: check for read-only send root under critical section
btrfs: send: check for dead send root under critical section
btrfs: remove check for NULL fs_info at btrfs_folio_end_lock_bitmap()
btrfs: fix warning on PTR_ERR() against NULL device at btrfs_control_ioctl()
btrfs: fix a typo in btrfs_use_zone_append
btrfs: avoid superfluous calls to free_extent_map() in btrfs_encoded_read()
btrfs: simplify logic to decrement snapshot counter at btrfs_mksnapshot()
btrfs: remove hole from struct btrfs_delayed_node
btrfs: update stale comment for struct btrfs_delayed_ref_node::add_list
btrfs: add new ioctl to wait for cleaned subvolumes
btrfs: simplify range tracking in cow_file_range()
btrfs: remove conditional path allocation in btrfs_read_locked_inode()
btrfs: push cleanup into btrfs_read_locked_inode()
io_uring/cmd: let cmds to know about dying task
btrfs: add struct io_btrfs_cmd as type for io_uring_cmd_to_pdu()
btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)
btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages()
btrfs: don't sleep in btrfs_encoded_read() if IOCB_NOWAIT is set
btrfs: change btrfs_encoded_read() so that reading of extent is done by caller
btrfs: remove pointless iocb::ki_pos addition in btrfs_encoded_read()
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcUQAAKCRCRxhvAZXjc
onEpAQCUdwIBHpwmSIFvJFA9aNGpbLzi0dDSEIxuWYtp5qVuogD+ImccwqpG3kEi
Zq9vokdPpB1zbahxKl1mkvBG4G0GFQE=
=LbP6
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.pagecache' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs pagecache updates from Christian Brauner:
"Cleanup filesystem page flag usage: This continues the work to make
the mappedtodisk/owner_2 flag available to filesystems which don't use
buffer heads. Further patches remove uses of Private2. This brings us
very close to being rid of it entirely"
* tag 'vfs-6.13.pagecache' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
migrate: Remove references to Private2
ceph: Remove call to PagePrivate2()
btrfs: Switch from using the private_2 flag to owner_2
mm: Remove PageMappedToDisk
nilfs2: Convert nilfs_copy_buffer() to use folios
fs: Move clearing of mappedtodisk to buffer.c
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcToAAKCRCRxhvAZXjc
osL9AP948FFumJRC28gDJ4xp+X4eohNOfkgoEG8FTbF2zU6ulwD+O0pr26FqpFli
pqlG+38UdATImpfqqWjPbb72sBYcfQg=
=wLUh
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Fixup and improve NLM and kNFSD file lock callbacks
Last year both GFS2 and OCFS2 had some work done to make their
locking more robust when exported over NFS. Unfortunately, part of
that work caused both NLM (for NFS v3 exports) and kNFSD (for
NFSv4.1+ exports) to no longer send lock notifications to clients
This in itself is not a huge problem because most NFS clients will
still poll the server in order to acquire a conflicted lock
It's important for NLM and kNFSD that they do not block their
kernel threads inside filesystem's file_lock implementations
because that can produce deadlocks. We used to make sure of this by
only trusting that posix_lock_file() can correctly handle blocking
lock calls asynchronously, so the lock managers would only setup
their file_lock requests for async callbacks if the filesystem did
not define its own lock() file operation
However, when GFS2 and OCFS2 grew the capability to correctly
handle blocking lock requests asynchronously, they started
signalling this behavior with EXPORT_OP_ASYNC_LOCK, and the check
for also trusting posix_lock_file() was inadvertently dropped, so
now most filesystems no longer produce lock notifications when
exported over NFS
Fix this by using an fop_flag which greatly simplifies the problem
and grooms the way for future uses by both filesystems and lock
managers alike
- Add a sysctl to delete the dentry when a file is removed instead of
making it a negative dentry
Commit 681ce86235 ("vfs: Delete the associated dentry when
deleting a file") introduced an unconditional deletion of the
associated dentry when a file is removed. However, this led to
performance regressions in specific benchmarks, such as
ilebench.sum_operations/s, prompting a revert in commit
4a4be1ad3a ("Revert "vfs: Delete the associated dentry when
deleting a file""). This reintroduces the concept conditionally
through a sysctl
- Expand the statmount() system call:
* Report the filesystem subtype in a new fs_subtype field to
e.g., report fuse filesystem subtypes
* Report the superblock source in a new sb_source field
* Add a new way to return filesystem specific mount options in an
option array that returns filesystem specific mount options
separated by zero bytes and unescaped. This allows caller's to
retrieve filesystem specific mount options and immediately pass
them to e.g., fsconfig() without having to unescape or split
them
* Report security (LSM) specific mount options in a separate
security option array. We don't lump them together with
filesystem specific mount options as security mount options are
generic and most users aren't interested in them
The format is the same as for the filesystem specific mount
option array
- Support relative paths in fsconfig()'s FSCONFIG_SET_STRING command
- Optimize acl_permission_check() to avoid costly {g,u}id ownership
checks if possible
- Use smp_mb__after_spinlock() to avoid full smp_mb() in evict()
- Add synchronous wakeup support for ep_poll_callback.
Currently, epoll only uses wake_up() to wake up task. But sometimes
there are epoll users which want to use the synchronous wakeup flag
to give a hint to the scheduler, e.g., the Android binder driver.
So add a wake_up_sync() define, and use wake_up_sync() when sync is
true in ep_poll_callback()
Fixes:
- Fix kernel documentation for inode_insert5() and iget5_locked()
- Annotate racy epoll check on file->f_ep
- Make F_DUPFD_QUERY associative
- Avoid filename buffer overrun in initramfs
- Don't let statmount() return empty strings
- Add a cond_resched() to dump_user_range() to avoid hogging the CPU
- Don't query the device logical blocksize multiple times for hfsplus
- Make filemap_read() check that the offset is positive or zero
Cleanups:
- Various typo fixes
- Cleanup wbc_attach_fdatawrite_inode()
- Add __releases annotation to wbc_attach_and_unlock_inode()
- Add hugetlbfs tracepoints
- Fix various vfs kernel doc parameters
- Remove obsolete TODO comment from io_cancel()
- Convert wbc_account_cgroup_owner() to take a folio
- Fix comments for BANDWITH_INTERVAL and wb_domain_writeout_add()
- Reorder struct posix_acl to save 8 bytes
- Annotate struct posix_acl with __counted_by()
- Replace one-element array with flexible array member in freevxfs
- Use idiomatic atomic64_inc_return() in alloc_mnt_ns()"
* tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
statmount: retrieve security mount options
vfs: make evict() use smp_mb__after_spinlock instead of smp_mb
statmount: add flag to retrieve unescaped options
fs: add the ability for statmount() to report the sb_source
writeback: wbc_attach_fdatawrite_inode out of line
writeback: add a __releases annoation to wbc_attach_and_unlock_inode
fs: add the ability for statmount() to report the fs_subtype
fs: don't let statmount return empty strings
fs:aio: Remove TODO comment suggesting hash or array usage in io_cancel()
hfsplus: don't query the device logical block size multiple times
freevxfs: Replace one-element array with flexible array member
fs: optimize acl_permission_check()
initramfs: avoid filename buffer overrun
fs/writeback: convert wbc_account_cgroup_owner to take a folio
acl: Annotate struct posix_acl with __counted_by()
acl: Realign struct posix_acl to save 8 bytes
epoll: Add synchronous wakeup support for ep_poll_callback
coredump: add cond_resched() to dump_user_range
mm/page-writeback.c: Fix comment of wb_domain_writeout_add()
mm/page-writeback.c: Update comment for BANDWIDTH_INTERVAL
...
Change the control flow of btrfs_encoded_read() so that it doesn't call
free_extent_map() when we know that this has already been done.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Simplify tracking of the range processed by using cur_alloc_size only to
store the reserved part that may fail to the allocated extent. Remove
the ram_size as well since it is always equal to cur_alloc_size in the
context. Advance the start in normal path until extent allocation
succeeds and keep the start unchanged in the error handling path.
Passed the fstest generic/475 test for a hundred times with quota
enabled. And a modified generic/475 test by removing the sleep time
for a hundred times. About one tenth of the tests do enter the error
handling path due to fail to reserve extent.
Suggested-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove conditional path allocation from btrfs_read_locked_inode(). Add
an ASSERT(path) to indicate it should never be called with a NULL path.
Call btrfs_read_locked_inode() directly from btrfs_iget(). This causes
code duplication between btrfs_iget() and btrfs_iget_path(), but I
think this is justifiable as it removes the need for conditionally
allocating the path inside of btrfs_read_locked_inode(). This makes the
code easier to reason about and makes it clear who has the
responsibility of allocating and freeing the path.
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move btrfs_add_inode_to_root() so it can be called from
btrfs_read_locked_inode(), no changes were made to the function.
Move cleanup code from btrfs_iget_path() to btrfs_read_locked_inode.
This improves readability and improves a leaky abstraction. Previously
btrfs_iget_path() had to handle a positive error case as a result of a
call to btrfs_search_slot(), but it makes more sense to handle this
closer to the source of the call.
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add an io_uring command for encoded reads, using the same interface as
the existing BTRFS_IOC_ENCODED_READ ioctl.
btrfs_uring_encoded_read() is an io_uring version of
btrfs_ioctl_encoded_read(), which validates the user input and calls
btrfs_encoded_read() to read the appropriate metadata. If we determine
that we need to read an extent from disk, we call
btrfs_encoded_read_regular_fill_pages() through
btrfs_uring_read_extent() to prepare the bio.
The existing btrfs_encoded_read_regular_fill_pages() is changed so that
if it is passed a valid uring_ctx, rather than waking up any waiting
threads it calls btrfs_uring_read_extent_endio(). This in turn copies
the read data back to userspace, and calls io_uring_cmd_done() to
complete the io_uring command.
Because we're potentially doing a non-blocking read,
btrfs_uring_read_extent() doesn't clean up after itself if it returns
-EIOCBQUEUED. Instead, it allocates a priv struct, populates the fields
there that we will need to unlock the inode and free our allocations,
and defers this to the btrfs_uring_read_finished() that gets called when
the bio completes.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Change btrfs_encoded_read_regular_fill_pages() so that the priv struct
is allocated rather than stored on the stack, in preparation for adding
an asynchronous mode to the function.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Change btrfs_encoded_read() so that it returns -EAGAIN rather than sleeps
if IOCB_NOWAIT is set in iocb->ki_flags. The conditions that require
sleeping are: inode lock, writeback, extent lock, ordered range.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Change the behaviour of btrfs_encoded_read() so that if it needs to read
an extent from disk, it leaves the extent and inode locked and returns
-EIOCBQUEUED. The caller is then responsible for doing the I/O via
btrfs_encoded_read_regular() and unlocking the extent and inode.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
iocb->ki_pos isn't used after this function, so there's no point in
changing its value.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When fgp_flags and gfp_flags are zero, use filemap_get_folio(A, B)
instead of __filemap_get_folio(A, B, 0, 0)—no need for the extra
arguments 0, 0.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_do_encoded_write() was converted to use folios in 400b172b8c,
but we're still allocating based on sizeof(struct page *) rather than
sizeof(struct folio *). There's no functional change.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The file_offset parameter used to be passed to encoded read struct but
was removed in commit b665affe93 ("btrfs: remove unused members from
struct btrfs_encoded_read_private").
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need offset for inline extents, they always start from 0.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need the inode pointer to read inline extent, it's all
accessible from the path pointer.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_set_range_writeback() was originally a callback for
metadata and data, to mark a range with writeback flag.
Then it was converted into a common function call for both metadata and
data.
From the very beginning, the function had been only called on a full page,
later converted to handle range inside a page.
But it never needed to handle multiple pages, and since commit
8189197425 ("btrfs: refactor __extent_writepage_io() to do
sector-by-sector submission") the function was only called on a
sector-by-sector basis.
This makes the function unnecessary, and can be converted to a simple
btrfs_folio_set_writeback() call instead.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we BUG_ON() in btrfs_finish_one_ordered() if we are finishing
an ordered extent that is flagged as NOCOW, but it's checksum list is
not empty.
This is clearly a logic error which we can recover from by aborting the
transaction.
For developer builds which enable CONFIG_BTRFS_ASSERT, also ASSERT()
that the list is empty.
Suggested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove the duplicated transaction joining, block reserve setting and raid
extent inserting in btrfs_finish_ordered_extent().
While at it, also abort the transaction in case inserting a RAID
stripe-tree entry fails.
Suggested-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously for btrfs with sector size smaller than page size (subpage),
we only allow compression if the range is fully page aligned.
This is to work around the asynchronous submission of compressed range,
which delayed the page unlock and writeback into a workqueue,
furthermore asynchronous submission can lock multiple sector range
across page boundary.
Such asynchronous submission makes it very hard to co-operate with other
regular writes.
With the recent changes to the subpage folio unlock path, now
asynchronous submission of compressed pages can co-operate with regular
submission, so enable sector perfect compression if it's an experimental
build.
The ETA for moving this feature out of experimental is 6.15, and I hope
all remaining corner cases can be exposed before that.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For btrfs with sector size < page size (e.g. 4K sector size, 64K page
size), and enable the sector perfect compression support, then the
following dirty range can lead to problems:
0 32K 64K 96K 128K
| |///////||//////| |/|
124K
In above case, if we start writeback for that inode, the last dirty
range [124K, 128K) will not be submitted and cause reserved space
leakage:
- Start writeback for page 0
We find the range [32K, 96K) is suitable for compression, and queue it
into a workqueue to do the delayed compression and submission.
- Compression happens for range [32K, 96K)
Function extent_range_clear_dirty_for_io() is called, however it is
only doing full page handling, not considering any the extra bitmaps
for subpage cases.
That function will clear page dirty for both page 0 and page 64K.
- Writeback for the inode is done
Because page 64K has its dirty flag cleared, it will not be considered
as a writeback target.
This means the range [124K, 128K) will not be submitted, and reserved
space for it will be leaked.
Fix this problem by using the subpage helper to clear the dirty flag.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmct+40ACgkQxWXV+ddt
WDvCtRAAp0rheEu14hpVvWE2//+6u9Gx7Wfjzbj0+o4zBRWdg7BigFxfeb6JsH/E
2TjuWdcoP/OMV9ghCBQAQxySAPtsxH7skkyNy2UcMk5byBIrNvhw9auP5GXXlrhK
jSKDD4yfOMb++8LhrLevgTrijNyjLqaKXruw9a1Pmc3gxpdNmnMEySsQaF62o2Sm
YC3jwi0KpNAhu2qyJ6TnPgd5zf3BTM0JAeuB019IZW4WoeRTOdcPe7S7gqqJwZ+e
lL0D2/lfIE1lKvLE266Fab4FAQiJV07rozYj25XHiDpqThCxnJVOZCEHasOQ1PRy
d6j3RmGPqJYAYfQL1L+FH2hsS1BVZfVyCV1V7A/cN+lAffBfnROnf13C3gJ15Nbx
3lTyjBPQQw2WpfdmeyF3ikbrjZ8AfahChQO+mMnLN7oAWdIwWX5MRB+cwfWTxzA/
P8upz6HSTpSwy8nXdq264q1KkyCjx0Wv+8iyU7LirN2fCcEchA12HAIaOBeHedgh
PrGZDqrkZccQQxAvU5H7hQv0hZkGK8qba381oYHO09g72VM6ysuBU7tGrPZrlZYB
CvYTCwNZ/lqI8ikrcHOyUO1SPR9SaaWej1mWgBJ69ZIfg+ZuMtOMl171DU4S/i2V
iYgYoN8eCqTQWdaX5kk+3LWmK8fSU7F/KSDtJtT1KxkaSwCacfY=
=TQzP
-----END PGP SIGNATURE-----
Merge tag 'for-6.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more one-liners that fix some user visible problems:
- use correct range when clearing qgroup reservations after COW
- properly reset freed delayed ref list head
- fix ro/rw subvolume mounts to be backward compatible with old and
new mount API"
* tag 'for-6.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix the length of reserved qgroup to free
btrfs: reinitialize delayed ref list after deleting it from the list
btrfs: fix per-subvolume RO/RW flags with new mount API
The dealloc flag may be cleared and the extent won't reach the disk in
cow_file_range when errors path. The reserved qgroup space is freed in
commit 30479f31d4 ("btrfs: fix qgroup reserve leaks in
cow_file_range"). However, the length of untouched region to free needs
to be adjusted with the correct remaining region size.
Fixes: 30479f31d4 ("btrfs: fix qgroup reserve leaks in cow_file_range")
CC: stable@vger.kernel.org # 6.11+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Most of the callers of wbc_account_cgroup_owner() are converting a folio
to page before calling the function. wbc_account_cgroup_owner() is
converting the page back to a folio to call mem_cgroup_css_from_folio().
Convert wbc_account_cgroup_owner() to take a folio instead of a page,
and convert all callers to pass a folio directly except f2fs.
Convert the page to folio for all the callers from f2fs as they were the
only callers calling wbc_account_cgroup_owner() with a page. As f2fs is
already in the process of converting to folios, these call sites might
also soon be calling wbc_account_cgroup_owner() with a folio directly in
the future.
No functional changes. Only compile tested.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20240926140121.203821-1-kernel@pankajraghav.com
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmcZGqsACgkQxWXV+ddt
WDsZTg//dSLAAswT3dTAvWDt7rGxPhJC1V+gnzWdj/0Q3CUimNaw7zHp1QHtJDFT
S+3TVvJJ2JDvOnbi3u24s/bL5YdkWcvIyy87oVE4trJMbQPc/E45pkYFqF3TRQAH
wEYAVEnO9f9WUY+ekxX2XBzhmKp3xol93j9BBHDXOF9kHsDC5lI0D5YVYEhRY8Qu
D4GcqAhqUj5Hj6I6ppiqO47NCJBNzw1Se9QsgruPpmItRbB0/LYJFhUfevwosTKg
xVpkRVntFqQjkIdIdBfBv/ZGWfxJyM7K4M49QwLUqUQfxugu7BiGYuEjkBkiy07a
pZDEOF9s8wUZsxvRVohqKlhL0zHBF+/pAANowYuhKNW1sqKt4GVCdN3V34AbH8ST
JIbPvC2g1tUzIc2wckE8GO/NnsNR4r3k6iPB53MdCHrIo3jnENOeb9wF0GiVDb6s
OrhCa3ph2ps80YC1aCnc4Jr/yV2ONebSivnqvHCIUQEZfpjtAc07atm0i946/7nx
eiBE+9zSVJZB0LoooIVz2I3HX3lRnm3Wwi4nj8U/sNL/IbGaHNCurIETR23NivYP
yWVql2njwE+yMc8q9YZXs4MBdKsSGP6eGJW3ZwKi6ru2PJdf5eIib+ffvR4bLqXD
UUDfMyC1esGsQB24sc8wppk97wmmMrdqQUj9WnmNlTP8FUFggvQ=
=sYxX
-----END PGP SIGNATURE-----
Merge tag 'for-6.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- mount option fixes:
- fix handling of compression mount options on remount
- reject rw remount in case there are options that don't work
in read-write mode (like rescue options)
- fix zone accounting of unusable space
- fix in-memory corruption when merging extent maps
- fix delalloc range locking for sector < page
- use more convenient default value of drop subtree threshold, clean
more subvolumes without the fallback to marking quotas inconsistent
- fix smatch warning about incorrect value passed to ERR_PTR
* tag 'for-6.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix passing 0 to ERR_PTR in btrfs_search_dir_index_item()
btrfs: reject ro->rw reconfiguration if there are hard ro requirements
btrfs: fix read corruption due to race with extent map merging
btrfs: fix the delalloc range locking if sector size < page size
btrfs: qgroup: set a more sane default value for subtree drop threshold
btrfs: clear force-compress on remount when compress mount option is given
btrfs: zoned: fix zone unusable accounting for freed reserved extent
The ret may be zero in btrfs_search_dir_index_item() and should not
passed to ERR_PTR(). Now btrfs_unlink_subvol() is the only caller to
this, reconstructed it to check ERR_PTR(-ENOENT) while ret >= 0.
This fixes smatch warnings:
fs/btrfs/dir-item.c:353
btrfs_search_dir_index_item() warn: passing zero to 'ERR_PTR'
Fixes: 9dcbe16fcc ("btrfs: use btrfs_for_each_slot in btrfs_search_dir_index_item")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmb/9agACgkQxWXV+ddt
WDtKFA/5AW88osk/1k/NVhOvOa0xPr5XyDLq1n8Gaxfy8uHlHAc8wdsvJzCDMS0M
qUOD/tOPhRI0HGXPiKD767erwbyXiZAcCTkSd8x5jlXy1hVjUHQSKO//JxD0vtAZ
jOscoUA1wJJutopCXcppnoUUFE2753edEg0w2EtUXMpfqivqOmMCR+1ZtKkfaNJo
oRuZCq3Oi8hu7Wsvmh4Etq/9MvGM+xovXAMAji6Op8nsP1jJlzWztpEUogLOQH2S
IhDFFxP9shBV9JjV+HSyXcAYr8VArH6HtYjapR9oajCH0pvSjLYQQPq/qQ0//8Hb
SHr4YP6RBbpEIvvaQeA7vwsckBHNBOrSAbEgRQ2+zdmiwha6SRIEVyXYh5LUwcYT
WnVozbk7ZX9rD1jOVhvgouFG6vUz8A6/qt3BD028bVcyMvBXW4gsEduCMVlFGmlN
D6+hNY6J08j4HUEGnPk7fAYi/lk5OEK1p5yarUgsOQ3GqWWS0ywkrfbmbWwyC+Ff
AxggFTl9YodU5RMs7EU2GeHkLU6LnXgevk6FFm0JzsLtT/BbEP7pj6tOot4Msl1e
2ovqFiSbuPNg5Wr70ZBRO9LDIAYtTZy1UrVR/YCSLzm+wZsXMelDIHMQfE7+Yodp
O5Cud23AanRTwjErvNl3X4rhut8rrI1FsR89gDyKK1EPG+mwofo=
=yslr
-----END PGP SIGNATURE-----
Merge tag 'for-6.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- in incremental send, fix invalid clone operation for file that got
its size decreased
- fix __counted_by() annotation of send path cache entries, we do not
store the terminating NUL
- fix a longstanding bug in relocation (and quite hard to hit by
chance), drop back reference cache that can get out of sync after
transaction commit
- wait for fixup worker kthread before finishing umount
- add missing raid-stripe-tree extent for NOCOW files, zoned mode
cannot have NOCOW files but RST is meant to be a standalone feature
- handle transaction start error during relocation, avoid potential
NULL pointer dereference of relocation control structure (reported by
syzbot)
- disable module-wide rate limiting of debug level messages
- minor fix to tracepoint definition (reported by checkpatch.pl)
* tag 'for-6.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: disable rate limiting when debug enabled
btrfs: wait for fixup workers before stopping cleaner kthread during umount
btrfs: fix a NULL pointer dereference when failed to start a new trasacntion
btrfs: send: fix invalid clone operation for file that got its size decreased
btrfs: tracepoints: end assignment with semicolon at btrfs_qgroup_extent event class
btrfs: drop the backref cache during relocation if we commit
btrfs: also add stripe entries for NOCOW writes
btrfs: send: fix buffer overflow detection when copying path to cache entry
We are close to removing the private_2 flag, so switch btrfs to using
owner_2 for its ordered flag. This is mostly used by buffer head
filesystems, so btrfs can use it because it doesn't use buffer heads.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20241002040111.1023018-5-willy@infradead.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.
auto-generated by the following:
for i in `git grep -l -w asm/unaligned.h`; do
sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
NOCOW writes do not generate stripe_extent entries in the RAID stripe
tree, as the RAID stripe-tree feature initially was designed with a
zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
writes. But the RAID stripe-tree feature is independent from the zoned
feature, so we must also do NOCOW writes for RAID stripe-tree filesystems.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Based on the previous patch, the compression path can be
directly used in folio without converting to page.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And page_to_inode() can be replaced with folio_to_inode() now.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Now clear_page_extent_mapped() can deal with a folio
directly, so change its name to clear_folio_extent_mapped().
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's only one caller inode_should_defrag() that passes NULL to
btrfs_add_inode_defrag() so we can drop it an simplify the code.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function does not follow the pattern where the underscores would be
justified, so rename it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function name is a bit misleading as it submits the btrfs_bio
(bbio), rename it so we can use btrfs_submit_bio() when an actual bio is
submitted.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a comment to document the complicated locked_page unlock logic in
cow_file_range_inline. The specifically tricky part is that a caller
just up the stack converts ret == 0 to ret == 1 and then another
caller far up the callstack handles ret == 1 as a success, AND returns
without cleanup in that case, both of which "feel" unnatural and led to
the original bug.
Try to document that somewhat specific callstack logic here to explain
the weird un-setting of locked_folio on success.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of getting a page and using that to clear dirty for io, use the
folio helper and use the appropriate folio functions.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only use a page to copy in the data for the inline extent. Use a
folio for this instead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already use a lot of functions here that use folios, update the
function to use __filemap_get_folio instead of find_get_page and then
use the folio directly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently this already uses a folio for most things, update it to take a
folio and update all the page usage with the corresponding folio usage.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only pass this into read_inline_extent, change it to take a folio and
update the callers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using a page, use a folio instead, take a folio as an
argument, and update the callers appropriately.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update uncompress_inline to take a folio and update it's usage
accordingly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now the fixup creator and consumer use folios, change this to use a
folio as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of a page, use a folio for btrfs_writepage_cow_fixup. We
already have a folio at the only caller, and the fixup worker uses
folios.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function heavily messes with pages, instead update it to use a
folio.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This mostly uses folios already, update it to take a folio and update
the rest of the function to use the folio instead of the page.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of passing in the page for ->locked_page, make it hold a
locked_folio and then update the users of async_chunk to act
accordingly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that every function that btrfs_run_delalloc_range calls takes a
folio, update it to take a folio and update the callers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This just passes the page into the compressed machinery to keep track of
the locked page. Update this to take a folio and convert it to a page
where appropriate.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that btrfs_cleanup_ordered_extents is operating mostly with folios,
update it to use a folio instead of a page, and the update the function
and the callers as appropriate.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We walk through pages in this function and clear ordered, and the
function for this uses folios. Update the function to use a folio for
this whole operation.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now all of the functions that use locked_page in run_delalloc_nocow take
a folio, update it to take a folio and update the caller.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With this we can pass the folio directly into cow_file_range().
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Convert this to take a folio and pass it into all of the various cleanup
functions. Update the callers to pass in a folio instead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we want the folio in this function, convert it to take a folio
directly and use that.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We pass the folio into extent_write_locked_range, go ahead and take a
folio to pass along, and update the callers to pass in a folio.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This mostly uses folios, convert it to take a folio instead and update
the callers to pass in the folio.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of taking the locked page, take the locked folio so we can pass
that into __process_folios_contig.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only need a folio now, make it take a folio as an argument and update
all of the callers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unlink changes the link count on the target inode. POSIX mandates that
the ctime must also change when this occurs.
According to https://pubs.opengroup.org/onlinepubs/9699919799/functions/unlink.html:
"Upon successful completion, unlink() shall mark for update the last data
modification and last file status change timestamps of the parent
directory. Also, if the file's link count is not 0, the last file status
change timestamp of the file shall be marked for update."
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add link to the opengroup docs ]
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmazbGYACgkQxWXV+ddt
WDsQAw//Z3XjjylTZPuHNk/AiMe5oochxB5T9ZracQOzG0o70gj1w/UQIZBkSzp1
66g8I4YdbwvEXKDg9Oi/GPDSON3GuhAiLXp+0Y/reeD/totgvrhROuJ3mIk5CZ0H
B4fIKH3xCKLQan26Opgju4qjum7+AR7ekFveM6GicxXXb3eAYALgoEFt63eZjZVu
7myak78gmBuK5QdGHH+onEhn+HfC57UTGBqu1bJsSOQC7dANkU+WzmgbH6FeOHqx
2T/lN/tu2tBBoF4zMvC472Zjmj4PnubNQnwwv0oJ8Z2Y0yIY95joZV0uXIXO72oz
BMQC6s0cltiTn1Tfe4iIWn+ZNjcfGAZO7aoD5NcJb2F/Mz5mZuMhtq3BND0wJ8/+
SuYk7PxX7tcOaFrDAn3Ne7XHsD7r5lLkFICXkzcNG4dqkBUxR3dN4Oi8KKMFrgCP
kTpf0/lAkYKYSrU86Yn1zhwRLaH8jm3fulVUY8i/p0tJnpsW8AmeME1Sk97Bhxbb
y6zt8+MPscPuQi3jPsBevaYd8Q8BIT34vVtU/jNmGAyqEv1wVYQRRK2Ma4sJJfNk
EEQV9i4VqXWLc17dcWVZrG6lEsVRIIBE/2Adhth6Myq73bkunpB9ZNNNZApf1T2j
Wn5gPzkuQd6FWrXe8V7oSN9rT1OPn9uHL6BmSUWhHUCuFeATgoc=
=03Gf
-----END PGP SIGNATURE-----
Merge tag 'for-6.11-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix double inode unlock for direct IO sync writes (reported by
syzbot)
- fix root tree id/name map definitions, don't use fixed size buffers
for name (reported by -Werror=unterminated-string-initialization)
- fix qgroup reserve leaks in bufferd write path
- update scrub status structure more often so it can be reported in
user space more accurately and let 'resume' not repeat work
- in preparation to remove space cache v1 in the future print a warning
if it's detected
* tag 'for-6.11-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: avoid using fixed char array size for tree names
btrfs: fix double inode unlock for direct IO sync writes
btrfs: emit a warning about space cache v1 being deprecated
btrfs: fix qgroup reserve leaks in cow_file_range
btrfs: implement launder_folio for clearing dirty page reserve
btrfs: scrub: update last_physical after scrubbing one stripe
btrfs: factor out stripe length calculation into a helper
In the buffered write path, the dirty page owns the qgroup reserve until
it creates an ordered_extent.
Therefore, any errors that occur before the ordered_extent is created
must free that reservation, or else the space is leaked. The fstest
generic/475 exercises various IO error paths, and is able to trigger
errors in cow_file_range where we fail to get to allocating the ordered
extent. Note that because we *do* clear delalloc, we are likely to
remove the inode from the delalloc list, so the inodes/pages to not have
invalidate/launder called on them in the commit abort path.
This results in failures at the unmount stage of the test that look like:
BTRFS: error (device dm-8 state EA) in cleanup_transaction:2018: errno=-5 IO failure
BTRFS: error (device dm-8 state EA) in btrfs_replace_file_extents:2416: errno=-5 IO failure
BTRFS warning (device dm-8 state EA): qgroup 0/5 has unreleased space, type 0 rsv 28672
------------[ cut here ]------------
WARNING: CPU: 3 PID: 22588 at fs/btrfs/disk-io.c:4333 close_ctree+0x222/0x4d0 [btrfs]
Modules linked in: btrfs blake2b_generic libcrc32c xor zstd_compress raid6_pq
CPU: 3 PID: 22588 Comm: umount Kdump: loaded Tainted: G W 6.10.0-rc7-gab56fde445b8 #21
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
RIP: 0010:close_ctree+0x222/0x4d0 [btrfs]
RSP: 0018:ffffb4465283be00 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffffa1a1818e1000 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffffb4465283bbe0 RDI: ffffa1a19374fcb8
RBP: ffffa1a1818e13c0 R08: 0000000100028b16 R09: 0000000000000000
R10: 0000000000000003 R11: 0000000000000003 R12: ffffa1a18ad7972c
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS: 00007f9168312b80(0000) GS:ffffa1a4afcc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f91683c9140 CR3: 000000010acaa000 CR4: 00000000000006f0
Call Trace:
<TASK>
? close_ctree+0x222/0x4d0 [btrfs]
? __warn.cold+0x8e/0xea
? close_ctree+0x222/0x4d0 [btrfs]
? report_bug+0xff/0x140
? handle_bug+0x3b/0x70
? exc_invalid_op+0x17/0x70
? asm_exc_invalid_op+0x1a/0x20
? close_ctree+0x222/0x4d0 [btrfs]
generic_shutdown_super+0x70/0x160
kill_anon_super+0x11/0x40
btrfs_kill_super+0x11/0x20 [btrfs]
deactivate_locked_super+0x2e/0xa0
cleanup_mnt+0xb5/0x150
task_work_run+0x57/0x80
syscall_exit_to_user_mode+0x121/0x130
do_syscall_64+0xab/0x1a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f916847a887
---[ end trace 0000000000000000 ]---
BTRFS error (device dm-8 state EA): qgroup reserved space leaked
Cases 2 and 3 in the out_reserve path both pertain to this type of leak
and must free the reserved qgroup data. Because it is already an error
path, I opted not to handle the possible errors in
btrfs_free_qgroup_data.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
In the buffered write path, dirty pages can be said to "own" the qgroup
reservation until they create an ordered_extent. It is possible for
there to be outstanding dirty pages when a transaction is aborted, in
which case there is no cancellation path for freeing this reservation
and it is leaked.
We do already walk the list of outstanding delalloc inodes in
btrfs_destroy_delalloc_inodes() and call invalidate_inode_pages2() on them.
This does *not* call btrfs_invalidate_folio(), as one might guess, but
rather calls launder_folio() and release_folio(). Since this is a
reservation associated with dirty pages only, rather than something
associated with the private bit (ordered_extent is cancelled separately
already in the cleanup transaction path), implementing this release
should be done via launder_folio.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmapmOQACgkQxWXV+ddt
WDsXVhAAi4X+xt3o4jcN3IAu08JCQAAyXnFWC3lvn7sqYjSrcccI6ZT4/gAbHss+
qrifakRGoYQ7fAjYBmhw48HqPmHtI2OQjIUDaIqHQOS68aXShBo9HiE460HRY4GT
QV/KT0w37E2/R0EDR9gyjLq3ZA3/raxN1n+LNCFhRWmtsAEZrk4XzsADWb05YkIq
1QBa92DzEhVpd04X8YHIYBgRidWbcYST6xhoWdyL9VZ1pzZsISq5LH67D4f/J1KU
gXNf+ZnF9DXsQnptJrMsjhx61seJ2F0/vozFZ+l6SjRr0jeysmrJI0dxqQc/hUga
gbLmdha6ztKdn03JOIL+lfdZYzICFl/2fekSWI2SNcag+TYszACjlFOyHusOgKsa
3qQwzVB699FheWO5nrOOvOtgq0ZqGsrIvhIXLhA7/bVpNavPnUB7IQCcs8n89ImQ
hUIebfX1FZnYXTrB6Hhm92LUb0lyLSlW1we3SSmaAMiy1TiXHG7hO2G/sIbOPAJC
5VzdHf0DEjzEdjmTrGOV7JBfy5JmMK56oN8viZS95p70DYxNGvEOhLs/8n5twpri
MWV8GElcOjjC+KnGnUH72spsnEKONpdzyccG9kiZEgkEi4csgHSxrkSmAehYD6i6
MFYk+i7jvZ1VsbOulmdGOLbHS7whxi9pWb/CT3KKF1Ei5/v07bU=
=JdOX
-----END PGP SIGNATURE-----
Merge tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix regression in extent map rework when handling insertion of
overlapping compressed extent
- fix unexpected file length when appending to a file using direct io
and buffer not faulted in
- in zoned mode, fix accounting of unusable space when flipping
read-only block group back to read-write
- fix page locking when COWing an inline range, assertion failure found
by syzbot
- fix calculation of space info in debugging print
- tree-checker, add validation of data reference item
- fix a few -Wmaybe-uninitialized build warnings
* tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: initialize location to fix -Wmaybe-uninitialized in btrfs_lookup_dentry()
btrfs: fix corruption after buffer fault in during direct IO append write
btrfs: zoned: fix zone_unusable accounting on making block group read-write again
btrfs: do not subtract delalloc from avail bytes
btrfs: make cow_file_range_inline() honor locked_page on error
btrfs: fix corrupt read due to bad offset of a compressed extent map
btrfs: tree-checker: validate dref root and objectid
Some arch + compiler combinations report a potentially unused variable
location in btrfs_lookup_dentry(). This is a false alert as the variable
is passed by value and always valid or there's an error. The compilers
cannot probably reason about that although btrfs_inode_by_name() is in
the same file.
> + /kisskb/src/fs/btrfs/inode.c: error: 'location.objectid' may be used
+uninitialized in this function [-Werror=maybe-uninitialized]: => 5603:9
> + /kisskb/src/fs/btrfs/inode.c: error: 'location.type' may be used
+uninitialized in this function [-Werror=maybe-uninitialized]: => 5674:5
m68k-gcc8/m68k-allmodconfig
mips-gcc8/mips-allmodconfig
powerpc-gcc5/powerpc-all{mod,yes}config
powerpc-gcc5/ppc64_defconfig
Initialize it to zero, this should fix the warnings and won't change the
behaviour as btrfs_inode_by_name() accepts only a root or inode item
types, otherwise returns an error.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/linux-btrfs/bd4e9928-17b3-9257-8ba7-6b7f9bbb639a@linux-m68k.org/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs buffered write path runs through __extent_writepage() which
has some tricky return value handling for writepage_delalloc().
Specifically, when that returns 1, we exit, but for other return values
we continue and end up calling btrfs_folio_end_all_writers(). If the
folio has been unlocked (note that we check the PageLocked bit at the
start of __extent_writepage()), this results in an assert panic like
this one from syzbot:
BTRFS: error (device loop0 state EAL) in free_log_tree:3267: errno=-5 IO failure
BTRFS warning (device loop0 state EAL): Skipping commit of aborted transaction.
BTRFS: error (device loop0 state EAL) in cleanup_transaction:2018: errno=-5 IO failure
assertion failed: folio_test_locked(folio), in fs/btrfs/subpage.c:871
------------[ cut here ]------------
kernel BUG at fs/btrfs/subpage.c:871!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 1 PID: 5090 Comm: syz-executor225 Not tainted
6.10.0-syzkaller-05505-gb1bc554e009e #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 06/27/2024
RIP: 0010:btrfs_folio_end_all_writers+0x55b/0x610 fs/btrfs/subpage.c:871
Code: e9 d3 fb ff ff e8 25 22 c2 fd 48 c7 c7 c0 3c 0e 8c 48 c7 c6 80 3d
0e 8c 48 c7 c2 60 3c 0e 8c b9 67 03 00 00 e8 66 47 ad 07 90 <0f> 0b e8
6e 45 b0 07 4c 89 ff be 08 00 00 00 e8 21 12 25 fe 4c 89
RSP: 0018:ffffc900033d72e0 EFLAGS: 00010246
RAX: 0000000000000045 RBX: 00fff0000000402c RCX: 663b7a08c50a0a00
RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
RBP: ffffc900033d73b0 R08: ffffffff8176b98c R09: 1ffff9200067adfc
R10: dffffc0000000000 R11: fffff5200067adfd R12: 0000000000000001
R13: dffffc0000000000 R14: 0000000000000000 R15: ffffea0001cbee80
FS: 0000000000000000(0000) GS:ffff8880b9500000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f5f076012f8 CR3: 000000000e134000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
__extent_writepage fs/btrfs/extent_io.c:1597 [inline]
extent_write_cache_pages fs/btrfs/extent_io.c:2251 [inline]
btrfs_writepages+0x14d7/0x2760 fs/btrfs/extent_io.c:2373
do_writepages+0x359/0x870 mm/page-writeback.c:2656
filemap_fdatawrite_wbc+0x125/0x180 mm/filemap.c:397
__filemap_fdatawrite_range mm/filemap.c:430 [inline]
__filemap_fdatawrite mm/filemap.c:436 [inline]
filemap_flush+0xdf/0x130 mm/filemap.c:463
btrfs_release_file+0x117/0x130 fs/btrfs/file.c:1547
__fput+0x24a/0x8a0 fs/file_table.c:422
task_work_run+0x24f/0x310 kernel/task_work.c:222
exit_task_work include/linux/task_work.h:40 [inline]
do_exit+0xa2f/0x27f0 kernel/exit.c:877
do_group_exit+0x207/0x2c0 kernel/exit.c:1026
__do_sys_exit_group kernel/exit.c:1037 [inline]
__se_sys_exit_group kernel/exit.c:1035 [inline]
__x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1035
x64_sys_call+0x2634/0x2640
arch/x86/include/generated/asm/syscalls_64.h:232
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5f075b70c9
Code: Unable to access opcode bytes at
0x7f5f075b709f.
I was hitting the same issue by doing hundreds of accelerated runs of
generic/475, which also hits IO errors by design.
I instrumented that reproducer with bpftrace and found that the
undesirable folio_unlock was coming from the following callstack:
folio_unlock+5
__process_pages_contig+475
cow_file_range_inline.constprop.0+230
cow_file_range+803
btrfs_run_delalloc_range+566
writepage_delalloc+332
__extent_writepage # inlined in my stacktrace, but I added it here
extent_write_cache_pages+622
Looking at the bisected-to patch in the syzbot report, Josef realized
that the logic of the cow_file_range_inline error path subtly changing.
In the past, on error, it jumped to out_unlock in cow_file_range(),
which honors the locked_page, so when we ultimately call
folio_end_all_writers(), the folio of interest is still locked. After
the change, we always unlocked ignoring the locked_page, on both success
and error. On the success path, this all results in returning 1 to
__extent_writepage(), which skips the folio_end_all_writers() call,
which makes it OK to have unlocked.
Fix the bug by wiring the locked_page into cow_file_range_inline() and
only setting locked_page to NULL on success.
Reported-by: syzbot+a14d8ac9af3a2a4fd0c8@syzkaller.appspotmail.com
Fixes: 0586d0a89e ("btrfs: move extent bit and page cleanup into cow_file_range_inline")
CC: stable@vger.kernel.org # 6.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmaVN3MACgkQxWXV+ddt
WDtpIRAAl+1NjsEj8e5V/UYn8Jr06ujTOnrkR3PCTICxDHbUaMLkQEw21H0K/ogQ
3fOiEVpSlZOfKdYXtXaMQbC0jd/Af2eA10Uht96nAEjAtxu1uJ4cFZGu2meNdXZP
xUioivJ/CElMPH2aluG6FaQvUTqmhrEr8tSoYbxzQmUd434q9kqqyjtw1tfzYDG1
VDn2f7ykhpB/8P0aoqgWSshWTmaCzG0GkuI28o1o0iZUIF/P9TKdzxlLRW6BVHE7
T2oGLEQjN1GQbCH75L4IeNJDkCBVfcDcbZkUDJ/ae4Pt/jJQTFY53YIP9wXFZQnd
mdfHmK7Atpsk75ATftYSq+ENkbQ5fsuut5CD63u54gAqA4M1FncDXTAWS1Y30F76
P8juSCmsSy0o3gTflDIo/IMdntoh/JmncwwStF6oKzmyUZZzzarsqM8mc1P03ZNt
3ttlnbY7lC1TDAlD5J2wXE0INCT2pN+4C9IToWdRypeuLu6qrI7cQ0oylyp9OVQM
t9umTXm0B6s1cyqEDjJf0xJZS/JTHYwu7S4EmAJwicgiLpOjABVTmO8021rVmDJy
TAUu6yEhSsrTT6Dxm7/2Et1EEOKFF5hhsG1SiGD9oUIZK6B5+0waT+rbkEWl7osR
4/TAv2zX6tuCc7HIW0fQloM/6/Gyd5wcDVaQNDUzFA075uKstwY=
=k5d3
-----END PGP SIGNATURE-----
Merge tag 'for-6.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"The highlights are new logic behind background block group reclaim,
automatic removal of qgroup after removing a subvolume and new
'rescue=' mount options.
The rest is optimizations, cleanups and refactoring.
User visible features:
- dynamic block group reclaim:
- tunable framework to avoid situations where eager data
allocations prevent creating new metadata chunks due to lack of
unallocated space
- reuse sysfs knob bg_reclaim_threshold (otherwise used only in
zoned mode) for a fixed value threshold
- new on/off sysfs knob "dynamic_reclaim" calculating the value
based on heuristics, aiming to keep spare working space for
relocating chunks but not to needlessly relocate partially
utilized block groups or reclaim newly allocated ones
- stats are exported in sysfs per block group type, files
"reclaim_*"
- this may increase IO load at unexpected times but the corner
case of no allocatable block groups is known to be worse
- automatically remove qgroup of deleted subvolumes:
- adjust qgroup removal conditions, make sure all related
subvolume data are already removed, or return EBUSY, also take
into account setting of sysfs drop_subtree_threshold
- also works in squota mode
- mount option updates: new modes of 'rescue=' that allow to mount
images (read-only) that could have been partially converted by user
space tools
- ignoremetacsums - invalid metadata checksums are ignored
- ignoresuperflags - super block flags that track conversion in
progress (like UUID or checksums)
Core:
- size of struct btrfs_inode is now below 1024 (on a release config),
improved memory packing and other secondary effects
- switch tracking of open inodes from rb-tree to xarray, minor
performance improvement
- reduce number of empty transaction commits when there are no dirty
data/metadata
- memory allocation optimizations (reduced numbers, reordering out of
critical sections)
- extent map structure optimizations and refactoring, more sanity
checks
- more subpage in zoned mode preparations or fixes
- general snapshot code cleanups, improvements and documentation
- tree-checker updates: more file extent ram_bytes fixes, continued
- raid-stripe-tree update (not backward compatible):
- remove extent encoding field from the structure, can be inferred
from other information
- requires btrfs-progs 6.9.1 or newer
- cleanups and refactoring
- error message updates
- error handling improvements
- return type and parameter cleanups and improvements"
* tag 'for-6.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (152 commits)
btrfs: fix extent map use-after-free when adding pages to compressed bio
btrfs: fix bitmap leak when loading free space cache on duplicate entry
btrfs: remove the BUG_ON() inside extent_range_clear_dirty_for_io()
btrfs: move extent_range_clear_dirty_for_io() into inode.c
btrfs: enhance compression error messages
btrfs: fix data race when accessing the last_trans field of a root
btrfs: rename the extra_gfp parameter of btrfs_alloc_page_array()
btrfs: remove the extra_gfp parameter from btrfs_alloc_folio_array()
btrfs: introduce new "rescue=ignoresuperflags" mount option
btrfs: introduce new "rescue=ignoremetacsums" mount option
btrfs: output the unrecognized super block flags as hex
btrfs: remove unused Opt enums
btrfs: tree-checker: add extra ram_bytes and disk_num_bytes check
btrfs: fix the ram_bytes assignment for truncated ordered extents
btrfs: make validate_extent_map() catch ram_bytes mismatch
btrfs: ignore incorrect btrfs_file_extent_item::ram_bytes
btrfs: cleanup the bytenr usage inside btrfs_extent_item_to_extent_map()
btrfs: fix typo in error message in btrfs_validate_super()
btrfs: move the direct IO code into its own file
btrfs: pass a btrfs_inode to btrfs_set_prop()
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEG2wAKCRCRxhvAZXjc
ooW/AQDzyY+xNGt4OPMvlyFUHd5RcyiLsMhYrkKc3FaIFjesVgD+PFW5PPW12c0V
Z4VHg9w1HDDuUn4XvELs7OXZpek7RgU=
=eDC8
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.11.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode / dentry updates from Christian Brauner:
"This contains smaller performance improvements to inodes and dentries:
inode:
- Add rcu based inode lookup variants.
They avoid one inode hash lock acquire in the common case thereby
significantly reducing contention. We already support RCU-based
operations but didn't take advantage of them during inode
insertion.
Callers of iget_locked() get the improvement without any code
changes. Callers that need a custom callback can switch to
iget5_locked_rcu() as e.g., did btrfs.
With 20 threads each walking a dedicated 1000 dirs * 1000 files
directory tree to stat(2) on a 32 core + 24GB ram vm:
before: 3.54s user 892.30s system 1966% cpu 45.549 total
after: 3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%)
Long-term we should pick up the effort to introduce more
fine-grained locking and possibly improve on the currently used
hash implementation.
- Start zeroing i_state in inode_init_always() instead of doing it in
individual filesystems.
This allows us to remove an unneeded lock acquire in new_inode()
and not burden individual filesystems with this.
dcache:
- Move d_lockref out of the area used by RCU lookup to avoid
cacheline ping poing because the embedded name is sharing a
cacheline with d_lockref.
- Fix dentry size on 32bit with CONFIG_SMP=y so it does actually end
up with 128 bytes in total"
* tag 'vfs-6.11.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: fix dentry size
vfs: move d_lockref out of the area used by RCU lookup
bcachefs: remove now spurious i_state initialization
xfs: remove now spurious i_state initialization in xfs_inode_alloc
vfs: partially sanitize i_state zeroing on inode creation
xfs: preserve i_state around inode_init_always in xfs_reinit_inode
btrfs: use iget5_locked_rcu
vfs: add rcu-based find_inode variants for iget ops
Previously we had a BUG_ON() inside extent_range_clear_dirty_for_io(), as
we expected all involved folios to be still locked, thus no folio should be
missing.
However for extent_range_clear_dirty_for_io() itself, we can skip the
missing folio and handle the remaining ones, and return an error if
there is anything wrong.
Remove the BUG_ON() and let the caller to handle the error.
In the caller we do not have a quick way to cleanup the error, but all
the compression routines would handle the missing folio as an error and
properly error out, so we only need to do an ASSERT() for developers,
while for non-debug build the compression routine would handle the
error correctly.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is only used inside inode.c by compress_file_range(),
so move it to inode.c and unexport it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is only one caller utilizing the @extra_gfp parameter,
alloc_eb_folio_array(). And in that case the extra_gfp is only assigned
to __GFP_NOFAIL.
Rename the @extra_gfp parameter to @nofail to indicate that.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[HICCUP]
After adding extra checks on btrfs_file_extent_item::ram_bytes to
tree-checker, running fsstress leads to tree-checker warning at write time,
as we created file extent items with an invalid ram_bytes.
All those offending file extents have offset 0, and ram_bytes matching
num_bytes, and smaller than disk_num_bytes.
This would also trigger the recently enhanced btrfs-check, which catches
such mismatches and report them as minor errors.
[CAUSE]
When a folio/page is invalidated and it is part of a submitted OE, we
mark the OE truncated just to the beginning of the folio/page.
And for truncated OE, we insert the file extent item with incorrect
value for ram_bytes (using num_bytes instead of the usual value).
This is not a big deal for end users, as we do not utilize the ram_bytes
field for regular non-compressed extents.
This mismatch is just a small violation against on-disk format.
[FIX]
Fix it by removing the override on btrfs_file_extent_item::ram_bytes.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The direct IO code is over a thousand lines and it's currently spread
between file.c and inode.c, which makes it not easy to locate some parts
of it sometimes. Also inode.c is about 11 thousand lines and file.c about
4 thousand lines, both too big. So move all the direct IO code into a
dedicated file, so that it's easy to locate all its code and reduce the
sizes of inode.c and file.c.
This is a pure move of code without any other changes except export a
a couple functions from inode.c (get_extent_allocation_hint() and
create_io_em()) because they are used in inode.c and the new direct-io.c
file, and a couple functions from file.c (btrfs_buffered_write() and
btrfs_write_check()) because they are used both in file.c and in the new
direct-io.c file.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_compress_heuristic() as it's an
internal interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
The structure is internal so we should use struct btrfs_inode for that,
allowing to remove some use of BTRFS_I.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_readdir_get_delayed_items() as it's
an internal interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_readdir_put_delayed_items() as it's
an internal interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
It's pointless to pass a super block argument to btrfs_iget_locked()
because we always pass a root and from it we can get the super block
through:
root->fs_info->sb
So remove the super block argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's pointless to pass a super block argument to btrfs_iget_path() because
we always pass a root and from it we can get the super block through:
root->fs_info->sb
So remove the super block argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's pointless to pass a super block argument to btrfs_iget() because we
always pass a root and from it we can get the super block through:
root->fs_info->sb
So remove the super block argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When creating and deleting a subvolume, after starting a transaction we
are explicitly calling btrfs_record_root_in_trans() for the root which we
passed to btrfs_start_transaction(). This is pointless because at
transaction.c:start_transaction() we end up doing that call, regardless
of whether we actually start a new transaction or join an existing one,
and if we were not it would mean the root item of that root would not
be updated in the root tree when committing the transaction, leading to
problems easy to spot with fstests for example.
Remove these redundant calls. They were introduced with commit
74e9795812 ("btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume
operations").
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The following 3 parameters can be cleaned up using btrfs_file_extent
structure:
- len
btrfs_file_extent::num_bytes
- orig_block_len
btrfs_file_extent::disk_num_bytes
- ram_bytes
btrfs_file_extent::ram_bytes
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Most parameters of create_io_em() can be replaced by the members with
the same name inside btrfs_file_extent.
Do a direct parameters cleanup here.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All parameters after @filepos of btrfs_alloc_ordered_extent() can be
replaced with btrfs_file_extent structure.
This patch does the cleanup, meanwhile some points to note:
- Move btrfs_file_extent structure to ordered-data.h
The structure is needed by both btrfs_alloc_ordered_extent() and
can_nocow_extent(), but since btrfs_inode.h includes
ordered-data.h, so we need to move the structure to ordered-data.h.
- Move the special handling of NOCOW/PREALLOC into
btrfs_alloc_ordered_extent()
This is to allow btrfs_split_ordered_extent() to properly split them
for DIO.
For now just move the handling into btrfs_alloc_ordered_extent() to
simplify the callers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The following functions and structures can be simplified using the
btrfs_file_extent structure:
- can_nocow_extent()
No need to return ram_bytes/orig_block_len through the parameter list,
the @file_extent parameter contains all the needed info.
- can_nocow_file_extent_args
The following members are no longer needed:
* disk_bytenr
This one is confusing as it's not really the
btrfs_file_extent_item::disk_bytenr, but where the IO would be,
thus it's file_extent::disk_bytenr + file_extent::offset now.
* num_bytes
Now file_extent::num_bytes.
* extent_offset
Now file_extent::offset.
* disk_num_bytes
Now file_extent::disk_num_bytes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The member extent_map::block_start can be calculated from
extent_map::disk_bytenr + extent_map::offset for regular extents.
And otherwise just extent_map::disk_bytenr.
And this is already validated by the validate_extent_map(). Now we can
remove the member.
However there is a special case in btrfs_create_dio_extent() where we
for NOCOW/PREALLOC ordered extents cannot directly use the resulting
btrfs_file_extent, as btrfs_split_ordered_extent() cannot handle them
yet.
So for that call site, we pass file_extent->disk_bytenr +
file_extent->num_bytes as disk_bytenr for the ordered extent, and 0 for
offset.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent_map::block_len is either extent_map::len (non-compressed
extent) or extent_map::disk_num_bytes (compressed extent).
Since we already have sanity checks to do the cross-checks between the
new and old members, we can drop the old extent_map::block_len now.
For most call sites, they can manually select extent_map::len or
extent_map::disk_num_bytes, since most if not all of them have checked
if the extent is compressed.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we have extent_map::offset, the old extent_map::orig_start is just
extent_map::start - extent_map::offset for non-hole/inline extents.
And since the new extent_map::offset is already verified by
validate_extent_map() while the old orig_start is not, let's just remove
the old member from all call sites.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce two new members for extent_map:
- disk_bytenr
- offset
Both are matching the members with the same name inside
btrfs_file_extent_items.
For now this patch only touches those members when:
- Reading btrfs_file_extent_items from disk
- Inserting new holes
- Merging two extent maps
With the new disk_bytenr and disk_num_bytes, doing merging would be a
little more complex, as we have 3 different cases:
* Both extent maps are referring to the same data extents
|<----- data extent A ----->|
|<- em 1 ->|<- em 2 ->|
* Both extent maps are referring to different data extents
|<-- data extent A -->|<-- data extent B -->|
|<- em 1 ->|<- em 2 ->|
* One of the extent maps is referring to a merged and larger data
extent that covers both extent maps
This is not really valid case other than some selftests.
So this test case would be removed.
A new helper merge_ondisk_extents() is introduced to handle the above
valid cases.
To properly assign values for those new members, a new btrfs_file_extent
parameter is introduced to all the involved call sites.
- For NOCOW writes the btrfs_file_extent would be exposed from
can_nocow_file_extent().
- For other writes, the members can be easily calculated
As most of them have 0 offset and utilizing the whole on-disk data
extent.
The exception is encoded write, but thankfully that interface provided
offset directly and all other needed info.
For now, both the old members (block_start/block_len/orig_start) are
co-existing with the new members (disk_bytenr/offset), meanwhile all the
critical code is still using the old members only.
The cleanup will happen later after all the old and new members are
properly validated.
There would be some re-ordering for the assignment of the extent_map
members, now we follow the new ordering:
- start and len
Or file_pos and num_bytes for other structures.
- disk_bytenr and disk_num_bytes
- offset and ram_bytes
- compression
So expect some seemingly unrelated line movement.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently function can_nocow_extent() only returns members needed for
extent_map.
However since we will soon change the extent_map structure to be more
like btrfs_file_extent_item, we want to expose the expected file extent
caused by the NOCOW write for future usage.
This introduces a new structure, btrfs_file_extent, to be a more
memory access friendly representation of btrfs_file_extent_item.
And use that structure to expose the expected file extent caused by the
NOCOW write.
For now there is no user of the new structure yet.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This would make it very obvious that the member just matches
btrfs_file_extent_item::disk_num_bytes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the core of the fiemap code lives in extent_io.c, which does
not make any sense because it's not related to extent IO at all (and it
was not as well before the big rewrite of fiemap I did some time ago).
The entry point for fiemap, btrfs_fiemap(), lives in inode.c since it's
an inode operation.
Since there's a significant amount of fiemap code, move all of it into a
dedicated file, including its entry point inode.c:btrfs_fiemap().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When running 'make W=2' there are a few reports where a variable of the
same name is declared in a nested block. In all the cases we can use the
one declared in the parent block, no problematic cases were found.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of passing a (VFS) inode pointer argument, pass a btrfs_inode
instead, as this is generally what we do for internal APIs, making it
more consistent with most of the code base. This will later allow to
help to remove a lot of BTRFS_I() calls in btrfs_sync_file().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
On 64 bits platforms we don't really need to have a dedicated member (the
objectid field) for the inode's number since we store in the VFS inode's
i_ino member, which is an unsigned long and this type is 64 bits wide on
64 bits platforms. We only need that field in case we are on a 32 bits
platform because the unsigned long type is 32 bits wide on such platforms
See commit 33345d0152 ("Btrfs: Always use 64bit inode number") regarding
this 64/32 bits detail.
The objectid field of struct btrfs_inode is also used to store the ID of
a root for directories that are stubs for unreferenced roots. In such
cases the inode is a directory and has the BTRFS_INODE_ROOT_STUB runtime
flag set.
So in order to reduce the size of btrfs_inode structure on 64 bits
platforms we can remove the objectid member and use the VFS inode's i_ino
member instead whenever we need to get the inode number. In case the inode
is a root stub (BTRFS_INODE_ROOT_STUB set) we can use the member
last_reflink_trans to store the ID of the unreferenced root, since such
inode is a directory and reflinks can't be done against directories.
So remove the objectid fields for 64 bits platforms and alias the
last_reflink_trans field with a name of ref_root_id in a union.
On a release kernel config, this reduces the size of struct btrfs_inode
from 1040 bytes down to 1032 bytes.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently struct btrfs_inode has a key member, named "location", that is
either:
1) The key of the inode's item. In this case the objectid is the number
of the inode;
2) A key stored in a dir entry with a type of BTRFS_ROOT_ITEM_KEY, for
the case where we have a root that is a snapshot of a subvolume that
points to other subvolumes. In this case the objectid is the ID of
a subvolume inside the snapshotted parent subvolume.
The key is only used to lookup the inode item for the first case, while
for the second it's never used since it corresponds to directory stubs
created with new_simple_dir() and which are marked as dummy, so there's
no actual inode item to ever update. In the second case we only check
the key type at btrfs_ino() for 32 bits platforms and its objectid is
only needed for unlink.
Instead of using a key we can do fine with just the objectid, since we
can generate the key whenever we need it having only the objectid, as
in all use cases the type is always BTRFS_INODE_ITEM_KEY and the offset
is always 0.
So use only an objectid instead of a full key. This reduces the size of
struct btrfs_inode from 1048 bytes down to 1040 bytes on a release kernel.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When not using the NO_HOLES feature we always allocate an io tree for an
inode's file_extent_tree. This is wasteful because that io tree is only
used for regular files, so we allocate more memory than needed for inodes
that represent directories or symlinks for example, or for inodes that
correspond to free space inodes.
So improve on this by allocating the io tree only for inodes of regular
files that are not free space inodes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The index_cnt field of struct btrfs_inode is used only for two purposes:
1) To store the index for the next entry added to a directory;
2) For the data relocation inode to track the logical start address of the
block group currently being relocated.
For the relocation case we use index_cnt because it's not used for
anything else in the relocation use case - we could have used other fields
that are not used by relocation such as defrag_bytes, last_unlink_trans
or last_reflink_trans for example (among others).
Since the csum_bytes field is not used for directories, do the following
changes:
1) Put index_cnt and csum_bytes in a union, and index_cnt is only
initialized when the inode is a directory. The csum_bytes is only
accessed in IO paths for regular files, so we're fine here;
2) Use the defrag_bytes field for relocation, since the data relocation
inode is never used for defrag purposes. And to make the naming better,
alias it to reloc_block_group_start by using a union.
This reduces the size of struct btrfs_inode by 8 bytes in a release
kernel, from 1056 bytes down to 1048 bytes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we use the spinlock inode_lock from struct btrfs_root to
serialize access to two different data structures:
1) The delayed inodes xarray (struct btrfs_root::delayed_nodes);
2) The inodes xarray (struct btrfs_root::inodes).
Instead of using our own lock, we can use the spinlock that is part of the
xarray implementation, by using the xa_lock() and xa_unlock() APIs and
using the xarray APIs with the double underscore prefix that don't take
the xarray locks and assume the caller is using xa_lock() and xa_unlock().
So remove the spinlock inode_lock from struct btrfs_root and use the
corresponding xarray locks. This brings 2 benefits:
1) We reduce the size of struct btrfs_root, from 1336 bytes down to
1328 bytes on a 64 bits release kernel config;
2) We reduce lock contention by not using anymore the same lock for
changing two different and unrelated xarrays.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make btrfs_iget_path() simpler and easier to read by avoiding nesting of
if-then-else statements and having an error label to do all the error
handling instead of repeating it a couple times.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When creating a new inode, at btrfs_create_new_inode(), one of the very
last steps is to add the inode to the root's inodes xarray. This often
requires allocating memory which may fail (even though xarrays have a
dedicated kmem_cache which make it less likely to fail), and at that point
we are forced to abort the current transaction (as some, but not all, of
the inode metadata was added to its subvolume btree).
To avoid a transaction abort, preallocate memory for the xarray early at
btrfs_create_new_inode(), so that if we fail we don't need to abort the
transaction and the insertion into the xarray is guaranteed to succeed.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we use a red black tree (rb-tree) to track the currently open
inodes of a root (in struct btrfs_root::inode_tree). This however is not
very efficient when the number of inodes is large since rb-trees are
binary trees. For example for 100K open inodes, the tree has a depth of
17. Besides that, inserting into the tree requires navigating through it
and pulling useless cache lines in the process since the red black tree
nodes are embedded within the btrfs inode - on the other hand, by being
embedded, it requires no extra memory allocations.
We can improve this by using an xarray instead, which is efficient when
indices are densely clustered (such as inode numbers), is more cache
friendly and behaves like a resizable array, with a much better search
and insertion complexity than a red black tree. This only has one small
disadvantage which is that insertion will sometimes require allocating
memory for the xarray - which may fail (not that often since it uses a
kmem_cache) - but on the other hand we can reduce the btrfs inode
structure size by 24 bytes (from 1080 down to 1056 bytes) after removing
the embedded red black tree node, which after the next patches will allow
to reduce the size of the structure to 1024 bytes, meaning we will be able
to store 4 inodes per 4K page instead of 3 inodes.
This change does a straightforward change to use an xarray, and results
in a transaction abort if we can't allocate memory for the xarray when
creating an inode - but the next patch changes things so that we don't
need to abort.
Running the following fs_mark test showed some improvements:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
MOUNT_OPTIONS="-o ssd"
FILES=100000
THREADS=$(nproc --all)
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $DEV
mount $MOUNT_OPTIONS $DEV $MNT
OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k"
for ((i = 1; i <= $THREADS; i++)); do
OPTS="$OPTS -d $MNT/d$i"
done
fs_mark $OPTS
umount $MNT
Before this patch:
FSUse% Count Size Files/sec App Overhead
10 1200000 0 92081.6 12505547
16 2400000 0 138222.6 13067072
23 3600000 0 148833.1 13290336
43 4800000 0 97864.7 13931248
53 6000000 0 85597.3 14384313
After this patch:
FSUse% Count Size Files/sec App Overhead
10 1200000 0 93225.1 12571078
16 2400000 0 146720.3 12805007
23 3600000 0 160626.4 13073835
46 4800000 0 116286.2 13802927
53 6000000 0 90087.9 14754892
The test was run with a release kernel config (Debian's default config).
Also capturing the insertion times into the rb tree and into the xarray,
that is measuring the duration of the old function inode_tree_add() and
the duration of the new btrfs_add_inode_to_root() function, gave the
following results (in nanoseconds):
Before this patch, inode_tree_add() execution times:
Count: 5000000
Range: 0.000 - 5536887.000; Mean: 775.674; Median: 729.000; Stddev: 4820.961
Percentiles: 90th: 1015.000; 95th: 1139.000; 99th: 1397.000
0.000 - 7.816: 40 |
7.816 - 37.858: 209 |
37.858 - 170.278: 6059 |
170.278 - 753.961: 2754890 #####################################################
753.961 - 3326.728: 2232312 ###########################################
3326.728 - 14667.018: 4366 |
14667.018 - 64652.943: 852 |
64652.943 - 284981.761: 550 |
284981.761 - 1256150.914: 221 |
1256150.914 - 5536887.000: 7 |
After this patch, btrfs_add_inode_to_root() execution times:
Count: 5000000
Range: 0.000 - 2900652.000; Mean: 272.148; Median: 241.000; Stddev: 2873.369
Percentiles: 90th: 342.000; 95th: 432.000; 99th: 572.000
0.000 - 7.264: 104 |
7.264 - 33.145: 352 |
33.145 - 140.081: 109606 #
140.081 - 581.930: 4840090 #####################################################
581.930 - 2407.590: 43532 |
2407.590 - 9950.979: 2245 |
9950.979 - 41119.278: 514 |
41119.278 - 169902.616: 155 |
169902.616 - 702018.539: 47 |
702018.539 - 2900652.000: 9 |
Average, percentiles, standard deviation, etc, are all much better.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmaG0jUACgkQxWXV+ddt
WDsg4Q//f7xRooomsRaRvIDN4Add17jyZWuhreMhxKKvdtoRzLq/YZUldGJt1XXu
tVpCvKkVvrJt5UY3b1czfA8GTHeOInZSoqyV8ZaxOAcg6cfKrUtLFi1wAKGU6BK/
S9Zz/9lfbjXhG4PnZW7GRFmH0n68p/dmS+kXPVVPhj7CibxuZJtifflK1r6rHw+Y
PDDnMIPqbrdK+xBfR+SEpG+1uIEFvI6SgsAZG6p0mzcbbUINGp7SpU3v2NVuvdIX
tjTlZhz+D2do4dPk3RJ4Z6dpL+VlZnR1F9CZn6SmPcdGWn6mTkywoukJPP62iHYO
b8Xr4j1mKDPcu55OEPWwhYWOvRP4FvltrGB+35xsQoPlKBBKxSMFgNyEkdfvjtKU
qL9qRYJvm8D2OBqdv5BnawKRSouPcR4KPiM9JSGWRcW6vdOiY3Cd1mnAR8EA1LSd
SRYmBlrZIIjOpEeGkEy2MeNKEH9L5B1wAU21VQd+cAlCKBFUmFb1csZ/vCJJ1B34
swedmirNx0lyZTNBGQsDipkxSM0c/0xj/ysG8N41bLvPUM35x0K9xw5ZqBlTXrrg
Lln51s6TeqSzfEvKdDGikBL3dVhT5I6acRUchQU9YSZ3Bqxj3QCTNnOZMZQsCFkp
CMohduRVHiK4LHQTkU++q8bT8xX20kh/1HIJhAjdnDbrlRY4oTU=
=tyXO
-----END PGP SIGNATURE-----
Merge tag 'for-6.10-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix folio refcounting when releasing them (encoded write, dummy
extent buffer)
- fix out of bounds read when checking qgroup inherit data
- fix how configurable chunk size is handled in zoned mode
- in the ref-verify tool, fix uninitialized return value when checking
extent owner ref and simple quota are not enabled
* tag 'for-6.10-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix folio refcount in __alloc_dummy_extent_buffer()
btrfs: fix folio refcount in btrfs_do_encoded_write()
btrfs: fix uninitialized return value in the ref-verify tool
btrfs: always do the basic checks for btrfs_qgroup_inherit structure
btrfs: zoned: fix calc_available_free_space() for zoned mode
The conversion to folios switched __free_page() to __folio_put() in the
error path in btrfs_do_encoded_write().
However, this gets the page refcounting wrong. If we do hit that error
path (I reproduced by modifying btrfs_do_encoded_write to pretend to
always fail in a way that jumps to out_folios and running the fstests
case btrfs/281), then we always hit the following BUG freeing the folio:
BUG: Bad page state in process btrfs pfn:40ab0b
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x61be5 pfn:0x40ab0b
flags: 0x5ffff0000000000(node=0|zone=2|lastcpupid=0x1ffff)
raw: 05ffff0000000000 0000000000000000 dead000000000122 0000000000000000
raw: 0000000000061be5 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: nonzero _refcount
Call Trace:
<TASK>
dump_stack_lvl+0x3d/0xe0
bad_page+0xea/0xf0
free_unref_page+0x8e1/0x900
? __mem_cgroup_uncharge+0x69/0x90
__folio_put+0xe6/0x190
btrfs_do_encoded_write+0x445/0x780
? current_time+0x25/0xd0
btrfs_do_write_iter+0x2cc/0x4b0
btrfs_ioctl_encoded_write+0x2b6/0x340
It turns out __free_page() decreases the page reference count while
__folio_put() does not. Switch __folio_put() to folio_put() which
decreases the folio reference count first.
Fixes: 400b172b8c ("btrfs: compression: migrate compression/decompression paths to folios")
Tested-by: Ed Tomlinson <edtoml@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With 20 threads each walking a dedicated 1000 dirs * 1000 files
directory tree to stat(2) on a 32 core + 24GB ram vm:
before: 3.54s user 892.30s system 1966% cpu 45.549 total
after: 3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%)
Benchmark can be found here: https://people.freebsd.org/~mjg/fstree.tgz
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20240611173824.535995-3-mjguzik@gmail.com
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmZCE4MACgkQxWXV+ddt
WDudtQ//WjXcHtY3I6NJtDhPsIOG3Qjg9mA0shp73X4djJtZoGCdgL7dq+fTp5lk
Wu6/XY5g+CSttTgwF4eyHgUSJOptKWY0XQDWxX5VR8WCM2qmUZ7SedlrBED9GNDM
rN/3egmc74OGwnqyQq3I/2qYLByXFj66tsvW3UBjLNB8vMHajjw1idj9ujipioHq
ySStPCHkPMwuhEzw9+CTe3W47VUSb5Ug3XDhAZXvxT99oDHn1m+CxKQwcona/IPH
1El8PmZ7JetaT9ZO3DICBICfCyo+2SSy/KXYypXXE+nzNZhbhC0V9N7Uqm1c91C0
aRglsJZCXmHBD4BPLvkls6CqEIvMc7FvcNCqQlrbRT6PlfX91/XaeDq4l3RUcuPn
mGShsdHUiwbPMWYVwqVUKd0IPiktF1R7yigTjYSkEFJTL6HFTrBqV/2fAMUsMfPc
8gyzYMCPQld73WmrnXZQPKvmzO/LvE0gS5cPapokGwoXstq9n3iYd4ypN0wN6sif
1jwy3efNzWXXMYV0WzcihKwFMm2fqp/pl9bXq/zwn2CunfIX4WTsaQ2NmJf81jqF
qFNjlr8S3qO7AvIOs+R2XY9E3VjfzeDADzvjpQy5J/ZYbcHBcxxdYDhg+QGhe5nB
eNmR51oL1pHSjU2M8PxATL8JxKkX2BvX6u64lVojaw4rxUlyFC0=
=MMpE
-----END PGP SIGNATURE-----
Merge tag 'for-6.10-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This update brings a few minor performance improvements, otherwise
there's a lot of refactoring, cleanups and other sort of not user
visible changes.
Performance improvements:
- inline b-tree locking functions, improvement in metadata-heavy
changes
- relax locking on a range that's being reflinked, allows read
operations to run in parallel
- speed up NOCOW write checks (throughput +9% on a sample test)
- extent locking ranges have been reduced in several places, namely
around delayed ref processing
Core:
- more page to folio conversions:
- relocation
- send
- compression
- inline extent handling
- super block write and wait
- extent_map structure optimizations:
- reduced structure size
- code simplifications
- add shrinker for allocated objects, the numbers can go high and
could exhaust memory on smaller systems (reported) as they may
not get an opportunity to be freed fast enough
- extent locking optimizations:
- reduce locking ranges where it does not seem to be necessary and
are safe due to other means of synchronization
- potential improvements due to lower contention,
allocation/freeing and state management operations of extent
state tracking structures
- delayed ref cleanups and simplifications
- updated trace points
- improved error handling, warnings and assertions
- cleanups and refactoring, unification of error handling paths"
* tag 'for-6.10-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (122 commits)
btrfs: qgroup: fix initialization of auto inherit array
btrfs: count super block write errors in device instead of tracking folio error state
btrfs: use the folio iterator in btrfs_end_super_write()
btrfs: convert super block writes to folio in write_dev_supers()
btrfs: convert super block writes to folio in wait_dev_supers()
bio: Export bio_add_folio_nofail to modules
btrfs: remove duplicate included header from fs.h
btrfs: add a cached state to extent_clear_unlock_delalloc
btrfs: push extent lock down in submit_one_async_extent
btrfs: push lock_extent down in cow_file_range()
btrfs: move can_cow_file_range_inline() outside of the extent lock
btrfs: push lock_extent into cow_file_range_inline
btrfs: push extent lock into cow_file_range
btrfs: push extent lock into run_delalloc_cow
btrfs: remove unlock_extent from run_delalloc_compressed
btrfs: push extent lock down in run_delalloc_nocow
btrfs: adjust while loop condition in run_delalloc_nocow
btrfs: push extent lock into run_delalloc_nocow
btrfs: push the extent lock into btrfs_run_delalloc_range
btrfs: lock extent when doing inline extent in compression
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZj3HuwAKCRCRxhvAZXjc
orYvAQCZOr68uJaEaXAArYTdnMdQ6HIzG+FVlwrqtrhz0BV07wEAqgmtSR9XKh+L
0+DNepg4R8PZOHH371eSSsLNRCUCkAs=
=SVsU
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.
Features:
- Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That
means we now have six free FMODE_* bits in total (but bit #6
already got used for FMODE_WRITE_RESTRICTED)
- Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup)
- Add fd_raw cleanup class so we can make use of automatic cleanup
provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well
- Optimize seq_puts()
- Simplify __seq_puts()
- Add new anon_inode_getfile_fmode() api to allow specifying f_mode
instead of open-coding it in multiple places
- Annotate struct file_handle with __counted_by() and use
struct_size()
- Warn in get_file() whether f_count resurrection from zero is
attempted (epoll/drm discussion)
- Folio-sophize aio
- Export the subvolume id in statx() for both btrfs and bcachefs
- Relax linkat(AT_EMPTY_PATH) requirements
- Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors
for dup*() equality replacing kcmp()
Cleanups:
- Compile out swapfile inode checks when swap isn't enabled
- Use (1 << n) notation for FMODE_* bitshifts for clarity
- Remove redundant variable assignment in fs/direct-io
- Cleanup uses of strncpy in orangefs
- Speed up and cleanup writeback
- Move fsparam_string_empty() helper into header since it's currently
open-coded in multiple places
- Add kernel-doc comments to proc_create_net_data_write()
- Don't needlessly read dentry->d_flags twice
Fixes:
- Fix out-of-range warning in nilfs2
- Fix ecryptfs overflow due to wrong encryption packet size
calculation
- Fix overly long line in xfs file_operations (follow-up to FMODE_*
cleanup)
- Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up
to FMODE_* cleanup)
- Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_*
cleanup)
- Fix stable offset api to prevent endless loops
- Fix afs file server rotations
- Prevent xattr node from overflowing the eraseblock in jffs2
- Move fdinfo PTRACE_MODE_READ procfs check into the .permission()
operation instead of .open() operation since this caused userspace
regressions"
* tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits)
afs: Fix fileserver rotation getting stuck
selftests: add F_DUPDFD_QUERY selftests
fcntl: add F_DUPFD_QUERY fcntl()
file: add fd_raw cleanup class
fs: WARN when f_count resurrection is attempted
seq_file: Simplify __seq_puts()
seq_file: Optimize seq_puts()
proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation
fs: Create anon_inode_getfile_fmode()
xfs: don't call xfs_file_open from xfs_dir_open
xfs: drop fop_flags for directories
xfs: fix overly long line in the file_operations
shmem: Fix shmem_rename2()
libfs: Add simple_offset_rename() API
libfs: Fix simple_offset_rename_exchange()
jffs2: prevent xattr node from overflowing the eraseblock
vfs, swap: compile out IS_SWAPFILE() on swapless configs
vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements
fs/direct-io: remove redundant assignment to variable retval
fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading
...
Now that we have the lock_extent tightly coupled with
extent_clear_unlock_delalloc we can add a cached state to
extent_clear_unlock_delalloc and benefit from skipping the extra lookup
when we're doing cow.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need to include the time we spend in the allocator under our
extent lock protection, move it after the allocator and make sure we
lock the extent in the error case to ensure we're not clearing these
bits without the extent lock held.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we've got the extent lock pushed into cow_file_range() we can
push it further down into the allocation loop. This allows us to only
hold the extent lock during the dropping of the extent map range and
inserting the ordered extent.
This makes the error case a little trickier as we'll now have to lock
the range before clearing any of the other extent bits for the range,
but this is the error path so is less performance critical.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These checks aren't reliant on the extent lock. Move this up into
cow_file_range_inline(), and then update encoded writes to call this
check before calling __cow_file_range_inline(). This will allow us to
skip the extent lock if we're not able to inline the given extent.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we've pushed the lock_extent() into cow_file_range() we can
push the extent locking into cow_file_range_inline() and move the
lock_extent in cow_file_range() to after we call
cow_file_range_inline().
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that cow_file_range is the only function that is called with the
range locked, push this call into cow_file_range so we can further
narrow the scope.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is used by zoned but also as the fallback for uncompressed extents
when we fail to compress the ranges. Push the extent lock into
run_dealloc_cow(), and adjust the compression case to take the extent
lock after calling run_delalloc_cow().
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we immediately unlock the extent range when we enter
run_delalloc_compressed() simply move the lock_extent() down to cover
cow_file_range() and then remove the unlock_extent() from
run_delalloc_compressed.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
run_delalloc_nocow is a little special because we use the file extents
to see if we can nocow a range. We don't actually need the protection
of the extent lock to look at the file extents at this point however.
We are currently holding the page lock for this range, so we are
protected from anybody who would simultaneously be modifying the file
extent items for this range.
* mmap() - we're holding the page lock.
* buffered writes - we're holding the page lock.
* direct writes - we're holding the page lock and direct IO has to flush
page cache before it's able to continue.
* fallocate() - all callers flush the range and wait on ordered extents
while holding the inode lock and the mmap lock, so we are again saved
by the page lock.
We want to use the extent lock to protect
1) The mapping tree for the given range.
2) The ordered extents for the given range.
3) The io_tree for the given range.
Push the extent lock down to cover these operations. In the
fallback_to_cow() case we simply lock before doing anything and rely on
the cow_file_range() helper to handle it's range properly.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have the following pattern
while (1) {
if (cur_offset > end)
break;
}
Which is just
while (cur_offset <= end) {
...
}
so adjust the code to be more clear.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>