Commit Graph

2858 Commits

Author SHA1 Message Date
Qu Wenruo
1d2fbb7f1f btrfs: allow compression even if the range is not page aligned
Previously for btrfs with sector size smaller than page size (subpage),
we only allow compression if the range is fully page aligned.

This is to work around the asynchronous submission of compressed range,
which delayed the page unlock and writeback into a workqueue,
furthermore asynchronous submission can lock multiple sector range
across page boundary.

Such asynchronous submission makes it very hard to co-operate with other
regular writes.

With the recent changes to the subpage folio unlock path, now
asynchronous submission of compressed pages can co-operate with regular
submission, so enable sector perfect compression if it's an experimental
build.

The ETA for moving this feature out of experimental is 6.15, and I hope
all remaining corner cases can be exposed before that.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:13 +01:00
Qu Wenruo
a4ef54dbb5 btrfs: make extent_range_clear_dirty_for_io() to handle sector size < page size cases
For btrfs with sector size < page size (e.g. 4K sector size, 64K page
size), and enable the sector perfect compression support, then the
following dirty range can lead to problems:

   0     32K     64K     96K    128K
   |     |///////||//////|    |/|
                              124K

In above case, if we start writeback for that inode, the last dirty
range [124K, 128K) will not be submitted and cause reserved space
leakage:

- Start writeback for page 0
  We find the range [32K, 96K) is suitable for compression, and queue it
  into a workqueue to do the delayed compression and submission.

- Compression happens for range [32K, 96K)
  Function extent_range_clear_dirty_for_io() is called, however it is
  only doing full page handling, not considering any the extra bitmaps
  for subpage cases.

  That function will clear page dirty for both page 0 and page 64K.

- Writeback for the inode is done
  Because page 64K has its dirty flag cleared, it will not be considered
  as a writeback target.

This means the range [124K, 128K) will not be submitted, and reserved
space for it will be leaked.

Fix this problem by using the subpage helper to clear the dirty flag.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:12 +01:00
Linus Torvalds
9183e033ec for-6.12-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmct+40ACgkQxWXV+ddt
 WDvCtRAAp0rheEu14hpVvWE2//+6u9Gx7Wfjzbj0+o4zBRWdg7BigFxfeb6JsH/E
 2TjuWdcoP/OMV9ghCBQAQxySAPtsxH7skkyNy2UcMk5byBIrNvhw9auP5GXXlrhK
 jSKDD4yfOMb++8LhrLevgTrijNyjLqaKXruw9a1Pmc3gxpdNmnMEySsQaF62o2Sm
 YC3jwi0KpNAhu2qyJ6TnPgd5zf3BTM0JAeuB019IZW4WoeRTOdcPe7S7gqqJwZ+e
 lL0D2/lfIE1lKvLE266Fab4FAQiJV07rozYj25XHiDpqThCxnJVOZCEHasOQ1PRy
 d6j3RmGPqJYAYfQL1L+FH2hsS1BVZfVyCV1V7A/cN+lAffBfnROnf13C3gJ15Nbx
 3lTyjBPQQw2WpfdmeyF3ikbrjZ8AfahChQO+mMnLN7oAWdIwWX5MRB+cwfWTxzA/
 P8upz6HSTpSwy8nXdq264q1KkyCjx0Wv+8iyU7LirN2fCcEchA12HAIaOBeHedgh
 PrGZDqrkZccQQxAvU5H7hQv0hZkGK8qba381oYHO09g72VM6ysuBU7tGrPZrlZYB
 CvYTCwNZ/lqI8ikrcHOyUO1SPR9SaaWej1mWgBJ69ZIfg+ZuMtOMl171DU4S/i2V
 iYgYoN8eCqTQWdaX5kk+3LWmK8fSU7F/KSDtJtT1KxkaSwCacfY=
 =TQzP
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more one-liners that fix some user visible problems:

   - use correct range when clearing qgroup reservations after COW

   - properly reset freed delayed ref list head

   - fix ro/rw subvolume mounts to be backward compatible with old and
     new mount API"

* tag 'for-6.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix the length of reserved qgroup to free
  btrfs: reinitialize delayed ref list after deleting it from the list
  btrfs: fix per-subvolume RO/RW flags with new mount API
2024-11-08 07:31:03 -10:00
Haisu Wang
2b084d8205 btrfs: fix the length of reserved qgroup to free
The dealloc flag may be cleared and the extent won't reach the disk in
cow_file_range when errors path. The reserved qgroup space is freed in
commit 30479f31d4 ("btrfs: fix qgroup reserve leaks in
cow_file_range"). However, the length of untouched region to free needs
to be adjusted with the correct remaining region size.

Fixes: 30479f31d4 ("btrfs: fix qgroup reserve leaks in cow_file_range")
CC: stable@vger.kernel.org # 6.11+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-07 02:08:29 +01:00
Pankaj Raghav
30dac24e14
fs/writeback: convert wbc_account_cgroup_owner to take a folio
Most of the callers of wbc_account_cgroup_owner() are converting a folio
to page before calling the function. wbc_account_cgroup_owner() is
converting the page back to a folio to call mem_cgroup_css_from_folio().

Convert wbc_account_cgroup_owner() to take a folio instead of a page,
and convert all callers to pass a folio directly except f2fs.

Convert the page to folio for all the callers from f2fs as they were the
only callers calling wbc_account_cgroup_owner() with a page. As f2fs is
already in the process of converting to folios, these call sites might
also soon be calling wbc_account_cgroup_owner() with a folio directly in
the future.

No functional changes. Only compile tested.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20240926140121.203821-1-kernel@pankajraghav.com
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-28 13:26:54 +01:00
Linus Torvalds
4e46774408 for-6.12-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmcZGqsACgkQxWXV+ddt
 WDsZTg//dSLAAswT3dTAvWDt7rGxPhJC1V+gnzWdj/0Q3CUimNaw7zHp1QHtJDFT
 S+3TVvJJ2JDvOnbi3u24s/bL5YdkWcvIyy87oVE4trJMbQPc/E45pkYFqF3TRQAH
 wEYAVEnO9f9WUY+ekxX2XBzhmKp3xol93j9BBHDXOF9kHsDC5lI0D5YVYEhRY8Qu
 D4GcqAhqUj5Hj6I6ppiqO47NCJBNzw1Se9QsgruPpmItRbB0/LYJFhUfevwosTKg
 xVpkRVntFqQjkIdIdBfBv/ZGWfxJyM7K4M49QwLUqUQfxugu7BiGYuEjkBkiy07a
 pZDEOF9s8wUZsxvRVohqKlhL0zHBF+/pAANowYuhKNW1sqKt4GVCdN3V34AbH8ST
 JIbPvC2g1tUzIc2wckE8GO/NnsNR4r3k6iPB53MdCHrIo3jnENOeb9wF0GiVDb6s
 OrhCa3ph2ps80YC1aCnc4Jr/yV2ONebSivnqvHCIUQEZfpjtAc07atm0i946/7nx
 eiBE+9zSVJZB0LoooIVz2I3HX3lRnm3Wwi4nj8U/sNL/IbGaHNCurIETR23NivYP
 yWVql2njwE+yMc8q9YZXs4MBdKsSGP6eGJW3ZwKi6ru2PJdf5eIib+ffvR4bLqXD
 UUDfMyC1esGsQB24sc8wppk97wmmMrdqQUj9WnmNlTP8FUFggvQ=
 =sYxX
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - mount option fixes:
     - fix handling of compression mount options on remount
     - reject rw remount in case there are options that don't work
       in read-write mode (like rescue options)

 - fix zone accounting of unusable space

 - fix in-memory corruption when merging extent maps

 - fix delalloc range locking for sector < page

 - use more convenient default value of drop subtree threshold, clean
   more subvolumes without the fallback to marking quotas inconsistent

 - fix smatch warning about incorrect value passed to ERR_PTR

* tag 'for-6.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix passing 0 to ERR_PTR in btrfs_search_dir_index_item()
  btrfs: reject ro->rw reconfiguration if there are hard ro requirements
  btrfs: fix read corruption due to race with extent map merging
  btrfs: fix the delalloc range locking if sector size < page size
  btrfs: qgroup: set a more sane default value for subtree drop threshold
  btrfs: clear force-compress on remount when compress mount option is given
  btrfs: zoned: fix zone unusable accounting for freed reserved extent
2024-10-24 13:04:15 -07:00
Yue Haibing
75f49c3dc7 btrfs: fix passing 0 to ERR_PTR in btrfs_search_dir_index_item()
The ret may be zero in btrfs_search_dir_index_item() and should not
passed to ERR_PTR(). Now btrfs_unlink_subvol() is the only caller to
this, reconstructed it to check ERR_PTR(-ENOENT) while ret >= 0.

This fixes smatch warnings:

fs/btrfs/dir-item.c:353
  btrfs_search_dir_index_item() warn: passing zero to 'ERR_PTR'

Fixes: 9dcbe16fcc ("btrfs: use btrfs_for_each_slot in btrfs_search_dir_index_item")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22 16:10:55 +02:00
Linus Torvalds
79eb2c07af for-6.12-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmb/9agACgkQxWXV+ddt
 WDtKFA/5AW88osk/1k/NVhOvOa0xPr5XyDLq1n8Gaxfy8uHlHAc8wdsvJzCDMS0M
 qUOD/tOPhRI0HGXPiKD767erwbyXiZAcCTkSd8x5jlXy1hVjUHQSKO//JxD0vtAZ
 jOscoUA1wJJutopCXcppnoUUFE2753edEg0w2EtUXMpfqivqOmMCR+1ZtKkfaNJo
 oRuZCq3Oi8hu7Wsvmh4Etq/9MvGM+xovXAMAji6Op8nsP1jJlzWztpEUogLOQH2S
 IhDFFxP9shBV9JjV+HSyXcAYr8VArH6HtYjapR9oajCH0pvSjLYQQPq/qQ0//8Hb
 SHr4YP6RBbpEIvvaQeA7vwsckBHNBOrSAbEgRQ2+zdmiwha6SRIEVyXYh5LUwcYT
 WnVozbk7ZX9rD1jOVhvgouFG6vUz8A6/qt3BD028bVcyMvBXW4gsEduCMVlFGmlN
 D6+hNY6J08j4HUEGnPk7fAYi/lk5OEK1p5yarUgsOQ3GqWWS0ywkrfbmbWwyC+Ff
 AxggFTl9YodU5RMs7EU2GeHkLU6LnXgevk6FFm0JzsLtT/BbEP7pj6tOot4Msl1e
 2ovqFiSbuPNg5Wr70ZBRO9LDIAYtTZy1UrVR/YCSLzm+wZsXMelDIHMQfE7+Yodp
 O5Cud23AanRTwjErvNl3X4rhut8rrI1FsR89gDyKK1EPG+mwofo=
 =yslr
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - in incremental send, fix invalid clone operation for file that got
   its size decreased

 - fix __counted_by() annotation of send path cache entries, we do not
   store the terminating NUL

 - fix a longstanding bug in relocation (and quite hard to hit by
   chance), drop back reference cache that can get out of sync after
   transaction commit

 - wait for fixup worker kthread before finishing umount

 - add missing raid-stripe-tree extent for NOCOW files, zoned mode
   cannot have NOCOW files but RST is meant to be a standalone feature

 - handle transaction start error during relocation, avoid potential
   NULL pointer dereference of relocation control structure (reported by
   syzbot)

 - disable module-wide rate limiting of debug level messages

 - minor fix to tracepoint definition (reported by checkpatch.pl)

* tag 'for-6.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: disable rate limiting when debug enabled
  btrfs: wait for fixup workers before stopping cleaner kthread during umount
  btrfs: fix a NULL pointer dereference when failed to start a new trasacntion
  btrfs: send: fix invalid clone operation for file that got its size decreased
  btrfs: tracepoints: end assignment with semicolon at btrfs_qgroup_extent event class
  btrfs: drop the backref cache during relocation if we commit
  btrfs: also add stripe entries for NOCOW writes
  btrfs: send: fix buffer overflow detection when copying path to cache entry
2024-10-04 10:05:13 -07:00
Matthew Wilcox (Oracle)
a6752a6e7f
btrfs: Switch from using the private_2 flag to owner_2
We are close to removing the private_2 flag, so switch btrfs to using
owner_2 for its ordered flag.  This is mostly used by buffer head
filesystems, so btrfs can use it because it doesn't use buffer heads.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20241002040111.1023018-5-willy@infradead.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-04 09:24:25 +02:00
Al Viro
5f60d5f6bb move asm/unaligned.h to linux/unaligned.h
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.

auto-generated by the following:

for i in `git grep -l -w asm/unaligned.h`; do
	sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
	sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
2024-10-02 17:23:23 -04:00
Johannes Thumshirn
97f9782276 btrfs: also add stripe entries for NOCOW writes
NOCOW writes do not generate stripe_extent entries in the RAID stripe
tree, as the RAID stripe-tree feature initially was designed with a
zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
writes. But the RAID stripe-tree feature is independent from the zoned
feature, so we must also do NOCOW writes for RAID stripe-tree filesystems.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-01 19:09:04 +02:00
Li Zetao
aeb6d88148 btrfs: convert btrfs_decompress() to take a folio
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Based on the previous patch, the compression path can be
directly used in folio without converting to page.

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:21 +02:00
Li Zetao
046c0d6596 btrfs: convert try_release_extent_mapping() to take a folio
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And page_to_inode() can be replaced with folio_to_inode() now.

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:21 +02:00
Li Zetao
266a9361a4 btrfs: convert clear_page_extent_mapped() to take a folio
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Now clear_page_extent_mapped() can deal with a folio
directly, so change its name to clear_folio_extent_mapped().

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:20 +02:00
David Sterba
11e3107d47 btrfs: drop transaction parameter from btrfs_add_inode_defrag()
There's only one caller inode_should_defrag() that passes NULL to
btrfs_add_inode_defrag() so we can drop it an simplify the code.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:19 +02:00
David Sterba
06de42c5a9 btrfs: rename __extent_writepage() and drop double underscores
The function does not follow the pattern where the underscores would be
justified, so rename it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:19 +02:00
David Sterba
792e86ef31 btrfs: rename btrfs_submit_bio() to btrfs_submit_bbio()
The function name is a bit misleading as it submits the btrfs_bio
(bbio), rename it so we can use btrfs_submit_bio() when an actual bio is
submitted.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:19 +02:00
Boris Burkov
f8e9f4a76d btrfs: add comment about locking in cow_file_range_inline()
Add a comment to document the complicated locked_page unlock logic in
cow_file_range_inline. The specifically tricky part is that a caller
just up the stack converts ret == 0 to ret == 1 and then another
caller far up the callstack handles ret == 1 as a success, AND returns
without cleanup in that case, both of which "feel" unnatural and led to
the original bug.

Try to document that somewhat specific callstack logic here to explain
the weird un-setting of locked_folio on success.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:17 +02:00
Josef Bacik
5fe1912449 btrfs: convert extent_range_clear_dirty_for_io() to use a folio
Instead of getting a page and using that to clear dirty for io, use the
folio helper and use the appropriate folio functions.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:17 +02:00
Josef Bacik
c86d3aac81 btrfs: convert insert_inline_extent() to use a folio
We only use a page to copy in the data for the inline extent.  Use a
folio for this instead.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:17 +02:00
Josef Bacik
1bbf3a3aea btrfs: convert btrfs_set_range_writeback() to use a folio
We already use a lot of functions here that use folios, update the
function to use __filemap_get_folio instead of find_get_page and then
use the folio directly.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:17 +02:00
Josef Bacik
dfc9e3017a btrfs: convert wait_subpage_spinlock() to only use a folio
Currently this already uses a folio for most things, update it to take a
folio and update all the page usage with the corresponding folio usage.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:16 +02:00
Josef Bacik
dce9ef9412 btrfs: convert btrfs_get_extent() to take a folio
We only pass this into read_inline_extent, change it to take a folio and
update the callers.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:16 +02:00
Josef Bacik
220e77c412 btrfs: convert read_inline_extent() to use a folio
Instead of using a page, use a folio instead, take a folio as an
argument, and update the callers appropriately.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:16 +02:00
Josef Bacik
752965824b btrfs: convert uncompress_inline() to take a folio
Update uncompress_inline to take a folio and update it's usage
accordingly.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:16 +02:00
Josef Bacik
1b5125bbd4 btrfs: convert struct btrfs_writepage_fixup to use a folio
Now the fixup creator and consumer use folios, change this to use a
folio as well.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:16 +02:00
Josef Bacik
d71b53c3cb btrfs: convert btrfs_writepage_cow_fixup() to use folio
Instead of a page, use a folio for btrfs_writepage_cow_fixup.  We
already have a folio at the only caller, and the fixup worker uses
folios.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:16 +02:00
Josef Bacik
7d003cc2b3 btrfs: convert btrfs_writepage_fixup_worker() to use a folio
This function heavily messes with pages, instead update it to use a
folio.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:16 +02:00
Josef Bacik
0d11706810 btrfs: convert submit_uncompressed_range() to take a folio
This mostly uses folios already, update it to take a folio and update
the rest of the function to use the folio instead of the page.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:16 +02:00
Josef Bacik
3ed984b5d0 btrfs: convert struct async_chunk to hold a folio
Instead of passing in the page for ->locked_page, make it hold a
locked_folio and then update the users of async_chunk to act
accordingly.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
2609c9289f btrfs: convert btrfs_run_delalloc_range() to take a folio
Now that every function that btrfs_run_delalloc_range calls takes a
folio, update it to take a folio and update the callers.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
d9c750272d btrfs: convert run_delalloc_compressed() to take a folio
This just passes the page into the compressed machinery to keep track of
the locked page.  Update this to take a folio and convert it to a page
where appropriate.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
94cea66d1c btrfs: convert btrfs_cleanup_ordered_extents() to take a folio
Now that btrfs_cleanup_ordered_extents is operating mostly with folios,
update it to use a folio instead of a page, and the update the function
and the callers as appropriate.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
b38ec94ab9 btrfs: convert btrfs_cleanup_ordered_extents() to use folios
We walk through pages in this function and clear ordered, and the
function for this uses folios. Update the function to use a folio for
this whole operation.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
42a5947b1c btrfs: convert run_delalloc_nocow() to take a folio
Now all of the functions that use locked_page in run_delalloc_nocow take
a folio, update it to take a folio and update the caller.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
39bbc56a9c btrfs: convert fallback_to_cow() to take a folio
With this we can pass the folio directly into cow_file_range().

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
4cf7e0562f btrfs: convert cow_file_range() to take a folio
Convert this to take a folio and pass it into all of the various cleanup
functions.  Update the callers to pass in a folio instead.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
9f5db28074 btrfs: convert cow_file_range_inline() to take a folio
Now that we want the folio in this function, convert it to take a folio
directly and use that.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
2cdc1fbb1b btrfs: convert run_delalloc_cow() to take a folio
We pass the folio into extent_write_locked_range, go ahead and take a
folio to pass along, and update the callers to pass in a folio.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
01e11841f0 btrfs: convert extent_write_locked_range() to take a folio
This mostly uses folios, convert it to take a folio instead and update
the callers to pass in the folio.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
a67f540582 btrfs: convert extent_clear_unlock_delalloc() to take a folio
Instead of taking the locked page, take the locked folio so we can pass
that into __process_folios_contig.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:14 +02:00
Josef Bacik
a79228011c btrfs: convert btrfs_mark_ordered_io_finished() to take a folio
We only need a folio now, make it take a folio as an argument and update
all of the callers.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:14 +02:00
Jeff Layton
3bc2ac2f8f btrfs: update target inode's ctime on unlink
Unlink changes the link count on the target inode. POSIX mandates that
the ctime must also change when this occurs.

According to https://pubs.opengroup.org/onlinepubs/9699919799/functions/unlink.html:

"Upon successful completion, unlink() shall mark for update the last data
 modification and last file status change timestamps of the parent
 directory. Also, if the file's link count is not 0, the last file status
 change timestamp of the file shall be marked for update."

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add link to the opengroup docs ]
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-15 20:35:44 +02:00
Linus Torvalds
6a0e382640 for-6.11-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmazbGYACgkQxWXV+ddt
 WDsQAw//Z3XjjylTZPuHNk/AiMe5oochxB5T9ZracQOzG0o70gj1w/UQIZBkSzp1
 66g8I4YdbwvEXKDg9Oi/GPDSON3GuhAiLXp+0Y/reeD/totgvrhROuJ3mIk5CZ0H
 B4fIKH3xCKLQan26Opgju4qjum7+AR7ekFveM6GicxXXb3eAYALgoEFt63eZjZVu
 7myak78gmBuK5QdGHH+onEhn+HfC57UTGBqu1bJsSOQC7dANkU+WzmgbH6FeOHqx
 2T/lN/tu2tBBoF4zMvC472Zjmj4PnubNQnwwv0oJ8Z2Y0yIY95joZV0uXIXO72oz
 BMQC6s0cltiTn1Tfe4iIWn+ZNjcfGAZO7aoD5NcJb2F/Mz5mZuMhtq3BND0wJ8/+
 SuYk7PxX7tcOaFrDAn3Ne7XHsD7r5lLkFICXkzcNG4dqkBUxR3dN4Oi8KKMFrgCP
 kTpf0/lAkYKYSrU86Yn1zhwRLaH8jm3fulVUY8i/p0tJnpsW8AmeME1Sk97Bhxbb
 y6zt8+MPscPuQi3jPsBevaYd8Q8BIT34vVtU/jNmGAyqEv1wVYQRRK2Ma4sJJfNk
 EEQV9i4VqXWLc17dcWVZrG6lEsVRIIBE/2Adhth6Myq73bkunpB9ZNNNZApf1T2j
 Wn5gPzkuQd6FWrXe8V7oSN9rT1OPn9uHL6BmSUWhHUCuFeATgoc=
 =03Gf
 -----END PGP SIGNATURE-----

Merge tag 'for-6.11-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix double inode unlock for direct IO sync writes (reported by
   syzbot)

 - fix root tree id/name map definitions, don't use fixed size buffers
   for name (reported by -Werror=unterminated-string-initialization)

 - fix qgroup reserve leaks in bufferd write path

 - update scrub status structure more often so it can be reported in
   user space more accurately and let 'resume' not repeat work

 - in preparation to remove space cache v1 in the future print a warning
   if it's detected

* tag 'for-6.11-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: avoid using fixed char array size for tree names
  btrfs: fix double inode unlock for direct IO sync writes
  btrfs: emit a warning about space cache v1 being deprecated
  btrfs: fix qgroup reserve leaks in cow_file_range
  btrfs: implement launder_folio for clearing dirty page reserve
  btrfs: scrub: update last_physical after scrubbing one stripe
  btrfs: factor out stripe length calculation into a helper
2024-08-07 09:53:41 -07:00
Boris Burkov
30479f31d4 btrfs: fix qgroup reserve leaks in cow_file_range
In the buffered write path, the dirty page owns the qgroup reserve until
it creates an ordered_extent.

Therefore, any errors that occur before the ordered_extent is created
must free that reservation, or else the space is leaked. The fstest
generic/475 exercises various IO error paths, and is able to trigger
errors in cow_file_range where we fail to get to allocating the ordered
extent. Note that because we *do* clear delalloc, we are likely to
remove the inode from the delalloc list, so the inodes/pages to not have
invalidate/launder called on them in the commit abort path.

This results in failures at the unmount stage of the test that look like:

  BTRFS: error (device dm-8 state EA) in cleanup_transaction:2018: errno=-5 IO failure
  BTRFS: error (device dm-8 state EA) in btrfs_replace_file_extents:2416: errno=-5 IO failure
  BTRFS warning (device dm-8 state EA): qgroup 0/5 has unreleased space, type 0 rsv 28672
  ------------[ cut here ]------------
  WARNING: CPU: 3 PID: 22588 at fs/btrfs/disk-io.c:4333 close_ctree+0x222/0x4d0 [btrfs]
  Modules linked in: btrfs blake2b_generic libcrc32c xor zstd_compress raid6_pq
  CPU: 3 PID: 22588 Comm: umount Kdump: loaded Tainted: G W          6.10.0-rc7-gab56fde445b8 #21
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
  RIP: 0010:close_ctree+0x222/0x4d0 [btrfs]
  RSP: 0018:ffffb4465283be00 EFLAGS: 00010202
  RAX: 0000000000000001 RBX: ffffa1a1818e1000 RCX: 0000000000000001
  RDX: 0000000000000000 RSI: ffffb4465283bbe0 RDI: ffffa1a19374fcb8
  RBP: ffffa1a1818e13c0 R08: 0000000100028b16 R09: 0000000000000000
  R10: 0000000000000003 R11: 0000000000000003 R12: ffffa1a18ad7972c
  R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  FS:  00007f9168312b80(0000) GS:ffffa1a4afcc0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f91683c9140 CR3: 000000010acaa000 CR4: 00000000000006f0
  Call Trace:
   <TASK>
   ? close_ctree+0x222/0x4d0 [btrfs]
   ? __warn.cold+0x8e/0xea
   ? close_ctree+0x222/0x4d0 [btrfs]
   ? report_bug+0xff/0x140
   ? handle_bug+0x3b/0x70
   ? exc_invalid_op+0x17/0x70
   ? asm_exc_invalid_op+0x1a/0x20
   ? close_ctree+0x222/0x4d0 [btrfs]
   generic_shutdown_super+0x70/0x160
   kill_anon_super+0x11/0x40
   btrfs_kill_super+0x11/0x20 [btrfs]
   deactivate_locked_super+0x2e/0xa0
   cleanup_mnt+0xb5/0x150
   task_work_run+0x57/0x80
   syscall_exit_to_user_mode+0x121/0x130
   do_syscall_64+0xab/0x1a0
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
  RIP: 0033:0x7f916847a887
  ---[ end trace 0000000000000000 ]---
  BTRFS error (device dm-8 state EA): qgroup reserved space leaked

Cases 2 and 3 in the out_reserve path both pertain to this type of leak
and must free the reserved qgroup data. Because it is already an error
path, I opted not to handle the possible errors in
btrfs_free_qgroup_data.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-01 17:26:52 +02:00
Boris Burkov
872617a089 btrfs: implement launder_folio for clearing dirty page reserve
In the buffered write path, dirty pages can be said to "own" the qgroup
reservation until they create an ordered_extent. It is possible for
there to be outstanding dirty pages when a transaction is aborted, in
which case there is no cancellation path for freeing this reservation
and it is leaked.

We do already walk the list of outstanding delalloc inodes in
btrfs_destroy_delalloc_inodes() and call invalidate_inode_pages2() on them.

This does *not* call btrfs_invalidate_folio(), as one might guess, but
rather calls launder_folio() and release_folio(). Since this is a
reservation associated with dirty pages only, rather than something
associated with the private bit (ordered_extent is cancelled separately
already in the cleanup transaction path), implementing this release
should be done via launder_folio.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-01 17:26:40 +02:00
Linus Torvalds
e4fc196f5b for-6.11-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmapmOQACgkQxWXV+ddt
 WDsXVhAAi4X+xt3o4jcN3IAu08JCQAAyXnFWC3lvn7sqYjSrcccI6ZT4/gAbHss+
 qrifakRGoYQ7fAjYBmhw48HqPmHtI2OQjIUDaIqHQOS68aXShBo9HiE460HRY4GT
 QV/KT0w37E2/R0EDR9gyjLq3ZA3/raxN1n+LNCFhRWmtsAEZrk4XzsADWb05YkIq
 1QBa92DzEhVpd04X8YHIYBgRidWbcYST6xhoWdyL9VZ1pzZsISq5LH67D4f/J1KU
 gXNf+ZnF9DXsQnptJrMsjhx61seJ2F0/vozFZ+l6SjRr0jeysmrJI0dxqQc/hUga
 gbLmdha6ztKdn03JOIL+lfdZYzICFl/2fekSWI2SNcag+TYszACjlFOyHusOgKsa
 3qQwzVB699FheWO5nrOOvOtgq0ZqGsrIvhIXLhA7/bVpNavPnUB7IQCcs8n89ImQ
 hUIebfX1FZnYXTrB6Hhm92LUb0lyLSlW1we3SSmaAMiy1TiXHG7hO2G/sIbOPAJC
 5VzdHf0DEjzEdjmTrGOV7JBfy5JmMK56oN8viZS95p70DYxNGvEOhLs/8n5twpri
 MWV8GElcOjjC+KnGnUH72spsnEKONpdzyccG9kiZEgkEi4csgHSxrkSmAehYD6i6
 MFYk+i7jvZ1VsbOulmdGOLbHS7whxi9pWb/CT3KKF1Ei5/v07bU=
 =JdOX
 -----END PGP SIGNATURE-----

Merge tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix regression in extent map rework when handling insertion of
   overlapping compressed extent

 - fix unexpected file length when appending to a file using direct io
   and buffer not faulted in

 - in zoned mode, fix accounting of unusable space when flipping
   read-only block group back to read-write

 - fix page locking when COWing an inline range, assertion failure found
   by syzbot

 - fix calculation of space info in debugging print

 - tree-checker, add validation of data reference item

 - fix a few -Wmaybe-uninitialized build warnings

* tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: initialize location to fix -Wmaybe-uninitialized in btrfs_lookup_dentry()
  btrfs: fix corruption after buffer fault in during direct IO append write
  btrfs: zoned: fix zone_unusable accounting on making block group read-write again
  btrfs: do not subtract delalloc from avail bytes
  btrfs: make cow_file_range_inline() honor locked_page on error
  btrfs: fix corrupt read due to bad offset of a compressed extent map
  btrfs: tree-checker: validate dref root and objectid
2024-07-30 19:28:36 -07:00
David Sterba
b8e947e9f6 btrfs: initialize location to fix -Wmaybe-uninitialized in btrfs_lookup_dentry()
Some arch + compiler combinations report a potentially unused variable
location in btrfs_lookup_dentry(). This is a false alert as the variable
is passed by value and always valid or there's an error. The compilers
cannot probably reason about that although btrfs_inode_by_name() is in
the same file.

   >  + /kisskb/src/fs/btrfs/inode.c: error: 'location.objectid' may be used
   +uninitialized in this function [-Werror=maybe-uninitialized]:  => 5603:9
   >  + /kisskb/src/fs/btrfs/inode.c: error: 'location.type' may be used
   +uninitialized in this function [-Werror=maybe-uninitialized]:  => 5674:5

   m68k-gcc8/m68k-allmodconfig
   mips-gcc8/mips-allmodconfig
   powerpc-gcc5/powerpc-all{mod,yes}config
   powerpc-gcc5/ppc64_defconfig

Initialize it to zero, this should fix the warnings and won't change the
behaviour as btrfs_inode_by_name() accepts only a root or inode item
types, otherwise returns an error.

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/linux-btrfs/bd4e9928-17b3-9257-8ba7-6b7f9bbb639a@linux-m68k.org/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-30 15:33:06 +02:00
Boris Burkov
478574370b btrfs: make cow_file_range_inline() honor locked_page on error
The btrfs buffered write path runs through __extent_writepage() which
has some tricky return value handling for writepage_delalloc().
Specifically, when that returns 1, we exit, but for other return values
we continue and end up calling btrfs_folio_end_all_writers(). If the
folio has been unlocked (note that we check the PageLocked bit at the
start of __extent_writepage()), this results in an assert panic like
this one from syzbot:

  BTRFS: error (device loop0 state EAL) in free_log_tree:3267: errno=-5 IO failure
  BTRFS warning (device loop0 state EAL): Skipping commit of aborted transaction.
  BTRFS: error (device loop0 state EAL) in cleanup_transaction:2018: errno=-5 IO failure
  assertion failed: folio_test_locked(folio), in fs/btrfs/subpage.c:871
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/subpage.c:871!
  Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
  CPU: 1 PID: 5090 Comm: syz-executor225 Not tainted
  6.10.0-syzkaller-05505-gb1bc554e009e #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
  Google 06/27/2024
  RIP: 0010:btrfs_folio_end_all_writers+0x55b/0x610 fs/btrfs/subpage.c:871
  Code: e9 d3 fb ff ff e8 25 22 c2 fd 48 c7 c7 c0 3c 0e 8c 48 c7 c6 80 3d
  0e 8c 48 c7 c2 60 3c 0e 8c b9 67 03 00 00 e8 66 47 ad 07 90 <0f> 0b e8
  6e 45 b0 07 4c 89 ff be 08 00 00 00 e8 21 12 25 fe 4c 89
  RSP: 0018:ffffc900033d72e0 EFLAGS: 00010246
  RAX: 0000000000000045 RBX: 00fff0000000402c RCX: 663b7a08c50a0a00
  RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
  RBP: ffffc900033d73b0 R08: ffffffff8176b98c R09: 1ffff9200067adfc
  R10: dffffc0000000000 R11: fffff5200067adfd R12: 0000000000000001
  R13: dffffc0000000000 R14: 0000000000000000 R15: ffffea0001cbee80
  FS:  0000000000000000(0000) GS:ffff8880b9500000(0000)
  knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f5f076012f8 CR3: 000000000e134000 CR4: 00000000003506f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
  <TASK>
  __extent_writepage fs/btrfs/extent_io.c:1597 [inline]
  extent_write_cache_pages fs/btrfs/extent_io.c:2251 [inline]
  btrfs_writepages+0x14d7/0x2760 fs/btrfs/extent_io.c:2373
  do_writepages+0x359/0x870 mm/page-writeback.c:2656
  filemap_fdatawrite_wbc+0x125/0x180 mm/filemap.c:397
  __filemap_fdatawrite_range mm/filemap.c:430 [inline]
  __filemap_fdatawrite mm/filemap.c:436 [inline]
  filemap_flush+0xdf/0x130 mm/filemap.c:463
  btrfs_release_file+0x117/0x130 fs/btrfs/file.c:1547
  __fput+0x24a/0x8a0 fs/file_table.c:422
  task_work_run+0x24f/0x310 kernel/task_work.c:222
  exit_task_work include/linux/task_work.h:40 [inline]
  do_exit+0xa2f/0x27f0 kernel/exit.c:877
  do_group_exit+0x207/0x2c0 kernel/exit.c:1026
  __do_sys_exit_group kernel/exit.c:1037 [inline]
  __se_sys_exit_group kernel/exit.c:1035 [inline]
  __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1035
  x64_sys_call+0x2634/0x2640
  arch/x86/include/generated/asm/syscalls_64.h:232
  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
  entry_SYSCALL_64_after_hwframe+0x77/0x7f
  RIP: 0033:0x7f5f075b70c9
  Code: Unable to access opcode bytes at
  0x7f5f075b709f.

I was hitting the same issue by doing hundreds of accelerated runs of
generic/475, which also hits IO errors by design.

I instrumented that reproducer with bpftrace and found that the
undesirable folio_unlock was coming from the following callstack:

  folio_unlock+5
  __process_pages_contig+475
  cow_file_range_inline.constprop.0+230
  cow_file_range+803
  btrfs_run_delalloc_range+566
  writepage_delalloc+332
  __extent_writepage # inlined in my stacktrace, but I added it here
  extent_write_cache_pages+622

Looking at the bisected-to patch in the syzbot report, Josef realized
that the logic of the cow_file_range_inline error path subtly changing.
In the past, on error, it jumped to out_unlock in cow_file_range(),
which honors the locked_page, so when we ultimately call
folio_end_all_writers(), the folio of interest is still locked. After
the change, we always unlocked ignoring the locked_page, on both success
and error. On the success path, this all results in returning 1 to
__extent_writepage(), which skips the folio_end_all_writers() call,
which makes it OK to have unlocked.

Fix the bug by wiring the locked_page into cow_file_range_inline() and
only setting locked_page to NULL on success.

Reported-by: syzbot+a14d8ac9af3a2a4fd0c8@syzkaller.appspotmail.com
Fixes: 0586d0a89e ("btrfs: move extent bit and page cleanup into cow_file_range_inline")
CC: stable@vger.kernel.org # 6.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-29 19:20:51 +02:00
Linus Torvalds
a1b547f0f2 for-6.11-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmaVN3MACgkQxWXV+ddt
 WDtpIRAAl+1NjsEj8e5V/UYn8Jr06ujTOnrkR3PCTICxDHbUaMLkQEw21H0K/ogQ
 3fOiEVpSlZOfKdYXtXaMQbC0jd/Af2eA10Uht96nAEjAtxu1uJ4cFZGu2meNdXZP
 xUioivJ/CElMPH2aluG6FaQvUTqmhrEr8tSoYbxzQmUd434q9kqqyjtw1tfzYDG1
 VDn2f7ykhpB/8P0aoqgWSshWTmaCzG0GkuI28o1o0iZUIF/P9TKdzxlLRW6BVHE7
 T2oGLEQjN1GQbCH75L4IeNJDkCBVfcDcbZkUDJ/ae4Pt/jJQTFY53YIP9wXFZQnd
 mdfHmK7Atpsk75ATftYSq+ENkbQ5fsuut5CD63u54gAqA4M1FncDXTAWS1Y30F76
 P8juSCmsSy0o3gTflDIo/IMdntoh/JmncwwStF6oKzmyUZZzzarsqM8mc1P03ZNt
 3ttlnbY7lC1TDAlD5J2wXE0INCT2pN+4C9IToWdRypeuLu6qrI7cQ0oylyp9OVQM
 t9umTXm0B6s1cyqEDjJf0xJZS/JTHYwu7S4EmAJwicgiLpOjABVTmO8021rVmDJy
 TAUu6yEhSsrTT6Dxm7/2Et1EEOKFF5hhsG1SiGD9oUIZK6B5+0waT+rbkEWl7osR
 4/TAv2zX6tuCc7HIW0fQloM/6/Gyd5wcDVaQNDUzFA075uKstwY=
 =k5d3
 -----END PGP SIGNATURE-----

Merge tag 'for-6.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "The highlights are new logic behind background block group reclaim,
  automatic removal of qgroup after removing a subvolume and new
  'rescue=' mount options.

  The rest is optimizations, cleanups and refactoring.

  User visible features:

   - dynamic block group reclaim:
      - tunable framework to avoid situations where eager data
        allocations prevent creating new metadata chunks due to lack of
        unallocated space
      - reuse sysfs knob bg_reclaim_threshold (otherwise used only in
        zoned mode) for a fixed value threshold
      - new on/off sysfs knob "dynamic_reclaim" calculating the value
        based on heuristics, aiming to keep spare working space for
        relocating chunks but not to needlessly relocate partially
        utilized block groups or reclaim newly allocated ones
      - stats are exported in sysfs per block group type, files
        "reclaim_*"
      - this may increase IO load at unexpected times but the corner
        case of no allocatable block groups is known to be worse

   - automatically remove qgroup of deleted subvolumes:
      - adjust qgroup removal conditions, make sure all related
        subvolume data are already removed, or return EBUSY, also take
        into account setting of sysfs drop_subtree_threshold
      - also works in squota mode

   - mount option updates: new modes of 'rescue=' that allow to mount
     images (read-only) that could have been partially converted by user
     space tools
      - ignoremetacsums  - invalid metadata checksums are ignored
      - ignoresuperflags - super block flags that track conversion in
                           progress (like UUID or checksums)

  Core:

   - size of struct btrfs_inode is now below 1024 (on a release config),
     improved memory packing and other secondary effects

   - switch tracking of open inodes from rb-tree to xarray, minor
     performance improvement

   - reduce number of empty transaction commits when there are no dirty
     data/metadata

   - memory allocation optimizations (reduced numbers, reordering out of
     critical sections)

   - extent map structure optimizations and refactoring, more sanity
     checks

   - more subpage in zoned mode preparations or fixes

   - general snapshot code cleanups, improvements and documentation

   - tree-checker updates: more file extent ram_bytes fixes, continued

   - raid-stripe-tree update (not backward compatible):
      - remove extent encoding field from the structure, can be inferred
        from other information
      - requires btrfs-progs 6.9.1 or newer

   - cleanups and refactoring
      - error message updates
      - error handling improvements
      - return type and parameter cleanups and improvements"

* tag 'for-6.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (152 commits)
  btrfs: fix extent map use-after-free when adding pages to compressed bio
  btrfs: fix bitmap leak when loading free space cache on duplicate entry
  btrfs: remove the BUG_ON() inside extent_range_clear_dirty_for_io()
  btrfs: move extent_range_clear_dirty_for_io() into inode.c
  btrfs: enhance compression error messages
  btrfs: fix data race when accessing the last_trans field of a root
  btrfs: rename the extra_gfp parameter of btrfs_alloc_page_array()
  btrfs: remove the extra_gfp parameter from btrfs_alloc_folio_array()
  btrfs: introduce new "rescue=ignoresuperflags" mount option
  btrfs: introduce new "rescue=ignoremetacsums" mount option
  btrfs: output the unrecognized super block flags as hex
  btrfs: remove unused Opt enums
  btrfs: tree-checker: add extra ram_bytes and disk_num_bytes check
  btrfs: fix the ram_bytes assignment for truncated ordered extents
  btrfs: make validate_extent_map() catch ram_bytes mismatch
  btrfs: ignore incorrect btrfs_file_extent_item::ram_bytes
  btrfs: cleanup the bytenr usage inside btrfs_extent_item_to_extent_map()
  btrfs: fix typo in error message in btrfs_validate_super()
  btrfs: move the direct IO code into its own file
  btrfs: pass a btrfs_inode to btrfs_set_prop()
  ...
2024-07-17 12:38:04 -07:00
Linus Torvalds
2aae1d67fd vfs-6.11.inode
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEG2wAKCRCRxhvAZXjc
 ooW/AQDzyY+xNGt4OPMvlyFUHd5RcyiLsMhYrkKc3FaIFjesVgD+PFW5PPW12c0V
 Z4VHg9w1HDDuUn4XvELs7OXZpek7RgU=
 =eDC8
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.11.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs inode / dentry updates from Christian Brauner:
 "This contains smaller performance improvements to inodes and dentries:

  inode:

   - Add rcu based inode lookup variants.

     They avoid one inode hash lock acquire in the common case thereby
     significantly reducing contention. We already support RCU-based
     operations but didn't take advantage of them during inode
     insertion.

     Callers of iget_locked() get the improvement without any code
     changes. Callers that need a custom callback can switch to
     iget5_locked_rcu() as e.g., did btrfs.

     With 20 threads each walking a dedicated 1000 dirs * 1000 files
     directory tree to stat(2) on a 32 core + 24GB ram vm:

        before: 3.54s user 892.30s system 1966% cpu 45.549 total
        after:  3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%)

     Long-term we should pick up the effort to introduce more
     fine-grained locking and possibly improve on the currently used
     hash implementation.

   - Start zeroing i_state in inode_init_always() instead of doing it in
     individual filesystems.

     This allows us to remove an unneeded lock acquire in new_inode()
     and not burden individual filesystems with this.

  dcache:

   - Move d_lockref out of the area used by RCU lookup to avoid
     cacheline ping poing because the embedded name is sharing a
     cacheline with d_lockref.

   - Fix dentry size on 32bit with CONFIG_SMP=y so it does actually end
     up with 128 bytes in total"

* tag 'vfs-6.11.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: fix dentry size
  vfs: move d_lockref out of the area used by RCU lookup
  bcachefs: remove now spurious i_state initialization
  xfs: remove now spurious i_state initialization in xfs_inode_alloc
  vfs: partially sanitize i_state zeroing on inode creation
  xfs: preserve i_state around inode_init_always in xfs_reinit_inode
  btrfs: use iget5_locked_rcu
  vfs: add rcu-based find_inode variants for iget ops
2024-07-15 11:39:44 -07:00
Qu Wenruo
a39484371d btrfs: remove the BUG_ON() inside extent_range_clear_dirty_for_io()
Previously we had a BUG_ON() inside extent_range_clear_dirty_for_io(), as
we expected all involved folios to be still locked, thus no folio should be
missing.

However for extent_range_clear_dirty_for_io() itself, we can skip the
missing folio and handle the remaining ones, and return an error if
there is anything wrong.

Remove the BUG_ON() and let the caller to handle the error.
In the caller we do not have a quick way to cleanup the error, but all
the compression routines would handle the missing folio as an error and
properly error out, so we only need to do an ASSERT() for developers,
while for non-debug build the compression routine would handle the
error correctly.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:52:25 +02:00
Qu Wenruo
af61081fb5 btrfs: move extent_range_clear_dirty_for_io() into inode.c
The function is only used inside inode.c by compress_file_range(),
so move it to inode.c and unexport it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:52:25 +02:00
Qu Wenruo
0fbf6cbd72 btrfs: rename the extra_gfp parameter of btrfs_alloc_page_array()
There is only one caller utilizing the @extra_gfp parameter,
alloc_eb_folio_array().  And in that case the extra_gfp is only assigned
to __GFP_NOFAIL.

Rename the @extra_gfp parameter to @nofail to indicate that.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:30 +02:00
Qu Wenruo
896c8b92dd btrfs: fix the ram_bytes assignment for truncated ordered extents
[HICCUP]
After adding extra checks on btrfs_file_extent_item::ram_bytes to
tree-checker, running fsstress leads to tree-checker warning at write time,
as we created file extent items with an invalid ram_bytes.

All those offending file extents have offset 0, and ram_bytes matching
num_bytes, and smaller than disk_num_bytes.

This would also trigger the recently enhanced btrfs-check, which catches
such mismatches and report them as minor errors.

[CAUSE]
When a folio/page is invalidated and it is part of a submitted OE, we
mark the OE truncated just to the beginning of the folio/page.

And for truncated OE, we insert the file extent item with incorrect
value for ram_bytes (using num_bytes instead of the usual value).

This is not a big deal for end users, as we do not utilize the ram_bytes
field for regular non-compressed extents.
This mismatch is just a small violation against on-disk format.

[FIX]
Fix it by removing the override on btrfs_file_extent_item::ram_bytes.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:29 +02:00
Filipe Manana
9aa29a20b7 btrfs: move the direct IO code into its own file
The direct IO code is over a thousand lines and it's currently spread
between file.c and inode.c, which makes it not easy to locate some parts
of it sometimes. Also inode.c is about 11 thousand lines and file.c about
4 thousand lines, both too big. So move all the direct IO code into a
dedicated file, so that it's easy to locate all its code and reduce the
sizes of inode.c and file.c.

This is a pure move of code without any other changes except export a
a couple functions from inode.c (get_extent_allocation_hint() and
create_io_em()) because they are used in inode.c and the new direct-io.c
file, and a couple functions from file.c (btrfs_buffered_write() and
btrfs_write_check()) because they are used both in file.c and in the new
direct-io.c file.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:29 +02:00
David Sterba
e2877c2a03 btrfs: pass a btrfs_inode to btrfs_compress_heuristic()
Pass a struct btrfs_inode to btrfs_compress_heuristic() as it's an
internal interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:28 +02:00
David Sterba
a1f4e3d7bd btrfs: switch btrfs_ordered_extent::inode to struct btrfs_inode
The structure is internal so we should use struct btrfs_inode for that,
allowing to remove some use of BTRFS_I.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:28 +02:00
David Sterba
a0d7e98ced btrfs: pass a btrfs_inode to btrfs_readdir_get_delayed_items()
Pass a struct btrfs_inode to btrfs_readdir_get_delayed_items() as it's
an internal interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:28 +02:00
David Sterba
849c01ae90 btrfs: pass a btrfs_inode to btrfs_readdir_put_delayed_items()
Pass a struct btrfs_inode to btrfs_readdir_put_delayed_items() as it's
an internal interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:28 +02:00
Filipe Manana
b751915765 btrfs: remove super block argument from btrfs_iget_locked()
It's pointless to pass a super block argument to btrfs_iget_locked()
because we always pass a root and from it we can get the super block
through:

   root->fs_info->sb

So remove the super block argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:25 +02:00
Filipe Manana
d383eb69eb btrfs: remove super block argument from btrfs_iget_path()
It's pointless to pass a super block argument to btrfs_iget_path() because
we always pass a root and from it we can get the super block through:

   root->fs_info->sb

So remove the super block argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:25 +02:00
Filipe Manana
d13240dd0a btrfs: remove super block argument from btrfs_iget()
It's pointless to pass a super block argument to btrfs_iget() because we
always pass a root and from it we can get the super block through:

   root->fs_info->sb

So remove the super block argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:25 +02:00
Filipe Manana
ebc7c7678e btrfs: remove pointless code when creating and deleting a subvolume
When creating and deleting a subvolume, after starting a transaction we
are explicitly calling btrfs_record_root_in_trans() for the root which we
passed to btrfs_start_transaction(). This is pointless because at
transaction.c:start_transaction() we end up doing that call, regardless
of whether we actually start a new transaction or join an existing one,
and if we were not it would mean the root item of that root would not
be updated in the root tree when committing the transaction, leading to
problems easy to spot with fstests for example.

Remove these redundant calls. They were introduced with commit
74e9795812 ("btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume
operations").

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:24 +02:00
Qu Wenruo
04ef7631bf btrfs: cleanup duplicated parameters related to btrfs_create_dio_extent()
The following 3 parameters can be cleaned up using btrfs_file_extent
structure:

- len
  btrfs_file_extent::num_bytes

- orig_block_len
  btrfs_file_extent::disk_num_bytes

- ram_bytes
  btrfs_file_extent::ram_bytes

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:21 +02:00
Qu Wenruo
9fec848b3a btrfs: cleanup duplicated parameters related to create_io_em()
Most parameters of create_io_em() can be replaced by the members with
the same name inside btrfs_file_extent.

Do a direct parameters cleanup here.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:21 +02:00
Qu Wenruo
e9ea31fb5c btrfs: cleanup duplicated parameters related to btrfs_alloc_ordered_extent
All parameters after @filepos of btrfs_alloc_ordered_extent() can be
replaced with btrfs_file_extent structure.

This patch does the cleanup, meanwhile some points to note:

- Move btrfs_file_extent structure to ordered-data.h
  The structure is needed by both btrfs_alloc_ordered_extent() and
  can_nocow_extent(), but since btrfs_inode.h includes
  ordered-data.h, so we need to move the structure to ordered-data.h.

- Move the special handling of NOCOW/PREALLOC into
  btrfs_alloc_ordered_extent()
  This is to allow btrfs_split_ordered_extent() to properly split them
  for DIO.
  For now just move the handling into btrfs_alloc_ordered_extent() to
  simplify the callers.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:21 +02:00
Qu Wenruo
cdc627e65c btrfs: cleanup duplicated parameters related to can_nocow_file_extent_args
The following functions and structures can be simplified using the
btrfs_file_extent structure:

- can_nocow_extent()
  No need to return ram_bytes/orig_block_len through the parameter list,
  the @file_extent parameter contains all the needed info.

- can_nocow_file_extent_args
  The following members are no longer needed:

  * disk_bytenr
    This one is confusing as it's not really the
    btrfs_file_extent_item::disk_bytenr, but where the IO would be,
    thus it's file_extent::disk_bytenr + file_extent::offset now.

  * num_bytes
    Now file_extent::num_bytes.

  * extent_offset
    Now file_extent::offset.

  * disk_num_bytes
    Now file_extent::disk_num_bytes.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:21 +02:00
Qu Wenruo
c77a8c6100 btrfs: remove extent_map::block_start member
The member extent_map::block_start can be calculated from
extent_map::disk_bytenr + extent_map::offset for regular extents.
And otherwise just extent_map::disk_bytenr.

And this is already validated by the validate_extent_map().  Now we can
remove the member.

However there is a special case in btrfs_create_dio_extent() where we
for NOCOW/PREALLOC ordered extents cannot directly use the resulting
btrfs_file_extent, as btrfs_split_ordered_extent() cannot handle them
yet.

So for that call site, we pass file_extent->disk_bytenr +
file_extent->num_bytes as disk_bytenr for the ordered extent, and 0 for
offset.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:21 +02:00
Qu Wenruo
e28b851ed9 btrfs: remove extent_map::block_len member
The extent_map::block_len is either extent_map::len (non-compressed
extent) or extent_map::disk_num_bytes (compressed extent).

Since we already have sanity checks to do the cross-checks between the
new and old members, we can drop the old extent_map::block_len now.

For most call sites, they can manually select extent_map::len or
extent_map::disk_num_bytes, since most if not all of them have checked
if the extent is compressed.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:20 +02:00
Qu Wenruo
4aa7b5d178 btrfs: remove extent_map::orig_start member
Since we have extent_map::offset, the old extent_map::orig_start is just
extent_map::start - extent_map::offset for non-hole/inline extents.

And since the new extent_map::offset is already verified by
validate_extent_map() while the old orig_start is not, let's just remove
the old member from all call sites.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:20 +02:00
Qu Wenruo
3d2ac99224 btrfs: introduce new members for extent_map
Introduce two new members for extent_map:

- disk_bytenr
- offset

Both are matching the members with the same name inside
btrfs_file_extent_items.

For now this patch only touches those members when:

- Reading btrfs_file_extent_items from disk
- Inserting new holes
- Merging two extent maps
  With the new disk_bytenr and disk_num_bytes, doing merging would be a
  little more complex, as we have 3 different cases:

  * Both extent maps are referring to the same data extents
    |<----- data extent A ----->|
       |<- em 1 ->|<- em 2 ->|

  * Both extent maps are referring to different data extents
    |<-- data extent A -->|<-- data extent B -->|
               |<- em 1 ->|<- em 2 ->|

  * One of the extent maps is referring to a merged and larger data
    extent that covers both extent maps

    This is not really valid case other than some selftests.
    So this test case would be removed.

  A new helper merge_ondisk_extents() is introduced to handle the above
  valid cases.

To properly assign values for those new members, a new btrfs_file_extent
parameter is introduced to all the involved call sites.

- For NOCOW writes the btrfs_file_extent would be exposed from
  can_nocow_file_extent().

- For other writes, the members can be easily calculated
  As most of them have 0 offset and utilizing the whole on-disk data
  extent.
  The exception is encoded write, but thankfully that interface provided
  offset directly and all other needed info.

For now, both the old members (block_start/block_len/orig_start) are
co-existing with the new members (disk_bytenr/offset), meanwhile all the
critical code is still using the old members only.

The cleanup will happen later after all the old and new members are
properly validated.

There would be some re-ordering for the assignment of the extent_map
members, now we follow the new ordering:

- start and len
  Or file_pos and num_bytes for other structures.

- disk_bytenr and disk_num_bytes
- offset and ram_bytes
- compression

So expect some seemingly unrelated line movement.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:20 +02:00
Qu Wenruo
87a6962f73 btrfs: export the expected file extent through can_nocow_extent()
Currently function can_nocow_extent() only returns members needed for
extent_map.

However since we will soon change the extent_map structure to be more
like btrfs_file_extent_item, we want to expose the expected file extent
caused by the NOCOW write for future usage.

This introduces a new structure, btrfs_file_extent, to be a more
memory access friendly representation of btrfs_file_extent_item.
And use that structure to expose the expected file extent caused by the
NOCOW write.

For now there is no user of the new structure yet.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:20 +02:00
Qu Wenruo
e8fe524da0 btrfs: rename extent_map::orig_block_len to disk_num_bytes
This would make it very obvious that the member just matches
btrfs_file_extent_item::disk_num_bytes.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:20 +02:00
Filipe Manana
8996f61ab9 btrfs: move fiemap code into its own file
Currently the core of the fiemap code lives in extent_io.c, which does
not make any sense because it's not related to extent IO at all (and it
was not as well before the big rewrite of fiemap I did some time ago).
The entry point for fiemap, btrfs_fiemap(), lives in inode.c since it's
an inode operation.

Since there's a significant amount of fiemap code, move all of it into a
dedicated file, including its entry point inode.c:btrfs_fiemap().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:20 +02:00
David Sterba
9c5e1fb024 btrfs: remove duplicate name variable declarations
When running 'make W=2' there are a few reports where a variable of the
same name is declared in a nested block. In all the cases we can use the
one declared in the parent block, no problematic cases were found.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:18 +02:00
Filipe Manana
e641e323ab btrfs: pass a btrfs_inode to btrfs_wait_ordered_range()
Instead of passing a (VFS) inode pointer argument, pass a btrfs_inode
instead, as this is generally what we do for internal APIs, making it
more consistent with most of the code base. This will later allow to
help to remove a lot of BTRFS_I() calls in btrfs_sync_file().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:18 +02:00
Filipe Manana
7a7bc21449 btrfs: remove objectid from struct btrfs_inode on 64 bits platforms
On 64 bits platforms we don't really need to have a dedicated member (the
objectid field) for the inode's number since we store in the VFS inode's
i_ino member, which is an unsigned long and this type is 64 bits wide on
64 bits platforms. We only need that field in case we are on a 32 bits
platform because the unsigned long type is 32 bits wide on such platforms
See commit 33345d0152 ("Btrfs: Always use 64bit inode number") regarding
this 64/32 bits detail.

The objectid field of struct btrfs_inode is also used to store the ID of
a root for directories that are stubs for unreferenced roots. In such
cases the inode is a directory and has the BTRFS_INODE_ROOT_STUB runtime
flag set.

So in order to reduce the size of btrfs_inode structure on 64 bits
platforms we can remove the objectid member and use the VFS inode's i_ino
member instead whenever we need to get the inode number. In case the inode
is a root stub (BTRFS_INODE_ROOT_STUB set) we can use the member
last_reflink_trans to store the ID of the unreferenced root, since such
inode is a directory and reflinks can't be done against directories.

So remove the objectid fields for 64 bits platforms and alias the
last_reflink_trans field with a name of ref_root_id in a union.
On a release kernel config, this reduces the size of struct btrfs_inode
from 1040 bytes down to 1032 bytes.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:17 +02:00
Filipe Manana
068fc8f914 btrfs: remove location key from struct btrfs_inode
Currently struct btrfs_inode has a key member, named "location", that is
either:

1) The key of the inode's item. In this case the objectid is the number
   of the inode;

2) A key stored in a dir entry with a type of BTRFS_ROOT_ITEM_KEY, for
   the case where we have a root that is a snapshot of a subvolume that
   points to other subvolumes. In this case the objectid is the ID of
   a subvolume inside the snapshotted parent subvolume.

The key is only used to lookup the inode item for the first case, while
for the second it's never used since it corresponds to directory stubs
created with new_simple_dir() and which are marked as dummy, so there's
no actual inode item to ever update. In the second case we only check
the key type at btrfs_ino() for 32 bits platforms and its objectid is
only needed for unlink.

Instead of using a key we can do fine with just the objectid, since we
can generate the key whenever we need it having only the objectid, as
in all use cases the type is always BTRFS_INODE_ITEM_KEY and the offset
is always 0.

So use only an objectid instead of a full key. This reduces the size of
struct btrfs_inode from 1048 bytes down to 1040 bytes on a release kernel.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:17 +02:00
Filipe Manana
3d7db6e8bd btrfs: don't allocate file extent tree for non regular files
When not using the NO_HOLES feature we always allocate an io tree for an
inode's file_extent_tree. This is wasteful because that io tree is only
used for regular files, so we allocate more memory than needed for inodes
that represent directories or symlinks for example, or for inodes that
correspond to free space inodes.

So improve on this by allocating the io tree only for inodes of regular
files that are not free space inodes.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:17 +02:00
Filipe Manana
d9891ae28b btrfs: unify index_cnt and csum_bytes from struct btrfs_inode
The index_cnt field of struct btrfs_inode is used only for two purposes:

1) To store the index for the next entry added to a directory;

2) For the data relocation inode to track the logical start address of the
   block group currently being relocated.

For the relocation case we use index_cnt because it's not used for
anything else in the relocation use case - we could have used other fields
that are not used by relocation such as defrag_bytes, last_unlink_trans
or last_reflink_trans for example (among others).

Since the csum_bytes field is not used for directories, do the following
changes:

1) Put index_cnt and csum_bytes in a union, and index_cnt is only
   initialized when the inode is a directory. The csum_bytes is only
   accessed in IO paths for regular files, so we're fine here;

2) Use the defrag_bytes field for relocation, since the data relocation
   inode is never used for defrag purposes. And to make the naming better,
   alias it to reloc_block_group_start by using a union.

This reduces the size of struct btrfs_inode by 8 bytes in a release
kernel, from 1056 bytes down to 1048 bytes.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:17 +02:00
Filipe Manana
e2844cce75 btrfs: remove inode_lock from struct btrfs_root and use xarray locks
Currently we use the spinlock inode_lock from struct btrfs_root to
serialize access to two different data structures:

1) The delayed inodes xarray (struct btrfs_root::delayed_nodes);
2) The inodes xarray (struct btrfs_root::inodes).

Instead of using our own lock, we can use the spinlock that is part of the
xarray implementation, by using the xa_lock() and xa_unlock() APIs and
using the xarray APIs with the double underscore prefix that don't take
the xarray locks and assume the caller is using xa_lock() and xa_unlock().

So remove the spinlock inode_lock from struct btrfs_root and use the
corresponding xarray locks. This brings 2 benefits:

1) We reduce the size of struct btrfs_root, from 1336 bytes down to
   1328 bytes on a 64 bits release kernel config;

2) We reduce lock contention by not using anymore the same lock for
   changing two different and unrelated xarrays.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:17 +02:00
Filipe Manana
d25f4ec176 btrfs: reduce nesting and deduplicate error handling at btrfs_iget_path()
Make btrfs_iget_path() simpler and easier to read by avoiding nesting of
if-then-else statements and having an error label to do all the error
handling instead of repeating it a couple times.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:17 +02:00
Filipe Manana
061ea8581b btrfs: preallocate inodes xarray entry to avoid transaction abort
When creating a new inode, at btrfs_create_new_inode(), one of the very
last steps is to add the inode to the root's inodes xarray. This often
requires allocating memory which may fail (even though xarrays have a
dedicated kmem_cache which make it less likely to fail), and at that point
we are forced to abort the current transaction (as some, but not all, of
the inode metadata was added to its subvolume btree).

To avoid a transaction abort, preallocate memory for the xarray early at
btrfs_create_new_inode(), so that if we fail we don't need to abort the
transaction and the insertion into the xarray is guaranteed to succeed.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:17 +02:00
Filipe Manana
310b2f5d5a btrfs: use an xarray to track open inodes in a root
Currently we use a red black tree (rb-tree) to track the currently open
inodes of a root (in struct btrfs_root::inode_tree). This however is not
very efficient when the number of inodes is large since rb-trees are
binary trees. For example for 100K open inodes, the tree has a depth of
17. Besides that, inserting into the tree requires navigating through it
and pulling useless cache lines in the process since the red black tree
nodes are embedded within the btrfs inode - on the other hand, by being
embedded, it requires no extra memory allocations.

We can improve this by using an xarray instead, which is efficient when
indices are densely clustered (such as inode numbers), is more cache
friendly and behaves like a resizable array, with a much better search
and insertion complexity than a red black tree. This only has one small
disadvantage which is that insertion will sometimes require allocating
memory for the xarray - which may fail (not that often since it uses a
kmem_cache) - but on the other hand we can reduce the btrfs inode
structure size by 24 bytes (from 1080 down to 1056 bytes) after removing
the embedded red black tree node, which after the next patches will allow
to reduce the size of the structure to 1024 bytes, meaning we will be able
to store 4 inodes per 4K page instead of 3 inodes.

This change does a straightforward change to use an xarray, and results
in a transaction abort if we can't allocate memory for the xarray when
creating an inode - but the next patch changes things so that we don't
need to abort.

Running the following fs_mark test showed some improvements:

    $ cat test.sh
    #!/bin/bash

    DEV=/dev/nullb0
    MNT=/mnt/nullb0
    MOUNT_OPTIONS="-o ssd"
    FILES=100000
    THREADS=$(nproc --all)

    echo "performance" | \
        tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

    mkfs.btrfs -f $DEV
    mount $MOUNT_OPTIONS $DEV $MNT

    OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k"
    for ((i = 1; i <= $THREADS; i++)); do
        OPTS="$OPTS -d $MNT/d$i"
    done

    fs_mark $OPTS

    umount $MNT

Before this patch:

    FSUse%        Count         Size    Files/sec     App Overhead
        10      1200000            0      92081.6         12505547
        16      2400000            0     138222.6         13067072
        23      3600000            0     148833.1         13290336
        43      4800000            0      97864.7         13931248
        53      6000000            0      85597.3         14384313

After this patch:

    FSUse%        Count         Size    Files/sec     App Overhead
        10      1200000            0      93225.1         12571078
        16      2400000            0     146720.3         12805007
        23      3600000            0     160626.4         13073835
        46      4800000            0     116286.2         13802927
        53      6000000            0      90087.9         14754892

The test was run with a release kernel config (Debian's default config).

Also capturing the insertion times into the rb tree and into the xarray,
that is measuring the duration of the old function inode_tree_add() and
the duration of the new btrfs_add_inode_to_root() function, gave the
following results (in nanoseconds):

Before this patch, inode_tree_add() execution times:

   Count: 5000000
   Range:  0.000 - 5536887.000; Mean: 775.674; Median: 729.000; Stddev: 4820.961
   Percentiles:  90th: 1015.000; 95th: 1139.000; 99th: 1397.000
         0.000 -       7.816:      40 |
         7.816 -      37.858:     209 |
        37.858 -     170.278:    6059 |
       170.278 -     753.961: 2754890 #####################################################
       753.961 -    3326.728: 2232312 ###########################################
      3326.728 -   14667.018:    4366 |
     14667.018 -   64652.943:     852 |
     64652.943 -  284981.761:     550 |
    284981.761 - 1256150.914:     221 |
   1256150.914 - 5536887.000:       7 |

After this patch, btrfs_add_inode_to_root() execution times:

   Count: 5000000
   Range:  0.000 - 2900652.000; Mean: 272.148; Median: 241.000; Stddev: 2873.369
   Percentiles:  90th: 342.000; 95th: 432.000; 99th: 572.000
        0.000 -       7.264:     104 |
        7.264 -      33.145:     352 |
       33.145 -     140.081:  109606 #
      140.081 -     581.930: 4840090 #####################################################
      581.930 -    2407.590:   43532 |
     2407.590 -    9950.979:    2245 |
     9950.979 -   41119.278:     514 |
    41119.278 -  169902.616:     155 |
   169902.616 -  702018.539:      47 |
   702018.539 - 2900652.000:       9 |

Average, percentiles, standard deviation, etc, are all much better.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:17 +02:00
Linus Torvalds
661e504db0 for-6.10-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmaG0jUACgkQxWXV+ddt
 WDsg4Q//f7xRooomsRaRvIDN4Add17jyZWuhreMhxKKvdtoRzLq/YZUldGJt1XXu
 tVpCvKkVvrJt5UY3b1czfA8GTHeOInZSoqyV8ZaxOAcg6cfKrUtLFi1wAKGU6BK/
 S9Zz/9lfbjXhG4PnZW7GRFmH0n68p/dmS+kXPVVPhj7CibxuZJtifflK1r6rHw+Y
 PDDnMIPqbrdK+xBfR+SEpG+1uIEFvI6SgsAZG6p0mzcbbUINGp7SpU3v2NVuvdIX
 tjTlZhz+D2do4dPk3RJ4Z6dpL+VlZnR1F9CZn6SmPcdGWn6mTkywoukJPP62iHYO
 b8Xr4j1mKDPcu55OEPWwhYWOvRP4FvltrGB+35xsQoPlKBBKxSMFgNyEkdfvjtKU
 qL9qRYJvm8D2OBqdv5BnawKRSouPcR4KPiM9JSGWRcW6vdOiY3Cd1mnAR8EA1LSd
 SRYmBlrZIIjOpEeGkEy2MeNKEH9L5B1wAU21VQd+cAlCKBFUmFb1csZ/vCJJ1B34
 swedmirNx0lyZTNBGQsDipkxSM0c/0xj/ysG8N41bLvPUM35x0K9xw5ZqBlTXrrg
 Lln51s6TeqSzfEvKdDGikBL3dVhT5I6acRUchQU9YSZ3Bqxj3QCTNnOZMZQsCFkp
 CMohduRVHiK4LHQTkU++q8bT8xX20kh/1HIJhAjdnDbrlRY4oTU=
 =tyXO
 -----END PGP SIGNATURE-----

Merge tag 'for-6.10-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix folio refcounting when releasing them (encoded write, dummy
   extent buffer)

 - fix out of bounds read when checking qgroup inherit data

 - fix how configurable chunk size is handled in zoned mode

 - in the ref-verify tool, fix uninitialized return value when checking
   extent owner ref and simple quota are not enabled

* tag 'for-6.10-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix folio refcount in __alloc_dummy_extent_buffer()
  btrfs: fix folio refcount in btrfs_do_encoded_write()
  btrfs: fix uninitialized return value in the ref-verify tool
  btrfs: always do the basic checks for btrfs_qgroup_inherit structure
  btrfs: zoned: fix calc_available_free_space() for zoned mode
2024-07-04 10:27:37 -07:00
Boris Burkov
da0386c1c7 btrfs: fix folio refcount in btrfs_do_encoded_write()
The conversion to folios switched __free_page() to __folio_put() in the
error path in btrfs_do_encoded_write().

However, this gets the page refcounting wrong. If we do hit that error
path (I reproduced by modifying btrfs_do_encoded_write to pretend to
always fail in a way that jumps to out_folios and running the fstests
case btrfs/281), then we always hit the following BUG freeing the folio:

  BUG: Bad page state in process btrfs  pfn:40ab0b
  page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x61be5 pfn:0x40ab0b
   flags: 0x5ffff0000000000(node=0|zone=2|lastcpupid=0x1ffff)
  raw: 05ffff0000000000 0000000000000000 dead000000000122 0000000000000000
  raw: 0000000000061be5 0000000000000000 00000001ffffffff 0000000000000000
  page dumped because: nonzero _refcount
  Call Trace:
  <TASK>
  dump_stack_lvl+0x3d/0xe0
  bad_page+0xea/0xf0
  free_unref_page+0x8e1/0x900
  ? __mem_cgroup_uncharge+0x69/0x90
  __folio_put+0xe6/0x190
  btrfs_do_encoded_write+0x445/0x780
  ? current_time+0x25/0xd0
  btrfs_do_write_iter+0x2cc/0x4b0
  btrfs_ioctl_encoded_write+0x2b6/0x340

It turns out __free_page() decreases the page reference count while
__folio_put() does not. Switch __folio_put() to folio_put() which
decreases the folio reference count first.

Fixes: 400b172b8c ("btrfs: compression: migrate compression/decompression paths to folios")
Tested-by: Ed Tomlinson <edtoml@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-04 02:18:45 +02:00
Mateusz Guzik
3a8e2f99f1 btrfs: use iget5_locked_rcu
With 20 threads each walking a dedicated 1000 dirs * 1000 files
directory tree to stat(2) on a 32 core + 24GB ram vm:

before: 3.54s user 892.30s system 1966% cpu 45.549 total
after:  3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%)

Benchmark can be found here: https://people.freebsd.org/~mjg/fstree.tgz

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20240611173824.535995-3-mjguzik@gmail.com
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-12 14:10:00 +02:00
Linus Torvalds
a3d1f54d7a for-6.10-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmZCE4MACgkQxWXV+ddt
 WDudtQ//WjXcHtY3I6NJtDhPsIOG3Qjg9mA0shp73X4djJtZoGCdgL7dq+fTp5lk
 Wu6/XY5g+CSttTgwF4eyHgUSJOptKWY0XQDWxX5VR8WCM2qmUZ7SedlrBED9GNDM
 rN/3egmc74OGwnqyQq3I/2qYLByXFj66tsvW3UBjLNB8vMHajjw1idj9ujipioHq
 ySStPCHkPMwuhEzw9+CTe3W47VUSb5Ug3XDhAZXvxT99oDHn1m+CxKQwcona/IPH
 1El8PmZ7JetaT9ZO3DICBICfCyo+2SSy/KXYypXXE+nzNZhbhC0V9N7Uqm1c91C0
 aRglsJZCXmHBD4BPLvkls6CqEIvMc7FvcNCqQlrbRT6PlfX91/XaeDq4l3RUcuPn
 mGShsdHUiwbPMWYVwqVUKd0IPiktF1R7yigTjYSkEFJTL6HFTrBqV/2fAMUsMfPc
 8gyzYMCPQld73WmrnXZQPKvmzO/LvE0gS5cPapokGwoXstq9n3iYd4ypN0wN6sif
 1jwy3efNzWXXMYV0WzcihKwFMm2fqp/pl9bXq/zwn2CunfIX4WTsaQ2NmJf81jqF
 qFNjlr8S3qO7AvIOs+R2XY9E3VjfzeDADzvjpQy5J/ZYbcHBcxxdYDhg+QGhe5nB
 eNmR51oL1pHSjU2M8PxATL8JxKkX2BvX6u64lVojaw4rxUlyFC0=
 =MMpE
 -----END PGP SIGNATURE-----

Merge tag 'for-6.10-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "This update brings a few minor performance improvements, otherwise
  there's a lot of refactoring, cleanups and other sort of not user
  visible changes.

  Performance improvements:

   - inline b-tree locking functions, improvement in metadata-heavy
     changes

   - relax locking on a range that's being reflinked, allows read
     operations to run in parallel

   - speed up NOCOW write checks (throughput +9% on a sample test)

   - extent locking ranges have been reduced in several places, namely
     around delayed ref processing

  Core:

   - more page to folio conversions:
      - relocation
      - send
      - compression
      - inline extent handling
      - super block write and wait

   - extent_map structure optimizations:
      - reduced structure size
      - code simplifications
      - add shrinker for allocated objects, the numbers can go high and
        could exhaust memory on smaller systems (reported) as they may
        not get an opportunity to be freed fast enough

   - extent locking optimizations:
      - reduce locking ranges where it does not seem to be necessary and
        are safe due to other means of synchronization
      - potential improvements due to lower contention,
        allocation/freeing and state management operations of extent
        state tracking structures

   - delayed ref cleanups and simplifications

   - updated trace points

   - improved error handling, warnings and assertions

   - cleanups and refactoring, unification of error handling paths"

* tag 'for-6.10-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (122 commits)
  btrfs: qgroup: fix initialization of auto inherit array
  btrfs: count super block write errors in device instead of tracking folio error state
  btrfs: use the folio iterator in btrfs_end_super_write()
  btrfs: convert super block writes to folio in write_dev_supers()
  btrfs: convert super block writes to folio in wait_dev_supers()
  bio: Export bio_add_folio_nofail to modules
  btrfs: remove duplicate included header from fs.h
  btrfs: add a cached state to extent_clear_unlock_delalloc
  btrfs: push extent lock down in submit_one_async_extent
  btrfs: push lock_extent down in cow_file_range()
  btrfs: move can_cow_file_range_inline() outside of the extent lock
  btrfs: push lock_extent into cow_file_range_inline
  btrfs: push extent lock into cow_file_range
  btrfs: push extent lock into run_delalloc_cow
  btrfs: remove unlock_extent from run_delalloc_compressed
  btrfs: push extent lock down in run_delalloc_nocow
  btrfs: adjust while loop condition in run_delalloc_nocow
  btrfs: push extent lock into run_delalloc_nocow
  btrfs: push the extent lock into btrfs_run_delalloc_range
  btrfs: lock extent when doing inline extent in compression
  ...
2024-05-14 17:25:36 -07:00
Linus Torvalds
1b0aabcc9a vfs-6.10.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZj3HuwAKCRCRxhvAZXjc
 orYvAQCZOr68uJaEaXAArYTdnMdQ6HIzG+FVlwrqtrhz0BV07wEAqgmtSR9XKh+L
 0+DNepg4R8PZOHH371eSSsLNRCUCkAs=
 =SVsU
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual miscellaneous features, cleanups, and fixes
  for vfs and individual fses.

  Features:

   - Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That
     means we now have six free FMODE_* bits in total (but bit #6
     already got used for FMODE_WRITE_RESTRICTED)

   - Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup)

   - Add fd_raw cleanup class so we can make use of automatic cleanup
     provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well

   - Optimize seq_puts()

   - Simplify __seq_puts()

   - Add new anon_inode_getfile_fmode() api to allow specifying f_mode
     instead of open-coding it in multiple places

   - Annotate struct file_handle with __counted_by() and use
     struct_size()

   - Warn in get_file() whether f_count resurrection from zero is
     attempted (epoll/drm discussion)

   - Folio-sophize aio

   - Export the subvolume id in statx() for both btrfs and bcachefs

   - Relax linkat(AT_EMPTY_PATH) requirements

   - Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors
     for dup*() equality replacing kcmp()

  Cleanups:

   - Compile out swapfile inode checks when swap isn't enabled

   - Use (1 << n) notation for FMODE_* bitshifts for clarity

   - Remove redundant variable assignment in fs/direct-io

   - Cleanup uses of strncpy in orangefs

   - Speed up and cleanup writeback

   - Move fsparam_string_empty() helper into header since it's currently
     open-coded in multiple places

   - Add kernel-doc comments to proc_create_net_data_write()

   - Don't needlessly read dentry->d_flags twice

  Fixes:

   - Fix out-of-range warning in nilfs2

   - Fix ecryptfs overflow due to wrong encryption packet size
     calculation

   - Fix overly long line in xfs file_operations (follow-up to FMODE_*
     cleanup)

   - Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up
     to FMODE_* cleanup)

   - Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_*
     cleanup)

   - Fix stable offset api to prevent endless loops

   - Fix afs file server rotations

   - Prevent xattr node from overflowing the eraseblock in jffs2

   - Move fdinfo PTRACE_MODE_READ procfs check into the .permission()
     operation instead of .open() operation since this caused userspace
     regressions"

* tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits)
  afs: Fix fileserver rotation getting stuck
  selftests: add F_DUPDFD_QUERY selftests
  fcntl: add F_DUPFD_QUERY fcntl()
  file: add fd_raw cleanup class
  fs: WARN when f_count resurrection is attempted
  seq_file: Simplify __seq_puts()
  seq_file: Optimize seq_puts()
  proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation
  fs: Create anon_inode_getfile_fmode()
  xfs: don't call xfs_file_open from xfs_dir_open
  xfs: drop fop_flags for directories
  xfs: fix overly long line in the file_operations
  shmem: Fix shmem_rename2()
  libfs: Add simple_offset_rename() API
  libfs: Fix simple_offset_rename_exchange()
  jffs2: prevent xattr node from overflowing the eraseblock
  vfs, swap: compile out IS_SWAPFILE() on swapless configs
  vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements
  fs/direct-io: remove redundant assignment to variable retval
  fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading
  ...
2024-05-13 11:40:06 -07:00
Josef Bacik
6b0a63a4fa btrfs: add a cached state to extent_clear_unlock_delalloc
Now that we have the lock_extent tightly coupled with
extent_clear_unlock_delalloc we can add a cached state to
extent_clear_unlock_delalloc and benefit from skipping the extra lookup
when we're doing cow.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:10 +02:00
Josef Bacik
8325f41a56 btrfs: push extent lock down in submit_one_async_extent
We don't need to include the time we spend in the allocator under our
extent lock protection, move it after the allocator and make sure we
lock the extent in the error case to ensure we're not clearing these
bits without the extent lock held.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:10 +02:00
Josef Bacik
d456c25dbb btrfs: push lock_extent down in cow_file_range()
Now that we've got the extent lock pushed into cow_file_range() we can
push it further down into the allocation loop.  This allows us to only
hold the extent lock during the dropping of the extent map range and
inserting the ordered extent.

This makes the error case a little trickier as we'll now have to lock
the range before clearing any of the other extent bits for the range,
but this is the error path so is less performance critical.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:10 +02:00
Josef Bacik
cd241a8f55 btrfs: move can_cow_file_range_inline() outside of the extent lock
These checks aren't reliant on the extent lock.  Move this up into
cow_file_range_inline(), and then update encoded writes to call this
check before calling __cow_file_range_inline().  This will allow us to
skip the extent lock if we're not able to inline the given extent.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:10 +02:00
Josef Bacik
0ab540995a btrfs: push lock_extent into cow_file_range_inline
Now that we've pushed the lock_extent() into cow_file_range() we can
push the extent locking into cow_file_range_inline() and move the
lock_extent in cow_file_range() to after we call
cow_file_range_inline().

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:10 +02:00
Josef Bacik
a0766d8f35 btrfs: push extent lock into cow_file_range
Now that cow_file_range is the only function that is called with the
range locked, push this call into cow_file_range so we can further
narrow the scope.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:10 +02:00
Josef Bacik
00009d7bcb btrfs: push extent lock into run_delalloc_cow
This is used by zoned but also as the fallback for uncompressed extents
when we fail to compress the ranges.  Push the extent lock into
run_dealloc_cow(), and adjust the compression case to take the extent
lock after calling run_delalloc_cow().

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:09 +02:00
Josef Bacik
0e128d4e41 btrfs: remove unlock_extent from run_delalloc_compressed
Since we immediately unlock the extent range when we enter
run_delalloc_compressed() simply move the lock_extent() down to cover
cow_file_range() and then remove the unlock_extent() from
run_delalloc_compressed.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:09 +02:00
Josef Bacik
aa56b0aa91 btrfs: push extent lock down in run_delalloc_nocow
run_delalloc_nocow is a little special because we use the file extents
to see if we can nocow a range.  We don't actually need the protection
of the extent lock to look at the file extents at this point however.
We are currently holding the page lock for this range, so we are
protected from anybody who would simultaneously be modifying the file
extent items for this range.

* mmap() - we're holding the page lock.
* buffered writes - we're holding the page lock.
* direct writes - we're holding the page lock and direct IO has to flush
  page cache before it's able to continue.
* fallocate() - all callers flush the range and wait on ordered extents
  while holding the inode lock and the mmap lock, so we are again saved
  by the page lock.

We want to use the extent lock to protect

1) The mapping tree for the given range.
2) The ordered extents for the given range.
3) The io_tree for the given range.

Push the extent lock down to cover these operations.  In the
fallback_to_cow() case we simply lock before doing anything and rely on
the cow_file_range() helper to handle it's range properly.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:09 +02:00
Josef Bacik
0ed30c17f6 btrfs: adjust while loop condition in run_delalloc_nocow
We have the following pattern

while (1) {
	if (cur_offset > end)
		break;
}

Which is just

while (cur_offset <= end) {
	...
}

so adjust the code to be more clear.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:09 +02:00
Josef Bacik
7c9acd440f btrfs: push extent lock into run_delalloc_nocow
run_delalloc_nocow is a bit special as it walks through the file extents
for the inode and determines what it can nocow and what it can't.  This
is the more complicated area for extent locking, so start with this
function.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:09 +02:00
Josef Bacik
c0707c9e1e btrfs: push the extent lock into btrfs_run_delalloc_range
We want to limit the scope of the extent lock to be around operations
that can change in flight.  Currently we hold the extent lock through
the entire writepage operation, which isn't really necessary.

We want to protect to make sure nobody has updated DELALLOC.  In
find_lock_delalloc_range we must lock the range in order to validate the
contents of our io_tree.  However once we've done that we're safe to
unlock the range and continue, as we have the page lock already held for
the range.

We are protected from all operations at this point.

* mmap() - we're holding the page lock, thus are protected.
* buffered writes - again, we're protected because we take the page lock
  for the first and last page in our range for buffered writes so we
  won't create new delalloc ranges in this area.
* direct IO - we invalidate pagecache before attempting to write a new
  area, which requires the page lock, so again are protected once we're
  holding the page lock on this range.

Additionally this behavior actually already exists for compressed, we
unlock the range as soon as we start to process the async extents, and
re-lock it during compression.  So this is completely safe, and makes
the locking more consistent.

Make this simple by just pushing the extent lock into
btrfs_run_delalloc_range.  From there followup patches will push the
lock further down into its users.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:09 +02:00
Josef Bacik
7034674b8a btrfs: lock extent when doing inline extent in compression
We currently don't lock the extent when we're doing a
cow_file_range_inline() for a compressed extent.  This isn't a problem
necessarily, but it's inconsistent with the rest of our usage of
cow_file_range_inline().  This also leads to some extra weird logic
around whether the extent is locked or not.  Fix this to lock the extent
before calling cow_file_range_inline() in compression to make it
consistent with the rest of the inline users.  In future patches this
will be pushed down into the cow_file_range_inline() helper, so we're
fine with the quick and dirty locking here.  This patch exists to make
the behavior change obvious.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:09 +02:00
Josef Bacik
0586d0a89e btrfs: move extent bit and page cleanup into cow_file_range_inline
We duplicate the extent cleanup for cow_file_range_inline() in the cow
and compressed case.  The encoded case doesn't need to do cleanup the
same way, so rename cow_file_range_inline to __cow_file_range_inline and
then make cow_file_range_inline handle the extent cleanup appropriately,
and update the callers.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:09 +02:00
Josef Bacik
0332967b4d btrfs: unlock all the pages with successful inline extent creation
Since 4750af3bbe ("btrfs: prevent extent_clear_unlock_delalloc() to
unlock page not locked by __process_pages_contig()") we have been
unlocking the locked page manually instead of via
extent_clear_unlock_delalloc() because of subpage blocksize support.
However we actually disable inline extent creation for subpage blocksize
support, so this behavior isn't necessary.  Remove this code and
comment, if at some point the subpage blocksize code grows support for
inline extents this can be re-evaluated.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:09 +02:00
Josef Bacik
6eecfa2240 btrfs: push all inline logic into cow_file_range
Currently we have a lot of duplicated checks of

if (start == 0 && fs_info->sectorsize == PAGE_SIZE)
	cow_file_range_inline();

Instead of duplicating this check everywhere, consolidate all of the
inline extent logic into a helper which documents all of the checks and
then use that helper inside of cow_file_range_inline().  With this we
can clean up all of the calls to either unconditionally call
cow_file_range_inline(), or at least reduce the checks we're doing
before we call cow_file_range_inline();

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:09 +02:00
Josef Bacik
aa5ccf2917 btrfs: handle errors in btrfs_reloc_clone_csums properly
In the cow path we will clone the reloc csums for relocated data
extents, and if there's an error we already have an ordered extent and
rely on the ordered extent finishing to clean everything up.

There's a problem however, we don't mark the ordered extent with an
error, we pretend like everything was just fine.  If we were at the end
of our range we won't actually bubble up this error anywhere, and we
could end up inserting an extent that doesn't have csums where it should
have them.

Fix this by adding a helper to mark the ordered extent with an error,
and then use this when we fail to lookup the csums in
btrfs_reloc_clone_csums.  Use this helper in the other place where we
use the same pattern while we're here.

This will prevent us from erroneously inserting the extent that doesn't
have the required checksums.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:09 +02:00
Qu Wenruo
e98bf64f7a btrfs: add extra sanity checks for create_io_em()
The function create_io_em() is called before we submit an IO, to update
the in-memory extent map for the involved range.

This patch changes the following aspects:

- Does not allow BTRFS_ORDERED_NOCOW type
  For real NOCOW (excluding NOCOW writes into preallocated ranges)
  writes, we never call create_io_em(), as we does not need to update
  the extent map at all.

  So remove the sanity check allowing BTRFS_ORDERED_NOCOW type.

- Add extra sanity checks
  * PREALLOC
    - @block_len == len
      For uncompressed writes.

  * REGULAR
    - @block_len == @orig_block_len == @ram_bytes == @len
      We're creating a new uncompressed extent, and referring all of it.

    - @orig_start == @start
      We haven no offset inside the extent.

  * COMPRESSED
    - valid @compress_type
    - @len <= @ram_bytes
      This is to co-operate with encoded writes, which can cause a new
      file extent referring only part of a uncompressed extent.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:08 +02:00
Filipe Manana
de6f14e83e btrfs: make try_release_extent_mapping() return a bool
Currently try_release_extent_mapping() as an int return type, but we
use it as a boolean. Its only caller, the release folio callback, also
returns a boolean which corresponds to try_release_extent_mapping()'s
return value. So change its return value type to bool as well as its
helper try_release_extent_state().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:07 +02:00
Josef Bacik
e094f48040 btrfs: change root->root_key.objectid to btrfs_root_id()
A comment from Filipe on one of my previous cleanups brought my
attention to a new helper we have for getting the root id of a root,
which makes it easier to read in the code.

The changes where made with the following Coccinelle semantic patch:

// <smpl>
@@
expression E,E1;
@@
(
 E->root_key.objectid = E1
|
- E->root_key.objectid
+ btrfs_root_id(E)
)
// </smpl>

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:06 +02:00
Filipe Manana
26c0fae3e7 btrfs: use btrfs_find_first_inode() at btrfs_prune_dentries()
Currently btrfs_prune_dentries() has open code to find the first inode in
a root with a minimum inode number. Remove that code and make it use the
helper btrfs_find_first_inode() for that task.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:04 +02:00
Filipe Manana
5e485ac6f0 btrfs: export find_next_inode() as btrfs_find_first_inode()
Export the relocation private helper find_next_inode() to inode.c, as this
same logic is also used at btrfs_prune_dentries() and will be used by an
upcoming change that adds an extent map shrinker. The next patch will
change btrfs_prune_dentries() to use this helper.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:04 +02:00
Filipe Manana
0a308f8095 btrfs: pass an inode to btrfs_add_extent_mapping()
Instead of passing fs_info and extent map tree arguments to
btrfs_add_extent_mapping(), we can pass an inode instead, as extent maps
are always inserted in the extent map tree of an inode, and the fs_info
can be extracted from the inode (inode->root->fs_info). The only exception
is in the self tests where we allocate an extent map tree and then use it
to insert/update/remove extent maps. However the tests can be changed to
use a test inode and then use the inode's extent map tree.

So change btrfs_add_extent_mapping() to have an inode as an argument
instead of a fs_info and an extent map tree. This reduces the number of
parameters and will also be needed for an upcoming change.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:03 +02:00
Filipe Manana
236e3107fc btrfs: open code csum_exist_in_range()
The csum_exist_in_range() function is now too trivial and is only used in
one place, so open code it in its single caller.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:03 +02:00
Filipe Manana
8d2a83a97f btrfs: make NOCOW checks for existence of checksums in a range more efficient
Before deciding if we can do a NOCOW write into a range, one of the things
we have to do is check if there are checksum items for that range. We do
that through the btrfs_lookup_csums_list() function, which searches for
checksums and adds them to a list supplied by the caller.

But all we need is to check if there is any checksum, we don't need to
look for all of them and collect them into a list, which requires more
search time in the checksums tree, allocating memory for checksums items
to add to the list, copy checksums from a leaf into those list items,
then free that memory, etc. This is all unnecessary overhead, wasting
mostly CPU time, and perhaps some occasional IO if we need to read from
disk any extent buffers.

So change btrfs_lookup_csums_list() to allow to return immediately in
case it finds any checksum, without the need to add it to a list and read
it from a leaf. This is accomplished by allowing a NULL list parameter and
making the function return 1 if it found any checksum, 0 if it didn't
found any, and a negative value in case of an error.

The following test with fio was used to measure performance:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/nullb0
  MNT=/mnt/nullb0

  cat <<EOF > /tmp/fio-job.ini
  [global]
  name=fio-rand-write
  filename=$MNT/fio-rand-write
  rw=randwrite
  bssplit=4k/20:8k/20:16k/20:32k/20:64k/20
  direct=1
  numjobs=16
  fallocate=posix
  time_based
  runtime=300

  [file1]
  size=8G
  ioengine=io_uring
  iodepth=16
  EOF

  umount $MNT &> /dev/null
  mkfs.btrfs -f $DEV
  mount -o ssd $DEV $MNT

  fio /tmp/fio-job.ini
  umount $MNT

The test was run on a release kernel (Debian's default kernel config).

The results before this patch:

  WRITE: bw=139MiB/s (146MB/s), 8204KiB/s-9504KiB/s (8401kB/s-9732kB/s), io=17.0GiB (18.3GB), run=125317-125344msec

The results after this patch:

  WRITE: bw=153MiB/s (160MB/s), 9241KiB/s-10.0MiB/s (9463kB/s-10.5MB/s), io=17.0GiB (18.3GB), run=114054-114071msec

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:03 +02:00
Filipe Manana
afcb80624f btrfs: remove search_commit parameter from btrfs_lookup_csums_list()
All the callers of btrfs_lookup_csums_list() pass a value of 0 as the
"search_commit" parameter. So remove it and make the function behave as
to always search from the regular root.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:03 +02:00
Filipe Manana
0ddefc2a7c btrfs: move btrfs_page_mkwrite() from inode.c into file.c
btrfs_page_mkwrite() is a struct vm_operations_struct callback and we
define that structure in file.c. Currently the function is in inode.c and
has to be exported to be used in file.c, which makes no sense because it's
not used anywhere else. So move btrfs_page_mkwrite() from inode.c and into
file.c.

While at it do a few minor style changes:

1) Capitalize the first word of every comment and end each sentence with
   punctuation;

2) Avoid splitting some statements into two lines when everything fits in
   85 characters or less.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:03 +02:00
Filipe Manana
47f6944877 btrfs: remove pointless return value assignment at btrfs_finish_one_ordered()
At btrfs_finish_one_ordered() it's pointless to assign 0 to the 'ret'
variable because if it has a non-zero value (error), we have already
jumped to the 'out' label. So remove that redundant assignment.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:02 +02:00
Filipe Manana
2e438442ba btrfs: remove not needed mod_start and mod_len from struct extent_map
The mod_start and mod_len fields of struct extent_map were introduced by
commit 4e2f84e63d ("Btrfs: improve fsync by filtering extents that we
want") in order to avoid too low performance when fsyncing a file that
keeps getting extent maps merge, because it resulted in each fsync logging
again csum ranges that were already merged before.

We don't need this anymore as extent maps in the list of modified extents
are never merged with other extent maps and once we log an extent map we
remove it from the list of modified extent maps, so it's never logged
twice.

So remove the mod_start and mod_len fields from struct extent_map and use
instead the start and len fields when logging checksums in the fast fsync
path. This also makes EXTENT_FLAG_FILLING unused so remove it as well.

Running the reproducer from the commit mentioned before, with a larger
number of extents and against a null block device, so that IO is fast
and we can better see any impact from searching checksums items and
logging them, gave the following results from dd:

Before this change:

   409600000 bytes (410 MB, 391 MiB) copied, 22.948 s, 17.8 MB/s

After this change:

   409600000 bytes (410 MB, 391 MiB) copied, 22.9997 s, 17.8 MB/s

So no changes in throughput.
The test was done in a release kernel (non-debug, Debian's default kernel
config) and its steps are the following:

   $ mkfs.btrfs -f /dev/nullb0
   $ mount /dev/sdb /mnt
   $ dd if=/dev/zero of=/mnt/foobar bs=4k count=100000 oflag=sync
   $ umount /mnt

This also reduces the size of struct extent_map from 128 bytes down to 112
bytes, so now we can have 36 extents maps per 4K page instead of 32.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:02 +02:00
Qu Wenruo
400b172b8c btrfs: compression: migrate compression/decompression paths to folios
For both compression and decompression paths, we always require a
"struct page **pages" and "unsigned long nr_pages", this involves quite
some part of the btrfs compression paths:

- All the compression entry points

- compressed_bio structure
  This affects both compression and decompression.

- async_extent structure

Unfortunately with all those involved parts, there is no good way to
split the conversion into smaller patches while still passing compiling.
So do this in one big conversion in one go.

Please note this is direct page->folio conversion, no change on the page
sized folio requirement yet.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:02 +02:00
Qu Wenruo
ae0d22a7fc btrfs: migrate insert_inline_extent() to folio interfaces
Since insert_inline_extent() now only accepts a single page, it's much
easier to convert it to use folio interfaces.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:02 +02:00
Qu Wenruo
eb1fa9ab47 btrfs: make insert_inline_extent() accept one page directly
Since our inline extent cannot accept anything larger than a sector,
there is really no need to pass all the compressed pages to
insert_inline_extent().

And just in case, expand the ASSERT()s to make sure we only try inline
with compressed size no larger than sectorsize.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:02 +02:00
Qu Wenruo
98fe01af7e btrfs: compression: convert page allocation to folio interfaces
Currently we have two wrappers to allocate and free a page for
compression usage:

- btrfs_alloc_compr_page()
- btrfs_free_compr_page()

The allocator would try to grab a page from the pool, and only allocate
a new page if the pool is empty.

The reclaimer would check if the pool is full, and if not full it would
put the page into the pool.

This patch converts both helpers to use folio interfaces, and allowing
further conversion of compression path to folios.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:02 +02:00
Anand Jain
5e45b044b7 btrfs: rename err to ret in btrfs_cont_expand()
Unify naming of return value to the preferred way.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:00 +02:00
Anand Jain
c3a1cc8ff4 btrfs: rename err to ret in btrfs_rmdir()
Unify naming of return value to the preferred way.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:00 +02:00
Filipe Manana
c66f2afc71 btrfs: remove pointless writepages callback wrapper
There's no point in having a static writepages callback in inode.c that
does nothing besides calling extent_writepages from extent_io.c.
So just remove the callback at inode.c and rename extent_writepages()
to btrfs_writepages().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:00 +02:00
Filipe Manana
7938d38b94 btrfs: remove pointless readahead callback wrapper
There's no point in having a static readahead callback in inode.c that
does nothing besides calling extent_readahead() from extent_io.c.
So just remove the callback at inode.c and rename extent_readahead()
to btrfs_readahead().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07 21:31:00 +02:00
Sweet Tea Dorminy
131a821a24 btrfs: fallback if compressed IO fails for ENOSPC
In commit b4ccace878 ("btrfs: refactor submit_compressed_extents()"), if
an async extent compressed but failed to find enough space, we changed
from falling back to an uncompressed write to just failing the write
altogether. The principle was that if there's not enough space to write
the compressed version of the data, there can't possibly be enough space
to write the larger, uncompressed version of the data.

However, this isn't necessarily true: due to fragmentation, there could
be enough discontiguous free blocks to write the uncompressed version,
but not enough contiguous free blocks to write the smaller but
unsplittable compressed version.

This has occurred to an internal workload which relied on write()'s
return value indicating there was space. While rare, it has happened a
few times.

Thus, in order to prevent early ENOSPC, re-add a fallback to
uncompressed writing.

Fixes: b4ccace878 ("btrfs: refactor submit_compressed_extents()")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Co-developed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-18 01:46:52 +02:00
Boris Burkov
3c6f0c5ecc btrfs: make btrfs_clear_delalloc_extent() free delalloc reserve
Currently, this call site in btrfs_clear_delalloc_extent() only converts
the reservation. We are marking it not delalloc, so I don't think it
makes sense to keep the rsv around.  This is a path where we are not
sure to join a transaction, so it leads to incorrect free-ing during
umount.

Helps with the pass rate of generic/269 and generic/475.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-02 19:19:08 +02:00
Boris Burkov
74e9795812 btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume operations
Create subvolume, create snapshot and delete subvolume all use
btrfs_subvolume_reserve_metadata() to reserve metadata for the changes
done to the parent subvolume's fs tree, which cannot be mediated in the
normal way via start_transaction. When quota groups (squota or qgroups)
are enabled, this reserves qgroup metadata of type PREALLOC. Once the
operation is associated to a transaction, we convert PREALLOC to
PERTRANS, which gets cleared in bulk at the end of the transaction.

However, the error paths of these three operations were not implementing
this lifecycle correctly. They unconditionally converted the PREALLOC to
PERTRANS in a generic cleanup step regardless of errors or whether the
operation was fully associated to a transaction or not. This resulted in
error paths occasionally converting this rsv to PERTRANS without calling
record_root_in_trans successfully, which meant that unless that root got
recorded in the transaction by some other thread, the end of the
transaction would not free that root's PERTRANS, leaking it. Ultimately,
this resulted in hitting a WARN in CONFIG_BTRFS_DEBUG builds at unmount
for the leaked reservation.

The fix is to ensure that every qgroup PREALLOC reservation observes the
following properties:

1. any failure before record_root_in_trans is called successfully
   results in freeing the PREALLOC reservation.
2. after record_root_in_trans, we convert to PERTRANS, and now the
   transaction owns freeing the reservation.

This patch enforces those properties on the three operations. Without
it, generic/269 with squotas enabled at mkfs time would fail in ~5-10
runs on my system. With this patch, it ran successfully 1000 times in a
row.

Fixes: e85fde5162 ("btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-02 19:18:23 +02:00
Kent Overstreet
2a82bb0294
statx: stx_subvol
Add a new statx field for (sub)volume identifiers, as implemented by
btrfs and bcachefs.

This includes bcachefs support; we'll definitely want btrfs support as
well.

Link: https://lore.kernel.org/linux-fsdevel/2uvhm6gweyl7iyyp2xpfryvcu2g3padagaeqcbiavjyiis6prl@yjm725bizncq/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lore.kernel.org/r/20240308022914.196982-1-kent.overstreet@linux.dev
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-26 09:01:18 +01:00
Chengming Zhou
ef5a05c557 btrfs: remove SLAB_MEM_SPREAD flag use
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1, so it became a dead flag since the commit
16a1d96835 ("mm/slab: remove mm/slab.c and slab_def.h"). And the
series[1] went on to mark it obsolete to avoid confusion for users.
Here we can just remove all its users, which has no functional change.

[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-05 17:13:23 +01:00
David Sterba
5a8a57f9a4 btrfs: merge btrfs_del_delalloc_inode() helpers
The helpers btrfs_del_delalloc_inode() and __btrfs_del_delalloc_inode()
don't follow the pattern when the "__" helper does a special case and
are in fact reversed regarding the naming. We can merge them into one as
there's only one place that needs to be open coded.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:54 +01:00
David Sterba
636d91d7ee btrfs: delete BUG_ON in btrfs_init_locked_inode()
The purpose of the BUG_ON is not clear. The helper btrfs_grab_root()
could return a NULL in case args->root would be a NULL or if there are
zero references. Then we check if the root pointer stored in the inode
still exists.

The whole call chain is for iget:

btrfs_iget
  btrfs_iget_path
    btrfs_iget_locked
      iget5_locked
	btrfs_init_locked_inode

which is called from many contexts where we the root pointer is used and
we can safely assume has enough references.

Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:51 +01:00
David Sterba
6fbc6f4ac1 btrfs: handle invalid root reference found in may_destroy_subvol()
The may_destroy_subvol() looks up a root by a key, allowing to do an
inexact search when key->offset is -1.  It's never expected to find such
item, as it would break the allowed range of a root id.

Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:51 +01:00
David Sterba
dbe6cda68f btrfs: push errors up from add_async_extent()
The memory allocation error in add_async_extent() is not handled
properly, return an error and push the BUG_ON to the caller. Handling it
there is not trivial so at least make it visible.

Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:50 +01:00
Filipe Manana
4e94ee80e1 btrfs: remove do_list variable at btrfs_clear_delalloc_extent()
The "do_list" variable has a rather confusing name, so remove it and
directly use btrfs_is_free_space_inode() instead.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:50 +01:00
Filipe Manana
99c15fec32 btrfs: remove do_list variable at btrfs_set_delalloc_extent()
The "do_list" variable is only used once, plus its name/meaning is a bit
confusing, so remove it and directory use btrfs_is_free_space_inode().

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:50 +01:00
Filipe Manana
d23626d8bc btrfs: use assertion instead of BUG_ON when adding/removing to delalloc list
When adding or removing and inode to/from the root's delalloc list,
instead of using a BUG_ON() to validate list emptiness, use ASSERT()
since this is to check logic errors rather than real errors.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:50 +01:00
Filipe Manana
b5d5639259 btrfs: add lockdep assertion to remaining delalloc callbacks
The merge and split callbacks for an inode's io tree are supposed to be
called while the io tree's spinlock is being held, so that the given
extent_state records are stable, not modified or freed while the callbacks
are using them. So add lockdep assertions in the callbacks.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:50 +01:00
Filipe Manana
bdc0f89e06 btrfs: reduce inode lock critical section when setting and clearing delalloc
When setting and clearing a delalloc range, at btrfs_set_delalloc_extent()
and btrfs_clear_delalloc_extent(), we are adding/removing the inode
to/from the root's list of delalloc inodes while under the protection of
the inode's lock. This however is not needed, we can add and remove the
inode to the root's list without holding the inode's lock because here
we are under the protection of the io tree's lock, reducing the size of
the critical section delimited by the inode's lock. The inode's lock is
used in many other places such as when finishing an ordered extent (when
calling btrfs_update_inode_bytes() or btrfs_delalloc_release_metadata(),
or decreasing the number of outstanding extents) or when reserving space
when doing a buffered or direct IO write (calls to functions from
delalloc-space.c).

So move the inode add/remove operations to the root's list of delalloc
inodes to outside the critical section delimited by the inode's lock.
This also allows us to get rid of the BTRFS_INODE_IN_DELALLOC_LIST flag
since we can rely on the inode's delalloc bytes counter to determine if
the inode is or is not in the list.

The following fio based test, that exercises IO to multiple files in the
same subvolume, was used to test:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/nullb0
   MNT=/mnt/nullb0
   MOUNT_OPTIONS="-o ssd"

   mkfs.btrfs -f $DEV &> /dev/null
   mount $MOUNT_OPTIONS $DEV $MNT

   fio --direct=0 --ioengine=sync --thread --directory=$MNT \
       --invalidate=1 --group_reporting=1 \
       --new_group --rw=randwrite --size=50m --numjobs=200 \
       --bs=4k --fsync_on_close=0 --fallocate=none --end_fsync=0 \
       --name=foo --filename_format=FioWorkloads.\$jobnum

   umount $MNT

The test was run on a non-debug kernel (Debian's default kernel config)
against a 16G null block device.

Result before this patch:

   WRITE: bw=81.9MiB/s (85.9MB/s), 81.9MiB/s-81.9MiB/s (85.9MB/s-85.9MB/s), io=9.77GiB (10.5GB), run=122136-122136msec

Result after this patch:

   WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=9.77GiB (10.5GB), run=115180-115180msec

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:50 +01:00
Filipe Manana
f4f15454fa btrfs: rename btrfs_add_delalloc_inodes() to singular form
The function btrfs_add_delalloc_inodes() adds a single inode its root's
list of delalloc inodes, so it doesn't make any sense at all for the
function's name to be plural. Rename it to the singular form
btrfs_add_delalloc_inode() to avoid any confusion.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:50 +01:00
Filipe Manana
f23f89524b btrfs: assert root delalloc lock is held at __btrfs_del_delalloc_inode()
This function requires the delalloc lock of the inode's root to be held,
so assert it's held.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:50 +01:00
Filipe Manana
f5169f12d7 btrfs: stop passing root argument to __btrfs_del_delalloc_inode()
There's no need to pass a root argument to __btrfs_del_delalloc_inode()
and btrfs_del_delalloc_inode(), we can just pass the inode since the root
is always the root associated to that inode. Some remove the root argument
from these functions.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:49 +01:00
Filipe Manana
8a46e55a6c btrfs: stop passing root argument to btrfs_add_delalloc_inodes()
There's no need to pass a root argument to btrfs_add_delalloc_inodes(), we
can just pass the inode since the root is always the root associated to
the inode in the context it's called. So remove it and have the single
caller pass only the inode.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:49 +01:00
David Sterba
41044b41ad btrfs: add helper to get fs_info from struct inode pointer
Add a convenience helper to get a fs_info from a VFS inode pointer
instead of open coding the chain or using btrfs_sb() that in some cases
does one more pointer hop.  This is implemented as a macro (still with
type checking) so we don't need full definitions of struct btrfs_inode,
btrfs_root or btrfs_fs_info.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:49 +01:00
David Sterba
b33d2e535f btrfs: add helpers to get fs_info from page/folio pointers
Add convenience helpers to get a fs_info from a page or folio pointer
instead of open coding the chain or using btrfs_sb() that in some cases
does one more pointer hop.  This is implemented as a macro (still with
type checking) so we don't need full definitions of struct page, folio,
btrfs_root and btrfs_fs_info. The latter can't be static inlines as this
would create loop between ctree.h <-> fs.h, or the headers would have to
be restructured.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:49 +01:00
David Sterba
c8293894af btrfs: add helpers to get inode from page/folio pointers
Add convenience helpers to get a struct btrfs_inode from a page or folio
pointer instead of open coding the chain or intermediate BTRFS_I. This
is implemented as a macro (still with type checking) so we don't need
full definitions of struct page or address_space.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:49 +01:00
David Sterba
c03c89f821 btrfs: handle errors returned from unpin_extent_cache()
We've had numerous attempts to let function unpin_extent_cache() return
void as it only returns 0. There are still error cases to handle so do
that, in addition to the verbose messages. The only caller
btrfs_finish_one_ordered() will now abort the transaction, previously it
let it continue which could lead to further problems.

Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:46 +01:00
David Sterba
2b712e3bb2 btrfs: remove unused included headers
With help of neovim, LSP and clangd we can identify header files that
are not actually needed to be included in the .c files. This is focused
only on removal (with minor fixups), further cleanups are possible but
will require doing the header files properly with forward declarations,
minimized includes and include-what-you-use care.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:46 +01:00
David Sterba
4e00422ee6 btrfs: replace sb::s_blocksize by fs_info::sectorsize
The block size stored in the super block is used by subsystems outside
of btrfs and it's a copy of fs_info::sectorsize. Unify that to always
use our sectorsize, with the exception of mount where we first need to
use fixed values (4K) until we read the super block and can set the
sectorsize.

Replace all uses, in most cases it's fewer pointer indirections.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:46 +01:00
Goldwyn Rodrigues
df055afe9b btrfs: page to folio conversion in btrfs_truncate_block()
Convert use of struct page to struct folio inside btrfs_truncate_block().
The only page based function is set_page_extent_mapped(). All other
functions have folio equivalents.

Had to use __filemap_get_folio() because filemap_grab_folio() does not
allow passing allocation mask as a parameter.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:45 +01:00
Qu Wenruo
8bab0a3066 btrfs: remove the pg_offset parameter from btrfs_get_extent()
The parameter @pg_offset of btrfs_get_extent() is only utilized for
inlined extent, and we already have an ASSERT() and tree-checker, to
make sure we can only get inline extent at file offset 0.

Any invalid inline extent with non-zero file offset would be rejected by
tree-checker in the first place.

Thus the @pg_offset parameter is not really necessary, just remove it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:45 +01:00
Linus Torvalds
7505aa147a for-6.8-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmXh1GgACgkQxWXV+ddt
 WDtnvA/7BN7BZ6QmwWv9UyxhgSBtzI19AXPi/kBsssnnjNuzXoHFaVHj68lQCCOB
 a4YjRxAg7nmwFGHdVDTdnwXgUECzqlVkeX9cXg1ZpJy0IfP9RriGedRlC/93z7aV
 pg6DnKMh2FlkibK7yO6kRBR8RYLc5aVIytqHXgUeqbaquuhj2Hh8EpqRo2X0RsoE
 wDXlK0qgrU8HyrA3fHdqKYPcm1+cYABGTCwGx65iRffy8vH+KFSAr71G8jOJVEUj
 DgNWJCpBjXJNs0dsKrik5oGmvLd3GDBKinNX7R2mAvMAMGWrL+fVVTVTfBS/clUT
 FBiVFNYCJuphMcO3Qjs6JIuEez0GuGEeh1m+tQ8W795At1FSiINtE5J7LjmJUl5X
 FuUwOUpxco1lTXBLX149Y9kk7AEOaqYxy0XbH4r5bbKyuzQegRGB9/qQX4sSaCt3
 3T+Td9PvS+6Jo+CDO0dsYhG/h3bsHeHtHGR6f2CiO/m1zHDnTX9aYVcLMM3hsrMI
 8OUEy1jkuKnDZQuZuIWES/3V9FlJL34dR3Cb236Pv/yIH1iujIc27g0qXrC1vzPg
 wnUL1wheLQ9IRLedXoiHtX2Y2ZfFQGQDrIKNCJFD+WNPkZYffih5QNTV/mPZmL80
 9EbjoVTu+6rygzdD43R1RWvK0kFsY44RKhHreI8SItO+e3/0TAs=
 =hMf8
 -----END PGP SIGNATURE-----

Merge tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix freeing allocated id for anon dev when snapshot creation fails

 - fiemap fixes:
     - followup for a recent deadlock fix, ranges that fiemap can access
       can still race with ordered extent completion
     - make sure fiemap with SYNC flag does not race with writes

* tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix double free of anonymous device after snapshot creation failure
  btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given
  btrfs: fix race between ordered extent completion and fiemap
2024-03-01 12:29:20 -08:00
Filipe Manana
418b090277 btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given
When FIEMAP_FLAG_SYNC is given to fiemap the expectation is that that
are no concurrent writes and we get a stable view of the inode's extent
layout.

When the flag is given we flush all IO (and wait for ordered extents to
complete) and then lock the inode in shared mode, however that leaves open
the possibility that a write might happen right after the flushing and
before locking the inode. So fix this by flushing again after locking the
inode - we leave the initial flushing before locking the inode to avoid
holding the lock and blocking other RO operations while waiting for IO
and ordered extents to complete. The second flushing while holding the
inode's lock will most of the time do nothing or very little since the
time window for new writes to have happened is small.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-29 22:34:07 +01:00
Linus Torvalds
1f3a3e2aae for-6.8-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmXMewsACgkQxWXV+ddt
 WDtFUBAAkEU/hxB4YsLn2JEdp3wc80w5/qKkPaYHsI2ncvc3RFiG+tqSY7BakMgE
 Kkdl8ouNX3p/S62ykIBQTKZnOTk7FgKlClAQtgKn1afexqABsP2mifnh40Dzf7eA
 VvEl7chnRT6oeivtQkB+BtgOzaOUp4j/8oAivRN8NKNwTxojV4g9PErKSOWfVQSq
 3zlrLJbe6era43SpnexkjZHn4Fy4CN+C7FMm+pT/yKzZi2oBZs9BvNZGhIkdnzcK
 MftrY9dSGO3CDD2Kvrz3lEm7ZB83wCpm+GTDN7iJx2y+yeW+aHjshFkJr1ApEZQa
 lsWTnj3hk3yHoOPUuLlchw5JcFb/dFZ1Ztdwkunf8nmt5a3O/5Zf+Csgze8c+Iii
 MJQKi0B/bNQ7cSEwRt36s75kROBItZmHCZmSBlOpT1LXSDQMJ9lvEnv/fPQdcHHF
 WMEmk5O5IoGYv5kx5wIoWv27HKE/bDwH6RjkxEd/n17XP+PcfHY4K0o0CGtfwS8g
 hdy9RI9X8dbf3ZPrxtsgQ2T8btWs68A4S6nwcSuY5HK0WNmvRh47eLfCI6S6XGJs
 hHkppLcc+WTXOskCA+ABdm9hgeAPZkCSpuQSmC2HBt8gRv8XqO7z4cZ/up2N+tES
 ZOJSrJb97nusOcxY0pLexnD6eI3pQxzGMiPONlC1Re8CdjZ0l+4=
 =RRGT
 -----END PGP SIGNATURE-----

Merge tag 'for-6.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few regular fixes and one fix for space reservation regression since
  6.7 that users have been reporting:

   - fix over-reservation of metadata chunks due to not keeping proper
     balance between global block reserve and delayed refs reserve; in
     practice this leaves behind empty metadata block groups, the
     workaround is to reclaim them by using the '-musage=1' balance
     filter

   - other space reservation fixes:
      - do not delete unused block group if it may be used soon
      - do not reserve space for checksums for NOCOW files

   - fix extent map assertion failure when writing out free space inode

   - reject encoded write if inode has nodatasum flag set

   - fix chunk map leak when loading block group zone info"

* tag 'for-6.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: don't refill whole delayed refs block reserve when starting transaction
  btrfs: zoned: fix chunk map leak when loading block group zone info
  btrfs: reject encoded write if inode has nodatasum flag set
  btrfs: don't reserve space for checksums when writing to nocow files
  btrfs: add new unused block groups to the list of unused block groups
  btrfs: do not delete unused block group if it may be used soon
  btrfs: add and use helper to check if block group is used
  btrfs: don't drop extent_map for free space inode on write error
2024-02-14 15:47:02 -08:00
Filipe Manana
1bd96c92c6 btrfs: reject encoded write if inode has nodatasum flag set
Currently we allow an encoded write against inodes that have the NODATASUM
flag set, either because they are NOCOW files or they were created while
the filesystem was mounted with "-o nodatasum". This results in having
compressed extents without corresponding checksums, which is a filesystem
inconsistency reported by 'btrfs check'.

For example, running btrfs/281 with MOUNT_OPTIONS="-o nodatacow" triggers
this and 'btrfs check' errors out with:

   [1/7] checking root items
   [2/7] checking extents
   [3/7] checking free space tree
   [4/7] checking fs roots
   root 256 inode 257 errors 1040, bad file extent, some csum missing
   root 256 inode 258 errors 1040, bad file extent, some csum missing
   ERROR: errors found in fs roots
   (...)

So reject encoded writes if the target inode has NODATASUM set.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-13 18:38:05 +01:00
Josef Bacik
5571e41ec6 btrfs: don't drop extent_map for free space inode on write error
While running the CI for an unrelated change I hit the following panic
with generic/648 on btrfs_holes_spacecache.

assertion failed: block_start != EXTENT_MAP_HOLE, in fs/btrfs/extent_io.c:1385
------------[ cut here ]------------
kernel BUG at fs/btrfs/extent_io.c:1385!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 2695096 Comm: fsstress Kdump: loaded Tainted: G        W          6.8.0-rc2+ #1
RIP: 0010:__extent_writepage_io.constprop.0+0x4c1/0x5c0
Call Trace:
 <TASK>
 extent_write_cache_pages+0x2ac/0x8f0
 extent_writepages+0x87/0x110
 do_writepages+0xd5/0x1f0
 filemap_fdatawrite_wbc+0x63/0x90
 __filemap_fdatawrite_range+0x5c/0x80
 btrfs_fdatawrite_range+0x1f/0x50
 btrfs_write_out_cache+0x507/0x560
 btrfs_write_dirty_block_groups+0x32a/0x420
 commit_cowonly_roots+0x21b/0x290
 btrfs_commit_transaction+0x813/0x1360
 btrfs_sync_file+0x51a/0x640
 __x64_sys_fdatasync+0x52/0x90
 do_syscall_64+0x9c/0x190
 entry_SYSCALL_64_after_hwframe+0x6e/0x76

This happens because we fail to write out the free space cache in one
instance, come back around and attempt to write it again.  However on
the second pass through we go to call btrfs_get_extent() on the inode to
get the extent mapping.  Because this is a new block group, and with the
free space inode we always search the commit root to avoid deadlocking
with the tree, we find nothing and return a EXTENT_MAP_HOLE for the
requested range.

This happens because the first time we try to write the space cache out
we hit an error, and on an error we drop the extent mapping.  This is
normal for normal files, but the free space cache inode is special.  We
always expect the extent map to be correct.  Thus the second time
through we end up with a bogus extent map.

Since we're deprecating this feature, the most straightforward way to
fix this is to simply skip dropping the extent map range for this failed
range.

I shortened the test by using error injection to stress the area to make
it easier to reproduce.  With this patch in place we no longer panic
with my error injection test.

CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-09 20:28:28 +01:00
Linus Torvalds
5d9248eed4 for-6.8-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmWurp4ACgkQxWXV+ddt
 WDsqSg/+OS5/1Cr2W6/3ns2hannEeAzYUeoRDNhNHluHOSufXS52QTckQdiA62BO
 iMKGoIxZIn9BQPlvil1hi+jIEt/9qsRt/Qc6oBnzvlto21tJCoS486PJAShu6Sj5
 jXKxtR7d6WrJEfk65uzatk1SbRguRKFxSrFlkaOeOHAmWsD54p/BnsZ/pqxPjF8W
 LOFvwdhbTw3pzQ873b+hJg16rm4IenAnuazZNmXRdSufgdPEcArv0l7fMr4xTBvO
 DBQXoM5GBGVHV2+IsrZiK39p7khz9ej2Ob4rps/x6PduC+GPxGtm6iLy8dZts+hV
 D1FOHh3fqWmV2LQIzLNNu9N7sj5sF5dNFRZHSkq4qFNVNQYfvyFg43iJKfUnMY/s
 puUm7ElSF3tLC2pRys0m/jDfkykZVFFZzbayfYQn+jRKuUASyXnWqmCKlljkLJD5
 ekFXPpor+SQzQso9x0OpAjkSIUmmYFqSvoJCCczPFoo/3EDPv4C6VGOPEQyN6dDH
 nBjn7fLXmn4hpdEKia+LU1MhajFis+SUlmjaoTh7UfCCzXDosDOPThRC1Kx0rNlY
 t4KON8pMUCK3iGEce+7iOSwEImDDU4B7DUARey/sF0C8cs7jRsX8bf8eFTrEId8M
 4C2sLmTw0JJ5n2I2soyTi9fHrGJnJamUlzp/hLrp8JyMzy6qBrs=
 =38MW
 -----END PGP SIGNATURE-----

Merge tag 'for-6.8-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - zoned mode fixes:
     - fix slowdown when writing large file sequentially by looking up
       block groups with enough space faster
     - locking fixes when activating a zone

 - new mount API fixes:
     - preserve mount options for a ro/rw mount of the same subvolume

 - scrub fixes:
     - fix use-after-free in case the chunk length is not aligned to
       64K, this does not happen normally but has been reported on
       images converted from ext4
     - similar alignment check was missing with raid-stripe-tree

 - subvolume deletion fixes:
     - prevent calling ioctl on already deleted subvolume
     - properly track flag tracking a deleted subvolume

 - in subpage mode, fix decompression of an inline extent (zlib, lzo,
   zstd)

 - fix crash when starting writeback on a folio, after integration with
   recent MM changes this needs to be started conditionally

 - reject unknown flags in defrag ioctl

 - error handling, API fixes, minor warning fixes

* tag 'for-6.8-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: scrub: limit RST scrub to chunk boundary
  btrfs: scrub: avoid use-after-free when chunk length is not 64K aligned
  btrfs: don't unconditionally call folio_start_writeback in subpage
  btrfs: use the original mount's mount options for the legacy reconfigure
  btrfs: don't warn if discard range is not aligned to sector
  btrfs: tree-checker: fix inline ref size in error messages
  btrfs: zstd: fix and simplify the inline extent decompression
  btrfs: lzo: fix and simplify the inline extent decompression
  btrfs: zlib: fix and simplify the inline extent decompression
  btrfs: defrag: reject unknown flags of btrfs_ioctl_defrag_range_args
  btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted
  btrfs: don't abort filesystem when attempting to snapshot deleted subvolume
  btrfs: zoned: fix lock ordering in btrfs_zone_activate()
  btrfs: fix unbalanced unlock of mapping_tree_lock
  btrfs: ref-verify: free ref cache before clearing mount opt
  btrfs: fix kvcalloc() arguments order in btrfs_ioctl_send()
  btrfs: zoned: optimize hint byte for zoned allocator
  btrfs: zoned: factor out prepare_allocation_zoned()
2024-01-22 13:29:42 -08:00
Omar Sandoval
3324d05478 btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted
Sweet Tea spotted a race between subvolume deletion and snapshotting
that can result in the root item for the snapshot having the
BTRFS_ROOT_SUBVOL_DEAD flag set. The race is:

Thread 1                                      | Thread 2
----------------------------------------------|----------
btrfs_delete_subvolume                        |
  btrfs_set_root_flags(BTRFS_ROOT_SUBVOL_DEAD)|
                                              |btrfs_mksubvol
                                              |  down_read(subvol_sem)
                                              |  create_snapshot
                                              |    ...
                                              |    create_pending_snapshot
                                              |      copy root item from source
  down_write(subvol_sem)                      |

This flag is only checked in send and swap activate, which this would
cause to fail mysteriously.

create_snapshot() now checks the root refs to reject a deleted
subvolume, so we can fix this by locking subvol_sem earlier so that the
BTRFS_ROOT_SUBVOL_DEAD flag and the root refs are updated atomically.

CC: stable@vger.kernel.org # 4.14+
Reported-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-12 02:00:21 +01:00
Linus Torvalds
affc5af36b for-6.8-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmWYTmMACgkQxWXV+ddt
 WDvPRg/+KgS5LV3nNC0MguYcTMQxmgeutIgXZIMfeA3v6EnFS7nj8leP4EPc6+bj
 JPSkwj4u2vHVwpnTVuEAuJUXnmFY+Qu70nVy6bM2uOHOYTVBQ8zRVK4cErNNLWCp
 OekDaADR53RrZ/xprlQ7b7Ph0Ch2uq9OrpH50IcyquEsH1ffkxlqwyrvth4/8dxC
 6zgsFHWrbtVKJf0DYoQPpjEPz5tpdQ+xHZwtmf1cNlUgI1objODr/ZTqXtZqTfw4
 /GwrtDPbEri53K/qjgr0dDH7pBVqD6PtnbgoHfYkiizZ0G7UkmlaK6rZIurtATJb
 Yk/RCqCUp9tPC4yeFSewFMm1Y8Ae3rkUBG7rnYkvMmBspMqyh/kQAWSBimF5yk/y
 vFEdFTe9AbdvP19Nw0CqovLzaO6RrOXCL1usnFvCmBgvF5gZAv63ZW1njP3ZoNta
 wB8Rs6hxdRkph8Dk7yvYf54uUR+JyKqjHY6egg2qkKTjz0CSf6qQFyFZXpr81m97
 gK4WN5SeP/P2ukRbBKKyzZ5IljUxZuVatvJa0tktd7kAbU26WLzofOJ7pX+iqimM
 F2G7gKGJZykLY1WPntXBp9Dg97Ras2O5iViQ7ZKwRdOx1yZS5zzTYlIznHBAmXbL
 UgXfVnpJH1xFdkvedNTn+Fz9BHNV1K2a2AT7VITj7sxz23z3aJA=
 =4sw3
 -----END PGP SIGNATURE-----

Merge tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "There are no exciting changes for users, it's been mostly API
  conversions and some fixes or refactoring.

  The mount API conversion is a base for future improvements that would
  come with VFS. Metadata processing has been converted to folios, not
  yet enabling the large folios but it's one patch away once everything
  gets tested enough.

  Core changes:

   - convert extent buffers to folios:
      - direct API conversion where possible
      - performance can drop by a few percent on metadata heavy
        workloads, the folio sizes are not constant and the calculations
        add up in the item helpers
      - both regular and subpage modes
      - data cannot be converted yet, we need to port that to iomap and
        there are some other generic changes required

   - convert mount to the new API, should not be user visible:
      - options deprecated long time ago have been removed: inode_cache,
        recovery
      - the new logic that splits mount to two phases slightly changes
        timing of device scanning for multi-device filesystems
      - LSM options will now work (like for selinux)

   - convert delayed nodes radix tree to xarray, preserving the
     preload-like logic that still allows to allocate with GFP_NOFS

   - more validation of sysfs value of scrub_speed_max

   - refactor chunk map structure, reduce size and improve performance

   - extent map refactoring, smaller data structures, improved
     performance

   - reduce size of struct extent_io_tree, embedded in several
     structures

   - temporary pages used for compression are cached and attached to a
     shrinker, this may slightly improve performance

   - in zoned mode, remove redirty extent buffer tracking, zeros are
     written in case an out-of-order is detected and proper data are
     written to the actual write pointer

   - cleanups, refactoring, error message improvements, updated tests

   - verify and update branch name or tag

   - remove unwanted text"

* tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (89 commits)
  btrfs: pass btrfs_io_geometry into btrfs_max_io_len
  btrfs: pass struct btrfs_io_geometry to set_io_stripe
  btrfs: open code set_io_stripe for RAID56
  btrfs: change block mapping to switch/case in btrfs_map_block
  btrfs: factor out block mapping for single profiles
  btrfs: factor out block mapping for RAID5/6
  btrfs: reduce scope of data_stripes in btrfs_map_block
  btrfs: factor out block mapping for RAID10
  btrfs: factor out block mapping for DUP profiles
  btrfs: factor out RAID1 block mapping
  btrfs: factor out block-mapping for RAID0
  btrfs: re-introduce struct btrfs_io_geometry
  btrfs: factor out helper for single device IO check
  btrfs: migrate btrfs_repair_io_failure() to folio interfaces
  btrfs: migrate eb_bitmap_offset() to folio interfaces
  btrfs: migrate various end io functions to folios
  btrfs: migrate subpage code to folio interfaces
  btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
  btrfs: don't double put our subpage reference in alloc_extent_buffer
  btrfs: cleanup metadata page pointer usage
  ...
2024-01-10 09:27:40 -08:00
Linus Torvalds
fb46e22a9e Many singleton patches against the MM code. The patch series which
are included in this merge do the following:
 
 - Peng Zhang has done some mapletree maintainance work in the
   series
 
 	"maple_tree: add mt_free_one() and mt_attr() helpers"
 	"Some cleanups of maple tree"
 
 - In the series "mm: use memmap_on_memory semantics for dax/kmem"
   Vishal Verma has altered the interworking between memory-hotplug
   and dax/kmem so that newly added 'device memory' can more easily
   have its memmap placed within that newly added memory.
 
 - Matthew Wilcox continues folio-related work (including a few
   fixes) in the patch series
 
 	"Add folio_zero_tail() and folio_fill_tail()"
 	"Make folio_start_writeback return void"
 	"Fix fault handler's handling of poisoned tail pages"
 	"Convert aops->error_remove_page to ->error_remove_folio"
 	"Finish two folio conversions"
 	"More swap folio conversions"
 
 - Kefeng Wang has also contributed folio-related work in the series
 
 	"mm: cleanup and use more folio in page fault"
 
 - Jim Cromie has improved the kmemleak reporting output in the
   series "tweak kmemleak report format".
 
 - In the series "stackdepot: allow evicting stack traces" Andrey
   Konovalov to permits clients (in this case KASAN) to cause
   eviction of no longer needed stack traces.
 
 - Charan Teja Kalla has fixed some accounting issues in the page
   allocator's atomic reserve calculations in the series "mm:
   page_alloc: fixes for high atomic reserve caluculations".
 
 - Dmitry Rokosov has added to the samples/ dorectory some sample
   code for a userspace memcg event listener application.  See the
   series "samples: introduce cgroup events listeners".
 
 - Some mapletree maintanance work from Liam Howlett in the series
   "maple_tree: iterator state changes".
 
 - Nhat Pham has improved zswap's approach to writeback in the
   series "workload-specific and memory pressure-driven zswap
   writeback".
 
 - DAMON/DAMOS feature and maintenance work from SeongJae Park in
   the series
 
 	"mm/damon: let users feed and tame/auto-tune DAMOS"
 	"selftests/damon: add Python-written DAMON functionality tests"
 	"mm/damon: misc updates for 6.8"
 
 - Yosry Ahmed has improved memcg's stats flushing in the series
   "mm: memcg: subtree stats flushing and thresholds".
 
 - In the series "Multi-size THP for anonymous memory" Ryan Roberts
   has added a runtime opt-in feature to transparent hugepages which
   improves performance by allocating larger chunks of memory during
   anonymous page faults.
 
 - Matthew Wilcox has also contributed some cleanup and maintenance
   work against eh buffer_head code int he series "More buffer_head
   cleanups".
 
 - Suren Baghdasaryan has done work on Andrea Arcangeli's series
   "userfaultfd move option".  UFFDIO_MOVE permits userspace heap
   compaction algorithms to move userspace's pages around rather than
   UFFDIO_COPY'a alloc/copy/free.
 
 - Stefan Roesch has developed a "KSM Advisor", in the series
   "mm/ksm: Add ksm advisor".  This is a governor which tunes KSM's
   scanning aggressiveness in response to userspace's current needs.
 
 - Chengming Zhou has optimized zswap's temporary working memory
   use in the series "mm/zswap: dstmem reuse optimizations and
   cleanups".
 
 - Matthew Wilcox has performed some maintenance work on the
   writeback code, both code and within filesystems.  The series is
   "Clean up the writeback paths".
 
 - Andrey Konovalov has optimized KASAN's handling of alloc and
   free stack traces for secondary-level allocators, in the series
   "kasan: save mempool stack traces".
 
 - Andrey also performed some KASAN maintenance work in the series
   "kasan: assorted clean-ups".
 
 - David Hildenbrand has gone to town on the rmap code.  Cleanups,
   more pte batching, folio conversions and more.  See the series
   "mm/rmap: interface overhaul".
 
 - Kinsey Ho has contributed some maintenance work on the MGLRU
   code in the series "mm/mglru: Kconfig cleanup".
 
 - Matthew Wilcox has contributed lruvec page accounting code
   cleanups in the series "Remove some lruvec page accounting
   functions".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA
 jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27
 Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU=
 =0NHs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Many singleton patches against the MM code. The patch series which are
  included in this merge do the following:

   - Peng Zhang has done some mapletree maintainance work in the series

	'maple_tree: add mt_free_one() and mt_attr() helpers'
	'Some cleanups of maple tree'

   - In the series 'mm: use memmap_on_memory semantics for dax/kmem'
     Vishal Verma has altered the interworking between memory-hotplug
     and dax/kmem so that newly added 'device memory' can more easily
     have its memmap placed within that newly added memory.

   - Matthew Wilcox continues folio-related work (including a few fixes)
     in the patch series

	'Add folio_zero_tail() and folio_fill_tail()'
	'Make folio_start_writeback return void'
	'Fix fault handler's handling of poisoned tail pages'
	'Convert aops->error_remove_page to ->error_remove_folio'
	'Finish two folio conversions'
	'More swap folio conversions'

   - Kefeng Wang has also contributed folio-related work in the series

	'mm: cleanup and use more folio in page fault'

   - Jim Cromie has improved the kmemleak reporting output in the series
     'tweak kmemleak report format'.

   - In the series 'stackdepot: allow evicting stack traces' Andrey
     Konovalov to permits clients (in this case KASAN) to cause eviction
     of no longer needed stack traces.

   - Charan Teja Kalla has fixed some accounting issues in the page
     allocator's atomic reserve calculations in the series 'mm:
     page_alloc: fixes for high atomic reserve caluculations'.

   - Dmitry Rokosov has added to the samples/ dorectory some sample code
     for a userspace memcg event listener application. See the series
     'samples: introduce cgroup events listeners'.

   - Some mapletree maintanance work from Liam Howlett in the series
     'maple_tree: iterator state changes'.

   - Nhat Pham has improved zswap's approach to writeback in the series
     'workload-specific and memory pressure-driven zswap writeback'.

   - DAMON/DAMOS feature and maintenance work from SeongJae Park in the
     series

	'mm/damon: let users feed and tame/auto-tune DAMOS'
	'selftests/damon: add Python-written DAMON functionality tests'
	'mm/damon: misc updates for 6.8'

   - Yosry Ahmed has improved memcg's stats flushing in the series 'mm:
     memcg: subtree stats flushing and thresholds'.

   - In the series 'Multi-size THP for anonymous memory' Ryan Roberts
     has added a runtime opt-in feature to transparent hugepages which
     improves performance by allocating larger chunks of memory during
     anonymous page faults.

   - Matthew Wilcox has also contributed some cleanup and maintenance
     work against eh buffer_head code int he series 'More buffer_head
     cleanups'.

   - Suren Baghdasaryan has done work on Andrea Arcangeli's series
     'userfaultfd move option'. UFFDIO_MOVE permits userspace heap
     compaction algorithms to move userspace's pages around rather than
     UFFDIO_COPY'a alloc/copy/free.

   - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm:
     Add ksm advisor'. This is a governor which tunes KSM's scanning
     aggressiveness in response to userspace's current needs.

   - Chengming Zhou has optimized zswap's temporary working memory use
     in the series 'mm/zswap: dstmem reuse optimizations and cleanups'.

   - Matthew Wilcox has performed some maintenance work on the writeback
     code, both code and within filesystems. The series is 'Clean up the
     writeback paths'.

   - Andrey Konovalov has optimized KASAN's handling of alloc and free
     stack traces for secondary-level allocators, in the series 'kasan:
     save mempool stack traces'.

   - Andrey also performed some KASAN maintenance work in the series
     'kasan: assorted clean-ups'.

   - David Hildenbrand has gone to town on the rmap code. Cleanups, more
     pte batching, folio conversions and more. See the series 'mm/rmap:
     interface overhaul'.

   - Kinsey Ho has contributed some maintenance work on the MGLRU code
     in the series 'mm/mglru: Kconfig cleanup'.

   - Matthew Wilcox has contributed lruvec page accounting code cleanups
     in the series 'Remove some lruvec page accounting functions'"

* tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits)
  mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
  mm, treewide: introduce NR_PAGE_ORDERS
  selftests/mm: add separate UFFDIO_MOVE test for PMD splitting
  selftests/mm: skip test if application doesn't has root privileges
  selftests/mm: conform test to TAP format output
  selftests: mm: hugepage-mmap: conform to TAP format output
  selftests/mm: gup_test: conform test to TAP format output
  mm/selftests: hugepage-mremap: conform test to TAP format output
  mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING
  mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large
  mm/memcontrol: remove __mod_lruvec_page_state()
  mm/khugepaged: use a folio more in collapse_file()
  slub: use a folio in __kmalloc_large_node
  slub: use folio APIs in free_large_kmalloc()
  slub: use alloc_pages_node() in alloc_slab_page()
  mm: remove inc/dec lruvec page state functions
  mm: ratelimit stat flush from workingset shrinker
  kasan: stop leaking stack trace handles
  mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE
  mm/mglru: add dummy pmd_dirty()
  ...
2024-01-09 11:18:47 -08:00
Qu Wenruo
55151ea9ec btrfs: migrate subpage code to folio interfaces
Although subpage itself is conflicting with higher folio, since subpage
(sectorsize < PAGE_SIZE and nodesize < PAGE_SIZE) means we will never
need higher order folio, there is a hidden pitfall:

- btrfs_page_*() helpers

Those helpers are an abstraction to handle both subpage and non-subpage
cases, which means we're going to pass pages pointers to those helpers.

And since those helpers are shared between data and metadata paths, it's
unavoidable to let them to handle folios, including higher order
folios).

Meanwhile for true subpage case, we should only have a single page
backed folios anyway, thus add a new ASSERT() for btrfs_subpage_assert()
to ensure that.

Also since those helpers are shared between both data and metadata, add
some extra ASSERT()s for data path to make sure we only get single page
backed folio for now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 23:03:58 +01:00
Qu Wenruo
13df3775ef btrfs: cleanup metadata page pointer usage
Although we have migrated extent_buffer::pages[] to folios[], we're
still mostly using the folio_page() help to grab the page.

This patch would do the following cleanups for metadata:

- Introduce num_extent_folios() helper
  This is to replace most num_extent_pages() callers.

- Use num_extent_folios() to iterate future large folios
  This allows us to use things like
  bio_add_folio()/bio_add_folio_nofail(), and only set the needed flags
  for the folio (aka the leading/tailing page), which reduces the loop
  iteration to 1 for large folios.

- Change metadata related functions to use folio pointers
  Including their function name, involving:
  * attach_extent_buffer_page()
  * detach_extent_buffer_page()
  * page_range_has_eb()
  * btrfs_release_extent_buffer_pages()
  * btree_clear_page_dirty()
  * btrfs_page_inc_eb_refs()
  * btrfs_page_dec_eb_refs()

- Change btrfs_is_subpage() to accept an address_space pointer
  This is to allow both page->mapping and folio->mapping to be utilized.
  As data is still using the old per-page code, and may keep so for a
  while.

- Special corner case place holder for future order mismatches between
  extent buffer and inode filemap
  For now it's  just a block of comments and a dead ASSERT(), no real
  handling yet.

The subpage code would still go page, just because subpage and large
folio are conflicting conditions, thus we don't need to bother subpage
with higher order folios at all. Just folio_page(folio, 0) would be
enough.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor styling tweaks ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 23:01:04 +01:00
Qu Wenruo
09e6cef19c btrfs: refactor alloc_extent_buffer() to allocate-then-attach method
Currently alloc_extent_buffer() utilizes find_or_create_page() to
allocate one page a time for an extent buffer.

This method has the following disadvantages:

- find_or_create_page() is the legacy way of allocating new pages
  With the new folio infrastructure, find_or_create_page() is just
  redirected to filemap_get_folio().

- Lacks the way to support higher order (order >= 1) folios
  As we can not yet let filemap give us a higher order folio.

This patch would change the workflow by the following way:

		Old		   |		new
-----------------------------------+-------------------------------------
                                   | ret = btrfs_alloc_page_array();
for (i = 0; i < num_pages; i++) {  | for (i = 0; i < num_pages; i++) {
    p = find_or_create_page();     |     ret = filemap_add_folio();
    /* Attach page private */      |     /* Reuse page cache if needed */
    /* Reused eb if needed */      |
				   |     /* Attach page private and
				   |        reuse eb if needed */
				   | }

By this we split the page allocation and private attaching into two
parts, allowing future updates to each part more easily, and migrate to
folio interfaces (especially for possible higher order folios).

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 23:01:04 +01:00
David Sterba
6140ba8a0a btrfs: switch btrfs_root::delayed_nodes_tree to xarray from radix-tree
The radix-tree has been superseded by the xarray
(https://lwn.net/Articles/745073), this patch converts the
btrfs_root::delayed_nodes, the APIs are used in a simple way.

First idea is to do xa_insert() but this would require GFP_ATOMIC
allocation which we want to avoid if possible. The preload mechanism of
radix-tree can be emulated within the xarray API.

- xa_reserve() with GFP_NOFS outside of the lock, the reserved entry
  is inserted atomically at most once

- xa_store() under a lock, in case something races in we can detect that
  and xa_load() returns a valid pointer

All uses of xa_load() must check for a valid pointer in case they manage
to get between the xa_reserve() and xa_store(), this is handled in
btrfs_get_delayed_node().

Otherwise the functionality is equivalent, xarray implements the
radix-tree and there should be no performance difference.

The patch continues the efforts started in 253bf57555 ("btrfs: turn
delayed_nodes_tree into an XArray") and fixes the problems with locking
and GFP flags 088aea3b97 ("Revert "btrfs: turn delayed_nodes_tree
into an XArray"").

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 23:01:03 +01:00
Filipe Manana
f86f7a75e2 btrfs: use the flags of an extent map to identify the compression type
Currently, in struct extent_map, we use an unsigned int (32 bits) to
identify the compression type of an extent and an unsigned long (64 bits
on a 64 bits platform, 32 bits otherwise) for flags. We are only using
6 different flags, so an unsigned long is excessive and we can use flags
to identify the compression type instead of using a dedicated 32 bits
field.

We can easily have tens or hundreds of thousands (or more) of extent maps
on busy and large filesystems, specially with compression enabled or many
or large files with tons of small extents. So it's convenient to have the
extent_map structure as small as possible in order to use less memory.

So remove the compression type field from struct extent_map, use flags
to identify the compression type and shorten the flags field from an
unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and
reduces the size of the structure from 136 bytes down to 128 bytes, using
now only two cache lines, and increases the number of extent maps we can
have per 4K page from 30 to 32. By using a u32 for the flags instead of
an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(),
but that level of atomicity is not needed as most flags are never cleared
once set (before adding an extent map to the tree), and the ones that can
be cleared or set after an extent map is added to the tree, are always
performed while holding the write lock on the extent map tree, while the
reader holds a lock on the tree or tests for a flag that never changes
once the extent map is in the tree (such as compression flags).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 22:59:02 +01:00
Filipe Manana
00deaf04df btrfs: log messages at unpin_extent_range() during unexpected cases
At unpin_extent_range() we trigger a WARN_ON() when we don't find an
extent map or we find one with a start offset not matching the start
offset of the target range. This however isn't very useful for debugging
because:

1) We don't know which condition was triggered, as they are both in the
   same WARN_ON() call;

2) We don't know which inode was affected, from which root, for which
   range, what's the start offset of the extent map, and so on.

So trigger a separate warning for each case and log a message for each
case providing information about the inode, its root, the target range,
the generation and the start offset of the extent map we found.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 22:59:01 +01:00
David Sterba
637e6e0f50 btrfs: allocate btrfs_inode::file_extent_tree only without NO_HOLES
The file_extent_tree was added in 41a2ee75aa ("btrfs: introduce
per-inode file extent tree") so we have an explicit mapping of the file
extents to know where it is safe to update i_size. When the feature
NO_HOLES is enabled, and it's been a mkfs default since 5.15, the tree
is not necessary.

To save some space in the inode, allocate the tree only when necessary.
This reduces size by 16 bytes from 1096 to 1080 on a x86_64 release
config.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 22:59:01 +01:00
Josef Bacik
ed9b50a13e btrfs: cache that we don't have security.capability set
When profiling a workload I noticed we were constantly calling getxattr.
These were mostly coming from __remove_privs, which will lookup if
security.capability exists to remove it.  However instrumenting getxattr
showed we get called nearly constantly on an idle machine on a lot of
accesses.

These are wasteful and not free.  Other security LSMs have a way to
cache their results, but capability doesn't have this, so it's asking us
all the time for the xattr.

Fix this by setting a flag in our inode that it doesn't have a
security.capability xattr.  We set this on new inodes and after a failed
lookup of security.capability.  If we set this xattr at all we'll clear
the flag.

I haven't found a test in fsperf that this makes a visible difference
on, but I assume fs_mark related tests would show it clearly.  This is a
perf report output of the smallfiles100k run where it shows 20% of our
time spent in __remove_privs because we're looking up the non-existent
xattr.

--21.86%--btrfs_write_check.constprop.0
  --21.62%--__file_remove_privs
    --21.55%--security_inode_need_killpriv
      --21.54%--cap_inode_need_killpriv
        --21.53%--__vfs_getxattr
          --20.89%--btrfs_getxattr

Obviously this is just CPU time in a mostly IO bound test, so the actual
effect of removing this callchain is minimal.  However in just normal
testing of an idle system tracing showed around 100 getxattr calls per
minute, and with this patch there are 0.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:05 +01:00
David Sterba
738290c056 btrfs: always set extent_io_tree::inode and drop fs_info
The extent_io_tree is embedded in several structures, notably in struct
btrfs_inode.  The fs_info is only used for reporting errors and for
reference in trace points. We can get to the pointer through the inode,
but not all io trees set it. However, we always know the owner and
can recognize if inode is valid.  For access helpers are provided, const
variant for the trace points.

This reduces size of extent_io_tree by 8 bytes and following structures
in turn:

- btrfs_inode		1104 -> 1088
- btrfs_device		 520 ->  512
- btrfs_root		1360 -> 1344
- btrfs_transaction	 456 ->  440
- btrfs_fs_info		3600 -> 3592
- reloc_control		1520 -> 1512

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:02 +01:00
David Sterba
516095cdf0 btrfs: move lockdep class setting out of extent_io_tree_init
The per-inode file extent tree was added in 41a2ee75aa ("btrfs:
introduce per-inode file extent tree"), it's the only tree type
that requires the lockdep class. Move it to the file where it is
actually used.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:02 +01:00
Filipe Manana
7dc66abb5a btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:

1) To actually represent extents for inodes;
2) To represent chunk mappings.

This is odd and has several disadvantages:

1) To create a chunk map, we need to do two memory allocations: one for
   an extent_map structure and another one for a map_lookup structure, so
   more potential for an allocation failure and more complicated code to
   manage and link two structures;

2) For a chunk map we actually only use 3 fields (24 bytes) of the
   respective extent map structure: the 'start' field to have the logical
   start address of the chunk, the 'len' field to have the chunk's size,
   and the 'orig_block_len' field to contain the chunk's stripe size.

   Besides wasting a memory, it's also odd and not intuitive at all to
   have the stripe size in a field named 'orig_block_len'.

   We are also using 'block_len' of the extent_map structure to contain
   the chunk size, so we have 2 fields for the same value, 'len' and
   'block_len', which is pointless;

3) When an extent map is associated to a chunk mapping, we set the bit
   EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
   'map_lookup' point to the associated map_lookup structure. This means
   that for an extent map associated to an inode extent, we are not using
   this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);

4) Extent maps associated to a chunk mapping are never merged or split so
   it's pointless to use the existing extent map infrastructure.

So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:

1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.

This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.

We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:02 +01:00
Qu Wenruo
cfbf07e278 btrfs: migrate to use folio private instead of page private
As a cleanup and preparation for future folio migration, this patch
would replace all page->private to folio version.  This includes:

- PagePrivate()
  -> folio_test_private()

- page->private
  -> folio_get_private()

- attach_page_private()
  -> folio_attach_private()

- detach_page_private()
  -> folio_detach_private()

Since we're here, also remove the forced cast on page->private, since
it's (void *) already, we don't really need to do the cast.

For now even if we missed some call sites, it won't cause any problem
yet, as we're only using order 0 folio (single page), thus all those
folio/page flags should be synced.

But for the future conversion to utilize higher order folio, the page
<-> folio flag sync is no longer guaranteed, thus we have to migrate to
utilize folio flags.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:01 +01:00
David Sterba
9ba965dca3 btrfs: use page alloc/free wrappers for compression pages
This is a preparation for managing compression pages in a cache-like
manner, instead of asking the allocator each time. The common allocation
and free wrappers are introduced and are functionally equivalent to the
current code.

The freeing helpers need to be carefully placed where the last reference
is dropped.  This is either after directly allocating (error handling)
or when there are no other users of the pages (after copying the contents).

It's safe to not use the helper and use put_page() that will handle the
reference count. Not using the helper means there's lower number of
pages that could be reused without passing them back to allocator.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:01 +01:00
Linus Torvalds
bdb2701f0b for-6.7-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmV5rTIACgkQxWXV+ddt
 WDuLUg/+Ix/CeA+JY6VZMA2kBHMzmRexSjYONWfQwIL7LPBy4sOuSEaTZt+QQMs+
 AEKau1YfTgo7e9S2DlbZhIWp6P87VFui7Q1E99uJEmKelakvf94DbMrufPTTKjaD
 JG2KB6LsD59yWwfbGHEAVVNGSMRk2LDXzcUWMK6/uzu/7Bcr4ataOymWd86/blUV
 cw5g87uAHpBn+R1ARTf1CkqyYiI9UldNUJmW1q7dwxOyYG+weUtJImosw2Uda76y
 wQXAFQAH3vsFzTC+qjC9Vz7cnyAX9qAw48ODRH7rIT1BQ3yAFQbfXE20jJ/fSE+C
 lz3p05tA9373KAOtLUHmANBwe3NafCnlut6ZYRfpTcEzUslAO5PnajPaHh5Al7uC
 Iwdpy49byoyVFeNf0yECBsuDP8s86HlUALF8mdJabPI1Kl66MUea6KgS1oyO3pCB
 hfqLbpofV4JTywtIRLGQTQvzSwkjPHTbSwtZ9nftTw520a5f7memDu5vi4XzFd+B
 NrJxmz2DrMRlwrLgWg9OXXgx1riWPvHnIoqzjG5W6A9N74Ud1/oz7t3VzjGSQ5S2
 UikRB6iofPE0deD8IF6H6DvFfvQxU9d9BJ6IS9V2zRt5vdgJ2w08FlqbLZewSY4x
 iaQ+L7UYKDjC9hdosXVNu/6fAspyBVdSp2NbKk14fraZtNAoPNs=
 =uF/Q
 -----END PGP SIGNATURE-----

Merge tag 'for-6.7-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
  "Some fixes to quota accounting code, mostly around error handling and
   correctness:

   - free reserves on various error paths, after IO errors or
     transaction abort

   - don't clear reserved range at the folio release time, it'll be
     properly cleared after final write

   - fix integer overflow due to int used when passing around size of
     freed reservations

   - fix a regression in squota accounting that missed some cases with
     delayed refs"

* tag 'for-6.7-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: ensure releasing squota reserve on head refs
  btrfs: don't clear qgroup reserved bit in release_folio
  btrfs: free qgroup pertrans reserve on transaction abort
  btrfs: fix qgroup_free_reserved_data int overflow
  btrfs: free qgroup reserve when ORDERED_IOERR is set
2023-12-14 11:53:00 -08:00
Matthew Wilcox (Oracle)
af7628d6ec fs: convert error_remove_page to error_remove_folio
There were already assertions that we were not passing a tail page to
error_remove_page(), so make the compiler enforce that by converting
everything to pass and use a folio.

Link: https://lkml.kernel.org/r/20231117161447.2461643-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:42 -08:00
Boris Burkov
9e65bfca24 btrfs: fix qgroup_free_reserved_data int overflow
The reserved data counter and input parameter is a u64, but we
inadvertently accumulate it in an int. Overflowing that int results in
freeing the wrong amount of data and breaking reserve accounting.

Unfortunately, this overflow rot spreads from there, as the qgroup
release/free functions rely on returning an int to take advantage of
negative values for error codes.

Therefore, the full fix is to return the "released" or "freed" amount by
a u64 argument and to return 0 or negative error code via the return
value.

Most of the call sites simply ignore the return value, though some
of them handle the error and count the returned bytes. Change all of
them accordingly.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-06 22:32:46 +01:00
Linus Torvalds
9bacdd8996 for-6.7-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmVSO50ACgkQxWXV+ddt
 WDuiyg/7BZviFAyiQMAzpA319qRJZ+EemfTdF/k69q4axGYuvqVdXKnpOV44AR4I
 dKcHLOPpDZIxsh8lFytkm1UEAHptw1v7A64c+gcdjGK0tAA7aKbw/1nmNysowT23
 L0v2+34hkBUfG8A3uVgOwL1rjItEX5Fl54slpVsazSqlEbKrqC4MGNjmqdp3IOeC
 qfXTgkvkXmm8s8NyoJybKewM9Aw0tmK0jkAFHA+2sgcZPYKXjqWGv9KUOsXnCx5o
 3kPWIRT1sj4q2qzrgP14Q12O6qPLZ2/0oTBhi6nhj8+N1yiH+USS5zBITegF+w2n
 leQeVHtyBYHlPYQSQlCIZy7+10gkePvs+JmoAuL8YFISnGYnvOZqCeArlV7cnNI3
 CQt7ZBER5Dqw78Y756usUhpYrLWa9kOpcPVRmjJ/R62+TY1FkkyY7irETbn5EGjI
 NlhEa4PMYeYpAOccoxWEm9tIiiVD1abURhVBdn3Znfcb1Sv/lrGBlo9DYGFCxbBh
 xU1JP7sly8w0aPLqCbn1X3VY8dXp+CeYz4FQabHjQA/zr9lF08/pRYj3haAbYAyH
 0KphXurwz/YqY+LmRg7SbQ/KMgBAiBV8Qk9JyNvdvaQbnYnq7CWdpoHcpZu3mvpb
 HLGoXew58kZaSfxLHlcT5wwYlbq0rooXRstuFg2+BBcOFOMCQfw=
 =GM+1
 -----END PGP SIGNATURE-----

Merge tag 'for-6.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix potential overflow in returned value from SEARCH_TREE_V2
   ioctl on 32bit architecture

 - zoned mode fixes:

     - drop unnecessary write pointer check for RAID0/RAID1/RAID10
       profiles, now it works because of raid-stripe-tree

     - wait for finishing the zone when direct IO needs a new
       allocation

 - simple quota fixes:

     - pass correct owning root pointer when cleaning up an
       aborted transaction

     - fix leaking some structures when processing delayed refs

     - change key type number of BTRFS_EXTENT_OWNER_REF_KEY,
       reorder it before inline refs that are supposed to be
       sorted, keeping the original number would complicate a lot
       of things; this change needs an updated version of
       btrfs-progs to work and filesystems need to be recreated

 - fix error pointer dereference after failure to allocate fs
   devices

 - fix race between accounting qgroup extents and removing a
   qgroup

* tag 'for-6.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: make OWNER_REF_KEY type value smallest among inline refs
  btrfs: fix qgroup record leaks when using simple quotas
  btrfs: fix race between accounting qgroup extents and removing a qgroup
  btrfs: fix error pointer dereference after failure to allocate fs devices
  btrfs: make found_logical_ret parameter mandatory for function queue_scrub_stripe()
  btrfs: get correct owning_root when dropping snapshot
  btrfs: zoned: wait for data BG to be finished on direct IO allocation
  btrfs: zoned: drop no longer valid write pointer check
  btrfs: directly return 0 on no error code in btrfs_insert_raid_extent()
  btrfs: use u64 for buffer sizes in the tree search ioctls
2023-11-13 09:09:12 -08:00
Naohiro Aota
776a838f1f btrfs: zoned: wait for data BG to be finished on direct IO allocation
Running the fio command below on a ZNS device results in "Resource
temporarily unavailable" error.

  $ sudo fio --name=w --directory=/mnt --filesize=1GB --bs=16MB --numjobs=16 \
        --rw=write --ioengine=libaio --iodepth=128 --direct=1

  fio: io_u error on file /mnt/w.2.0: Resource temporarily unavailable: write offset=117440512, buflen=16777216
  fio: io_u error on file /mnt/w.2.0: Resource temporarily unavailable: write offset=134217728, buflen=16777216
  ...

This happens because -EAGAIN error returned from btrfs_reserve_extent()
called from btrfs_new_extent_direct() is spilling over to the userland.

btrfs_reserve_extent() returns -EAGAIN when there is no active zone
available. Then, the caller should wait for some other on-going IO to
finish a zone and retry the allocation.

This logic is already implemented for buffered write in cow_file_range(),
but it is missing for the direct IO counterpart. Implement the same logic
for it.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 2ce543f478 ("btrfs: zoned: wait until zone is finished when allocation didn't progress")
CC: stable@vger.kernel.org # 6.1+
Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-03 16:39:01 +01:00
Linus Torvalds
d5acbc60fa for-6.7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmU/xAEACgkQxWXV+ddt
 WDvYKg//SjTimA5Nins9mb4jdz8n+dDeZnQhKzy3FqInU41EzDRc4WwnEODmDlTa
 AyU9rGB3k0JNSUc075jZFCyLqq/ARiOqRi4x33Gk0ckIlc4X5OgBoqP2XkPh0VlP
 txskLCrmhc3pwyR4ErlFDX2jebIUXfkv39bJuE40grGvUatRe+WNq0ERIrgO8RAr
 Rc3hBotMH8AIqfD1L6j1ZiZIAyrOkT1BJMuqeoq27/gJZn/MRhM9TCrMTzfWGaoW
 SxPrQiCDEN3KECsOY/caroMn3AekDijg/ley1Nf7Z0N6oEV+n4VWWPBFE9HhRz83
 9fIdvSbGjSJF6ekzTjcVXPAbcuKZFzeqOdBRMIW3TIUo7mZQyJTVkMsc1y/NL2Z3
 9DhlRLIzvWJJjt1CEK0u18n5IU+dGngdktbhWWIuIlo8r+G/iKR/7zqU92VfWLHL
 Z7/eh6HgH5zr2bm+yKORbrUjkv4IVhGVarW8D4aM+MCG0lFN2GaPcJCCUrp4n7rZ
 PzpQbxXa38ANBk6hsp4ndS8TJSBL9moY8tumzLcKg97nzNMV6KpBdV/G6/QfRLCN
 3kM6UbwTAkMwGcQS86Mqx6s04ORLnQeD6f7N6X4Ppx0Mi/zkjI2HkRuvQGp12B0v
 iZjCCZAYY2Iu+/TU0GrCXSss/grzIAUPzM9msyV3XGO/VBpwdec=
 =9TVx
 -----END PGP SIGNATURE-----

Merge tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "New features:

   - raid-stripe-tree

     New tree for logical file extent mapping where the physical mapping
     may not match on multiple devices. This is now used in zoned mode
     to implement RAID0/RAID1* profiles, but can be used in non-zoned
     mode as well. The support for RAID56 is in development and will
     eventually fix the problems with the current implementation. This
     is a backward incompatible feature and has to be enabled at mkfs
     time.

   - simple quota accounting (squota)

     A simplified mode of qgroup that accounts all space on the initial
     extent owners (a subvolume), the snapshots are then cheap to create
     and delete. The deletion of snapshots in fully accounting qgroups
     is a known CPU/IO performance bottleneck.

     The squota is not suitable for the general use case but works well
     for containers where the original subvolume exists for the whole
     time. This is a backward incompatible feature as it needs extending
     some structures, but can be enabled on an existing filesystem.

   - temporary filesystem fsid (temp_fsid)

     The fsid identifies a filesystem and is hard coded in the
     structures, which disallows mounting the same fsid found on
     different devices.

     For a single device filesystem this is not strictly necessary, a
     new temporary fsid can be generated on mount e.g. after a device is
     cloned. This will be used by Steam Deck for root partition A/B
     testing, or can be used for VM root images.

  Other user visible changes:

   - filesystems with partially finished metadata_uuid conversion cannot
     be mounted anymore and the uuid fixup has to be done by btrfs-progs
     (btrfstune).

  Performance improvements:

   - reduce reservations for checksum deletions (with enabled free space
     tree by factor of 4), on a sample workload on file with many
     extents the deletion time decreased by 12%

   - make extent state merges more efficient during insertions, reduce
     rb-tree iterations (run time of critical functions reduced by 5%)

  Core changes:

   - the integrity check functionality has been removed, this was a
     debugging feature and removal does not affect other integrity
     checks like checksums or tree-checker

   - space reservation changes:

      - more efficient delayed ref reservations, this avoids building up
        too much work or overusing or exhausting the global block
        reserve in some situations

      - move delayed refs reservation to the transaction start time,
        this prevents some ENOSPC corner cases related to exhaustion of
        global reserve

      - improvements in reducing excessive reservations for block group
        items

      - adjust overcommit logic in near full situations, account for one
        more chunk to eventually allocate metadata chunk, this is mostly
        relevant for small filesystems (<10GiB)

   - single device filesystems are scanned but not registered (except
     seed devices), this allows temp_fsid to work

   - qgroup iterations do not need GFP_ATOMIC allocations anymore

   - cleanups, refactoring, reduced data structure size, function
     parameter simplifications, error handling fixes"

* tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (156 commits)
  btrfs: open code timespec64 in struct btrfs_inode
  btrfs: remove redundant log root tree index assignment during log sync
  btrfs: remove redundant initialization of variable dirty in btrfs_update_time()
  btrfs: sysfs: show temp_fsid feature
  btrfs: disable the device add feature for temp-fsid
  btrfs: disable the seed feature for temp-fsid
  btrfs: update comment for temp-fsid, fsid, and metadata_uuid
  btrfs: remove pointless empty log context list check when syncing log
  btrfs: update comment for struct btrfs_inode::lock
  btrfs: remove pointless barrier from btrfs_sync_file()
  btrfs: add and use helpers for reading and writing last_trans_committed
  btrfs: add and use helpers for reading and writing fs_info->generation
  btrfs: add and use helpers for reading and writing log_transid
  btrfs: add and use helpers for reading and writing last_log_commit
  btrfs: support cloned-device mount capability
  btrfs: add helper function find_fsid_by_disk
  btrfs: stop reserving excessive space for block group item insertions
  btrfs: stop reserving excessive space for block group item updates
  btrfs: reorder btrfs_inode to fill gaps
  btrfs: open code btrfs_ordered_inode_tree in btrfs_inode
  ...
2023-10-30 10:42:06 -10:00
Jeff Layton
b1c38a1338
btrfs: convert to new timestamp accessors
Convert to using the new inode timestamp accessor functions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20231004185347.80880-21-jlayton@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-18 13:26:19 +02:00
David Sterba
c6e8f898f5 btrfs: open code timespec64 in struct btrfs_inode
The type of timespec64::tv_nsec is 'unsigned long', while we have only
u32 for on-disk and in-memory. This wastes a few bytes in btrfs_inode.
Add separate members for sec and nsec with the corresponding type width.
This creates a 4 byte hole in btrfs_inode which can be utilized in the
future.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:19 +02:00
Colin Ian King
a666ce9bab btrfs: remove redundant initialization of variable dirty in btrfs_update_time()
The variable dirty is initialized with a value that is never read, it
is being re-assigned later on. Remove the redundant initialization.
Cleans up clang scan build warning:

  fs/btrfs/inode.c:5965:7: warning: Value stored to 'dirty' during its
  initialization is never read [deadcode.DeadStores]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:19 +02:00
Filipe Manana
4a4f8fe2b0 btrfs: add and use helpers for reading and writing fs_info->generation
Currently the generation field of struct btrfs_fs_info is always modified
while holding fs_info->trans_lock locked. Most readers will access this
field without taking that lock but while holding a transaction handle,
which is safe to do due to the transaction life cycle.

However there are other readers that are neither holding the lock nor
holding a transaction handle open:

1) When reading an inode from disk, at btrfs_read_locked_inode();

2) When reading the generation to expose it to sysfs, at
   btrfs_generation_show();

3) Early in the fsync path, at skip_inode_logging();

4) When creating a hole at btrfs_cont_expand(), during write paths,
   truncate and reflinking;

5) In the fs_info ioctl (btrfs_ioctl_fs_info());

6) While mounting the filesystem, in the open_ctree() path. In these
   cases it's safe to directly read fs_info->generation as no one
   can concurrently start a transaction and update fs_info->generation.

In case of the fsync path, races here should be harmless, and in the worst
case they may cause a fsync to log an inode when it's not really needed,
so nothing bad from a functional perspective. In the other cases it's not
so clear if functional problems may arise, though in case 1 rare things
like a load/store tearing [1] may cause the BTRFS_INODE_NEEDS_FULL_SYNC
flag not being set on an inode and therefore result in incorrect logging
later on in case a fsync call is made.

To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the generation field of struct btrfs_fs_info using READ_ONCE() and
WRITE_ONCE(), and use these helpers where needed.

[1] https://lwn.net/Articles/793253/

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:17 +02:00
David Sterba
54c6537146 btrfs: open code btrfs_ordered_inode_tree in btrfs_inode
The structure btrfs_ordered_inode_tree is used only in one place, in
btrfs_inode. The structure itself has a 4 byte hole which is wasted
space.

Move the btrfs_ordered_inode_tree members to btrfs_inode with a common
prefix 'ordered_tree_' where the hole can be utilized and shrink inode
size.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:16 +02:00
David Sterba
893fe24399 btrfs: change test_range_bit to scan the whole range
The semantics of test_range_bit() with filled == 0 is now in it's own
helper so test_range_bit will check the whole range unconditionally.
The detection logic is flipped and assumes success by default and
catches exceptions.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:14 +02:00
David Sterba
99be1a66e1 btrfs: add specific helper for range bit test exists
The existing helper test_range_bit works in two ways, checks if the whole
range contains all the bits, or stop on the first occurrence.  By adding
a specific helper for the latter case, the inner loop can be simplified
and contains fewer conditionals, making it a bit faster.

There's no caller that uses the cached state pointer so this reduces the
argument count further.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:14 +02:00
Filipe Manana
0a325e620e btrfs: remove redundant root argument from maybe_insert_hole()
The root argument for maybe_insert_hole() always matches the root of the
given inode, so remove the root argument and get it from the inode
argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
Filipe Manana
04bd8e9410 btrfs: remove redundant root argument from btrfs_delayed_update_inode()
The root argument for btrfs_delayed_update_inode() always matches the root
of the given inode, so remove the root argument and get it from the inode
argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
Filipe Manana
07a274a886 btrfs: remove redundant root argument from btrfs_update_inode_item()
The root argument for btrfs_update_inode_item() always matches the root of
the given inode, so remove the root argument and get it from the inode
argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
Filipe Manana
8b9d032225 btrfs: remove redundant root argument from btrfs_update_inode()
The root argument for btrfs_update_inode() always matches the root of the
given inode, so remove the root argument and get it from the inode
argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
Filipe Manana
0a5d0dc55f btrfs: remove redundant root argument from btrfs_update_inode_fallback()
The root argument for btrfs_update_inode_fallback() always matches the
root of the given inode, so remove the root argument and get it from the
inode argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
Filipe Manana
cddaaacca9 btrfs: remove noinline from btrfs_update_inode()
The noinline attribute of btrfs_update_inode() is pointless as the
function is exported and widely used, so remove it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
Filipe Manana
2199cb0f5e btrfs: simplify error check condition at btrfs_dirty_inode()
The following condition at btrfs_dirty_inode() is redundant:

  if (ret && (ret == -ENOSPC || ret == -EDQUOT))

The first check for a non-zero 'ret' value is pointless, we can simplify
this to simply:

  if (ret == -ENOSPC || ret == -EDQUOT)

Not only this makes it easier to read, it also slightly reduces the text
size of the btrfs kernel module:

  $ size fs/btrfs/btrfs.ko.before
     text	   data	    bss	    dec	    hex	filename
  1641400	 168265	  16864	1826529	 1bdee1	fs/btrfs/btrfs.ko.before

  $ size fs/btrfs/btrfs.ko.after
     text	   data	    bss	    dec	    hex	filename
  1641224	 168181	  16864	1826269	 1bdddd	fs/btrfs/btrfs.ko.after

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
David Sterba
078b8b90b8 btrfs: merge ordered work callbacks in btrfs_work into one
There are two callbacks defined in btrfs_work but only two actually make
use of them, otherwise there are NULLs. We can get rid of the freeing
callback making it a special case of the normal work. This reduces the
size of btrfs_work by 8 bytes, final layout:

struct btrfs_work {
        btrfs_func_t               func;                 /*     0     8 */
        btrfs_ordered_func_t       ordered_func;         /*     8     8 */
        struct work_struct         normal_work;          /*    16    32 */
        struct list_head           ordered_list;         /*    48    16 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        struct btrfs_workqueue *   wq;                   /*    64     8 */
        long unsigned int          flags;                /*    72     8 */

        /* size: 80, cachelines: 2, members: 6 */
        /* last cacheline: 16 bytes */
};

This in turn reduces size of other structures (on a release config):

- async_chunk			 160 ->  152
- async_submit_bio		 152 ->  144
- btrfs_async_delayed_work	 104 ->   96
- btrfs_caching_control		 176 ->  168
- btrfs_delalloc_work		 144 ->  136
- btrfs_fs_info			3608 -> 3600
- btrfs_ordered_extent		 440 ->  424
- btrfs_writepage_fixup		 104 ->   96

Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:10 +02:00
Johannes Thumshirn
02c372e1f0 btrfs: add support for inserting raid stripe extents
Add support for inserting stripe extents into the raid stripe tree on
completion of every write that needs an extra logical-to-physical
translation when using RAID.

Inserting the stripe extents happens after the data I/O has completed,
this is done to

  a) support zone-append and
  b) rule out the possibility of a RAID-write-hole.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:09 +02:00
Filipe Manana
50564b651d btrfs: abort transaction on generation mismatch when marking eb as dirty
When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(),
we check if its generation matches the running transaction and if not we
just print a warning. Such mismatch is an indicator that something really
went wrong and only printing a warning message (and stack trace) is not
enough to prevent a corruption. Allowing a transaction to commit with such
an extent buffer will trigger an error if we ever try to read it from disk
due to a generation mismatch with its parent generation.

So abort the current transaction with -EUCLEAN if we notice a generation
mismatch. For this we need to pass a transaction handle to
btrfs_mark_buffer_dirty() which is always available except in test code,
in which case we can pass NULL since it operates on dummy extent buffers
and all test roots have a single node/leaf (root node at level 0).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:07 +02:00
David Sterba
9580503bcb btrfs: reformat remaining kdoc style comments
Function name in the comment does not bring much value to code not
exposed as API and we don't stick to the kdoc format anymore. Update
formatting of parameter descriptions.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:04 +02:00
Linus Torvalds
a229cf67ab for-6.6-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmULIZUACgkQxWXV+ddt
 WDv77Q//ZiKpmevPQmQfUtmV8WwMfD2a9zRlKBpGggwtrD4mf3CYRLnOpTm81MPO
 vFIuYacBn+9UXqp2j/IbvNWfQAPQNVDxSPXx66uba93RJc+bB1J3TydxcEyJ7fr4
 dwhLLk01jttfk0+rnjF34fmXiHSTtI6D2WeaLCzUbaPLw4SZ+ul+GAdeF3P174iO
 OMNBUln7hK00Q7j8kFf4j6SW1yIIKMTl6MfOFJYanIqzx51PYFFVtKwoCr0Vt53v
 ZHbgrK582ZJO6pKF9kJF/1tqrY9/Df8jzgSypK8pew/SukMOrf7iVwrmhietuhKA
 92j5sxKhCRyq6Qg6ZwC0jyk+oMqrT8r+q3r38a5qDJx/9Q279vkXBqQnACfLjmnH
 6+sNdkY5/uBWnDMh/+d6yBtfbdW5DtuET4McYpJt1Nk2St/f3UzPaL4LcNkDXNPk
 t1Q4W4v0KS1V8TbsLfdD629CMghxQNKVs1XqyCAbUq9ub4LE2CtL3lDm730qZoZt
 +LM7+sAxEOJC6yqYfdEbcIc8l27Hl5nZEzamcvMrRz61N85/8Jx4Sq2b6VSE9TCE
 hNEWAL5sOjhuhmUPhatYC+KO1P6NDP+Yg99yZCZIT9s/P1oK5H+aETshWX+lvJ+Q
 Ai+qzKvp2ERHFcE+R5qIXs/uX7azpzjqsRZxY2/zdp70ugQDSXE=
 =0eEg
 -----END PGP SIGNATURE-----

Merge tag 'for-6.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more followup fixes to the directory listing.

  People have noticed different behaviour compared to other filesystems
  after changes in 6.5. This is now unified to more "logical" and
  expected behaviour while still within POSIX. And a few more fixes for
  stable.

   - change behaviour of readdir()/rewinddir() when new directory
     entries are created after opendir(), properly tracking the last
     entry

   - fix race in readdir when multiple threads can set the last entry
     index for a directory

  Additionally:

   - use exclusive lock when direct io might need to drop privs and call
     notify_change()

   - don't clear uptodate bit on page after an error, this may lead to a
     deadlock in subpage mode

   - fix waiting pattern when multiple readers block on Merkle tree
     data, switch to folios"

* tag 'for-6.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix race between reading a directory and adding entries to it
  btrfs: refresh dir last index during a rewinddir(3) call
  btrfs: set last dir index to the current last index when opening dir
  btrfs: don't clear uptodate on write errors
  btrfs: file_remove_privs needs an exclusive lock in direct io write
  btrfs: convert btrfs_read_merkle_tree_page() to use a folio
2023-09-20 11:03:45 -07:00