Commit Graph

2858 Commits

Author SHA1 Message Date
David Sterba
4e00422ee6 btrfs: replace sb::s_blocksize by fs_info::sectorsize
The block size stored in the super block is used by subsystems outside
of btrfs and it's a copy of fs_info::sectorsize. Unify that to always
use our sectorsize, with the exception of mount where we first need to
use fixed values (4K) until we read the super block and can set the
sectorsize.

Replace all uses, in most cases it's fewer pointer indirections.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:46 +01:00
Goldwyn Rodrigues
df055afe9b btrfs: page to folio conversion in btrfs_truncate_block()
Convert use of struct page to struct folio inside btrfs_truncate_block().
The only page based function is set_page_extent_mapped(). All other
functions have folio equivalents.

Had to use __filemap_get_folio() because filemap_grab_folio() does not
allow passing allocation mask as a parameter.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:45 +01:00
Qu Wenruo
8bab0a3066 btrfs: remove the pg_offset parameter from btrfs_get_extent()
The parameter @pg_offset of btrfs_get_extent() is only utilized for
inlined extent, and we already have an ASSERT() and tree-checker, to
make sure we can only get inline extent at file offset 0.

Any invalid inline extent with non-zero file offset would be rejected by
tree-checker in the first place.

Thus the @pg_offset parameter is not really necessary, just remove it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:45 +01:00
Linus Torvalds
7505aa147a for-6.8-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmXh1GgACgkQxWXV+ddt
 WDtnvA/7BN7BZ6QmwWv9UyxhgSBtzI19AXPi/kBsssnnjNuzXoHFaVHj68lQCCOB
 a4YjRxAg7nmwFGHdVDTdnwXgUECzqlVkeX9cXg1ZpJy0IfP9RriGedRlC/93z7aV
 pg6DnKMh2FlkibK7yO6kRBR8RYLc5aVIytqHXgUeqbaquuhj2Hh8EpqRo2X0RsoE
 wDXlK0qgrU8HyrA3fHdqKYPcm1+cYABGTCwGx65iRffy8vH+KFSAr71G8jOJVEUj
 DgNWJCpBjXJNs0dsKrik5oGmvLd3GDBKinNX7R2mAvMAMGWrL+fVVTVTfBS/clUT
 FBiVFNYCJuphMcO3Qjs6JIuEez0GuGEeh1m+tQ8W795At1FSiINtE5J7LjmJUl5X
 FuUwOUpxco1lTXBLX149Y9kk7AEOaqYxy0XbH4r5bbKyuzQegRGB9/qQX4sSaCt3
 3T+Td9PvS+6Jo+CDO0dsYhG/h3bsHeHtHGR6f2CiO/m1zHDnTX9aYVcLMM3hsrMI
 8OUEy1jkuKnDZQuZuIWES/3V9FlJL34dR3Cb236Pv/yIH1iujIc27g0qXrC1vzPg
 wnUL1wheLQ9IRLedXoiHtX2Y2ZfFQGQDrIKNCJFD+WNPkZYffih5QNTV/mPZmL80
 9EbjoVTu+6rygzdD43R1RWvK0kFsY44RKhHreI8SItO+e3/0TAs=
 =hMf8
 -----END PGP SIGNATURE-----

Merge tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix freeing allocated id for anon dev when snapshot creation fails

 - fiemap fixes:
     - followup for a recent deadlock fix, ranges that fiemap can access
       can still race with ordered extent completion
     - make sure fiemap with SYNC flag does not race with writes

* tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix double free of anonymous device after snapshot creation failure
  btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given
  btrfs: fix race between ordered extent completion and fiemap
2024-03-01 12:29:20 -08:00
Filipe Manana
418b090277 btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given
When FIEMAP_FLAG_SYNC is given to fiemap the expectation is that that
are no concurrent writes and we get a stable view of the inode's extent
layout.

When the flag is given we flush all IO (and wait for ordered extents to
complete) and then lock the inode in shared mode, however that leaves open
the possibility that a write might happen right after the flushing and
before locking the inode. So fix this by flushing again after locking the
inode - we leave the initial flushing before locking the inode to avoid
holding the lock and blocking other RO operations while waiting for IO
and ordered extents to complete. The second flushing while holding the
inode's lock will most of the time do nothing or very little since the
time window for new writes to have happened is small.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-29 22:34:07 +01:00
Linus Torvalds
1f3a3e2aae for-6.8-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmXMewsACgkQxWXV+ddt
 WDtFUBAAkEU/hxB4YsLn2JEdp3wc80w5/qKkPaYHsI2ncvc3RFiG+tqSY7BakMgE
 Kkdl8ouNX3p/S62ykIBQTKZnOTk7FgKlClAQtgKn1afexqABsP2mifnh40Dzf7eA
 VvEl7chnRT6oeivtQkB+BtgOzaOUp4j/8oAivRN8NKNwTxojV4g9PErKSOWfVQSq
 3zlrLJbe6era43SpnexkjZHn4Fy4CN+C7FMm+pT/yKzZi2oBZs9BvNZGhIkdnzcK
 MftrY9dSGO3CDD2Kvrz3lEm7ZB83wCpm+GTDN7iJx2y+yeW+aHjshFkJr1ApEZQa
 lsWTnj3hk3yHoOPUuLlchw5JcFb/dFZ1Ztdwkunf8nmt5a3O/5Zf+Csgze8c+Iii
 MJQKi0B/bNQ7cSEwRt36s75kROBItZmHCZmSBlOpT1LXSDQMJ9lvEnv/fPQdcHHF
 WMEmk5O5IoGYv5kx5wIoWv27HKE/bDwH6RjkxEd/n17XP+PcfHY4K0o0CGtfwS8g
 hdy9RI9X8dbf3ZPrxtsgQ2T8btWs68A4S6nwcSuY5HK0WNmvRh47eLfCI6S6XGJs
 hHkppLcc+WTXOskCA+ABdm9hgeAPZkCSpuQSmC2HBt8gRv8XqO7z4cZ/up2N+tES
 ZOJSrJb97nusOcxY0pLexnD6eI3pQxzGMiPONlC1Re8CdjZ0l+4=
 =RRGT
 -----END PGP SIGNATURE-----

Merge tag 'for-6.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few regular fixes and one fix for space reservation regression since
  6.7 that users have been reporting:

   - fix over-reservation of metadata chunks due to not keeping proper
     balance between global block reserve and delayed refs reserve; in
     practice this leaves behind empty metadata block groups, the
     workaround is to reclaim them by using the '-musage=1' balance
     filter

   - other space reservation fixes:
      - do not delete unused block group if it may be used soon
      - do not reserve space for checksums for NOCOW files

   - fix extent map assertion failure when writing out free space inode

   - reject encoded write if inode has nodatasum flag set

   - fix chunk map leak when loading block group zone info"

* tag 'for-6.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: don't refill whole delayed refs block reserve when starting transaction
  btrfs: zoned: fix chunk map leak when loading block group zone info
  btrfs: reject encoded write if inode has nodatasum flag set
  btrfs: don't reserve space for checksums when writing to nocow files
  btrfs: add new unused block groups to the list of unused block groups
  btrfs: do not delete unused block group if it may be used soon
  btrfs: add and use helper to check if block group is used
  btrfs: don't drop extent_map for free space inode on write error
2024-02-14 15:47:02 -08:00
Filipe Manana
1bd96c92c6 btrfs: reject encoded write if inode has nodatasum flag set
Currently we allow an encoded write against inodes that have the NODATASUM
flag set, either because they are NOCOW files or they were created while
the filesystem was mounted with "-o nodatasum". This results in having
compressed extents without corresponding checksums, which is a filesystem
inconsistency reported by 'btrfs check'.

For example, running btrfs/281 with MOUNT_OPTIONS="-o nodatacow" triggers
this and 'btrfs check' errors out with:

   [1/7] checking root items
   [2/7] checking extents
   [3/7] checking free space tree
   [4/7] checking fs roots
   root 256 inode 257 errors 1040, bad file extent, some csum missing
   root 256 inode 258 errors 1040, bad file extent, some csum missing
   ERROR: errors found in fs roots
   (...)

So reject encoded writes if the target inode has NODATASUM set.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-13 18:38:05 +01:00
Josef Bacik
5571e41ec6 btrfs: don't drop extent_map for free space inode on write error
While running the CI for an unrelated change I hit the following panic
with generic/648 on btrfs_holes_spacecache.

assertion failed: block_start != EXTENT_MAP_HOLE, in fs/btrfs/extent_io.c:1385
------------[ cut here ]------------
kernel BUG at fs/btrfs/extent_io.c:1385!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 2695096 Comm: fsstress Kdump: loaded Tainted: G        W          6.8.0-rc2+ #1
RIP: 0010:__extent_writepage_io.constprop.0+0x4c1/0x5c0
Call Trace:
 <TASK>
 extent_write_cache_pages+0x2ac/0x8f0
 extent_writepages+0x87/0x110
 do_writepages+0xd5/0x1f0
 filemap_fdatawrite_wbc+0x63/0x90
 __filemap_fdatawrite_range+0x5c/0x80
 btrfs_fdatawrite_range+0x1f/0x50
 btrfs_write_out_cache+0x507/0x560
 btrfs_write_dirty_block_groups+0x32a/0x420
 commit_cowonly_roots+0x21b/0x290
 btrfs_commit_transaction+0x813/0x1360
 btrfs_sync_file+0x51a/0x640
 __x64_sys_fdatasync+0x52/0x90
 do_syscall_64+0x9c/0x190
 entry_SYSCALL_64_after_hwframe+0x6e/0x76

This happens because we fail to write out the free space cache in one
instance, come back around and attempt to write it again.  However on
the second pass through we go to call btrfs_get_extent() on the inode to
get the extent mapping.  Because this is a new block group, and with the
free space inode we always search the commit root to avoid deadlocking
with the tree, we find nothing and return a EXTENT_MAP_HOLE for the
requested range.

This happens because the first time we try to write the space cache out
we hit an error, and on an error we drop the extent mapping.  This is
normal for normal files, but the free space cache inode is special.  We
always expect the extent map to be correct.  Thus the second time
through we end up with a bogus extent map.

Since we're deprecating this feature, the most straightforward way to
fix this is to simply skip dropping the extent map range for this failed
range.

I shortened the test by using error injection to stress the area to make
it easier to reproduce.  With this patch in place we no longer panic
with my error injection test.

CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-09 20:28:28 +01:00
Linus Torvalds
5d9248eed4 for-6.8-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmWurp4ACgkQxWXV+ddt
 WDsqSg/+OS5/1Cr2W6/3ns2hannEeAzYUeoRDNhNHluHOSufXS52QTckQdiA62BO
 iMKGoIxZIn9BQPlvil1hi+jIEt/9qsRt/Qc6oBnzvlto21tJCoS486PJAShu6Sj5
 jXKxtR7d6WrJEfk65uzatk1SbRguRKFxSrFlkaOeOHAmWsD54p/BnsZ/pqxPjF8W
 LOFvwdhbTw3pzQ873b+hJg16rm4IenAnuazZNmXRdSufgdPEcArv0l7fMr4xTBvO
 DBQXoM5GBGVHV2+IsrZiK39p7khz9ej2Ob4rps/x6PduC+GPxGtm6iLy8dZts+hV
 D1FOHh3fqWmV2LQIzLNNu9N7sj5sF5dNFRZHSkq4qFNVNQYfvyFg43iJKfUnMY/s
 puUm7ElSF3tLC2pRys0m/jDfkykZVFFZzbayfYQn+jRKuUASyXnWqmCKlljkLJD5
 ekFXPpor+SQzQso9x0OpAjkSIUmmYFqSvoJCCczPFoo/3EDPv4C6VGOPEQyN6dDH
 nBjn7fLXmn4hpdEKia+LU1MhajFis+SUlmjaoTh7UfCCzXDosDOPThRC1Kx0rNlY
 t4KON8pMUCK3iGEce+7iOSwEImDDU4B7DUARey/sF0C8cs7jRsX8bf8eFTrEId8M
 4C2sLmTw0JJ5n2I2soyTi9fHrGJnJamUlzp/hLrp8JyMzy6qBrs=
 =38MW
 -----END PGP SIGNATURE-----

Merge tag 'for-6.8-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - zoned mode fixes:
     - fix slowdown when writing large file sequentially by looking up
       block groups with enough space faster
     - locking fixes when activating a zone

 - new mount API fixes:
     - preserve mount options for a ro/rw mount of the same subvolume

 - scrub fixes:
     - fix use-after-free in case the chunk length is not aligned to
       64K, this does not happen normally but has been reported on
       images converted from ext4
     - similar alignment check was missing with raid-stripe-tree

 - subvolume deletion fixes:
     - prevent calling ioctl on already deleted subvolume
     - properly track flag tracking a deleted subvolume

 - in subpage mode, fix decompression of an inline extent (zlib, lzo,
   zstd)

 - fix crash when starting writeback on a folio, after integration with
   recent MM changes this needs to be started conditionally

 - reject unknown flags in defrag ioctl

 - error handling, API fixes, minor warning fixes

* tag 'for-6.8-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: scrub: limit RST scrub to chunk boundary
  btrfs: scrub: avoid use-after-free when chunk length is not 64K aligned
  btrfs: don't unconditionally call folio_start_writeback in subpage
  btrfs: use the original mount's mount options for the legacy reconfigure
  btrfs: don't warn if discard range is not aligned to sector
  btrfs: tree-checker: fix inline ref size in error messages
  btrfs: zstd: fix and simplify the inline extent decompression
  btrfs: lzo: fix and simplify the inline extent decompression
  btrfs: zlib: fix and simplify the inline extent decompression
  btrfs: defrag: reject unknown flags of btrfs_ioctl_defrag_range_args
  btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted
  btrfs: don't abort filesystem when attempting to snapshot deleted subvolume
  btrfs: zoned: fix lock ordering in btrfs_zone_activate()
  btrfs: fix unbalanced unlock of mapping_tree_lock
  btrfs: ref-verify: free ref cache before clearing mount opt
  btrfs: fix kvcalloc() arguments order in btrfs_ioctl_send()
  btrfs: zoned: optimize hint byte for zoned allocator
  btrfs: zoned: factor out prepare_allocation_zoned()
2024-01-22 13:29:42 -08:00
Omar Sandoval
3324d05478 btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted
Sweet Tea spotted a race between subvolume deletion and snapshotting
that can result in the root item for the snapshot having the
BTRFS_ROOT_SUBVOL_DEAD flag set. The race is:

Thread 1                                      | Thread 2
----------------------------------------------|----------
btrfs_delete_subvolume                        |
  btrfs_set_root_flags(BTRFS_ROOT_SUBVOL_DEAD)|
                                              |btrfs_mksubvol
                                              |  down_read(subvol_sem)
                                              |  create_snapshot
                                              |    ...
                                              |    create_pending_snapshot
                                              |      copy root item from source
  down_write(subvol_sem)                      |

This flag is only checked in send and swap activate, which this would
cause to fail mysteriously.

create_snapshot() now checks the root refs to reject a deleted
subvolume, so we can fix this by locking subvol_sem earlier so that the
BTRFS_ROOT_SUBVOL_DEAD flag and the root refs are updated atomically.

CC: stable@vger.kernel.org # 4.14+
Reported-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-12 02:00:21 +01:00
Linus Torvalds
affc5af36b for-6.8-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmWYTmMACgkQxWXV+ddt
 WDvPRg/+KgS5LV3nNC0MguYcTMQxmgeutIgXZIMfeA3v6EnFS7nj8leP4EPc6+bj
 JPSkwj4u2vHVwpnTVuEAuJUXnmFY+Qu70nVy6bM2uOHOYTVBQ8zRVK4cErNNLWCp
 OekDaADR53RrZ/xprlQ7b7Ph0Ch2uq9OrpH50IcyquEsH1ffkxlqwyrvth4/8dxC
 6zgsFHWrbtVKJf0DYoQPpjEPz5tpdQ+xHZwtmf1cNlUgI1objODr/ZTqXtZqTfw4
 /GwrtDPbEri53K/qjgr0dDH7pBVqD6PtnbgoHfYkiizZ0G7UkmlaK6rZIurtATJb
 Yk/RCqCUp9tPC4yeFSewFMm1Y8Ae3rkUBG7rnYkvMmBspMqyh/kQAWSBimF5yk/y
 vFEdFTe9AbdvP19Nw0CqovLzaO6RrOXCL1usnFvCmBgvF5gZAv63ZW1njP3ZoNta
 wB8Rs6hxdRkph8Dk7yvYf54uUR+JyKqjHY6egg2qkKTjz0CSf6qQFyFZXpr81m97
 gK4WN5SeP/P2ukRbBKKyzZ5IljUxZuVatvJa0tktd7kAbU26WLzofOJ7pX+iqimM
 F2G7gKGJZykLY1WPntXBp9Dg97Ras2O5iViQ7ZKwRdOx1yZS5zzTYlIznHBAmXbL
 UgXfVnpJH1xFdkvedNTn+Fz9BHNV1K2a2AT7VITj7sxz23z3aJA=
 =4sw3
 -----END PGP SIGNATURE-----

Merge tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "There are no exciting changes for users, it's been mostly API
  conversions and some fixes or refactoring.

  The mount API conversion is a base for future improvements that would
  come with VFS. Metadata processing has been converted to folios, not
  yet enabling the large folios but it's one patch away once everything
  gets tested enough.

  Core changes:

   - convert extent buffers to folios:
      - direct API conversion where possible
      - performance can drop by a few percent on metadata heavy
        workloads, the folio sizes are not constant and the calculations
        add up in the item helpers
      - both regular and subpage modes
      - data cannot be converted yet, we need to port that to iomap and
        there are some other generic changes required

   - convert mount to the new API, should not be user visible:
      - options deprecated long time ago have been removed: inode_cache,
        recovery
      - the new logic that splits mount to two phases slightly changes
        timing of device scanning for multi-device filesystems
      - LSM options will now work (like for selinux)

   - convert delayed nodes radix tree to xarray, preserving the
     preload-like logic that still allows to allocate with GFP_NOFS

   - more validation of sysfs value of scrub_speed_max

   - refactor chunk map structure, reduce size and improve performance

   - extent map refactoring, smaller data structures, improved
     performance

   - reduce size of struct extent_io_tree, embedded in several
     structures

   - temporary pages used for compression are cached and attached to a
     shrinker, this may slightly improve performance

   - in zoned mode, remove redirty extent buffer tracking, zeros are
     written in case an out-of-order is detected and proper data are
     written to the actual write pointer

   - cleanups, refactoring, error message improvements, updated tests

   - verify and update branch name or tag

   - remove unwanted text"

* tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (89 commits)
  btrfs: pass btrfs_io_geometry into btrfs_max_io_len
  btrfs: pass struct btrfs_io_geometry to set_io_stripe
  btrfs: open code set_io_stripe for RAID56
  btrfs: change block mapping to switch/case in btrfs_map_block
  btrfs: factor out block mapping for single profiles
  btrfs: factor out block mapping for RAID5/6
  btrfs: reduce scope of data_stripes in btrfs_map_block
  btrfs: factor out block mapping for RAID10
  btrfs: factor out block mapping for DUP profiles
  btrfs: factor out RAID1 block mapping
  btrfs: factor out block-mapping for RAID0
  btrfs: re-introduce struct btrfs_io_geometry
  btrfs: factor out helper for single device IO check
  btrfs: migrate btrfs_repair_io_failure() to folio interfaces
  btrfs: migrate eb_bitmap_offset() to folio interfaces
  btrfs: migrate various end io functions to folios
  btrfs: migrate subpage code to folio interfaces
  btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
  btrfs: don't double put our subpage reference in alloc_extent_buffer
  btrfs: cleanup metadata page pointer usage
  ...
2024-01-10 09:27:40 -08:00
Linus Torvalds
fb46e22a9e Many singleton patches against the MM code. The patch series which
are included in this merge do the following:
 
 - Peng Zhang has done some mapletree maintainance work in the
   series
 
 	"maple_tree: add mt_free_one() and mt_attr() helpers"
 	"Some cleanups of maple tree"
 
 - In the series "mm: use memmap_on_memory semantics for dax/kmem"
   Vishal Verma has altered the interworking between memory-hotplug
   and dax/kmem so that newly added 'device memory' can more easily
   have its memmap placed within that newly added memory.
 
 - Matthew Wilcox continues folio-related work (including a few
   fixes) in the patch series
 
 	"Add folio_zero_tail() and folio_fill_tail()"
 	"Make folio_start_writeback return void"
 	"Fix fault handler's handling of poisoned tail pages"
 	"Convert aops->error_remove_page to ->error_remove_folio"
 	"Finish two folio conversions"
 	"More swap folio conversions"
 
 - Kefeng Wang has also contributed folio-related work in the series
 
 	"mm: cleanup and use more folio in page fault"
 
 - Jim Cromie has improved the kmemleak reporting output in the
   series "tweak kmemleak report format".
 
 - In the series "stackdepot: allow evicting stack traces" Andrey
   Konovalov to permits clients (in this case KASAN) to cause
   eviction of no longer needed stack traces.
 
 - Charan Teja Kalla has fixed some accounting issues in the page
   allocator's atomic reserve calculations in the series "mm:
   page_alloc: fixes for high atomic reserve caluculations".
 
 - Dmitry Rokosov has added to the samples/ dorectory some sample
   code for a userspace memcg event listener application.  See the
   series "samples: introduce cgroup events listeners".
 
 - Some mapletree maintanance work from Liam Howlett in the series
   "maple_tree: iterator state changes".
 
 - Nhat Pham has improved zswap's approach to writeback in the
   series "workload-specific and memory pressure-driven zswap
   writeback".
 
 - DAMON/DAMOS feature and maintenance work from SeongJae Park in
   the series
 
 	"mm/damon: let users feed and tame/auto-tune DAMOS"
 	"selftests/damon: add Python-written DAMON functionality tests"
 	"mm/damon: misc updates for 6.8"
 
 - Yosry Ahmed has improved memcg's stats flushing in the series
   "mm: memcg: subtree stats flushing and thresholds".
 
 - In the series "Multi-size THP for anonymous memory" Ryan Roberts
   has added a runtime opt-in feature to transparent hugepages which
   improves performance by allocating larger chunks of memory during
   anonymous page faults.
 
 - Matthew Wilcox has also contributed some cleanup and maintenance
   work against eh buffer_head code int he series "More buffer_head
   cleanups".
 
 - Suren Baghdasaryan has done work on Andrea Arcangeli's series
   "userfaultfd move option".  UFFDIO_MOVE permits userspace heap
   compaction algorithms to move userspace's pages around rather than
   UFFDIO_COPY'a alloc/copy/free.
 
 - Stefan Roesch has developed a "KSM Advisor", in the series
   "mm/ksm: Add ksm advisor".  This is a governor which tunes KSM's
   scanning aggressiveness in response to userspace's current needs.
 
 - Chengming Zhou has optimized zswap's temporary working memory
   use in the series "mm/zswap: dstmem reuse optimizations and
   cleanups".
 
 - Matthew Wilcox has performed some maintenance work on the
   writeback code, both code and within filesystems.  The series is
   "Clean up the writeback paths".
 
 - Andrey Konovalov has optimized KASAN's handling of alloc and
   free stack traces for secondary-level allocators, in the series
   "kasan: save mempool stack traces".
 
 - Andrey also performed some KASAN maintenance work in the series
   "kasan: assorted clean-ups".
 
 - David Hildenbrand has gone to town on the rmap code.  Cleanups,
   more pte batching, folio conversions and more.  See the series
   "mm/rmap: interface overhaul".
 
 - Kinsey Ho has contributed some maintenance work on the MGLRU
   code in the series "mm/mglru: Kconfig cleanup".
 
 - Matthew Wilcox has contributed lruvec page accounting code
   cleanups in the series "Remove some lruvec page accounting
   functions".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA
 jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27
 Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU=
 =0NHs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Many singleton patches against the MM code. The patch series which are
  included in this merge do the following:

   - Peng Zhang has done some mapletree maintainance work in the series

	'maple_tree: add mt_free_one() and mt_attr() helpers'
	'Some cleanups of maple tree'

   - In the series 'mm: use memmap_on_memory semantics for dax/kmem'
     Vishal Verma has altered the interworking between memory-hotplug
     and dax/kmem so that newly added 'device memory' can more easily
     have its memmap placed within that newly added memory.

   - Matthew Wilcox continues folio-related work (including a few fixes)
     in the patch series

	'Add folio_zero_tail() and folio_fill_tail()'
	'Make folio_start_writeback return void'
	'Fix fault handler's handling of poisoned tail pages'
	'Convert aops->error_remove_page to ->error_remove_folio'
	'Finish two folio conversions'
	'More swap folio conversions'

   - Kefeng Wang has also contributed folio-related work in the series

	'mm: cleanup and use more folio in page fault'

   - Jim Cromie has improved the kmemleak reporting output in the series
     'tweak kmemleak report format'.

   - In the series 'stackdepot: allow evicting stack traces' Andrey
     Konovalov to permits clients (in this case KASAN) to cause eviction
     of no longer needed stack traces.

   - Charan Teja Kalla has fixed some accounting issues in the page
     allocator's atomic reserve calculations in the series 'mm:
     page_alloc: fixes for high atomic reserve caluculations'.

   - Dmitry Rokosov has added to the samples/ dorectory some sample code
     for a userspace memcg event listener application. See the series
     'samples: introduce cgroup events listeners'.

   - Some mapletree maintanance work from Liam Howlett in the series
     'maple_tree: iterator state changes'.

   - Nhat Pham has improved zswap's approach to writeback in the series
     'workload-specific and memory pressure-driven zswap writeback'.

   - DAMON/DAMOS feature and maintenance work from SeongJae Park in the
     series

	'mm/damon: let users feed and tame/auto-tune DAMOS'
	'selftests/damon: add Python-written DAMON functionality tests'
	'mm/damon: misc updates for 6.8'

   - Yosry Ahmed has improved memcg's stats flushing in the series 'mm:
     memcg: subtree stats flushing and thresholds'.

   - In the series 'Multi-size THP for anonymous memory' Ryan Roberts
     has added a runtime opt-in feature to transparent hugepages which
     improves performance by allocating larger chunks of memory during
     anonymous page faults.

   - Matthew Wilcox has also contributed some cleanup and maintenance
     work against eh buffer_head code int he series 'More buffer_head
     cleanups'.

   - Suren Baghdasaryan has done work on Andrea Arcangeli's series
     'userfaultfd move option'. UFFDIO_MOVE permits userspace heap
     compaction algorithms to move userspace's pages around rather than
     UFFDIO_COPY'a alloc/copy/free.

   - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm:
     Add ksm advisor'. This is a governor which tunes KSM's scanning
     aggressiveness in response to userspace's current needs.

   - Chengming Zhou has optimized zswap's temporary working memory use
     in the series 'mm/zswap: dstmem reuse optimizations and cleanups'.

   - Matthew Wilcox has performed some maintenance work on the writeback
     code, both code and within filesystems. The series is 'Clean up the
     writeback paths'.

   - Andrey Konovalov has optimized KASAN's handling of alloc and free
     stack traces for secondary-level allocators, in the series 'kasan:
     save mempool stack traces'.

   - Andrey also performed some KASAN maintenance work in the series
     'kasan: assorted clean-ups'.

   - David Hildenbrand has gone to town on the rmap code. Cleanups, more
     pte batching, folio conversions and more. See the series 'mm/rmap:
     interface overhaul'.

   - Kinsey Ho has contributed some maintenance work on the MGLRU code
     in the series 'mm/mglru: Kconfig cleanup'.

   - Matthew Wilcox has contributed lruvec page accounting code cleanups
     in the series 'Remove some lruvec page accounting functions'"

* tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits)
  mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
  mm, treewide: introduce NR_PAGE_ORDERS
  selftests/mm: add separate UFFDIO_MOVE test for PMD splitting
  selftests/mm: skip test if application doesn't has root privileges
  selftests/mm: conform test to TAP format output
  selftests: mm: hugepage-mmap: conform to TAP format output
  selftests/mm: gup_test: conform test to TAP format output
  mm/selftests: hugepage-mremap: conform test to TAP format output
  mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING
  mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large
  mm/memcontrol: remove __mod_lruvec_page_state()
  mm/khugepaged: use a folio more in collapse_file()
  slub: use a folio in __kmalloc_large_node
  slub: use folio APIs in free_large_kmalloc()
  slub: use alloc_pages_node() in alloc_slab_page()
  mm: remove inc/dec lruvec page state functions
  mm: ratelimit stat flush from workingset shrinker
  kasan: stop leaking stack trace handles
  mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE
  mm/mglru: add dummy pmd_dirty()
  ...
2024-01-09 11:18:47 -08:00
Qu Wenruo
55151ea9ec btrfs: migrate subpage code to folio interfaces
Although subpage itself is conflicting with higher folio, since subpage
(sectorsize < PAGE_SIZE and nodesize < PAGE_SIZE) means we will never
need higher order folio, there is a hidden pitfall:

- btrfs_page_*() helpers

Those helpers are an abstraction to handle both subpage and non-subpage
cases, which means we're going to pass pages pointers to those helpers.

And since those helpers are shared between data and metadata paths, it's
unavoidable to let them to handle folios, including higher order
folios).

Meanwhile for true subpage case, we should only have a single page
backed folios anyway, thus add a new ASSERT() for btrfs_subpage_assert()
to ensure that.

Also since those helpers are shared between both data and metadata, add
some extra ASSERT()s for data path to make sure we only get single page
backed folio for now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 23:03:58 +01:00
Qu Wenruo
13df3775ef btrfs: cleanup metadata page pointer usage
Although we have migrated extent_buffer::pages[] to folios[], we're
still mostly using the folio_page() help to grab the page.

This patch would do the following cleanups for metadata:

- Introduce num_extent_folios() helper
  This is to replace most num_extent_pages() callers.

- Use num_extent_folios() to iterate future large folios
  This allows us to use things like
  bio_add_folio()/bio_add_folio_nofail(), and only set the needed flags
  for the folio (aka the leading/tailing page), which reduces the loop
  iteration to 1 for large folios.

- Change metadata related functions to use folio pointers
  Including their function name, involving:
  * attach_extent_buffer_page()
  * detach_extent_buffer_page()
  * page_range_has_eb()
  * btrfs_release_extent_buffer_pages()
  * btree_clear_page_dirty()
  * btrfs_page_inc_eb_refs()
  * btrfs_page_dec_eb_refs()

- Change btrfs_is_subpage() to accept an address_space pointer
  This is to allow both page->mapping and folio->mapping to be utilized.
  As data is still using the old per-page code, and may keep so for a
  while.

- Special corner case place holder for future order mismatches between
  extent buffer and inode filemap
  For now it's  just a block of comments and a dead ASSERT(), no real
  handling yet.

The subpage code would still go page, just because subpage and large
folio are conflicting conditions, thus we don't need to bother subpage
with higher order folios at all. Just folio_page(folio, 0) would be
enough.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor styling tweaks ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 23:01:04 +01:00
Qu Wenruo
09e6cef19c btrfs: refactor alloc_extent_buffer() to allocate-then-attach method
Currently alloc_extent_buffer() utilizes find_or_create_page() to
allocate one page a time for an extent buffer.

This method has the following disadvantages:

- find_or_create_page() is the legacy way of allocating new pages
  With the new folio infrastructure, find_or_create_page() is just
  redirected to filemap_get_folio().

- Lacks the way to support higher order (order >= 1) folios
  As we can not yet let filemap give us a higher order folio.

This patch would change the workflow by the following way:

		Old		   |		new
-----------------------------------+-------------------------------------
                                   | ret = btrfs_alloc_page_array();
for (i = 0; i < num_pages; i++) {  | for (i = 0; i < num_pages; i++) {
    p = find_or_create_page();     |     ret = filemap_add_folio();
    /* Attach page private */      |     /* Reuse page cache if needed */
    /* Reused eb if needed */      |
				   |     /* Attach page private and
				   |        reuse eb if needed */
				   | }

By this we split the page allocation and private attaching into two
parts, allowing future updates to each part more easily, and migrate to
folio interfaces (especially for possible higher order folios).

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 23:01:04 +01:00
David Sterba
6140ba8a0a btrfs: switch btrfs_root::delayed_nodes_tree to xarray from radix-tree
The radix-tree has been superseded by the xarray
(https://lwn.net/Articles/745073), this patch converts the
btrfs_root::delayed_nodes, the APIs are used in a simple way.

First idea is to do xa_insert() but this would require GFP_ATOMIC
allocation which we want to avoid if possible. The preload mechanism of
radix-tree can be emulated within the xarray API.

- xa_reserve() with GFP_NOFS outside of the lock, the reserved entry
  is inserted atomically at most once

- xa_store() under a lock, in case something races in we can detect that
  and xa_load() returns a valid pointer

All uses of xa_load() must check for a valid pointer in case they manage
to get between the xa_reserve() and xa_store(), this is handled in
btrfs_get_delayed_node().

Otherwise the functionality is equivalent, xarray implements the
radix-tree and there should be no performance difference.

The patch continues the efforts started in 253bf57555 ("btrfs: turn
delayed_nodes_tree into an XArray") and fixes the problems with locking
and GFP flags 088aea3b97 ("Revert "btrfs: turn delayed_nodes_tree
into an XArray"").

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 23:01:03 +01:00
Filipe Manana
f86f7a75e2 btrfs: use the flags of an extent map to identify the compression type
Currently, in struct extent_map, we use an unsigned int (32 bits) to
identify the compression type of an extent and an unsigned long (64 bits
on a 64 bits platform, 32 bits otherwise) for flags. We are only using
6 different flags, so an unsigned long is excessive and we can use flags
to identify the compression type instead of using a dedicated 32 bits
field.

We can easily have tens or hundreds of thousands (or more) of extent maps
on busy and large filesystems, specially with compression enabled or many
or large files with tons of small extents. So it's convenient to have the
extent_map structure as small as possible in order to use less memory.

So remove the compression type field from struct extent_map, use flags
to identify the compression type and shorten the flags field from an
unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and
reduces the size of the structure from 136 bytes down to 128 bytes, using
now only two cache lines, and increases the number of extent maps we can
have per 4K page from 30 to 32. By using a u32 for the flags instead of
an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(),
but that level of atomicity is not needed as most flags are never cleared
once set (before adding an extent map to the tree), and the ones that can
be cleared or set after an extent map is added to the tree, are always
performed while holding the write lock on the extent map tree, while the
reader holds a lock on the tree or tests for a flag that never changes
once the extent map is in the tree (such as compression flags).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 22:59:02 +01:00
Filipe Manana
00deaf04df btrfs: log messages at unpin_extent_range() during unexpected cases
At unpin_extent_range() we trigger a WARN_ON() when we don't find an
extent map or we find one with a start offset not matching the start
offset of the target range. This however isn't very useful for debugging
because:

1) We don't know which condition was triggered, as they are both in the
   same WARN_ON() call;

2) We don't know which inode was affected, from which root, for which
   range, what's the start offset of the extent map, and so on.

So trigger a separate warning for each case and log a message for each
case providing information about the inode, its root, the target range,
the generation and the start offset of the extent map we found.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 22:59:01 +01:00
David Sterba
637e6e0f50 btrfs: allocate btrfs_inode::file_extent_tree only without NO_HOLES
The file_extent_tree was added in 41a2ee75aa ("btrfs: introduce
per-inode file extent tree") so we have an explicit mapping of the file
extents to know where it is safe to update i_size. When the feature
NO_HOLES is enabled, and it's been a mkfs default since 5.15, the tree
is not necessary.

To save some space in the inode, allocate the tree only when necessary.
This reduces size by 16 bytes from 1096 to 1080 on a x86_64 release
config.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 22:59:01 +01:00
Josef Bacik
ed9b50a13e btrfs: cache that we don't have security.capability set
When profiling a workload I noticed we were constantly calling getxattr.
These were mostly coming from __remove_privs, which will lookup if
security.capability exists to remove it.  However instrumenting getxattr
showed we get called nearly constantly on an idle machine on a lot of
accesses.

These are wasteful and not free.  Other security LSMs have a way to
cache their results, but capability doesn't have this, so it's asking us
all the time for the xattr.

Fix this by setting a flag in our inode that it doesn't have a
security.capability xattr.  We set this on new inodes and after a failed
lookup of security.capability.  If we set this xattr at all we'll clear
the flag.

I haven't found a test in fsperf that this makes a visible difference
on, but I assume fs_mark related tests would show it clearly.  This is a
perf report output of the smallfiles100k run where it shows 20% of our
time spent in __remove_privs because we're looking up the non-existent
xattr.

--21.86%--btrfs_write_check.constprop.0
  --21.62%--__file_remove_privs
    --21.55%--security_inode_need_killpriv
      --21.54%--cap_inode_need_killpriv
        --21.53%--__vfs_getxattr
          --20.89%--btrfs_getxattr

Obviously this is just CPU time in a mostly IO bound test, so the actual
effect of removing this callchain is minimal.  However in just normal
testing of an idle system tracing showed around 100 getxattr calls per
minute, and with this patch there are 0.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:05 +01:00
David Sterba
738290c056 btrfs: always set extent_io_tree::inode and drop fs_info
The extent_io_tree is embedded in several structures, notably in struct
btrfs_inode.  The fs_info is only used for reporting errors and for
reference in trace points. We can get to the pointer through the inode,
but not all io trees set it. However, we always know the owner and
can recognize if inode is valid.  For access helpers are provided, const
variant for the trace points.

This reduces size of extent_io_tree by 8 bytes and following structures
in turn:

- btrfs_inode		1104 -> 1088
- btrfs_device		 520 ->  512
- btrfs_root		1360 -> 1344
- btrfs_transaction	 456 ->  440
- btrfs_fs_info		3600 -> 3592
- reloc_control		1520 -> 1512

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:02 +01:00
David Sterba
516095cdf0 btrfs: move lockdep class setting out of extent_io_tree_init
The per-inode file extent tree was added in 41a2ee75aa ("btrfs:
introduce per-inode file extent tree"), it's the only tree type
that requires the lockdep class. Move it to the file where it is
actually used.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:02 +01:00
Filipe Manana
7dc66abb5a btrfs: use a dedicated data structure for chunk maps
Currently we abuse the extent_map structure for two purposes:

1) To actually represent extents for inodes;
2) To represent chunk mappings.

This is odd and has several disadvantages:

1) To create a chunk map, we need to do two memory allocations: one for
   an extent_map structure and another one for a map_lookup structure, so
   more potential for an allocation failure and more complicated code to
   manage and link two structures;

2) For a chunk map we actually only use 3 fields (24 bytes) of the
   respective extent map structure: the 'start' field to have the logical
   start address of the chunk, the 'len' field to have the chunk's size,
   and the 'orig_block_len' field to contain the chunk's stripe size.

   Besides wasting a memory, it's also odd and not intuitive at all to
   have the stripe size in a field named 'orig_block_len'.

   We are also using 'block_len' of the extent_map structure to contain
   the chunk size, so we have 2 fields for the same value, 'len' and
   'block_len', which is pointless;

3) When an extent map is associated to a chunk mapping, we set the bit
   EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
   'map_lookup' point to the associated map_lookup structure. This means
   that for an extent map associated to an inode extent, we are not using
   this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);

4) Extent maps associated to a chunk mapping are never merged or split so
   it's pointless to use the existing extent map infrastructure.

So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:

1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.

This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.

We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:02 +01:00
Qu Wenruo
cfbf07e278 btrfs: migrate to use folio private instead of page private
As a cleanup and preparation for future folio migration, this patch
would replace all page->private to folio version.  This includes:

- PagePrivate()
  -> folio_test_private()

- page->private
  -> folio_get_private()

- attach_page_private()
  -> folio_attach_private()

- detach_page_private()
  -> folio_detach_private()

Since we're here, also remove the forced cast on page->private, since
it's (void *) already, we don't really need to do the cast.

For now even if we missed some call sites, it won't cause any problem
yet, as we're only using order 0 folio (single page), thus all those
folio/page flags should be synced.

But for the future conversion to utilize higher order folio, the page
<-> folio flag sync is no longer guaranteed, thus we have to migrate to
utilize folio flags.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:01 +01:00
David Sterba
9ba965dca3 btrfs: use page alloc/free wrappers for compression pages
This is a preparation for managing compression pages in a cache-like
manner, instead of asking the allocator each time. The common allocation
and free wrappers are introduced and are functionally equivalent to the
current code.

The freeing helpers need to be carefully placed where the last reference
is dropped.  This is either after directly allocating (error handling)
or when there are no other users of the pages (after copying the contents).

It's safe to not use the helper and use put_page() that will handle the
reference count. Not using the helper means there's lower number of
pages that could be reused without passing them back to allocator.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:01 +01:00
Linus Torvalds
bdb2701f0b for-6.7-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmV5rTIACgkQxWXV+ddt
 WDuLUg/+Ix/CeA+JY6VZMA2kBHMzmRexSjYONWfQwIL7LPBy4sOuSEaTZt+QQMs+
 AEKau1YfTgo7e9S2DlbZhIWp6P87VFui7Q1E99uJEmKelakvf94DbMrufPTTKjaD
 JG2KB6LsD59yWwfbGHEAVVNGSMRk2LDXzcUWMK6/uzu/7Bcr4ataOymWd86/blUV
 cw5g87uAHpBn+R1ARTf1CkqyYiI9UldNUJmW1q7dwxOyYG+weUtJImosw2Uda76y
 wQXAFQAH3vsFzTC+qjC9Vz7cnyAX9qAw48ODRH7rIT1BQ3yAFQbfXE20jJ/fSE+C
 lz3p05tA9373KAOtLUHmANBwe3NafCnlut6ZYRfpTcEzUslAO5PnajPaHh5Al7uC
 Iwdpy49byoyVFeNf0yECBsuDP8s86HlUALF8mdJabPI1Kl66MUea6KgS1oyO3pCB
 hfqLbpofV4JTywtIRLGQTQvzSwkjPHTbSwtZ9nftTw520a5f7memDu5vi4XzFd+B
 NrJxmz2DrMRlwrLgWg9OXXgx1riWPvHnIoqzjG5W6A9N74Ud1/oz7t3VzjGSQ5S2
 UikRB6iofPE0deD8IF6H6DvFfvQxU9d9BJ6IS9V2zRt5vdgJ2w08FlqbLZewSY4x
 iaQ+L7UYKDjC9hdosXVNu/6fAspyBVdSp2NbKk14fraZtNAoPNs=
 =uF/Q
 -----END PGP SIGNATURE-----

Merge tag 'for-6.7-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
  "Some fixes to quota accounting code, mostly around error handling and
   correctness:

   - free reserves on various error paths, after IO errors or
     transaction abort

   - don't clear reserved range at the folio release time, it'll be
     properly cleared after final write

   - fix integer overflow due to int used when passing around size of
     freed reservations

   - fix a regression in squota accounting that missed some cases with
     delayed refs"

* tag 'for-6.7-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: ensure releasing squota reserve on head refs
  btrfs: don't clear qgroup reserved bit in release_folio
  btrfs: free qgroup pertrans reserve on transaction abort
  btrfs: fix qgroup_free_reserved_data int overflow
  btrfs: free qgroup reserve when ORDERED_IOERR is set
2023-12-14 11:53:00 -08:00
Matthew Wilcox (Oracle)
af7628d6ec fs: convert error_remove_page to error_remove_folio
There were already assertions that we were not passing a tail page to
error_remove_page(), so make the compiler enforce that by converting
everything to pass and use a folio.

Link: https://lkml.kernel.org/r/20231117161447.2461643-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:42 -08:00
Boris Burkov
9e65bfca24 btrfs: fix qgroup_free_reserved_data int overflow
The reserved data counter and input parameter is a u64, but we
inadvertently accumulate it in an int. Overflowing that int results in
freeing the wrong amount of data and breaking reserve accounting.

Unfortunately, this overflow rot spreads from there, as the qgroup
release/free functions rely on returning an int to take advantage of
negative values for error codes.

Therefore, the full fix is to return the "released" or "freed" amount by
a u64 argument and to return 0 or negative error code via the return
value.

Most of the call sites simply ignore the return value, though some
of them handle the error and count the returned bytes. Change all of
them accordingly.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-06 22:32:46 +01:00
Linus Torvalds
9bacdd8996 for-6.7-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmVSO50ACgkQxWXV+ddt
 WDuiyg/7BZviFAyiQMAzpA319qRJZ+EemfTdF/k69q4axGYuvqVdXKnpOV44AR4I
 dKcHLOPpDZIxsh8lFytkm1UEAHptw1v7A64c+gcdjGK0tAA7aKbw/1nmNysowT23
 L0v2+34hkBUfG8A3uVgOwL1rjItEX5Fl54slpVsazSqlEbKrqC4MGNjmqdp3IOeC
 qfXTgkvkXmm8s8NyoJybKewM9Aw0tmK0jkAFHA+2sgcZPYKXjqWGv9KUOsXnCx5o
 3kPWIRT1sj4q2qzrgP14Q12O6qPLZ2/0oTBhi6nhj8+N1yiH+USS5zBITegF+w2n
 leQeVHtyBYHlPYQSQlCIZy7+10gkePvs+JmoAuL8YFISnGYnvOZqCeArlV7cnNI3
 CQt7ZBER5Dqw78Y756usUhpYrLWa9kOpcPVRmjJ/R62+TY1FkkyY7irETbn5EGjI
 NlhEa4PMYeYpAOccoxWEm9tIiiVD1abURhVBdn3Znfcb1Sv/lrGBlo9DYGFCxbBh
 xU1JP7sly8w0aPLqCbn1X3VY8dXp+CeYz4FQabHjQA/zr9lF08/pRYj3haAbYAyH
 0KphXurwz/YqY+LmRg7SbQ/KMgBAiBV8Qk9JyNvdvaQbnYnq7CWdpoHcpZu3mvpb
 HLGoXew58kZaSfxLHlcT5wwYlbq0rooXRstuFg2+BBcOFOMCQfw=
 =GM+1
 -----END PGP SIGNATURE-----

Merge tag 'for-6.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix potential overflow in returned value from SEARCH_TREE_V2
   ioctl on 32bit architecture

 - zoned mode fixes:

     - drop unnecessary write pointer check for RAID0/RAID1/RAID10
       profiles, now it works because of raid-stripe-tree

     - wait for finishing the zone when direct IO needs a new
       allocation

 - simple quota fixes:

     - pass correct owning root pointer when cleaning up an
       aborted transaction

     - fix leaking some structures when processing delayed refs

     - change key type number of BTRFS_EXTENT_OWNER_REF_KEY,
       reorder it before inline refs that are supposed to be
       sorted, keeping the original number would complicate a lot
       of things; this change needs an updated version of
       btrfs-progs to work and filesystems need to be recreated

 - fix error pointer dereference after failure to allocate fs
   devices

 - fix race between accounting qgroup extents and removing a
   qgroup

* tag 'for-6.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: make OWNER_REF_KEY type value smallest among inline refs
  btrfs: fix qgroup record leaks when using simple quotas
  btrfs: fix race between accounting qgroup extents and removing a qgroup
  btrfs: fix error pointer dereference after failure to allocate fs devices
  btrfs: make found_logical_ret parameter mandatory for function queue_scrub_stripe()
  btrfs: get correct owning_root when dropping snapshot
  btrfs: zoned: wait for data BG to be finished on direct IO allocation
  btrfs: zoned: drop no longer valid write pointer check
  btrfs: directly return 0 on no error code in btrfs_insert_raid_extent()
  btrfs: use u64 for buffer sizes in the tree search ioctls
2023-11-13 09:09:12 -08:00
Naohiro Aota
776a838f1f btrfs: zoned: wait for data BG to be finished on direct IO allocation
Running the fio command below on a ZNS device results in "Resource
temporarily unavailable" error.

  $ sudo fio --name=w --directory=/mnt --filesize=1GB --bs=16MB --numjobs=16 \
        --rw=write --ioengine=libaio --iodepth=128 --direct=1

  fio: io_u error on file /mnt/w.2.0: Resource temporarily unavailable: write offset=117440512, buflen=16777216
  fio: io_u error on file /mnt/w.2.0: Resource temporarily unavailable: write offset=134217728, buflen=16777216
  ...

This happens because -EAGAIN error returned from btrfs_reserve_extent()
called from btrfs_new_extent_direct() is spilling over to the userland.

btrfs_reserve_extent() returns -EAGAIN when there is no active zone
available. Then, the caller should wait for some other on-going IO to
finish a zone and retry the allocation.

This logic is already implemented for buffered write in cow_file_range(),
but it is missing for the direct IO counterpart. Implement the same logic
for it.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 2ce543f478 ("btrfs: zoned: wait until zone is finished when allocation didn't progress")
CC: stable@vger.kernel.org # 6.1+
Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-03 16:39:01 +01:00
Linus Torvalds
d5acbc60fa for-6.7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmU/xAEACgkQxWXV+ddt
 WDvYKg//SjTimA5Nins9mb4jdz8n+dDeZnQhKzy3FqInU41EzDRc4WwnEODmDlTa
 AyU9rGB3k0JNSUc075jZFCyLqq/ARiOqRi4x33Gk0ckIlc4X5OgBoqP2XkPh0VlP
 txskLCrmhc3pwyR4ErlFDX2jebIUXfkv39bJuE40grGvUatRe+WNq0ERIrgO8RAr
 Rc3hBotMH8AIqfD1L6j1ZiZIAyrOkT1BJMuqeoq27/gJZn/MRhM9TCrMTzfWGaoW
 SxPrQiCDEN3KECsOY/caroMn3AekDijg/ley1Nf7Z0N6oEV+n4VWWPBFE9HhRz83
 9fIdvSbGjSJF6ekzTjcVXPAbcuKZFzeqOdBRMIW3TIUo7mZQyJTVkMsc1y/NL2Z3
 9DhlRLIzvWJJjt1CEK0u18n5IU+dGngdktbhWWIuIlo8r+G/iKR/7zqU92VfWLHL
 Z7/eh6HgH5zr2bm+yKORbrUjkv4IVhGVarW8D4aM+MCG0lFN2GaPcJCCUrp4n7rZ
 PzpQbxXa38ANBk6hsp4ndS8TJSBL9moY8tumzLcKg97nzNMV6KpBdV/G6/QfRLCN
 3kM6UbwTAkMwGcQS86Mqx6s04ORLnQeD6f7N6X4Ppx0Mi/zkjI2HkRuvQGp12B0v
 iZjCCZAYY2Iu+/TU0GrCXSss/grzIAUPzM9msyV3XGO/VBpwdec=
 =9TVx
 -----END PGP SIGNATURE-----

Merge tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "New features:

   - raid-stripe-tree

     New tree for logical file extent mapping where the physical mapping
     may not match on multiple devices. This is now used in zoned mode
     to implement RAID0/RAID1* profiles, but can be used in non-zoned
     mode as well. The support for RAID56 is in development and will
     eventually fix the problems with the current implementation. This
     is a backward incompatible feature and has to be enabled at mkfs
     time.

   - simple quota accounting (squota)

     A simplified mode of qgroup that accounts all space on the initial
     extent owners (a subvolume), the snapshots are then cheap to create
     and delete. The deletion of snapshots in fully accounting qgroups
     is a known CPU/IO performance bottleneck.

     The squota is not suitable for the general use case but works well
     for containers where the original subvolume exists for the whole
     time. This is a backward incompatible feature as it needs extending
     some structures, but can be enabled on an existing filesystem.

   - temporary filesystem fsid (temp_fsid)

     The fsid identifies a filesystem and is hard coded in the
     structures, which disallows mounting the same fsid found on
     different devices.

     For a single device filesystem this is not strictly necessary, a
     new temporary fsid can be generated on mount e.g. after a device is
     cloned. This will be used by Steam Deck for root partition A/B
     testing, or can be used for VM root images.

  Other user visible changes:

   - filesystems with partially finished metadata_uuid conversion cannot
     be mounted anymore and the uuid fixup has to be done by btrfs-progs
     (btrfstune).

  Performance improvements:

   - reduce reservations for checksum deletions (with enabled free space
     tree by factor of 4), on a sample workload on file with many
     extents the deletion time decreased by 12%

   - make extent state merges more efficient during insertions, reduce
     rb-tree iterations (run time of critical functions reduced by 5%)

  Core changes:

   - the integrity check functionality has been removed, this was a
     debugging feature and removal does not affect other integrity
     checks like checksums or tree-checker

   - space reservation changes:

      - more efficient delayed ref reservations, this avoids building up
        too much work or overusing or exhausting the global block
        reserve in some situations

      - move delayed refs reservation to the transaction start time,
        this prevents some ENOSPC corner cases related to exhaustion of
        global reserve

      - improvements in reducing excessive reservations for block group
        items

      - adjust overcommit logic in near full situations, account for one
        more chunk to eventually allocate metadata chunk, this is mostly
        relevant for small filesystems (<10GiB)

   - single device filesystems are scanned but not registered (except
     seed devices), this allows temp_fsid to work

   - qgroup iterations do not need GFP_ATOMIC allocations anymore

   - cleanups, refactoring, reduced data structure size, function
     parameter simplifications, error handling fixes"

* tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (156 commits)
  btrfs: open code timespec64 in struct btrfs_inode
  btrfs: remove redundant log root tree index assignment during log sync
  btrfs: remove redundant initialization of variable dirty in btrfs_update_time()
  btrfs: sysfs: show temp_fsid feature
  btrfs: disable the device add feature for temp-fsid
  btrfs: disable the seed feature for temp-fsid
  btrfs: update comment for temp-fsid, fsid, and metadata_uuid
  btrfs: remove pointless empty log context list check when syncing log
  btrfs: update comment for struct btrfs_inode::lock
  btrfs: remove pointless barrier from btrfs_sync_file()
  btrfs: add and use helpers for reading and writing last_trans_committed
  btrfs: add and use helpers for reading and writing fs_info->generation
  btrfs: add and use helpers for reading and writing log_transid
  btrfs: add and use helpers for reading and writing last_log_commit
  btrfs: support cloned-device mount capability
  btrfs: add helper function find_fsid_by_disk
  btrfs: stop reserving excessive space for block group item insertions
  btrfs: stop reserving excessive space for block group item updates
  btrfs: reorder btrfs_inode to fill gaps
  btrfs: open code btrfs_ordered_inode_tree in btrfs_inode
  ...
2023-10-30 10:42:06 -10:00
Jeff Layton
b1c38a1338
btrfs: convert to new timestamp accessors
Convert to using the new inode timestamp accessor functions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20231004185347.80880-21-jlayton@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-18 13:26:19 +02:00
David Sterba
c6e8f898f5 btrfs: open code timespec64 in struct btrfs_inode
The type of timespec64::tv_nsec is 'unsigned long', while we have only
u32 for on-disk and in-memory. This wastes a few bytes in btrfs_inode.
Add separate members for sec and nsec with the corresponding type width.
This creates a 4 byte hole in btrfs_inode which can be utilized in the
future.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:19 +02:00
Colin Ian King
a666ce9bab btrfs: remove redundant initialization of variable dirty in btrfs_update_time()
The variable dirty is initialized with a value that is never read, it
is being re-assigned later on. Remove the redundant initialization.
Cleans up clang scan build warning:

  fs/btrfs/inode.c:5965:7: warning: Value stored to 'dirty' during its
  initialization is never read [deadcode.DeadStores]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:19 +02:00
Filipe Manana
4a4f8fe2b0 btrfs: add and use helpers for reading and writing fs_info->generation
Currently the generation field of struct btrfs_fs_info is always modified
while holding fs_info->trans_lock locked. Most readers will access this
field without taking that lock but while holding a transaction handle,
which is safe to do due to the transaction life cycle.

However there are other readers that are neither holding the lock nor
holding a transaction handle open:

1) When reading an inode from disk, at btrfs_read_locked_inode();

2) When reading the generation to expose it to sysfs, at
   btrfs_generation_show();

3) Early in the fsync path, at skip_inode_logging();

4) When creating a hole at btrfs_cont_expand(), during write paths,
   truncate and reflinking;

5) In the fs_info ioctl (btrfs_ioctl_fs_info());

6) While mounting the filesystem, in the open_ctree() path. In these
   cases it's safe to directly read fs_info->generation as no one
   can concurrently start a transaction and update fs_info->generation.

In case of the fsync path, races here should be harmless, and in the worst
case they may cause a fsync to log an inode when it's not really needed,
so nothing bad from a functional perspective. In the other cases it's not
so clear if functional problems may arise, though in case 1 rare things
like a load/store tearing [1] may cause the BTRFS_INODE_NEEDS_FULL_SYNC
flag not being set on an inode and therefore result in incorrect logging
later on in case a fsync call is made.

To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the generation field of struct btrfs_fs_info using READ_ONCE() and
WRITE_ONCE(), and use these helpers where needed.

[1] https://lwn.net/Articles/793253/

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:17 +02:00
David Sterba
54c6537146 btrfs: open code btrfs_ordered_inode_tree in btrfs_inode
The structure btrfs_ordered_inode_tree is used only in one place, in
btrfs_inode. The structure itself has a 4 byte hole which is wasted
space.

Move the btrfs_ordered_inode_tree members to btrfs_inode with a common
prefix 'ordered_tree_' where the hole can be utilized and shrink inode
size.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:16 +02:00
David Sterba
893fe24399 btrfs: change test_range_bit to scan the whole range
The semantics of test_range_bit() with filled == 0 is now in it's own
helper so test_range_bit will check the whole range unconditionally.
The detection logic is flipped and assumes success by default and
catches exceptions.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:14 +02:00
David Sterba
99be1a66e1 btrfs: add specific helper for range bit test exists
The existing helper test_range_bit works in two ways, checks if the whole
range contains all the bits, or stop on the first occurrence.  By adding
a specific helper for the latter case, the inner loop can be simplified
and contains fewer conditionals, making it a bit faster.

There's no caller that uses the cached state pointer so this reduces the
argument count further.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:14 +02:00
Filipe Manana
0a325e620e btrfs: remove redundant root argument from maybe_insert_hole()
The root argument for maybe_insert_hole() always matches the root of the
given inode, so remove the root argument and get it from the inode
argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
Filipe Manana
04bd8e9410 btrfs: remove redundant root argument from btrfs_delayed_update_inode()
The root argument for btrfs_delayed_update_inode() always matches the root
of the given inode, so remove the root argument and get it from the inode
argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
Filipe Manana
07a274a886 btrfs: remove redundant root argument from btrfs_update_inode_item()
The root argument for btrfs_update_inode_item() always matches the root of
the given inode, so remove the root argument and get it from the inode
argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
Filipe Manana
8b9d032225 btrfs: remove redundant root argument from btrfs_update_inode()
The root argument for btrfs_update_inode() always matches the root of the
given inode, so remove the root argument and get it from the inode
argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
Filipe Manana
0a5d0dc55f btrfs: remove redundant root argument from btrfs_update_inode_fallback()
The root argument for btrfs_update_inode_fallback() always matches the
root of the given inode, so remove the root argument and get it from the
inode argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
Filipe Manana
cddaaacca9 btrfs: remove noinline from btrfs_update_inode()
The noinline attribute of btrfs_update_inode() is pointless as the
function is exported and widely used, so remove it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
Filipe Manana
2199cb0f5e btrfs: simplify error check condition at btrfs_dirty_inode()
The following condition at btrfs_dirty_inode() is redundant:

  if (ret && (ret == -ENOSPC || ret == -EDQUOT))

The first check for a non-zero 'ret' value is pointless, we can simplify
this to simply:

  if (ret == -ENOSPC || ret == -EDQUOT)

Not only this makes it easier to read, it also slightly reduces the text
size of the btrfs kernel module:

  $ size fs/btrfs/btrfs.ko.before
     text	   data	    bss	    dec	    hex	filename
  1641400	 168265	  16864	1826529	 1bdee1	fs/btrfs/btrfs.ko.before

  $ size fs/btrfs/btrfs.ko.after
     text	   data	    bss	    dec	    hex	filename
  1641224	 168181	  16864	1826269	 1bdddd	fs/btrfs/btrfs.ko.after

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:12 +02:00
David Sterba
078b8b90b8 btrfs: merge ordered work callbacks in btrfs_work into one
There are two callbacks defined in btrfs_work but only two actually make
use of them, otherwise there are NULLs. We can get rid of the freeing
callback making it a special case of the normal work. This reduces the
size of btrfs_work by 8 bytes, final layout:

struct btrfs_work {
        btrfs_func_t               func;                 /*     0     8 */
        btrfs_ordered_func_t       ordered_func;         /*     8     8 */
        struct work_struct         normal_work;          /*    16    32 */
        struct list_head           ordered_list;         /*    48    16 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        struct btrfs_workqueue *   wq;                   /*    64     8 */
        long unsigned int          flags;                /*    72     8 */

        /* size: 80, cachelines: 2, members: 6 */
        /* last cacheline: 16 bytes */
};

This in turn reduces size of other structures (on a release config):

- async_chunk			 160 ->  152
- async_submit_bio		 152 ->  144
- btrfs_async_delayed_work	 104 ->   96
- btrfs_caching_control		 176 ->  168
- btrfs_delalloc_work		 144 ->  136
- btrfs_fs_info			3608 -> 3600
- btrfs_ordered_extent		 440 ->  424
- btrfs_writepage_fixup		 104 ->   96

Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:10 +02:00
Johannes Thumshirn
02c372e1f0 btrfs: add support for inserting raid stripe extents
Add support for inserting stripe extents into the raid stripe tree on
completion of every write that needs an extra logical-to-physical
translation when using RAID.

Inserting the stripe extents happens after the data I/O has completed,
this is done to

  a) support zone-append and
  b) rule out the possibility of a RAID-write-hole.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:09 +02:00
Filipe Manana
50564b651d btrfs: abort transaction on generation mismatch when marking eb as dirty
When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(),
we check if its generation matches the running transaction and if not we
just print a warning. Such mismatch is an indicator that something really
went wrong and only printing a warning message (and stack trace) is not
enough to prevent a corruption. Allowing a transaction to commit with such
an extent buffer will trigger an error if we ever try to read it from disk
due to a generation mismatch with its parent generation.

So abort the current transaction with -EUCLEAN if we notice a generation
mismatch. For this we need to pass a transaction handle to
btrfs_mark_buffer_dirty() which is always available except in test code,
in which case we can pass NULL since it operates on dummy extent buffers
and all test roots have a single node/leaf (root node at level 0).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:07 +02:00
David Sterba
9580503bcb btrfs: reformat remaining kdoc style comments
Function name in the comment does not bring much value to code not
exposed as API and we don't stick to the kdoc format anymore. Update
formatting of parameter descriptions.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:04 +02:00
Linus Torvalds
a229cf67ab for-6.6-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmULIZUACgkQxWXV+ddt
 WDv77Q//ZiKpmevPQmQfUtmV8WwMfD2a9zRlKBpGggwtrD4mf3CYRLnOpTm81MPO
 vFIuYacBn+9UXqp2j/IbvNWfQAPQNVDxSPXx66uba93RJc+bB1J3TydxcEyJ7fr4
 dwhLLk01jttfk0+rnjF34fmXiHSTtI6D2WeaLCzUbaPLw4SZ+ul+GAdeF3P174iO
 OMNBUln7hK00Q7j8kFf4j6SW1yIIKMTl6MfOFJYanIqzx51PYFFVtKwoCr0Vt53v
 ZHbgrK582ZJO6pKF9kJF/1tqrY9/Df8jzgSypK8pew/SukMOrf7iVwrmhietuhKA
 92j5sxKhCRyq6Qg6ZwC0jyk+oMqrT8r+q3r38a5qDJx/9Q279vkXBqQnACfLjmnH
 6+sNdkY5/uBWnDMh/+d6yBtfbdW5DtuET4McYpJt1Nk2St/f3UzPaL4LcNkDXNPk
 t1Q4W4v0KS1V8TbsLfdD629CMghxQNKVs1XqyCAbUq9ub4LE2CtL3lDm730qZoZt
 +LM7+sAxEOJC6yqYfdEbcIc8l27Hl5nZEzamcvMrRz61N85/8Jx4Sq2b6VSE9TCE
 hNEWAL5sOjhuhmUPhatYC+KO1P6NDP+Yg99yZCZIT9s/P1oK5H+aETshWX+lvJ+Q
 Ai+qzKvp2ERHFcE+R5qIXs/uX7azpzjqsRZxY2/zdp70ugQDSXE=
 =0eEg
 -----END PGP SIGNATURE-----

Merge tag 'for-6.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more followup fixes to the directory listing.

  People have noticed different behaviour compared to other filesystems
  after changes in 6.5. This is now unified to more "logical" and
  expected behaviour while still within POSIX. And a few more fixes for
  stable.

   - change behaviour of readdir()/rewinddir() when new directory
     entries are created after opendir(), properly tracking the last
     entry

   - fix race in readdir when multiple threads can set the last entry
     index for a directory

  Additionally:

   - use exclusive lock when direct io might need to drop privs and call
     notify_change()

   - don't clear uptodate bit on page after an error, this may lead to a
     deadlock in subpage mode

   - fix waiting pattern when multiple readers block on Merkle tree
     data, switch to folios"

* tag 'for-6.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix race between reading a directory and adding entries to it
  btrfs: refresh dir last index during a rewinddir(3) call
  btrfs: set last dir index to the current last index when opening dir
  btrfs: don't clear uptodate on write errors
  btrfs: file_remove_privs needs an exclusive lock in direct io write
  btrfs: convert btrfs_read_merkle_tree_page() to use a folio
2023-09-20 11:03:45 -07:00
Filipe Manana
8e7f82deb0 btrfs: fix race between reading a directory and adding entries to it
When opening a directory (opendir(3)) or rewinding it (rewinddir(3)), we
are not holding the directory's inode locked, and this can result in later
attempting to add two entries to the directory with the same index number,
resulting in a transaction abort, with -EEXIST (-17), when inserting the
second delayed dir index. This results in a trace like the following:

  Sep 11 22:34:59 myhostname kernel: BTRFS error (device dm-3): err add delayed dir index item(name: cockroach-stderr.log) into the insertion tree of the delayed node(root id: 5, inode id: 4539217, errno: -17)
  Sep 11 22:34:59 myhostname kernel: ------------[ cut here ]------------
  Sep 11 22:34:59 myhostname kernel: kernel BUG at fs/btrfs/delayed-inode.c:1504!
  Sep 11 22:34:59 myhostname kernel: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
  Sep 11 22:34:59 myhostname kernel: CPU: 0 PID: 7159 Comm: cockroach Not tainted 6.4.15-200.fc38.x86_64 #1
  Sep 11 22:34:59 myhostname kernel: Hardware name: ASUS ESC500 G3/P9D WS, BIOS 2402 06/27/2018
  Sep 11 22:34:59 myhostname kernel: RIP: 0010:btrfs_insert_delayed_dir_index+0x1da/0x260
  Sep 11 22:34:59 myhostname kernel: Code: eb dd 48 (...)
  Sep 11 22:34:59 myhostname kernel: RSP: 0000:ffffa9980e0fbb28 EFLAGS: 00010282
  Sep 11 22:34:59 myhostname kernel: RAX: 0000000000000000 RBX: ffff8b10b8f4a3c0 RCX: 0000000000000000
  Sep 11 22:34:59 myhostname kernel: RDX: 0000000000000000 RSI: ffff8b177ec21540 RDI: ffff8b177ec21540
  Sep 11 22:34:59 myhostname kernel: RBP: ffff8b110cf80888 R08: 0000000000000000 R09: ffffa9980e0fb938
  Sep 11 22:34:59 myhostname kernel: R10: 0000000000000003 R11: ffffffff86146508 R12: 0000000000000014
  Sep 11 22:34:59 myhostname kernel: R13: ffff8b1131ae5b40 R14: ffff8b10b8f4a418 R15: 00000000ffffffef
  Sep 11 22:34:59 myhostname kernel: FS:  00007fb14a7fe6c0(0000) GS:ffff8b177ec00000(0000) knlGS:0000000000000000
  Sep 11 22:34:59 myhostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  Sep 11 22:34:59 myhostname kernel: CR2: 000000c00143d000 CR3: 00000001b3b4e002 CR4: 00000000001706f0
  Sep 11 22:34:59 myhostname kernel: Call Trace:
  Sep 11 22:34:59 myhostname kernel:  <TASK>
  Sep 11 22:34:59 myhostname kernel:  ? die+0x36/0x90
  Sep 11 22:34:59 myhostname kernel:  ? do_trap+0xda/0x100
  Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
  Sep 11 22:34:59 myhostname kernel:  ? do_error_trap+0x6a/0x90
  Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
  Sep 11 22:34:59 myhostname kernel:  ? exc_invalid_op+0x50/0x70
  Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
  Sep 11 22:34:59 myhostname kernel:  ? asm_exc_invalid_op+0x1a/0x20
  Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
  Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
  Sep 11 22:34:59 myhostname kernel:  btrfs_insert_dir_item+0x200/0x280
  Sep 11 22:34:59 myhostname kernel:  btrfs_add_link+0xab/0x4f0
  Sep 11 22:34:59 myhostname kernel:  ? ktime_get_real_ts64+0x47/0xe0
  Sep 11 22:34:59 myhostname kernel:  btrfs_create_new_inode+0x7cd/0xa80
  Sep 11 22:34:59 myhostname kernel:  btrfs_symlink+0x190/0x4d0
  Sep 11 22:34:59 myhostname kernel:  ? schedule+0x5e/0xd0
  Sep 11 22:34:59 myhostname kernel:  ? __d_lookup+0x7e/0xc0
  Sep 11 22:34:59 myhostname kernel:  vfs_symlink+0x148/0x1e0
  Sep 11 22:34:59 myhostname kernel:  do_symlinkat+0x130/0x140
  Sep 11 22:34:59 myhostname kernel:  __x64_sys_symlinkat+0x3d/0x50
  Sep 11 22:34:59 myhostname kernel:  do_syscall_64+0x5d/0x90
  Sep 11 22:34:59 myhostname kernel:  ? syscall_exit_to_user_mode+0x2b/0x40
  Sep 11 22:34:59 myhostname kernel:  ? do_syscall_64+0x6c/0x90
  Sep 11 22:34:59 myhostname kernel:  entry_SYSCALL_64_after_hwframe+0x72/0xdc

The race leading to the problem happens like this:

1) Directory inode X is loaded into memory, its ->index_cnt field is
   initialized to (u64)-1 (at btrfs_alloc_inode());

2) Task A is adding a new file to directory X, holding its vfs inode lock,
   and calls btrfs_set_inode_index() to get an index number for the entry.

   Because the inode's index_cnt field is set to (u64)-1 it calls
   btrfs_inode_delayed_dir_index_count() which fails because no dir index
   entries were added yet to the delayed inode and then it calls
   btrfs_set_inode_index_count(). This functions finds the last dir index
   key and then sets index_cnt to that index value + 1. It found that the
   last index key has an offset of 100. However before it assigns a value
   of 101 to index_cnt...

3) Task B calls opendir(3), ending up at btrfs_opendir(), where the VFS
   lock for inode X is not taken, so it calls btrfs_get_dir_last_index()
   and sees index_cnt still with a value of (u64)-1. Because of that it
   calls btrfs_inode_delayed_dir_index_count() which fails since no dir
   index entries were added to the delayed inode yet, and then it also
   calls btrfs_set_inode_index_count(). This also finds that the last
   index key has an offset of 100, and before it assigns the value 101
   to the index_cnt field of inode X...

4) Task A assigns a value of 101 to index_cnt. And then the code flow
   goes to btrfs_set_inode_index() where it increments index_cnt from
   101 to 102. Task A then creates a delayed dir index entry with a
   sequence number of 101 and adds it to the delayed inode;

5) Task B assigns 101 to the index_cnt field of inode X;

6) At some later point when someone tries to add a new entry to the
   directory, btrfs_set_inode_index() will return 101 again and shortly
   after an attempt to add another delayed dir index key with index
   number 101 will fail with -EEXIST resulting in a transaction abort.

Fix this by locking the inode at btrfs_get_dir_last_index(), which is only
only used when opening a directory or attempting to lseek on it.

Reported-by: ken <ken@bllue.org>
Link: https://lore.kernel.org/linux-btrfs/CAE6xmH+Lp=Q=E61bU+v9eWX8gYfLvu6jLYxjxjFpo3zHVPR0EQ@mail.gmail.com/
Reported-by: syzbot+d13490c82ad5353c779d@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/
Fixes: 9b378f6ad4 ("btrfs: fix infinite directory reads")
CC: stable@vger.kernel.org # 6.5+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-14 23:24:42 +02:00
Filipe Manana
e60aa5da14 btrfs: refresh dir last index during a rewinddir(3) call
When opening a directory we find what's the index of its last entry and
then store it in the directory's file handle private data (struct
btrfs_file_private::last_index), so that in the case new directory entries
are added to a directory after an opendir(3) call we don't end up in an
infinite loop (see commit 9b378f6ad4 ("btrfs: fix infinite directory
reads")) when calling readdir(3).

However once rewinddir(3) is called, POSIX states [1] that any new
directory entries added after the previous opendir(3) call, must be
returned by subsequent calls to readdir(3):

  "The rewinddir() function shall reset the position of the directory
   stream to which dirp refers to the beginning of the directory.
   It shall also cause the directory stream to refer to the current
   state of the corresponding directory, as a call to opendir() would
   have done."

We currently don't refresh the last_index field of the struct
btrfs_file_private associated to the directory, so after a rewinddir(3)
we are not returning any new entries added after the opendir(3) call.

Fix this by finding the current last index of the directory when llseek
is called against the directory.

This can be reproduced by the following C program provided by Ian Johnson:

   #include <dirent.h>
   #include <stdio.h>

   int main(void) {
     DIR *dir = opendir("test");

     FILE *file;
     file = fopen("test/1", "w");
     fwrite("1", 1, 1, file);
     fclose(file);

     file = fopen("test/2", "w");
     fwrite("2", 1, 1, file);
     fclose(file);

     rewinddir(dir);

     struct dirent *entry;
     while ((entry = readdir(dir))) {
        printf("%s\n", entry->d_name);
     }
     closedir(dir);
     return 0;
   }

Reported-by: Ian Johnson <ian@ianjohnson.dev>
Link: https://lore.kernel.org/linux-btrfs/YR1P0S.NGASEG570GJ8@ianjohnson.dev/
Fixes: 9b378f6ad4 ("btrfs: fix infinite directory reads")
CC: stable@vger.kernel.org # 6.5+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-14 23:24:42 +02:00
Filipe Manana
357950361c btrfs: set last dir index to the current last index when opening dir
When opening a directory for reading it, we set the last index where we
stop iteration to the value in struct btrfs_inode::index_cnt. That value
does not match the index of the most recently added directory entry but
it's instead the index number that will be assigned the next directory
entry.

This means that if after the call to opendir(3) new directory entries are
added, a readdir(3) call will return the first new directory entry. This
is fine because POSIX says the following [1]:

  "If a file is removed from or added to the directory after the most
   recent call to opendir() or rewinddir(), whether a subsequent call to
   readdir() returns an entry for that file is unspecified."

For example for the test script from commit 9b378f6ad4 ("btrfs: fix
infinite directory reads"), where we have 2000 files in a directory, ext4
doesn't return any new directory entry after opendir(3), while xfs returns
the first 13 new directory entries added after the opendir(3) call.

If we move to a shorter example with an empty directory when opendir(3) is
called, and 2 files added to the directory after the opendir(3) call, then
readdir(3) on btrfs will return the first file, ext4 and xfs return the 2
files (but in a different order). A test program for this, reported by
Ian Johnson, is the following:

   #include <dirent.h>
   #include <stdio.h>

   int main(void) {
     DIR *dir = opendir("test");

     FILE *file;
     file = fopen("test/1", "w");
     fwrite("1", 1, 1, file);
     fclose(file);

     file = fopen("test/2", "w");
     fwrite("2", 1, 1, file);
     fclose(file);

     struct dirent *entry;
     while ((entry = readdir(dir))) {
        printf("%s\n", entry->d_name);
     }
     closedir(dir);
     return 0;
   }

To make this less odd, change the behaviour to never return new entries
that were added after the opendir(3) call. This is done by setting the
last_index field of the struct btrfs_file_private attached to the
directory's file handle with a value matching btrfs_inode::index_cnt
minus 1, since that value always matches the index of the next new
directory entry and not the index of the most recently added entry.

[1] https://pubs.opengroup.org/onlinepubs/007904875/functions/readdir_r.html

Link: https://lore.kernel.org/linux-btrfs/YR1P0S.NGASEG570GJ8@ianjohnson.dev/
CC: stable@vger.kernel.org # 6.5+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-14 23:24:42 +02:00
Josef Bacik
b595d25996 btrfs: don't clear uptodate on write errors
We have been consistently seeing hangs with generic/648 in our subpage
GitHub CI setup.  This is a classic deadlock, we are calling
btrfs_read_folio() on a folio, which requires holding the folio lock on
the folio, and then finding a ordered extent that overlaps that range
and calling btrfs_start_ordered_extent(), which then tries to write out
the dirty page, which requires taking the folio lock and then we
deadlock.

The hang happens because we're writing to range [1271750656, 1271767040),
page index [77621, 77622], and page 77621 is !Uptodate.  It is also Dirty,
so we call btrfs_read_folio() for 77621 and which does
btrfs_lock_and_flush_ordered_range() for that range, and we find an ordered
extent which is [1271644160, 1271746560), page index [77615, 77621].
The page indexes overlap, but the actual bytes don't overlap.  We're
holding the page lock for 77621, then call
btrfs_lock_and_flush_ordered_range() which tries to flush the dirty
page, and tries to lock 77621 again and then we deadlock.

The byte ranges do not overlap, but with subpage support if we clear
uptodate on any portion of the page we mark the entire thing as not
uptodate.

We have been clearing page uptodate on write errors, but no other file
system does this, and is in fact incorrect.  This doesn't hurt us in the
!subpage case because we can't end up with overlapped ranges that don't
also overlap on the page.

Fix this by not clearing uptodate when we have a write error.  The only
thing we should be doing in this case is setting the mapping error and
carrying on.  This makes it so we would no longer call
btrfs_read_folio() on the page as it's uptodate and eliminates the
deadlock.

With this patch we're now able to make it through a full fstests run on
our subpage blocksize VMs.

Note for stable backports: this probably goes beyond 6.1 but the code
has been cleaned up and clearing the uptodate bit must be verified on
each version independently.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-13 18:41:07 +02:00
Linus Torvalds
547635c6ac for-6.6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmTskOwACgkQxWXV+ddt
 WDsNJw/8CCi41Z7e3LdJsQd2iy3/+oJZUvIGuT5YvshYxTLCbV7AL+diBPnSQs4Q
 /KFMGL7RZBgJzwVoSQtXnESXXgX8VOVfN1zY//k5g6z7BscCEQd73H/M0B8ciZy/
 aBygm9tJ7EtWbGZWNR8yad8YtOgl6xoClrPnJK/DCLwMGPy2o+fnKP3Y9FOKY5KM
 1Sl0Y4FlJ9dTJpxIwYbx4xmuyHrh2OivjU/KnS9SzQlHu0nl6zsIAE45eKem2/EG
 1figY5aFBYPpPYfopbLDalEBR3bQGiViZVJuNEop3AimdcMOXw9jBF3EZYUb5Tgn
 MleMDgmmjLGOE/txGhvTxKj9kci2aGX+fJn3jXbcIMksAA0OQFLPqzGvEQcrs6Ok
 HA0RsmAkS5fWNDCuuo4ZPXEyUPvluTQizkwyoulOfnK+UPJCWaRqbEBMTsvm6M6X
 wFT2czwLpaEU/W6loIZkISUhfbRqVoA3DfHy398QXNzRhSrg8fQJjma1f7mrHvTi
 CzU+OD5YSC2nXktVOnklyTr0XT+7HF69cumlDbr8TS8u1qu8n1keU/7M3MBB4xZk
 BZFJDz8pnsAqpwVA4T434E/w45MDnYlwBw5r+U8Xjyso8xlau+sYXKcim85vT2Q0
 yx/L91P6tdekR1y97p4aDdxw/PgTzdkNGMnsTBMVzgtCj+5pMmE=
 =N7Yn
 -----END PGP SIGNATURE-----

Merge tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "No new features, the bulk of the changes are fixes, refactoring and
  cleanups. The notable fix is the scrub performance restoration after
  rewrite in 6.4, though still only partial.

  Fixes:

   - scrub performance drop due to rewrite in 6.4 partially restored:
      - do IO grouping by blg_plug/blk_unplug again
      - avoid unnecessary tree searches when processing stripes, in
        extent and checksum trees
      - the drop is noticeable on fast PCIe devices, -66% and restored
        to -33% of the original
      - backports to 6.4 planned

   - handle more corner cases of transaction commit during orphan
     cleanup or delayed ref processing

   - use correct fsid/metadata_uuid when validating super block

   - copy directory permissions and time when creating a stub subvolume

  Core:

   - debugging feature integrity checker deprecated, to be removed in
     6.7

   - in zoned mode, zones are activated just before the write, making
     error handling easier, now the overcommit mechanism can be enabled
     again which improves performance by avoiding more frequent flushing

   - v0 extent handling completely removed, deprecated long time ago

   - error handling improvements

   - tests:
      - extent buffer bitmap tests
      - pinned extent splitting tests

   - cleanups and refactoring:
      - compression writeback
      - extent buffer bitmap
      - space flushing, ENOSPC handling"

* tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (110 commits)
  btrfs: zoned: skip splitting and logical rewriting on pre-alloc write
  btrfs: tests: test invalid splitting when skipping pinned drop extent_map
  btrfs: tests: add a test for btrfs_add_extent_mapping
  btrfs: tests: add extent_map tests for dropping with odd layouts
  btrfs: scrub: move write back of repaired sectors to scrub_stripe_read_repair_worker()
  btrfs: scrub: don't go ordered workqueue for dev-replace
  btrfs: scrub: fix grouping of read IO
  btrfs: scrub: avoid unnecessary csum tree search preparing stripes
  btrfs: scrub: avoid unnecessary extent tree search preparing stripes
  btrfs: copy dir permission and time when creating a stub subvolume
  btrfs: remove pointless empty list check when reading delayed dir indexes
  btrfs: drop redundant check to use fs_devices::metadata_uuid
  btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_super
  btrfs: use the correct superblock to compare fsid in btrfs_validate_super
  btrfs: simplify memcpy either of metadata_uuid or fsid
  btrfs: add a helper to read the superblock metadata_uuid
  btrfs: remove v0 extent handling
  btrfs: output extra debug info if we failed to find an inline backref
  btrfs: move the !zoned assert into run_delalloc_cow
  btrfs: consolidate the error handling in run_delalloc_nocow
  ...
2023-08-28 12:26:57 -07:00
Linus Torvalds
615e95831e v6.6-vfs.ctime
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTKAAKCRCRxhvAZXjc
 oifJAQCzi/p+AdQu8LA/0XvR7fTwaq64ZDCibU4BISuLGT2kEgEAuGbuoFZa0rs2
 XYD/s4+gi64p9Z01MmXm2XO1pu3GPg0=
 =eJz5
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs timestamp updates from Christian Brauner:
 "This adds VFS support for multi-grain timestamps and converts tmpfs,
  xfs, ext4, and btrfs to use them. This carries acks from all relevant
  filesystems.

  The VFS always uses coarse-grained timestamps when updating the ctime
  and mtime after a change. This has the benefit of allowing filesystems
  to optimize away a lot of metadata updates, down to around 1 per
  jiffy, even when a file is under heavy writes.

  Unfortunately, this has always been an issue when we're exporting via
  NFSv3, which relies on timestamps to validate caches. A lot of changes
  can happen in a jiffy, so timestamps aren't sufficient to help the
  client decide to invalidate the cache.

  Even with NFSv4, a lot of exported filesystems don't properly support
  a change attribute and are subject to the same problems with timestamp
  granularity. Other applications have similar issues with timestamps
  (e.g., backup applications).

  If we were to always use fine-grained timestamps, that would improve
  the situation, but that becomes rather expensive, as the underlying
  filesystem would have to log a lot more metadata updates.

  This introduces fine-grained timestamps that are used when they are
  actively queried.

  This uses the 31st bit of the ctime tv_nsec field to indicate that
  something has queried the inode for the mtime or ctime. When this flag
  is set, on the next mtime or ctime update, the kernel will fetch a
  fine-grained timestamp instead of the usual coarse-grained one.

  As POSIX generally mandates that when the mtime changes, the ctime
  must also change the kernel always stores normalized ctime values, so
  only the first 30 bits of the tv_nsec field are ever used.

  Filesytems can opt into this behavior by setting the FS_MGTIME flag in
  the fstype. Filesystems that don't set this flag will continue to use
  coarse-grained timestamps.

  Various preparatory changes, fixes and cleanups are included:

   - Fixup all relevant places where POSIX requires updating ctime
     together with mtime. This is a wide-range of places and all
     maintainers provided necessary Acks.

   - Add new accessors for inode->i_ctime directly and change all
     callers to rely on them. Plain accesses to inode->i_ctime are now
     gone and it is accordingly rename to inode->__i_ctime and commented
     as requiring accessors.

   - Extend generic_fillattr() to pass in a request mask mirroring in a
     sense the statx() uapi. This allows callers to pass in a request
     mask to only get a subset of attributes filled in.

   - Rework timestamp updates so it's possible to drop the @now
     parameter the update_time() inode operation and associated helpers.

   - Add inode_update_timestamps() and convert all filesystems to it
     removing a bunch of open-coding"

* tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits)
  btrfs: convert to multigrain timestamps
  ext4: switch to multigrain timestamps
  xfs: switch to multigrain timestamps
  tmpfs: add support for multigrain timestamps
  fs: add infrastructure for multigrain timestamps
  fs: drop the timespec64 argument from update_time
  xfs: have xfs_vn_update_time gets its own timestamp
  fat: make fat_update_time get its own timestamp
  fat: remove i_version handling from fat_update_time
  ubifs: have ubifs_update_time use inode_update_timestamps
  btrfs: have it use inode_update_timestamps
  fs: drop the timespec64 arg from generic_update_time
  fs: pass the request_mask to generic_fillattr
  fs: remove silly warning from current_time
  gfs2: fix timestamp handling on quota inodes
  fs: rename i_ctime field to __i_ctime
  selinux: convert to ctime accessor functions
  security: convert to ctime accessor functions
  apparmor: convert to ctime accessor functions
  sunrpc: convert to ctime accessor functions
  ...
2023-08-28 09:31:32 -07:00
Lee Trager
94628ad944 btrfs: copy dir permission and time when creating a stub subvolume
btrfs supports creating nested subvolumes however snapshots are not
recursive.  When a snapshot is taken of a volume which contains a
subvolume the subvolume is replaced with a stub subvolume which has the
same name and uses inode number 2[1]. The stub subvolume kept the
directory name but did not set the time or permissions of the stub
subvolume. This resulted in all time information being the current time
and ownership defaulting to root. When subvolumes and snapshots are
created using unshare this results in a snapshot directory the user
created but has no permissions for.

Test case:

  [vmuser@archvm ~]# sudo -i
  [root@archvm ~]# mkdir -p /mnt/btrfs/test
  [root@archvm ~]# chown vmuser:users /mnt/btrfs/test/
  [root@archvm ~]# exit
  logout
  [vmuser@archvm ~]$ cd /mnt/btrfs/test
  [vmuser@archvm test]$ unshare --user --keep-caps --map-auto --map-root-user
  [root@archvm test]# btrfs subvolume create subvolume
  Create subvolume './subvolume'
  [root@archvm test]# btrfs subvolume create subvolume/subsubvolume
  Create subvolume 'subvolume/subsubvolume'
  [root@archvm test]# btrfs subvolume snapshot subvolume snapshot
  Create a snapshot of 'subvolume' in './snapshot'
  [root@archvm test]# exit
  logout
  [vmuser@archvm test]$ tree -ug
  [vmuser   users   ]  .
  ├── [vmuser   users   ]  snapshot
  │   └── [vmuser   users   ]  subsubvolume  <-- Without patch perm is root:root
  └── [vmuser   users   ]  subvolume
      └── [vmuser   users   ]  subsubvolume

  5 directories, 0 files

[1] https://btrfs.readthedocs.io/en/latest/btrfs-subvolume.html#nested-subvolumes

Signed-off-by: Lee Trager <lee@trager.us>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:54:48 +02:00
Christoph Hellwig
76c5126e76 btrfs: move the !zoned assert into run_delalloc_cow
Having the assert in the actual helper documents the pre-conditions
much better than having it in the caller, so move it.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:54:47 +02:00
Christoph Hellwig
38dc88890d btrfs: consolidate the error handling in run_delalloc_nocow
Share the calls to extent_clear_unlock_delalloc for btrfs_path allocation
failure handling and the normal exit path.

This relies on btrfs_free_path ignoring a NULL pointer, and the
initialization of cur_offset to start at the beginning of the function.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:54:47 +02:00
Christoph Hellwig
18f62b86c4 btrfs: cleanup the COW fallback logic in run_delalloc_nocow
Use the block group pointer used to track the outstanding NOCOW writes as
a boolean to remove the duplicate nocow variable, and keep it contained
in the main loop to simplify the logic.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:54:47 +02:00
Christoph Hellwig
953fa5ced5 btrfs: fix error handling when in a COW window in run_delalloc_nocow
When run_delalloc_nocow has cow_start set to a value other than (u64)-1,
it has delayed COW writeback pending behind cur_offset.  When an error
occurs in such a window, the range going back to cow_start and not just
cur_offset needs to be unlocked, but only two error cases handle this
correctly  Move the code to handle unlock the COW range to the common
error handling label and document the logic.

To make things even more complicated, cow_file_range as called by
fallback_to_cow will unlock the range it is operating on when it fails as
well, so we need to reset cow_start right after caling fallback_to_cow
instead of only when it succeeded.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:54:47 +02:00
Ruan Jinjie
84af994b85 btrfs: use LIST_HEAD() to initialize the list_head
Use LIST_HEAD() to initialize the list_head instead of open-coding it.

Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:54:46 +02:00
Filipe Manana
a7f8de500e btrfs: return real error when orphan cleanup fails due to a transaction abort
During mount we will call btrfs_orphan_cleanup() to remove any inodes that
were previously deleted (have a link count of 0) but for which we were not
able before to remove their items from the subvolume tree. The removal of
the items will happen by triggering eviction, when we do the final iput()
on them at btrfs_orphan_cleanup(), which will end in the loop at
btrfs_evict_inode() that truncates inode items.

In a dire situation we may have a transaction abort due to -ENOSPC when
attempting to truncate the inode items, and in that case the orphan item
(key type BTRFS_ORPHAN_ITEM_KEY) will remain in the subvolume tree and
when we hit the next iteration of the while loop at btrfs_orphan_cleanup()
we will find the same orphan item as before, and then we will return
-EINVAL from btrfs_orphan_cleanup() through the following if statement:

    if (found_key.offset == last_objectid) {
       btrfs_err(fs_info,
                 "Error removing orphan entry, stopping orphan cleanup");
       ret = -EINVAL;
       goto out;
    }

This makes the mount operation fail with -EINVAL, when it should have been
-ENOSPC. This is confusing because -EINVAL might lead a user into thinking
it provided invalid mount options for example.

An example where this happens:

   $ mount test.img /mnt
   mount: /mnt: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error.

   $ dmesg
   [ 2542.356934] BTRFS: device fsid 977fff75-1181-4d2b-a739-384fa710d16e devid 1 transid 47409973 /dev/loop0 scanned by mount (4459)
   [ 2542.357451] BTRFS info (device loop0): using crc32c (crc32c-intel) checksum algorithm
   [ 2542.357461] BTRFS info (device loop0): disk space caching is enabled
   [ 2542.742287] BTRFS info (device loop0): auto enabling async discard
   [ 2542.764554] BTRFS info (device loop0): checking UUID tree
   [ 2551.743065] ------------[ cut here ]------------
   [ 2551.743068] BTRFS: Transaction aborted (error -28)
   [ 2551.743149] WARNING: CPU: 7 PID: 215 at fs/btrfs/block-group.c:3494 btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs]
   [ 2551.743311] Modules linked in: btrfs blake2b_generic (...)
   [ 2551.743353] CPU: 7 PID: 215 Comm: kworker/u24:5 Not tainted 6.4.0-rc6-btrfs-next-134+ #1
   [ 2551.743356] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
   [ 2551.743357] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
   [ 2551.743405] RIP: 0010:btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs]
   [ 2551.743449] Code: 8b 43 0c (...)
   [ 2551.743451] RSP: 0018:ffff982c005a7c40 EFLAGS: 00010286
   [ 2551.743452] RAX: 0000000000000000 RBX: ffff88fc6e44b400 RCX: 0000000000000000
   [ 2551.743453] RDX: 0000000000000002 RSI: ffffffff8dff0878 RDI: 00000000ffffffff
   [ 2551.743454] RBP: ffff88fc51817208 R08: 0000000000000000 R09: ffff982c005a7ae0
   [ 2551.743455] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88fc43d2e570
   [ 2551.743456] R13: ffff88fc43d2e400 R14: ffff88fc8fb08ee0 R15: ffff88fc6e44b530
   [ 2551.743457] FS:  0000000000000000(0000) GS:ffff89035fbc0000(0000) knlGS:0000000000000000
   [ 2551.743458] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [ 2551.743459] CR2: 00007fa8cdf2f6f4 CR3: 0000000124850003 CR4: 0000000000370ee0
   [ 2551.743462] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   [ 2551.743463] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
   [ 2551.743464] Call Trace:
   [ 2551.743472]  <TASK>
   [ 2551.743474]  ? __warn+0x80/0x130
   [ 2551.743478]  ? btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs]
   [ 2551.743520]  ? report_bug+0x1f4/0x200
   [ 2551.743523]  ? handle_bug+0x42/0x70
   [ 2551.743526]  ? exc_invalid_op+0x14/0x70
   [ 2551.743528]  ? asm_exc_invalid_op+0x16/0x20
   [ 2551.743532]  ? btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs]
   [ 2551.743574]  ? _raw_spin_unlock+0x15/0x30
   [ 2551.743576]  ? btrfs_run_delayed_refs+0x1bd/0x200 [btrfs]
   [ 2551.743609]  commit_cowonly_roots+0x1e9/0x260 [btrfs]
   [ 2551.743652]  btrfs_commit_transaction+0x42e/0xfa0 [btrfs]
   [ 2551.743693]  ? __pfx_autoremove_wake_function+0x10/0x10
   [ 2551.743697]  flush_space+0xf1/0x5d0 [btrfs]
   [ 2551.743743]  ? _raw_spin_unlock+0x15/0x30
   [ 2551.743745]  ? finish_task_switch+0x91/0x2a0
   [ 2551.743748]  ? _raw_spin_unlock+0x15/0x30
   [ 2551.743750]  ? btrfs_get_alloc_profile+0xc9/0x1f0 [btrfs]
   [ 2551.743793]  btrfs_async_reclaim_metadata_space+0xe1/0x230 [btrfs]
   [ 2551.743837]  process_one_work+0x1d9/0x3e0
   [ 2551.743844]  worker_thread+0x4a/0x3b0
   [ 2551.743847]  ? __pfx_worker_thread+0x10/0x10
   [ 2551.743849]  kthread+0xee/0x120
   [ 2551.743852]  ? __pfx_kthread+0x10/0x10
   [ 2551.743854]  ret_from_fork+0x29/0x50
   [ 2551.743860]  </TASK>
   [ 2551.743861] ---[ end trace 0000000000000000 ]---
   [ 2551.743863] BTRFS info (device loop0: state A): dumping space info:
   [ 2551.743866] BTRFS info (device loop0: state A): space_info DATA has 126976 free, is full
   [ 2551.743868] BTRFS info (device loop0: state A): space_info total=13458472960, used=13458137088, pinned=143360, reserved=0, may_use=0, readonly=65536 zone_unusable=0
   [ 2551.743870] BTRFS info (device loop0: state A): space_info METADATA has -51625984 free, is full
   [ 2551.743872] BTRFS info (device loop0: state A): space_info total=771751936, used=770146304, pinned=1605632, reserved=0, may_use=51625984, readonly=0 zone_unusable=0
   [ 2551.743874] BTRFS info (device loop0: state A): space_info SYSTEM has 14663680 free, is not full
   [ 2551.743875] BTRFS info (device loop0: state A): space_info total=14680064, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
   [ 2551.743877] BTRFS info (device loop0: state A): global_block_rsv: size 53231616 reserved 51544064
   [ 2551.743878] BTRFS info (device loop0: state A): trans_block_rsv: size 0 reserved 0
   [ 2551.743879] BTRFS info (device loop0: state A): chunk_block_rsv: size 0 reserved 0
   [ 2551.743880] BTRFS info (device loop0: state A): delayed_block_rsv: size 0 reserved 0
   [ 2551.743881] BTRFS info (device loop0: state A): delayed_refs_rsv: size 786432 reserved 0
   [ 2551.743886] BTRFS: error (device loop0: state A) in btrfs_write_dirty_block_groups:3494: errno=-28 No space left
   [ 2551.743911] BTRFS info (device loop0: state EA): forced readonly
   [ 2551.743951] BTRFS warning (device loop0: state EA): could not allocate space for delete; will truncate on mount
   [ 2551.743962] BTRFS error (device loop0: state EA): Error removing orphan entry, stopping orphan cleanup
   [ 2551.743973] BTRFS warning (device loop0: state EA): Skipping commit of aborted transaction.
   [ 2551.743989] BTRFS error (device loop0: state EA): could not do orphan cleanup -22

So make the btrfs_orphan_cleanup() return the value of BTRFS_FS_ERROR(),
if it's set, and -EINVAL otherwise.

For that same example, after this change, the mount operation fails with
-ENOSPC:

   $ mount test.img /mnt
   mount: /mnt: mount(2) system call failed: No space left on device.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:18 +02:00
Christoph Hellwig
256b0cf90d btrfs: fix zoned handling in submit_uncompressed_range
For zoned file systems we need to use run_delalloc_zoned to submit
writeback, as we need to write out partial allocations when running into
zone active limits.

submit_uncompressed_range currently always calls cow_file_range to
allocate blocks and thus misses the active zone limits handling.  Fix
this by passing the pages_dirty argument to run_delalloc_zoned and always
using it from submit_uncompressed_range as it does the right thing for
zoned and non-zoned file systems.

To account for the fact that run_delalloc_zoned is now also used for
non-zoned file systems rename it to run_delalloc_cow, and add comment
describing it.

Fixes: 42c0110009 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:16 +02:00
Christoph Hellwig
778b878543 btrfs: don't redirty locked_page in run_delalloc_zoned
extent_write_locked_range currently expects that either all or no
pages are dirty when it is called.  Bur run_delalloc_zoned is called
directly in the writepages path, and has the dirty bit cleared only
for locked_page and which the extent_write_cache_pages currently
operates.  It currently works around this by redirtying locked_page,
but that is a bit inefficient and cumbersome.  Pass a locked_page
argument to run_delalloc_zoned so that clearing the dirty bit can
be skipped on just that page.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:16 +02:00
Christoph Hellwig
6e144bf16b btrfs: refactor the zoned device handling in cow_file_range
Handling of the done_offset to cow_file_range is a bit confusing, as
it is not updated at all when the function succeeds, and the -EAGAIN
status is used bother for the case where we need to wait for a zone
finish and the one where the allocation was partially successful.

Change the calling convention so that done_offset is always updated,
and 0 is returned if some allocation was successful (partial allocation
can still only happen for zoned devices), and waiting for a zone
finish is done internally in cow_file_range instead of the caller.

Also write a comment explaining the logic.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:16 +02:00
Christoph Hellwig
44962ca37c btrfs: don't redirty pages in compress_file_range
compress_file_range needs to clear the dirty bit before handing off work
to the compression worker threads to prevent processes coming in through
mmap and changing the file contents while the compression is accessing
the data (See commit 4adaa61102 ("Btrfs: fix race between mmap writes
and compression").

But when compress_file_range decides to not compress the data, it falls
back to submit_uncompressed_range which uses extent_write_locked_range
to write the uncompressed data.  extent_write_locked_range currently
expects all pages to be marked dirty so that it can clear the dirty
bit itself, and thus compress_file_range has to redirty the page range.

Redirtying the page range is rather inefficient and also pointless,
so instead pass a pages_dirty parameter to extent_write_locked_range
and skip the redirty game entirely.

Note that compress_file_range was even redirtying the locked_page twice
given that extent_range_clear_dirty_for_io already redirties all pages
in the range, which must include locked_page if there is one.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:15 +02:00
Christoph Hellwig
f778b6b8e0 btrfs: share the code to free the page array in compress_file_range
compress_file_range has two code blocks to free the page array for the
compressed data.  Share the code using a goto label.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:15 +02:00
Christoph Hellwig
184aa1ffa5 btrfs: use a separate label for the incompressible case in compress_file_range
compress_file_range can fail to compress either because of resource or
alignment constraints or because the data is incompressible.  In the latter
case the inode is marked so that compression isn't tried again.  Currently
that check is based on the condition that the pages array has been allocated
which is rather cryptic.  Use a separate label to clearly distinguish this
case.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:15 +02:00
Christoph Hellwig
6a7167bf9c btrfs: further simplify the compress or not logic in compress_file_range
Currently the logic whether to compress or not in compress_file_range is
a bit convoluted because it tries to share code for creating inline
extents for the compressible [1] path and the bail to uncompressed path.

But the latter isn't needed at all, because cow_file_range as called by
submit_uncompressed_range will already create inline extents as needed,
so there is no need to have special handling for it if we can live with
the fact that it will be called a bit later in the ->ordered_func of the
workqueue instead of right now.

[1] there is undocumented logic that creates an uncompressed inline
extent outside of the shall not compress logic if total_in is too small.
This logic isn't explained in comments or any commit log I could find,
so I've preserved it.  Documentation explaining it would be appreciated
if anyone understands this code.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:15 +02:00
Christoph Hellwig
e94e54e89b btrfs: streamline compress_file_range
Reorder compress_file_range so that the main compression flow happens
straight line and not in branches.  To do this ensure that pages is
always zeroed before a page allocation happens, which allows the
cleanup_and_bail_uncompressed label to clean up the page allocations
as needed.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:15 +02:00
Christoph Hellwig
00d31d1766 btrfs: merge submit_compressed_extents and async_cow_submit
The code in submit_compressed_extents just loops over the async_extents,
and doesn't need to be conditional on an inode being present, as there
won't be any async_extent in the list if we created and inline extent.
Merge the two functions to simplify the logic.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:15 +02:00
Christoph Hellwig
c15d8cf295 btrfs: merge async_cow_start and compress_file_range
There is no good reason to have the simple async_cow_start wrapper,
merge the argument conversion into the main compress_file_range function.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:15 +02:00
Christoph Hellwig
3134508e47 btrfs: don't clear async_chunk->inode in async_cow_start
Now that the ->inode check isn't needed in submit_compressed_extents
any more, there is no reason to clear the field early.  Always keep
the inode around until the work item is finished and remove the special
casing, and the counting of compressed extents in compress_file_range.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:15 +02:00
Christoph Hellwig
6758346808 btrfs: clean up the check for uncompressed ranges in submit_one_async_extent
Instead of checking for a NULL !pages and explaining this with a cryptic
comment, just check the compression type for BTRFS_COMPRESS_NONE to make
the check self-explanatory.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:15 +02:00
Christoph Hellwig
c56cbe9059 btrfs: reduce the number of arguments to btrfs_run_delalloc_range
Instead of a separate page_started argument that tells the callers that
btrfs_run_delalloc_range already started writeback by itself, overload
the return value with a positive 1 in additio to 0 and a negative error
code to indicate that is has already started writeback, and remove the
nr_written argument as that caller can calculate it directly based on
the range, and in fact already does so for the case where writeback
wasn't started yet.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:14 +02:00
Christoph Hellwig
ff20d6a4a9 btrfs: remove the return value from submit_uncompressed_range
The return value from submit_uncompressed_range is ignored, and that's
fine because the error reporting happens through the mapping and
ordered_extent.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:14 +02:00
Christoph Hellwig
84f262f009 btrfs: reduce debug spam from submit_compressed_extents
Move the printk that is supposed to help to debug failures in
submit_one_async_extent into submit_one_async_extent and make it
coniditonal on actually having an error condition instead of spamming
the log unconditionally.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:14 +02:00
Christoph Hellwig
9783e4deed btrfs: remove end_extent_writepage
end_extent_writepage is a small helper that combines a call to
btrfs_mark_ordered_io_finished with conditional error-only calls to
btrfs_page_clear_uptodate and mapping_set_error with a somewhat
unfortunate calling convention that passes and inclusive end instead
of the len expected by the underlying functions.

Remove end_extent_writepage and open code it in the 4 callers. Out
of those two already are error-only and thus don't need the extra
conditional, and one already has the mapping_set_error, so a duplicate
call can be avoided.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:14 +02:00
Christoph Hellwig
6648cedd86 btrfs: remove btrfs_writepage_endio_finish_ordered
btrfs_writepage_endio_finish_ordered is a small wrapper around
btrfs_mark_ordered_io_finished that just changs the argument passing
slightly, and adds a tracepoint.

Move the tracpoint to btrfs_mark_ordered_io_finished, which means
it now also covers the error handling in btrfs_cleanup_ordered_extent
and switch all callers to just call btrfs_mark_ordered_io_finished
directly.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:14 +02:00
Christoph Hellwig
53ffb30a78 btrfs: don't create inline extents in fallback_to_cow
For NOCOW files, run_delalloc_nocow can still fall back to COW
allocations when required and calls to fallback_to_cow helper for
that.  For such an allocation we can have multiple ordered_extents
for existing extents that NOCOW overwrites and new allocations that
fallback_to_cow creates.  If one of the new extents is an inline
extent, the writepages could would have to avoid normal page writeback
for them as indicated by the page_started return argument, which
run_delalloc_nocow can't return.   Fix this by never creating inline
extents from fallback_to_cow.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:14 +02:00
Christoph Hellwig
ba9145add5 btrfs: pass a flags argument to cow_file_range
The int used as bool unlock is not a very good way to describe the
behavior, and the next patch will have to add another behavior modifier.
We'll do that by two bool parameters instead of adding bit flags.  Now
specifies that the pages should always be kept locked.  This is the
inverse of the old unlock argument.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ switch flags to bool ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:14 +02:00
Boris Burkov
e28b02118b btrfs: free qgroup rsv on io failure
If we do a write whose bio suffers an error, we will never reclaim the
qgroup reserved space for it. We allocate the space in the write_iter
codepath, then release the reservation as we allocate the ordered
extent, but we only create a delayed ref if the ordered extent finishes.
If it has an error, we simply leak the rsv. This is apparent in running
any error injecting (dmerror) fstests like btrfs/146 or btrfs/160. Such
tests fail due to dmesg on umount complaining about the leaked qgroup
data space.

When we clean up other aspects of space on failed ordered_extents, also
free the qgroup rsv.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:13 +02:00
Goldwyn Rodrigues
75d305c55b btrfs: remove duplicate free_async_extent_pages() on reservation error
While performing compressed writes, if the extent reservation fails, the
async extent pages are first freed in the error check for return value
ret, and then again at out_free label.

Remove the first call to free_async_extent_pages().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:13 +02:00
Filipe Manana
9b378f6ad4 btrfs: fix infinite directory reads
The readdir implementation currently processes always up to the last index
it finds. This however can result in an infinite loop if the directory has
a large number of entries such that they won't all fit in the given buffer
passed to the readdir callback, that is, dir_emit() returns a non-zero
value. Because in that case readdir() will be called again and if in the
meanwhile new directory entries were added and we still can't put all the
remaining entries in the buffer, we keep repeating this over and over.

The following C program and test script reproduce the problem:

  $ cat /mnt/readdir_prog.c
  #include <sys/types.h>
  #include <dirent.h>
  #include <stdio.h>

  int main(int argc, char *argv[])
  {
    DIR *dir = opendir(".");
    struct dirent *dd;

    while ((dd = readdir(dir))) {
      printf("%s\n", dd->d_name);
      rename(dd->d_name, "TEMPFILE");
      rename("TEMPFILE", dd->d_name);
    }
    closedir(dir);
  }

  $ gcc -o /mnt/readdir_prog /mnt/readdir_prog.c

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  mkfs.btrfs -f $DEV &> /dev/null
  #mkfs.xfs -f $DEV &> /dev/null
  #mkfs.ext4 -F $DEV &> /dev/null

  mount $DEV $MNT

  mkdir $MNT/testdir
  for ((i = 1; i <= 2000; i++)); do
      echo -n > $MNT/testdir/file_$i
  done

  cd $MNT/testdir
  /mnt/readdir_prog

  cd /mnt

  umount $MNT

This behaviour is surprising to applications and it's unlike ext4, xfs,
tmpfs, vfat and other filesystems, which always finish. In this case where
new entries were added due to renames, some file names may be reported
more than once, but this varies according to each filesystem - for example
ext4 never reported the same file more than once while xfs reports the
first 13 file names twice.

So change our readdir implementation to track the last index number when
opendir() is called and then make readdir() never process beyond that
index number. This gives the same behaviour as ext4.

Reported-by: Rob Landley <rob@landley.net>
Link: https://lore.kernel.org/linux-btrfs/2c8c55ec-04c6-e0dc-9c5c-8c7924778c35@landley.net/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217681
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-14 16:17:37 +02:00
Jeff Layton
913e99287b
fs: drop the timespec64 argument from update_time
Now that all of the update_time operations are prepared for it, we can
drop the timespec64 argument from the update_time operation. Do that and
remove it from some associated functions like inode_update_time and
inode_needs_update_time.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230807-mgctime-v7-8-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-11 09:04:57 +02:00
Christoph Hellwig
12b2d64e59 btrfs: properly clear end of the unreserved range in cow_file_range
When the call to btrfs_reloc_clone_csums in cow_file_range returns an
error, we jump to the out_unlock label with the extent_reserved variable
set to false.   The cleanup at the label will then call
extent_clear_unlock_delalloc on the range from start to end.  But we've
already added cur_alloc_size to start before the jump, so there might no
range be left from the newly incremented start to end.  Move the check for
'start < end' so that it is reached by also for the !extent_reserved case.

CC: stable@vger.kernel.org # 6.1+
Fixes: a315e68f6e ("Btrfs: fix invalid attempt to free reserved space on failure to cow range")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-10 17:07:10 +02:00
Jeff Layton
bb7cc0a62e btrfs: have it use inode_update_timestamps
In later patches, we're going to drop the "now" argument from the
update_time operation. Have btrfs_update_time use the new
inode_update_timestamps helper to fetch a new timestamp and update it
properly.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230807-mgctime-v7-4-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-09 08:56:37 +02:00
Jeff Layton
0d72b92883 fs: pass the request_mask to generic_fillattr
generic_fillattr just fills in the entire stat struct indiscriminately
today, copying data from the inode. There is at least one attribute
(STATX_CHANGE_COOKIE) that can have side effects when it is reported,
and we're looking at adding more with the addition of multigrain
timestamps.

Add a request_mask argument to generic_fillattr and have most callers
just pass in the value that is passed to getattr. Have other callers
(e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
STATX_CHANGE_COOKIE into generic_fillattr.

Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-09 08:56:36 +02:00
Christoph Hellwig
7cad645ebf btrfs: fix ordered extent split error handling in btrfs_dio_submit_io
When the call to btrfs_extract_ordered_extent in btrfs_dio_submit_io
fails to allocate memory for a new ordered_extent, it calls into the
btrfs_dio_end_io for error handling.  btrfs_dio_end_io then assumes that
bbio->ordered is set because it is supposed to be at this point, except
for this error handling corner case.  Try to not overload the
btrfs_dio_end_io with error handling of a bio in a non-canonical state,
and instead call btrfs_finish_ordered_extent and iomap_dio_bio_end_io
directly for this error case.

Reported-by: syzbot <syzbot+5b82f0e951f8c2bcdb8f@syzkaller.appspotmail.com>
Fixes: b41b6f6937 ("btrfs: use btrfs_finish_ordered_extent to complete direct writes")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: syzbot <syzbot+5b82f0e951f8c2bcdb8f@syzkaller.appspotmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18 03:14:01 +02:00
Josef Bacik
17b17fcd6d btrfs: set_page_extent_mapped after read_folio in btrfs_cont_expand
While trying to get the subpage blocksize tests running, I hit the
following panic on generic/476

  assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:229
  kernel BUG at fs/btrfs/subpage.c:229!
  Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
  CPU: 1 PID: 1453 Comm: fsstress Not tainted 6.4.0-rc7+ #12
  Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20230301gitf80f052277c8-26.fc38 03/01/2023
  pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
  pc : btrfs_subpage_assert+0xbc/0xf0
  lr : btrfs_subpage_assert+0xbc/0xf0
  Call trace:
   btrfs_subpage_assert+0xbc/0xf0
   btrfs_subpage_clear_checked+0x38/0xc0
   btrfs_page_clear_checked+0x48/0x98
   btrfs_truncate_block+0x5d0/0x6a8
   btrfs_cont_expand+0x5c/0x528
   btrfs_write_check.isra.0+0xf8/0x150
   btrfs_buffered_write+0xb4/0x760
   btrfs_do_write_iter+0x2f8/0x4b0
   btrfs_file_write_iter+0x1c/0x30
   do_iter_readv_writev+0xc8/0x158
   do_iter_write+0x9c/0x210
   vfs_iter_write+0x24/0x40
   iter_file_splice_write+0x224/0x390
   direct_splice_actor+0x38/0x68
   splice_direct_to_actor+0x12c/0x260
   do_splice_direct+0x90/0xe8
   generic_copy_file_range+0x50/0x90
   vfs_copy_file_range+0x29c/0x470
   __arm64_sys_copy_file_range+0xcc/0x498
   invoke_syscall.constprop.0+0x80/0xd8
   do_el0_svc+0x6c/0x168
   el0_svc+0x50/0x1b0
   el0t_64_sync_handler+0x114/0x120
   el0t_64_sync+0x194/0x198

This happens because during btrfs_cont_expand we'll get a page, set it
as mapped, and if it's not Uptodate we'll read it.  However between the
read and re-locking the page we could have called release_folio() on the
page, but left the page in the file mapping.  release_folio() can clear
the page private, and thus further down we blow up when we go to modify
the subpage bits.

Fix this by putting the set_page_extent_mapped() after the read.  This
is safe because read_folio() will call set_page_extent_mapped() before
it does the read, and then if we clear page private but leave it on the
mapping we're completely safe re-setting set_page_extent_mapped().  With
this patch I can now run generic/476 without panicing.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18 03:13:58 +02:00
Filipe Manana
866e98a4d9 btrfs: use irq safe locking when running and adding delayed iputs
Running delayed iputs, which never happens in an irq context, needs to
lock the spinlock fs_info->delayed_iput_lock. When finishing bios for
data writes (irq context, bio.c) we call btrfs_put_ordered_extent() which
needs to add a delayed iput and for that it needs to acquire the spinlock
fs_info->delayed_iput_lock. Without disabling irqs when running delayed
iputs we can therefore deadlock on that spinlock. The same deadlock can
also happen when adding an inode to the delayed iputs list, since this
can be done outside an irq context as well.

Syzbot recently reported this, which results in the following trace:

  ================================
  WARNING: inconsistent lock state
  6.4.0-syzkaller-09904-ga507db1d8fdc #0 Not tainted
  --------------------------------
  inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
  btrfs-cleaner/16079 [HC0[0]:SC0[0]:HE1:SE1] takes:
  ffff888107804d20 (&fs_info->delayed_iput_lock){+.?.}-{2:2}, at: spin_lock include/linux/spinlock.h:350 [inline]
  ffff888107804d20 (&fs_info->delayed_iput_lock){+.?.}-{2:2}, at: btrfs_run_delayed_iputs+0x28/0xe0 fs/btrfs/inode.c:3523
  {IN-SOFTIRQ-W} state was registered at:
    lock_acquire kernel/locking/lockdep.c:5761 [inline]
    lock_acquire+0x1b1/0x520 kernel/locking/lockdep.c:5726
    __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
    _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
    spin_lock include/linux/spinlock.h:350 [inline]
    btrfs_add_delayed_iput+0x128/0x390 fs/btrfs/inode.c:3490
    btrfs_put_ordered_extent fs/btrfs/ordered-data.c:559 [inline]
    btrfs_put_ordered_extent+0x2f6/0x610 fs/btrfs/ordered-data.c:547
    __btrfs_bio_end_io fs/btrfs/bio.c:118 [inline]
    __btrfs_bio_end_io+0x136/0x180 fs/btrfs/bio.c:112
    btrfs_orig_bbio_end_io+0x86/0x2b0 fs/btrfs/bio.c:163
    btrfs_simple_end_io+0x105/0x380 fs/btrfs/bio.c:378
    bio_endio+0x589/0x690 block/bio.c:1617
    req_bio_endio block/blk-mq.c:766 [inline]
    blk_update_request+0x5c5/0x1620 block/blk-mq.c:911
    blk_mq_end_request+0x59/0x680 block/blk-mq.c:1032
    lo_complete_rq+0x1c6/0x280 drivers/block/loop.c:370
    blk_complete_reqs+0xb3/0xf0 block/blk-mq.c:1110
    __do_softirq+0x1d4/0x905 kernel/softirq.c:553
    run_ksoftirqd kernel/softirq.c:921 [inline]
    run_ksoftirqd+0x31/0x60 kernel/softirq.c:913
    smpboot_thread_fn+0x659/0x9e0 kernel/smpboot.c:164
    kthread+0x344/0x440 kernel/kthread.c:389
    ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
  irq event stamp: 39
  hardirqs last  enabled at (39): [<ffffffff81d5ebc4>] __do_kmem_cache_free mm/slab.c:3558 [inline]
  hardirqs last  enabled at (39): [<ffffffff81d5ebc4>] kmem_cache_free mm/slab.c:3582 [inline]
  hardirqs last  enabled at (39): [<ffffffff81d5ebc4>] kmem_cache_free+0x244/0x370 mm/slab.c:3575
  hardirqs last disabled at (38): [<ffffffff81d5eb5e>] __do_kmem_cache_free mm/slab.c:3553 [inline]
  hardirqs last disabled at (38): [<ffffffff81d5eb5e>] kmem_cache_free mm/slab.c:3582 [inline]
  hardirqs last disabled at (38): [<ffffffff81d5eb5e>] kmem_cache_free+0x1de/0x370 mm/slab.c:3575
  softirqs last  enabled at (0): [<ffffffff814ac99f>] copy_process+0x227f/0x75c0 kernel/fork.c:2448
  softirqs last disabled at (0): [<0000000000000000>] 0x0

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(&fs_info->delayed_iput_lock);
    <Interrupt>
      lock(&fs_info->delayed_iput_lock);

   *** DEADLOCK ***

  1 lock held by btrfs-cleaner/16079:
   #0: ffff888107804860 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x103/0x4b0 fs/btrfs/disk-io.c:1463

  stack backtrace:
  CPU: 3 PID: 16079 Comm: btrfs-cleaner Not tainted 6.4.0-syzkaller-09904-ga507db1d8fdc #0
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
  Call Trace:
   <TASK>
   __dump_stack lib/dump_stack.c:88 [inline]
   dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
   print_usage_bug kernel/locking/lockdep.c:3978 [inline]
   valid_state kernel/locking/lockdep.c:4020 [inline]
   mark_lock_irq kernel/locking/lockdep.c:4223 [inline]
   mark_lock.part.0+0x1102/0x1960 kernel/locking/lockdep.c:4685
   mark_lock kernel/locking/lockdep.c:4649 [inline]
   mark_usage kernel/locking/lockdep.c:4598 [inline]
   __lock_acquire+0x8e4/0x5e20 kernel/locking/lockdep.c:5098
   lock_acquire kernel/locking/lockdep.c:5761 [inline]
   lock_acquire+0x1b1/0x520 kernel/locking/lockdep.c:5726
   __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
   _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
   spin_lock include/linux/spinlock.h:350 [inline]
   btrfs_run_delayed_iputs+0x28/0xe0 fs/btrfs/inode.c:3523
   cleaner_kthread+0x2e5/0x4b0 fs/btrfs/disk-io.c:1478
   kthread+0x344/0x440 kernel/kthread.c:389
   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
   </TASK>

So fix this by using spin_lock_irq() and spin_unlock_irq() when running
delayed iputs, and using spin_lock_irqsave() and spin_unlock_irqrestore()
when adding a delayed iput().

Reported-by: syzbot+da501a04be5ff533b102@syzkaller.appspotmail.com
Fixes: ec63b84d46 ("btrfs: add an ordered_extent pointer to struct btrfs_bio")
Link: https://lore.kernel.org/linux-btrfs/000000000000d5c89a05ffbd39dd@google.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18 03:13:10 +02:00
Filipe Manana
cbaee87f2e btrfs: fix iput() on error pointer after error during orphan cleanup
At btrfs_orphan_cleanup(), if we can't find an inode (btrfs_iget() returns
an -ENOENT error pointer), we proceed with 'ret' set to -ENOENT and the
inode pointer set to ERR_PTR(-ENOENT). Later when we proceed to the body
of the following if statement:

    if (ret == -ENOENT || inode->i_nlink) {
        (...)
        trans = btrfs_start_transaction(root, 1);
        if (IS_ERR(trans)) {
            ret = PTR_ERR(trans);
            iput(inode);
            goto out;
        }
        (...)
        ret = btrfs_del_orphan_item(trans, root,
                                    found_key.objectid);
        btrfs_end_transaction(trans);
        if (ret) {
            iput(inode);
            goto out;
        }
        continue;
    }

If we get an error from btrfs_start_transaction() or from the call to
btrfs_del_orphan_item() we end calling iput() against an inode pointer
that has a value of ERR_PTR(-ENOENT), resulting in a crash with the
following trace:

  [876.667] BUG: kernel NULL pointer dereference, address: 0000000000000096
  [876.667] #PF: supervisor read access in kernel mode
  [876.667] #PF: error_code(0x0000) - not-present page
  [876.667] PGD 0 P4D 0
  [876.668] Oops: 0000 [#1] PREEMPT SMP PTI
  [876.668] CPU: 0 PID: 2356187 Comm: mount Tainted: G        W          6.4.0-rc6-btrfs-next-134+ #1
  [876.668] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
  [876.668] RIP: 0010:iput+0xa/0x20
  [876.668] Code: ff ff ff 66 (...)
  [876.669] RSP: 0018:ffffafa9c0c9f9d0 EFLAGS: 00010282
  [876.669] RAX: ffffffffffffffe4 RBX: 000000000009453b RCX: 0000000000000000
  [876.669] RDX: 0000000000000001 RSI: ffffafa9c0c9f930 RDI: fffffffffffffffe
  [876.669] RBP: ffff95c612f3b800 R08: 0000000000000001 R09: ffffffffffffffe4
  [876.670] R10: 00018f2a71010000 R11: 000000000ead96e3 R12: ffff95cb7d6909a0
  [876.670] R13: fffffffffffffffe R14: ffff95c60f477000 R15: 00000000ffffffe4
  [876.670] FS:  00007f5fbe30a840(0000) GS:ffff95ccdfa00000(0000) knlGS:0000000000000000
  [876.670] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [876.671] CR2: 0000000000000096 CR3: 000000055e9f6004 CR4: 0000000000370ef0
  [876.671] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [876.671] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [876.672] Call Trace:
  [876.744]  <TASK>
  [876.744]  ? __die_body+0x1b/0x60
  [876.744]  ? page_fault_oops+0x15d/0x450
  [876.745]  ? __kmem_cache_alloc_node+0x47/0x410
  [876.745]  ? do_user_addr_fault+0x65/0x8a0
  [876.745]  ? exc_page_fault+0x74/0x170
  [876.746]  ? asm_exc_page_fault+0x22/0x30
  [876.746]  ? iput+0xa/0x20
  [876.746]  btrfs_orphan_cleanup+0x221/0x330 [btrfs]
  [876.746]  btrfs_lookup_dentry+0x58f/0x5f0 [btrfs]
  [876.747]  btrfs_lookup+0xe/0x30 [btrfs]
  [876.747]  __lookup_slow+0x82/0x130
  [876.785]  walk_component+0xe5/0x160
  [876.786]  path_lookupat.isra.0+0x6e/0x150
  [876.786]  filename_lookup+0xcf/0x1a0
  [876.786]  ? mod_objcg_state+0xd2/0x360
  [876.786]  ? obj_cgroup_charge+0xf5/0x110
  [876.787]  ? should_failslab+0xa/0x20
  [876.787]  ? kmem_cache_alloc+0x47/0x450
  [876.787]  vfs_path_lookup+0x51/0x90
  [876.788]  mount_subtree+0x8d/0x130
  [876.788]  btrfs_mount+0x149/0x410 [btrfs]
  [876.788]  ? __kmem_cache_alloc_node+0x47/0x410
  [876.788]  ? vfs_parse_fs_param+0xc0/0x110
  [876.789]  legacy_get_tree+0x24/0x50
  [876.834]  vfs_get_tree+0x22/0xd0
  [876.852]  path_mount+0x2d8/0x9c0
  [876.852]  do_mount+0x79/0x90
  [876.852]  __x64_sys_mount+0x8e/0xd0
  [876.853]  do_syscall_64+0x38/0x90
  [876.899]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
  [876.958] RIP: 0033:0x7f5fbe50b76a
  [876.959] Code: 48 8b 0d a9 (...)
  [876.959] RSP: 002b:00007fff01925798 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
  [876.959] RAX: ffffffffffffffda RBX: 00007f5fbe694264 RCX: 00007f5fbe50b76a
  [876.960] RDX: 0000561bde6c8720 RSI: 0000561bde6bdec0 RDI: 0000561bde6c31a0
  [876.960] RBP: 0000561bde6bdc70 R08: 0000000000000000 R09: 0000000000000001
  [876.960] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
  [876.960] R13: 0000561bde6c31a0 R14: 0000561bde6c8720 R15: 0000561bde6bdc70
  [876.960]  </TASK>

So fix this by setting 'inode' to NULL whenever we get an error from
btrfs_iget(), and to make the code simpler, stop testing for 'ret' being
-ENOENT to check if we have an inode - instead test for 'inode' being NULL
or not. Having a NULL 'inode' prevents any iput() call from crashing, as
iput() ignores NULL inode pointers. Also, stop testing for a NULL return
value from btrfs_iget() with PTR_ERR_OR_ZERO(), because btrfs_iget() never
returns NULL - in case an inode is not found, it returns ERR_PTR(-ENOENT),
and in case of memory allocation failure, it returns ERR_PTR(-ENOMEM).
We also don't need the extra iput() calls on the error branches for the
btrfs_start_transaction() and btrfs_del_orphan_item() calls, as we have
already called iput() before, so remove them.

Fixes: a13bb2c038 ("btrfs: add missing iputs on orphan cleanup failure")
CC: stable@vger.kernel.org # 6.4
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18 03:13:01 +02:00
Filipe Manana
b777d279ff btrfs: fix double iput() on inode after an error during orphan cleanup
At btrfs_orphan_cleanup(), if we were able to find the inode, we do an
iput() on the inode, then if btrfs_drop_verity_items() succeeds and then
either btrfs_start_transaction() or btrfs_del_orphan_item() fail, we do
another iput() in the respective error paths, resulting in an extra iput()
on the inode.

Fix this by setting inode to NULL after the first iput(), as iput()
ignores a NULL inode pointer argument.

Fixes: a13bb2c038 ("btrfs: add missing iputs on orphan cleanup failure")
CC: stable@vger.kernel.org # 6.4
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18 03:12:59 +02:00
Jeff Layton
2a9462de43 btrfs: convert to ctime accessor functions
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-27-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-13 10:28:04 +02:00
Jeff Layton
130f1eca98 btrfs: convert to simple_rename_timestamp
A rename potentially involves updating 4 different inode timestamps.
Convert to the new simple_rename_timestamp helper function.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230705190309.579783-7-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-10 10:08:16 +02:00
Filipe Manana
6822b3f708 btrfs: do not BUG_ON after failure to migrate space during truncation
During truncation we reserve 2 metadata units when starting a transaction
(reserved space goes to fs_info->trans_block_rsv) and then attempt to
migrate 1 unit (min_size bytes) from fs_info->trans_block_rsv into the
local block reserve. If we ever fail we trigger a BUG_ON(), which should
never happen, because we reserved 2 units. However if we happen to fail
for some reason, we don't need to be so dire and hit a BUG_ON(), we can
just error out the truncation and, since this is highly unexpected,
surround the error check with WARN_ON(), to get an informative stack
trace and tag the branh as 'unlikely'. Also make the 'min_size' variable
const, since it's not supposed to ever change and any accidental change
could possibly make the space migration not so unlikely to fail.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:39 +02:00
Christoph Hellwig
f02c75e630 btrfs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method
Since commit a2ad63daa8 ("VFS: add FMODE_CAN_ODIRECT file flag") file
systems can just set the FMODE_CAN_ODIRECT flag at open time instead of
wiring up a dummy direct_IO method to indicate support for direct I/O.
Do that for btrfs so that noop_direct_IO can eventually be removed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:38 +02:00
Christoph Hellwig
b41b6f6937 btrfs: use btrfs_finish_ordered_extent to complete direct writes
Use the btrfs_finish_ordered_extent helper to complete compressed writes
using the bbio->ordered pointer instead of requiring an rbtree lookup
in the otherwise equivalent btrfs_mark_ordered_io_finished called from
btrfs_writepage_endio_finish_ordered.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:37 +02:00
Christoph Hellwig
ec63b84d46 btrfs: add an ordered_extent pointer to struct btrfs_bio
Add a pointer to the ordered_extent to the existing union in struct
btrfs_bio, so all code dealing with data write bios can just use a
pointer dereference to retrieve the ordered_extent instead of doing
multiple rbtree lookups per I/O.

The reference to this ordered_extent is dropped at end I/O time,
which implies that an extra one must be acquired when the bio is split.
This also requires moving the btrfs_extract_ordered_extent call into
btrfs_split_bio so that the invariant of always having a valid
ordered_extent reference for the btrfs_bio is kept.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:36 +02:00
Christoph Hellwig
112397acc3 btrfs: open code btrfs_bio_end_io in btrfs_dio_submit_io
btrfs_dio_submit_io is the only place that uses btrfs_bio_end_io to end a
bio that hasn't been submitted using btrfs_submit_bio yet, and this
invariant will become a problem with upcoming changes to the btrfs bio
layer.  Just open code the assignment of bi_status and the call to
btrfs_dio_end_io.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:36 +02:00
Christoph Hellwig
d611935b5d btrfs: pass an ordered_extent to btrfs_submit_compressed_write
btrfs_submit_compressed_write always operates on a single ordered_extent.
Make that explicit by using btrfs_alloc_ordered_extent in the callers
and passing the ordered_extent to btrfs_submit_compressed_write.  This
will help with storing and ordered_extent pointer in the btrfs_bio in
subsequent patches.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:36 +02:00
Christoph Hellwig
34bfaf1530 btrfs: pass an ordered_extent to btrfs_reloc_clone_csums
Both callers of btrfs_reloc_clone_csums allocate the ordered_extent that
btrfs_reloc_clone_csums operates on.  Switch them to use
btrfs_alloc_ordered_extent instead of btrfs_add_ordered_extent and
pass the ordered_extent to btrfs_reloc_clone_csums instead of doing an
extra lookup.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:36 +02:00
Christoph Hellwig
3daea5fda1 btrfs: merge the two calls to btrfs_add_ordered_extent in run_delalloc_nocow
Refactor run_delalloc_nocow a little bit so that there is only a single
call to btrfs_add_ordered_extent instead of two.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:36 +02:00
Christoph Hellwig
7027f87108 btrfs: don't treat zoned writeback as being from an async helper thread
When extent_write_locked_range was originally added, it was only used
writing back compressed pages from an async helper thread.  But it is
now also used for writing back pages on zoned devices, where it is
called directly from the ->writepage context.  In this case we want to
be able to pass on the writeback_control instead of creating a new one,
and more importantly want to use all the normal cgroup interaction
instead of potentially deferring writeback to another helper.

Fixes: 898793d992 ("btrfs: zoned: write out partially allocated region")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:35 +02:00
Christoph Hellwig
a994310aa2 btrfs: remove PAGE_SET_ERROR
Now that the btrfs writeback code has stopped using PageError, using
PAGE_SET_ERROR to just set the per-address_space error flag is confusing.
Open code the mapping_set_error calls in the callers and remove
the PAGE_SET_ERROR flag.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:35 +02:00
Christoph Hellwig
2b2553f123 btrfs: stop setting PageError in the data I/O path
PageError is not used by the VFS/MM and deprecated because it uses up a
page bit and has no coherent rules.  Instead read errors are usually
propagated by not setting or clearing the uptodate bit, and write errors
are propagated through the address_space.  Btrfs now only sets the flag
and never clears it for data pages, so just remove all places setting it,
and the subpage error bit.

Note that the error propagation for superblock writes that work on the
block device mapping still uses PageError for now, but that will be
addressed in a separate series.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:35 +02:00
Christoph Hellwig
bb7b05fe1b btrfs: rename cow_file_range_async to run_delalloc_compressed
cow_file_range_async is only used for compressed writeback.  Rename it
to run_delalloc_compressed, which also fits in with run_delalloc_nocow
and run_delalloc_zoned.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:35 +02:00
Christoph Hellwig
973fb26e81 btrfs: don't fail writeback when allocating the compression context fails
If cow_file_range_async fails to allocate the asynchronous writeback
context, it currently returns an error and entirely fails the writeback.
This is not a good idea as a writeback failure is a non-temporary error
condition that will make the file system unusable.  Just fall back to
synchronous uncompressed writeback instead.  This requires us to delay
setting the BTRFS_INODE_HAS_ASYNC_EXTENT flag until we've committed to
the async writeback.

The compression checks INODE_NOCOMPRESS and FORCE_COMPRESS are moved
from cow_file_range_async to the preceding checks btrfs_run_delalloc_range().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:35 +02:00
Christoph Hellwig
f000bc6fe4 btrfs: pass the new logical address to split_extent_map
split_extent_map splits off the first chunk of an extent map into a new
one.  One of the two users is the zoned I/O completion code that wants to
rewrite the logical block start address right after this split.  Pass in
the logical address to be set in the split off first extent_map as an
argument to avoid an extra extent tree lookup for this case.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:33 +02:00
Christoph Hellwig
71df088c1c btrfs: defer splitting of ordered extents until I/O completion
The btrfs zoned completion code currently needs an ordered_extent and
extent_map per bio so that it can account for the non-predictable
write location from Zone Append.  To archive that it currently splits
the ordered_extent and extent_map at I/O submission time, and then
records the actual physical address in the ->physical field of the
ordered_extent.

This patch instead switches to record the "original" physical address
that the btrfs allocator assigned in spare space in the btrfs_bio,
and then rewrites the logical address in the btrfs_ordered_sum
structure at I/O completion time.  This allows the ordered extent
completion handler to simply walk the list of ordered csums and
split the ordered extent as needed.  This removes an extra ordered
extent and extent_map lookup and manipulation during the I/O
submission path, and instead batches it in the I/O completion path
where we need to touch these anyway.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:33 +02:00
Christoph Hellwig
b0307e2864 btrfs: return the new ordered_extent from btrfs_split_ordered_extent
Return the ordered_extent split from the passed in one.  This will be
needed to be able to store an ordered_extent in the btrfs_bio.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:33 +02:00
Christoph Hellwig
ebdb44a00e btrfs: reorder conditions in btrfs_extract_ordered_extent
There is no good reason for doing one before the other in terms of
failure implications, but doing the extent_map split first will
simplify some upcoming refactoring.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:33 +02:00
Christoph Hellwig
a6f3e205e4 btrfs: move split_extent_map to extent_map.c
split_extent_map doesn't have anything to do with the other code in
inode.c, so move it to extent_map.c.

This also allows marking replace_extent_mapping static.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:33 +02:00
Christoph Hellwig
cbfce4c7fb btrfs: optimize the logical to physical mapping for zoned writes
The current code to store the final logical to physical mapping for a
zone append write in the extent tree is rather inefficient.  It first has
to split the ordered extent so that there is one ordered extent per bio,
so that it can look up the ordered extent on I/O completion in
btrfs_record_physical_zoned and store the physical LBA returned by the
block driver in the ordered extent.

btrfs_rewrite_logical_zoned then has to do a lookup in the chunk tree to
see what physical address the logical address for this bio / ordered
extent is mapped to, and then rewrite it in the extent tree.

To optimize this process, we can store the physical address assigned in
the chunk tree to the original logical address and a pointer to
btrfs_ordered_sum structure the in the btrfs_bio structure, and then use
this information to rewrite the logical address in the btrfs_ordered_sum
structure directly at I/O completion time in btrfs_record_physical_zoned.
btrfs_rewrite_logical_zoned then simply updates the logical address in
the extent tree and the ordered_extent itself.

The code in btrfs_rewrite_logical_zoned now runs for all data I/O
completions in zoned file systems, which is fine as there is no remapping
to do for non-append writes to conventional zones or for relocation, and
the overhead for quickly breaking out of the loop is very low.

Because zoned file systems now need the ordered_sums structure to
record the actual write location returned by zone append, allocate dummy
structures without the csum array for them when the I/O doesn't use
checksums, and free them when completing the ordered_extent.

Note that the btrfs_bio doesn't grow as the new field are places into
a union that is so far not used for data writes and has plenty of space
left in it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:32 +02:00
Christoph Hellwig
5cfe76f846 btrfs: rename the bytenr field in struct btrfs_ordered_sum to logical
btrfs_ordered_sum::bytendr stores a logical address.  Make that clear by
renaming it to ->logical.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:32 +02:00
David Sterba
1d12680044 btrfs: drop gfp from parameter extent state helpers
Now that all extent state bit helpers effectively take the GFP_NOFS mask
(and GFP_NOWAIT is encoded in the bits) we can remove the parameter.
This reduces stack consumption in many functions and simplifies a lot of
code.

Net effect on module on a release build:

   text    data     bss     dec     hex filename
1250432   20985   16088 1287505  13a551 pre/btrfs.ko
1247074   20985   16088 1284147  139833 post/btrfs.ko

DELTA: -3358

Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:30 +02:00
David Sterba
66240ab115 btrfs: open code set_extent_delalloc
The helper is used once in fs code and a few times in the self test
code.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:30 +02:00
Filipe Manana
59fcf38817 btrfs: change for_rename argument of btrfs_record_unlink_dir() to bool
The for_rename argument of btrfs_record_unlink_dir() is defined as an
integer, but the argument is in fact used as a boolean. So change it to
a boolean to make its use more clear.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:26 +02:00
Qu Wenruo
b7f9945a14 btrfs: handle tree backref walk error properly
[BUG]
Smatch reports the following errors related to commit ("btrfs: output
affected files when relocation fails"):

	fs/btrfs/inode.c:283 print_data_reloc_error()
	error: uninitialized symbol 'ref_level'.

[CAUSE]
That part of code is mostly copied from scrub, but unfortunately scrub
code from the beginning is not doing the error handling properly.

The offending code looks like this:

	do {
		ret = tree_backref_for_extent();
		btrfs_warn_rl();
	} while (ret != 1);

There are several problems involved:

- No error handling
  If that tree_backref_for_extent() failed, we would output the same
  error again and again, never really exit as it requires ret == 1 to
  exit.

- Always do one extra output
  As tree_backref_for_extent() only return > 0 if there is no more
  backref item.
  This means after the last item we hit, we would output an invalid
  error message for ret > 0 case.

[FIX]
Fix the old code by:

- Move @ref_root and @ref_level into the if branch
  And do not initialize them, so we can catch such uninitialized values
  just like what we do in the inode.c

- Explicitly check the return value of tree_backref_for_extent()
  And handle ret < 0 and ret > 0 cases properly.

- No more do {} while () loop
  Instead go while (true) {} loop since we will handle @ret manually.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:26 +02:00
Qu Wenruo
b9a9a85059 btrfs: output affected files when relocation fails
[PROBLEM]
When relocation fails (mostly due to checksum mismatch), we only got
very cryptic error messages like:

  BTRFS info (device dm-4): relocating block group 13631488 flags data
  BTRFS warning (device dm-4): csum failed root -9 ino 257 off 0 csum 0x373e1ae3 expected csum 0x98757625 mirror 1
  BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
  BTRFS info (device dm-4): balance: ended with status: -5

The end user has to decipher the above messages and use various tools to
locate the affected files and find a way to fix the problem (mostly
deleting the file).  This is not an easy work even for experienced
developer, not to mention the end users.

[SCRUB IS DOING BETTER]
By contrast, scrub is providing much better error messages:

  BTRFS error (device dm-4): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488
  BTRFS warning (device dm-4): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file)
  BTRFS info (device dm-4): scrub: finished on devid 1 with status: 0

Which provides the affected files directly to the end user.

[IMPROVEMENT]
Instead of the generic data checksum error messages, which is not doing
a good job for data reloc inodes, this patch introduce a scrub like
backref walking based solution.

When a sector fails its checksum for data reloc inode, we go the
following workflow:

- Get the real logical bytenr
  For data reloc inode, the file offset is the offset inside the block
  group.
  Thus the real logical bytenr is @file_off + @block_group->start.

- Do an extent type check
  If it's tree blocks it's much easier to handle, just go through
  all the tree block backref.

- Do a backref walk and inode path resolution for data extents
  This is mostly the same as scrub.
  But unfortunately we can not reuse the same function as the output
  format is different.

Now the new output would be more user friendly:

  BTRFS info (device dm-4): relocating block group 13631488 flags data
  BTRFS warning (device dm-4): csum failed root -9 ino 257 off 0 logical 13631488 csum 0x373e1ae3 expected csum 0x98757625 mirror 1
  BTRFS warning (device dm-4): checksum error at logical 13631488 mirror 1 root 5 inode 257 offset 0 length 4096 links 1 (path: file)
  BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
  BTRFS info (device dm-4): balance: ended with status: -5

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Christoph Hellwig
e917ff56c8 btrfs: determine synchronous writers from bio or writeback control
The writeback_control structure already passes down the information about
a writeback being synchronous from the core VM code, and thus information
is propagated into the bio REQ_SYNC flag through the wbc_to_write_flags
helper.

Use that information to decide if checksums calculation is offloaded to
a workqueue instead of btrfs_inode::sync_writers field that not only
bloats the inode but also has too wide scope, being inode wide instead
of limited to the actual writeback request.

The sync writes were set in:

- btrfs_do_write_iter - regular IO, sync status is set
- start_ordered_ops - ordered write start, writeback with WB_SYNC_ALL
  mode
- btrfs_write_marked_extents - write marked extents, writeback with
  WB_SYNC_ALL mode

Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Anand Jain
29e70be261 btrfs: use SECTOR_SHIFT to convert physical offset to LBA
Use SECTOR_SHIFT while converting a physical address to an LBA, makes
it more readable.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Chris Mason
deccae40e4 btrfs: can_nocow_file_extent should pass down args->strict from callers
Commit 619104ba45 ("btrfs: move common NOCOW checks against a file
extent into a helper") changed our call to btrfs_cross_ref_exist() to
always pass false for the 'strict' parameter.  We're passing this down
through the stack so that we can do a full check for cross references
during swapfile activation.

With strict always false, this test fails:

  btrfs subvol create swappy
  chattr +C swappy
  fallocate -l1G swappy/swapfile
  chmod 600 swappy/swapfile
  mkswap swappy/swapfile

  btrfs subvol snap swappy swapsnap
  btrfs subvol del -C swapsnap

  btrfs fi sync /
  sync;sync;sync

  swapon swappy/swapfile

The fix is to just use args->strict, and everyone except swapfile
activation is passing false.

Fixes: 619104ba45 ("btrfs: move common NOCOW checks against a file extent into a helper")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-13 00:01:08 +02:00
Christoph Hellwig
7833b86595 btrfs: fix iomap_begin length for nocow writes
can_nocow_extent can reduce the len passed in, which needs to be
propagated to btrfs_dio_iomap_begin so that iomap does not submit
more data then is mapped.

This problems exists since the btrfs_get_blocks_direct helper was added
in commit c5794e5178 ("btrfs: Factor out write portion of
btrfs_get_blocks_direct"), but the ordered_extent splitting added in
commit b73a6fd1b1 ("btrfs: split partial dio bios before submit")
added a WARN_ON that made a syzkaller test fail.

Reported-by: syzbot+ee90502d5c8fd1d0dd93@syzkaller.appspotmail.com
Fixes: c5794e5178 ("btrfs: Factor out write portion of btrfs_get_blocks_direct")
CC: stable@vger.kernel.org # 6.1+
Tested-by: syzbot+ee90502d5c8fd1d0dd93@syzkaller.appspotmail.com
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-13 00:01:00 +02:00
Naohiro Aota
f84353c7c2 btrfs: zoned: zone finish data relocation BG with last IO
For data block groups, we zone finish a zone (or, just deactivate it) when
seeing the last IO in btrfs_finish_ordered_io(). That is only called for
IOs using ZONE_APPEND, but we use a regular WRITE command for data
relocation IOs. Detect it and call btrfs_zone_finish_endio() properly.

Fixes: be1a1d7a5d ("btrfs: zoned: finish fully written block group")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-10 14:50:12 +02:00
Qu Wenruo
4317ff0056 btrfs: introduce btrfs_bio::fs_info member
Currently we're doing a lot of work for btrfs_bio:

- Checksum verification for data read bios
- Bio splits if it crosses stripe boundary
- Read repair for data read bios

However for the incoming scrub patches, we don't want this extra
functionality at all, just plain logical + mirror -> physical mapping
ability.

Thus here we do the following changes:

- Introduce btrfs_bio::fs_info
  This is for the new scrub specific btrfs_bio, which would not populate
  btrfs_bio::inode.
  Thus we need such new member to grab a fs_info

  This new member will always be populated.

- Replace @inode argument with @fs_info for btrfs_bio_init() and its
  caller
  Since @inode is no longer a mandatory member, replace it with
  @fs_info, and let involved users populate @inode.

- Skip checksum verification and generation if @bbio->inode is NULL

- Add extra ASSERT()s
  To make sure:

  * bbio->inode is properly set for involved read repair path
  * if @file_offset is set, bbio->inode is also populated

- Grab @fs_info from @bbio directly
  We can no longer go @bbio->inode->root->fs_info, as bbio->inode can be
  NULL. This involves:

  * btrfs_simple_end_io()
  * should_async_write()
  * btrfs_wq_submit_bio()
  * btrfs_use_zone_append()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Christoph Hellwig
3480373ebd btrfs, block: move REQ_CGROUP_PUNT to btrfs
REQ_CGROUP_PUNT is a bit annoying as it is hard to follow and adds
a branch to the bio submission hot path.  To fix this, export
blkcg_punt_bio_submit and let btrfs call it directly.  Add a new
REQ_FS_PRIVATE flag for btrfs to indicate to it's own low-level
bio submission code that a punt to the cgroup submission helper
is required.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Christoph Hellwig
896d7c1a90 btrfs: also use kthread_associate_blkcg for uncompressible ranges
submit_one_async_extent needs to use submit_one_async_extent no matter
if the range it handles ends up beeing compressed or not as the deadlock
risk due to cgroup thottling is the same.  Call kthread_associate_blkcg
earlier to cover submit_uncompressed_range case as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Christoph Hellwig
e43a6210b7 btrfs: don't free the async_extent in submit_uncompressed_range
Let submit_one_async_extent, which is the only caller of
submit_uncompressed_range handle freeing of the async_extent in one
central place.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Christoph Hellwig
05d06a5c9d btrfs: move kthread_associate_blkcg out of btrfs_submit_compressed_write
btrfs_submit_compressed_write should not have to care if it is called
from a helper thread or not.  Move the kthread_associate_blkcg handling
into submit_one_async_extent, as that is the one caller that needs it.
Also move the assignment of REQ_CGROUP_PUNT into cow_file_range_async,
as that is the routine that sets up the helper thread offload.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Boris Burkov
b73a6fd1b1 btrfs: split partial dio bios before submit
If an application is doing direct io to a btrfs file and experiences a
page fault reading from the write buffer, iomap will issue a partial
bio, and allow the fs to keep going. However, there was a subtle bug in
this code path in the btrfs dio iomap implementation that led to the
partial write ending up as a gap in the file's extents and to be read
back as zeros.

The sequence of events in a partial write, lightly summarized and
trimmed down for brevity is as follows:

==== WRITING TASK ====
 btrfs_direct_write
 __iomap_dio_write
 iomap_iter
 btrfs_dio_iomap_begin # create full ordered extent
 iomap_dio_bio_iter
 bio_iov_iter_get_pages # page fault; partial read
 submit_bio # partial bio
 iomap_iter
 btrfs_dio_iomap_end
 btrfs_mark_ordered_io_finished # sets BTRFS_ORDERED_IOERR;
				# submit to finish_ordered_fn wq
 fault_in_iov_iter_readable # btrfs_direct_write detects partial write
 __iomap_dio_write
 iomap_iter
 btrfs_dio_iomap_begin # create second partial ordered extent
 iomap_dio_bio_iter
 bio_iov_iter_get_pages # read all of remainder
 submit_bio # partial bio with all of remainder
 iomap_iter
 btrfs_dio_iomap_end # nothing exciting to do with ordered io

==== DIO ENDIO ====
== FIRST PARTIAL BIO ==
 btrfs_dio_end_io
 btrfs_mark_ordered_io_finished # bytes_left > 0
			        # don't submit to finish_ordered_fn wq
== SECOND PARTIAL BIO ==
 btrfs_dio_end_io
 btrfs_mark_ordered_io_finished # bytes_left == 0
			        # submit to finish_ordered_fn wq

==== BTRFS FINISH ORDERED WQ ====
== FIRST PARTIAL BIO ==
 btrfs_finish_ordered_io # called by dio_iomap_end_io, sees
		         # BTRFS_ORDERED_IOERR, just drops the
		         # ordered_extent
==SECOND PARTIAL BIO==
 btrfs_finish_ordered_io # called by btrfs_dio_end_io, writes out file
		         # extents, csums, etc...

The essence of the problem is that while btrfs_direct_write and iomap
properly interact to submit all the correct bios, there is insufficient
logic in the btrfs dio functions (btrfs_dio_iomap_begin,
btrfs_dio_submit_io, btrfs_dio_end_io, and btrfs_dio_iomap_end) to
ensure that every bio is at least a part of a completed ordered_extent.
And it is completing an ordered_extent that results in crucial
functionality like writing out a file extent for the range.

More specifically, btrfs_dio_end_io treats the ordered extent as
unfinished but btrfs_dio_iomap_end sets BTRFS_ORDERED_IOERR on it.
Thus, the finish io work doesn't result in file extents, csums, etc.
In the aftermath, such a file behaves as though it has a hole in it,
instead of the purportedly written data.

We considered a few options for fixing the bug:

  1. treat the partial bio as if we had truncated the file, which would
     result in properly finishing it.
  2. split the ordered extent when submitting a partial bio.
  3. cache the ordered extent across calls to __iomap_dio_rw in
     iter->private, so that we could reuse it and correctly apply
     several bios to it.

I had trouble with 1, and it felt the most like a hack, so I tried 2
and 3. Since 3 has the benefit of also not creating an extra file
extent, and avoids an ordered extent lookup during bio submission, it
felt like the best option. However, that turned out to re-introduce a
deadlock which this code discarding the ordered_extent between faults
was meant to fix in the first place. (Link to an explanation of the
deadlock below.)

Therefore, go with fix 2, which requires a bit more setup work but fixes
the corruption without introducing the deadlock, which is fundamentally
caused by the ordered extent existing when we attempt to fault in a
range that overlaps with it.

Put succinctly, what this patch does is: when we submit a dio bio, check
if it is partial against the ordered extent stored in dio_data, and if it
is, extract the ordered_extent that matches the bio exactly out of the
larger ordered_extent. Keep the remaining ordered_extent around in dio_data
for cancellation in iomap_end.

Thanks to Josef, Christoph, and Filipe with their help figuring out the
bug and the fix.

Fixes: 51bd9563b6 ("btrfs: fix deadlock due to page faults during direct IO reads and writes")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2169947
Link: https://lore.kernel.org/linux-btrfs/aa1fb69e-b613-47aa-a99e-a0a2c9ed273f@app.fastmail.com/
Link: https://pastebin.com/3SDaH8C6
Link: https://lore.kernel.org/linux-btrfs/20230315195231.GW10580@twin.jikos.cz/T/#t
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
[ hch: refactored the ordered_extent extraction ]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Boris Burkov
f0f5329a00 btrfs: don't split NOCOW extent_maps in btrfs_extract_ordered_extent
NOCOW writes just overwrite an existing extent map, which thus should
not be split in btrfs_extract_ordered_extent.  The NOCOW case can't
currently happen as btrfs_extract_ordered_extent is only used on zoned
devices that do not support NOCOW writes, but this will change soon.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
[ hch: split from a larger patch, wrote a commit log ]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Christoph Hellwig
7edd339c8a btrfs: pass an ordered_extent to btrfs_extract_ordered_extent
To prepare for a new caller that already has the ordered_extent
available, change btrfs_extract_ordered_extent to take an argument
for it.  Add a wrapper for the bio case that still has to do the
lookup (for now).

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Christoph Hellwig
2e38a84bc6 btrfs: simplify extent map splitting and rename split_zoned_em
split_zoned_em is only ever asked to split out the beginning of an extent
map.  Change it to only take a len to split out instead of a pre and post
region.

Also rename the function to split_extent_map as there is nothing zoned
device specific about it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Christoph Hellwig
8f4af4b8e1 btrfs: sink parameter len to btrfs_split_ordered_extent
btrfs_split_ordered_extent is only ever asked to split out the beginning
of an ordered_extent (i.e. post == 0).  Change it to only take a len to
split out, and switch it to allocate the new extent for the beginning,
as that helps with callers that want to keep a pointer to the
ordered_extent that it is stealing from.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Christoph Hellwig
11d33ab6c1 btrfs: simplify splitting logic in btrfs_extract_ordered_extent
btrfs_extract_ordered_extent is always used to split an ordered_extent
and extent_map into two parts, so it doesn't need to deal with a three
way split.

Simplify it by only allowing for a single split point, and always split
out the beginning of the extent, as that is what we'll later need to
be able to hold on to a reference to the original ordered_extent that
the first part is split off for submission.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Christoph Hellwig
e44ca71cfe btrfs: move ordered_extent internal sanity checks into btrfs_split_ordered_extent
Move the three checks that are about ordered extent internal sanity
checking into btrfs_split_ordered_extent instead of doing them in the
higher level btrfs_extract_ordered_extent routine.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Boris Burkov
53f2c20687 btrfs: stash ordered extent in dio_data during iomap dio
While it is not feasible for an ordered extent to survive across the
calls btrfs_direct_write makes into __iomap_dio_rw, it is still helpful
to stash it on the dio_data in between creating it in iomap_begin and
finishing it in either end_io or iomap_end.

The specific use I have in mind is that we can check if a particular bio
is partial in submit_io without unconditionally looking up the ordered
extent. This is a preparatory patch for a later patch which does just
that.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:21 +02:00
Filipe Manana
b7b1167c36 btrfs: simplify exit paths of btrfs_evict_inode()
Instead of using two labels at btrfs_evict_inode() for exiting depending
on whether we need to delete the inode items and orphan or some error
happened, we can use a single exit label if we initialize the block
reserve to NULL, since btrfs_free_block_rsv() ignores a NULL block reserve
pointer. So just do that. It will also make an upcoming change simpler by
avoiding one extra error label.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:20 +02:00
Filipe Manana
5630e2bcfe btrfs: use a constant for the number of metadata units needed for an unlink
Instead of hard coding the number of metadata units for an unlink operation
in a couple places, define a macro and use it instead. This eliminates the
problem of one place getting out of sync with the other, such as recently
fixed by the previous patch in the series ("btrfs: fix calculation of the
global block reserve's size").

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:20 +02:00
Filipe Manana
b13d57db90 btrfs: calculate correct amount of space for delayed reference when evicting
When evicting an inode, we are incorrectly calculating the amount of space
required for a single delayed reference in case the free space tree is
enabled. We have to multiply by 2 the result of
btrfs_calc_insert_metadata_size(). We should be calculating according to
the size update and space release of the delayed block reserve logic at
btrfs_update_delayed_refs_rsv() and btrfs_delayed_refs_rsv_release().

Fix this by using the btrfs_calc_delayed_ref_bytes() helper at
evict_refill_and_join() instead of btrfs_calc_insert_metadata_size().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:20 +02:00
Filipe Manana
afa4b0afee btrfs: don't throttle on delayed items when evicting deleted inode
During inode eviction, if we are truncating a deleted inode, we don't add
delayed items for our inode, so there's no need to throttle on delayed
items on each iteration of the loop that truncates inode items from its
subvolume tree. But we dirty extent buffers from its subvolume tree, so
we only need to throttle on btree inode dirty pages.

So use btrfs_btree_balance_dirty_nodelay() in the loop that truncates
inode items.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:19 +02:00
Filipe Manana
4e0527deb3 btrfs: pass a bool to btrfs_block_rsv_migrate() at evict_refill_and_join()
The last argument of btrfs_block_rsv_migrate() is a boolean, but we are
passing an integer, with a value of 1, to it at evict_refill_and_join().
While this is not a bug, due to type conversion, it's a lot more clear to
simply pass the boolean true value instead. So just do that.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:18 +02:00
Christoph Hellwig
b41bbd293e btrfs: return a btrfs_bio from btrfs_bio_alloc
Return the containing struct btrfs_bio instead of the less type safe
struct bio from btrfs_bio_alloc.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:17 +02:00
Christoph Hellwig
ae42a154ca btrfs: pass a btrfs_bio to btrfs_submit_bio
btrfs_submit_bio expects the bio passed to it to be embedded into a
btrfs_bio structure.  Pass the btrfs_bio directly to increase type
safety and make the code self-documenting.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:17 +02:00
Christoph Hellwig
34f888ce3a btrfs: cleanup main loop in btrfs_encoded_read_regular_fill_pages
btrfs_encoded_read_regular_fill_pages has a pretty odd control flow.
Unwind it so that there is a single loop over the pages array.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:17 +02:00
Christoph Hellwig
b665affe93 btrfs: remove unused members from struct btrfs_encoded_read_private
The inode and file_offset members in struct btrfs_encoded_read_private
are unused, so remove them.

Last used in commit 7959bd4411 ("btrfs: remove the start argument to
check_data_csum and export") and commit 7609afac67 ("btrfs: handle
checksum validation and repair at the storage layer").

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:17 +02:00
Christoph Hellwig
544fe4a903 btrfs: embed a btrfs_bio into struct compressed_bio
Embed a btrfs_bio into struct compressed_bio.  This avoids potential
(so far theoretical) deadlocks due to nesting of btrfs_bioset allocations
for the original read bio and the compressed bio, and avoids an extra
memory allocation in the I/O path.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:14 +02:00
Johannes Thumshirn
fa13661c48 btrfs: open code btrfs_csum_ptr
Remove btrfs_csum_ptr() and fold it into it's only caller.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:13 +02:00
Josef Bacik
a13bb2c038 btrfs: add missing iputs on orphan cleanup failure
We missed a couple of iput()s in the orphan cleanup failure paths, add
them so we don't get refcount errors. The iput needs to be done in the
check and not under a common label due to the way the code is
structured.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:13 +02:00
Linus Torvalds
285063049a for-6.3-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmQc0bUACgkQxWXV+ddt
 WDspCQ//TZRZxwvtgHuJO04vk/CyGrB/2FPytweM3QIjUkq7WaWxoDbgkXfJVuej
 qvdlNlugtXuuTZ87j7dTC2tP2agi0BWhJSO9C0S5z8GTYF2uewKknUD01uOZnKz0
 j++9ki5HfcAYbH80xpM2S4GqOz4FBsfRx/10WIdKOfHrB5jhbfMvN6rBE+UGged0
 Of9TZ9u4i5FMlY36G5+Rek/mhQrK2eFIn45IDwzQptUKnK+0OZ1qqk8ZUmAeT+hn
 6EY3ZXXJIhx6fMxqoeo2TelUWwknARgBQvPSY8YbwZc6T+ObZF0jxZx6n9ESVB8R
 AXOXoovn6+pnm3qi/8j8d0z88LYBrGOXPNp4vtXkKToW+6VWbrvM4zHnUSKCXMDy
 1eaxVcv3MDZ07+Y98XbUMJDKjQ4yHXKBMv/wPCTnvRl0ZZ9r4zFKpcFUSFyEM0rR
 rtwsWY8M2UDiF4ypouc9ep+xmxFxun9XQVmxGYprP/OduGwslex6xbrhrFJhlGja
 acbtA/1P5bZCcseeWcZRHqqwtfEH+ZOdG9+nBzxn7yKGcY0DDCQvbiH4HwlAts1R
 GhEQOtqP1szWKENSELluWwbuUdpaYrF3dcsUxtnJOLHsg0dwABm7buM0kiUPEUqK
 nZhAP4wXks6dGFB9V4BUybGtl0Vcr+5nhWCo8Wc/dLN5GMVzPvM=
 =XuDt
 -----END PGP SIGNATURE-----

Merge tag 'for-6.3-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more fixes, the zoned accounting fix is spread across a few
  patches, preparatory and the actual fixes:

   - zoned mode:
      - fix accounting of unusable zone space
      - fix zone activation condition for DUP profile
      - preparatory patches

   - improved error handling of missing chunks

   - fix compiler warning"

* tag 'for-6.3-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: zoned: drop space_info->active_total_bytes
  btrfs: zoned: count fresh BG region as zone unusable
  btrfs: use temporary variable for space_info in btrfs_update_block_group
  btrfs: rename BTRFS_FS_NO_OVERCOMMIT to BTRFS_FS_ACTIVE_ZONE_TRACKING
  btrfs: zoned: fix btrfs_can_activate_zone() to support DUP profile
  btrfs: fix compiler warning on SPARC/PA-RISC handling fscrypt_setup_filename
  btrfs: handle missing chunk mapping more gracefully
2023-03-24 08:32:10 -07:00
Sweet Tea Dorminy
10a8857a1b btrfs: fix compiler warning on SPARC/PA-RISC handling fscrypt_setup_filename
Commit 1ec49744ba ("btrfs: turn on -Wmaybe-uninitialized") exposed
that on SPARC and PA-RISC, gcc is unaware that fscrypt_setup_filename()
only returns negative error values or 0. This ultimately results in a
maybe-uninitialized warning in btrfs_lookup_dentry().

Change to only return negative error values or 0 from
fscrypt_setup_filename() at the relevant call site, and assert that no
positive error codes are returned (which would have wider implications
involving other users).

Reported-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/all/481b19b5-83a0-4793-b4fd-194ad7b978c3@roeck-us.net/
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-15 20:51:06 +01:00
Linus Torvalds
885ce48739 for-6.3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmPzxWcACgkQxWXV+ddt
 WDt+fRAAg5pz7gWNMtIK30gp/uojjAkCWXymxRtK2tZU3naI+6IYSAKxuKq8Iz1Y
 drdlpSvTX/Gv3XlGB9QuoH6digTjQzeVzjAm0eP6w8t8354KGSRUYdtoFp8I8E5Z
 q0JUuZ6w/KvpZfOIsmcgpOScgcl+8+UlOxs2iuSrOvAqP8Dg1VCt5vBm7htIb0tm
 5ClbgmIacxWrOII55XGuY0mWuZSlS4hdyWdYMelvtM8aPPG+e8eEzKjscVOOueLz
 Smi1kN5QU3o+m4oKjN1OJlKfeURdbcZUwva9zOsegSbPHUzNwIao44cQ5cQhMR0r
 kI3nCpJwGKdUd6IblEdcqBN5F4V64edLSruOLuGYzxySnEWhFE2YU2xW/v5b1eQW
 GHurI52FGrPqcX9FgQNzfTjQzk341iQ0QIs5exycJH7xeohEZnlaK2yNUngKSo1C
 naqczEMMMcxNjQaooUuxRkL/zz36D/Dkyo2YOCODtWyu61XY9LqvaxMvClFI20lL
 40dzzYnnMQwkXJrQ/MVQhz1BBaPVqizt8+ErL7GQp2CWr9miD6mcA5b2pyZm5Q3r
 hHadzeTXXS7P9g9UnuDxpZqkhvadGC2Sy4l/D6jURyKFzr8mtplaRRwUS2gSuP3z
 zxavvP4UukwNWXxDz755NAhiGbA+xpSMATKCrZ/Sdogvxe8IhRg=
 =NCpw
 -----END PGP SIGNATURE-----

Merge tag 'for-6.3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "The usual mix of performance improvements and new features.

  The core change is reworking how checksums are processed, with
  followup cleanups and simplifications. There are two minor changes in
  block layer and iomap code.

  Features:

   - block group allocation class heuristics:
      - pack files by size (up to 128k, up to 8M, more) to avoid
        fragmentation in block groups, assuming that file size and life
        time is correlated, in particular this may help during balance
      - with tracepoints and extensible in the future

  Performance:

   - send: cache directory utimes and only emit the command when
     necessary
      - speedup up to 10x
      - smaller final stream produced (no redundant utimes commands
        issued)
      - compatibility not affected

   - fiemap: skip backref checks for shared leaves
      - speedup 3x on sample filesystem with all leaves shared (e.g. on
        snapshots)

   - micro optimized b-tree key lookup, speedup in metadata operations
     (sample benchmark: fs_mark +10% of files/sec)

  Core changes:

   - change where checksumming is done in the io path:
      - checksum and read repair does verification at lower layer
      - cascaded cleanups and simplifications

   - raid56 refactoring and cleanups

  Fixes:

   - sysfs: make sure that a run-time change of a feature is correctly
     tracked by the feature files

   - scrub: better reporting of tree block errors

  Other:

   - locally enable -Wmaybe-uninitialized after fixing all warnings

   - misc cleanups, spelling fixes

  Other code:

   - block: export bio_split_rw

   - iomap: remove IOMAP_F_ZONE_APPEND"

* tag 'for-6.3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (109 commits)
  btrfs: make kobj_type structures constant
  btrfs: remove the bdev argument to btrfs_rmap_block
  btrfs: don't rely on unchanging ->bi_bdev for zone append remaps
  btrfs: never return true for reads in btrfs_use_zone_append
  btrfs: pass a btrfs_bio to btrfs_use_append
  btrfs: set bbio->file_offset in alloc_new_bio
  btrfs: use file_offset to limit bios size in calc_bio_boundaries
  btrfs: do unsigned integer division in the extent buffer binary search loop
  btrfs: eliminate extra call when doing binary search on extent buffer
  btrfs: raid56: handle endio in scrub_rbio
  btrfs: raid56: handle endio in recover_rbio
  btrfs: raid56: handle endio in rmw_rbio
  btrfs: raid56: submit the read bios from scrub_assemble_read_bios
  btrfs: raid56: fold rmw_read_wait_recover into rmw_read_bios
  btrfs: raid56: fold recover_assemble_read_bios into recover_rbio
  btrfs: raid56: add a bio_list_put helper
  btrfs: raid56: wait for I/O completion in submit_read_bios
  btrfs: raid56: simplify code flow in rmw_rbio
  btrfs: raid56: simplify error handling and code flow in raid56_parity_write
  btrfs: replace btrfs_wait_tree_block_writeback by wait_on_extent_buffer_writeback
  ...
2023-02-20 12:54:27 -08:00
Linus Torvalds
05e6295f7b fs.idmapped.v6.3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY+5NlQAKCRCRxhvAZXjc
 orOaAP9i2h3OJy95nO2Fpde0Bt2UT+oulKCCcGlvXJ8/+TQpyQD/ZQq47gFQ0EAz
 Br5NxeyGeecAb0lHpFz+CpLGsxMrMwQ=
 =+BG5
 -----END PGP SIGNATURE-----

Merge tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull vfs idmapping updates from Christian Brauner:

 - Last cycle we introduced the dedicated struct mnt_idmap type for
   mount idmapping and the required infrastucture in 256c8aed2b ("fs:
   introduce dedicated idmap type for mounts"). As promised in last
   cycle's pull request message this converts everything to rely on
   struct mnt_idmap.

   Currently we still pass around the plain namespace that was attached
   to a mount. This is in general pretty convenient but it makes it easy
   to conflate namespaces that are relevant on the filesystem with
   namespaces that are relevant on the mount level. Especially for
   non-vfs developers without detailed knowledge in this area this was a
   potential source for bugs.

   This finishes the conversion. Instead of passing the plain namespace
   around this updates all places that currently take a pointer to a
   mnt_userns with a pointer to struct mnt_idmap.

   Now that the conversion is done all helpers down to the really
   low-level helpers only accept a struct mnt_idmap argument instead of
   two namespace arguments.

   Conflating mount and other idmappings will now cause the compiler to
   complain loudly thus eliminating the possibility of any bugs. This
   makes it impossible for filesystem developers to mix up mount and
   filesystem idmappings as they are two distinct types and require
   distinct helpers that cannot be used interchangeably.

   Everything associated with struct mnt_idmap is moved into a single
   separate file. With that change no code can poke around in struct
   mnt_idmap. It can only be interacted with through dedicated helpers.
   That means all filesystems are and all of the vfs is completely
   oblivious to the actual implementation of idmappings.

   We are now also able to extend struct mnt_idmap as we see fit. For
   example, we can decouple it completely from namespaces for users that
   don't require or don't want to use them at all. We can also extend
   the concept of idmappings so we can cover filesystem specific
   requirements.

   In combination with the vfs{g,u}id_t work we finished in v6.2 this
   makes this feature substantially more robust and thus difficult to
   implement wrong by a given filesystem and also protects the vfs.

 - Enable idmapped mounts for tmpfs and fulfill a longstanding request.

   A long-standing request from users had been to make it possible to
   create idmapped mounts for tmpfs. For example, to share the host's
   tmpfs mount between multiple sandboxes. This is a prerequisite for
   some advanced Kubernetes cases. Systemd also has a range of use-cases
   to increase service isolation. And there are more users of this.

   However, with all of the other work going on this was way down on the
   priority list but luckily someone other than ourselves picked this
   up.

   As usual the patch is tiny as all the infrastructure work had been
   done multiple kernel releases ago. In addition to all the tests that
   we already have I requested that Rodrigo add a dedicated tmpfs
   testsuite for idmapped mounts to xfstests. It is to be included into
   xfstests during the v6.3 development cycle. This should add a slew of
   additional tests.

* tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits)
  shmem: support idmapped mounts for tmpfs
  fs: move mnt_idmap
  fs: port vfs{g,u}id helpers to mnt_idmap
  fs: port fs{g,u}id helpers to mnt_idmap
  fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
  fs: port i_{g,u}id_{needs_}update() to mnt_idmap
  quota: port to mnt_idmap
  fs: port privilege checking helpers to mnt_idmap
  fs: port inode_owner_or_capable() to mnt_idmap
  fs: port inode_init_owner() to mnt_idmap
  fs: port acl to mnt_idmap
  fs: port xattr to mnt_idmap
  fs: port ->permission() to pass mnt_idmap
  fs: port ->fileattr_set() to pass mnt_idmap
  fs: port ->set_acl() to pass mnt_idmap
  fs: port ->get_acl() to pass mnt_idmap
  fs: port ->tmpfile() to pass mnt_idmap
  fs: port ->rename() to pass mnt_idmap
  fs: port ->mknod() to pass mnt_idmap
  fs: port ->mkdir() to pass mnt_idmap
  ...
2023-02-20 11:53:11 -08:00
Christoph Hellwig
04f0847c45 btrfs: don't rely on unchanging ->bi_bdev for zone append remaps
btrfs_record_physical_zoned relies on a bio->bi_bdev samples in the
bio_end_io handler to find the reverse map for remapping the zone append
write, but stacked block device drivers can and usually do change bi_bdev
when sending on the bio to a lower device.  This can happen e.g. with the
nvme-multipath driver when a NVMe SSD sets the shared namespace bit.

But there is no real need for the bdev in btrfs_record_physical_zoned,
as it is only passed to btrfs_rmap_block, which uses it to pick the
mapping to report if there are multiple reverse mappings.  As zone
writes can only do simple non-mirror writes right now, and anything
more complex will use the stripe tree there is no chance of the multiple
mappings case actually happening.

Instead open code the subset of btrfs_rmap_block in
btrfs_record_physical_zoned, which also removes a memory allocation and
remove the bdev field in the ordered extent.

Fixes: d8e3fb106f ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:55 +01:00
Christoph Hellwig
d5e4377d50 btrfs: split zone append bios in btrfs_submit_bio
The current btrfs zoned device support is a little cumbersome in the data
I/O path as it requires the callers to not issue I/O larger than the
supported ZONE_APPEND size of the underlying device.  This leads to a lot
of extra accounting.  Instead change btrfs_submit_bio so that it can take
write bios of arbitrary size and form from the upper layers, and just
split them internally to the ZONE_APPEND queue limits.  Then remove all
the upper layer warts catering to limited write sized on zoned devices,
including the extra refcount in the compressed_bio.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:53 +01:00
Christoph Hellwig
35a8d7da3c btrfs: remove now spurious bio submission helpers
Call btrfs_submit_bio and btrfs_submit_compressed_read directly from
submit_one_bio now that all additional functionality has moved into
btrfs_submit_bio.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:53 +01:00
Christoph Hellwig
285599b6fe btrfs: remove the fs_info argument to btrfs_submit_bio
btrfs_submit_bio can derive it trivially from bbio->inode, so stop
bothering in the callers.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:53 +01:00
Christoph Hellwig
48253076c3 btrfs: open code submit_encoded_read_bio
Open code the functionality in the only caller and remove the now
superfluous error handling there.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:53 +01:00
Qu Wenruo
a34e4c3f88 btrfs: remove stripe boundary calculation for encoded I/O
Stop looking at the stripe boundary in
btrfs_encoded_read_regular_fill_pages() now that btrfs_submit_bio can
split bios.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:52 +01:00
Christoph Hellwig
67d6698250 btrfs: pass the iomap bio to btrfs_submit_bio
Now that btrfs_submit_bio splits the bio when crossing stripe boundaries,
there is no need for the higher level code to do that manually.

For direct I/O this is really helpful, as btrfs_submit_io can now simply
take the bio allocated by iomap and send it on to btrfs_submit_bio
instead of allocating clones.

For that to work, the bio embedded into struct btrfs_dio_private needs to
become a full btrfs_bio as expected by btrfs_submit_bio.

With this change there is a single work item to offload the entire iomap
bio so the heuristics to skip async processing for bios that were split
isn't needed anymore either.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:52 +01:00
Christoph Hellwig
69ccf3f424 btrfs: handle recording of zoned writes in the storage layer
Move the code that splits the ordered extents and records the physical
location for them to the storage layer so that the higher level consumers
don't have to care about physical block numbers at all.  This will also
allow to eventually remove accounting for the zone append write sizes in
the upper layer with a little bit more block layer work.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:52 +01:00
Christoph Hellwig
f8a53bb58e btrfs: handle checksum generation in the storage layer
Instead of letting the callers of btrfs_submit_bio deal with checksumming
the (meta)data in the bio and making decisions on when to offload the
checksumming to the bio, leave that to btrfs_submit_bio.  Do do so the
existing btrfs_submit_bio function is split into an upper and a lower
half, so that the lower half can be offloaded to a workqueue.

Note that this changes the behavior for direct writes to raid56 volumes so
that async checksum offloading is not skipped when more I/O is expected.
This runs counter to the argument explaining why it was done, although I
can't measure any affects of the change.  Commits later in this series
will make sure the entire direct writes is offloaded to the workqueue
at once and thus make sure it is sent to the raid56 code from a single
thread.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:52 +01:00
Christoph Hellwig
f8c44673e5 btrfs: simplify the btrfs_csum_one_bio calling convention
To prepare for further bio submission changes btrfs_csum_one_bio
should be able to take all it's arguments from the btrfs_bio structure.
It can always use the bbio->inode already, and once the compression code
is updated to set ->file_offset that one can be used unconditionally
as well instead of looking at the page mapping now that btrfs doesn't
allow ordered extents to span discontiguous data ranges.

The only slightly tricky bit is the one_ordered flag set by the
compressed writes.  Replace that one with the driver private bio
flag, which gets cleared before the bio is handed off to the block layer
so that we don't get in the way of driver use.

Note: this leaves an argument and a flag to btrfs_wq_submit_bio unused.
But that whole mechanism will be removed in its current form in the
next patch.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:52 +01:00
Christoph Hellwig
deb6216fa0 btrfs: open code the submit_bio_start helpers
The submit helpers are now trivial and can be called directly.  Note
that btree_csum_one_bio has to be moved up in the file a bit to avoid a
forward declaration.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:52 +01:00
Christoph Hellwig
0571b6357c btrfs: remove the io_failure_record infrastructure
struct io_failure_record and the io_failure_tree tree are unused now,
so remove them. This in turn makes struct btrfs_inode smaller by 16
bytes.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:51 +01:00
Christoph Hellwig
3d49d0d312 btrfs: remove now unused checksumming helpers
Remove the unused btrfs_verify_data_csum helper, and fold
btrfs_check_data_csum into its only caller.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:51 +01:00
Christoph Hellwig
7609afac67 btrfs: handle checksum validation and repair at the storage layer
Currently btrfs handles checksum validation and repair in the end I/O
handler for the btrfs_bio.  This leads to a lot of duplicate code
plus issues with varying semantics or bugs, e.g.

- the until recently broken repair for compressed extents
- the fact that encoded reads validate the checksums but do not kick
  of read repair
- the inconsistent checking of the BTRFS_FS_STATE_NO_CSUMS flag

This commit revamps the checksum validation and repair code to instead
work below the btrfs_submit_bio interfaces.

In case of a checksum failure (or a plain old I/O error), the repair
is now kicked off before the upper level ->end_io handler is invoked.

Progress of an in-progress repair is tracked by a small structure
that is allocated using a mempool for each original bio with failed
sectors, which holds a reference to the original bio.   This new
structure is allocated using a mempool to guarantee forward progress
even under memory pressure.  The mempool will be replenished when
the repair completes, just as the mempools backing the bios.

There is one significant behavior change here:  If repair fails or
is impossible to start with, the whole bio will be failed to the
upper layer.  This is the behavior that all I/O submitters except
for buffered I/O already emulated in their end_io handler.  For
buffered I/O this now means that a large readahead request can
fail due to a single bad sector, but as readahead errors are ignored
the following readpage if the sector is actually accessed will
still be able to read.  This also matches the I/O failure handling
in other file systems.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:51 +01:00
Christoph Hellwig
e52190441b btrfs: add a btrfs_data_csum_ok helper
Add a new checksumming helper that wraps btrfs_check_data_csum and
does all the checks to if we're dealing with some form of nodatacsum
I/O.  This helper will be used by the new storage layer checksum
validation and repair code.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:51 +01:00
Christoph Hellwig
1c2b3ee3b0 btrfs: pre-load data checksum for reads in btrfs_submit_bio
Instead of calling btrfs_lookup_bio_sums in every caller of
btrfs_submit_bio that reads data, do the call once in btrfs_submit_bio.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:51 +01:00
Christoph Hellwig
7276aa7d38 btrfs: save the bio iter for checksum validation in common code
All callers of btrfs_submit_bio that want to validate checksums
currently have to store a copy of the iter in the btrfs_bio.  Move
the assignment into common code.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:51 +01:00
Christoph Hellwig
4ae2edf12d btrfs: simplify parameters of btrfs_lookup_bio_sums
The csums argument is always NULL now, so remove it and always allocate
the csums array in the btrfs_bio.  Also pass the btrfs_bio instead of
inode + bio to document that this function requires a btrfs_bio and
not just any bio.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:50 +01:00
Christoph Hellwig
5fa356531e btrfs: remove the direct I/O read checksum lookup optimization
To prepare for pending changes drop the optimization to only look up
csums once per bio that is submitted from the iomap layer.  In the
short run this does cause additional lookups for fragmented direct
reads, but later in the series, the bio based lookup will be used on
the entire bio submitted from iomap, restoring the old behavior
in common code.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:50 +01:00
Christoph Hellwig
d0e5cb2be7 btrfs: add a btrfs_inode pointer to struct btrfs_bio
All btrfs_bio I/Os are associated with an inode.  Add a pointer to that
inode, which will allow to simplify a lot of calling conventions, and
which will be needed in the I/O completion path in the future.

This grow the btrfs_bio structure by a pointer, but that grows will
be offset by the removal of the device pointer soon.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:50 +01:00
Christoph Hellwig
36d4556745 btrfs: remove the wait argument to btrfs_start_ordered_extent
Given that wait is always set to 1, so remove the argument.
Last use of wait with 0 was in 0c304304fe ("Btrfs: remove
csum_bytes_left").

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13 17:50:34 +01:00
Yushan Zhou
ce394a7f39 btrfs: use PAGE_{ALIGN, ALIGNED, ALIGN_DOWN} macro
The header file linux/mm.h provides PAGE_ALIGN, PAGE_ALIGNED,
PAGE_ALIGN_DOWN macros. Use these macros to make code more
concise.

Signed-off-by: Yushan Zhou <katrinzhou@tencent.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13 17:50:34 +01:00
Peng Hao
d31de37850 btrfs: go to matching label when cleaning em in btrfs_submit_direct
When btrfs_get_chunk_map fails to allocate a new em the cleanup does not
need to be done so the goto target is out_err, which is consistent with
current coding style.

Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13 17:50:33 +01:00
Josef Bacik
0e47b25caf btrfs: fix uninitialized variable warning in btrfs_cleanup_ordered_extents
We can conditionally pass in a locked page, and then we'll use that page
range to skip marking errors as that will happen in another layer.
However this causes the compiler to complain because it doesn't
understand we only use these values when we have the page.  Make the
compiler stop complaining by setting these values to 0.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13 17:50:33 +01:00
Christian Brauner
f2d40141d5
fs: port inode_init_owner() to mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:28 +01:00
Christian Brauner
4609e1f18e
fs: port ->permission() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:28 +01:00
Christian Brauner
13e83a4923
fs: port ->set_acl() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:27 +01:00
Christian Brauner
011e2b717b
fs: port ->tmpfile() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:27 +01:00
Christian Brauner
e18275ae55
fs: port ->rename() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:26 +01:00
Christian Brauner
5ebb29bee8
fs: port ->mknod() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:26 +01:00
Christian Brauner
c54bd91e9e
fs: port ->mkdir() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:26 +01:00
Christian Brauner
7a77db9551
fs: port ->symlink() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:25 +01:00
Christian Brauner
6c960e68aa
fs: port ->create() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:25 +01:00
Christian Brauner
b74d24f7a7
fs: port ->getattr() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:25 +01:00
Christian Brauner
c1632a0f11
fs: port ->setattr() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:02 +01:00
Linus Torvalds
fc7b76c4a4 for-6.2-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmO3GgQACgkQxWXV+ddt
 WDt23w/+M7YshE37i5NVRFsFQ4E2/kNQAnbUvSDg5xTmaWkQo/XOMbO9EGUoTLQW
 vT5LmUxn3ynfLu65jnbBREyqjT1JoFN47gTFud+Y7XayBZvq/EVwkkBu5vd/Xwu+
 bE/ms/mWvDNuBnNjBjjKCvMebUZFs2Yn4BGGGCor2zs+u2SL9yd8gHzaBABPr0jd
 Jt1XcmdlYzIJ/59oWZI9B9yP//3z/ad2cgI6aCcbALocWW3LtUATRgJt5O72IFdO
 HweiMw/Cvd2EFBmiur3NTsAi80vyV1VUImxMKD8yrWp5vdR4ZSAeMFd7vFQpfCco
 u/8LHE1xzq3Ael0yGSQIB+UhBTHxFp1lCKTtA1vC9Iv0APVjd2zJlqf18z+hdgr9
 ULU3wxVaN9rtHd2vttt+u/YikJYwFnYw+iNK2FNYIKU2q3pidoQHgEKOCJF7s1pY
 Yrpk6kYJNaS9nT71/sX57aLA/WmIx1KFkA16Yvi+RqnMQVYJtuEleRRp95ZdXAg/
 CzjkugN3gmQvsv43FQLiKHFd/8bDnhcft48tIVjikCpSar3VwFoV7A5mgWs18ULO
 g+vyjWm1P2UagXhjLl/rsULWNLVAYOKsKXEDnRV3993lCA+EXiQbFY8gA16dfKMJ
 ho1yspX+N2ItORT7lo6ZPmDIWZ37hUyo8Bfhk5RaUKpE/adEBwM=
 =xM0t
 -----END PGP SIGNATURE-----

Merge tag 'for-6.2-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more regression and regular fixes:

   - regressions:
       - fix assertion condition using = instead of ==
       - fix false alert on bad tree level check
       - fix off-by-one error in delalloc search during lseek

   - fix compat ro feature check at read-write remount

   - handle case when read-repair happens with ongoing device replace

   - updated error messages"

* tag 'for-6.2-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix compat_ro checks against remount
  btrfs: always report error in run_one_delayed_ref()
  btrfs: handle case when repair happens with dev-replace
  btrfs: fix off-by-one in delalloc search during lseek
  btrfs: fix false alert on bad tree level check
  btrfs: add error message for metadata level mismatch
  btrfs: fix ASSERT em->len condition in btrfs_get_extent
2023-01-06 12:01:49 -08:00
Tanmay Bhushan
946c2923e7 btrfs: fix ASSERT em->len condition in btrfs_get_extent
The em->len value is supposed to be verified in the assertion condition
though we expect it to be same as the sectorsize.

Fixes: a196a8944f ("btrfs: do not reset extent map members for inline extents read")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Tanmay Bhushan <007047221b@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-03 15:53:18 +01:00
Linus Torvalds
69b41ac87e for-6.2-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmOyzdUACgkQxWXV+ddt
 WDt4qhAAqZZ7Tldx3kVKN6ExBfcDoimeQPPZmmMnL7A7POQyATtyBHCcu9ymj6Z6
 tuUqYcj7h4ydeHjL0AvaskpV1ALkfopkOA9KWAE2m1lyu4qclF6tSEJl7AKyCft7
 g4UyBpCFcnml/by0JeErHMJoxUz/AADYfW/wbyM/XvH2IiODJWf4mMWzJaL+t+GP
 rkJe9OgtmKEVZ2h5Gvdfnw4CrYm/Ds7CfG0UntpwIHvQBLHcms+OvFDSxRKZHxGs
 kt4u/b589AgL+8xNQrpfWfUQf9Zev2c+ekatU3ibi+c67XRtv45kHwsJvqaX+gmV
 +AaBI0GrQDdHXPNU22nmXeIi7tb3JnI/Vy6GHNkopIzdWkIiEtRu8hkVARhRxle7
 Z1WEAWgzPj2QerwmWrgk2TedxF1KD5J0jEJlNaNN7Dh3T8Fu5YjediQVf6mbKhkM
 yFUd0OBAlGNhEqq42ObH6TUYsqbzGk58EYaHGzBDa6QbA/yEfHaFwSqRstg/X3gv
 7WxImSq67KN0SkZZDMszZxzfEehXK9nmxoIfgo0/WGaYMSCxzBs6Xh17SJl9bhiE
 7Cee5dfiHamrYZF6oGpolP/FoZx68yPJXRmfEUQARTrMvF7cE62hjLLUjU7OgW9m
 GeLoFDq9bAh3OC4aEPdqyyu3Bh2yOfMPwpCO1wMk9I/tsIvR8mY=
 =+EpE
 -----END PGP SIGNATURE-----

Merge tag 'for-6.2-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "First batch of regression and regular fixes:

   - regressions:
       - fix error handling after conversion to qstr for paths
       - fix raid56/scrub recovery caused by uninitialized variable
         after conversion to error bitmaps
       - restore qgroup backref lookup behaviour after recent
         refactoring
       - fix leak of device lists at module exit time

   - fix resolving backrefs for inline extent followed by prealloc

   - reset defrag ioctl buffer on memory allocation error"

* tag 'for-6.2-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix fscrypt name leak after failure to join log transaction
  btrfs: scrub: fix uninitialized return value in recover_scrub_rbio
  btrfs: fix resolving backrefs for inline extent followed by prealloc
  btrfs: fix trace event name typo for FLUSH_DELAYED_REFS
  btrfs: restore BTRFS_SEQ_LAST when looking up qgroup backref lookup
  btrfs: fix leak of fs devices after removing btrfs module
  btrfs: fix an error handling path in btrfs_defrag_leaves()
  btrfs: fix an error handling path in btrfs_rename()
2023-01-02 11:06:18 -08:00
Christophe JAILLET
abe3bf7425 btrfs: fix an error handling path in btrfs_rename()
If new_whiteout_inode() fails, some resources need to be freed.
Add the missing goto to the error handling path.

Fixes: ab3c5c18e8 ("btrfs: setup qstr from dentrys using fscrypt helper")
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-15 19:16:47 +01:00
Linus Torvalds
149c51f876 for-6.2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmOSLtIACgkQxWXV+ddt
 WDvpQA//dQ3Wosz5puFNiZvoSUn/BnYJueZHjwF0bWY8OYINkF1PvDenu/WotyFz
 Ozf4Yl4Afxncz+FjDnOtlpr6KsSU5NqdGM3NrY0eNsxd2t1KrTsN0LgkA4m24p8b
 YsYp7pygbMm7c+h0X4uFpebY4lABkEPCBXnI//ktsls0xG5sOvGfZA3rdUP0bou2
 JTn6hk+s0cLTNoTiOCGNHRJbeTzHLR0viZj/E4LCJfCeJvAmOLZamUjqe9sBNYAg
 YtsrZTpUIL3JgmRi5B6jG4fHSXOnE14mKmRIR3xPME6J6eoYyNOeuSh1oNmJEuoE
 B7nD5We+x5+isjXNw/V5CQrs7FF09UbdpbNb9NF5CYQWv40OCeefuai1opGtBUxX
 dvbfmf1blYpWW/wfFOKQwMOsl8kZIZYx68FW2OBUNglB6yRpX/3QgFSGb8kPCr83
 DW2ttqwkpSNPMKk92I/owIc4BRvZ+LMR/PimEHB/Sa2apZA2/L+7RGwoaaei1QNX
 1tJxHWeJFLDZ+YRxjO1eKqhWdGQPn1kkq8LoXLi3tGaNF4kYQfhWOSM3WRowvx1q
 f99XRgA8JQnqZS83zqRIspWlpFK0CFdvzG1Zlqx+eoxERfeaMNA2fHxv1YCyFV4+
 TiXgsnCo+PIBwlvL/HjUWZgYE9+AD+NN5vyoE2UDYff4AgBFTE8=
 =Nqg9
 -----END PGP SIGNATURE-----

Merge tag 'for-6.2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "This round there are a lot of cleanups and moved code so the diffstat
  looks huge, otherwise there are some nice performance improvements and
  an update to raid56 reliability.

  User visible features:

   - raid56 reliability vs performance trade off:
      - fix destructive RMW for raid5 data (raid6 still needs work): do
        full checksum verification for all data during RMW cycle, this
        should prevent rewriting potentially corrupted data without
        notice
      - stripes are cached in memory which should reduce the performance
        impact but still can hurt some workloads
      - checksums are verified after repair again
      - this is the last option without introducing additional features
        (write intent bitmap, journal, another tree), the extra checksum
        read/verification was supposed to be avoided by the original
        implementation exactly for performance reasons but that caused
        all the reliability problems

   - discard=async by default for devices that support it

   - implement emergency flush reserve to avoid almost all unnecessary
     transaction aborts due to ENOSPC in cases where there are too many
     delayed refs or delayed allocation

   - skip block group synchronization if there's no change in used
     bytes, can reduce transaction commit count for some workloads

  Performance improvements:

   - fiemap and lseek:
      - overall speedup due to skipping unnecessary or duplicate
        searches (-40% run time)
      - cache some data structures and sharedness of extents (-30% run
        time)

   - send:
      - faster backref resolution when finding clones
      - cached leaf to root mapping for faster backref walking
      - improved clone/sharing detection
      - overall run time improvements (-70%)

  Core:

   - module initialization converted to a table of function pointers run
     in a sequence

   - preparation for fscrypt, extend passing file names across calls,
     dir item can store encryption status

   - raid56 updates:
      - more accurate error tracking of sectors within stripe
      - simplify recovery path and remove dedicated endio worker kthread
      - simplify scrub call paths
      - refactoring to support the extra data checksum verification
        during RMW cycle

   - tree block parentness checks consolidated and done at metadata read
     time

   - improved error handling

   - cleanups:
      - move a lot of code for better synchronization between kernel and
        user space sources, split big files
      - enum cleanups
      - GFP flag cleanups
      - header file cleanups, prototypes, dependencies
      - redundant parameter cleanups
      - inline extent handling simplifications
      - inode parameter conversion
      - data structure cleanups, reductions, renames, merges"

* tag 'for-6.2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (249 commits)
  btrfs: print transaction aborted messages with an error level
  btrfs: sync some cleanups from progs into uapi/btrfs.h
  btrfs: do not BUG_ON() on ENOMEM when dropping extent items for a range
  btrfs: fix extent map use-after-free when handling missing device in read_one_chunk
  btrfs: remove outdated logic from overwrite_item() and add assertion
  btrfs: unify overwrite_item() and do_overwrite_item()
  btrfs: replace strncpy() with strscpy()
  btrfs: fix uninitialized variable in find_first_clear_extent_bit
  btrfs: fix uninitialized parent in insert_state
  btrfs: add might_sleep() annotations
  btrfs: add stack helpers for a few btrfs items
  btrfs: add nr_global_roots to the super block definition
  btrfs: remove BTRFS_LEAF_DATA_OFFSET
  btrfs: add helpers for manipulating leaf items and data
  btrfs: add eb to btrfs_node_key_ptr_offset
  btrfs: pass the extent buffer for the btrfs_item_nr helpers
  btrfs: move the csum helpers into ctree.h
  btrfs: move eb offset helpers into extent_io.h
  btrfs: move file_extent_item helpers into file-item.h
  btrfs: move leaf_data_end into ctree.c
  ...
2022-12-12 20:47:51 -08:00
Linus Torvalds
6a518afcc2 fs.acl.rework.v6.2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY5bwTgAKCRCRxhvAZXjc
 ovd2AQCK00NAtGjQCjQPQGyTa4GAPqvWgq1ef0lnhv+TL5US5gD9FncQ8UofeMXt
 pBfjtAD6ettTPCTxUQfnTwWEU4rc7Qg=
 =27Wm
 -----END PGP SIGNATURE-----

Merge tag 'fs.acl.rework.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull VFS acl updates from Christian Brauner:
 "This contains the work that builds a dedicated vfs posix acl api.

  The origins of this work trace back to v5.19 but it took quite a while
  to understand the various filesystem specific implementations in
  sufficient detail and also come up with an acceptable solution.

  As we discussed and seen multiple times the current state of how posix
  acls are handled isn't nice and comes with a lot of problems: The
  current way of handling posix acls via the generic xattr api is error
  prone, hard to maintain, and type unsafe for the vfs until we call
  into the filesystem's dedicated get and set inode operations.

  It is already the case that posix acls are special-cased to death all
  the way through the vfs. There are an uncounted number of hacks that
  operate on the uapi posix acl struct instead of the dedicated vfs
  struct posix_acl. And the vfs must be involved in order to interpret
  and fixup posix acls before storing them to the backing store, caching
  them, reporting them to userspace, or for permission checking.

  Currently a range of hacks and duct tape exist to make this work. As
  with most things this is really no ones fault it's just something that
  happened over time. But the code is hard to understand and difficult
  to maintain and one is constantly at risk of introducing bugs and
  regressions when having to touch it.

  Instead of continuing to hack posix acls through the xattr handlers
  this series builds a dedicated posix acl api solely around the get and
  set inode operations.

  Going forward, the vfs_get_acl(), vfs_remove_acl(), and vfs_set_acl()
  helpers must be used in order to interact with posix acls. They
  operate directly on the vfs internal struct posix_acl instead of
  abusing the uapi posix acl struct as we currently do. In the end this
  removes all of the hackiness, makes the codepaths easier to maintain,
  and gets us type safety.

  This series passes the LTP and xfstests suites without any
  regressions. For xfstests the following combinations were tested:
   - xfs
   - ext4
   - btrfs
   - overlayfs
   - overlayfs on top of idmapped mounts
   - orangefs
   - (limited) cifs

  There's more simplifications for posix acls that we can make in the
  future if the basic api has made it.

  A few implementation details:

   - The series makes sure to retain exactly the same security and
     integrity module permission checks. Especially for the integrity
     modules this api is a win because right now they convert the uapi
     posix acl struct passed to them via a void pointer into the vfs
     struct posix_acl format to perform permission checking on the mode.

     There's a new dedicated security hook for setting posix acls which
     passes the vfs struct posix_acl not a void pointer. Basing checking
     on the posix acl stored in the uapi format is really unreliable.
     The vfs currently hacks around directly in the uapi struct storing
     values that frankly the security and integrity modules can't
     correctly interpret as evidenced by bugs we reported and fixed in
     this area. It's not necessarily even their fault it's just that the
     format we provide to them is sub optimal.

   - Some filesystems like 9p and cifs need access to the dentry in
     order to get and set posix acls which is why they either only
     partially or not even at all implement get and set inode
     operations. For example, cifs allows setxattr() and getxattr()
     operations but doesn't allow permission checking based on posix
     acls because it can't implement a get acl inode operation.

     Thus, this patch series updates the set acl inode operation to take
     a dentry instead of an inode argument. However, for the get acl
     inode operation we can't do this as the old get acl method is
     called in e.g., generic_permission() and inode_permission(). These
     helpers in turn are called in various filesystem's permission inode
     operation. So passing a dentry argument to the old get acl inode
     operation would amount to passing a dentry to the permission inode
     operation which we shouldn't and probably can't do.

     So instead of extending the existing inode operation Christoph
     suggested to add a new one. He also requested to ensure that the
     get and set acl inode operation taking a dentry are consistently
     named. So for this version the old get acl operation is renamed to
     ->get_inode_acl() and a new ->get_acl() inode operation taking a
     dentry is added. With this we can give both 9p and cifs get and set
     acl inode operations and in turn remove their complex custom posix
     xattr handlers.

     In the future I hope to get rid of the inode method duplication but
     it isn't like we have never had this situation. Readdir is just one
     example. And frankly, the overall gain in type safety and the more
     pleasant api wise are simply too big of a benefit to not accept
     this duplication for a while.

   - We've done a full audit of every codepaths using variant of the
     current generic xattr api to get and set posix acls and
     surprisingly it isn't that many places. There's of course always a
     chance that we might have missed some and if so I'm sure we'll find
     them soon enough.

     The crucial codepaths to be converted are obviously stacking
     filesystems such as ecryptfs and overlayfs.

     For a list of all callers currently using generic xattr api helpers
     see [2] including comments whether they support posix acls or not.

   - The old vfs generic posix acl infrastructure doesn't obey the
     create and replace semantics promised on the setxattr(2) manpage.
     This patch series doesn't address this. It really is something we
     should revisit later though.

  The patches are roughly organized as follows:

   (1) Change existing set acl inode operation to take a dentry
       argument (Intended to be a non-functional change)

   (2) Rename existing get acl method (Intended to be a non-functional
       change)

   (3) Implement get and set acl inode operations for filesystems that
       couldn't implement one before because of the missing dentry.
       That's mostly 9p and cifs (Intended to be a non-functional
       change)

   (4) Build posix acl api, i.e., add vfs_get_acl(), vfs_remove_acl(),
       and vfs_set_acl() including security and integrity hooks
       (Intended to be a non-functional change)

   (5) Implement get and set acl inode operations for stacking
       filesystems (Intended to be a non-functional change)

   (6) Switch posix acl handling in stacking filesystems to new posix
       acl api now that all filesystems it can stack upon support it.

   (7) Switch vfs to new posix acl api (semantical change)

   (8) Remove all now unused helpers

   (9) Additional regression fixes reported after we merged this into
       linux-next

  Thanks to Seth for a lot of good discussion around this and
  encouragement and input from Christoph"

* tag 'fs.acl.rework.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (36 commits)
  posix_acl: Fix the type of sentinel in get_acl
  orangefs: fix mode handling
  ovl: call posix_acl_release() after error checking
  evm: remove dead code in evm_inode_set_acl()
  cifs: check whether acl is valid early
  acl: make vfs_posix_acl_to_xattr() static
  acl: remove a slew of now unused helpers
  9p: use stub posix acl handlers
  cifs: use stub posix acl handlers
  ovl: use stub posix acl handlers
  ecryptfs: use stub posix acl handlers
  evm: remove evm_xattr_acl_change()
  xattr: use posix acl api
  ovl: use posix acl api
  ovl: implement set acl method
  ovl: implement get acl method
  ecryptfs: implement set acl method
  ecryptfs: implement get acl method
  ksmbd: use vfs_remove_acl()
  acl: add vfs_remove_acl()
  ...
2022-12-12 18:46:39 -08:00
Christoph Hellwig
103c19723c btrfs: split the bio submission path into a separate file
The code used by btrfs_submit_bio only interacts with the rest of
volumes.c through __btrfs_map_block (which itself is a more generic
version of two exported helpers) and does not really have anything
to do with volumes.c.  Create a new bio.c file and a bio.h header
going along with it for the btrfs_bio-based storage layer, which
will grow even more going forward.

Also update the file with my copyright notice given that a large
part of the moved code was written or rewritten by me.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:57 +01:00
Qu Wenruo
97e3823933 btrfs: introduce a bitmap based csum range search function
Although we have an existing function, btrfs_lookup_csums_range(), to
find all data checksums for a range, it's based on a btrfs_ordered_sum
list.

For the incoming RAID56 data checksum verification at RMW time, we don't
want to waste time by allocating temporary memory.

So this patch will introduce a new helper, btrfs_lookup_csums_bitmap().
It will use bitmap based result, which will be a perfect fit for later
RAID56 usage.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:57 +01:00
Filipe Manana
8c6e53a79d btrfs: allow passing a cached state record to count_range_bits()
An inode's io_tree can be quite large and there are cases where due to
delalloc it can have thousands of extent state records, which makes the
red black tree have a depth of 10 or more, making the operation of
count_range_bits() slow if we repeatedly call it for a range that starts
where, or after, the previous one we called it for. Such use cases are
when searching for delalloc in a file range that corresponds to a hole or
a prealloc extent, which is done during lseek SEEK_HOLE/DATA and fiemap.

So introduce a cached state parameter to count_range_bits() which we use
to store the last extent state record we visited, and then allow the
caller to pass it again on its next call to count_range_bits(). The next
patches in the series will make fiemap and lseek use the new parameter.

This change is part of a patchset that has the goal to make performance
better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
iterate over the extents of a file. Two examples are the cp program from
coreutils 9.0+ and the tar program (when using its --sparse / -S option).
A sample test and results are listed in the changelog of the last patch
in the series:

  1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
  2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
  3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
  4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
  5/9 btrfs: remove no longer used btrfs_next_extent_map()
  6/9 btrfs: allow passing a cached state record to count_range_bits()
  7/9 btrfs: update stale comment for count_range_bits()
  8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
  9/9 btrfs: use cached state when looking for delalloc ranges with lseek

Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:56 +01:00
David Sterba
e55cf7ca85 btrfs: pass btrfs_inode to btrfs_add_delayed_iput
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:55 +01:00
David Sterba
5fc24314c8 btrfs: use btrfs_inode inside btrfs_verify_data_csum
The function is mostly using internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:55 +01:00
David Sterba
99a01bd638 btrfs: use btrfs_inode inside compress_file_range
The function is mostly using internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:55 +01:00
David Sterba
99a81a4444 btrfs: switch async_chunk::inode to btrfs_inode
The async_chunk::inode structure is for internal interfaces so we should
use the btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:55 +01:00
David Sterba
7a0443f031 btrfs: pass btrfs_inode to btrfs_inherit_iflags
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:54 +01:00
David Sterba
4c45a4f4de btrfs: pass btrfs_inode to inode_tree_add
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:54 +01:00
David Sterba
3c1b1c4c0e btrfs: pass btrfs_inode to fixup_tree_root_location
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:54 +01:00
David Sterba
d1de429bce btrfs: pass btrfs_inode to btrfs_inode_by_name
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:54 +01:00
David Sterba
5b7544cb06 btrfs: pass btrfs_inode to btrfs_unlink_subvol
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:54 +01:00
David Sterba
bd54766e40 btrfs: pass btrfs_inode to btrfs_clear_delalloc_extent
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:54 +01:00
David Sterba
62798a4915 btrfs: pass btrfs_inode to btrfs_split_delalloc_extent
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:54 +01:00
David Sterba
4c5d166f6b btrfs: pass btrfs_inode to btrfs_set_delalloc_extent
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:54 +01:00
David Sterba
2454151cde btrfs: pass btrfs_inode to btrfs_merge_delalloc_extent
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:54 +01:00
David Sterba
0988fc7bda btrfs: switch extent_io_tree::private_data to btrfs_inode and rename
The extent_io_tree::private_data was meant to be a preparatory work for
the metadata inode rework but that never materialized. Now it's used
only for an inode so it's better to change the appropriate type and
rename it.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:54 +01:00
David Sterba
35da5a7ede btrfs: drop private_data parameter from extent_io_tree_init
All callers except one pass NULL, so the parameter can be dropped and
the inode::io_tree initialization can be open coded.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:54 +01:00
David Sterba
3c4f91e23a btrfs: pass btrfs_inode to btrfs_delete_subvolume
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:53 +01:00
David Sterba
e569b1d545 btrfs: pass btrfs_inode to __unlink_start_trans
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:53 +01:00
David Sterba
621af94af3 btrfs: pass btrfs_inode to btrfs_check_data_csum
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:53 +01:00
David Sterba
36eeaef559 btrfs: switch btrfs_writepage_fixup::inode to btrfs_inode
The btrfs_writepage_fixup structure is for internal interfaces so we
should use the btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:53 +01:00
David Sterba
82ca5a04f0 btrfs: pass btrfs_inode to btrfs_add_delalloc_inodes
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:53 +01:00
David Sterba
7152b425da btrfs: pass btrfs_inode to btrfs_dirty_inode
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:53 +01:00
David Sterba
e5d4d75bd3 btrfs: pass btrfs_inode to btrfs_inode_unlock
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:53 +01:00
David Sterba
29b6352b14 btrfs: pass btrfs_inode to btrfs_inode_lock
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:53 +01:00
David Sterba
d9dcae67b7 btrfs: pass btrfs_inode to btrfs_truncate
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:53 +01:00
David Sterba
bb41632ea7 btrfs: pass btrfs_inode to btrfs_submit_dio_bio
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:53 +01:00
David Sterba
e2884c3d44 btrfs: switch btrfs_dio_private::inode to btrfs_inode
The btrfs_dio_private structure is for internal interfaces so we should
use the btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:52 +01:00
David Sterba
d8f9268ece btrfs: pass btrfs_inode to btrfs_repair_one_sector
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:52 +01:00
David Sterba
d781c1c315 btrfs: pass btrfs_inode to btrfs_submit_dio_repair_bio
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:52 +01:00
David Sterba
b762041629 btrfs: pass btrfs_inode to btrfs_submit_data_read_bio
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:52 +01:00
David Sterba
535a7e5d6b btrfs: pass btrfs_inode to btrfs_submit_data_write_bio
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:52 +01:00
David Sterba
5fcdadc270 btrfs: pass btrfs_inode to btrfs_wq_submit_bio
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:52 +01:00
David Sterba
bfa1706682 btrfs: pass btrfs_inode to btrfs_submit_bio_start_direct_io
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:52 +01:00
David Sterba
882681ac98 btrfs: pass btrfs_inode to btrfs_submit_bio_start
The function is for internal interfaces so we should use the
btrfs_inode.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:52 +01:00
David Sterba
ad65ecf30b btrfs: simplify btree_submit_bio_start and btrfs_submit_bio_start parameters
After previous patches the unused parameters can be removed from
btree_submit_bio_start and btrfs_submit_bio_start as they don't need to
conform to the extent_submit_bio_start_t typedef.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:51 +01:00
David Sterba
ab2072b292 btrfs: change how submit bio callback is passed to btrfs_wq_submit_bio
There's a callback function parameter for btrfs_wq_submit_bio that can
be one of: metadata, buffered data, direct io data. The callback
abstraction is unnecessary as we have all functions available.

Replace the parameter with a command that leads to a direct call in
run_one_async_start. The called functions can be then simplified and we
can also remove the extent_submit_bio_start_t typedef.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:51 +01:00
David Sterba
7920b773bd btrfs: drop parameter compression_type from btrfs_submit_dio_repair_bio
Compression and direct io don't work together so the compression
parameter can be dropped after previous patch that changed the call
to direct.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:51 +01:00
David Sterba
19af6a7d34 btrfs: change how repair action is passed to btrfs_repair_one_sector
There's a function pointer passed to btrfs_repair_one_sector that will
submit the right bio for repair. However there are only two callbacks,
for buffered and for direct IO. This can be simplified to a bool-based
switch and call either function, indirect calls in this case is an
unnecessary abstraction. This allows to remove the submit_bio_hook_t
typedef.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:51 +01:00
Qu Wenruo
a982fc8220 btrfs: extract the inline extent read code into its own function
Currently we have inline extent read code behind two levels of
indentation, factor them them out into a new function,
read_inline_extent(), to make it a little easier to read.

Since we're here, also remove @extent_offset and @pg_offset arguments
from uncompress_inline() function, as it's not possible to have inline
extents at non-inline file offset.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:48 +01:00
Qu Wenruo
280f15cb96 btrfs: remove new_inline argument from btrfs_extent_item_to_extent_map()
The argument @new_inline changes the following members of extent_map:

- em->compress_type
- EXTENT_FLAG_COMPRESSED of em->flags

However neither members makes a difference for inline extents:

- Inline extent read never use above em members

  As inside btrfs_get_extent() we directly use the file extent item to
  do the read.

- Inline extents are never to be split

  Thus code really needs em->compress_type or that flag will never be
  executed on inlined extents.
  (btrfs_drop_extent_cache() would be one example)

- Fiemap no longer relies on extent maps

  Recent fiemap optimization makes fiemap to search subvolume tree
  directly, without using any extent map at all.

  Thus those members make no difference for inline extents any more.

Furthermore such exception without much explanation is really a source
of confusion.

Thus this patch will completely remove the argument, and always set the
involved members, unifying the behavior.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:48 +01:00
Qu Wenruo
a196a8944f btrfs: do not reset extent map members for inline extents read
Currently for inline extents read inside btrfs_get_extent(), we will
reset several extent map members:

- em->start

  Reset to extent_start, which is completely unnecessary.
  The extent_start and em->start should have already be zero, ensured by
  tree-checker already.

- em->len

  Reset the round_up(copy_size, fs_info->sectorsize), which is again
  unnecessary.

- em->orig_block_len

  Reset to em->len (sectorsize), while it is originally unset from
  btrfs_extent_item_to_extent_map().

  This makes no difference, as all extent map handling paths will
  ignore the orig_block_len if they found it's an inlined extent.

  Such inline extent orig_block_len ignoring examples can be found in
  btrfs_drop_extent_cache().

- em->orig_start

  Reset to em->start (0), while it is originally set to EXTENT_MAP_HOLE.

  This makes no difference either, as all extent map handling paths will
  ignore the em->orig_start if they found it's an inline extent.

Thus all these em members resetting are unnecessary.

Replace them with ASSERT()s checking the only two members (block_start
and length) that make sense.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:47 +01:00
Qu Wenruo
affc542433 btrfs: make inline extent read calculation much simpler
Currently we calculate inline extent read in a way that inline extent
can start at non-zero offset.

This is consistent with the inode selftests, which puts an inline extent
at file offset 5.

Meanwhile the inline extent creation code will only create inline extent
at file offset 0.

Furthermore with the introduction of tree-checker on file extents, we are
actively rejecting inline extent which starts at non-zero file offset.
And so far we haven't yet seen any report of rejected inline extents at
non-zero file offset.

This all means, the extra calculation to support inline extents at
non-zero file offset is mostly paper weight, and damaging the
readability of the code.

Thus this patch will:

- Add extra ASSERT()s to make sure involved file offset are all 0

- Remove @extent_offset calculation

- Simplify the involved code
  As several variables are now single-use, no need to declare them as
  a variable anymore.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:47 +01:00
Josef Bacik
aa5d3003dd btrfs: move orphan prototypes into orphan.h
Move these out of ctree.h into orphan.h to cut down on code in ctree.h.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:47 +01:00
Josef Bacik
7f0add250f btrfs: move super_block specific helpers into super.h
This will make syncing fs.h to user space a little easier if we can pull
the super block specific helpers out of fs.h and put them in super.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:47 +01:00
Josef Bacik
5c11adcc38 btrfs: move verity prototypes into verity.h
Move these out of ctree.h into verity.h to cut down on code in ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:47 +01:00
Josef Bacik
677074792a btrfs: move relocation prototypes into relocation.h
Move these out of ctree.h into relocation.h to cut down on code in
ctree.h

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:47 +01:00
Josef Bacik
33cf97a7b6 btrfs: move acl prototypes into acl.h
Move these out of ctree.h into acl.h to cut down on code in ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:46 +01:00
Josef Bacik
af142b6f44 btrfs: move file prototypes to file.h
Move these out of ctree.h into file.h to cut down on code in ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:46 +01:00
Josef Bacik
7572dec8f5 btrfs: move ioctl prototypes into ioctl.h
Move these out of ctree.h into ioctl.h to cut down on code in ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:46 +01:00
Josef Bacik
c7a03b524d btrfs: move uuid tree prototypes to uuid-tree.h
Move these out of ctree.h into uuid-tree.h to cut down on the code in
ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:46 +01:00
Josef Bacik
7c8ede1628 btrfs: move file-item prototypes into their own header
Move these prototypes out of ctree.h and into file-item.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:46 +01:00
Josef Bacik
f2b39277b8 btrfs: move dir-item prototypes into dir-item.h
Move these prototypes out of ctree.h and into their own header file.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:46 +01:00
Josef Bacik
59b818e064 btrfs: move defrag related prototypes to their own header
Now that the defrag code is all in one file, create a defrag.h and move
all the defrag related prototypes and helper out of ctree.h and into
defrag.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:46 +01:00
David Sterba
43dd529abe btrfs: update function comments
Update, reformat or reword function comments. This also removes the kdoc
marker so we don't get reports when the function name is missing.

Changes made:

- remove kdoc markers
- reformat the brief description to be a proper sentence
- reword to imperative voice
- align parameter list
- fix typos

Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:45 +01:00
Josef Bacik
45c40c8f95 btrfs: move root tree prototypes to their own header
Move all the root-tree.c prototypes to root-tree.h, and then update all
the necessary files to include the new header.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:44 +01:00
Josef Bacik
a0231804af btrfs: move extent-tree helpers into their own header file
Move all the extent tree related prototypes to extent-tree.h out of
ctree.h, and then go include it everywhere needed so everything
compiles.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:44 +01:00
Omar Sandoval
94a48aef49 btrfs: extend btrfs_dir_item type to store encryption status
For directories with encrypted files/filenames, we need to store a flag
indicating this fact. There's no room in other fields, so we'll need to
borrow a bit from dir_type. Since it's now a combination of type and
flags, we rename it to dir_flags to reflect its new usage.

The new flag, FT_ENCRYPTED, indicates a directory containing encrypted
data, which is orthogonal to file type; therefore, add the new
flag, and make conversion from directory type to file type strip the
flag.

As the file types almost never change we can afford to use the bits.
Actual usage will be guarded behind an incompat bit, this patch only
adds the support for later use by fscrypt.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:43 +01:00
Sweet Tea Dorminy
6db7531882 btrfs: use struct fscrypt_str instead of struct qstr
While struct qstr is more natural without fscrypt, since it's provided
by dentries, struct fscrypt_str is provided by the fscrypt handlers
processing dentries, and is thus more natural in the fscrypt world.
Replace all of the struct qstr uses with struct fscrypt_str.

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:43 +01:00
Sweet Tea Dorminy
ab3c5c18e8 btrfs: setup qstr from dentrys using fscrypt helper
Most places where we get a struct qstr, we are doing so from a dentry.
With fscrypt, the dentry's name may be encrypted on-disk, so fscrypt
provides a helper to convert a dentry name to the appropriate disk name
if necessary. Convert each of the dentry name accesses to use
fscrypt_setup_filename(), then convert the resulting fscrypt_name back
to an unencrypted qstr. This does not work for nokey names, but the
specific locations that could spawn nokey names are noted.

At present, since there are no encrypted directories, nothing goes down
the filename encryption paths.

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:43 +01:00
Sweet Tea Dorminy
e43eec81c5 btrfs: use struct qstr instead of name and namelen pairs
Many functions throughout btrfs take name buffer and name length
arguments. Most of these functions at the highest level are usually
called with these arguments extracted from a supplied dentry's name.
But the entire name can be passed instead, making each function a little
more elegant.

Each function whose arguments are currently the name and length
extracted from a dentry is herein converted to instead take a pointer to
the name in the dentry. The couple of calls to these calls without a
struct dentry are converted to create an appropriate qstr to pass in.
Additionally, every function which is only called with a name/len
extracted directly from a qstr is also converted.

This change has positive effect on stack consumption, frame of many
functions is reduced but this will be used in the future for fscrypt
related structures.

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:43 +01:00
Josef Bacik
ad1ac5012c btrfs: move btrfs_map_token to accessors
This is specific to the item-accessor code, move it out of ctree.h into
accessor.h/.c and then update the users to include the new header file.
This un-inlines btrfs_init_map_token, however this is only called once
per function so it's not critical to be inlined.  This also saves 904
bytes of code on a release build.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:42 +01:00
Josef Bacik
c7f13d428e btrfs: move fs wide helpers out of ctree.h
We have several fs wide related helpers in ctree.h.  The bulk of these
are the incompat flag test helpers, but there are things such as
btrfs_fs_closing() and the read only helpers that also aren't directly
related to the ctree code.  Move these into a fs.h header, which will
serve as the location for file system wide related helpers.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:41 +01:00
Josef Bacik
eda517fd0c btrfs: move free space cachep's out of ctree.h
This is local to the free-space-cache.c code, remove it from ctree.h and
inode.c, create new init/exit functions for the cachep, and move it
locally to free-space-cache.c.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:37 +01:00
Josef Bacik
226463d7b1 btrfs: move btrfs_path_cachep out of ctree.h
This is local to the ctree code, remove it from ctree.h and inode.c,
create new init/exit functions for the cachep, and move it locally to
ctree.c.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:37 +01:00
Josef Bacik
956504a331 btrfs: move trans_handle_cachep out of ctree.h
This is local to the transaction code, remove it from ctree.h and
inode.c, create new helpers in the transaction to handle the init work
and move the cachep locally to transaction.c.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:37 +01:00
Josef Bacik
f60acad355 btrfs: move btrfs_print_data_csum_error into inode.c
This isn't used outside of inode.c, there's no reason to define it in
btrfs_inode.h. Drop the inline and add __cold as it's for errors that
are not in any hot path.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:37 +01:00
Josef Bacik
83ae4133ac btrfs: add a cached_state to try_lock_extent
With nowait becoming more pervasive throughout our codebase go ahead and
add a cached_state to try_lock_extent().  This allows us to be faster
about clearing the locked area if we have contention, and then gives us
the same optimization for unlock if we are able to lock the range.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:35 +01:00
Linus Torvalds
f2f32f8af2 for-6.1-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmNj2yMACgkQxWXV+ddt
 WDsRPg/+Mgp4lLF6WCUhWNbO7K7EdJ+YEikDr7/35TTUcnpqZ6oBrWiHwwcG4d2S
 V7eQLf/yId5zVfSD+aZEOSz8gC6Mh+0CujVdj09BYuDl7fDIEjFaoH38JsAhANFO
 uUaqxzgZw2feWpwiEF9P2iwZD8VqUMAELjASjBBZVMs6WCpM6SDQRPDj/IkfI2BN
 qgtKB7Im9VYBN92eIKlg6+MQCwuMMXKZRQH3dkPfYGJYQMDRyYrDxoeVWSAf9pGX
 Xvb3mEUZEcPQmE6ue78Ny0OGXX2sh7Mvz4cEFBJvFUPi99Iu6TluVBgN0akuMTwZ
 oZbV/1Abs+KV+yOICAhE/u7mKkLPsfRZeR4Ly8qjIlMUN12r1MR1BuGOJj750nsi
 LLBohtfQ+BQYpEOrJ32MbdxXy6/CBinC6Xqz+J3M+F/AMYREPLaND7Co5YkgWyT4
 pViRpgxLV+plP5bizbiXtnXI1h4OMBRx7idAZmeBNFtquHSzgf9psUz+sHI8Wvr2
 tAI+6n7RSnUDG/N+p0cJSqZf4RZWevjVJrUS4pko56t9ixK/xPkyVFbYLIdcd3bC
 N83tDgNtdBuyuFw3f2Ye+f0BxBhpZx6getQW2W9mb+6ylN5nyHFWmQpDGO5sDec0
 KJRR3w8vQ/0+64P2JhjFbYW55CzpmB279qGxemsnGakDweEcs+o=
 =Ltzp
 -----END PGP SIGNATURE-----

Merge tag 'for-6.1-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A batch of error handling fixes for resource leaks, fixes for nowait
  mode in combination with direct and buffered IO:

   - direct IO + dsync + nowait could miss a sync of the file after
     write, add handling for this combination

   - buffered IO + nowait should not fail with ENOSPC, only blocking IO
     could determine that

   - error handling fixes:
      - fix inode reserve space leak due to nowait buffered write
      - check the correct variable after allocation (direct IO submit)
      - fix inode list leak during backref walking
      - fix ulist freeing in self tests"

* tag 'for-6.1-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix inode reserve space leak due to nowait buffered write
  btrfs: fix nowait buffered write returning -ENOSPC
  btrfs: remove pointless and double ulist frees in error paths of qgroup tests
  btrfs: fix ulist leaks in error paths of qgroup self tests
  btrfs: fix inode list leak during backref walking at find_parent_nodes()
  btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
  btrfs: fix lost file sync on direct IO write with nowait and dsync iocb
  btrfs: fix a memory allocation failure test in btrfs_submit_direct
2022-11-03 11:12:48 -07:00
Filipe Manana
8184620ae2 btrfs: fix lost file sync on direct IO write with nowait and dsync iocb
When doing a direct IO write using a iocb with nowait and dsync set, we
end up not syncing the file once the write completes.

This is because we tell iomap to not call generic_write_sync(), which
would result in calling btrfs_sync_file(), in order to avoid a deadlock
since iomap can call it while we are holding the inode's lock and
btrfs_sync_file() needs to acquire the inode's lock. The deadlock happens
only if the write happens synchronously, when iomap_dio_rw() calls
iomap_dio_complete() before it returns. Instead we do the sync ourselves
at btrfs_do_write_iter().

For a nowait write however we can end up not doing the sync ourselves at
at btrfs_do_write_iter() because the write could have been queued, and
therefore we get -EIOCBQUEUED returned from iomap in such case. That makes
us skip the sync call at btrfs_do_write_iter(), as we don't do it for
any error returned from btrfs_direct_write(). We can't simply do the call
even if -EIOCBQUEUED is returned, since that would block the task waiting
for IO, both for the data since there are bios still in progress as well
as potentially blocking when joining a log transaction and when syncing
the log (writing log trees, super blocks, etc).

So let iomap do the sync call itself and in order to avoid deadlocks for
the case of synchronous writes (without nowait), use __iomap_dio_rw() and
have ourselves call iomap_dio_complete() after unlocking the inode.

A test case will later be sent for fstests, after this is fixed in Linus'
tree.

Fixes: 51bd9563b6 ("btrfs: fix deadlock due to page faults during direct IO reads and writes")
Reported-by: Марк Коренберг <socketpair@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAEmTpZGRKbzc16fWPvxbr6AfFsQoLmz-Lcg-7OgJOZDboJ+SGQ@mail.gmail.com/
CC: stable@vger.kernel.org # 6.0+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-10-31 16:52:56 +01:00
Christophe JAILLET
063b1f21cc btrfs: fix a memory allocation failure test in btrfs_submit_direct
After allocation 'dip' is tested instead of 'dip->csums'.  Fix it.

Fixes: 642c5d34da ("btrfs: allocate the btrfs_dio_private as part of the iomap dio bio")
CC: stable@vger.kernel.org # 5.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-10-31 16:50:15 +01:00
Christian Brauner
cac2f8b8d8
fs: rename current get acl method
The current way of setting and getting posix acls through the generic
xattr interface is error prone and type unsafe. The vfs needs to
interpret and fixup posix acls before storing or reporting it to
userspace. Various hacks exist to make this work. The code is hard to
understand and difficult to maintain in it's current form. Instead of
making this work by hacking posix acls through xattr handlers we are
building a dedicated posix acl api around the get and set inode
operations. This removes a lot of hackiness and makes the codepaths
easier to maintain. A lot of background can be found in [1].

The current inode operation for getting posix acls takes an inode
argument but various filesystems (e.g., 9p, cifs, overlayfs) need access
to the dentry. In contrast to the ->set_acl() inode operation we cannot
simply extend ->get_acl() to take a dentry argument. The ->get_acl()
inode operation is called from:

acl_permission_check()
-> check_acl()
   -> get_acl()

which is part of generic_permission() which in turn is part of
inode_permission(). Both generic_permission() and inode_permission() are
called in the ->permission() handler of various filesystems (e.g.,
overlayfs). So simply passing a dentry argument to ->get_acl() would
amount to also having to pass a dentry argument to ->permission(). We
should avoid this unnecessary change.

So instead of extending the existing inode operation rename it from
->get_acl() to ->get_inode_acl() and add a ->get_acl() method later that
passes a dentry argument and which filesystems that need access to the
dentry can implement instead of ->get_inode_acl(). Filesystems like cifs
which allow setting and getting posix acls but not using them for
permission checking during lookup can simply not implement
->get_inode_acl().

This is intended to be a non-functional change.

Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1]
Suggested-by/Inspired-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-10-20 10:13:27 +02:00
Christian Brauner
138060ba92
fs: pass dentry to set acl method
The current way of setting and getting posix acls through the generic
xattr interface is error prone and type unsafe. The vfs needs to
interpret and fixup posix acls before storing or reporting it to
userspace. Various hacks exist to make this work. The code is hard to
understand and difficult to maintain in it's current form. Instead of
making this work by hacking posix acls through xattr handlers we are
building a dedicated posix acl api around the get and set inode
operations. This removes a lot of hackiness and makes the codepaths
easier to maintain. A lot of background can be found in [1].

Since some filesystem rely on the dentry being available to them when
setting posix acls (e.g., 9p and cifs) they cannot rely on set acl inode
operation. But since ->set_acl() is required in order to use the generic
posix acl xattr handlers filesystems that do not implement this inode
operation cannot use the handler and need to implement their own
dedicated posix acl handlers.

Update the ->set_acl() inode method to take a dentry argument. This
allows all filesystems to rely on ->set_acl().

As far as I can tell all codepaths can be switched to rely on the dentry
instead of just the inode. Note that the original motivation for passing
the dentry separate from the inode instead of just the dentry in the
xattr handlers was because of security modules that call
security_d_instantiate(). This hook is called during
d_instantiate_new(), d_add(), __d_instantiate_anon(), and
d_splice_alias() to initialize the inode's security context and possibly
to set security.* xattrs. Since this only affects security.* xattrs this
is completely irrelevant for posix acls.

Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-10-19 12:55:42 +02:00
Linus Torvalds
f721d24e5d tmpfile API change
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCY0DP2AAKCRBZ7Krx/gZQ
 6/+qAQCEGQWpcC5MB17zylaX7gqzhgAsDrwtpevlno3aIv/1pQD/YWr/E8tf7WTW
 ERXRXMRx1cAzBJhUhVgIY+3ANfU2Rg4=
 =cko4
 -----END PGP SIGNATURE-----

Merge tag 'pull-tmpfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs tmpfile updates from Al Viro:
 "Miklos' ->tmpfile() signature change; pass an unopened struct file to
  it, let it open the damn thing. Allows to add tmpfile support to FUSE"

* tag 'pull-tmpfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fuse: implement ->tmpfile()
  vfs: open inside ->tmpfile()
  vfs: move open right after ->tmpfile()
  vfs: make vfs_tmpfile() static
  ovl: use vfs_tmpfile_open() helper
  cachefiles: use vfs_tmpfile_open() helper
  cachefiles: only pass inode to *mark_inode_inuse() helpers
  cachefiles: tmpfile error handling cleanup
  hugetlbfs: cleanup mknod and tmpfile
  vfs: add vfs_tmpfile_open() helper
2022-10-10 19:45:17 -07:00
Filipe Manana
b54bb86556 btrfs: avoid pointless extent map tree search when flushing delalloc
When flushing delalloc, in COW mode at cow_file_range(), before entering
the loop that allocates extents and creates ordered extents, we do a call
to btrfs_drop_extent_map_range() for the whole range. This is pointless
because in the loop we call create_io_em(), which will also call
btrfs_drop_extent_map_range() before inserting the new extent map.

So remove that call at cow_file_range() not only because it is not needed,
but also because it will make the btrfs_drop_extent_map_range() calls made
from create_io_em() waste time searching the extent map tree, and that
tree can be large for files with many extents. It also makes us waste time
at btrfs_drop_extent_map_range() allocating and freeing the split extent
maps for nothing.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:31 +02:00
Filipe Manana
a1ba4c080b btrfs: add helper to replace extent map range with a new extent map
We have several places that need to drop all the extent maps in a given
file range and then add a new extent map for that range. Currently they
call btrfs_drop_extent_map_range() to delete all extent maps in the range
and then keep trying to add the new extent map in a loop that keeps
retrying while the insertion of the new extent map fails with -EEXIST.

So instead of repeating this logic, add a helper to extent_map.c that
does these steps and name it btrfs_replace_extent_map_range(). Also add
a comment about why the retry loop is necessary.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:30 +02:00
Filipe Manana
9c9d1b4f74 btrfs: move open coded extent map tree deletion out of inode eviction
Move the loop that removes all the extent maps from the inode's extent
map tree during inode eviction out of inode.c and into extent_map.c, to
btrfs_drop_extent_map_range(). Anything manipulating extent maps or the
extent map tree should be in extent_map.c.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:30 +02:00
Filipe Manana
99ba0c8150 btrfs: use cond_resched_rwlock_write() during inode eviction
At evict_inode_truncate_pages(), instead of manually checking if
rescheduling is needed, then unlock the extent map tree, reschedule and
then write lock again the tree, use the helper cond_resched_rwlock_write()
which does all that.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:30 +02:00
Filipe Manana
4c0c8cfc84 btrfs: move btrfs_drop_extent_cache() to extent_map.c
The function btrfs_drop_extent_cache() doesn't really belong at file.c
because what it does is drop a range of extent maps for a file range.
It directly allocates and manipulates extent maps, by dropping,
splitting and replacing them in an extent map tree, so it should be
located at extent_map.c, where all manipulations of an extent map tree
and its extent maps are supposed to be done.

So move it out of file.c and into extent_map.c. Additionally do the
following changes:

1) Rename it into btrfs_drop_extent_map_range(), as this makes it more
   clear about what it does. The term "cache" is a bit confusing as it's
   not widely used, "extent maps" or "extent mapping" is much more common;

2) Change its 'skip_pinned' argument from int to bool;

3) Turn several of its local variables from int to bool, since they are
   used as booleans;

4) Move the declaration of some variables out of the function's main
   scope and into the scopes where they are used;

5) Remove pointless assignment of false to 'modified' early in the while
   loop, as later that variable is set and it's not used before that
   second assignment;

6) Remove checks for NULL before calling free_extent_map().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:30 +02:00
Josef Bacik
80f9d24130 btrfs: make btrfs_check_nocow_lock nowait compatible
Now all the helpers that btrfs_check_nocow_lock uses handle nowait, add
a nowait flag to btrfs_check_nocow_lock so it can be used by the write
path.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:28 +02:00
Josef Bacik
1daedb1d6b btrfs: add the ability to use NO_FLUSH for data reservations
In order to accommodate NOWAIT IOCB's we need to be able to do NO_FLUSH
data reservations, so plumb this through the delalloc reservation
system.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:28 +02:00
Josef Bacik
26ce911446 btrfs: make can_nocow_extent nowait compatible
If we have NOWAIT specified on our IOCB and we're writing into a
PREALLOC or NOCOW extent then we need to be able to tell
can_nocow_extent that we don't want to wait on any locks or metadata IO.
Fix can_nocow_extent to allow for NOWAIT.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:26 +02:00
Josef Bacik
9b9b885465 btrfs: use a runtime flag to indicate an inode is a free space inode
We always check the root of an inode as well as it's inode number to
determine if it's a free space inode.  This is problematic as the helper
is in a header file where it doesn't have the fs_info definition.  To
avoid this and make the check a little cleaner simply add a flag to the
runtime_flags to indicate that the inode is a free space inode, set that
when we create the inode, and then change the helper to check for this
flag.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:07 +02:00
Josef Bacik
f119553fd3 btrfs: move btrfs_csum_ptr to inode.c
This helper is only used in inode.c, move it locally to that file
instead of defining it in ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:06 +02:00
Josef Bacik
efb0645bd9 btrfs: don't init io tree with private data for non-inodes
We only use this for normal inodes, so don't set it if we're not a
normal inode.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:05 +02:00
Josef Bacik
bd015294af btrfs: replace delete argument with EXTENT_CLEAR_ALL_BITS
Instead of taking up a whole argument to indicate we're clearing
everything in a range, simply add another EXTENT bit to control this,
and then update all the callers to drop this argument from the
clear_extent_bit variants.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:05 +02:00
Josef Bacik
4374d03d21 btrfs: remove extent_io_tree::track_uptodate
Since commit 78361f64ff42 ("btrfs: remove unnecessary EXTENT_UPTODATE
state in buffered I/O path") we no longer check ->track_uptodate, remove
it.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:05 +02:00
Josef Bacik
570eb97bac btrfs: unify the lock/unlock extent variants
We have two variants of lock/unlock extent, one set that takes a cached
state, another that does not.  This is slightly annoying, and generally
speaking there are only a few places where we don't have a cached state.
Simplify this by making lock_extent/unlock_extent the only variant and
make it take a cached state, then convert all the callers appropriately.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:05 +02:00
Josef Bacik
291bbb1e7e btrfs: drop extent_changeset from set_extent_bit
The only places that set extent_changeset is set_record_extent_bits,
everywhere else sets it to NULL.  Drop this argument from
set_extent_bit.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:05 +02:00
Josef Bacik
994bcd1eae btrfs: remove failed_start argument from set_extent_bit
This is only used for internal locking related helpers, everybody else
just passes in NULL.  I've changed set_extent_bit to __set_extent_bit
and made it static, removed failed_start from set_extent_bit and have it
call __set_extent_bit with a NULL failed_start, and I've moved some code
down below the now static __set_extent_bit.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:05 +02:00
Josef Bacik
dbbf49928f btrfs: remove the wake argument from clear_extent_bits
This is only used in the case that we are clearing EXTENT_LOCKED, so
infer this value from the bits passed in instead of taking it as an
argument.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:04 +02:00
Josef Bacik
c07d1004c5 btrfs: drop exclusive_bits from set_extent_bit
This is only ever set if we have EXTENT_LOCKED set, so simply push this
into the function itself and remove the function argument.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:04 +02:00
Josef Bacik
87c11705cc btrfs: convert the io_failure_tree to a plain rb_tree
We still have this oddity of stashing the io_failure_record in the
extent state for the io_failure_tree, which is leftover from when we
used to stuff private pointers in extent_io_trees.

However this doesn't make a lot of sense for the io failure records, we
can simply use a normal rb_tree for this.  This will allow us to further
simplify the extent_io_tree code by removing the io_failure_rec pointer
from the extent state.

Convert the io_failure_tree to an rb tree + spinlock in the inode, and
then use our rb tree simple helpers to insert and find failed records.
This greatly cleans up this code and makes it easier to separate out the
extent_io_tree code.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:02 +02:00
Josef Bacik
0d0a762c41 btrfs: rename clean_io_failure and remove extraneous args
This is exported, so rename it to btrfs_clean_io_failure.  Additionally
we are passing in the io tree's and such from the inode, so instead of
doing all that simply pass in the inode itself and get all the
components we need directly inside of btrfs_clean_io_failure.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:02 +02:00
Filipe Manana
ac3c0d36a2 btrfs: make fiemap more efficient and accurate reporting extent sharedness
The current fiemap implementation does not scale very well with the number
of extents a file has. This is both because the main algorithm to find out
the extents has a high algorithmic complexity and because for each extent
we have to check if it's shared. This second part, checking if an extent
is shared, is significantly improved by the two previous patches in this
patchset, while the first part is improved by this specific patch. Every
now and then we get reports from users mentioning fiemap is too slow or
even unusable for files with a very large number of extents, such as the
two recent reports referred to by the Link tags at the bottom of this
change log.

To understand why the part of finding which extents a file has is very
inefficient, consider the example of doing a full ranged fiemap against
a file that has over 100K extents (normal for example for a file with
more than 10G of data and using compression, which limits the extent size
to 128K). When we enter fiemap at extent_fiemap(), the following happens:

1) Before entering the main loop, we call get_extent_skip_holes() to get
   the first extent map. This leads us to btrfs_get_extent_fiemap(), which
   in turn calls btrfs_get_extent(), to find the first extent map that
   covers the file range [0, LLONG_MAX).

   btrfs_get_extent() will first search the inode's extent map tree, to
   see if we have an extent map there that covers the range. If it does
   not find one, then it will search the inode's subvolume b+tree for a
   fitting file extent item. After finding the file extent item, it will
   allocate an extent map, fill it in with information extracted from the
   file extent item, and add it to the inode's extent map tree (which
   requires a search for insertion in the tree).

2) Then we enter the main loop at extent_fiemap(), emit the details of
   the extent, and call again get_extent_skip_holes(), with a start
   offset matching the end of the extent map we previously processed.

   We end up at btrfs_get_extent() again, will search the extent map tree
   and then search the subvolume b+tree for a file extent item if we could
   not find an extent map in the extent tree. We allocate an extent map,
   fill it in with the details in the file extent item, and then insert
   it into the extent map tree (yet another search in this tree).

3) The second step is repeated over and over, until we have processed the
   whole file range. Each iteration ends at btrfs_get_extent(), which
   does a red black tree search on the extent map tree, then searches the
   subvolume b+tree, allocates an extent map and then does another search
   in the extent map tree in order to insert the extent map.

   In the best scenario we have all the extent maps already in the extent
   tree, and so for each extent we do a single search on a red black tree,
   so we have a complexity of O(n log n).

   In the worst scenario we don't have any extent map already loaded in
   the extent map tree, or have very few already there. In this case the
   complexity is much higher since we do:

   - A red black tree search on the extent map tree, which has O(log n)
     complexity, initially very fast since the tree is empty or very
     small, but as we end up allocating extent maps and adding them to
     the tree when we don't find them there, each subsequent search on
     the tree gets slower, since it's getting bigger and bigger after
     each iteration.

   - A search on the subvolume b+tree, also O(log n) complexity, but it
     has items for all inodes in the subvolume, not just items for our
     inode. Plus on a filesystem with concurrent operations on other
     inodes, we can block doing the search due to lock contention on
     b+tree nodes/leaves.

   - Allocate an extent map - this can block, and can also fail if we
     are under serious memory pressure.

   - Do another search on the extent maps red black tree, with the goal
     of inserting the extent map we just allocated. Again, after every
     iteration this tree is getting bigger by 1 element, so after many
     iterations the searches are slower and slower.

   - We will not need the allocated extent map anymore, so it's pointless
     to add it to the extent map tree. It's just wasting time and memory.

   In short we end up searching the extent map tree multiple times, on a
   tree that is growing bigger and bigger after each iteration. And
   besides that we visit the same leaf of the subvolume b+tree many times,
   since a leaf with the default size of 16K can easily have more than 200
   file extent items.

This is very inefficient overall. This patch changes the algorithm to
instead iterate over the subvolume b+tree, visiting each leaf only once,
and only searching in the extent map tree for file ranges that have holes
or prealloc extents, in order to figure out if we have delalloc there.
It will never allocate an extent map and add it to the extent map tree.
This is very similar to what was previously done for the lseek's hole and
data seeking features.

Also, the current implementation relying on extent maps for figuring out
which extents we have is not correct. This is because extent maps can be
merged even if they represent different extents - we do this to minimize
memory utilization and keep extent map trees smaller. For example if we
have two extents that are contiguous on disk, once we load the two extent
maps, they get merged into a single one - however if only one of the
extents is shared, we end up reporting both as shared or both as not
shared, which is incorrect.

This reproducer triggers that bug:

    $ cat fiemap-bug.sh
    #!/bin/bash

    DEV=/dev/sdj
    MNT=/mnt/sdj

    mkfs.btrfs -f $DEV
    mount $DEV $MNT

    # Create a file with two 256K extents.
    # Since there is no other write activity, they will be contiguous,
    # and their extent maps merged, despite having two distinct extents.
    xfs_io -f -c "pwrite -S 0xab 0 256K" \
              -c "fsync" \
              -c "pwrite -S 0xcd 256K 256K" \
              -c "fsync" \
              $MNT/foo

    # Now clone only the second extent into another file.
    xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar

    # Filefrag will report a single 512K extent, and say it's not shared.
    echo
    filefrag -v $MNT/foo

    umount $MNT

Running the reproducer:

    $ ./fiemap-bug.sh
    wrote 262144/262144 bytes at offset 0
    256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
    wrote 262144/262144 bytes at offset 262144
    256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
    linked 262144/262144 bytes at offset 0
    256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)

    Filesystem type is: 9123683e
    File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
     ext:     logical_offset:        physical_offset: length:   expected: flags:
       0:        0..     127:       3328..      3455:    128:             last,eof
    /mnt/sdj/foo: 1 extent found

We end up reporting that we have a single 512K that is not shared, however
we have two 256K extents, and the second one is shared. Changing the
reproducer to clone instead the first extent into file 'bar', makes us
report a single 512K extent that is shared, which is algo incorrect since
we have two 256K extents and only the first one is shared.

This patch is part of a larger patchset that is comprised of the following
patches:

    btrfs: allow hole and data seeking to be interruptible
    btrfs: make hole and data seeking a lot more efficient
    btrfs: remove check for impossible block start for an extent map at fiemap
    btrfs: remove zero length check when entering fiemap
    btrfs: properly flush delalloc when entering fiemap
    btrfs: allow fiemap to be interruptible
    btrfs: rename btrfs_check_shared() to a more descriptive name
    btrfs: speedup checking for extent sharedness during fiemap
    btrfs: skip unnecessary extent buffer sharedness checks during fiemap
    btrfs: make fiemap more efficient and accurate reporting extent sharedness

The patchset was tested on a machine running a non-debug kernel (Debian's
default config) and compared the tests below on a branch without the
patchset versus the same branch with the whole patchset applied.

The following test for a large compressed file without holes:

    $ cat fiemap-perf-test.sh
    #!/bin/bash

    DEV=/dev/sdi
    MNT=/mnt/sdi

    mkfs.btrfs -f $DEV
    mount -o compress=lzo $DEV $MNT

    # 40G gives 327680 128K file extents (due to compression).
    xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar

    umount $MNT
    mount -o compress=lzo $DEV $MNT

    start=$(date +%s%N)
    filefrag $MNT/foobar
    end=$(date +%s%N)
    dur=$(( (end - start) / 1000000 ))
    echo "fiemap took $dur milliseconds (metadata not cached)"

    start=$(date +%s%N)
    filefrag $MNT/foobar
    end=$(date +%s%N)
    dur=$(( (end - start) / 1000000 ))
    echo "fiemap took $dur milliseconds (metadata cached)"

    umount $MNT

Before patchset:

    $ ./fiemap-perf-test.sh
    (...)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 3597 milliseconds (metadata not cached)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 2107 milliseconds (metadata cached)

After patchset:

    $ ./fiemap-perf-test.sh
    (...)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 1214 milliseconds (metadata not cached)
    /mnt/sdi/foobar: 327680 extents found
    fiemap took 684 milliseconds (metadata cached)

That's a speedup of about 3x for both cases (no metadata cached and all
metadata cached).

The test provided by Pavel (first Link tag at the bottom), which uses
files with a large number of holes, was also used to measure the gains,
and it consists on a small C program and a shell script to invoke it.
The C program is the following:

    $ cat pavels-test.c
    #include <stdio.h>
    #include <unistd.h>
    #include <stdlib.h>
    #include <fcntl.h>

    #include <sys/stat.h>
    #include <sys/time.h>
    #include <sys/ioctl.h>

    #include <linux/fs.h>
    #include <linux/fiemap.h>

    #define FILE_INTERVAL (1<<13) /* 8Kb */

    long long interval(struct timeval t1, struct timeval t2)
    {
        long long val = 0;
        val += (t2.tv_usec - t1.tv_usec);
        val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000;
        return val;
    }

    int main(int argc, char **argv)
    {
        struct fiemap fiemap = {};
        struct timeval t1, t2;
        char data = 'a';
        struct stat st;
        int fd, off, file_size = FILE_INTERVAL;

        if (argc != 3 && argc != 2) {
                printf("usage: %s <path> [size]\n", argv[0]);
                return 1;
        }

        if (argc == 3)
                file_size = atoi(argv[2]);
        if (file_size < FILE_INTERVAL)
                file_size = FILE_INTERVAL;
        file_size -= file_size % FILE_INTERVAL;

        fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
        if (fd < 0) {
            perror("open");
            return 1;
        }

        for (off = 0; off < file_size; off += FILE_INTERVAL) {
            if (pwrite(fd, &data, 1, off) != 1) {
                perror("pwrite");
                close(fd);
                return 1;
            }
        }

        if (ftruncate(fd, file_size)) {
            perror("ftruncate");
            close(fd);
            return 1;
        }

        if (fstat(fd, &st) < 0) {
            perror("fstat");
            close(fd);
            return 1;
        }

        printf("size: %ld\n", st.st_size);
        printf("actual size: %ld\n", st.st_blocks * 512);

        fiemap.fm_length = FIEMAP_MAX_OFFSET;
        gettimeofday(&t1, NULL);
        if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
            perror("fiemap");
            close(fd);
            return 1;
        }
        gettimeofday(&t2, NULL);

        printf("fiemap: fm_mapped_extents = %d\n",
               fiemap.fm_mapped_extents);
        printf("time = %lld us\n", interval(t1, t2));

        close(fd);
        return 0;
    }

    $ gcc -o pavels_test pavels_test.c

And the wrapper shell script:

    $ cat fiemap-pavels-test.sh

    #!/bin/bash

    DEV=/dev/sdi
    MNT=/mnt/sdi

    mkfs.btrfs -f -O no-holes $DEV
    mount $DEV $MNT

    echo
    echo "*********** 256M ***********"
    echo

    ./pavels-test $MNT/testfile $((1 << 28))
    echo
    ./pavels-test $MNT/testfile $((1 << 28))

    echo
    echo "*********** 512M ***********"
    echo

    ./pavels-test $MNT/testfile $((1 << 29))
    echo
    ./pavels-test $MNT/testfile $((1 << 29))

    echo
    echo "*********** 1G ***********"
    echo

    ./pavels-test $MNT/testfile $((1 << 30))
    echo
    ./pavels-test $MNT/testfile $((1 << 30))

    umount $MNT

Running his reproducer before applying the patchset:

    *********** 256M ***********

    size: 268435456
    actual size: 134217728
    fiemap: fm_mapped_extents = 32768
    time = 4003133 us

    size: 268435456
    actual size: 134217728
    fiemap: fm_mapped_extents = 32768
    time = 4895330 us

    *********** 512M ***********

    size: 536870912
    actual size: 268435456
    fiemap: fm_mapped_extents = 65536
    time = 30123675 us

    size: 536870912
    actual size: 268435456
    fiemap: fm_mapped_extents = 65536
    time = 33450934 us

    *********** 1G ***********

    size: 1073741824
    actual size: 536870912
    fiemap: fm_mapped_extents = 131072
    time = 224924074 us

    size: 1073741824
    actual size: 536870912
    fiemap: fm_mapped_extents = 131072
    time = 217239242 us

Running it after applying the patchset:

    *********** 256M ***********

    size: 268435456
    actual size: 134217728
    fiemap: fm_mapped_extents = 32768
    time = 29475 us

    size: 268435456
    actual size: 134217728
    fiemap: fm_mapped_extents = 32768
    time = 29307 us

    *********** 512M ***********

    size: 536870912
    actual size: 268435456
    fiemap: fm_mapped_extents = 65536
    time = 58996 us

    size: 536870912
    actual size: 268435456
    fiemap: fm_mapped_extents = 65536
    time = 59115 us

    *********** 1G ***********

    size: 1073741824
    actual size: 536870912
    fiemap: fm_mapped_extents = 116251
    time = 124141 us

    size: 1073741824
    actual size: 536870912
    fiemap: fm_mapped_extents = 131072
    time = 119387 us

The speedup is massive, both on the first fiemap call and on the second
one as well, as his test creates files with many holes and small extents
(every extent follows a hole and precedes another hole).

For the 256M file we go from 4 seconds down to 29 milliseconds in the
first run, and then from 4.9 seconds down to 29 milliseconds again in the
second run, a speedup of 138x and 169x, respectively.

For the 512M file we go from 30.1 seconds down to 59 milliseconds in the
first run, and then from 33.5 seconds down to 59 milliseconds again in the
second run, a speedup of 510x and 568x, respectively.

For the 1G file, we go from 225 seconds down to 124 milliseconds in the
first run, and then from 217 seconds down to 119 milliseconds in the
second run, a speedup of 1815x and 1824x, respectively.

Reported-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:01 +02:00
Filipe Manana
33a86cfa17 btrfs: properly flush delalloc when entering fiemap
If the flag FIEMAP_FLAG_SYNC is passed to fiemap, it means all delalloc
should be flushed and writeback complete. We call the generic helper
fiemap_prep() which does a filemap_write_and_wait() in case that flag is
given, however that is not enough if we have compression. Because a
single filemap_fdatawrite_range() only starts compression (in an async
thread) and therefore returns before the compression is done and writeback
is started.

So make btrfs_fiemap(), actually wait for all writeback to start and
complete if FIEMAP_FLAG_SYNC is set. We start and wait for writeback
on the whole possible file range, from 0 to LLONG_MAX, because that is
what the generic code at fiemap_prep() does.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:00 +02:00
Christoph Hellwig
917f32a235 btrfs: give struct btrfs_bio a real end_io handler
Currently btrfs_bio end I/O handling is a bit of a mess.  The bi_end_io
handler and bi_private pointer of the embedded struct bio are both used
to handle the completion of the high-level btrfs_bio and for the I/O
completion for the low-level device that the embedded bio ends up being
sent to.

To support this bi_end_io and bi_private are saved into the
btrfs_io_context structure and then restored after the bio sent to the
underlying device has completed the actual I/O.

Untangle this by adding an end I/O handler and private data to struct
btrfs_bio for the high-level btrfs_bio based completions, and leave the
actual bio bi_end_io handler and bi_private pointer entirely to the
low-level device I/O.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:59 +02:00
Christoph Hellwig
6b42f5e343 btrfs: pass the operation to btrfs_bio_alloc
Pass the operation to btrfs_bio_alloc, matching what bio_alloc_bioset
set does.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:58 +02:00
Ethan Lien
52b029f427 btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path
After we copied data to page cache in buffered I/O, we
1. Insert a EXTENT_UPTODATE state into inode's io_tree, by
   endio_readpage_release_extent(), set_extent_delalloc() or
   set_extent_defrag().
2. Set page uptodate before we unlock the page.

But the only place we check io_tree's EXTENT_UPTODATE state is in
btrfs_do_readpage(). We know we enter btrfs_do_readpage() only when we
have a non-uptodate page, so it is unnecessary to set EXTENT_UPTODATE.

For example, when performing a buffered random read:

	fio --rw=randread --ioengine=libaio --direct=0 --numjobs=4 \
		--filesize=32G --size=4G --bs=4k --name=job \
		--filename=/mnt/file --name=job

Then check how many extent_state in io_tree:

	cat /proc/slabinfo | grep btrfs_extent_state | awk '{print $2}'

w/o this patch, we got 640567 btrfs_extent_state.
w/  this patch, we got    204 btrfs_extent_state.

Maintaining such a big tree brings overhead since every I/O needs to insert
EXTENT_LOCKED, insert EXTENT_UPTODATE, then remove EXTENT_LOCKED. And in
every insert or remove, we need to lock io_tree, do tree search, alloc or
dealloc extent states. By removing unnecessary EXTENT_UPTODATE, we keep
io_tree in a minimal size and reduce overhead when performing buffered I/O.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:57 +02:00
Omar Sandoval
d1f68ba069 btrfs: rename btrfs_insert_file_extent() to btrfs_insert_hole_extent()
btrfs_insert_file_extent() is only ever used to insert holes, so rename
it and remove the redundant parameters.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:54 +02:00
Ioannis Angelakopoulos
5f4403e10f btrfs: add lockdep annotations for the ordered extents wait event
This wait event is very similar to the pending ordered wait event in the
sense that it occurs in a different context than the condition signaling
for the event. The signaling occurs in btrfs_remove_ordered_extent()
while the wait event is implemented in btrfs_start_ordered_extent() in
fs/btrfs/ordered-data.c

However, in this case a thread must not acquire the lockdep map for the
ordered extents wait event when the ordered extent is related to a free
space inode. That is because lockdep creates dependencies between locks
acquired both in execution paths related to normal inodes and paths
related to free space inodes, thus leading to false positives.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:53 +02:00
Miklos Szeredi
863f144f12 vfs: open inside ->tmpfile()
This is in preparation for adding tmpfile support to fuse, which requires
that the tmpfile creation and opening are done as a single operation.

Replace the 'struct dentry *' argument of i_op->tmpfile with
'struct file *'.

Call finish_open_simple() as the last thing in ->tmpfile() instances (may
be omitted in the error case).

Change d_tmpfile() argument to 'struct file *' as well to make callers more
readable.

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-09-24 07:00:00 +02:00
Linus Torvalds
9b45094954 for-6.0-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmMaPukACgkQxWXV+ddt
 WDsWKw/+IcpMsb08sjudn4dtFQ3HSA1E+dOYDzXwUJTS7ZpZhLRniLe1XQwHxe4D
 7DUQA+e1RKGq4+TiznoLhaG/YCCcrLPZL/1aWhwO0M5Wj6BCIxSUa00BJNpxyBMw
 kWb9vQltc5w5zJXHeIr7m2ByzT+YIl0v1lf2GQrJVieHhGiKslfkJHLoJt49oJ0L
 9ka183VR/OCi/3uxUw6NMAjfv+0OGEsFZX/CF8Vo64IKg0I0Q248H4enZt43aDHA
 dQDapAyAr4f6RLDs6ULS2GSzKfZIKMLHlvSeg1BSPyUt/NZFVlC0VwVX0NmwP62a
 5NECYdimlQOGSlaahNEQpLIiyNYboi3Mq7m63BofWduDQanpnM1FByln9JVEizlm
 VuUs3+O0CMp81HecSk3VbSe3ukO2fqAdQjM5cdpRx30TYu7WRiYNE3aHchgLmXLP
 0zw9JV6ePg04Mstx+/3lo8D/X/7fMAT3NrqYmuImoekFWbdJfsiUtgdXNOglT9dt
 6lb1/0jBEbdiXnQ/jT1OreGwSdGZqkEKF4OE26kPRxURyTDESzglNVyhXmshIANC
 qnNuUFGea5d7LbyozYyfdcsQS7rEqLVKmUWrOb/3O/K1947/DegYodnhRwjCUSS7
 iUaetkYUWxHa7U9303KneCUAyLEf1S8NXRPIObL6YIw7D09wato=
 =WD7B
 -----END PGP SIGNATURE-----

Merge tag 'for-6.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more fixes to zoned mode and one regression fix for chunk limit:

    - Zoned mode fixes:
        - fix how wait/wake up is done when finishing zone
        - fix zone append limit in emulated mode
        - fix mount on devices with conventional zones

   - fix regression, user settable data chunk limit got accidentally
     lowered and causes allocation problems on some profiles (raid0,
     raid1)"

* tag 'for-6.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix the max chunk size and stripe length calculation
  btrfs: zoned: fix mounting with conventional zones
  btrfs: zoned: set pseudo max append zone limit in zone emulation mode
  btrfs: zoned: fix API misuse of zone finish waiting
2022-09-09 07:54:19 -04:00
Naohiro Aota
d5b81ced74 btrfs: zoned: fix API misuse of zone finish waiting
The commit 2ce543f478 ("btrfs: zoned: wait until zone is finished when
allocation didn't progress") implemented a zone finish waiting mechanism
to the write path of zoned mode. However, using
wait_var_event()/wake_up_all() on fs_info->zone_finish_wait is wrong and
wait_var_event() just hangs because no one ever wakes it up once it goes
into sleep.

Instead, we can simply use wait_on_bit_io() and clear_and_wake_up_bit()
on fs_info->flags with a proper barrier installed.

Fixes: 2ce543f478 ("btrfs: zoned: wait until zone is finished when allocation didn't progress")
CC: stable@vger.kernel.org # 5.16+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-05 15:32:21 +02:00
Linus Torvalds
8379c0b31f for-6.0-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmMLY9oACgkQxWXV+ddt
 WDue/w/8C3ZF8nLAI/sMrUpef2vSD62bvkKRRS45wzR2uod6yc0Fle9upzBssJQZ
 qO3mQ53+QV+imCq7dY5mmtmwCUJNmbV5gbiMoF1OoV9TYtpZb/NIDklSX8se2eJX
 drdAWQr2pYwU2M4duA4IEW08TvQ2TFh0JiqMi0aYM5apyL80uv3WniOu+xpRipA3
 CMFAnDqayIgQ5OIsedqNy2MBLBopodUL5PZv/H7/g6KSKIuAZP9zgg1eKPfaz2t3
 HO183ubmMbVtxgxeu+EnvCkg/iQ5hQiuGmyi0FLYMs/A6/NglwBnIJU5jCMQhcp6
 HO5+FSUn6lHQetVzt2uHb9Lo+gX4FtCaHqVv1bXT62lnmDsZO1D7RVSg1Fra+CY+
 jJmi8vvIbfbYlSZPZlJANoWe8ODOMVPk+pM4SFHlxOWGAY6HViX2RfHnIjNj5x9O
 iDSTGvH6++nBF1Wu2/Xja/VKZ1avxRyTu2srW8JOF62j/tTU/EoPJcO9rxXOBBmC
 Hi4UmJ690p3h5xZeeiyE8CmaSlPtfdCcnc/97FnusEjBao9O7THX0PCDVJX6VBkm
 hVk01Z6+az1UNcD18KecvCpKYF/At4WpjaUGgf7q+LBfJXuXA6jfzOVDJMKV3TFd
 n1yMFg+duGj90l8gT0aa/VQiBlUlnzQKz6ceqyKkPccwveNis6I=
 =p8YV
 -----END PGP SIGNATURE-----

Merge tag 'for-6.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Fixes:

   - check that subvolume is writable when changing xattrs from security
     namespace

   - fix memory leak in device lookup helper

   - update generation of hole file extent item when merging holes

   - fix space cache corruption and potential double allocations; this
     is a rare bug but can be serious once it happens, stable backports
     and analysis tool will be provided

   - fix error handling when deleting root references

   - fix crash due to assert when attempting to cancel suspended device
     replace, add message what to do if mount fails due to missing
     replace item

  Regressions:

   - don't merge pages into bio if their page offset is not contiguous

   - don't allow large NOWAIT direct reads, this could lead to short
     reads eg. in io_uring"

* tag 'for-6.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: add info when mount fails due to stale replace target
  btrfs: replace: drop assert for suspended replace
  btrfs: fix silent failure when deleting root reference
  btrfs: fix space cache corruption and potential double allocations
  btrfs: don't allow large NOWAIT direct reads
  btrfs: don't merge pages into bio if their page offset is not contiguous
  btrfs: update generation of hole file extent item when merging holes
  btrfs: fix possible memory leak in btrfs_get_dev_args_from_path()
  btrfs: check if root is readonly while setting security xattr
2022-08-28 10:44:04 -07:00
Josef Bacik
79d3d1d12e btrfs: don't allow large NOWAIT direct reads
Dylan and Jens reported a problem where they had an io_uring test that
was returning short reads, and bisected it to ee5b46a353 ("btrfs:
increase direct io read size limit to 256 sectors").

The root cause is their test was doing larger reads via io_uring with
NOWAIT and async.  This was triggering a page fault during the direct
read, however the first page was able to work just fine and thus we
submitted a 4k read for a larger iocb.

Btrfs allows for partial IO's in this case specifically because we don't
allow page faults, and thus we'll attempt to do any io that we can,
submit what we could, come back and fault in the rest of the range and
try to do the remaining IO.

However for !is_sync_kiocb() we'll call ->ki_complete() as soon as the
partial dio is done, which is incorrect.  In the sync case we can exit
the iomap code, submit more io's, and return with the amount of IO we
were able to complete successfully.

We were always doing short reads in this case, but for NOWAIT we were
getting saved by the fact that we were limiting direct reads to
sectorsize, and if we were larger than that we would return EAGAIN.

Fix the regression by simply returning EAGAIN in the NOWAIT case with
larger reads, that way io_uring can retry and get the larger IO and have
the fault logic handle everything properly.

This still leaves the AIO short read case, but that existed before this
change.  The way to properly fix this would be to handle partial iocb
completions, but that's a lot of work, for now deal with the regression
in the most straightforward way possible.

Reported-by: Dylan Yudaken <dylany@fb.com>
Fixes: ee5b46a353 ("btrfs: increase direct io read size limit to 256 sectors")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-22 18:08:07 +02:00
Linus Torvalds
353767e4aa for-5.20-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmLnyNUACgkQxWXV+ddt
 WDt9vA/9HcF+v5EkknyW07tatTap/Hm/ZB86Z5OZi6ikwIEcHsWhp3rUICejm88e
 GecDPIluDtCtyD6x4stuqkwOm22aDP5q2T9H6+gyw92ozyb436OV1Z8IrmftzXKY
 EpZO70PHZT+E6E/WYvyoTmmoCrjib7YlqCWZZhSLUFpsqqlOInmHEH49PW6KvM4r
 acUZ/RxHurKdmI3kNY6ECbAQl6CASvtTdYcVCx8fT2zN0azoLIQxpYa7n/9ca1R6
 8WnYilCbLbNGtcUXvO2M3tMZ4/5kvxrwQsUn93ccCJYuiN0ASiDXbLZ2g4LZ+n56
 JGu+y5v5oBwjpVf+46cuvnENP5BQ61594WPseiVjrqODWnPjN28XkcVC0XmPsiiZ
 lszeHO2cuIrIFoCah8ELMl8usu8+qxfXmPxIXtPu9rEyKsDtOjxVYc8SMXqLp0qQ
 qYtBoFm0JcZHqtZRpB+dhQ37/xXtH4ljUi/mI6x8iALVujeR273URs7yO9zgIdeW
 uZoFtbwpHFLUk+TL7Ku82/zOXp3fCwtDpNmlYbxeMbea/be3ShjncM4+mYzvHYri
 dYON2LFrq+mnRDqtIXTCaAYwX7zU8Y18Ev9QwlNll8dKlKwS89+jpqLoa+eVYy3c
 /HitHFza70KxmOj4dvDVZlzDpPvl7kW1UBkmskg4u3jnNWzedkM=
 =sS1q
 -----END PGP SIGNATURE-----

Merge tag 'for-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "This brings some long awaited changes, the send protocol bump,
  otherwise lots of small improvements and fixes. The main core part is
  reworking bio handling, cleaning up the submission and endio and
  improving error handling.

  There are some changes outside of btrfs adding helpers or updating
  API, listed at the end of the changelog.

  Features:

   - sysfs:
      - export chunk size, in debug mode add tunable for setting its size
      - show zoned among features (was only in debug mode)
      - show commit stats (number, last/max/total duration)

   - send protocol updated to 2
      - new commands:
         - ability write larger data chunks than 64K
         - send raw compressed extents (uses the encoded data ioctls),
           ie. no decompression on send side, no compression needed on
           receive side if supported
         - send 'otime' (inode creation time) among other timestamps
         - send file attributes (a.k.a file flags and xflags)
      - this is first version bump, backward compatibility on send and
        receive side is provided
      - there are still some known and wanted commands that will be
        implemented in the near future, another version bump will be
        needed, however we want to minimize that to avoid causing
        usability issues

   - print checksum type and implementation at mount time

   - don't print some messages at mount (mentioned as people asked about
     it), we want to print messages namely for new features so let's
     make some space for that
      - big metadata - this has been supported for a long time and is
        not a feature that's worth mentioning
      - skinny metadata - same reason, set by default by mkfs

  Performance improvements:

   - reduced amount of reserved metadata for delayed items
      - when inserted items can be batched into one leaf
      - when deleting batched directory index items
      - when deleting delayed items used for deletion
      - overall improved count of files/sec, decreased subvolume lock
        contention

   - metadata item access bounds checker micro-optimized, with a few
     percent of improved runtime for metadata-heavy operations

   - increase direct io limit for read to 256 sectors, improved
     throughput by 3x on sample workload

  Notable fixes:

   - raid56
      - reduce parity writes, skip sectors of stripe when there are no
        data updates
      - restore reading from on-disk data instead of using stripe cache,
        this reduces chances to damage correct data due to RMW cycle

   - refuse to replay log with unknown incompat read-only feature bit
     set

   - zoned
      - fix page locking when COW fails in the middle of allocation
      - improved tracking of active zones, ZNS drives may limit the
        number and there are ENOSPC errors due to that limit and not
        actual lack of space
      - adjust maximum extent size for zone append so it does not cause
        late ENOSPC due to underreservation

   - mirror reading error messages show the mirror number

   - don't fallback to buffered IO for NOWAIT direct IO writes, we don't
     have the NOWAIT semantics for buffered io yet

   - send, fix sending link commands for existing file paths when there
     are deleted and created hardlinks for same files

   - repair all mirrors for profiles with more than 1 copy (raid1c34)

   - fix repair of compressed extents, unify where error detection and
     repair happen

  Core changes:

   - bio completion cleanups
      - don't double defer compression bios
      - simplify endio workqueues
      - add more data to btrfs_bio to avoid allocation for read requests
      - rework bio error handling so it's same what block layer does,
        the submission works and errors are consumed in endio
      - when asynchronous bio offload fails fall back to synchronous
        checksum calculation to avoid errors under writeback or memory
        pressure

   - new trace points
      - raid56 events
      - ordered extent operations

   - super block log_root_transid deprecated (never used)

   - mixed_backref and big_metadata sysfs feature files removed, they've
     been default for sufficiently long time, there are no known users
     and mixed_backref could be confused with mixed_groups

  Non-btrfs changes, API updates:

   - minor highmem API update to cover const arguments

   - switch all kmap/kmap_atomic to kmap_local

   - remove redundant flush_dcache_page()

   - address_space_operations::writepage callback removed

   - add bdev_max_segments() helper"

* tag 'for-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (163 commits)
  btrfs: don't call btrfs_page_set_checked in finish_compressed_bio_read
  btrfs: fix repair of compressed extents
  btrfs: remove the start argument to check_data_csum and export
  btrfs: pass a btrfs_bio to btrfs_repair_one_sector
  btrfs: simplify the pending I/O counting in struct compressed_bio
  btrfs: repair all known bad mirrors
  btrfs: merge btrfs_dev_stat_print_on_error with its only caller
  btrfs: join running log transaction when logging new name
  btrfs: simplify error handling in btrfs_lookup_dentry
  btrfs: send: always use the rbtree based inode ref management infrastructure
  btrfs: send: fix sending link commands for existing file paths
  btrfs: send: introduce recorded_ref_alloc and recorded_ref_free
  btrfs: zoned: wait until zone is finished when allocation didn't progress
  btrfs: zoned: write out partially allocated region
  btrfs: zoned: activate necessary block group
  btrfs: zoned: activate metadata block group on flush_space
  btrfs: zoned: disable metadata overcommit for zoned
  btrfs: zoned: introduce space_info->active_total_bytes
  btrfs: zoned: finish least available block group on data bg allocation
  btrfs: let can_allocate_chunk return error
  ...
2022-08-03 14:54:52 -07:00
Linus Torvalds
5264406cdb iov_iter work, part 1 - isolated cleanups and optimizations.
One of the goals is to reduce the overhead of using ->read_iter()
 and ->write_iter() instead of ->read()/->write(); new_sync_{read,write}()
 has a surprising amount of overhead, in particular inside iocb_flags().
 That's why the beginning of the series is in this pile; it's not directly
 iov_iter-related, but it's a part of the same work...
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYurGOQAKCRBZ7Krx/gZQ
 6ysyAP91lvBfMRepcxpd9kvtuzWkU8A3rfSziZZteEHANB9Q7QEAiPn2a2OjWkcZ
 uAyUWfCkHCNx+dSMkEvUgR5okQ0exAM=
 =9UCV
 -----END PGP SIGNATURE-----

Merge tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs iov_iter updates from Al Viro:
 "Part 1 - isolated cleanups and optimizations.

  One of the goals is to reduce the overhead of using ->read_iter() and
  ->write_iter() instead of ->read()/->write().

  new_sync_{read,write}() has a surprising amount of overhead, in
  particular inside iocb_flags(). That's the explanation for the
  beginning of the series is in this pile; it's not directly
  iov_iter-related, but it's a part of the same work..."

* tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  first_iovec_segment(): just return address
  iov_iter: massage calling conventions for first_{iovec,bvec}_segment()
  iov_iter: first_{iovec,bvec}_segment() - simplify a bit
  iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment()
  iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT
  iov_iter_bvec_advance(): don't bother with bvec_iter
  copy_page_{to,from}_iter(): switch iovec variants to generic
  keep iocb_flags() result cached in struct file
  iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC
  struct file: use anonymous union member for rcuhead and llist
  btrfs: use IOMAP_DIO_NOSYNC
  teach iomap_dio_rw() to suppress dsync
  No need of likely/unlikely on calls of check_copy_size()
2022-08-03 13:50:22 -07:00
Linus Torvalds
f00654007f Folio changes for 6.0
- Fix an accounting bug that made NR_FILE_DIRTY grow without limit
    when running xfstests
 
  - Convert more of mpage to use folios
 
  - Remove add_to_page_cache() and add_to_page_cache_locked()
 
  - Convert find_get_pages_range() to filemap_get_folios()
 
  - Improvements to the read_cache_page() family of functions
 
  - Remove a few unnecessary checks of PageError
 
  - Some straightforward filesystem conversions to use folios
 
  - Split PageMovable users out from address_space_operations into their
    own movable_operations
 
  - Convert aops->migratepage to aops->migrate_folio
 
  - Remove nobh support (Christoph Hellwig)
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmLpViQACgkQDpNsjXcp
 gj5pBgf/f3+K7Hi3qw7aYQCYJQ7IA/bLyE/DLWI59kuiao6wDSve40B9YH9X++Ha
 mRLp55bkQS+bwS2xa4jlqrIDJzAfNoWlXaXZHUXGL1C/52ChTF6jaH2cvO9PVlDS
 7fLv1hy2LwiIdzpKJkUW7T+kcQGj3QLKqtQ4x8zD0LGMg055yvt/qndHSUi41nWT
 /58+6W8Sk4vvRgkpeChFzF1lGLy00+FGT8y5V2kM9uRliFQ7XPCwqB2a3e5jbW6z
 C1NXQmRnopCrnOT1TFIhK3DyX6MDIWV5qcikNAmCKFb9fQFPmjDLPt9iSoMGjw2M
 Z+UVhJCaU3ISccd0DG5Ra/vzs9/O9Q==
 =DgUi
 -----END PGP SIGNATURE-----

Merge tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache

Pull folio updates from Matthew Wilcox:

 - Fix an accounting bug that made NR_FILE_DIRTY grow without limit
   when running xfstests

 - Convert more of mpage to use folios

 - Remove add_to_page_cache() and add_to_page_cache_locked()

 - Convert find_get_pages_range() to filemap_get_folios()

 - Improvements to the read_cache_page() family of functions

 - Remove a few unnecessary checks of PageError

 - Some straightforward filesystem conversions to use folios

 - Split PageMovable users out from address_space_operations into
   their own movable_operations

 - Convert aops->migratepage to aops->migrate_folio

 - Remove nobh support (Christoph Hellwig)

* tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache: (78 commits)
  fs: remove the NULL get_block case in mpage_writepages
  fs: don't call ->writepage from __mpage_writepage
  fs: remove the nobh helpers
  jfs: stop using the nobh helper
  ext2: remove nobh support
  ntfs3: refactor ntfs_writepages
  mm/folio-compat: Remove migration compatibility functions
  fs: Remove aops->migratepage()
  secretmem: Convert to migrate_folio
  hugetlb: Convert to migrate_folio
  aio: Convert to migrate_folio
  f2fs: Convert to filemap_migrate_folio()
  ubifs: Convert to filemap_migrate_folio()
  btrfs: Convert btrfs_migratepage to migrate_folio
  mm/migrate: Add filemap_migrate_folio()
  mm/migrate: Convert migrate_page() to migrate_folio()
  nfs: Convert to migrate_folio
  btrfs: Convert btree_migratepage to migrate_folio
  mm/migrate: Convert expected_page_refs() to folio_expected_refs()
  mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()
  ...
2022-08-03 10:35:43 -07:00
Linus Torvalds
c013d0af81 for-5.20/block-2022-07-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLko3gQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmQaD/90NKFj4v8I456TUQyg1jimXEsL+e84E6o2
 ALWVb6JzQvlPVQXNLnK5YKIunMWOTtTMz0nyB8sVRwVJVJO0P5d7QopAkZM8fkyU
 MK5OCzoryENw4DTc2wJS4in6cSbGylIuN74wMzlf7+M67JTImfoZQhbTMcjwzZfn
 b3OlL6sID7zMXwGcuOJPZyUJICCpDhzdSF9JXqKma5PQuG2SBmQyvFxJAcsoFBPc
 YetnoRIOIN6yBvsIZaPaYq7XI9MIvF0e67EQtyCEHj4tHpyVnyDWkeObVFULsISU
 gGEKbkYPvNUzRAU5Q1NBBHh1tTfkf/MaUxTuZwoEwZ/s04IGBGMmrZGyfvdfzYo6
 M7NwSEg/TrUSNfTwn65mQi7uOXu1pGkJrqz84Flm8u9Qid9Vd7LExLG5p/ggnWdH
 5th93MDEmtEg29e9DXpEAuS5d0t3TtSvosflaKpyfNNfr+P0rWCN6GM/uW62VUTK
 ls69SQh/AQJRbg64jU4xper6WhaYtSXK7TKEnxJycoEn9gYNyCcdot2uekth0xRH
 ChHGmRlteiqe/y4uFWn/2dcxWjoleiHbFjTaiRL75WVl8wIDEjw02LGuoZ61Ss9H
 WOV+MT7KqNjBGe6lreUY+O/PO02dzmoR6heJXN19p8zr/pBuLCTGX7UpO7rzgaBR
 4N1HEozvIw==
 =celk
 -----END PGP SIGNATURE-----

Merge tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Improve the type checking of request flags (Bart)

 - Ensure queue mapping for a single queues always picks the right queue
   (Bart)

 - Sanitize the io priority handling (Jan)

 - rq-qos race fix (Jinke)

 - Reserved tags handling improvements (John)

 - Separate memory alignment from file/disk offset aligment for O_DIRECT
   (Keith)

 - Add new ublk driver, userspace block driver using io_uring for
   communication with the userspace backend (Ming)

 - Use try_cmpxchg() to cleanup the code in various spots (Uros)

 - Finally remove bdevname() (Christoph)

 - Clean up the zoned device handling (Christoph)

 - Clean up independent access range support (Christoph)

 - Clean up and improve block sysfs handling (Christoph)

 - Clean up and improve teardown of block devices.

   This turns the usual two step process into something that is simpler
   to implement and handle in block drivers (Christoph)

 - Clean up chunk size handling (Christoph)

 - Misc cleanups and fixes (Bart, Bo, Dan, GuoYong, Jason, Keith, Liu,
   Ming, Sebastian, Yang, Ying)

* tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block: (178 commits)
  ublk_drv: fix double shift bug
  ublk_drv: make sure that correct flags(features) returned to userspace
  ublk_drv: fix error handling of ublk_add_dev
  ublk_drv: fix lockdep warning
  block: remove __blk_get_queue
  block: call blk_mq_exit_queue from disk_release for never added disks
  blk-mq: fix error handling in __blk_mq_alloc_disk
  ublk: defer disk allocation
  ublk: rewrite ublk_ctrl_get_queue_affinity to not rely on hctx->cpumask
  ublk: fold __ublk_create_dev into ublk_ctrl_add_dev
  ublk: cleanup ublk_ctrl_uring_cmd
  ublk: simplify ublk_ch_open and ublk_ch_release
  ublk: remove the empty open and release block device operations
  ublk: remove UBLK_IO_F_PREFLUSH
  ublk: add a MAINTAINERS entry
  block: don't allow the same type rq_qos add more than once
  mmc: fix disk/queue leak in case of adding disk failure
  ublk_drv: fix an IS_ERR() vs NULL check
  ublk: remove UBLK_IO_F_INTEGRITY
  ublk_drv: remove unneeded semicolon
  ...
2022-08-02 13:46:35 -07:00
Matthew Wilcox (Oracle)
e7a60a1787 btrfs: Convert btrfs_migratepage to migrate_folio
Use filemap_migrate_folio() to do the bulk of the work, and then copy
the ordered flag across if needed.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
2022-08-02 12:34:04 -04:00
Christoph Hellwig
0b078d9db8 btrfs: don't call btrfs_page_set_checked in finish_compressed_bio_read
This flag was used to communicate that the low-level compression code
already did verify the checksum to the high-level I/O completion code.

But it has been unused for a long time as the upper btrfs_bio for the
decompressed data had a NULL csum pointer basically since that pointer
existed and the code already checks for that a little later.

Note that this does not affect the other use of the checked flag, which
is only used for the COW fixup worker.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 19:56:16 +02:00
Christoph Hellwig
81bd9328ab btrfs: fix repair of compressed extents
Currently the checksum of compressed extents is verified based on the
compressed data and the lower btrfs_bio, but the actual repair process
is driven by end_bio_extent_readpage on the upper btrfs_bio for the
decompressed data.

This has a bunch of issues, including not being able to properly
communicate the failed mirror up in case that the I/O submission got
preempted, a general loss of if an error was an I/O error or a checksum
verification failure, but most importantly that this design causes
btrfs_clean_io_failure to eventually write back the uncompressed good
data onto the disk sectors that are supposed to contain compressed data.

Fix this by moving the repair to the lower btrfs_bio.  To do so, a fair
amount of code has to be reshuffled:

 a) the lower btrfs_bio now needs a valid csum pointer.  The easiest way
    to achieve that is to pass NULL btrfs_lookup_bio_sums and just use
    the btrfs_bio management of csums.  For a compressed_bio that is
    split into multiple btrfs_bios this means additional memory
    allocations, but the code becomes a lot more regular.
 b) checksum verification now runs directly on the lower btrfs_bio instead
    of the compressed_bio.  This actually nicely simplifies the end I/O
    processing.
 c) btrfs_repair_one_sector can't just look up the logical address for
    the file offset any more, as there is no corresponding relative
    offsets that apply to the file offset and the logic address for
    compressed extents.  Instead require that the saved bvec_iter in the
    btrfs_bio is filled out for all read bios and use that, which again
    removes a fair amount of code.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 19:56:16 +02:00
Christoph Hellwig
7959bd4411 btrfs: remove the start argument to check_data_csum and export
Derive the value of start from the btrfs_bio now that ->file_offset is
always valid.  Also export and rename the function so it's available
outside of inode.c as we'll need that soon.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 19:55:32 +02:00
Christoph Hellwig
7aa51232e2 btrfs: pass a btrfs_bio to btrfs_repair_one_sector
Pass the btrfs_bio instead of the plain bio to btrfs_repair_one_sector,
and remove the start and failed_mirror arguments in favor of deriving
them from the btrfs_bio.  For this to work ensure that the file_offset
field is also initialized for buffered I/O.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 19:55:19 +02:00
Nikolay Borisov
fc8b235fdc btrfs: simplify error handling in btrfs_lookup_dentry
In btrfs_lookup_dentry releasing the reference of the sub_root and the
running orphan cleanup should only happen if the dentry found actually
represents a subvolume. This can only be true in the 'else' branch as
otherwise either fixup_tree_root_location returned an ENOENT error, in
which case sub_root wouldn't have been changed or if we got a different
errno this means btrfs_get_fs_root couldn't have executed successfully
again meaning sub_root will equal to root. So simplify all the branches
by moving the code into the 'else'.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:42 +02:00
Naohiro Aota
2ce543f478 btrfs: zoned: wait until zone is finished when allocation didn't progress
When the allocated position doesn't progress, we cannot submit IOs to
finish a block group, but there should be ongoing IOs that will finish a
block group. So, in that case, we wait for a zone to be finished and retry
the allocation after that.

Introduce a new flag BTRFS_FS_NEED_ZONE_FINISH for fs_info->flags to
indicate we need a zone finish to have proceeded. The flag is set when the
allocator detected it cannot activate a new block group. And, it is cleared
once a zone is finished.

CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036 ("btrfs: zoned: implement active zone tracking")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:42 +02:00
Naohiro Aota
898793d992 btrfs: zoned: write out partially allocated region
cow_file_range() works in an all-or-nothing way: if it fails to allocate an
extent for a part of the given region, it gives up all the region including
the successfully allocated parts. On cow_file_range(), run_delalloc_zoned()
writes data for the region only when it successfully allocate all the
region.

This all-or-nothing allocation and write-out are problematic when available
space in all the block groups are get tight with the active zone
restriction. btrfs_reserve_extent() try hard to utilize the left space in
the active block groups and gives up finally and fails with
-ENOSPC. However, if we send IOs for the successfully allocated region, we
can finish a zone and can continue on the rest of the allocation on a newly
allocated block group.

This patch implements the partial write-out for run_delalloc_zoned(). With
this patch applied, cow_file_range() returns -EAGAIN to tell the caller to
do something to progress the further allocation, and tells the successfully
allocated region with done_offset. Furthermore, the zoned extent allocator
returns -EAGAIN to tell cow_file_range() going back to the caller side.

Actually, we still need to wait for an IO to complete to continue the
allocation. The next patch implements that part.

CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036 ("btrfs: zoned: implement active zone tracking")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:42 +02:00
Naohiro Aota
7d7672bc5d btrfs: convert count_max_extents() to use fs_info->max_extent_size
If count_max_extents() uses BTRFS_MAX_EXTENT_SIZE to calculate the number
of extents needed, btrfs release the metadata reservation too much on its
way to write out the data.

Now that BTRFS_MAX_EXTENT_SIZE is replaced with fs_info->max_extent_size,
convert count_max_extents() to use it instead, and fix the calculation of
the metadata reservation.

CC: stable@vger.kernel.org # 5.12+
Fixes: d8e3fb106f ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:41 +02:00
Naohiro Aota
f7b12a62f0 btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size
On zoned filesystem, data write out is limited by max_zone_append_size,
and a large ordered extent is split according the size of a bio. OTOH,
the number of extents to be written is calculated using
BTRFS_MAX_EXTENT_SIZE, and that estimated number is used to reserve the
metadata bytes to update and/or create the metadata items.

The metadata reservation is done at e.g, btrfs_buffered_write() and then
released according to the estimation changes. Thus, if the number of extent
increases massively, the reserved metadata can run out.

The increase of the number of extents easily occurs on zoned filesystem
if BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the
following warning on a small RAM environment with disabling metadata
over-commit (in the following patch).

[75721.498492] ------------[ cut here ]------------
[75721.505624] BTRFS: block rsv 1 returned -28
[75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G        W         5.18.0-rc2-BTRFS-ZNS+ #109
[75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
[75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286
[75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000
[75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e
[75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7
[75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28
[75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a
[75721.701878] FS:  0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000
[75721.712601] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0
[75721.730499] Call Trace:
[75721.735166]  <TASK>
[75721.739886]  btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs]
[75721.747545]  ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs]
[75721.756145]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.762852]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.769520]  ? push_leaf_left+0x420/0x620 [btrfs]
[75721.776431]  ? memcpy+0x4e/0x60
[75721.781931]  split_leaf+0x433/0x12d0 [btrfs]
[75721.788392]  ? btrfs_get_token_32+0x580/0x580 [btrfs]
[75721.795636]  ? push_for_double_split.isra.0+0x420/0x420 [btrfs]
[75721.803759]  ? leaf_space_used+0x15d/0x1a0 [btrfs]
[75721.811156]  btrfs_search_slot+0x1bc3/0x2790 [btrfs]
[75721.818300]  ? lock_downgrade+0x7c0/0x7c0
[75721.824411]  ? free_extent_buffer.part.0+0x107/0x200 [btrfs]
[75721.832456]  ? split_leaf+0x12d0/0x12d0 [btrfs]
[75721.839149]  ? free_extent_buffer.part.0+0x14f/0x200 [btrfs]
[75721.846945]  ? free_extent_buffer+0x13/0x20 [btrfs]
[75721.853960]  ? btrfs_release_path+0x4b/0x190 [btrfs]
[75721.861429]  btrfs_csum_file_blocks+0x85c/0x1500 [btrfs]
[75721.869313]  ? rcu_read_lock_sched_held+0x16/0x80
[75721.876085]  ? lock_release+0x552/0xf80
[75721.881957]  ? btrfs_del_csums+0x8c0/0x8c0 [btrfs]
[75721.888886]  ? __kasan_check_write+0x14/0x20
[75721.895152]  ? do_raw_read_unlock+0x44/0x80
[75721.901323]  ? _raw_write_lock_irq+0x60/0x80
[75721.907983]  ? btrfs_global_root+0xb9/0xe0 [btrfs]
[75721.915166]  ? btrfs_csum_root+0x12b/0x180 [btrfs]
[75721.921918]  ? btrfs_get_global_root+0x820/0x820 [btrfs]
[75721.929166]  ? _raw_write_unlock+0x23/0x40
[75721.935116]  ? unpin_extent_cache+0x1e3/0x390 [btrfs]
[75721.942041]  btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs]
[75721.949906]  ? try_to_wake_up+0x30/0x14a0
[75721.955700]  ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs]
[75721.962661]  ? rcu_read_lock_sched_held+0x16/0x80
[75721.969111]  ? lock_acquire+0x41b/0x4c0
[75721.974982]  finish_ordered_fn+0x15/0x20 [btrfs]
[75721.981639]  btrfs_work_helper+0x1af/0xa80 [btrfs]
[75721.988184]  ? _raw_spin_unlock_irq+0x28/0x50
[75721.994643]  process_one_work+0x815/0x1460
[75722.000444]  ? pwq_dec_nr_in_flight+0x250/0x250
[75722.006643]  ? do_raw_spin_trylock+0xbb/0x190
[75722.013086]  worker_thread+0x59a/0xeb0
[75722.018511]  kthread+0x2ac/0x360
[75722.023428]  ? process_one_work+0x1460/0x1460
[75722.029431]  ? kthread_complete_and_exit+0x30/0x30
[75722.036044]  ret_from_fork+0x22/0x30
[75722.041255]  </TASK>
[75722.045047] irq event stamp: 0
[75722.049703] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0
[75722.067533] softirqs last  enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0
[75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0
[75722.085335] ---[ end trace 0000000000000000 ]---

To fix the estimation, we need to introduce fs_info->max_extent_size to
replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for
regular vs zoned filesystem.

Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned
filesystem, it is set to fs_info->max_zone_append_size.

CC: stable@vger.kernel.org # 5.12+
Fixes: d8e3fb106f ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:41 +02:00
Fabio M. De Francesco
4cb2e5e8fe btrfs: replace kmap_atomic() with kmap_local_page()
kmap_atomic() is being deprecated in favor of kmap_local_page() where it
is feasible. With kmap_local_page() mappings are per thread, CPU local,
and not globally visible.

The last use of kmap_atomic is in inode.c where the context is atomic [1]
and can be safely replaced by kmap_local_page.

Tested with xfstests on a QEMU + KVM 32-bits VM with 4GB RAM and booting a
kernel with HIGHMEM64GB enabled.

[1] https://lore.kernel.org/linux-btrfs/20220601132545.GM20633@twin.jikos.cz/

Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:41 +02:00
David Sterba
710d5921d1 btrfs: switch btrfs_block_rsv::failfast to bool
Use simple bool type for the block reserve failfast status, there's
short to save space as there used to be int but there's no reason for
that.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:40 +02:00
Christoph Hellwig
37899117e8 btrfs: do not return errors from btrfs_submit_dio_bio
Always consume the bio and call the end_io handler on error instead of
returning an error and letting the caller handle it.  This matches what
the block layer submission and the other btrfs bio submission handlers do
and avoids any confusion on who needs to handle errors.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:40 +02:00
Christoph Hellwig
ea1f0cedef btrfs: handle allocation failure in btrfs_wq_submit_bio gracefully
btrfs_wq_submit_bio is used for writeback under memory pressure.
Instead of failing the I/O when we can't allocate the async_submit_bio,
just punt back to the synchronous submission path.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:40 +02:00
Christoph Hellwig
82443fd55c btrfs: simplify sync/async submission in btrfs_submit_data_write_bio
btrfs_submit_data_write_bio special cases the reloc root because the
checksums are preloaded, but only does so for the !sync case.  The sync
case can't happen for data relocation, but just handling it more generally
significantly simplifies the logic.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:40 +02:00
Christoph Hellwig
1a722d8f5b btrfs: do not return errors from btrfs_map_bio
Always consume the bio and call the end_io handler on error instead of
returning an error and letting the caller handle it.  This matches
what the block layer submission does and avoids any confusion on who
needs to handle errors.

As this requires touching all the callers, rename the function to
btrfs_submit_bio, which describes the functionality much better.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:39 +02:00
David Sterba
c1867eb33e btrfs: clean up chained assignments
The chained assignments may be convenient to write, but make readability
a bit worse as it's too easy to overlook that there are several values
set on the same line while this is rather an exception.  Making it
consistent everywhere avoids surprises.

The pattern where inode times are initialized reuses the first value and
the order is mtime, ctime. In other blocks the assignments are expanded
so the order of variables is similar to the neighboring code.

Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:39 +02:00
Naohiro Aota
aaafa1ebd6 btrfs: replace unnecessary goto with direct return at cow_file_range()
The 'goto out' in cow_file_range() in the exit block are not necessary
and jump back. Replace them with return, while still keeping 'goto out'
in the main code.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ keep goto in the main code, update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:38 +02:00
Naohiro Aota
71aa147b4d btrfs: fix error handling of fallback uncompress write
When cow_file_range() fails in the middle of the allocation loop, it
unlocks the pages but leaves the ordered extents intact. Thus, we need
to call btrfs_cleanup_ordered_extents() to finish the created ordered
extents.

Also, we need to call end_extent_writepage() if locked_page is available
because btrfs_cleanup_ordered_extents() never processes the region on
the locked_page.

Furthermore, we need to set the mapping as error if locked_page is
unavailable before unlocking the pages, so that the errno is properly
propagated to the user space.

CC: stable@vger.kernel.org # 5.18+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:38 +02:00
Naohiro Aota
99826e4cab btrfs: extend btrfs_cleanup_ordered_extents for NULL locked_page
btrfs_cleanup_ordered_extents() assumes locked_page to be non-NULL, so it
is not usable for submit_uncompressed_range() which can have NULL
locked_page.

Add support supports locked_page == NULL case. Also, it rewrites
redundant "page_offset(locked_page)".

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:38 +02:00
Naohiro Aota
9ce7466f37 btrfs: ensure pages are unlocked on cow_file_range() failure
There is a hung_task report on zoned btrfs like below.

https://github.com/naota/linux/issues/59

  [726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
  [726.329839]       Not tainted 5.16.0-rc1+ #1
  [726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [726.331603] task:rocksdb:high0   state:D stack:    0 pid:11085 ppid: 11082 flags:0x00000000
  [726.331608] Call Trace:
  [726.331611]  <TASK>
  [726.331614]  __schedule+0x2e5/0x9d0
  [726.331622]  schedule+0x58/0xd0
  [726.331626]  io_schedule+0x3f/0x70
  [726.331629]  __folio_lock+0x125/0x200
  [726.331634]  ? find_get_entries+0x1bc/0x240
  [726.331638]  ? filemap_invalidate_unlock_two+0x40/0x40
  [726.331642]  truncate_inode_pages_range+0x5b2/0x770
  [726.331649]  truncate_inode_pages_final+0x44/0x50
  [726.331653]  btrfs_evict_inode+0x67/0x480
  [726.331658]  evict+0xd0/0x180
  [726.331661]  iput+0x13f/0x200
  [726.331664]  do_unlinkat+0x1c0/0x2b0
  [726.331668]  __x64_sys_unlink+0x23/0x30
  [726.331670]  do_syscall_64+0x3b/0xc0
  [726.331674]  entry_SYSCALL_64_after_hwframe+0x44/0xae
  [726.331677] RIP: 0033:0x7fb9490a171b
  [726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
  [726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
  [726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
  [726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
  [726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
  [726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
  [726.331693]  </TASK>

While we debug the issue, we found running fstests generic/551 on 5GB
non-zoned null_blk device in the emulated zoned mode also had a
similar hung issue.

Also, we can reproduce the same symptom with an error injected
cow_file_range() setup.

The hang occurs when cow_file_range() fails in the middle of
allocation. cow_file_range() called from do_allocation_zoned() can
split the give region ([start, end]) for allocation depending on
current block group usages. When btrfs can allocate bytes for one part
of the split regions but fails for the other region (e.g. because of
-ENOSPC), we return the error leaving the pages in the succeeded regions
locked. Technically, this occurs only when @unlock == 0. Otherwise, we
unlock the pages in an allocated region after creating an ordered
extent.

Considering the callers of cow_file_range(unlock=0) won't write out
the pages, we can unlock the pages on error exit from
cow_file_range(). So, we can ensure all the pages except @locked_page
are unlocked on error case.

In summary, cow_file_range now behaves like this:

- page_started == 1 (return value)
  - All the pages are unlocked. IO is started.
- unlock == 1
  - All the pages except @locked_page are unlocked in any case
- unlock == 0
  - On success, all the pages are locked for writing out them
  - On failure, all the pages except @locked_page are unlocked

Fixes: 42c0110009 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:38 +02:00
Christoph Hellwig
f3e90c1ca9 btrfs: remove extent writepage address space operation
Same as in commit 21b4ee7029 ("xfs: drop ->writepage completely"): we
can remove the callback as it's only used in one place - single page
writeback from memory reclaim and is not called for cgroup writeback at
all.

We only allow such writeback from kswapd, not from direct memory
reclaim, and so it is rarely used. When it comes from kswapd, it is
effectively random dirty page shoot-down, which is horrible for IO
patterns. We can rely on background writeback to clean all dirty pages
in an efficient way and not let it be interrupted by kswapd.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:37 +02:00
Christoph Hellwig
ee5b46a353 btrfs: increase direct io read size limit to 256 sectors
Btrfs currently limits direct I/O reads to a single sector, which goes
back to commit c329861da4 ("Btrfs: don't allocate a separate csums
array for direct reads") from Josef.  That commit changes the direct I/O
code to ".. use the private part of the io_tree for our csums.", but ten
years later that isn't how checksums for direct reads work, instead they
use a csums allocation on a per-btrfs_dio_private basis (which have their
own performance problem for small I/O, but that will be addressed later).

There is no fundamental limit in btrfs itself to limit the I/O size
except for the size of the checksum array that scales linearly with
the number of sectors in an I/O.  Pick a somewhat arbitrary limit of
256 limits, which matches what the buffered reads typically see as
the upper limit as the limit for direct I/O as well.

This significantly improves direct read performance.  For example a fio
run doing 1 MiB aio reads with a queue depth of 1 roughly triples the
throughput:

Baseline:

READ: bw=65.3MiB/s (68.5MB/s), 65.3MiB/s-65.3MiB/s (68.5MB/s-68.5MB/s), io=19.1GiB (20.6GB), run=300013-300013msec

With this patch:

READ: bw=196MiB/s (206MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=57.5GiB (61.7GB), run=300006-300006msc

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:37 +02:00
Christoph Hellwig
711f447b4f btrfs: remove the finish_func argument to btrfs_mark_ordered_io_finished
finish_func is always set to finish_ordered_fn, so remove it and also
the now pointless and somewhat confusingly named
__endio_write_update_ordered wrapper.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:37 +02:00
David Sterba
6d92b304ec btrfs: pass bits by value not by pointer for extent_state helpers
The bits are passed to all extent state helpers for no apparent reason,
the value only read and never updated so remove the indirection and pass
it directly. Also unify the type to u32 where needed.

Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:35 +02:00
Fabio M. De Francesco
70826b6bd5 btrfs: replace kmap() with kmap_local_page() in inode.c
The use of kmap() is being deprecated in favor of kmap_local_page() where
it is feasible. With kmap_local_page(), the mapping is per thread, CPU
local and not globally visible.

Therefore, use kmap_local_page() / kunmap_local() in inode.c wherever the
mappings are per thread and not globally visible.

Tested on QEMU + KVM 32 bits VM with 4GB of RAM and HIGHMEM64G enabled.

Suggested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:33 +02:00
Christoph Hellwig
d7b9416fe5 btrfs: remove btrfs_end_io_wq
All reads bio that go through btrfs_map_bio need to be completed in
user context.  And read I/Os are the most common and timing critical
in almost any file system workloads.

Embed a work_struct into struct btrfs_bio and use it to complete all
read bios submitted through btrfs_map, using the REQ_META flag to decide
which workqueue they are placed on.

This removes the need for a separate 128 byte allocation (typically
rounded up to 192 bytes by slab) for all reads with a size increase
of 24 bytes for struct btrfs_bio.  Future patches will reorganize
struct btrfs_bio to make use of this extra space for writes as well.

(All sizes are based a on typical 64-bit non-debug build)

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:33 +02:00
Christoph Hellwig
02bb5b7247 btrfs: don't double-defer bio completions for compressed reads
The bio completion handler of the bio used for the compressed data is
already run in a workqueue using btrfs_bio_wq_end_io, so don't schedule
the completion of the original bio to the same workqueue again but just
execute it directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:33 +02:00
Christoph Hellwig
c93104e758 btrfs: split btrfs_submit_data_bio to read and write parts
Split btrfs_submit_data_bio into one helper for reads and one for writes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:33 +02:00
Christoph Hellwig
e6484bd488 btrfs: simplify code flow in btrfs_submit_dio_bio
There is no exit block and cleanup and the function is reasonably short
so we can use inline return and not the goto. This makes the function
more straight forward.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:33 +02:00
Omar Sandoval
3ea4dc5bf0 btrfs: send: send compressed extents with encoded writes
Now that all of the pieces are in place, we can use the ENCODED_WRITE
command to send compressed extents when appropriate.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:32 +02:00
Filipe Manana
814e77182b btrfs: free the path earlier when creating a new inode
When creating an inode, through btrfs_create_new_inode(), we release the
path we allocated before once we don't need it anymore. But we keep it
allocated until we return from that function, which is wasteful because
after we release the path we do several things that can allocate yet
another path: inheriting properties, setting the xattrs used by ACLs and
secutiry modules, adding an orphan item (O_TMPFILE case) or adding a
dir item (for the non-O_TMPFILE case).

So instead of releasing the path once we don't need it anymore, free it
instead. This way we avoid having two paths allocated until we return
from btrfs_create_new_inode().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:35 +02:00
Filipe Manana
ca6dee6b79 btrfs: balance btree dirty pages and delayed items after a rename
A rename operation modifies a subvolume's btree, to remove the old dir
item, add the new dir item, remove an inode ref and add a new inode ref.
It can also create the delayed inode for the inodes involved in the
operation, and it creates two delayed dir index items, one to delete
the old name and another one to add the new name.

However we are neither balancing the btree dirty pages nor the delayed
items after a rename, which can result in accumulation of too many
btree dirty pages and delayed items, specially if a task is doing a
series of rename operations (for example it can happen for package
installations/upgrades through the zypper tool).

So just call btrfs_btree_balance_dirty() after a rename, just like we
do for every other system call that results on modifying a btree and
adding delayed items.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:35 +02:00
David Sterba
21a8935ead btrfs: remove redundant calls to flush_dcache_page
Both memzero_page and memcpy_to_page already call flush_dcache_page so
we can remove the calls from btrfs code.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:34 +02:00
Christoph Hellwig
1e87770cb3 btrfs: use btrfs_bio_for_each_sector in btrfs_check_read_dio_bio
Use the new btrfs_bio_for_each_sector iterator to simplify
btrfs_check_read_dio_bio.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:34 +02:00
Christoph Hellwig
a89ce08ce6 btrfs: factor out a btrfs_csum_ptr helper
Add a helper to find the csum for a byte offset into the csum buffer.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:34 +02:00
Qu Wenruo
ae643a74eb btrfs: introduce a data checksum checking helper
Although we have several data csum verification code, we never have a
function really just to verify checksum for one sector.

Function check_data_csum() do extra work for error reporting, thus it
requires a lot of extra things like file offset, bio_offset etc.

Function btrfs_verify_data_csum() is even worse, it will utilize page
checked flag, which means it can not be utilized for direct IO pages.

Here we introduce a new helper, btrfs_check_sector_csum(), which really
only accept a sector in page, and expected checksum pointer.

We use this function to implement check_data_csum(), and export it for
incoming patch.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[hch: keep passing the csum array as an arguments, as the callers want
      to print it, rename per request]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:33 +02:00
Fanjun Kong
1280d2d165 btrfs: use PAGE_ALIGNED instead of IS_ALIGNED
The <linux/mm.h> already provides the PAGE_ALIGNED macro. Let's
use it instead of IS_ALIGNED and passing PAGE_SIZE directly.

Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fanjun Kong <bh1scw@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:33 +02:00
David Sterba
143823cf4d btrfs: fix typos in comments
Codespell has found a few typos.

Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:33 +02:00
Linus Torvalds
972a278fe6 for-5.19-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmLRpPgACgkQxWXV+ddt
 WDtu/BAAnfx7CXKIfWKpz6FZEio9Qb3mUHVOglyKzqR0qB72OdrC1dQMvEWPJc6h
 N65di6+8tTNmRIlaFBMU0MDHODR2aDRpDtlR9eUzUuidTc4iOp1fi31uBwl31r7b
 k8mCZBc/IAdfH13lBtcfkb2HGid7rik5ZC6Kx/glMcqh647QkSMAleupUsIYHKsK
 IgcUWuN3wFIUK2WVgsja7+ljlwIHBHKRp9yrEYw+ef/B0NCNKvOnrIOPJzO7nxMP
 1FbqJ6F7u7HjoMFcMwn5rbV/BoIwSSvXyKRqOW+EhGeQR/imVmkH9jXJ7wXdblSz
 IvSqaZ0DaWWSvivdMpwbr8Z0Cu4iIYhVY6PSA0hukR63qB5GwKKJ6j1L0zoYoz8C
 IDWJPW03FNRIu5ZOduvUQ3qG7jcJQZ3WPCCfrDST1cO2xHT/7f65Tjz4k0hvp4za
 edITetC1mEv310CHeGsJaLxGYPNrRe38VZYPxgJ7yFpteGYjh0ZwsuyUHb4MH1no
 JWwgElNW+m1BatdWSUBYk6xhqod1s2LOFPNqo7jNlv8I27hPViqCbBA2i9FlkXf+
 FwL5kWyJXs69gfjUIj59381Z0U1VdA1tvU8GP2m2+JvIDS6ooAcZj7yEQ69mCZxi
 2RFJIU0NFbnc/5j2ARSzOTGs9glDD0yffgXJM+cK+TWsQ3AC31I=
 =/47A
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs reverts from David Sterba:
 "Due to a recent report [1] we need to revert the radix tree to xarray
  conversion patches.

  There's a problem with sleeping under spinlock, when xa_insert could
  allocate memory under pressure. We use GFP_NOFS so this is a real
  problem that we unfortunately did not discover during review.

  I'm sorry to do such change at rc6 time but the revert is IMO the
  safer option, there are patches to use mutex instead of the spin locks
  but that would need more testing. The revert branch has been tested on
  a few setups, all seem ok.

  The conversion to xarray will be revisited in the future"

Link: https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ [1]

* tag 'for-5.19-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Revert "btrfs: turn delayed_nodes_tree into an XArray"
  Revert "btrfs: turn name_cache radix tree into XArray in send_ctx"
  Revert "btrfs: turn fs_info member buffer_radix into XArray"
  Revert "btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray"
2022-07-16 13:48:55 -07:00
David Sterba
088aea3b97 Revert "btrfs: turn delayed_nodes_tree into an XArray"
This reverts commit 253bf57555.

Revert the xarray conversion, there's a problem with potential
sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS
allocation. The radix tree used the preloading mechanism to avoid
sleeping but this is not available in xarray.

Conversion from spin lock to mutex is possible but at time of rc6 is
riskier than a clean revert.

[1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-15 19:15:19 +02:00
David Sterba
fc7cbcd489 Revert "btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray"
This reverts commit 48b36a602a.

Revert the xarray conversion, there's a problem with potential
sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS
allocation. The radix tree used the preloading mechanism to avoid
sleeping but this is not available in xarray.

Conversion from spin lock to mutex is possible but at time of rc6 is
riskier than a clean revert.

[1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-15 19:14:28 +02:00
Bart Van Assche
bf9486d6dd fs/btrfs: Use the enum req_op and blk_opf_t types
Improve static type checking by using the enum req_op type for variables
that represent a request operation and the new blk_opf_t type for
variables that represent request flags.

Acked-by: David Sterba <dsterba@suse.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-51-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14 12:14:32 -06:00