[REGRESSION]
There are several regression reports about the scrub performance with
v6.4 kernel.
On a PCIe 3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
v6.4 can only go 1GB/s, an obvious 66% performance drop.
[CAUSE]
Iostat shows a very different behavior between v6.3 and v6.4 kernel:
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 9731.00 3425544.00 17237.00 63.92 2.18 352.02 21.18 100.00
nvme0n1p3 15578.00 993616.00 5.00 0.03 0.09 63.78 1.32 100.00
The upper one is v6.3 while the lower one is v6.4.
There are several obvious differences:
- Very few read merges
This turns out to be a behavior change that we no longer do bio
plug/unplug.
- Very low aqu-sz
This is due to the submit-and-wait behavior of flush_scrub_stripes(),
and extra extent/csum tree search.
Both behaviors are not that obvious on SATA SSDs, as SATA SSDs have NCQ
to merge the reads, while SATA SSDs can not handle high queue depth well
either.
[FIX]
For now this patch focuses on the read speed fix. Dev-replace replace
speed needs more work.
For the read part, we go two directions to fix the problems:
- Re-introduce blk plug/unplug to merge read requests
This is pretty simple, and the behavior is pretty easy to observe.
This would enlarge the average read request size to 512K.
- Introduce multi-group reads and no longer wait for each group
Instead of the old behavior, which submits 8 stripes and waits for
them, here we would enlarge the total number of stripes to 16 * 8.
Which is 8M per device, the same limit as the old scrub in-flight
bios size limit.
Now every time we fill a group (8 stripes), we submit them and
continue to next stripes.
Only when the full 16 * 8 stripes are all filled, we submit the
remaining ones (the last group), and wait for all groups to finish.
Then submit the repair writes and dev-replace writes.
This should enlarge the queue depth.
This would greatly improve the merge rate (thus read block size) and
queue depth:
Before (with regression, and cached extent/csum path):
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 20666.00 1318240.00 10.00 0.05 0.08 63.79 1.63 100.00
After (with all patches applied):
nvme0n1p3 5165.00 2278304.00 30557.00 85.54 0.55 441.10 2.81 100.00
i.e. 1287 to 2224 MB/s.
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
One of the bottleneck of the new scrub code is the extra csum tree
search.
The old code would only do the csum tree search for each scrub bio,
which can be as large as 512KiB, thus they can afford to allocate a new
path each time.
But the new scrub code is doing csum tree search for each stripe, which
is only 64KiB, this means we'd better re-use the same csum path during
each search.
This patch would introduce a per-sctx path for csum tree search, as we
don't need to re-allocate the path every time we need to do a csum tree
search.
With this change we can further improve the queue depth and improve the
scrub read performance:
Before (with regression and cached extent tree path):
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 15875.00 1013328.00 12.00 0.08 0.08 63.83 1.35 100.00
After (with both cached extent/csum tree path):
nvme0n1p3 17759.00 1133280.00 10.00 0.06 0.08 63.81 1.50 100.00
Fixes: e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror()
to scrub_stripe infrastructure"), scrub no longer re-use the same path
for extent tree search.
This can lead to unnecessary extent tree search, especially for the new
stripe based scrub, as we have way more stripes to prepare.
This patch would re-introduce a shared path for extent tree search, and
properly release it when the block group is scrubbed.
This change alone can improve scrub performance slightly by reducing the
time spend preparing the stripe thus improving the queue depth.
Before (with regression):
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
nvme0n1p3 15578.00 993616.00 5.00 0.03 0.09 63.78 1.32 100.00
After (with this patch):
nvme0n1p3 15875.00 1013328.00 12.00 0.08 0.08 63.83 1.35 100.00
Fixes: e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The @path in scrub_simple_mirror() is no longer utilized after commit
e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe
infrastructure").
Before that commit, we call find_first_extent_item() directly, which
needs a path and that path can be reused. But after that switch commit,
the extent search is done inside queue_scrub_stripe(), which will no
longer accept a path from outside.
So the @path variable can be safely removed.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ remove the stale comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Variable stripe_nr is being divided by map->num_stripes however the
result is never read. The division and assignment are redundant and
can be removed. Cleans up clang scan build warning:
fs/btrfs/scrub.c:1264:3: warning: Value stored to 'stripe_nr' is
never read [deadcode.DeadStores]
The code is a leftover from 6ded22c1bf ("btrfs: reduce div64 calls by
limiting the number of stripes of a chunk to u32") that converted div64
to normal division, it's the same but previous version did not trigger a
warning.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmTgyQQACgkQxWXV+ddt
WDvqSQ/+PFg0GwssGuiqWTGbfHV2bJCJWeuXUJNuKFo8PtEnpN0zf28ihsaRXAHF
ZDFKrRjEmb62n+EWJFDpC7wmnz6UJEoEtQteN2VBnLSIUQAKFI+g5flXrR85rk1D
d52JSXtaXSZeCtZH/wdYWdfkL19SJQqJrFDY1WmRLCylOsLHuG0a67fXNeL+5WM/
NgGUMk0bO/j2CKjiCwJT4EpsSP4tFj49TciuDESyXnS8aDbPLbAQkGpYlE+99HSj
D3vjZeqdVfmVhSjdIrK2eTlndzCl+HU+J1DXHzRE6I5XkXhzofJFtrlsvl++C9pv
UZL9bFyMFzybKME33RWvzXBhiRguZ4hfGBoh5FQbJl4yErU4I5RVZcd3/S/2V6n+
AzWemwkOdLEiiPD+aLV28EYdKpnd4GFweVTxeXjdXrJrSx/e4Vn/kPNq1aZJi6Qi
ex3hZWr0oN7JG/StN6i3ix09fEB8cyDzn/jaEwk5zb6uHVN8fw7whkVwZOvFkXx5
VcPxZOyxBFxwmN+L6JlxkIGEpu8UQC2RHa1JJzDTXJPqpz6W68d2wJ8jlDFJYUaf
fahDd8FoG/e/EYh8sPsOnp3gMY53UxxWLF8fuZXVScq9+g5zA3jfftF+a3TaA5bh
e119g0ml+KIGtTB7Q8nLob4PA12NNhNtHbKfdSPDhOfvz8heg9A=
=eFDQ
-----END PGP SIGNATURE-----
Merge tag 'for-6.5-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix infinite loop in readdir(), could happen in a big directory when
files get renamed during enumeration
- fix extent map handling of skipped pinned ranges
- fix a corner case when handling ordered extent length
- fix a potential crash when balance cancel races with pause
- verify correct uuid when starting scrub or device replace
* tag 'for-6.5-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix incorrect splitting in btrfs_drop_extent_map_range
btrfs: fix BUG_ON condition in btrfs_cancel_balance
btrfs: only subtract from len_to_oe_boundary when it is tracking an extent
btrfs: fix replace/scrub failure with metadata_uuid
btrfs: fix infinite directory reads
Fstests with POST_MKFS_CMD="btrfstune -m" (as in the mailing list)
reported a few of the test cases failing.
The failure scenario can be summarized and simplified as follows:
$ mkfs.btrfs -fq -draid1 -mraid1 /dev/sdb1 /dev/sdb2 :0
$ btrfstune -m /dev/sdb1 :0
$ wipefs -a /dev/sdb1 :0
$ mount -o degraded /dev/sdb2 /btrfs :0
$ btrfs replace start -B -f -r 1 /dev/sdb1 /btrfs :1
STDERR:
ERROR: ioctl(DEV_REPLACE_START) failed on "/btrfs": Input/output error
[11290.583502] BTRFS warning (device sdb2): tree block 22036480 mirror 2 has bad fsid, has 99835c32-49f0-4668-9e66-dc277a96b4a6 want da40350c-33ac-4872-92a8-4948ed8c04d0
[11290.586580] BTRFS error (device sdb2): unable to fix up (regular) error at logical 22020096 on dev /dev/sdb8 physical 1048576
As above, the replace is failing because we are verifying the header with
fs_devices::fsid instead of fs_devices::metadata_uuid, despite the
metadata_uuid actually being present.
To fix this, use fs_devices::metadata_uuid. We copy fsid into
fs_devices::metadata_uuid if there is no metadata_uuid, so its fine.
Fixes: a3ddbaebc7 ("btrfs: scrub: introduce a helper to verify one metadata block")
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmSZ0YUACgkQxWXV+ddt
WDuF9g/+OsEZflGYIZj11trG0l5HWApnINKqhZ524J0TNy9KxY0KOTqPCOg0O41E
Vt7uJCMG06+ifvVEby8srjzUZxUutIuMeGIP91VyXUbt+CleAWunxnL8aKcTPS3T
QxyGx0VmukO2UHYCXhXQLRXo/+zPcBxnk6UgVzcBCIOecOMTB1KCeblBmqd3q86f
NVFse+BTkmtm86u/1rzEDqgY6lMHQ+jNoHhpRVGpzmSnSX8GELyOW3QnixS0LCo2
no0vvW0QXkRJ+S68V5zBWlqa3xr21jOYcmON2Ra2G8Etsjx7W5XKS3I9k/4uonxb
LbITmBwEZWt/aTzmLFT16S5M9BlRqH5Ffmsw7Ls+NDmdvH/f1zM8XeNAb7kpFTrn
T3aALjkcd65/JFmgyVmzdt4BSmrUkYm0EEmLirQec86HJ4NlQJpJ2B7cfMWKPyal
+VgaT4S+fLTc/HJD3nObMXTCxZrMf0sBUhU4/QXqL7TTjqqosSn26mlGNUocw7Ty
HaESk7j2L9TMPt640r1G98j9ND7sWmyBmiYsah8F3MKZCIS892qhtFs0m5g2tA1F
sjPv9u6M5Pi6ie5Eo8xs+SqKa7TPPVsbZ9XcMRBuzDc5AtUPAm6ii9QVwref8wTq
qO379jDepgPj4HZkXMzQKxd6rw6wrF854304XhjHZefk+ChhIc4=
=nN4X
-----END PGP SIGNATURE-----
Merge tag 'for-6.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Mainly core changes, refactoring and optimizations.
Performance is improved in some areas, overall there may be a
cumulative improvement due to refactoring that removed lookups in the
IO path or simplified IO submission tracking.
Core:
- submit IO synchronously for fast checksums (crc32c and xxhash),
remove high priority worker kthread
- read extent buffer in one go, simplify IO tracking, bio submission
and locking
- remove additional tracking of redirtied extent buffers, originally
added for zoned mode but actually not needed
- track ordered extent pointer in bio to avoid rbtree lookups during
IO
- scrub, use recovered data stripes as cache to avoid unnecessary
read
- in zoned mode, optimize logical to physical mappings of extents
- remove PageError handling, not set by VFS nor writeback
- cleanups, refactoring, better structure packing
- lots of error handling improvements
- more assertions, lockdep annotations
- print assertion failure with the exact line where it happens
- tracepoint updates
- more debugging prints
Performance:
- speedup in fsync(), better tracking of inode logged status can
avoid transaction commit
- IO path structures track logical offsets in data structures and
does not need to look it up
User visible changes:
- don't commit transaction for every created subvolume, this can
reduce time when many subvolumes are created in a batch
- print affected files when relocation fails
- trigger orphan file cleanup during START_SYNC ioctl
Notable fixes:
- fix crash when disabling quota and relocation
- fix crashes when removing roots from drity list
- fix transacion abort during relocation when converting from newer
profiles not covered by fallback
- in zoned mode, stop reclaiming block groups if filesystem becomes
read-only
- fix rare race condition in tree mod log rewind that can miss some
btree node slots
- with enabled fsverity, drop up-to-date page bit in case the
verification fails"
* tag 'for-6.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (194 commits)
btrfs: fix race between quota disable and relocation
btrfs: add comment to struct btrfs_fs_info::dirty_cowonly_roots
btrfs: fix race when deleting free space root from the dirty cow roots list
btrfs: fix race when deleting quota root from the dirty cow roots list
btrfs: tracepoints: also show actual number of the outstanding extents
btrfs: update i_version in update_dev_time
btrfs: make btrfs_compressed_bioset static
btrfs: add handling for RAID1C23/DUP to btrfs_reduce_alloc_profile
btrfs: scrub: remove btrfs_fs_info::scrub_wr_completion_workers
btrfs: scrub: remove scrub_ctx::csum_list member
btrfs: do not BUG_ON after failure to migrate space during truncation
btrfs: do not BUG_ON on failure to get dir index for new snapshot
btrfs: send: do not BUG_ON() on unexpected symlink data extent
btrfs: do not BUG_ON() when dropping inode items from log root
btrfs: replace BUG_ON() at split_item() with proper error handling
btrfs: do not BUG_ON() on tree mod log failures at btrfs_del_ptr()
btrfs: do not BUG_ON() on tree mod log failures at insert_ptr()
btrfs: do not BUG_ON() on tree mod log failure at insert_new_root()
btrfs: do not BUG_ON() on tree mod log failures at push_nodes_for_insert()
btrfs: abort transaction at update_ref_for_cow() when ref count is zero
...
There was regression caused by a97699d1d6 ("btrfs: replace
map_lookup->stripe_len by BTRFS_STRIPE_LEN") and supposedly fixed by
a7299a18a1 ("btrfs: fix u32 overflows when left shifting stripe_nr").
To avoid code churn the fix was open coding the type casts but
unfortunately missed one which was still possible to hit [1].
The missing place was assignment of bioc->full_stripe_logical inside
btrfs_map_block().
Fix it by adding a helper that does the safe calculation of the offset
and use it everywhere even though it may not be strictly necessary due
to already using u64 types. This replaces all remaining
"<< BTRFS_STRIPE_LEN_SHIFT" calls.
[1] https://lore.kernel.org/linux-btrfs/20230622065438.86402-1-wqu@suse.com/
Fixes: a7299a18a1 ("btrfs: fix u32 overflows when left shifting stripe_nr")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Since the scrub rework introduced by commit 2af2aaf982 ("btrfs:
scrub: introduce structure for new BTRFS_STRIPE_LEN based interface")
and later commits, scrub only needs one single workqueue,
fs_info::scrub_worker.
That scrub_wr_completion_workers is initially to handle the delay work
after write bios finished. But the new scrub code goes submit-and-wait
for write bios, thus all the work are done inside the scrub_worker.
The last user of fs_info::scrub_wr_completion_workers is removed in
commit 16f9399349 ("btrfs: scrub: remove the old writeback
infrastructure"), so we can safely remove the workqueue.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the rework of scrub introduced by commit 2af2aaf982 ("btrfs:
scrub: introduce structure for new BTRFS_STRIPE_LEN based interface")
and later commits, scrub no longer keeps its data checksum inside sctx.
Instead we have scrub_stripe::csums for the checksum of the stripe.
Thus we can remove the unused scrub_ctx::csum_list member.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_map_sblock just hard codes three arguments and calls
btrfs_map_sblock. Remove it as it doesn't provide any real value, but
makes following the btrfs_map_block call chains harder.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
BACKGROUND
==========
When multiple work items are queued to a workqueue, their execution order
doesn't match the queueing order. They may get executed in any order and
simultaneously. When fully serialized execution - one by one in the queueing
order - is needed, an ordered workqueue should be used which can be created
with alloc_ordered_workqueue().
However, alloc_ordered_workqueue() was a later addition. Before it, an
ordered workqueue could be obtained by creating an UNBOUND workqueue with
@max_active==1. This originally was an implementation side-effect which was
broken by 4c16bd327c ("workqueue: restore WQ_UNBOUND/max_active==1 to be
ordered"). Because there were users that depended on the ordered execution,
5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
made workqueue allocation path to implicitly promote UNBOUND workqueues w/
@max_active==1 to ordered workqueues.
While this has worked okay, overloading the UNBOUND allocation interface
this way creates other issues. It's difficult to tell whether a given
workqueue actually needs to be ordered and users that legitimately want a
min concurrency level wq unexpectedly gets an ordered one instead. With
planned UNBOUND workqueue updates to improve execution locality and more
prevalence of chiplet designs which can benefit from such improvements, this
isn't a state we wanna be in forever.
This patch series audits all call sites that create an UNBOUND workqueue w/
@max_active==1 and converts them to alloc_ordered_workqueue() as necessary.
BTRFS
=====
* fs_info->scrub_workers initialized in scrub_workers_get() was setting
@max_active to 1 when @is_dev_replace is set and it seems that the
workqueue actually needs to be ordered if @is_dev_replace. Update the code
so that alloc_ordered_workqueue() is used if @is_dev_replace.
* fs_info->discard_ctl.discard_workers initialized in
btrfs_init_workqueues() was directly using alloc_workqueue() w/
@max_active==1. Converted to alloc_ordered_workqueue().
* fs_info->fixup_workers and fs_info->qgroup_rescan_workers initialized in
btrfs_queue_work() use the btrfs's workqueue wrapper, btrfs_workqueue,
which are allocated with btrfs_alloc_workqueue().
btrfs_workqueue implements automatic @max_active adjustment which is
disabled when the specified max limit is below a certain threshold, so
calling btrfs_alloc_workqueue() with @limit_active==1 yields an ordered
workqueue whose @max_active won't be changed as the auto-tuning is
disabled.
This is rather brittle in that nothing clearly indicates that the two
workqueues should be ordered or btrfs_alloc_workqueue() must disable
auto-tuning when @limit_active==1.
This patch factors out the common btrfs_workqueue init code into
btrfs_init_workqueue() and add explicit btrfs_alloc_ordered_workqueue().
The two workqueues are converted to use the new ordered allocation
interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are defined in the scrub.c file, but last callers were
removed in e9255d6c40 ("btrfs: scrub: remove the old scrub recheck
code").
fs/btrfs/scrub.c:553:20: warning: unused function 'scrub_stripe_index_and_offset'.
fs/btrfs/scrub.c:543:19: warning: unused function 'scrub_nr_raid_mirrors'.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=4937
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Smatch reports the following errors related to commit ("btrfs: output
affected files when relocation fails"):
fs/btrfs/inode.c:283 print_data_reloc_error()
error: uninitialized symbol 'ref_level'.
[CAUSE]
That part of code is mostly copied from scrub, but unfortunately scrub
code from the beginning is not doing the error handling properly.
The offending code looks like this:
do {
ret = tree_backref_for_extent();
btrfs_warn_rl();
} while (ret != 1);
There are several problems involved:
- No error handling
If that tree_backref_for_extent() failed, we would output the same
error again and again, never really exit as it requires ret == 1 to
exit.
- Always do one extra output
As tree_backref_for_extent() only return > 0 if there is no more
backref item.
This means after the last item we hit, we would output an invalid
error message for ret > 0 case.
[FIX]
Fix the old code by:
- Move @ref_root and @ref_level into the if branch
And do not initialize them, so we can catch such uninitialized values
just like what we do in the inode.c
- Explicitly check the return value of tree_backref_for_extent()
And handle ret < 0 and ret > 0 cases properly.
- No more do {} while () loop
Instead go while (true) {} loop since we will handle @ret manually.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For P/Q stripe scrub, we have quite some duplicated read IO:
- Data stripes read for verification
This is triggered by the scrub_submit_initial_read() inside
scrub_raid56_parity_stripe().
- Data stripes read (again) for P/Q stripe verification
This is triggered by scrub_assemble_read_bios() from scrub_rbio().
Although we can have hit rbio cache and avoid unnecessary read, the
chance is very low, as scrub would easily flush the whole rbio cache.
This means, even we're just scrubbing a single P/Q stripe, we would read
the data stripes twice for the best case scenario. If we need to
recover some data stripes, it would cause more reads on the same data
stripes, again and again.
However before we call raid56_parity_submit_scrub_rbio() we already
have all data stripes repaired and their contents ready to use.
But RAID56 cache is unaware about the scrub cache, thus RAID56 layer
itself still needs to re-read the data stripes.
To avoid such cache miss, this patch would:
- Introduce a new helper, raid56_parity_cache_data_pages()
This function would grab the pages from an array, and copy the content
to the rbio, marking all the involved sectors uptodate.
The page copy is unavoidable because of the cache pages of rbio are all
self managed, thus can not utilize outside pages without screwing up
the lifespan.
- Use the repaired data stripes as cache inside
scrub_raid56_parity_stripe()
By this, we ensure all the data sectors of the scrub rbio are already
uptodate, and no need to read them again from disk.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[RETURN VALUE OVERWRITE]
Inside scrub_stripe(), we would submit all the remaining stripes after
iterating all extents.
But since flush_scrub_stripes() can return error, we need to avoid
overwriting the existing @ret if there is any error.
However the existing check is doing the wrong check:
ret2 = flush_scrub_stripes();
if (!ret2)
ret = ret2;
This would overwrite the existing @ret to 0 as long as the final flush
detects no critical errors.
[FIX]
We should check @ret other than @ret2 in that case.
Fixes: 8eb3dd17ea ("btrfs: dev-replace: error out if we have unrepaired metadata error during")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
After the recent scrub rework introduced in commit e02ee89baa ("btrfs:
scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure"),
btrfs scrub no longer reports repaired errors any more:
# mkfs.btrfs -f $dev -d DUP
# mount $dev $mnt
# xfs_io -f -d -c "pwrite -b 64K -S 0xaa 0 64" $mnt/file
# umount $dev
# xfs_io -f -c "pwrite -S 0xff $phy1 64K" $dev # Corrupt the first mirror
# mount $dev $mnt
# btrfs scrub start -BR $mnt
scrub done for 725e7cb7-8a4a-4c77-9f2a-86943619e218
Scrub started: Tue Jun 6 14:56:50 2023
Status: finished
Duration: 0:00:00
data_extents_scrubbed: 2
tree_extents_scrubbed: 18
data_bytes_scrubbed: 131072
tree_bytes_scrubbed: 294912
read_errors: 0
csum_errors: 0 <<< No errors here
verify_errors: 0
[...]
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 16 <<< Only corrected errors
last_physical: 2723151872
This can confuse btrfs-progs, as it relies on the csum_errors to
determine if there is anything wrong.
While on v6.3.x kernels, the report is different:
csum_errors: 16 <<<
verify_errors: 0
[...]
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 16 <<<
[CAUSE]
In the reworked scrub, we update the scrub progress inside
scrub_stripe_report_errors(), using various bitmaps to update the
result.
For example for csum_errors, we use bitmap_weight() of
stripe->csum_error_bitmap.
Unfortunately at that stage, all error bitmaps (except
init_error_bitmap) are the result of the latest repair attempt, thus if
the stripe is fully repaired, those error bitmaps will all be empty,
resulting the above output mismatch.
To fix this, record the number of errors into stripe->init_nr_*_errors.
Since we don't really care about where those errors are, we only need to
record the number of errors.
Then in scrub_stripe_report_errors(), use those initial numbers to
update the progress other than using the latest error bitmaps.
Fixes: e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With recent scrub rework, the scrub operation no longer respects the
read-only flag passed by "-r" option of "btrfs scrub start" command.
# mkfs.btrfs -f -d raid1 $dev1 $dev2
# mount $dev1 $mnt
# xfs_io -f -d -c "pwrite -b 128K -S 0xaa 0 128k" $mnt/file
# sync
# xfs_io -c "pwrite -S 0xff $phy1 64k" $dev1
# xfs_io -c "pwrite -S 0xff $((phy2 + 65536)) 64k" $dev2
# mount $dev1 $mnt -o ro
# btrfs scrub start -BrRd $mnt
Scrub device $dev1 (id 1) done
Scrub started: Tue Jun 6 09:59:14 2023
Status: finished
Duration: 0:00:00
[...]
corrected_errors: 16 <<< Still has corrupted sectors
last_physical: 1372585984
Scrub device $dev2 (id 2) done
Scrub started: Tue Jun 6 09:59:14 2023
Status: finished
Duration: 0:00:00
[...]
corrected_errors: 16 <<< Still has corrupted sectors
last_physical: 1351614464
# btrfs scrub start -BrRd $mnt
Scrub device $dev1 (id 1) done
Scrub started: Tue Jun 6 10:00:17 2023
Status: finished
Duration: 0:00:00
[...]
corrected_errors: 0 <<< No more errors
last_physical: 1372585984
Scrub device $dev2 (id 2) done
[...]
corrected_errors: 0 <<< No more errors
last_physical: 1372585984
[CAUSE]
In the newly reworked scrub code, repair is always submitted no matter
if we're doing a read-only scrub.
[FIX]
Fix it by skipping the write submission if the scrub is a read-only one.
Unfortunately for the report part, even for a read-only scrub we will
still report it as corrected errors, as we know it's repairable, even we
won't really submit the write.
Fixes: e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
After commit e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror()
to scrub_stripe infrastructure"), scrub no longer works for zoned device
at all.
Even an empty zoned btrfs cannot be replaced:
# mkfs.btrfs -f /dev/nvme0n1
# mount /dev/nvme0n1 /mnt/btrfs
# btrfs replace start -Bf 1 /dev/nvme0n2 /mnt/btrfs
Resetting device zones /dev/nvme1n1 (160 zones) ...
ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt/btrfs/": Input/output error
And we can hit kernel crash related to that:
BTRFS info (device nvme1n1): host-managed zoned block device /dev/nvme3n1, 160 zones of 134217728 bytes
BTRFS info (device nvme1n1): dev_replace from /dev/nvme2n1 (devid 2) to /dev/nvme3n1 started
nvme3n1: Zone Management Append(0x7d) @ LBA 65536, 4 blocks, Zone Is Full (sct 0x1 / sc 0xb9) DNR
I/O error, dev nvme3n1, sector 786432 op 0xd:(ZONE_APPEND) flags 0x4000 phys_seg 3 prio class 2
BTRFS error (device nvme1n1): bdev /dev/nvme3n1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
BUG: kernel NULL pointer dereference, address: 00000000000000a8
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:_raw_spin_lock_irqsave+0x1e/0x40
Call Trace:
<IRQ>
btrfs_lookup_ordered_extent+0x31/0x190
btrfs_record_physical_zoned+0x18/0x40
btrfs_simple_end_io+0xaf/0xc0
blk_update_request+0x153/0x4c0
blk_mq_end_request+0x15/0xd0
nvme_poll_cq+0x1d3/0x360
nvme_irq+0x39/0x80
__handle_irq_event_percpu+0x3b/0x190
handle_irq_event+0x2f/0x70
handle_edge_irq+0x7c/0x210
__common_interrupt+0x34/0xa0
common_interrupt+0x7d/0xa0
</IRQ>
<TASK>
asm_common_interrupt+0x22/0x40
[CAUSE]
Dev-replace reuses scrub code to iterate all extents and write the
existing content back to the new device.
And for zoned devices, we call fill_writer_pointer_gap() to make sure
all the writes into the zoned device is sequential, even if there may be
some gaps between the writes.
However we have several different bugs all related to zoned dev-replace:
- We are using ZONE_APPEND operation for metadata style write back
For zoned devices, btrfs has two ways to write data:
* ZONE_APPEND for data
This allows higher queue depth, but will not be able to know where
the write would land.
Thus needs to grab the real on-disk physical location in it's endio.
* WRITE for metadata
This requires single queue depth (new writes can only be submitted
after previous one finished), and all writes must be sequential.
For scrub, we go single queue depth, but still goes with ZONE_APPEND,
which requires btrfs_bio::inode being populated.
This is the cause of that crash.
- No correct tracing of write_pointer
After a write finished, we should forward sctx->write_pointer, or
fill_writer_pointer_gap() would not work properly and cause more
than necessary zero out, and fill the whole zone prematurely.
- Incorrect physical bytenr passed to fill_writer_pointer_gap()
In scrub_write_sectors(), one call site passes logical address, which
is completely wrong.
The other call site passes physical address of current sector, but
we should pass the physical address of the btrfs_bio we're submitting.
This is the cause of the -EIO errors.
[FIX]
- Do not use ZONE_APPEND for btrfs_submit_repair_write().
- Manually forward sctx->write_pointer after successful writeback
- Use the physical address of the to-be-submitted btrfs_bio for
fill_writer_pointer_gap()
Now zoned device replace would work as expected.
Reported-by: Christoph Hellwig <hch@lst.de>
Fixes: e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we allow a block group not to be marked read-only for scrub.
But for RAID56 block groups if we require the block group to be
read-only, then we're allowed to use cached content from scrub stripe to
reduce unnecessary RAID56 reads.
So this patch would:
- Make btrfs_inc_block_group_ro() try harder
During my tests, for cases like btrfs/061 and btrfs/064, we can hit
ENOSPC from btrfs_inc_block_group_ro() calls during scrub.
The reason is if we only have one single data chunk, and trying to
scrub it, we won't have any space left for any newer data writes.
But this check should be done by the caller, especially for scrub
cases we only temporarily mark the chunk read-only.
And newer data writes would always try to allocate a new data chunk
when needed.
- Return error for scrub if we failed to mark a RAID56 chunk read-only
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Even before the scrub rework, if we have some corrupted metadata failed
to be repaired during replace, we still continue replacing and let it
finish just as there is nothing wrong:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished
This can lead to unexpected problems for the resulting filesystem.
[CAUSE]
Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
But unlike scrub, dev-replace doesn't really bother to check the scrub
progress, which records all the errors found during replace.
And even if we check the progress, we cannot really determine which
errors are minor, which are critical just by the plain numbers.
(remember we don't treat metadata/data checksum error differently).
This behavior is there from the very beginning.
[FIX]
Instead of continuing the replace, just error out if we hit an
unrepaired metadata sector.
Now the dev-replace would be rejected with -EIO, to let the user know.
Although it also means, the filesystem has some metadata error which
cannot be repaired, the user would be upset anyway.
The new dmesg would look like this:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since scrub path has been fully moved to scrub_stripe based facilities,
no more scrub_bio would be submitted.
Thus we can remove it completely, this involves:
- SCRUB_SECTORS_PER_BIO macro
- SCRUB_BIOS_PER_SCTX macro
- SCRUB_MAX_PAGES macro
- BTRFS_MAX_MIRRORS macro
- scrub_bio structure
- scrub_ctx::bios member
- scrub_ctx::curr member
- scrub_ctx::bios_in_flight member
- scrub_ctx::workers_pending member
- scrub_ctx::list_lock member
- scrub_ctx::list_wait member
- function scrub_bio_end_io_worker()
- function scrub_pending_bio_inc()
- function scrub_pending_bio_dec()
- function scrub_throttle()
- function scrub_submit()
- function scrub_find_csum()
- function drop_csum_range()
- Some unnecessary flush and scrub pauses
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Those two structures are used to represent a bunch of sectors for scrub,
but now they are fully replaced by scrub_stripe in one go, so we can
remove them. This involves:
- structure scrub_block
- structure scrub_sector
- structure scrub_page_private
- function attach_scrub_page_private()
- function detach_scrub_page_private()
Now we no longer need to use page::private to handle subpage.
- function alloc_scrub_block()
- function alloc_scrub_sector()
- function scrub_sector_get_page()
- function scrub_sector_get_page_offset()
- function scrub_sector_get_kaddr()
- function bio_add_scrub_sector()
- function scrub_checksum_data()
- function scrub_checksum_tree_block()
- function scrub_checksum_super()
- function scrub_check_fsid()
- function scrub_block_get()
- function scrub_block_put()
- function scrub_sector_get()
- function scrub_sector_put()
- function scrub_bio_end_io()
- function scrub_block_complete()
- function scrub_add_sector_to_rd_bio()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old scrub code has different entrance to verify the content, and
since we have removed the writeback path, now we can start removing the
re-check part, including:
- scrub_recover structure
- scrub_sector::recover member
- function scrub_setup_recheck_block()
- function scrub_recheck_block()
- function scrub_recheck_block_checksum()
- function scrub_repair_block_group_good_copy()
- function scrub_repair_sector_from_good_copy()
- function scrub_is_page_on_raid56()
- function full_stripe_lock()
- function search_full_stripe_lock()
- function get_full_stripe_logical()
- function insert_full_stripe_lock()
- function lock_full_stripe()
- function unlock_full_stripe()
- btrfs_block_group::full_stripe_locks_root member
- btrfs_full_stripe_locks_tree structure
This infrastructure is to ensure RAID56 scrub is properly handling
recovery and P/Q scrub correctly.
This is no longer needed, before P/Q scrub we will wait for all
the involved data stripes to be scrubbed first, and RAID56 code has
internal lock to ensure no race in the same full stripe.
- function scrub_print_warning()
- function scrub_get_recover()
- function scrub_put_recover()
- function scrub_handle_errored_block()
- function scrub_setup_recheck_block()
- function scrub_bio_wait_endio()
- function scrub_submit_raid56_bio_wait()
- function scrub_recheck_block_on_raid56()
- function scrub_recheck_block()
- function scrub_recheck_block_checksum()
- function scrub_repair_block_from_good_copy()
- function scrub_repair_sector_from_good_copy()
And two more functions exported temporarily for later cleanup:
- alloc_scrub_sector()
- alloc_scrub_block()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the whole scrub path has been switched to scrub_stripe based
solution, the old writeback path can be removed completely, which
involves:
- scrub_ctx::wr_curr_bio member
- scrub_ctx::flush_all_writes member
- function scrub_write_block_to_dev_replace()
- function scrub_write_sector_to_dev_replace()
- function scrub_add_sector_to_wr_bio()
- function scrub_wr_submit()
- function scrub_wr_bio_end_io()
- function scrub_wr_bio_end_io_worker()
And one more function needs to be exported temporarily:
- scrub_sector_get()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The structure scrub_parity is used to indicate that some extents are
scrubbed for the purpose of RAID56 P/Q scrubbing.
Since the whole RAID56 P/Q scrubbing path has been replaced with new
scrub_stripe infrastructure, and we no longer need to use scrub_parity
to modify the behavior of data stripes, we can remove it completely.
This removal involves:
- scrub_parity_workers
Now only one worker would be utilized, scrub_workers, to do the read
and repair.
All writeback would happen at the main scrub thread.
- scrub_block::sparity member
- scrub_parity structure
- function scrub_parity_get()
- function scrub_parity_put()
- function scrub_free_parity()
- function __scrub_mark_bitmap()
- function scrub_parity_mark_sectors_error()
- function scrub_parity_mark_sectors_data()
These helpers are no longer needed, scrub_stripe has its bitmaps and
we can use bitmap helpers to get the error/data status.
- scrub_parity_bio_endio()
- scrub_parity_check_and_repair()
- function scrub_sectors_for_parity()
- function scrub_extent_for_parity()
- function scrub_raid56_data_stripe_for_parity()
- function scrub_raid56_parity()
The new code would reuse the scrub read-repair and writeback path.
Just skip the dev-replace phase.
And scrub_stripe infrastructure allows us to submit and wait for those
data stripes before scrubbing P/Q, without extra infrastructure.
The following two functions are temporarily exported for later cleanup:
- scrub_find_csum()
- scrub_add_sector_to_rd_bio()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Implement the only missing part for scrub: RAID56 P/Q stripe scrub.
The workflow is pretty straightforward for the new function,
scrub_raid56_parity_stripe():
- Go through the regular scrub path for each data stripe
- Wait for the verification and repair to finish
- Writeback the repaired sectors to data stripes
- Make sure all stripes are properly repaired
If we have sectors unrepaired, we cannot continue, or we could further
corrupt the P/Q stripe.
- Submit the rbio for P/Q stripe
The dev-replace would be handled inside
raid56_parity_submit_scrub_rbio() path.
- Wait for the above bio to finish
Although the old code is no longer used, we still keep the declaration,
as the cleanup can be several times larger than this patch itself.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Switch scrub_simple_mirror() to the new scrub_stripe infrastructure.
Since scrub_simple_mirror() is the core part of scrub (only RAID56
P/Q stripes don't utilize it), we can get rid of a big chunk of code,
mostly scrub_extent(), scrub_sectors() and directly called functions.
There is a functionality change:
- Scrub speed throttle now only affects read on the scrubbing device
Writes (for repair and replace), and reads from other mirrors won't
be limited by the set limits.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new helper, queue_scrub_stripe(), would try to queue a stripe for
scrub. If all stripes are already in use, we will submit all the
existing ones and wait for them to finish.
Currently we would queue up to 8 stripes, to enlarge the blocksize to
512KiB to improve the performance. Sectors repaired on zoned need to be
relocated instead of in-place fix.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new helper, scrub_stripe_report_errors(), will report the result of
the scrub to system log.
The main reporting is done by introducing a new helper,
scrub_print_common_warning(), which is mostly the same content from
scrub_print_wanring(), but without the need for a scrub_block.
Since we're reporting the errors, it's the perfect time to update the
scrub stats too.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a new helper, scrub_write_sectors(), to submit write bios for
specified sectors to the target disk.
There are several differences compared to read path:
- Utilize btrfs_submit_scrub_write()
Now we still rely on the @mirror_num based writeback, but the
requirement is also a little different than regular writeback or read,
thus we have to call btrfs_submit_scrub_write().
- We cannot write the full stripe back
We can only write the sectors we have. There will be two call sites
later, one for repaired sectors, one for all utilized sectors of
dev-replace.
Thus the callers should specify their own write_bitmap.
This function only submit the bios, will not wait for them unless for
zoned case.
Caller must explicitly wait for the IO to finish.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new helper, scrub_stripe_read_repair_worker(), would handle the
read-repair part:
- Wait for the previous submitted read IO to finish
- Verify the contents of the stripe
- Go through the remaining mirrors, using as large blocksize as possible
At this stage, we just read out all the failed sectors from each
mirror and re-verify.
If no more failed sector, we can exit.
- Go through all mirrors again, sector-by-sector
This time, we read sector by sector, this is to address cases where
one bad sector mismatches the drive's internal checksum, and cause the
whole read range to fail.
We put this recovery method as the last resort, as sector-by-sector
reading is slow, and reading from other mirrors may have already fixed
the errors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new helper, scrub_verify_stripe(), shares the same main workflow of
the old scrub code.
The major differences are:
- How pages/page_offset is grabbed
Everything can be grabbed from scrub_stripe easily.
- When error report happens
Currently the helper only verifies the sectors, not really doing any
error reporting.
The error reporting would be done after we have done the repair.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new helper, scrub_verify_one_metadata(), is almost the same as
scrub_checksum_tree_block().
The difference is in how we grab the pages from other structures.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new helper will search the extent tree to find the first extent of a
logical range, then fill the sectors array by two loops:
- Loop 1 to fill common bits and metadata generation
- Loop 2 to fill csum data (only for data bgs)
This loop will use the new btrfs_lookup_csums_bitmap() to fill
the full csum buffer, and set scrub_sector_verification::csum.
With all the needed info filled by this function, later we only need to
submit and verify the stripe.
Here we temporarily export the helper to avoid warning on unused static
function.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch introduces the following structures:
- scrub_sector_verification
Contains all the needed info to verify one sector (data or metadata).
- scrub_stripe
Contains all needed members (mostly bitmap based) to scrub one stripe
(with a length of BTRFS_STRIPE_LEN).
The basic idea is, we keep the existing per-device scrub behavior, but
merge all the scrub_bio/scrub_bio into one generic structure, and read
the full BTRFS_STRIPE_LEN stripe on the first try.
This means we will read some sectors which are not scrub target, but
that's fine. At dev-replace time we only writeback the utilized and good
sectors, and for read-repair we only writeback the repaired sectors.
With every read submitted in BTRFS_STRIPE_LEN, the need for complex bio
form shaping would be gone.
Although to get the same performance of the old scrub behavior, we would
need to submit the initial read for two stripes at once.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is really no need to go through the super complex scrub_sectors()
to just handle super blocks. Introduce a dedicated function to handle
super block scrubbing.
This new function will introduce a behavior change, instead of using the
complex but concurrent scrub_bio system, here we just go submit-and-wait.
There is really not much sense to care the performance of super block
scrubbing. It only has 3 super blocks at most, and they are all
scattered around the devices already.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need to pass the roots as arguments, reading them from the
rb-tree is cheap. Thus there is really not much need to pre-fetch it
and pass it all the way from scrub_stripe().
And we already have more than enough arguments in scrub_simple_mirror()
and scrub_simple_stripe(), it's better to remove them and only grab
those roots in scrub_simple_mirror().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variable @path is no longer passed into any call sites after commit
18d30ab961 ("btrfs: scrub: use scrub_simple_mirror() to handle RAID56
data stripe scrub"), thus we can remove the variable completely.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Although dev replace ioctl has a way to specify the mode on whether we
should read from the source device, it's not properly followed.
# mkfs.btrfs -f -d raid1 -m raid1 $dev1 $dev2
# mount $dev1 $mnt
# xfs_io -f -c "pwrite 0 32M" $mnt/file
# sync
# btrfs replace start -r -f 1 $dev3 $mnt
And one extra trace is added to scrub_submit(), showing the detail about
the bio:
btrfs-11569 [005] ... 37.0270: scrub_submit.part.0: devid=1 logical=22036480 phy=22036480 len=16384
btrfs-11569 [005] ... 37.0273: scrub_submit.part.0: devid=1 logical=30457856 phy=30457856 len=32768
btrfs-11569 [005] ... 37.0274: scrub_submit.part.0: devid=1 logical=30507008 phy=30507008 len=49152
btrfs-11569 [005] ... 37.0274: scrub_submit.part.0: devid=1 logical=30605312 phy=30605312 len=32768
btrfs-11569 [005] ... 37.0275: scrub_submit.part.0: devid=1 logical=30703616 phy=30703616 len=65536
btrfs-11569 [005] ... 37.0281: scrub_submit.part.0: devid=1 logical=298844160 phy=298844160 len=131072
...
btrfs-11569 [005] ... 37.0762: scrub_submit.part.0: devid=1 logical=322961408 phy=322961408 len=131072
btrfs-11569 [005] ... 37.0762: scrub_submit.part.0: devid=1 logical=323092480 phy=323092480 len=131072
One can see that all the reads are submitted to devid 1, even if we have
specified "-r" option to avoid reading from the source device.
[CAUSE]
The dev-replace read mode is only set but not followed by scrub code at
all. In fact, only common read path is properly following the read
mode, but scrub itself has its own read path, thus not following the
mode.
[FIX]
Here we enhance scrub_find_good_copy() to also follow the read mode.
The idea is pretty simple, in the first loop, we avoid the following
devices:
- Missing devices
This is the existing condition
- The source device if the replace wants to avoid it.
And if above loop found no candidate (e.g. replace a single device),
then we discard the 2nd condition, and try again.
Since we're here, also enhance the function scrub_find_good_copy() by:
- Remove the forward declaration
- Makes it return int
To indicates errors, e.g. no good mirror found.
- Add extra error messages
Now with the same trace, "btrfs replace start -r" works as expected:
btrfs-1213 [000] ... 991.9059: scrub_submit.part.0: devid=2 logical=22036480 phy=1064960 len=16384
btrfs-1213 [000] ... 991.9062: scrub_submit.part.0: devid=2 logical=30457856 phy=9486336 len=32768
btrfs-1213 [000] ... 991.9063: scrub_submit.part.0: devid=2 logical=30507008 phy=9535488 len=49152
btrfs-1213 [000] ... 991.9064: scrub_submit.part.0: devid=2 logical=30605312 phy=9633792 len=32768
btrfs-1213 [000] ... 991.9065: scrub_submit.part.0: devid=2 logical=30703616 phy=9732096 len=65536
btrfs-1213 [000] ... 991.9073: scrub_submit.part.0: devid=2 logical=298844160 phy=277872640 len=131072
btrfs-1213 [000] ... 991.9075: scrub_submit.part.0: devid=2 logical=298975232 phy=278003712 len=131072
btrfs-1213 [000] ... 991.9078: scrub_submit.part.0: devid=2 logical=299106304 phy=278134784 len=131072
...
btrfs-1213 [000] ... 991.9474: scrub_submit.part.0: devid=2 logical=318504960 phy=297533440 len=131072
btrfs-1213 [000] ... 991.9476: scrub_submit.part.0: devid=2 logical=318636032 phy=297664512 len=131072
btrfs-1213 [000] ... 991.9479: scrub_submit.part.0: devid=2 logical=318767104 phy=297795584 len=131072
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_io_context structure, we have a pointer raid_map, which
indicates the logical bytenr for each stripe.
But considering we always call sort_parity_stripes(), the result
raid_map[] is always sorted, thus raid_map[0] is always the logical
bytenr of the full stripe.
So why we waste the space and time (for sorting) for raid_map?
This patch will replace btrfs_io_context::raid_map with a single u64
number, full_stripe_start, by:
- Replace btrfs_io_context::raid_map with full_stripe_start
- Replace call sites using raid_map[0] to use full_stripe_start
- Replace call sites using raid_map[i] to compare with nr_data_stripes.
The benefits are:
- Less memory wasted on raid_map
It's sizeof(u64) * num_stripes vs sizeof(u64).
It'll always save at least one u64, and the benefit grows larger with
num_stripes.
- No more weird alloc_btrfs_io_context() behavior
As there is only one fixed size + one variable length array.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.
For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
same write to all involved devices).
But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.
Currently we use a complex way using tgtdev_map[] array, e.g:
num_tgtdevs = 1
tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
and it's duplicated to stripes[3].
tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.
But this is wasting some space, and ignores one important thing for
dev-replace, there is at most one running replace.
Thus we can change it to a fixed array to represent the mapping:
replace_nr_stripes = 1
replace_stripe_src = 1 <- Means stripes[1] is involved in replace.
thus the extra stripe is a copy of
stripes[1]
By this we can save some space for bioc on RAID56 chunks with many
devices. And we get rid of one variable sized array from bioc.
Thus the patch involves the following changes:
- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
and @replace_stripe_src.
@num_tgtdevs is just renamed to @replace_nr_stripes.
While the mapping is completely changed.
- Add extra ASSERT()s for RAID56 code
- Only add two more extra stripes for dev-replace cases.
As we have an upper limit on how many dev-replace stripes we can have.
- Unify the behavior of handle_ops_on_dev_replace()
Previously handle_ops_on_dev_replace() go two different paths for
WRITE and GET_READ_MIRRORS.
Now unify them by always going the WRITE path first (with at most 2
replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
extra stripes, just drop one stripe.
- Remove the @real_stripes argument from alloc_btrfs_io_context()
As we don't need the old variable length array any more.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are quite some div64 calls inside btrfs_map_block() and its
variants.
Such calls are for @stripe_nr, where @stripe_nr is the number of
stripes before our logical bytenr inside a chunk.
However we can eliminate such div64 calls by just reducing the width of
@stripe_nr from 64 to 32.
This can be done because our chunk size limit is already 10G, with fixed
stripe length 64K.
Thus a U32 is definitely enough to contain the number of stripes.
With such width reduction, we can get rid of slower div64, and extra
warning for certain 32bit arch.
This patch would do:
- Add a new tree-checker chunk validation on chunk length
Make sure no chunk can reach 256G, which can also act as a bitflip
checker.
- Reduce the width from u64 to u32 for @stripe_nr variables
- Replace unnecessary div64 calls with regular modulo and division
32bit division and modulo are much faster than 64bit operations, and
we are finally free of the div64 fear at least in those involved
functions.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs doesn't support stripe lengths other than 64KiB.
This is already set in the tree-checker.
There is really no meaning to record that fixed value in map_lookup for
now, and can all be replaced with BTRFS_STRIPE_LEN.
Furthermore we can use the fix stripe length to do the following
optimization:
- Use BTRFS_STRIPE_LEN_SHIFT to replace some 64bit division
Now we only need to do a right shift.
And the value of BTRFS_STRIPE_LEN itself is already too large for bit
shift, thus if we accidentally use BTRFS_STRIPE_LEN to do bit shift,
a compiler warning would be triggered.
Thus this bit shift optimization would be safe.
- Use BTRFS_STRIPE_LEN_MASK to calculate the offset inside a stripe
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There quite a few spelling mistakes as found using codespell. Fix them.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When debugging a scrub related metadata error, it turns out that our
metadata error reporting is not ideal.
The only 3 error messages are:
- BTRFS error (device dm-2): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 0, gen 1
Showing we have metadata generation mismatch errors.
- BTRFS error (device dm-2): unable to fixup (regular) error at logical 7110656 on dev /dev/mapper/test-scratch1
Showing which tree blocks are corrupted.
- BTRFS warning (device dm-2): checksum/header error at logical 24772608 on dev /dev/mapper/test-scratch2, physical 3801088: metadata node (level 1) in tree 5
Showing which physical range the corrupted metadata is at.
We have to combine the above 3 to know we have a corrupted metadata with
generation mismatch.
And this is already the better case, if we have other problems, like
fsid mismatch, we can not even know the cause.
[CAUSE]
The problem is caused by the fact that, scrub_checksum_tree_block()
never outputs any error message.
It just return two bits for scrub: sblock->header_error, and
sblock->generation_error.
And later we report error in scrub_print_warning(), but unfortunately we
only have two bits, there is not really much thing we can done to print
any detailed errors.
[FIX]
This patch will do the following to enhance the error reporting of
metadata scrub:
- Add extra warning (ratelimited) for every error we hit
This can help us to distinguish the different types of errors.
Some errors can help us to know what's going wrong immediately,
like bytenr mismatch.
- Re-order the checks
Currently we check bytenr first, then immediately generation.
This can lead to false generation mismatch reports, while the fsid
mismatches.
Here is the new output for the bug I'm debugging (we forgot to
writeback tree blocks for commit roots):
BTRFS warning (device dm-2): tree block 24117248 mirror 1 has bad fsid, has b77cd862-f150-4c71-90ec-7baf0544d83f want 17df6abf-23cd-445f-b350-5b3e40bfd2fc
BTRFS warning (device dm-2): tree block 24117248 mirror 0 has bad fsid, has b77cd862-f150-4c71-90ec-7baf0544d83f want 17df6abf-23cd-445f-b350-5b3e40bfd2fc
Now we can immediately know it's some tree blocks didn't even get written
back, other than the original confusing generation mismatch.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although we have an existing function, btrfs_lookup_csums_range(), to
find all data checksums for a range, it's based on a btrfs_ordered_sum
list.
For the incoming RAID56 data checksum verification at RMW time, we don't
want to waste time by allocating temporary memory.
So this patch will introduce a new helper, btrfs_lookup_csums_bitmap().
It will use bitmap based result, which will be a perfect fit for later
RAID56 usage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
If dev-replace failed to re-construct its data/metadata, the kernel
message would be incorrect for the missing device:
BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started
BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev (efault)
Note the above "dev (efault)" of the second line.
While the first line is properly reporting "<missing disk>".
[CAUSE]
Although dev-replace is using btrfs_dev_name(), the heavy lifting work
is still done by scrub (scrub is reused by both dev-replace and regular
scrub).
Unfortunately scrub code never uses btrfs_dev_name() helper, as it's
only declared locally inside dev-replace.c.
[FIX]
Fix the output by:
- Move the btrfs_dev_name() helper to volumes.h
- Use btrfs_dev_name() to replace open-coded rcu_str_deref() calls
Only zoned code is not touched, as I'm not familiar with degraded
zoned code.
- Constify return value and parameter
Now the output looks pretty sane:
BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started
BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev <missing disk>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The public backref walking functions have quite a lot of arguments that
are passed down the call stack to find_parent_nodes(), the core function
of the backref walking code.
The next patches in series will need to add even arguments to these
functions that should be passed not only to find_parent_nodes(), but also
to other functions used by the later (directly or even lower in the call
stack).
So create a structure to hold all these arguments and state used by the
main backref walking function, find_parent_nodes(), and use it as the
argument for the public backref walking functions iterate_extent_inodes(),
btrfs_find_all_leafs() and btrfs_find_all_roots().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The interface for find_parent_nodes() has two extent offset related
arguments:
1) One u64 pointer argument for the extent offset;
2) One boolean argument to tell if the extent offset should be ignored or
not.
These are confusing, becase the extent offset pointer can be NULL and in
some cases callers pass a NULL value as a way to tell the backref walking
code to ignore offsets in file extent items (and simply consider all file
extent items that point to the target data extent).
The boolean argument was added in commit c995ab3cda ("btrfs: add a flag
to iterate_inodes_from_logical to find all extent refs for uncompressed
extents"), but it was never really necessary, it was enough if it could
find a way to get a NULL value passed to the "extent_item_pos" argument of
find_parent_nodes(). The arguments are also passed to functions called
by find_parent_nodes() and respective helper functions, which further
makes everything more complicated than needed.
Then we have several backref walking related functions that end up calling
find_parent_nodes(), either directly or through some other function that
they call, and for many we have to use an "extent_item_pos" (u64) argument
and a boolean "ignore_offset" argument too.
This is confusing and not really necessary. So use a single argument to
specify the extent offset, as a simple u64 and not as a pointer, but
using a special value of (u64)-1, defined as a documented constant, to
indicate when the extent offset should be ignored.
This is also preparation work for the upcoming patches in the series that
add other arguments to find_parent_nodes() and other related functions
that use it.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently send does not do the best decisions when it comes to decide
between multiple clone sources, which results in clone operations for
partial extent ranges, which has the following disadvantages:
1) We get less shared extents at the destination;
2) We have to read more data during the send operation and emit more
write commands.
Besides not being optimal behaviour, it also breaks user expectations and
is often reported by users, with a recent example in the Link tag at the
bottom of this change log.
Part of the reason for this non-optimal behaviour is that the backref
walking code does not provide information about the length of the file
extent items that were found for each backref, so send is blind about
which backref is the best to chose as a cloning source.
The other existing reasons are just silliness, namely always prefering
the inode with the lowest number when multiple are found for the same
root and when we can clone from multiple roots, always prefer the send
root over any of the other clone roots. This does not make any sense
since any inode or root is fine and as good as any other inode/root.
Fix this by making backref walking pass information about the number of
bytes referenced by each file extent item and then have send's backref
callback pick the inode with the highest number of bytes for each root.
Finally select the root from which we can clone more bytes from.
Example reproducer:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV
mount $DEV $MNT
xfs_io -f -c "pwrite -S 0xab -b 2M 0 2M" $MNT/foo
cp --reflink=always $MNT/foo $MNT/bar
cp --reflink=always $MNT/foo $MNT/baz
sync
# Overwrite the second half of file foo.
xfs_io -c "pwrite -S 0xcd -b 1M 1M 1M" $MNT/foo
sync
echo
echo "*** fiemap in the original filesystem ***"
echo
xfs_io -c "fiemap -v" $MNT/foo
xfs_io -c "fiemap -v" $MNT/bar
xfs_io -c "fiemap -v" $MNT/baz
echo
btrfs filesystem du $MNT
btrfs subvolume snapshot -r $MNT $MNT/snap
btrfs send -f /tmp/send_stream $MNT/snap
umount $MNT
mkfs.btrfs -f $DEV &> /dev/null
mount $DEV $MNT
btrfs receive -f /tmp/send_stream $MNT
echo
echo "*** fiemap in the new filesystem ***"
echo
xfs_io -r -c "fiemap -v" $MNT/snap/foo
xfs_io -r -c "fiemap -v" $MNT/snap/bar
xfs_io -r -c "fiemap -v" $MNT/snap/baz
echo
btrfs filesystem du $MNT
rm -f /tmp/send_stream
rm -f /tmp/snap.fssum
umount $MNT
Before this change:
$ ./test.sh
(...)
*** fiemap in the original filesystem ***
/mnt/sdi/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2047]: 26624..28671 2048 0x2000
1: [2048..4095]: 30720..32767 2048 0x1
/mnt/sdi/bar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..4095]: 26624..30719 4096 0x2001
/mnt/sdi/baz:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..4095]: 26624..30719 4096 0x2001
Total Exclusive Set shared Filename
2.00MiB 1.00MiB - /mnt/sdi/foo
2.00MiB 0.00B - /mnt/sdi/bar
2.00MiB 0.00B - /mnt/sdi/baz
6.00MiB 1.00MiB 2.00MiB /mnt/sdi
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap'
At subvol /mnt/sdi/snap
At subvol snap
*** fiemap in the new filesystem ***
/mnt/sdi/snap/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..4095]: 26624..30719 4096 0x2001
/mnt/sdi/snap/bar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2047]: 26624..28671 2048 0x2000
1: [2048..4095]: 30720..32767 2048 0x1
/mnt/sdi/snap/baz:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2047]: 26624..28671 2048 0x2000
1: [2048..4095]: 32768..34815 2048 0x1
Total Exclusive Set shared Filename
2.00MiB 0.00B - /mnt/sdi/snap/foo
2.00MiB 1.00MiB - /mnt/sdi/snap/bar
2.00MiB 1.00MiB - /mnt/sdi/snap/baz
6.00MiB 2.00MiB - /mnt/sdi/snap
6.00MiB 2.00MiB 2.00MiB /mnt/sdi
We end up with two 1M extents that are not shared for files bar and baz.
After this change:
$ ./test.sh
(...)
*** fiemap in the original filesystem ***
/mnt/sdi/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2047]: 26624..28671 2048 0x2000
1: [2048..4095]: 30720..32767 2048 0x1
/mnt/sdi/bar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..4095]: 26624..30719 4096 0x2001
/mnt/sdi/baz:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..4095]: 26624..30719 4096 0x2001
Total Exclusive Set shared Filename
2.00MiB 1.00MiB - /mnt/sdi/foo
2.00MiB 0.00B - /mnt/sdi/bar
2.00MiB 0.00B - /mnt/sdi/baz
6.00MiB 1.00MiB 2.00MiB /mnt/sdi
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap'
At subvol /mnt/sdi/snap
At subvol snap
*** fiemap in the new filesystem ***
/mnt/sdi/snap/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..4095]: 26624..30719 4096 0x2001
/mnt/sdi/snap/bar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2047]: 26624..28671 2048 0x2000
1: [2048..4095]: 30720..32767 2048 0x2001
/mnt/sdi/snap/baz:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2047]: 26624..28671 2048 0x2000
1: [2048..4095]: 30720..32767 2048 0x2001
Total Exclusive Set shared Filename
2.00MiB 0.00B - /mnt/sdi/snap/foo
2.00MiB 0.00B - /mnt/sdi/snap/bar
2.00MiB 0.00B - /mnt/sdi/snap/baz
6.00MiB 0.00B - /mnt/sdi/snap
6.00MiB 0.00B 3.00MiB /mnt/sdi
Now there's a much better sharing, files bar and baz share 1M of the
extent of file foo and the second extent of files bar and baz is shared
between themselves.
This will later be turned into a test case for fstests.
Link: https://lore.kernel.org/linux-btrfs/20221008005704.795b44b0@crass-HP-ZBook-15-G2/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move these out of ctree.h into scrub.h to cut down on code in ctree.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move these prototypes out of ctree.h and into file-item.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pas GFP_KERNEL as parameter so we can use it directly in
alloc_scrub_sector.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's only one caller that calls scrub_setup_recheck_block in the
memalloc_nofs_save/_restore protection so it's effectively already
GFP_NOFS and it's safe to use GFP_KERNEL.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a large patch, but because they're all macros it's impossible to
split up. Simply copy all of the item accessors in ctree.h and paste
them in accessors.h, and then update any files to include the header so
everything compiles.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments, style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
We have several fs wide related helpers in ctree.h. The bulk of these
are the incompat flag test helpers, but there are things such as
btrfs_fs_closing() and the read only helpers that also aren't directly
related to the ctree code. Move these into a fs.h header, which will
serve as the location for file system wide related helpers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is only used locally in scrub.c, move it out of ctree.h into
scrub.c.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're doing device replace on a zoned filesystem and discover in
scrub_enumerate_chunks() that we don't have to copy the block group it is
unlocked before it gets skipped.
But as the block group hasn't yet been locked before it leads to a locking
imbalance. To fix this simply remove the unlock.
This was uncovered by fstests' testcase btrfs/163.
Fixes: 9283b9e09a ("btrfs: remove lock protection for BLOCK_GROUP_FLAG_TO_COPY")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit 786672e9e1.
[BUG]
Since commit 786672e9e1 ("btrfs: scrub: use larger block size for data
extent scrub"), btrfs scrub no longer reports errors if the corruption
is not in the first sector of a STRIPE_LEN.
The following script can expose the problem:
mkfs.btrfs -f $dev
mount $dev $mnt
xfs_io -f -c "pwrite -S 0xff 0 8k" $mnt/foobar
umount $mnt
# 13631488 is the logical bytenr of above 8K extent
btrfs-map-logical -l 13631488 -b 4096 $dev
mirror 1 logical 13631488 physical 13631488 device /dev/test/scratch1
# Corrupt the 2nd sector of that extent
xfs_io -f -c "pwrite -S 0x00 13635584 4k" $dev
mount $dev $mnt
btrfs scrub start -B $mnt
scrub done for 54e63f9f-0c30-4c84-a33b-5c56014629b7
Scrub started: Mon Nov 7 07:18:27 2022
Status: finished
Duration: 0:00:00
Total to scrub: 536.00MiB
Rate: 0.00B/s
Error summary: no errors found <<<
[CAUSE]
That offending commit enlarges the data extent scrub size from sector
size to BTRFS_STRIPE_LEN, to avoid extra scrub_block to be allocated.
But unfortunately the data extent scrub is still heavily relying on the
fact that there is only one scrub_sector per scrub_block.
Thus it will only check the first sector, and ignoring the remaining
sectors.
Furthermore the error reporting is not able to handle multiple sectors
either.
[FIX]
For now just revert the offending commit.
The consequence is just extra memory usage during scrub.
We will need a proper change to make the remaining data scrub path to
handle multiple sectors before we enlarging the data scrub size.
Reported-by: Li Zhang <zhanglikernel@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we have NOWAIT specified on our IOCB and we're writing into a
PREALLOC or NOCOW extent then we need to be able to tell
can_nocow_extent that we don't want to wait on any locks or metadata IO.
Fix can_nocow_extent to allow for NOWAIT.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parity raid write/recover functionality is currently not very well
abstracted from the bio submission and completion handling in volumes.c:
- the raid56 code directly completes the original btrfs_bio fed into
btrfs_submit_bio instead of dispatching back to volumes.c
- the raid56 code consumes the bioc and bio_counter references taken
by volumes.c, which also leads to special casing of the calls from
the scrub code into the raid56 code
To fix this up supply a bi_end_io handler that calls back into the
volumes.c machinery, which then puts the bioc, decrements the bio_counter
and completes the original bio, and updates the scrub code to also
take ownership of the bioc and bio_counter in all cases.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
[PROBLEM]
The existing scrub code for data extents always limit the block size to
sectorsize.
This causes quite some extra scrub_block being allocated:
(there is a data extent at logical bytenr 298844160, length 64KiB)
alloc_scrub_block: new block: logical=298844160 physical=298844160 mirror=1
alloc_scrub_block: new block: logical=298848256 physical=298848256 mirror=1
alloc_scrub_block: new block: logical=298852352 physical=298852352 mirror=1
alloc_scrub_block: new block: logical=298856448 physical=298856448 mirror=1
alloc_scrub_block: new block: logical=298860544 physical=298860544 mirror=1
alloc_scrub_block: new block: logical=298864640 physical=298864640 mirror=1
alloc_scrub_block: new block: logical=298868736 physical=298868736 mirror=1
alloc_scrub_block: new block: logical=298872832 physical=298872832 mirror=1
alloc_scrub_block: new block: logical=298876928 physical=298876928 mirror=1
alloc_scrub_block: new block: logical=298881024 physical=298881024 mirror=1
alloc_scrub_block: new block: logical=298885120 physical=298885120 mirror=1
alloc_scrub_block: new block: logical=298889216 physical=298889216 mirror=1
alloc_scrub_block: new block: logical=298893312 physical=298893312 mirror=1
alloc_scrub_block: new block: logical=298897408 physical=298897408 mirror=1
alloc_scrub_block: new block: logical=298901504 physical=298901504 mirror=1
alloc_scrub_block: new block: logical=298905600 physical=298905600 mirror=1
...
scrub_block_put: free block: logical=298844160 physical=298844160 len=4096 mirror=1
scrub_block_put: free block: logical=298848256 physical=298848256 len=4096 mirror=1
scrub_block_put: free block: logical=298852352 physical=298852352 len=4096 mirror=1
scrub_block_put: free block: logical=298856448 physical=298856448 len=4096 mirror=1
scrub_block_put: free block: logical=298860544 physical=298860544 len=4096 mirror=1
scrub_block_put: free block: logical=298864640 physical=298864640 len=4096 mirror=1
scrub_block_put: free block: logical=298868736 physical=298868736 len=4096 mirror=1
scrub_block_put: free block: logical=298872832 physical=298872832 len=4096 mirror=1
scrub_block_put: free block: logical=298876928 physical=298876928 len=4096 mirror=1
scrub_block_put: free block: logical=298881024 physical=298881024 len=4096 mirror=1
scrub_block_put: free block: logical=298885120 physical=298885120 len=4096 mirror=1
scrub_block_put: free block: logical=298889216 physical=298889216 len=4096 mirror=1
scrub_block_put: free block: logical=298893312 physical=298893312 len=4096 mirror=1
scrub_block_put: free block: logical=298897408 physical=298897408 len=4096 mirror=1
scrub_block_put: free block: logical=298901504 physical=298901504 len=4096 mirror=1
scrub_block_put: free block: logical=298905600 physical=298905600 len=4096 mirror=1
This behavior will waste a lot of memory, especially after we have moved
quite some members from scrub_sector to scrub_block.
[FIX]
To reduce the allocation of scrub_block, and to reduce memory usage, use
BTRFS_STRIPE_LEN instead of sectorsize as the block size to scrub data
extents.
This results only one scrub_block to be allocated for above data extent:
alloc_scrub_block: new block: logical=298844160 physical=298844160 mirror=1
scrub_block_put: free block: logical=298844160 physical=298844160 len=65536 mirror=1
This would greatly reduce the memory usage (even it's just transient)
for larger data extents scrub.
For above example, the memory usage would be:
Old: num_sectors * (sizeof(scrub_block) + sizeof(scrub_sector))
16 * (408 + 96) = 8065
New: sizeof(scrub_block) + num_sectors * sizeof(scrub_sector)
408 + 16 * 96 = 1944
A good reduction of 75.9%.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we store the following members in scrub_sector:
- logical
- physical
- physical_for_dev_replace
- dev
- mirror_num
However the current scrub code has ensured that scrub_blocks never cross
stripe boundary.
This is caused by the entry functions (scrub_simple_mirror,
scrub_simple_stripe), thus every scrub_block will not cross stripe
boundary.
Thus this makes it possible to move those members into scrub_block other
than putting them into scrub_sector.
This should save quite some memory, as a scrub_block can be as large as 64
sectors, even for metadata it's 16 sectors byte default.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although scrub currently works for subpage (PAGE_SIZE > sectorsize) cases,
it will allocate one page for each scrub_sector, which can cause extra
unnecessary memory usage.
Utilize scrub_block::pages[] instead of allocating page for each
scrub_sector, this allows us to integrate larger extents while using
less memory.
For example, if our page size is 64K, sectorsize is 4K, and we got an
32K sized extent.
We will only allocate one page for scrub_block, and all 8 scrub sectors
will point to that page.
To do that properly, here we introduce several small helpers:
- scrub_page_get_logical()
Get the logical bytenr of a page.
We store the logical bytenr of the page range into page::private.
But for 32bit systems, their (void *) is not large enough to contain
a u64, so in that case we will need to allocate extra memory for it.
For 64bit systems, we can use page::private directly.
- scrub_block_get_logical()
Just get the logical bytenr of the first page.
- scrub_sector_get_page()
Return the page which the scrub_sector points to.
- scrub_sector_get_page_offset()
Return the offset inside the page which the scrub_sector points to.
- scrub_sector_get_kaddr()
Return the address which the scrub_sector points to.
Just a wrapper using scrub_sector_get_page() and
scrub_sector_get_page_offset()
- bio_add_scrub_sector()
Please note that, even with this patch, we're still allocating one page
for one sector for data extents.
This is because in scrub_extent() we split the data extent using
sectorsize.
The memory usage reduction will need extra work to make scrub to work
like data read to only use the correct sector(s).
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BACKGROUND]
Currently for scrub, we allocate one page for one sector, this is fine
for PAGE_SIZE == sectorsize support, but can waste extra memory for
subpage support.
[CODE CHANGE]
Make scrub_block contain all the pages, so if we're scrubbing an extent
sized 64K, and our page size is also 64K, we only need to allocate one
page.
[LIFESPAN CHANGE]
Since now scrub_sector no longer holds a page, but is using
scrub_block::pages[] instead, we have to ensure scrub_block has a longer
lifespan for write bio. The lifespan for read bio is already large
enough.
Now scrub_block will only be released after the write bio finished.
[COMING NEXT]
Currently we only added scrub_block::pages[] for this purpose, but
scrub_sector is still utilizing the old scrub_sector::page.
The switch will happen in the next patch.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The allocation and initialization is shared by 3 call sites, and we're
going to change the initialization of some members in the upcoming
patches.
So factor out the allocation and initialization of scrub_sector into a
helper, alloc_scrub_sector(), which will do the following work:
- Allocate the memory for scrub_sector
- Allocate a page for scrub_sector::page
- Initialize scrub_sector::refs to 1
- Attach the allocated scrub_sector to scrub_block
The attachment is bidirectional, which means scrub_block::sectorv[]
will be updated and scrub_sector::sblock will also be updated.
- Update scrub_block::sector_count and do extra sanity check on it
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although there are only two callers, we are going to add some members
for scrub_block in the incoming patches. Factoring out the
initialization code will make later expansion easier.
One thing to note is, even scrub_handle_errored_block() doesn't utilize
scrub_block::refs, we still use alloc_scrub_block() to initialize
sblock::ref, allowing us to use scrub_block_put() to do cleanup.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In function scrub_handle_errored_block(), we use @sblocks_for_recheck
pointer to hold one scrub_block for each mirror, and uses kcalloc() to
allocate an array.
But this one pointer for an array is not readable due to the member
offsets done by addition and not [].
Change this pointer to struct scrub_block *[BTRFS_MAX_MIRRORS], this
will slightly increase the stack memory usage.
Since function scrub_handle_errored_block() won't get iterative calls,
this extra cost would completely be acceptable.
And since we're here, also set sblock->refs and use scrub_block_put() to
clean them up, as later we will add extra members in scrub_block, which
needs scrub_block_put() to clean them up.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several sanity checks which are no longer possible to trigger
inside btrfs_scrub_dev().
Since we have mount time check against super block nodesize/sectorsize,
and our fixed macro is hardcoded to handle even the worst combination.
Thus those sanity checks are no longer needed, can be easily removed.
But this patch still uses some ASSERT()s as a safe net just in case we
change some features in the future to trigger those impossible
combinations.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We use this during device replace for zoned devices, we were simply
taking the lock because it was in a bit field and we needed the lock to
be safe with other modifications in the bitfield. With the bit helpers
we no longer require that locking.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We use a bit field in the btrfs_block_group for different flags, however
this is awkward because we have to hold the block_group->lock for any
modification of any of these fields, and makes the code clunky for a few
of these flags. Convert these to a properly flags setup so we can
utilize the bit helpers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
The following script shows that, although scrub can detect super block
errors, it never tries to fix it:
mkfs.btrfs -f -d raid1 -m raid1 $dev1 $dev2
xfs_io -c "pwrite 67108864 4k" $dev2
mount $dev1 $mnt
btrfs scrub start -B $dev2
btrfs scrub start -Br $dev2
umount $mnt
The first scrub reports the super error correctly:
scrub done for f3289218-abd3-41ac-a630-202f766c0859
Scrub started: Tue Aug 2 14:44:11 2022
Status: finished
Duration: 0:00:00
Total to scrub: 1.26GiB
Rate: 0.00B/s
Error summary: super=1
Corrected: 0
Uncorrectable: 0
Unverified: 0
But the second read-only scrub still reports the same super error:
Scrub started: Tue Aug 2 14:44:11 2022
Status: finished
Duration: 0:00:00
Total to scrub: 1.26GiB
Rate: 0.00B/s
Error summary: super=1
Corrected: 0
Uncorrectable: 0
Unverified: 0
[CAUSE]
The comments already shows that super block can be easily fixed by
committing a transaction:
/*
* If we find an error in a super block, we just report it.
* They will get written with the next transaction commit
* anyway
*/
But the truth is, such assumption is not always true, and since scrub
should try to repair every error it found (except for read-only scrub),
we should really actively commit a transaction to fix this.
[FIX]
Just commit a transaction if we found any super block errors, after
everything else is done.
We cannot do this just after scrub_supers(), as
btrfs_commit_transaction() will try to pause and wait for the running
scrub, thus we can not call it with scrub_lock hold.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[PROBLEM]
Unlike data/metadata corruption, if scrub detected some error in the
super block, the only error message is from the updated device status:
BTRFS info (device dm-1): scrub: started on devid 2
BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0
This is not helpful at all.
[CAUSE]
Unlike data/metadata error reporting, there is no visible report in
kernel dmesg to report supper block errors.
In fact, return value of scrub_checksum_super() is intentionally
skipped, thus scrub_handle_errored_block() will never be called for
super blocks.
[FIX]
Make super block errors to output an error message, now the full
dmesg would looks like this:
BTRFS info (device dm-1): scrub: started on devid 2
BTRFS warning (device dm-1): super block error on device /dev/mapper/test-scratch2, physical 67108864
BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0
BTRFS info (device dm-1): scrub: started on devid 2
This fix involves:
- Move the super_errors reporting to scrub_handle_errored_block()
This allows the device status message to show after the super block
error message.
But now we no longer distinguish super block corruption and generation
mismatch, now all counted as corruption.
- Properly check the return value from scrub_checksum_super()
- Add extra super block error reporting for scrub_print_warning().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Always consume the bio and call the end_io handler on error instead of
returning an error and letting the caller handle it. This matches what
the block layer submission does and avoids any confusion on who
needs to handle errors.
Also use the proper bool type for the generic_io argument.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The raid56 code assumes a fixed stripe length BTRFS_STRIPE_LEN but there
are functions passing it as arguments, this is not necessary. The fixed
value has been used for a long time and though the stripe length should
be configurable by super block member stripesize, this hasn't been
implemented and would require more changes so we don't need to keep this
code around until then.
Partially based on a patch from Qu Wenruo.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[ update changelog ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For scrub_stripe() we can easily calculate the dev extent length as we
have the full info of the chunk.
Thus there is no need to pass @dev_extent_len from the caller, and we
introduce a helper, btrfs_calc_stripe_length(), to do the calculation
from extent_map structure.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously we use "unsigned long *" for those two bitmaps.
But since we only support fixed stripe length (64KiB, already checked in
tree-checker), "unsigned long *" is really a waste of memory, while we
can just use "unsigned long".
This saves us 8 bytes in total for scrub_parity.
To be extra safe, add an ASSERT() making sure calclulated @nsectors is
always smaller than BITS_PER_LONG.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[SUSPICIOUS CODE]
When refactoring scrub code, I noticed a very strange behavior around
scrub_remap_extent():
if (sctx->is_dev_replace)
scrub_remap_extent(fs_info, cur_logical, scrub_len,
&cur_physical, &target_dev, &cur_mirror);
As replace target is a 1:1 copy of the source device, thus physical
offset inside the target should be the same as physical inside source,
thus this remap call makes no sense to me.
[REAL FUNCTIONALITY]
After more investigation, the function name scrub_remap_extent()
doesn't tell anything of the truth, nor does its if () condition.
The real story behind this function is that, for scrub_pages() we never
expect missing device, even for replacing missing device.
What scrub_remap_extent() is really doing is to find a live mirror, and
make later scrub_pages() to read data from the good copy, other than
from the missing device and increase error counters unnecessarily.
[IMPROVEMENT]
We have no need to bother scrub_remap_extent() in scrub_simple_mirror()
at all, we only need to call it before we call scrub_pages().
And rename the function to scrub_find_live_copy(), add extra comments on
them.
By this we can remove one parameter from scrub_extent(), and reduce the
unnecessary calls to scrub_remap_extent() for regular replace.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we have find_first_extent_item() to iterate the extent items of a
certain range, there is no need to use the open-coded version.
Replace the final scrub call site with find_first_extent_item().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently scrub_raid56_parity() has a large double loop, handling the
following things at the same time:
- Iterate each data stripe
- Iterate each extent item in one data stripe
Refactor this by:
- Introduce a new helper to handle data stripe iteration
The new helper is scrub_raid56_data_stripe_for_parity(), which
only has one while() loop handling the extent items inside the
data stripe.
The code is still mostly the same as the old code.
- Call cond_resched() for each extent
Previously we only call cond_resched() under a complex if () check.
I see no special reason to do that, and for other scrub functions,
like scrub_simple_mirror() we're already doing the same cond_resched()
after scrubbing one extent.
- Add more comments
Please note that, this patch is only to address the double loop, there
are incoming patches to do extra cleanup.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although RAID56 has complex repair mechanism, which involves reading the
whole full stripe, but inside one data stripe, it's in fact no different
than SINGLE/RAID1.
The point here is, for data stripe we just check the csum for each
extent we hit. Only for csum mismatch case, our repair paths divide.
So we can still reuse scrub_simple_mirror() for RAID56 data stripes,
which saves quite some code.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we have moved all other profiles handling into their own
functions, now the main body of scrub_stripe() is just handling RAID56
profiles.
There is no need to address other profiles in the main loop of
scrub_stripe(), so we can remove those dead branches.
Since we're here, also slightly change the timing of initialization of
variables like @offset, @increment and @logical.
Especially for @logical, we don't really need to initialize it for
btrfs_extent_root()/btrfs_csum_root(), we can use bg->start for that
purpose.
Now those variables are only initialize for RAID56 branches.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new entrance will iterate through each data stripe which belongs to
the target device.
And since inside each data stripe, RAID0 is just SINGLE, while RAID10 is
just RAID1, we can reuse scrub_simple_mirror() to do the scrub properly.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new helper, scrub_simple_mirror(), will scrub all extents inside a
range which only has simple mirror based duplication.
This covers every range of SINGLE/DUP/RAID1/RAID1C*, and inside each
data stripe for RAID0/RAID10.
Currently we will use this function to scrub SINGLE/DUP/RAID1/RAID1C*
profiles. As one can see, the new entrance for those simple-mirror
based profiles can be small enough (with comments, just reach 100
lines).
This function will be the basis for the incoming scrub refactor.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new helper, find_first_extent_item(), will locate an extent item
(either EXTENT_ITEM or METADATA_ITEM) which covers any byte of the
search range.
This helper will later be used to refactor scrub code.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variable @physical_end is the exclusive stripe end, currently it's
calculated using @physical + @dev_extent_len / map->stripe_len *
map->stripe_len.
And since at allocation time we ensured dev_extent_len is stripe_len
aligned, the result is the same as @physical + @dev_extent_len.
So this patch will just assign @physical and @physical_end early,
without using @nstripes.
This is especially helpful for any possible out: label user, as now we
only need to initialize @offset before going to out: label.
Since we're here, also make @physical_end constant.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All three scrub workqueues don't need ordered execution or thread
disabling threshold (as the thresh parameter is less than DFT_THRESHOLD).
Just switch to the normal workqueues that use a lot less resources,
especially in the work_struct vs btrfs_work structures.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This requires one extra parameter @pgoff for the function.
In the current code base, scrub is still one page per sector, thus the
new parameter will always be 0.
It needs the extra subpage scrub optimization code to fully take
advantage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All the scrub bios go straight to the block device or the raid56 code,
none of which looks at the btrfs_bio.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The I/O in repair_io_failue is synchronous and doesn't need a btrfs_bio,
so just use an on-stack bio.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The I/O in repair_io_failue is synchronous and doesn't need a btrfs_bio,
so just use an on-stack bio.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Require a separate call to the integrity checking helpers from the
actual bio submission.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Explicit type casts are not necessary when it's void* to another pointer
type.
Signed-off-by: Yu Zhe <yuzhe@nfschina.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the subpage support for scrub, one page no longer always represents
one sector, thus scrub_bio::pagev and scrub_bio::sector_count are no
longer accurate.
Rename them to scrub_bio::sectors and scrub_bio::sector_count respectively.
This also involves scrub_ctx::pages_per_bio and other macros involved.
Now the renaming of pages involved in scrub is be finished.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the subpage support of scrub, scrub_sector is in fact just
representing one sector.
Thus the name scrub_page is no longer correct, rename it to
scrub_sector.
This also involves the following renames:
- spage -> sector
Normally we would just replace "page" with "sector" and result
something like "ssector".
But the repeating 's' is not really eye friendly.
So here we just simple use "sector", as there is nothing from MM layer
called "sector" to cause any confusion.
- scrub_parity::spages -> sectors_list
Normally we use plural to indicate an array, not a list.
Rename it to @sectors_list to be more explicit on the list part.
- Also reformat and update comments that get changed
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The following will be renamed in this patch:
- scrub_block::pagev -> sectors
- scrub_block::page_count -> sector_count
- SCRUB_MAX_PAGES_PER_BLOCK -> SCRUB_MAX_SECTORS_PER_BLOCK
- page_num -> sector_num to iterate scrub_block::sectors
For now scrub_page is not yet renamed to keep the patch reasonable and
it will be updated in a followup.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During a scrub, or device replace, we can race with block group removal
and allocation and trigger the following assertion failure:
[7526.385524] assertion failed: cache->start == chunk_offset, in fs/btrfs/scrub.c:3817
[7526.387351] ------------[ cut here ]------------
[7526.387373] kernel BUG at fs/btrfs/ctree.h:3599!
[7526.388001] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[7526.388970] CPU: 2 PID: 1158150 Comm: btrfs Not tainted 5.17.0-rc8-btrfs-next-114 #4
[7526.390279] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[7526.392430] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
[7526.393520] Code: f3 48 c7 c7 20 (...)
[7526.396926] RSP: 0018:ffffb9154176bc40 EFLAGS: 00010246
[7526.397690] RAX: 0000000000000048 RBX: ffffa0db8a910000 RCX: 0000000000000000
[7526.398732] RDX: 0000000000000000 RSI: ffffffff9d7239a2 RDI: 00000000ffffffff
[7526.399766] RBP: ffffa0db8a911e10 R08: ffffffffa71a3ca0 R09: 0000000000000001
[7526.400793] R10: 0000000000000001 R11: 0000000000000000 R12: ffffa0db4b170800
[7526.401839] R13: 00000003494b0000 R14: ffffa0db7c55b488 R15: ffffa0db8b19a000
[7526.402874] FS: 00007f6c99c40640(0000) GS:ffffa0de6d200000(0000) knlGS:0000000000000000
[7526.404038] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[7526.405040] CR2: 00007f31b0882160 CR3: 000000014b38c004 CR4: 0000000000370ee0
[7526.406112] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[7526.407148] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[7526.408169] Call Trace:
[7526.408529] <TASK>
[7526.408839] scrub_enumerate_chunks.cold+0x11/0x79 [btrfs]
[7526.409690] ? do_wait_intr_irq+0xb0/0xb0
[7526.410276] btrfs_scrub_dev+0x226/0x620 [btrfs]
[7526.410995] ? preempt_count_add+0x49/0xa0
[7526.411592] btrfs_ioctl+0x1ab5/0x36d0 [btrfs]
[7526.412278] ? __fget_files+0xc9/0x1b0
[7526.412825] ? kvm_sched_clock_read+0x14/0x40
[7526.413459] ? lock_release+0x155/0x4a0
[7526.414022] ? __x64_sys_ioctl+0x83/0xb0
[7526.414601] __x64_sys_ioctl+0x83/0xb0
[7526.415150] do_syscall_64+0x3b/0xc0
[7526.415675] entry_SYSCALL_64_after_hwframe+0x44/0xae
[7526.416408] RIP: 0033:0x7f6c99d34397
[7526.416931] Code: 3c 1c e8 1c ff (...)
[7526.419641] RSP: 002b:00007f6c99c3fca8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[7526.420735] RAX: ffffffffffffffda RBX: 00005624e1e007b0 RCX: 00007f6c99d34397
[7526.421779] RDX: 00005624e1e007b0 RSI: 00000000c400941b RDI: 0000000000000003
[7526.422820] RBP: 0000000000000000 R08: 00007f6c99c40640 R09: 0000000000000000
[7526.423906] R10: 00007f6c99c40640 R11: 0000000000000246 R12: 00007fff746755de
[7526.424924] R13: 00007fff746755df R14: 0000000000000000 R15: 00007f6c99c40640
[7526.425950] </TASK>
That assertion is relatively new, introduced with commit d04fbe19ae
("btrfs: scrub: cleanup the argument list of scrub_chunk()").
The block group we get at scrub_enumerate_chunks() can actually have a
start address that is smaller then the chunk offset we extracted from a
device extent item we got from the commit root of the device tree.
This is very rare, but it can happen due to a race with block group
removal and allocation. For example, the following steps show how this
can happen:
1) We are at transaction T, and we have the following blocks groups,
sorted by their logical start address:
[ bg A, start address A, length 1G (data) ]
[ bg B, start address B, length 1G (data) ]
(...)
[ bg W, start address W, length 1G (data) ]
--> logical address space hole of 256M,
there used to be a 256M metadata block group here
[ bg Y, start address Y, length 256M (metadata) ]
--> Y matches W's end offset + 256M
Block group Y is the block group with the highest logical address in
the whole filesystem;
2) Block group Y is deleted and its extent mapping is removed by the call
to remove_extent_mapping() made from btrfs_remove_block_group().
So after this point, the last element of the mapping red black tree,
its rightmost node, is the mapping for block group W;
3) While still at transaction T, a new data block group is allocated,
with a length of 1G. When creating the block group we do a call to
find_next_chunk(), which returns the logical start address for the
new block group. This calls returns X, which corresponds to the
end offset of the last block group, the rightmost node in the mapping
red black tree (fs_info->mapping_tree), plus one.
So we get a new block group that starts at logical address X and with
a length of 1G. It spans over the whole logical range of the old block
group Y, that was previously removed in the same transaction.
However the device extent allocated to block group X is not the same
device extent that was used by block group Y, and it also does not
overlap that extent, which must be always the case because we allocate
extents by searching through the commit root of the device tree
(otherwise it could corrupt a filesystem after a power failure or
an unclean shutdown in general), so the extent allocator is behaving
as expected;
4) We have a task running scrub, currently at scrub_enumerate_chunks().
There it searches for device extent items in the device tree, using
its commit root. It finds a device extent item that was used by
block group Y, and it extracts the value Y from that item into the
local variable 'chunk_offset', using btrfs_dev_extent_chunk_offset();
It then calls btrfs_lookup_block_group() to find block group for
the logical address Y - since there's currently no block group that
starts at that logical address, it returns block group X, because
its range contains Y.
This results in triggering the assertion:
ASSERT(cache->start == chunk_offset);
right before calling scrub_chunk(), as cache->start is X and
chunk_offset is Y.
This is more likely to happen of filesystems not larger than 50G, because
for these filesystems we use a 256M size for metadata block groups and
a 1G size for data block groups, while for filesystems larger than 50G,
we use a 1G size for both data and metadata block groups (except for
zoned filesystems). It could also happen on any filesystem size due to
the fact that system block groups are always smaller (32M) than both
data and metadata block groups, but these are not frequently deleted, so
much less likely to trigger the race.
So make scrub skip any block group with a start offset that is less than
the value we expect, as that means it's a new block group that was created
in the current transaction. It's pointless to continue and try to scrub
its extents, because scrub searches for extents using the commit root, so
it won't find any. For a device replace, skip it as well for the same
reasons, and we don't need to worry about the possibility of extents of
the new block group not being to the new device, because we have the write
duplication setup done through btrfs_map_block().
Fixes: d04fbe19ae ("btrfs: scrub: cleanup the argument list of scrub_chunk()")
CC: stable@vger.kernel.org # 5.17
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
increment is being initialized to map->stripe_len but this is never
read as increment is overwritten later on. Remove the redundant
initialization.
Cleans up the following clang-analyzer warning:
fs/btrfs/scrub.c:3193:6: warning: Value stored to 'increment' during its
initialization is never read [clang-analyzer-deadcode.DeadStores].
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>