With nowait becoming more pervasive throughout our codebase go ahead and
add a cached_state to try_lock_extent(). This allows us to be faster
about clearing the locked area if we have contention, and then gives us
the same optimization for unlock if we are able to lock the range.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During a nowait buffered write, if we fail to balance dirty pages we exit
btrfs_buffered_write() without releasing the delalloc space reserved for
an extent, resulting in leaking space from the inode's block reserve.
So fix that by releasing the delalloc space for the extent when balancing
dirty pages fails.
Reported-by: kernel test robot <yujie.liu@intel.com>
Link: https://lore.kernel.org/all/202210111304.d369bc32-yujie.liu@intel.com
Fixes: 965f47aeb5 ("btrfs: make btrfs_buffered_write nowait compatible")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we are doing a buffered write in NOWAIT context and we can't reserve
metadata space due to -ENOSPC, then we should return -EAGAIN so that we
retry the write in a context allowed to block and do metadata reservation
with flushing, which might succeed this time due to the allowed flushing.
Returning -ENOSPC while in NOWAIT context simply makes some writes fail
with -ENOSPC when they would likely succeed after switching from NOWAIT
context to blocking context. That is unexpected behaviour and even fio
complains about it with a warning like this:
fio: io_u error on file /mnt/sdi/task_0.0.0: No space left on device: write offset=1535705088, buflen=65536
fio: pid=592630, err=28/file:io_u.c:1846, func=io_u error, error=No space left on device
The fio's job config is this:
[global]
bs=64K
ioengine=io_uring
iodepth=1
size=2236962133
nr_files=1
filesize=2236962133
direct=0
runtime=10
fallocate=posix
io_size=2236962133
group_reporting
time_based
[task_0]
rw=randwrite
directory=/mnt/sdi
numjobs=4
So fix this by returning -EAGAIN if we are in NOWAIT context and the
metadata reservation failed with -ENOSPC.
Fixes: 304e45acdb ("btrfs: plumb NOWAIT through the write path")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a direct IO write using a iocb with nowait and dsync set, we
end up not syncing the file once the write completes.
This is because we tell iomap to not call generic_write_sync(), which
would result in calling btrfs_sync_file(), in order to avoid a deadlock
since iomap can call it while we are holding the inode's lock and
btrfs_sync_file() needs to acquire the inode's lock. The deadlock happens
only if the write happens synchronously, when iomap_dio_rw() calls
iomap_dio_complete() before it returns. Instead we do the sync ourselves
at btrfs_do_write_iter().
For a nowait write however we can end up not doing the sync ourselves at
at btrfs_do_write_iter() because the write could have been queued, and
therefore we get -EIOCBQUEUED returned from iomap in such case. That makes
us skip the sync call at btrfs_do_write_iter(), as we don't do it for
any error returned from btrfs_direct_write(). We can't simply do the call
even if -EIOCBQUEUED is returned, since that would block the task waiting
for IO, both for the data since there are bios still in progress as well
as potentially blocking when joining a log transaction and when syncing
the log (writing log trees, super blocks, etc).
So let iomap do the sync call itself and in order to avoid deadlocks for
the case of synchronous writes (without nowait), use __iomap_dio_rw() and
have ourselves call iomap_dio_complete() after unlocking the inode.
A test case will later be sent for fstests, after this is fixed in Linus'
tree.
Fixes: 51bd9563b6 ("btrfs: fix deadlock due to page faults during direct IO reads and writes")
Reported-by: Марк Коренберг <socketpair@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAEmTpZGRKbzc16fWPvxbr6AfFsQoLmz-Lcg-7OgJOZDboJ+SGQ@mail.gmail.com/
CC: stable@vger.kernel.org # 6.0+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have several places that need to drop all the extent maps in a given
file range and then add a new extent map for that range. Currently they
call btrfs_drop_extent_map_range() to delete all extent maps in the range
and then keep trying to add the new extent map in a loop that keeps
retrying while the insertion of the new extent map fails with -EEXIST.
So instead of repeating this logic, add a helper to extent_map.c that
does these steps and name it btrfs_replace_extent_map_range(). Also add
a comment about why the retry loop is necessary.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_drop_extent_cache() doesn't really belong at file.c
because what it does is drop a range of extent maps for a file range.
It directly allocates and manipulates extent maps, by dropping,
splitting and replacing them in an extent map tree, so it should be
located at extent_map.c, where all manipulations of an extent map tree
and its extent maps are supposed to be done.
So move it out of file.c and into extent_map.c. Additionally do the
following changes:
1) Rename it into btrfs_drop_extent_map_range(), as this makes it more
clear about what it does. The term "cache" is a bit confusing as it's
not widely used, "extent maps" or "extent mapping" is much more common;
2) Change its 'skip_pinned' argument from int to bool;
3) Turn several of its local variables from int to bool, since they are
used as booleans;
4) Move the declaration of some variables out of the function's main
scope and into the scopes where they are used;
5) Remove pointless assignment of false to 'modified' early in the while
loop, as later that variable is set and it's not used before that
second assignment;
6) Remove checks for NULL before calling free_extent_map().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When dropping extent maps for a range, through btrfs_drop_extent_cache(),
if we find an extent map that starts before our target range and/or ends
before the target range, and we are not able to allocate extent maps for
splitting that extent map, then we don't fail and simply remove the entire
extent map from the inode's extent map tree.
This is generally fine, because in case anyone needs to access the extent
map, it can just load it again later from the respective file extent
item(s) in the subvolume btree. However, if that extent map is new and is
in the list of modified extents, then a fast fsync will miss the parts of
the extent that were outside our range (that needed to be split),
therefore not logging them. Fix that by marking the inode for a full
fsync. This issue was introduced after removing BUG_ON()s triggered when
the split extent map allocations failed, done by commit 7014cdb493
("Btrfs: btrfs_drop_extent_cache should never fail"), back in 2012, and
the fast fsync path already existed but was very recent.
Also, in the case where we could allocate extent maps for the split
operations but then fail to add a split extent map to the tree, mark the
inode for a full fsync as well. This is not supposed to ever fail, and we
assert that, but in case assertions are disabled (CONFIG_BTRFS_ASSERT is
not set), it's the correct thing to do to make sure a fast fsync will not
miss a new extent.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Enable nowait async buffered writes in btrfs_do_write_iter() and
btrfs_file_open().
In this version encoded buffered writes have the optimization not
enabled. Encoded writes are enabled by using an ioctl. io_uring
currently does not support ioctls. This might be enabled in the future.
Performance results:
For fio the following results have been obtained with a queue depth of
1 and 4k block size (runtime 600 secs):
sequential writes:
without patch with patch libaio psync
iops: 55k 134k 117K 148K
bw: 221MB/s 538MB/s 469MB/s 592MB/s
clat: 15286ns 82ns 994ns 6340ns
For an io depth of 1, the new patch improves throughput by over two
times (compared to the existing behavior, where buffered writes are
processed by an io-worker process) and also the latency is considerably
reduced. To achieve the same or better performance with the existing
code an io depth of 4 is required. Increasing the iodepth further does
not lead to improvements.
The tests have been run like this:
./fio --name=seq-writers --ioengine=psync --iodepth=1 --rw=write \
--bs=4k --direct=0 --size=100000m --time_based --runtime=600 \
--numjobs=1 --filename=...
./fio --name=seq-writers --ioengine=io_uring --iodepth=1 --rw=write \
--bs=4k --direct=0 --size=100000m --time_based --runtime=600 \
--numjobs=1 --filename=...
./fio --name=seq-writers --ioengine=libaio --iodepth=1 --rw=write \
--bs=4k --direct=0 --size=100000m --time_based --runtime=600 \
--numjobs=1 --filename=...
Testing:
This patch has been tested with xfstests, fsx, fio. xfstests shows no new
diffs compared to running without the patch series.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We need to avoid unconditionally calling balance_dirty_pages_ratelimited
as it could wait for some reason. Use balance_dirty_pages_ratelimited_flags
with the BDP_ASYNC in case the buffered write is nowait, returning
EAGAIN eventually.
It also moves the function after the again label. This can cause the
function to be called a bit later, but this should have no impact in the
real world.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have everywhere setup for nowait, plumb NOWAIT through the write path.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add the nowait parameter to lock_and_cleanup_extent_if_need(). If the
nowait parameter is specified we try to lock the extent in nowait mode.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add nowait parameter to the prepare_pages function. In case nowait is
specified for an async buffered write request, do a nowait allocation or
return -EAGAIN.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now all the helpers that btrfs_check_nocow_lock uses handle nowait, add
a nowait flag to btrfs_check_nocow_lock so it can be used by the write
path.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to accommodate NOWAIT IOCB's we need to be able to do NO_FLUSH
data reservations, so plumb this through the delalloc reservation
system.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we have NOWAIT specified on our IOCB and we're writing into a
PREALLOC or NOCOW extent then we need to be able to tell
can_nocow_extent that we don't want to wait on any locks or metadata IO.
Fix can_nocow_extent to allow for NOWAIT.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is defined in btrfs_inode.h, and dereferences btrfs_root and
btrfs_fs_info, both of which aren't defined in btrfs_inode.h.
Additionally, in many places we already have root or fs_info, so this
helper often makes the code harder to read. So delete the helper and
simply open code it in the few places that we use it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of taking up a whole argument to indicate we're clearing
everything in a range, simply add another EXTENT bit to control this,
and then update all the callers to drop this argument from the
clear_extent_bit variants.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have two variants of lock/unlock extent, one set that takes a cached
state, another that does not. This is slightly annoying, and generally
speaking there are only a few places where we don't have a cached state.
Simplify this by making lock_extent/unlock_extent the only variant and
make it take a cached state, then convert all the callers appropriately.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is only used in the case that we are clearing EXTENT_LOCKED, so
infer this value from the bits passed in instead of taking it as an
argument.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The current fiemap implementation does not scale very well with the number
of extents a file has. This is both because the main algorithm to find out
the extents has a high algorithmic complexity and because for each extent
we have to check if it's shared. This second part, checking if an extent
is shared, is significantly improved by the two previous patches in this
patchset, while the first part is improved by this specific patch. Every
now and then we get reports from users mentioning fiemap is too slow or
even unusable for files with a very large number of extents, such as the
two recent reports referred to by the Link tags at the bottom of this
change log.
To understand why the part of finding which extents a file has is very
inefficient, consider the example of doing a full ranged fiemap against
a file that has over 100K extents (normal for example for a file with
more than 10G of data and using compression, which limits the extent size
to 128K). When we enter fiemap at extent_fiemap(), the following happens:
1) Before entering the main loop, we call get_extent_skip_holes() to get
the first extent map. This leads us to btrfs_get_extent_fiemap(), which
in turn calls btrfs_get_extent(), to find the first extent map that
covers the file range [0, LLONG_MAX).
btrfs_get_extent() will first search the inode's extent map tree, to
see if we have an extent map there that covers the range. If it does
not find one, then it will search the inode's subvolume b+tree for a
fitting file extent item. After finding the file extent item, it will
allocate an extent map, fill it in with information extracted from the
file extent item, and add it to the inode's extent map tree (which
requires a search for insertion in the tree).
2) Then we enter the main loop at extent_fiemap(), emit the details of
the extent, and call again get_extent_skip_holes(), with a start
offset matching the end of the extent map we previously processed.
We end up at btrfs_get_extent() again, will search the extent map tree
and then search the subvolume b+tree for a file extent item if we could
not find an extent map in the extent tree. We allocate an extent map,
fill it in with the details in the file extent item, and then insert
it into the extent map tree (yet another search in this tree).
3) The second step is repeated over and over, until we have processed the
whole file range. Each iteration ends at btrfs_get_extent(), which
does a red black tree search on the extent map tree, then searches the
subvolume b+tree, allocates an extent map and then does another search
in the extent map tree in order to insert the extent map.
In the best scenario we have all the extent maps already in the extent
tree, and so for each extent we do a single search on a red black tree,
so we have a complexity of O(n log n).
In the worst scenario we don't have any extent map already loaded in
the extent map tree, or have very few already there. In this case the
complexity is much higher since we do:
- A red black tree search on the extent map tree, which has O(log n)
complexity, initially very fast since the tree is empty or very
small, but as we end up allocating extent maps and adding them to
the tree when we don't find them there, each subsequent search on
the tree gets slower, since it's getting bigger and bigger after
each iteration.
- A search on the subvolume b+tree, also O(log n) complexity, but it
has items for all inodes in the subvolume, not just items for our
inode. Plus on a filesystem with concurrent operations on other
inodes, we can block doing the search due to lock contention on
b+tree nodes/leaves.
- Allocate an extent map - this can block, and can also fail if we
are under serious memory pressure.
- Do another search on the extent maps red black tree, with the goal
of inserting the extent map we just allocated. Again, after every
iteration this tree is getting bigger by 1 element, so after many
iterations the searches are slower and slower.
- We will not need the allocated extent map anymore, so it's pointless
to add it to the extent map tree. It's just wasting time and memory.
In short we end up searching the extent map tree multiple times, on a
tree that is growing bigger and bigger after each iteration. And
besides that we visit the same leaf of the subvolume b+tree many times,
since a leaf with the default size of 16K can easily have more than 200
file extent items.
This is very inefficient overall. This patch changes the algorithm to
instead iterate over the subvolume b+tree, visiting each leaf only once,
and only searching in the extent map tree for file ranges that have holes
or prealloc extents, in order to figure out if we have delalloc there.
It will never allocate an extent map and add it to the extent map tree.
This is very similar to what was previously done for the lseek's hole and
data seeking features.
Also, the current implementation relying on extent maps for figuring out
which extents we have is not correct. This is because extent maps can be
merged even if they represent different extents - we do this to minimize
memory utilization and keep extent map trees smaller. For example if we
have two extents that are contiguous on disk, once we load the two extent
maps, they get merged into a single one - however if only one of the
extents is shared, we end up reporting both as shared or both as not
shared, which is incorrect.
This reproducer triggers that bug:
$ cat fiemap-bug.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV
mount $DEV $MNT
# Create a file with two 256K extents.
# Since there is no other write activity, they will be contiguous,
# and their extent maps merged, despite having two distinct extents.
xfs_io -f -c "pwrite -S 0xab 0 256K" \
-c "fsync" \
-c "pwrite -S 0xcd 256K 256K" \
-c "fsync" \
$MNT/foo
# Now clone only the second extent into another file.
xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
# Filefrag will report a single 512K extent, and say it's not shared.
echo
filefrag -v $MNT/foo
umount $MNT
Running the reproducer:
$ ./fiemap-bug.sh
wrote 262144/262144 bytes at offset 0
256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
wrote 262144/262144 bytes at offset 262144
256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
linked 262144/262144 bytes at offset 0
256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
Filesystem type is: 9123683e
File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 127: 3328.. 3455: 128: last,eof
/mnt/sdj/foo: 1 extent found
We end up reporting that we have a single 512K that is not shared, however
we have two 256K extents, and the second one is shared. Changing the
reproducer to clone instead the first extent into file 'bar', makes us
report a single 512K extent that is shared, which is algo incorrect since
we have two 256K extents and only the first one is shared.
This patch is part of a larger patchset that is comprised of the following
patches:
btrfs: allow hole and data seeking to be interruptible
btrfs: make hole and data seeking a lot more efficient
btrfs: remove check for impossible block start for an extent map at fiemap
btrfs: remove zero length check when entering fiemap
btrfs: properly flush delalloc when entering fiemap
btrfs: allow fiemap to be interruptible
btrfs: rename btrfs_check_shared() to a more descriptive name
btrfs: speedup checking for extent sharedness during fiemap
btrfs: skip unnecessary extent buffer sharedness checks during fiemap
btrfs: make fiemap more efficient and accurate reporting extent sharedness
The patchset was tested on a machine running a non-debug kernel (Debian's
default config) and compared the tests below on a branch without the
patchset versus the same branch with the whole patchset applied.
The following test for a large compressed file without holes:
$ cat fiemap-perf-test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV
mount -o compress=lzo $DEV $MNT
# 40G gives 327680 128K file extents (due to compression).
xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar
umount $MNT
mount -o compress=lzo $DEV $MNT
start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata not cached)"
start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata cached)"
umount $MNT
Before patchset:
$ ./fiemap-perf-test.sh
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 3597 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 2107 milliseconds (metadata cached)
After patchset:
$ ./fiemap-perf-test.sh
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 1214 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 684 milliseconds (metadata cached)
That's a speedup of about 3x for both cases (no metadata cached and all
metadata cached).
The test provided by Pavel (first Link tag at the bottom), which uses
files with a large number of holes, was also used to measure the gains,
and it consists on a small C program and a shell script to invoke it.
The C program is the following:
$ cat pavels-test.c
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/ioctl.h>
#include <linux/fs.h>
#include <linux/fiemap.h>
#define FILE_INTERVAL (1<<13) /* 8Kb */
long long interval(struct timeval t1, struct timeval t2)
{
long long val = 0;
val += (t2.tv_usec - t1.tv_usec);
val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000;
return val;
}
int main(int argc, char **argv)
{
struct fiemap fiemap = {};
struct timeval t1, t2;
char data = 'a';
struct stat st;
int fd, off, file_size = FILE_INTERVAL;
if (argc != 3 && argc != 2) {
printf("usage: %s <path> [size]\n", argv[0]);
return 1;
}
if (argc == 3)
file_size = atoi(argv[2]);
if (file_size < FILE_INTERVAL)
file_size = FILE_INTERVAL;
file_size -= file_size % FILE_INTERVAL;
fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
if (fd < 0) {
perror("open");
return 1;
}
for (off = 0; off < file_size; off += FILE_INTERVAL) {
if (pwrite(fd, &data, 1, off) != 1) {
perror("pwrite");
close(fd);
return 1;
}
}
if (ftruncate(fd, file_size)) {
perror("ftruncate");
close(fd);
return 1;
}
if (fstat(fd, &st) < 0) {
perror("fstat");
close(fd);
return 1;
}
printf("size: %ld\n", st.st_size);
printf("actual size: %ld\n", st.st_blocks * 512);
fiemap.fm_length = FIEMAP_MAX_OFFSET;
gettimeofday(&t1, NULL);
if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
perror("fiemap");
close(fd);
return 1;
}
gettimeofday(&t2, NULL);
printf("fiemap: fm_mapped_extents = %d\n",
fiemap.fm_mapped_extents);
printf("time = %lld us\n", interval(t1, t2));
close(fd);
return 0;
}
$ gcc -o pavels_test pavels_test.c
And the wrapper shell script:
$ cat fiemap-pavels-test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f -O no-holes $DEV
mount $DEV $MNT
echo
echo "*********** 256M ***********"
echo
./pavels-test $MNT/testfile $((1 << 28))
echo
./pavels-test $MNT/testfile $((1 << 28))
echo
echo "*********** 512M ***********"
echo
./pavels-test $MNT/testfile $((1 << 29))
echo
./pavels-test $MNT/testfile $((1 << 29))
echo
echo "*********** 1G ***********"
echo
./pavels-test $MNT/testfile $((1 << 30))
echo
./pavels-test $MNT/testfile $((1 << 30))
umount $MNT
Running his reproducer before applying the patchset:
*********** 256M ***********
size: 268435456
actual size: 134217728
fiemap: fm_mapped_extents = 32768
time = 4003133 us
size: 268435456
actual size: 134217728
fiemap: fm_mapped_extents = 32768
time = 4895330 us
*********** 512M ***********
size: 536870912
actual size: 268435456
fiemap: fm_mapped_extents = 65536
time = 30123675 us
size: 536870912
actual size: 268435456
fiemap: fm_mapped_extents = 65536
time = 33450934 us
*********** 1G ***********
size: 1073741824
actual size: 536870912
fiemap: fm_mapped_extents = 131072
time = 224924074 us
size: 1073741824
actual size: 536870912
fiemap: fm_mapped_extents = 131072
time = 217239242 us
Running it after applying the patchset:
*********** 256M ***********
size: 268435456
actual size: 134217728
fiemap: fm_mapped_extents = 32768
time = 29475 us
size: 268435456
actual size: 134217728
fiemap: fm_mapped_extents = 32768
time = 29307 us
*********** 512M ***********
size: 536870912
actual size: 268435456
fiemap: fm_mapped_extents = 65536
time = 58996 us
size: 536870912
actual size: 268435456
fiemap: fm_mapped_extents = 65536
time = 59115 us
*********** 1G ***********
size: 1073741824
actual size: 536870912
fiemap: fm_mapped_extents = 116251
time = 124141 us
size: 1073741824
actual size: 536870912
fiemap: fm_mapped_extents = 131072
time = 119387 us
The speedup is massive, both on the first fiemap call and on the second
one as well, as his test creates files with many holes and small extents
(every extent follows a hole and precedes another hole).
For the 256M file we go from 4 seconds down to 29 milliseconds in the
first run, and then from 4.9 seconds down to 29 milliseconds again in the
second run, a speedup of 138x and 169x, respectively.
For the 512M file we go from 30.1 seconds down to 59 milliseconds in the
first run, and then from 33.5 seconds down to 59 milliseconds again in the
second run, a speedup of 510x and 568x, respectively.
For the 1G file, we go from 225 seconds down to 124 milliseconds in the
first run, and then from 217 seconds down to 119 milliseconds in the
second run, a speedup of 1815x and 1824x, respectively.
Reported-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The current implementation of hole and data seeking for llseek does not
scale well in regards to the number of extents and the distance between
the start offset and the next hole or extent. This is due to a very high
algorithmic complexity. Often we also get reports of btrfs' hole and data
seeking (llseek) being too slow, such as at 2017's LSFMM (see the Link
tag at the bottom).
In order to better understand it, lets consider the case where the start
offset is 0, we are seeking for a hole and the file size is 16G. Between
file offset 0 and the first hole in the file there are 100K extents - this
is common for large files, specially if we have compression enabled, since
the maximum extent size is limited to 128K. The steps take by the main
loop of the current algorithm are the following:
1) We start by calling btrfs_get_extent_fiemap(), for file offset 0, which
calls btrfs_get_extent(). This will first lookup for an extent map in
the inode's extent map tree (a red black tree). If the extent map is
not loaded in memory, then it will do a lookup for the corresponding
file extent item in the subvolume's b+tree, create an extent map based
on the contents of the file extent item and then add the extent map to
the extent map tree of the inode;
2) The second iteration calls btrfs_get_extent_fiemap() again, this time
with a start offset matching the end offset of the previous extent.
Again, btrfs_get_extent() will first search the extent map tree, and
if it doesn't find an extent map there, it will again search in the
b+tree of the subvolume for a matching file extent item, build an
extent map based on the file extent item, and add the extent map to
to the extent map tree of the inode;
3) This repeats over and over until we find the first hole (when seeking
for holes) or until we find the first extent (when seeking for data).
If there no extent maps loaded in memory for each iteration, then on
each iteration we do 1 extent map tree search, 1 b+tree search, plus
1 more extent map tree traversal to insert an extent map - plus we
allocate memory for the extent map.
On each iteration we are growing the size of the extent map tree,
making each future search slower, and also visiting the same b+tree
leaves over and over again - taking into account with the default leaf
size of 16K we can fit more than 200 file extent items in a leaf - so
we can visit the same b+tree leaf 200+ times, on each visit walking
down a path from the root to the leaf.
So it's easy to see that what we have now doesn't scale well. Also, it
loads an extent map for every file extent item into memory, which is not
efficient - we should add extents maps only when doing IO (writing or
reading file data).
This change implements a new algorithm which scales much better, and
works like this:
1) We iterate over the subvolume's b+tree, visiting each leaf that has
file extent items once and only once;
2) For any file extent items found, that don't represent holes or prealloc
extents, it will not search the extent map tree - there's no need at
all for that - an extent map is just an in-memory representation of a
file extent item;
3) When a hole is found, or a prealloc extent, it will check if there's
delalloc for its range. For this it will search for EXTENT_DELALLOC
bits in the inode's io tree and check the extent map tree - this is
for accounting for unflushed delalloc and for flushed delalloc (the
period between running delalloc and ordered extent completion),
respectively. This is similar to what the current implementation does
when it finds a hole or prealloc extent, but without creating extent
maps and adding them to the extent map tree in case they are not
loaded in memory;
4) It never allocates extent maps, or adds extent maps to the inode's
extent map tree. This not only saves memory and time (from the tree
insertions and allocations), but also eliminates the possibility of
-ENOMEM due to allocating too many extent maps.
Part of this new code will also be used later for fiemap (which also
suffers similar scalability problems).
The following test example can be used to quickly measure the efficiency
before and after this patch:
$ cat test-seek-hole.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV
mount -o compress=lzo $DEV $MNT
# 16G file -> 131073 compressed extents.
xfs_io -f -c "pwrite -S 0xab -b 1M 0 16G" $MNT/foobar
# Leave a 1M hole at file offset 15G.
xfs_io -c "fpunch 15G 1M" $MNT/foobar
# Unmount and mount again, so that we can test when there's no
# metadata cached in memory.
umount $MNT
mount -o compress=lzo $DEV $MNT
# Test seeking for hole from offset 0 (hole is at offset 15G).
start=$(date +%s%N)
xfs_io -c "seek -h 0" $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "Took $dur milliseconds to seek first hole (metadata not cached)"
echo
start=$(date +%s%N)
xfs_io -c "seek -h 0" $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "Took $dur milliseconds to seek first hole (metadata cached)"
echo
umount $MNT
Before this change:
$ ./test-seek-hole.sh
(...)
Whence Result
HOLE 16106127360
Took 176 milliseconds to seek first hole (metadata not cached)
Whence Result
HOLE 16106127360
Took 17 milliseconds to seek first hole (metadata cached)
After this change:
$ ./test-seek-hole.sh
(...)
Whence Result
HOLE 16106127360
Took 43 milliseconds to seek first hole (metadata not cached)
Whence Result
HOLE 16106127360
Took 13 milliseconds to seek first hole (metadata cached)
That's about 4x faster when no metadata is cached and about 30% faster
when all metadata is cached.
In practice the differences may often be significantly higher, either due
to a higher number of extents in a file or because the subvolume's b+tree
is much bigger than in this example, where we only have one file.
Link: https://lwn.net/Articles/718805/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Doing hole or data seeking on a file with a very large number of extents
can take a long time, and we have reports of it being too slow (such as
at LSFMM from 2017, see the Link below). So make it interruptible.
Link: https://lwn.net/Articles/718805/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an inode, if we detect the inode has a reference that
conflicts with some other inode that got renamed, we log that other inode
while holding the log mutex of the current inode. We then find out if
there are other inodes that conflict with the first conflicting inode,
and log them while under the log mutex of the original inode. This is
fine because the recursion can only happen once.
For the upcoming work where we directly log delayed items without flushing
them first to the subvolume tree, this recursion adds a lot of complexity
and it's hard to keep lockdep happy about it.
So collect a list of conflicting inodes and then log the inodes after
unlocking the log mutex of the inode we started with.
Also limit the maximum number of conflict inodes we log to 10, to avoid
spending too much time logging (and maybe allocating too many list
elements too), as typically we don't have more than 1 or 2 conflicting
inodes - if we go over the limit, simply fallback to a transaction commit.
It is possible to have a very long list of conflicting inodes to be
intentionally created by a user if he/she creates a very long succession
of renames like this:
(...)
rename E to F
rename D to E
rename C to D
rename B to C
rename A to B
touch A (create a new file named A)
fsync A
If that happened for a sequence of hundreds or thousands of renames, it
could massively slow down the logging and cause other secondary effects
like for example blocking other fsync operations and transaction commits
for a very long time (assuming it wouldn't run into -ENOSPC or -ENOMEM
first). However such cases are very uncommon to happen in practice,
nevertheless it's better to be prepared for them and avoid chaos.
Such long sequence of conflicting inodes could be created before this
change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_insert_file_extent() is only ever used to insert holes, so rename
it and remove the redundant parameters.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With CONFIG_READ_ONLY_THP_FOR_FS, the Linux kernel supports using THPs for
read-only mmapped files, such as shared libraries. However, the kernel
makes no attempt to actually align those mappings on 2MB boundaries,
which makes it impossible to use those THPs most of the time. This issue
applies to general file mapping THP as well as existing setups using
CONFIG_READ_ONLY_THP_FOR_FS. This is easily fixed by using
thp_get_unmapped_area for the unmapped_area function in btrfs, which
is what ext2, ext4, fuse, and xfs all use.
Initially btrfs had been left out in commit 8c07fc452ac0 ("btrfs: fix
alignment of VMA for memory mapped files on THP") as btrfs does not support
DAX. However, commit 1854bc6e24 ("mm/readahead: Align file mappings
for non-DAX") removed the DAX requirement. We should now be able to call
thp_get_unmapped_area() for btrfs.
The problem can be seen in /proc/PID/smaps where THPeligible is set to 0
on mappings to eligible shared object files as shown below.
Before this patch:
7fc6a7e18000-7fc6a80cc000 r-xp 00000000 00:1e 199856
/usr/lib64/libcrypto.so.1.1.1k
Size: 2768 kB
THPeligible: 0
VmFlags: rd ex mr mw me
With this patch the library is mapped at a 2MB aligned address:
fbdfe200000-7fbdfe4b4000 r-xp 00000000 00:1e 199856
/usr/lib64/libcrypto.so.1.1.1k
Size: 2768 kB
THPeligible: 1
VmFlags: rd ex mr mw me
This fixes the alignment of VMAs for any mmap of a file that has the
rd and ex permissions and size >= 2MB. The VMA alignment and
THPeligible field for anonymous memory is handled separately and
is thus not effected by this change.
CC: stable@vger.kernel.org # 5.18+
Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmMLY9oACgkQxWXV+ddt
WDue/w/8C3ZF8nLAI/sMrUpef2vSD62bvkKRRS45wzR2uod6yc0Fle9upzBssJQZ
qO3mQ53+QV+imCq7dY5mmtmwCUJNmbV5gbiMoF1OoV9TYtpZb/NIDklSX8se2eJX
drdAWQr2pYwU2M4duA4IEW08TvQ2TFh0JiqMi0aYM5apyL80uv3WniOu+xpRipA3
CMFAnDqayIgQ5OIsedqNy2MBLBopodUL5PZv/H7/g6KSKIuAZP9zgg1eKPfaz2t3
HO183ubmMbVtxgxeu+EnvCkg/iQ5hQiuGmyi0FLYMs/A6/NglwBnIJU5jCMQhcp6
HO5+FSUn6lHQetVzt2uHb9Lo+gX4FtCaHqVv1bXT62lnmDsZO1D7RVSg1Fra+CY+
jJmi8vvIbfbYlSZPZlJANoWe8ODOMVPk+pM4SFHlxOWGAY6HViX2RfHnIjNj5x9O
iDSTGvH6++nBF1Wu2/Xja/VKZ1avxRyTu2srW8JOF62j/tTU/EoPJcO9rxXOBBmC
Hi4UmJ690p3h5xZeeiyE8CmaSlPtfdCcnc/97FnusEjBao9O7THX0PCDVJX6VBkm
hVk01Z6+az1UNcD18KecvCpKYF/At4WpjaUGgf7q+LBfJXuXA6jfzOVDJMKV3TFd
n1yMFg+duGj90l8gT0aa/VQiBlUlnzQKz6ceqyKkPccwveNis6I=
=p8YV
-----END PGP SIGNATURE-----
Merge tag 'for-6.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Fixes:
- check that subvolume is writable when changing xattrs from security
namespace
- fix memory leak in device lookup helper
- update generation of hole file extent item when merging holes
- fix space cache corruption and potential double allocations; this
is a rare bug but can be serious once it happens, stable backports
and analysis tool will be provided
- fix error handling when deleting root references
- fix crash due to assert when attempting to cancel suspended device
replace, add message what to do if mount fails due to missing
replace item
Regressions:
- don't merge pages into bio if their page offset is not contiguous
- don't allow large NOWAIT direct reads, this could lead to short
reads eg. in io_uring"
* tag 'for-6.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: add info when mount fails due to stale replace target
btrfs: replace: drop assert for suspended replace
btrfs: fix silent failure when deleting root reference
btrfs: fix space cache corruption and potential double allocations
btrfs: don't allow large NOWAIT direct reads
btrfs: don't merge pages into bio if their page offset is not contiguous
btrfs: update generation of hole file extent item when merging holes
btrfs: fix possible memory leak in btrfs_get_dev_args_from_path()
btrfs: check if root is readonly while setting security xattr
When punching a hole into a file range that is adjacent with a hole and we
are not using the no-holes feature, we expand the range of the adjacent
file extent item that represents a hole, to save metadata space.
However we don't update the generation of hole file extent item, which
means a full fsync will not log that file extent item if the fsync happens
in a later transaction (since commit 7f30c07288 ("btrfs: stop copying
old file extents when doing a full fsync")).
For example, if we do this:
$ mkfs.btrfs -f -O ^no-holes /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0xab 2M 2M" /mnt/foobar
$ sync
We end up with 2 file extent items in our file:
1) One that represents the hole for the file range [0, 2M), with a
generation of 7;
2) Another one that represents an extent covering the range [2M, 4M).
After that if we do the following:
$ xfs_io -c "fpunch 2M 2M" /mnt/foobar
We end up with a single file extent item in the file, which represents a
hole for the range [0, 4M) and with a generation of 7 - because we end
dropping the data extent for range [2M, 4M) and then update the file
extent item that represented the hole at [0, 2M), by increasing
length from 2M to 4M.
Then doing a full fsync and power failing:
$ xfs_io -c "fsync" /mnt/foobar
<power failure>
will result in the full fsync not logging the file extent item that
represents the hole for the range [0, 4M), because its generation is 7,
which is lower than the generation of the current transaction (8).
As a consequence, after mounting again the filesystem (after log replay),
the region [2M, 4M) does not have a hole, it still points to the
previous data extent.
So fix this by always updating the generation of existing file extent
items representing holes when we merge/expand them. This solves the
problem and it's the same approach as when we merge prealloc extents that
got written (at btrfs_mark_extent_written()). Setting the generation to
the current transaction's generation is also what we do when merging
the new hole extent map with the previous one or the next one.
A test case for fstests, covering both cases of hole file extent item
merging (to the left and to the right), will be sent soon.
Fixes: 7f30c07288 ("btrfs: stop copying old file extents when doing a full fsync")
CC: stable@vger.kernel.org # 5.18+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmLnyNUACgkQxWXV+ddt
WDt9vA/9HcF+v5EkknyW07tatTap/Hm/ZB86Z5OZi6ikwIEcHsWhp3rUICejm88e
GecDPIluDtCtyD6x4stuqkwOm22aDP5q2T9H6+gyw92ozyb436OV1Z8IrmftzXKY
EpZO70PHZT+E6E/WYvyoTmmoCrjib7YlqCWZZhSLUFpsqqlOInmHEH49PW6KvM4r
acUZ/RxHurKdmI3kNY6ECbAQl6CASvtTdYcVCx8fT2zN0azoLIQxpYa7n/9ca1R6
8WnYilCbLbNGtcUXvO2M3tMZ4/5kvxrwQsUn93ccCJYuiN0ASiDXbLZ2g4LZ+n56
JGu+y5v5oBwjpVf+46cuvnENP5BQ61594WPseiVjrqODWnPjN28XkcVC0XmPsiiZ
lszeHO2cuIrIFoCah8ELMl8usu8+qxfXmPxIXtPu9rEyKsDtOjxVYc8SMXqLp0qQ
qYtBoFm0JcZHqtZRpB+dhQ37/xXtH4ljUi/mI6x8iALVujeR273URs7yO9zgIdeW
uZoFtbwpHFLUk+TL7Ku82/zOXp3fCwtDpNmlYbxeMbea/be3ShjncM4+mYzvHYri
dYON2LFrq+mnRDqtIXTCaAYwX7zU8Y18Ev9QwlNll8dKlKwS89+jpqLoa+eVYy3c
/HitHFza70KxmOj4dvDVZlzDpPvl7kW1UBkmskg4u3jnNWzedkM=
=sS1q
-----END PGP SIGNATURE-----
Merge tag 'for-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This brings some long awaited changes, the send protocol bump,
otherwise lots of small improvements and fixes. The main core part is
reworking bio handling, cleaning up the submission and endio and
improving error handling.
There are some changes outside of btrfs adding helpers or updating
API, listed at the end of the changelog.
Features:
- sysfs:
- export chunk size, in debug mode add tunable for setting its size
- show zoned among features (was only in debug mode)
- show commit stats (number, last/max/total duration)
- send protocol updated to 2
- new commands:
- ability write larger data chunks than 64K
- send raw compressed extents (uses the encoded data ioctls),
ie. no decompression on send side, no compression needed on
receive side if supported
- send 'otime' (inode creation time) among other timestamps
- send file attributes (a.k.a file flags and xflags)
- this is first version bump, backward compatibility on send and
receive side is provided
- there are still some known and wanted commands that will be
implemented in the near future, another version bump will be
needed, however we want to minimize that to avoid causing
usability issues
- print checksum type and implementation at mount time
- don't print some messages at mount (mentioned as people asked about
it), we want to print messages namely for new features so let's
make some space for that
- big metadata - this has been supported for a long time and is
not a feature that's worth mentioning
- skinny metadata - same reason, set by default by mkfs
Performance improvements:
- reduced amount of reserved metadata for delayed items
- when inserted items can be batched into one leaf
- when deleting batched directory index items
- when deleting delayed items used for deletion
- overall improved count of files/sec, decreased subvolume lock
contention
- metadata item access bounds checker micro-optimized, with a few
percent of improved runtime for metadata-heavy operations
- increase direct io limit for read to 256 sectors, improved
throughput by 3x on sample workload
Notable fixes:
- raid56
- reduce parity writes, skip sectors of stripe when there are no
data updates
- restore reading from on-disk data instead of using stripe cache,
this reduces chances to damage correct data due to RMW cycle
- refuse to replay log with unknown incompat read-only feature bit
set
- zoned
- fix page locking when COW fails in the middle of allocation
- improved tracking of active zones, ZNS drives may limit the
number and there are ENOSPC errors due to that limit and not
actual lack of space
- adjust maximum extent size for zone append so it does not cause
late ENOSPC due to underreservation
- mirror reading error messages show the mirror number
- don't fallback to buffered IO for NOWAIT direct IO writes, we don't
have the NOWAIT semantics for buffered io yet
- send, fix sending link commands for existing file paths when there
are deleted and created hardlinks for same files
- repair all mirrors for profiles with more than 1 copy (raid1c34)
- fix repair of compressed extents, unify where error detection and
repair happen
Core changes:
- bio completion cleanups
- don't double defer compression bios
- simplify endio workqueues
- add more data to btrfs_bio to avoid allocation for read requests
- rework bio error handling so it's same what block layer does,
the submission works and errors are consumed in endio
- when asynchronous bio offload fails fall back to synchronous
checksum calculation to avoid errors under writeback or memory
pressure
- new trace points
- raid56 events
- ordered extent operations
- super block log_root_transid deprecated (never used)
- mixed_backref and big_metadata sysfs feature files removed, they've
been default for sufficiently long time, there are no known users
and mixed_backref could be confused with mixed_groups
Non-btrfs changes, API updates:
- minor highmem API update to cover const arguments
- switch all kmap/kmap_atomic to kmap_local
- remove redundant flush_dcache_page()
- address_space_operations::writepage callback removed
- add bdev_max_segments() helper"
* tag 'for-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (163 commits)
btrfs: don't call btrfs_page_set_checked in finish_compressed_bio_read
btrfs: fix repair of compressed extents
btrfs: remove the start argument to check_data_csum and export
btrfs: pass a btrfs_bio to btrfs_repair_one_sector
btrfs: simplify the pending I/O counting in struct compressed_bio
btrfs: repair all known bad mirrors
btrfs: merge btrfs_dev_stat_print_on_error with its only caller
btrfs: join running log transaction when logging new name
btrfs: simplify error handling in btrfs_lookup_dentry
btrfs: send: always use the rbtree based inode ref management infrastructure
btrfs: send: fix sending link commands for existing file paths
btrfs: send: introduce recorded_ref_alloc and recorded_ref_free
btrfs: zoned: wait until zone is finished when allocation didn't progress
btrfs: zoned: write out partially allocated region
btrfs: zoned: activate necessary block group
btrfs: zoned: activate metadata block group on flush_space
btrfs: zoned: disable metadata overcommit for zoned
btrfs: zoned: introduce space_info->active_total_bytes
btrfs: zoned: finish least available block group on data bg allocation
btrfs: let can_allocate_chunk return error
...
One of the goals is to reduce the overhead of using ->read_iter()
and ->write_iter() instead of ->read()/->write(); new_sync_{read,write}()
has a surprising amount of overhead, in particular inside iocb_flags().
That's why the beginning of the series is in this pile; it's not directly
iov_iter-related, but it's a part of the same work...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYurGOQAKCRBZ7Krx/gZQ
6ysyAP91lvBfMRepcxpd9kvtuzWkU8A3rfSziZZteEHANB9Q7QEAiPn2a2OjWkcZ
uAyUWfCkHCNx+dSMkEvUgR5okQ0exAM=
=9UCV
-----END PGP SIGNATURE-----
Merge tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs iov_iter updates from Al Viro:
"Part 1 - isolated cleanups and optimizations.
One of the goals is to reduce the overhead of using ->read_iter() and
->write_iter() instead of ->read()/->write().
new_sync_{read,write}() has a surprising amount of overhead, in
particular inside iocb_flags(). That's the explanation for the
beginning of the series is in this pile; it's not directly
iov_iter-related, but it's a part of the same work..."
* tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
first_iovec_segment(): just return address
iov_iter: massage calling conventions for first_{iovec,bvec}_segment()
iov_iter: first_{iovec,bvec}_segment() - simplify a bit
iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment()
iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT
iov_iter_bvec_advance(): don't bother with bvec_iter
copy_page_{to,from}_iter(): switch iovec variants to generic
keep iocb_flags() result cached in struct file
iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC
struct file: use anonymous union member for rcuhead and llist
btrfs: use IOMAP_DIO_NOSYNC
teach iomap_dio_rw() to suppress dsync
No need of likely/unlikely on calls of check_copy_size()
Currently, for a direct IO write, if we need to fallback to buffered IO,
either to satisfy the whole write operation or just a part of it, we do
it in the current context even if it's a NOWAIT context. This is not ideal
because we currently don't have support for NOWAIT semantics in the
buffered IO path (we can block for several reasons), so we should instead
return -EAGAIN to the caller, so that it knows it should retry (the whole
operation or what's left of it) in a context where blocking is acceptable.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use simple bool type for the block reserve failfast status, there's
short to save space as there used to be int but there's no reason for
that.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The chained assignments may be convenient to write, but make readability
a bit worse as it's too easy to overlook that there are several values
set on the same line while this is rather an exception. Making it
consistent everywhere avoids surprises.
The pattern where inode times are initialized reuses the first value and
the order is mtime, ctime. In other blocks the assignments are expanded
so the order of variables is similar to the neighboring code.
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we will return 1 or -EAGAIN if we decide we need to commit
the transaction rather than sync the log. In practice this doesn't
really matter, we interpret any !0 and !BTRFS_NO_LOG_SYNC as needing to
commit the transaction. However this makes it hard to figure out what
the correct thing to do is.
Fix this up by defining BTRFS_LOG_FORCE_COMMIT and using this in all the
places where we want to force the transaction to be committed.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmK4dV4ACgkQxWXV+ddt
WDs4uQ/7B0XqPK05NJntJfwnuIoT/yOreKf47wt/6DyFV3CDMFte/qzaZwthwu6P
F0GMpSYAlVszLlML5elvF9VXymlV+e+QROtbD6QCNLNW1IwHA7ZiF5fV/a1Rj930
XSuaDyVFPAK7892RR6yMQ20IeMBuvqiAhXWEzaIJ2tIcAHn+fP+VkY8Nc0aZj3iC
mI+ep4n93karDxmnHVGUxJTxAe0l/uNopx+fYBWQDj7HuoMLo0Cu+rAdv0gRIxi2
RWUBkR4e4PBwV1OFScwNCsljjt6bHdUHrtdB3fo5Hzu9cO5hHdL7NEsKB1K2w7rV
bgNuNqfj6Y4xUBchAfQO5CCJ9ISci5KoJ4RBpk6EprZR3QN40kN8GPlhi2519K7w
F3d8jolDDHlkqxIsqoe47MYOcSepNEadVNsiYKb0rM6doilfxyXiu6dtTFMrC8Vy
K2HDCdTyuIgw+TnwqT1puaUwxiIL8DFJf1CVyjwGuQ4UgaIEkHXKIsCssyyJ76Jh
QkWX1aeRldbfkVArJWHQWqDQopx9pFBz1gjlws0YjAsU5YijOOXva464P9Rxg+Gq
4pRlgnO48joQam9bRirP2Z6yhqa4O6jkzKDOXSYduAUYD7IMfpsYnz09wKS95jj+
QCrR7VmKnpQdsXg5a/mqyacfIH30ph002VywRxPiFM89Syd25yo=
=rUrf
-----END PGP SIGNATURE-----
Merge tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- zoned relocation fixes:
- fix critical section end for extent writeback, this could lead
to out of order write
- prevent writing to previous data relocation block group if space
gets low
- reflink fixes:
- fix race between reflinking and ordered extent completion
- proper error handling when block reserve migration fails
- add missing inode iversion/mtime/ctime updates on each iteration
when replacing extents
- fix deadlock when running fsync/fiemap/commit at the same time
- fix false-positive KCSAN report regarding pid tracking for read locks
and data race
- minor documentation update and link to new site
* tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Documentation: update btrfs list of features and link to readthedocs.io
btrfs: fix deadlock with fsync+fiemap+transaction commit
btrfs: don't set lock_owner when locking extent buffer for reading
btrfs: zoned: fix critical section of relocation inode writeback
btrfs: zoned: prevent allocation from previous data relocation BG
btrfs: do not BUG_ON() on failure to migrate space when replacing extents
btrfs: add missing inode updates on each iteration when replacing extents
btrfs: fix race between reflinking and ordered extent completion
We are hitting the following deadlock in production occasionally
Task 1 Task 2 Task 3 Task 4 Task 5
fsync(A)
start trans
start commit
falloc(A)
lock 5m-10m
start trans
wait for commit
fiemap(A)
lock 0-10m
wait for 5m-10m
(have 0-5m locked)
have btrfs_need_log_full_commit
!full_sync
wait_ordered_extents
finish_ordered_io(A)
lock 0-5m
DEADLOCK
We have an existing dependency of file extent lock -> transaction.
However in fsync if we tried to do the fast logging, but then had to
fall back to committing the transaction, we will be forced to call
btrfs_wait_ordered_range() to make sure all of our extents are updated.
This creates a dependency of transaction -> file extent lock, because
btrfs_finish_ordered_io() will need to take the file extent lock in
order to run the ordered extents.
Fix this by stopping the transaction if we have to do the full commit
and we attempted to do the fast logging. Then attach to the transaction
and commit it if we need to.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_replace_file_extents(), if we fail to migrate reserved metadata
space from the transaction block reserve into the local block reserve,
we trigger a BUG_ON(). This is because it should not be possible to have
a failure here, as we reserved more space when we started the transaction
than the space we want to migrate. However having a BUG_ON() is way too
drastic, we can perfectly handle the failure and return the error to the
caller. So just do that instead, and add a WARN_ON() to make it easier
to notice the failure if it ever happens (which is particularly useful
for fstests, and the warning will trigger a failure of a test case).
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When replacing file extents, called during fallocate, hole punching,
clone and deduplication, we may not be able to replace/drop all the
target file extent items with a single transaction handle. We may get
-ENOSPC while doing it, in which case we release the transaction handle,
balance the dirty pages of the btree inode, flush delayed items and get
a new transaction handle to operate on what's left of the target range.
By dropping and replacing file extent items we have effectively modified
the inode, so we should bump its iversion and update its mtime/ctime
before we update the inode item. This is because if the transaction
we used for partially modifying the inode gets committed by someone after
we release it and before we finish the rest of the range, a power failure
happens, then after mounting the filesystem our inode has an outdated
iversion and mtime/ctime, corresponding to the values it had before we
changed it.
So add the missing iversion and mtime/ctime updates.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
New helper to be used instead of direct checks for IOCB_DSYNC:
iocb_is_dsync(iocb). Checks converted, which allows to avoid
the IS_SYNC(iocb->ki_filp->f_mapping->host) part (4 cache lines)
from iocb_flags() - it's checked in iocb_is_dsync() instead
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... instead of messing with iocb flags
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- Appoint myself page cache maintainer
- Fix how scsicam uses the page cache
- Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS
- Remove the AOP flags entirely
- Remove pagecache_write_begin() and pagecache_write_end()
- Documentation updates
- Convert several address_space operations to use folios:
- is_dirty_writeback
- readpage becomes read_folio
- releasepage becomes release_folio
- freepage becomes free_folio
- Change filler_t to require a struct file pointer be the first argument
like ->read_folio
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmKNMDUACgkQDpNsjXcp
gj4/mwf/bpHhXH4ZoNIvtUpTF6rZbqeffmc0VrbxCZDZ6igRnRPglxZ9H9v6L53O
7B0FBQIfxgNKHZpdqGdOkv8cjg/GMe/HJUbEy5wOakYPo4L9fZpHbDZ9HM2Eankj
xBqLIBgBJ7doKr+Y62DAN19TVD8jfRfVtli5mqXJoNKf65J7BkxljoTH1L3EXD9d
nhLAgyQjR67JQrT/39KMW+17GqLhGefLQ4YnAMONtB6TVwX/lZmigKpzVaCi4r26
bnk5vaR/3PdjtNxIoYvxdc71y2Eg05n2jEq9Wcy1AaDv/5vbyZUlZ2aBSaIVbtKX
WfrhN9O3L0bU5qS7p9PoyfLc9wpq8A==
=djLv
-----END PGP SIGNATURE-----
Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache
Pull page cache updates from Matthew Wilcox:
- Appoint myself page cache maintainer
- Fix how scsicam uses the page cache
- Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS
- Remove the AOP flags entirely
- Remove pagecache_write_begin() and pagecache_write_end()
- Documentation updates
- Convert several address_space operations to use folios:
- is_dirty_writeback
- readpage becomes read_folio
- releasepage becomes release_folio
- freepage becomes free_folio
- Change filler_t to require a struct file pointer be the first
argument like ->read_folio
* tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits)
nilfs2: Fix some kernel-doc comments
Appoint myself page cache maintainer
fs: Remove aops->freepage
secretmem: Convert to free_folio
nfs: Convert to free_folio
orangefs: Convert to free_folio
fs: Add free_folio address space operation
fs: Convert drop_buffers() to use a folio
fs: Change try_to_free_buffers() to take a folio
jbd2: Convert release_buffer_page() to use a folio
jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio
reiserfs: Convert release_buffer_page() to use a folio
fs: Remove last vestiges of releasepage
ubifs: Convert to release_folio
reiserfs: Convert to release_folio
orangefs: Convert to release_folio
ocfs2: Convert to release_folio
nilfs2: Remove comment about releasepage
nfs: Convert to release_folio
jfs: Convert to release_folio
...
Add a wrapper around iomap_dio_rw that keeps the direct I/O internals
isolated in inode.c.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a NOWAIT direct IO write, if we can NOCOW then it means we can
proceed with the non-blocking, NOWAIT path. However reserving the metadata
space and qgroup meta space can often result in blocking - flushing
delalloc, wait for ordered extents to complete, trigger transaction
commits, etc, going against the semantics of a NOWAIT write.
So make the NOWAIT write path to try to reserve all the metadata it needs
without resulting in a blocking behaviour - if we get -ENOSPC or -EDQUOT
then return -EAGAIN to make the caller fallback to a blocking direct IO
write.
This is part of a patchset comprised of the following patches:
btrfs: avoid blocking on page locks with nowait dio on compressed range
btrfs: avoid blocking nowait dio when locking file range
btrfs: avoid double nocow check when doing nowait dio writes
btrfs: stop allocating a path when checking if cross reference exists
btrfs: free path at can_nocow_extent() before checking for checksum items
btrfs: release path earlier at can_nocow_extent()
btrfs: avoid blocking when allocating context for nowait dio read/write
btrfs: avoid blocking on space revervation when doing nowait dio writes
The following test was run before and after applying this patchset:
$ cat io-uring-nodatacow-test.sh
#!/bin/bash
DEV=/dev/sdc
MNT=/mnt/sdc
MOUNT_OPTIONS="-o ssd -o nodatacow"
MKFS_OPTIONS="-R free-space-tree -O no-holes"
NUM_JOBS=4
FILE_SIZE=8G
RUN_TIME=300
cat <<EOF > /tmp/fio-job.ini
[io_uring_rw]
rw=randrw
fsync=0
fallocate=posix
group_reporting=1
direct=1
ioengine=io_uring
iodepth=64
bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
filesize=$FILE_SIZE
runtime=$RUN_TIME
time_based
filename=foobar
directory=$MNT
numjobs=$NUM_JOBS
thread
EOF
echo performance | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The test was run a 12 cores box with 64G of ram, using a non-debug kernel
config (Debian's default config) and a spinning disk.
Result before the patchset:
READ: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec
WRITE: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec
Result after the patchset:
READ: bw=436MiB/s (457MB/s), 436MiB/s-436MiB/s (457MB/s-457MB/s), io=128GiB (137GB), run=300044-300044msec
WRITE: bw=435MiB/s (456MB/s), 435MiB/s-435MiB/s (456MB/s-456MB/s), io=128GiB (137GB), run=300044-300044msec
That's about +7.2% throughput for reads and +6.9% for writes.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a NOWAIT direct IO write we are checking twice if we can COW
into the target file range using can_nocow_extent() - once at the very
beginning of the write path, at btrfs_write_check() via
check_nocow_nolock(), and later again at btrfs_get_blocks_direct_write().
The can_nocow_extent() function does a lot of expensive things - searching
for the file extent item in the inode's subvolume tree, searching for the
extent item in the extent tree, checking delayed references, etc, so it
isn't a very cheap call.
We can remove the first check at btrfs_write_check(), and add there a
quick check to verify if the inode has the NODATACOW or PREALLOC flags,
and quickly bail out if it doesn't have neither of those flags, as that
means we have to COW and therefore can't comply with the NOWAIT semantics.
After this we do only one call to can_nocow_extent(), while we are at
btrfs_get_blocks_direct_write(), where we have already locked the file
range and we did a try lock on the range before, at
btrfs_dio_iomap_begin() (since the previous patch in the series).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have four different scenarios where we don't expect to find ordered
extents after locking a file range:
1) During plain fallocate;
2) During hole punching;
3) During zero range;
4) During reflinks (both cloning and deduplication).
This is because in all these cases we follow the pattern:
1) Lock the inode's VFS lock in exclusive mode;
2) Lock the inode's i_mmap_lock in exclusive node, to serialize with
mmap writes;
3) Flush delalloc in a file range and wait for all ordered extents
to complete - both done through btrfs_wait_ordered_range();
4) Lock the file range in the inode's io_tree.
So add a helper that asserts that we don't have ordered extents for a
given range. Make the four scenarios listed above use this helper after
locking the respective file range.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For hole punching and zero range we have this loop that checks if we have
ordered extents after locking the file range, and if so unlock the range,
wait for ordered extents, and retry until we don't find more ordered
extents.
This logic was needed in the past because:
1) Direct IO writes within the i_size boundary did not take the inode's
VFS lock. This was because that lock used to be a mutex, then some
years ago it was switched to a rw semaphore (commit 9902af79c0
("parallel lookups: actual switch to rwsem")), and then btrfs was
changed to take the VFS inode's lock in shared mode for writes that
don't cross the i_size boundary (commit e9adabb971 ("btrfs: use
shared lock for direct writes within EOF"));
2) We could race with memory mapped writes, because memory mapped writes
don't acquire the inode's VFS lock. We don't have that race anymore,
as we have a rw semaphore to synchronize memory mapped writes with
fallocate (and reflinking too). That change happened with commit
8d9b4a162a ("btrfs: exclude mmap from happening during all
fallocate operations").
So stop looking for ordered extents after locking the file range when
doing hole punching and zero range operations.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing hole punching we are flushing delalloc and waiting for ordered
extents to complete before locking the inode (VFS lock and the btrfs
specific i_mmap_lock). This is fine because even if a write happens after
we call btrfs_wait_ordered_range() and before we lock the inode (call
btrfs_inode_lock()), we will notice the write at
btrfs_punch_hole_lock_range() and flush delalloc and wait for its ordered
extent.
We can however make this simpler by locking first the inode an then call
btrfs_wait_ordered_range(), which will allow us to remove the ordered
extent lookup logic from btrfs_punch_hole_lock_range() in the next patch.
It also makes the behaviour the same as plain fallocate, hole punching
and reflinks.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For fallocate() we have this loop that checks if we have ordered extents
after locking the file range, and if so unlock the range, wait for ordered
extents, and retry until we don't find more ordered extents.
This logic was needed in the past because:
1) Direct IO writes within the i_size boundary did not take the inode's
VFS lock. This was because that lock used to be a mutex, then some
years ago it was switched to a rw semaphore (commit 9902af79c0
("parallel lookups: actual switch to rwsem")), and then btrfs was
changed to take the VFS inode's lock in shared mode for writes that
don't cross the i_size boundary (commit e9adabb971 ("btrfs: use
shared lock for direct writes within EOF"));
2) We could race with memory mapped writes, because memory mapped writes
don't acquire the inode's VFS lock. We don't have that race anymore,
as we have a rw semaphore to synchronize memory mapped writes with
fallocate (and reflinking too). That change happened with commit
8d9b4a162a ("btrfs: exclude mmap from happening during all
fallocate operations").
So stop looking for ordered extents after locking the file range when
doing a plain fallocate.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When starting a fallocate zero range operation, before getting the first
extent map for the range, we make a call to inode_dio_wait().
This logic was needed in the past because direct IO writes within the
i_size boundary did not take the inode's VFS lock. This was because that
lock used to be a mutex, then some years ago it was switched to a rw
semaphore (by commit 9902af79c0 ("parallel lookups: actual switch to
rwsem")), and then btrfs was changed to take the VFS inode's lock in
shared mode for writes that don't cross the i_size boundary (done in
commit e9adabb971 ("btrfs: use shared lock for direct writes within
EOF")). The lockless direct IO writes could result in a race with the
zero range operation, resulting in the later getting a stale extent
map for the range.
So remove this no longer needed call to inode_dio_wait(), as fallocate
takes the inode's VFS lock in exclusive mode and direct IO writes within
i_size take that same lock in shared mode.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During a plain fallocate, we always start by reserving an amount of data
space that matches the length of the range passed to fallocate. When we
already have extents allocated in that range, we may end up trying to
reserve a lot more data space then we need, which can result in several
undesired behaviours:
1) We fail with -ENOSPC. For example the passed range has a length
of 1G, but there's only one hole with a size of 1M in that range;
2) We temporarily reserve excessive data space that could be used by
other operations happening concurrently;
3) By reserving much more data space then we need, we can end up
doing expensive things like triggering dellaloc for other inodes,
waiting for the ordered extents to complete, trigger transaction
commits, allocate new block groups, etc.
Example:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f -b 1g $DEV
mount $DEV $MNT
# Create a file with a size of 600M and two holes, one at [200M, 201M[
# and another at [401M, 402M[
xfs_io -f -c "pwrite -S 0xab 0 200M" \
-c "pwrite -S 0xcd 201M 200M" \
-c "pwrite -S 0xef 402M 198M" \
$MNT/foobar
# Now call fallocate against the whole file range, see if it fails
# with -ENOSPC or not - it shouldn't since we only need to allocate
# 2M of data space.
xfs_io -c "falloc 0 600M" $MNT/foobar
umount $MNT
$ ./test.sh
(...)
wrote 209715200/209715200 bytes at offset 0
200 MiB, 51200 ops; 0.8063 sec (248.026 MiB/sec and 63494.5831 ops/sec)
wrote 209715200/209715200 bytes at offset 210763776
200 MiB, 51200 ops; 0.8053 sec (248.329 MiB/sec and 63572.3172 ops/sec)
wrote 207618048/207618048 bytes at offset 421527552
198 MiB, 50688 ops; 0.7925 sec (249.830 MiB/sec and 63956.5548 ops/sec)
fallocate: No space left on device
$
So fix this by not allocating an amount of data space that matches the
length of the range passed to fallocate. Instead allocate an amount of
data space that corresponds to the sum of the sizes of each hole found
in the range. This reservation now happens after we have locked the file
range, which is safe since we know at this point there's no delalloc
in the range because we've taken the inode's VFS lock in exclusive mode,
we have taken the inode's i_mmap_lock in exclusive mode, we have flushed
delalloc and waited for all ordered extents in the range to complete.
This type of failure actually seems to happen in practice with systemd,
and we had at least one report about this in a very long thread which
is referenced by the Link tag below.
Link: https://lore.kernel.org/linux-btrfs/bdJVxLiFr_PyQSXRUbZJfFW_jAjsGgoMetqPHJMbg-hdy54Xt_ZHhRetmnJ6cJ99eBlcX76wy-AvWwV715c3YndkxneSlod11P1hlaADx0s=@protonmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I've only converted the outer layers of the btrfs release_folio paths
to use folios; the use of folios should be pushed further down into
btrfs from here.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
With all implementations of aops->readpage converted to aops->read_folio,
we can stop checking whether it's set and remove the member from aops.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
This is a "weak" conversion which converts straight back to using pages.
A full conversion should be performed at some point, hopefully by
someone familiar with the filesystem.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Change all the callers of ->readpage to call ->read_folio in preference,
if it exists. This is a transitional duplication, and will be removed
by the end of the series.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Since the initial introduction of (posix) fallocate back at the turn of
the century, it has been possible to use this syscall to change the
user-visible contents of files. This can happen by extending the file
size during a preallocation, or through any of the newer modes (punch,
zero range). Because the call can be used to change file contents, we
should treat it like we do any other modification to a file -- update
the mtime, and drop set[ug]id privileges/capabilities.
The VFS function file_modified() does all this for us if pass it a
locked inode, so let's make fallocate drop permissions correctly.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
When an inode has a last_reflink_trans matching the current transaction,
we have to take special care when logging its checksums in order to
avoid getting checksum items with overlapping ranges in a log tree,
which could result in missing checksums after log replay (more on that
in the changelogs of commit 40e046acbd ("Btrfs: fix missing data
checksums after replaying a log tree") and commit e289f03ea7 ("btrfs:
fix corrupt log due to concurrent fsync of inodes with shared extents")).
We also need to make sure a full fsync will copy all old file extent
items it finds in modified leaves, because they might have been copied
from some other inode.
However once we fsync an inode, we don't need to keep paying the price of
that extra special care in future fsyncs done in the same transaction,
unless the inode is used for another reflink operation or the full sync
flag is set on it (truncate, failure to allocate extent maps for holes,
and other exceptional and infrequent cases).
So after we fsync an inode reset its last_unlink_trans to zero. In case
another reflink happens, we continue to update the last_reflink_trans of
the inode, just as before. Also set last_reflink_trans to the generation
of the last transaction that modified the inode whenever we need to set
the full sync flag on the inode, just like when we need to load an inode
from disk after eviction.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The implementation resembles direct I/O: we have to flush any ordered
extents, invalidate the page cache, and do the io tree/delalloc/extent
map/ordered extent dance. From there, we can reuse the compression code
with a minor modification to distinguish the write from writeback. This
also creates inline extents when possible.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, we always reserve the same extent size in the file and extent
size on disk for delalloc because the former is the worst case for the
latter. For BTRFS_IOC_ENCODED_WRITE writes, we know the exact size of
the extent on disk, which may be less than or greater than (for
bookends) the size in the file. Add a disk_num_bytes parameter to
btrfs_delalloc_reserve_metadata() so that we can reserve the correct
amount of csum bytes. No functional change.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_drop_extents(), we try to replace a range of file extent items
with a new file extent in a single btree search, to avoid the need to do
a search for deletion, followed by a path release and followed by yet
another search for insertion.
When I originally added that optimization, in commit 1acae57b16
("Btrfs: faster file extent item replace operations"), I left a constraint
to do the fast replace only if we visited a single leaf. That was because
in the most common case we find all file extent items that need to be
deleted (or trimmed) in a single leaf, however it can work for other
common cases like when we need to delete a few file extent items located
at the end of a leaf and a few more located at the beginning of the next
leaf. The key for the new file extent item is greater than the key of
any deleted or trimmed file extent item from previous leaves, so we are
fine to use the last leaf that we found as long as we are holding a
write lock on it - even if the new key ends up at slot 0, as if that's
the case, the btree search has obtained a write lock on any upper nodes
that need to have a key pointer updated.
So removed the constraint that limits the optimization to the case where
we visited only a single leaf.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
The last patch in the series has some performance test result in its
changelog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a big gap between inode_should_defrag() and autodefrag extent
size threshold. For inode_should_defrag() it has a flexible
@small_write value. For compressed extent is 16K, and for non-compressed
extent it's 64K.
However for autodefrag extent size threshold, it's always fixed to the
default value (256K).
This means, the following write sequence will trigger autodefrag to
defrag ranges which didn't trigger autodefrag:
pwrite 0 8k
sync
pwrite 8k 128K
sync
The latter 128K write will also be considered as a defrag target (if
other conditions are met). While only that 8K write is really
triggering autodefrag.
Such behavior can cause extra IO for autodefrag.
Close the gap, by copying the @small_write value into inode_defrag, so
that later autodefrag can use the same @small_write value which
triggered autodefrag.
With the existing transid value, this allows autodefrag really to scan
the ranges which triggered autodefrag.
Although this behavior change is mostly reducing the extent_thresh value
for autodefrag, I believe in the future we should allow users to specify
the autodefrag extent threshold through mount options, but that's an
other problem to consider in the future.
CC: stable@vger.kernel.org # 5.16+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although we have btrfs_requeue_inode_defrag(), for autodefrag we are
still just exhausting all inode_defrag items in the tree.
This means, it doesn't make much difference to requeue an inode_defrag,
other than scan the inode from the beginning till its end.
Change the behaviour to always scan from offset 0 of an inode, and till
the end.
By this we get the following benefit:
- Straight-forward code
- No more re-queue related check
- Fewer members in inode_defrag
We still keep the same btrfs_get_fs_root() and btrfs_iget() check for
each loop, and added extra should_auto_defrag() check per-loop.
Note: the patch needs to be backported and is intentionally written
to minimize the diff size, code will be cleaned up later.
CC: stable@vger.kernel.org # 5.16
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we do a direct IO read or write when the buffer given by the user is
memory mapped to the file range we are going to do IO, we end up ending
in a deadlock. This is triggered by the new test case generic/647 from
fstests.
For a direct IO read we get a trace like this:
[967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
[967.874161] Not tainted 5.14.0-rc7-btrfs-next-95 #1
[967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[967.875983] task:mmap-rw-fault state:D stack: 0 pid:12176 ppid: 11884 flags:0x00000000
[967.875992] Call Trace:
[967.875999] __schedule+0x3ca/0xe10
[967.876015] schedule+0x43/0xe0
[967.876020] wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
[967.876109] ? do_wait_intr_irq+0xb0/0xb0
[967.876118] lock_extent_bits+0x37/0x90 [btrfs]
[967.876150] btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
[967.876184] ? extent_readahead+0xa7/0x530 [btrfs]
[967.876214] extent_readahead+0x32d/0x530 [btrfs]
[967.876253] ? lru_cache_add+0x104/0x220
[967.876255] ? kvm_sched_clock_read+0x14/0x40
[967.876258] ? sched_clock_cpu+0xd/0x110
[967.876263] ? lock_release+0x155/0x4a0
[967.876271] read_pages+0x86/0x270
[967.876274] ? lru_cache_add+0x125/0x220
[967.876281] page_cache_ra_unbounded+0x1a3/0x220
[967.876291] filemap_fault+0x626/0xa20
[967.876303] __do_fault+0x36/0xf0
[967.876308] __handle_mm_fault+0x83f/0x15f0
[967.876322] handle_mm_fault+0x9e/0x260
[967.876327] __get_user_pages+0x204/0x620
[967.876332] ? get_user_pages_unlocked+0x69/0x340
[967.876340] get_user_pages_unlocked+0xd3/0x340
[967.876349] internal_get_user_pages_fast+0xbca/0xdc0
[967.876366] iov_iter_get_pages+0x8d/0x3a0
[967.876374] bio_iov_iter_get_pages+0x82/0x4a0
[967.876379] ? lock_release+0x155/0x4a0
[967.876387] iomap_dio_bio_actor+0x232/0x410
[967.876396] iomap_apply+0x12a/0x4a0
[967.876398] ? iomap_dio_rw+0x30/0x30
[967.876414] __iomap_dio_rw+0x29f/0x5e0
[967.876415] ? iomap_dio_rw+0x30/0x30
[967.876420] ? lock_acquired+0xf3/0x420
[967.876429] iomap_dio_rw+0xa/0x30
[967.876431] btrfs_file_read_iter+0x10b/0x140 [btrfs]
[967.876460] new_sync_read+0x118/0x1a0
[967.876472] vfs_read+0x128/0x1b0
[967.876477] __x64_sys_pread64+0x90/0xc0
[967.876483] do_syscall_64+0x3b/0xc0
[967.876487] entry_SYSCALL_64_after_hwframe+0x44/0xae
[967.876490] RIP: 0033:0x7fb6f2c038d6
[967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
[967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
[967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
[967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
[967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
[967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000
This happens because at btrfs_dio_iomap_begin() we lock the extent range
and return with it locked - we only unlock in the endio callback, at
end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
faults that resulting in reading the pages, through the readahead callback
btrfs_readahead(), and through there we end to attempt to lock again the
same extent range (or a subrange of what we locked before), resulting in
the deadlock.
For a direct IO write, the scenario is a bit different, and it results in
trace like this:
[1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
[1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
[1330.350540] Not tainted 5.14.0-rc7-btrfs-next-95 #1
[1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1330.351900] task:mmap-rw-fault state:D stack: 0 pid:184017 ppid:183725 flags:0x00000000
[1330.351906] Call Trace:
[1330.351913] __schedule+0x3ca/0xe10
[1330.351930] schedule+0x43/0xe0
[1330.351935] btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
[1330.352020] ? do_wait_intr_irq+0xb0/0xb0
[1330.352028] btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
[1330.352064] ? extent_readahead+0xa7/0x530 [btrfs]
[1330.352094] extent_readahead+0x32d/0x530 [btrfs]
[1330.352133] ? lru_cache_add+0x104/0x220
[1330.352135] ? kvm_sched_clock_read+0x14/0x40
[1330.352138] ? sched_clock_cpu+0xd/0x110
[1330.352143] ? lock_release+0x155/0x4a0
[1330.352151] read_pages+0x86/0x270
[1330.352155] ? lru_cache_add+0x125/0x220
[1330.352162] page_cache_ra_unbounded+0x1a3/0x220
[1330.352172] filemap_fault+0x626/0xa20
[1330.352176] ? filemap_map_pages+0x18b/0x660
[1330.352184] __do_fault+0x36/0xf0
[1330.352189] __handle_mm_fault+0x1253/0x15f0
[1330.352203] handle_mm_fault+0x9e/0x260
[1330.352208] __get_user_pages+0x204/0x620
[1330.352212] ? get_user_pages_unlocked+0x69/0x340
[1330.352220] get_user_pages_unlocked+0xd3/0x340
[1330.352229] internal_get_user_pages_fast+0xbca/0xdc0
[1330.352246] iov_iter_get_pages+0x8d/0x3a0
[1330.352254] bio_iov_iter_get_pages+0x82/0x4a0
[1330.352259] ? lock_release+0x155/0x4a0
[1330.352266] iomap_dio_bio_actor+0x232/0x410
[1330.352275] iomap_apply+0x12a/0x4a0
[1330.352278] ? iomap_dio_rw+0x30/0x30
[1330.352292] __iomap_dio_rw+0x29f/0x5e0
[1330.352294] ? iomap_dio_rw+0x30/0x30
[1330.352306] btrfs_file_write_iter+0x238/0x480 [btrfs]
[1330.352339] new_sync_write+0x11f/0x1b0
[1330.352344] ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
[1330.352354] vfs_write+0x292/0x3c0
[1330.352359] __x64_sys_pwrite64+0x90/0xc0
[1330.352365] do_syscall_64+0x3b/0xc0
[1330.352369] entry_SYSCALL_64_after_hwframe+0x44/0xae
[1330.352372] RIP: 0033:0x7f4b0a580986
[1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
[1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
[1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
[1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
[1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
[1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
range unlocked, but later when the page faults are triggered and we try
to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
we find the ordered extent for our write, created by the iomap callback
btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
deadlock since we can't complete the ordered extent without reading the
pages (the iomap code only submits the bio after the pages are faulted
in).
Fix this by setting the nofault attribute of the given iov_iter and retry
the direct IO read/write if we get an -EFAULT error returned from iomap.
For reads, also disable page faults completely, this is because when we
read from a hole or a prealloc extent, we can still trigger page faults
due to the call to iov_iter_zero() done by iomap - at the moment, it is
oblivious to the value of the ->nofault attribute of an iov_iter.
We also need to keep track of the number of bytes written or read, and
pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.
This depends on the iov_iter and iomap changes introduced in commit
c03098d4b9 ("Merge tag 'gfs2-v5.15-rc5-mmap-fault' of
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2").
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Functions gfs2_file_read_iter and gfs2_file_write_iter are both
accessing the user buffer to write to or read from while holding the
inode glock. In the most basic scenario, that buffer will not be
resident and it will be mapped to the same file. Accessing the buffer
will trigger a page fault, and gfs2 will deadlock trying to take the
same inode glock again while trying to handle that fault.
Fix that and similar, more complex scenarios by disabling page faults
while accessing user buffers. To make this work, introduce a small
amount of new infrastructure and fix some bugs that didn't trigger so
far, with page faults enabled.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmGBPisUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTpE6A/7BezUnGuNJxJrR8pC+vcLYA7xAgUU
6STQ6IN7w5UHRlSkNzZxZ2XPxW4uVQ4SxSEeaLqBsHZihepjcLNFZ/8MhQ6UPSD0
8noHOi7CoIcp6IuWQtCpxRM/xjjm2SlMt2XbVJZaiJcdzCV9gB6TU9EkBRq7Zm/X
9WFBbv1xZF0skn9ISCJvNtiiI+VyWKgMDUKxJUiTQjmJcklyyqHcVGmQi9BjqPz4
4s3F+WH6CoGbDKlmNk/6Y9wZ/2+sbvGswVscUxPwJVPoZWsR1xBBUdAeAmEMD1P4
BgE/Y1J8JXyVPYtyvZKq70XUhKdQkxB7RfX87YasOk9mY4Kjd5rIIGEykh+o2vC9
kDhCHvf2Mnw5I6Rum3B7UXyB1vemY+fECIHsXhgBnS+ztabRtcAdpCuWoqb43ymw
yEX1KwXyU4FpRYbrRvdZT42Fmh6ty8TW+N4swg8S2TrffirvgAi5yrcHZ4mPupYv
lyzvsCW7Wv8hPXn/twNObX+okRgJnsxcCdBXARdCnRXfA8tH23xmu88u8RA1Vdxh
nzTvv6Dx2EowwojuDWMx29Mw3fA2IqIfbOV+4FaRU7NZ2ZKtknL8yGl27qQUsMoJ
vYsHTmagasjQr+NDJ3vQRLCw+JQ6B1hENpdkmixFD9moo7X1ZFW3HBi/UL973Bv6
5CmgeXto8FRUFjI=
=WeNd
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 mmap + page fault deadlocks fixes from Andreas Gruenbacher:
"Functions gfs2_file_read_iter and gfs2_file_write_iter are both
accessing the user buffer to write to or read from while holding the
inode glock.
In the most basic deadlock scenario, that buffer will not be resident
and it will be mapped to the same file. Accessing the buffer will
trigger a page fault, and gfs2 will deadlock trying to take the same
inode glock again while trying to handle that fault.
Fix that and similar, more complex scenarios by disabling page faults
while accessing user buffers. To make this work, introduce a small
amount of new infrastructure and fix some bugs that didn't trigger so
far, with page faults enabled"
* tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix mmap + page fault deadlocks for direct I/O
iov_iter: Introduce nofault flag to disable page faults
gup: Introduce FOLL_NOFAULT flag to disable page faults
iomap: Add done_before argument to iomap_dio_rw
iomap: Support partial direct I/O on user copy failures
iomap: Fix iomap_dio_rw return value for user copies
gfs2: Fix mmap + page fault deadlocks for buffered I/O
gfs2: Eliminate ip->i_gh
gfs2: Move the inode glock locking to gfs2_file_buffered_write
gfs2: Introduce flag for glock holder auto-demotion
gfs2: Clean up function may_grant
gfs2: Add wrapper for iomap_file_buffered_write
iov_iter: Introduce fault_in_iov_iter_writeable
iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}
powerpc/kvm: Fix kvm_use_magic_page
iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
In order to make 'real_root' used only in ref-verify it's required to
have the necessary context to perform the same checks that this member
is used for. So add 'mod_root' which will contain the root on behalf of
which a delayed ref was created and a 'skip_group' parameter which
will contain callsite-specific override of skip_qgroup.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a few flags that are inconsistently used to describe the fs in
different states of failure. As of 5963ffcaf3 ("btrfs: always abort
the transaction if we abort a trans handle") we will always set
BTRFS_FS_STATE_ERROR if we abort, so we don't have to check both ABORTED
and ERROR to see if things have gone wrong. Add a helper to check
BTRFS_FS_STATE_ERROR and then convert all checkers of FS_STATE_ERROR to
use the helper.
The TRANS_ABORTED bit check was added in af72273381 ("Btrfs: clean up
resources during umount after trans is aborted") but is not actually
specific.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although in btrfs we have very limited usage of PageChecked flag, it's
still some page flag not yet subpage compatible.
Fix it by introducing btrfs_subpage::checked_offset to do the convert.
For most call sites, especially for free-space cache, COW fixup and
btrfs_invalidatepage(), they all work in full page mode anyway.
For other call sites, they work as subpage compatible mode.
Some call sites need extra modification:
- btrfs_drop_pages()
Needs extra parameter to get the real range we need to clear checked
flag.
Also since btrfs_drop_pages() will accept pages beyond the dirtied
range, update btrfs_subpage_clamp_range() to handle such case
by setting @len to 0 if the page is beyond target range.
- btrfs_invalidatepage()
We need to call subpage helper before calling __btrfs_releasepage(),
or it will trigger ASSERT() as page->private will be cleared.
- btrfs_verify_data_csum()
In theory we don't need the io_bio->csum check anymore, but it's
won't hurt. Just change the comment.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since setup_items_for_insert() is not used anymore outside of ctree.c,
make it static and remove its prototype from ctree.h. This also requires
to move the definition of setup_item_for_insert() from ctree.h to ctree.c
and move down btrfs_duplicate_item() so that it's defined after
setup_items_for_insert().
Further, since setup_item_for_insert() is used outside ctree.c, rename it
to btrfs_setup_item_for_insert().
This patch is part of a small patchset that is comprised of the following
patches:
btrfs: loop only once over data sizes array when inserting an item batch
btrfs: unexport setup_items_for_insert()
btrfs: use single bulk copy operations when logging directories
This is patch 2/3 and performance results, and the specific tests, are
included in the changelog of patch 3/3.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When inserting a batch of items into a btree, we end up looping over the
data sizes array 3 times:
1) Once in the caller of btrfs_insert_empty_items(), when it populates the
array with the data sizes for each item;
2) Once at btrfs_insert_empty_items() to sum the elements of the data
sizes array and compute the total data size;
3) And then once again at setup_items_for_insert(), where we do exactly
the same as what we do at btrfs_insert_empty_items(), to compute the
total data size.
That is not bad for small arrays, but when the arrays have hundreds of
elements, the time spent on looping is not negligible. For example when
doing batch inserts of delayed items for dir index items or when logging
a directory, it's common to have 200 to 260 dir index items in a single
batch when using a leaf size of 16K and using file names between 8 and 12
characters. For a 64K leaf size, multiply that by 4. Taking into account
that during directory logging or when flushing delayed dir index items we
can have many of those large batches, the time spent on the looping adds
up quickly.
It's also more important to avoid it at setup_items_for_insert(), since
we are holding a write lock on a leaf and, in some cases, on upper nodes
of the btree, which causes us to block other tasks that want to access
the leaf and nodes for longer than necessary.
So change the code so that setup_items_for_insert() and
btrfs_insert_empty_items() no longer compute the total data size, and
instead rely on the caller to supply it. This makes us loop over the
array only once, where we can both populate the data size array and
compute the total data size, taking advantage of spatial and temporal
locality. To make this more manageable, use a structure to contain
all the relevant details for a batch of items (keys array, data sizes
array, total data size, number of items), and use it as an argument
for btrfs_insert_empty_items() and setup_items_for_insert().
This patch is part of a small patchset that is comprised of the following
patches:
btrfs: loop only once over data sizes array when inserting an item batch
btrfs: unexport setup_items_for_insert()
btrfs: use single bulk copy operations when logging directories
This is patch 1/3 and performance results, and the specific tests, are
included in the changelog of patch 3/3.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a done_before argument to iomap_dio_rw that indicates how much of
the request has already been transferred. When the request succeeds, we
report that done_before additional bytes were tranferred. This is
useful for finishing a request asynchronously when part of the request
has already been completed synchronously.
We'll use that to allow iomap_dio_rw to be used with page faults
disabled: when a page fault occurs while submitting a request, we
synchronously complete the part of the request that has already been
submitted. The caller can then take care of the page fault and call
iomap_dio_rw again for the rest of the request, passing in the number of
bytes already tranferred.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Turn iov_iter_fault_in_readable into a function that returns the number
of bytes not faulted in, similar to copy_to_user, instead of returning a
non-zero value when any of the requested pages couldn't be faulted in.
This supports the existing users that require all pages to be faulted in
as well as new users that are happy if any pages can be faulted in.
Rename iov_iter_fault_in_readable to fault_in_iov_iter_readable to make
sure this change doesn't silently break things.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Error injection testing uncovered a case where we'd end up with a
corrupt file system with a missing extent in the middle of a file. This
occurs because the if statement to decide if we should abort is wrong.
The only way we would abort in this case is if we got a ret !=
-EOPNOTSUPP and we called from the file clone code. However the
prealloc code uses this path too. Instead we need to abort if there is
an error, and the only error we _don't_ abort on is -EOPNOTSUPP and only
if we came from the clone file code.
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I hit a stuck relocation on btrfs/061 during my overnight testing. This
turned out to be because we had left over extent entries in our extent
root for a data reloc inode that no longer existed. This happened
because in btrfs_drop_extents() we only update refs if we have SHAREABLE
set or we are the tree_root. This regression was introduced by
aeb935a455 ("btrfs: don't set SHAREABLE flag for data reloc tree")
where we stopped setting SHAREABLE for the data reloc tree.
The problem here is we actually do want to update extent references for
data extents in the data reloc tree, in fact we only don't want to
update extent references if the file extents are in the log tree.
Update this check to only skip updating references in the case of the
log tree.
This is relatively rare, because you have to be running scrub at the
same time, which is what btrfs/061 does. The data reloc inode has its
extents pre-allocated, and then we copy the extent into the
pre-allocated chunks. We theoretically should never be calling
btrfs_drop_extents() on a data reloc inode. The exception of course is
with scrub, if our pre-allocated extent falls inside of the block group
we are scrubbing, then the block group will be marked read only and we
will be forced to cow that extent. This means we will call
btrfs_drop_extents() on that range when we COW that file extent.
This isn't really problematic if we do this, the data reloc inode
requires that our extent lengths match exactly with the extent we are
copying, thankfully we validate the extent is correct with
get_new_location(), so if we happen to COW only part of the extent we
won't link it in when we do the relocation, so we are safe from any
other shenanigans that arise because of this interaction with scrub.
Fixes: aeb935a455 ("btrfs: don't set SHAREABLE flag for data reloc tree")
CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add support for fsverity in btrfs. To support the generic interface in
fs/verity, we add two new item types in the fs tree for inodes with
verity enabled. One stores the per-file verity descriptor and btrfs
verity item and the other stores the Merkle tree data itself.
Verity checking is done in end_page_read just before a page is marked
uptodate. This naturally handles a variety of edge cases like holes,
preallocated extents, and inline extents. Some care needs to be taken to
not try to verity pages past the end of the file, which are accessed by
the generic buffered file reading code under some circumstances like
reading to the end of the last page and trying to read again. Direct IO
on a verity file falls back to buffered reads.
Verity relies on PageChecked for the Merkle tree data itself to avoid
re-walking up shared paths in the tree. For this reason, we need to
cache the Merkle tree data. Since the file is immutable after verity is
turned on, we can cache it at an index past EOF.
Use the new inode ro_flags to store verity on the inode item, so that we
can enable verity on a file, then rollback to an older kernel and still
mount the file system and read the file. Since we can't safely write the
file anymore without ruining the invariants of the Merkle tree, we mark
a ro_compat flag on the file system when a file has verity enabled.
Acked-by: Eric Biggers <ebiggers@google.com>
Co-developed-by: Chris Mason <clm@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a possible use-after-free bug when running generic/095.
BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
Faulting instruction address: 0xc000000000283654
c000000000283078 do_raw_spin_unlock+0x88/0x230
c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
c0000000009e0458 end_bio_extent_writepage+0x158/0x270
c000000000b6fd14 bio_endio+0x254/0x270
c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
c000000000b6fd14 bio_endio+0x254/0x270
c000000000b781fc blk_update_request+0x46c/0x670
c000000000b8b394 blk_mq_end_request+0x34/0x1d0
c000000000d82d1c lo_complete_rq+0x11c/0x140
c000000000b880a4 blk_complete_reqs+0x84/0xb0
c0000000012b2ca4 __do_softirq+0x334/0x680
c0000000001dd878 irq_exit+0x148/0x1d0
c000000000016f4c do_IRQ+0x20c/0x240
c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0
[CAUSE]
There is very small race window like the following in generic/095.
Thread 1 | Thread 2
--------------------------------+------------------------------------
end_bio_extent_writepage() | btrfs_releasepage()
|- spin_lock_irqsave() | |
|- end_page_writeback() | |
| | |- if (PageWriteback() ||...)
| | |- clear_page_extent_mapped()
| | |- kfree(subpage);
|- spin_unlock_irqrestore().
The race can also happen between writeback and btrfs_invalidatepage(),
although that would be much harder as btrfs_invalidatepage() has much
more work to do before the clear_page_extent_mapped() call.
[FIX]
Here we "wait" for the subapge spinlock to be released before we detach
subpage structure.
So this patch will introduce a new function, wait_subpage_spinlock(), to
do the "wait" by acquiring the spinlock and release it.
Since the caller has ensured the page is not dirty nor writeback, and
page is already locked, the only way to hold the subpage spinlock is
from endio function.
Thus we only need to acquire the spinlock to wait for any existing
holder.
Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running generic/095, there is a high chance to crash with subpage
data RW support:
assertion failed: PagePrivate(page) && page->private
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3403!
Internal error: Oops - BUG: 0 [#1] SMP
CPU: 1 PID: 3567 Comm: fio Tainted: 5.12.0-rc7-custom+ #17
Hardware name: Khadas VIM3 (DT)
Call trace:
assertfail.constprop.0+0x28/0x2c [btrfs]
btrfs_subpage_assert+0x80/0xa0 [btrfs]
btrfs_subpage_set_uptodate+0x34/0xec [btrfs]
btrfs_page_clamp_set_uptodate+0x74/0xa4 [btrfs]
btrfs_dirty_pages+0x160/0x270 [btrfs]
btrfs_buffered_write+0x444/0x630 [btrfs]
btrfs_direct_write+0x1cc/0x2d0 [btrfs]
btrfs_file_write_iter+0xc0/0x160 [btrfs]
new_sync_write+0xe8/0x180
vfs_write+0x1b4/0x210
ksys_pwrite64+0x7c/0xc0
__arm64_sys_pwrite64+0x24/0x30
el0_svc_common.constprop.0+0x70/0x140
do_el0_svc+0x28/0x90
el0_svc+0x2c/0x54
el0_sync_handler+0x1a8/0x1ac
el0_sync+0x170/0x180
Code: f0000160 913be042 913c4000 955444bc (d4210000)
---[ end trace 3fdd39f4cccedd68 ]---
[CAUSE]
Although prepare_pages() calls find_or_create_page(), which returns the
page locked, but in later prepare_uptodate_page() calls, we may call
btrfs_readpage() which will unlock the page before it returns.
This leaves a window where btrfs_releasepage() can sneak in and release
the page, clearing page->private and causing above ASSERT().
[FIX]
In prepare_uptodate_page(), we should not only check page->mapping, but
also PagePrivate() to ensure we are still holding the correct page which
has proper fs context setup.
Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull iov_iter updates from Al Viro:
"iov_iter cleanups and fixes.
There are followups, but this is what had sat in -next this cycle. IMO
the macro forest in there became much thinner and easier to follow..."
* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
csum_and_copy_to_pipe_iter(): leave handling of csum_state to caller
clean up copy_mc_pipe_to_iter()
pipe_zero(): we don't need no stinkin' kmap_atomic()...
iov_iter: clean csum_and_copy_...() primitives up a bit
copy_page_from_iter(): don't need kmap_atomic() for kvec/bvec cases
copy_page_to_iter(): don't bother with kmap_atomic() for bvec/kvec cases
iterate_xarray(): only of the first iteration we might get offset != 0
pull handling of ->iov_offset into iterate_{iovec,bvec,xarray}
iov_iter: make iterator callbacks use base and len instead of iovec
iov_iter: make the amount already copied available to iterator callbacks
iov_iter: get rid of separate bvec and xarray callbacks
iov_iter: teach iterate_{bvec,xarray}() about possible short copies
iterate_bvec(): expand bvec.h macro forest, massage a bit
iov_iter: unify iterate_iovec and iterate_kvec
iov_iter: massage iterate_iovec and iterate_kvec to logics similar to iterate_bvec
iterate_and_advance(): get rid of magic in case when n is 0
csum_and_copy_to_iter(): massage into form closer to csum_and_copy_from_iter()
iov_iter: replace iov_iter_copy_from_user_atomic() with iterator-advancing variant
[xarray] iov_iter_npages(): just use DIV_ROUND_UP()
iov_iter_npages(): don't bother with iterate_all_kinds()
...
By way of inverting the list_empty conditional the insert label can be
eliminated, making the function's flow entirely linear.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With current subpage RW support, the following script can hang the fs
with 64K page size.
# mkfs.btrfs -f -s 4k $dev
# mount $dev -o nospace_cache $mnt
# fsstress -w -n 50 -p 1 -s 1607749395 -d $mnt
The kernel will do an infinite loop in btrfs_punch_hole_lock_range().
[CAUSE]
In btrfs_punch_hole_lock_range() we:
- Truncate page cache range
- Lock extent io tree
- Wait any ordered extents in the range.
We exit the loop until we meet all the following conditions:
- No ordered extent in the lock range
- No page is in the lock range
The latter condition has a pitfall, it only works for sector size ==
PAGE_SIZE case.
While can't handle the following subpage case:
0 32K 64K 96K 128K
| |///////||//////| ||
lockstart=32K
lockend=96K - 1
In this case, although the range crosses 2 pages,
truncate_pagecache_range() will invalidate no page at all, but only zero
the [32K, 96K) range of the two pages.
Thus filemap_range_has_page(32K, 96K-1) will always return true, thus we
will never meet the loop exit condition.
[FIX]
Fix the problem by doing page alignment for the lock range.
Function filemap_range_has_page() has already handled lend < lstart
case, we only need to round up @lockstart, and round_down @lockend for
truncate_pagecache_range().
This modification should not change any thing for sector size ==
PAGE_SIZE case, as in that case our range is already page aligned.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the extent io tree operations in btrfs_dirty_pages() are already
subpage compatible, we only need to make the page status update to use
subpage helpers.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of calling list_entry with head->prev simply call
list_last_entry which makes it obvious which member of the list is
being referred. This allows to remove the extra 'prev' pointer.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replacement is called copy_page_from_iter_atomic(); unlike the old primitive the
callers do *not* need to do iov_iter_advance() after it. In case when they end
up consuming less than they'd been given they need to do iov_iter_revert() on
everything they had not consumed. That, however, needs to be done only on slow
paths.
All in-tree callers converted. And that kills the last user of iterate_all_kinds()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We always return 0 even in case of an error in btrfs_mark_extent_written().
Fix it to return proper error value in case of a failure. All callers
handle it.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a fast fsync on a file, there is a race which can result in the
fsync returning success to user space without logging the inode and without
durably persisting new data.
The following example shows one possible scenario for this:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ touch /mnt/bar
$ xfs_io -f -c "pwrite -S 0xab 0 1M" -c "fsync" /mnt/baz
# Now we have:
# file bar == inode 257
# file baz == inode 258
$ mv /mnt/baz /mnt/foo
# Now we have:
# file bar == inode 257
# file foo == inode 258
$ xfs_io -c "pwrite -S 0xcd 0 1M" /mnt/foo
# fsync bar before foo, it is important to trigger the race.
$ xfs_io -c "fsync" /mnt/bar
$ xfs_io -c "fsync" /mnt/foo
# After this:
# inode 257, file bar, is empty
# inode 258, file foo, has 1M filled with 0xcd
<power failure>
# Replay the log:
$ mount /dev/sdc /mnt
# After this point file foo should have 1M filled with 0xcd and not 0xab
The following steps explain how the race happens:
1) Before the first fsync of inode 258, when it has the "baz" name, its
->logged_trans is 0, ->last_sub_trans is 0 and ->last_log_commit is -1.
The inode also has the full sync flag set;
2) After the first fsync, we set inode 258 ->logged_trans to 6, which is
the generation of the current transaction, and set ->last_log_commit
to 0, which is the current value of ->last_sub_trans (done at
btrfs_log_inode()).
The full sync flag is cleared from the inode during the fsync.
The log sub transaction that was committed had an ID of 0 and when we
synced the log, at btrfs_sync_log(), we incremented root->log_transid
from 0 to 1;
3) During the rename:
We update inode 258, through btrfs_update_inode(), and that causes its
->last_sub_trans to be set to 1 (the current log transaction ID), and
->last_log_commit remains with a value of 0.
After updating inode 258, because we have previously logged the inode
in the previous fsync, we log again the inode through the call to
btrfs_log_new_name(). This results in updating the inode's
->last_log_commit from 0 to 1 (the current value of its
->last_sub_trans).
The ->last_sub_trans of inode 257 is updated to 1, which is the ID of
the next log transaction;
4) Then a buffered write against inode 258 is made. This leaves the value
of ->last_sub_trans as 1 (the ID of the current log transaction, stored
at root->log_transid);
5) Then an fsync against inode 257 (or any other inode other than 258),
happens. This results in committing the log transaction with ID 1,
which results in updating root->last_log_commit to 1 and bumping
root->log_transid from 1 to 2;
6) Then an fsync against inode 258 starts. We flush delalloc and wait only
for writeback to complete, since the full sync flag is not set in the
inode's runtime flags - we do not wait for ordered extents to complete.
Then, at btrfs_sync_file(), we call btrfs_inode_in_log() before the
ordered extent completes. The call returns true:
static inline bool btrfs_inode_in_log(...)
{
bool ret = false;
spin_lock(&inode->lock);
if (inode->logged_trans == generation &&
inode->last_sub_trans <= inode->last_log_commit &&
inode->last_sub_trans <= inode->root->last_log_commit)
ret = true;
spin_unlock(&inode->lock);
return ret;
}
generation has a value of 6 (fs_info->generation), ->logged_trans also
has a value of 6 (set when we logged the inode during the first fsync
and when logging it during the rename), ->last_sub_trans has a value
of 1, set during the rename (step 3), ->last_log_commit also has a
value of 1 (set in step 3) and root->last_log_commit has a value of 1,
which was set in step 5 when fsyncing inode 257.
As a consequence we don't log the inode, any new extents and do not
sync the log, resulting in a data loss if a power failure happens
after the fsync and before the current transaction commits.
Also, because we do not log the inode, after a power failure the mtime
and ctime of the inode do not match those we had before.
When the ordered extent completes before we call btrfs_inode_in_log(),
then the call returns false and we log the inode and sync the log,
since at the end of ordered extent completion we update the inode and
set ->last_sub_trans to 2 (the value of root->log_transid) and
->last_log_commit to 1.
This problem is found after removing the check for the emptiness of the
inode's list of modified extents in the recent commit 209ecbb858
("btrfs: remove stale comment and logic from btrfs_inode_in_log()"),
added in the 5.13 merge window. However checking the emptiness of the
list is not really the way to solve this problem, and was never intended
to, because while that solves the problem for COW writes, the problem
persists for NOCOW writes because in that case the list is always empty.
In the case of NOCOW writes, even though we wait for the writeback to
complete before returning from btrfs_sync_file(), we end up not logging
the inode, which has a new mtime/ctime, and because we don't sync the log,
we never issue disk barriers (send REQ_PREFLUSH to the device) since that
only happens when we sync the log (when we write super blocks at
btrfs_sync_log()). So effectively, for a NOCOW case, when we return from
btrfs_sync_file() to user space, we are not guaranteeing that the data is
durably persisted on disk.
Also, while the example above uses a rename exchange to show how the
problem happens, it is not the only way to trigger it. An alternative
could be adding a new hard link to inode 258, since that also results
in calling btrfs_log_new_name() and updating the inode in the log.
An example reproducer using the addition of a hard link instead of a
rename operation:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ touch /mnt/bar
$ xfs_io -f -c "pwrite -S 0xab 0 1M" -c "fsync" /mnt/foo
$ ln /mnt/foo /mnt/foo_link
$ xfs_io -c "pwrite -S 0xcd 0 1M" /mnt/foo
$ xfs_io -c "fsync" /mnt/bar
$ xfs_io -c "fsync" /mnt/foo
<power failure>
# Replay the log:
$ mount /dev/sdc /mnt
# After this point file foo often has 1M filled with 0xab and not 0xcd
The reasons leading to the final fsync of file foo, inode 258, not
persisting the new data are the same as for the previous example with
a rename operation.
So fix by never skipping logging and log syncing when there are still any
ordered extents in flight. To avoid making the conditional if statement
that checks if logging an inode is needed harder to read, place all the
logic into an helper function with separate if statements to make it more
manageable and easier to read.
A test case for fstests will follow soon.
For NOCOW writes, the problem existed before commit b5e6c3e170
("btrfs: always wait on ordered extents at fsync time"), introduced in
kernel 4.19, then it went away with that commit since we started to always
wait for ordered extent completion before logging.
The problem came back again once the fast fsync path was changed again to
avoid waiting for ordered extent completion, in commit 487781796d
("btrfs: make fast fsyncs wait only for writeback"), added in kernel 5.10.
However, for COW writes, the race only happens after the recent
commit 209ecbb858 ("btrfs: remove stale comment and logic from
btrfs_inode_in_log()"), introduced in the 5.13 merge window. For NOCOW
writes, the bug existed before that commit. So tag 5.10+ as the release
for stable backports.
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In commit d77815461f ("btrfs: Avoid trucating page or punching hole
in a already existed hole."), existing holes can be skipped by calling
find_first_non_hole() to adjust start and len. However, if the given len
is invalid and large, when an EXTENT_MAP_HOLE extent is found, len will
not be set to zero because (em->start + em->len) is less than
(start + len). Then the ret will be 1 but len will not be set to 0.
The propagated non-zero ret will result in fallocate failure.
In the while-loop of btrfs_replace_file_extents(), len is not updated
every time before it calls find_first_non_hole(). That is, after
btrfs_drop_extents() successfully drops the last non-hole file extent,
it may fail with ENOSPC when attempting to drop a file extent item
representing a hole. The problem can happen. After it calls
find_first_non_hole(), the cur_offset will be adjusted to be larger
than or equal to end. However, since the len is not set to zero, the
break-loop condition (ret && !len) will not be met. After it leaves the
while-loop, fallocate will return 1, which is an unexpected return
value.
We're not able to construct a reproducible way to let
btrfs_drop_extents() fail with ENOSPC after it drops the last non-hole
file extent but with remaining holes left. However, it's quite easy to
fix. We just need to update and check the len every time before we call
find_first_non_hole(). To make the while loop more readable, we also
pull the variable updates to the bottom of loop like this:
while (cur_offset < end) {
...
// update cur_offset & len
// advance cur_offset & len in hole-punching case if needed
}
Reported-by: Robbie Ko <robbieko@synology.com>
Fixes: d77815461f ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a comment at btrfs_replace_file_extents() that mentions that we
set the full sync flag on an inode when cloning into a file with a size
greater than or equals to 16MiB, through try_release_extent_mapping() when
we truncate the page cache after replacing file extents during a clone
operation.
That is not true anymore since commit 5e548b3201 ("btrfs: do not set
the full sync flag on the inode during page release"), so update the
comment to remove that part and rephrase it slightly to make it more
clear why the full sync flag is set at btrfs_replace_file_extents().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a race between marking that an inode needs to be logged, either
at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between
btrfs_sync_log(). The following steps describe how the race happens.
1) We are at transaction N;
2) Inode I was previously fsynced in the current transaction so it has:
inode->logged_trans set to N;
3) The inode's root currently has:
root->log_transid set to 1
root->last_log_commit set to 0
Which means only one log transaction was committed to far, log
transaction 0. When a log tree is created we set ->log_transid and
->last_log_commit of its parent root to 0 (at btrfs_add_log_tree());
4) One more range of pages is dirtied in inode I;
5) Some task A starts an fsync against some other inode J (same root), and
so it joins log transaction 1.
Before task A calls btrfs_sync_log()...
6) Task B starts an fsync against inode I, which currently has the full
sync flag set, so it starts delalloc and waits for the ordered extent
to complete before calling btrfs_inode_in_log() at btrfs_sync_file();
7) During ordered extent completion we have btrfs_update_inode() called
against inode I, which in turn calls btrfs_set_inode_last_trans(),
which does the following:
spin_lock(&inode->lock);
inode->last_trans = trans->transaction->transid;
inode->last_sub_trans = inode->root->log_transid;
inode->last_log_commit = inode->root->last_log_commit;
spin_unlock(&inode->lock);
So ->last_trans is set to N and ->last_sub_trans set to 1.
But before setting ->last_log_commit...
8) Task A is at btrfs_sync_log():
- it increments root->log_transid to 2
- starts writeback for all log tree extent buffers
- waits for the writeback to complete
- writes the super blocks
- updates root->last_log_commit to 1
It's a lot of slow steps between updating root->log_transid and
root->last_log_commit;
9) The task doing the ordered extent completion, currently at
btrfs_set_inode_last_trans(), then finally runs:
inode->last_log_commit = inode->root->last_log_commit;
spin_unlock(&inode->lock);
Which results in inode->last_log_commit being set to 1.
The ordered extent completes;
10) Task B is resumed, and it calls btrfs_inode_in_log() which returns
true because we have all the following conditions met:
inode->logged_trans == N which matches fs_info->generation &&
inode->last_subtrans (1) <= inode->last_log_commit (1) &&
inode->last_subtrans (1) <= root->last_log_commit (1) &&
list inode->extent_tree.modified_extents is empty
And as a consequence we return without logging the inode, so the
existing logged version of the inode does not point to the extent
that was written after the previous fsync.
It should be impossible in practice for one task be able to do so much
progress in btrfs_sync_log() while another task is at
btrfs_set_inode_last_trans() right after it reads root->log_transid and
before it reads root->last_log_commit. Even if kernel preemption is enabled
we know the task at btrfs_set_inode_last_trans() can not be preempted
because it is holding the inode's spinlock.
However there is another place where we do the same without holding the
spinlock, which is in the memory mapped write path at:
vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
{
(...)
BTRFS_I(inode)->last_trans = fs_info->generation;
BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
(...)
So with preemption happening after setting ->last_sub_trans and before
setting ->last_log_commit, it is less of a stretch to have another task
do enough progress at btrfs_sync_log() such that the task doing the memory
mapped write ends up with ->last_sub_trans and ->last_log_commit set to
the same value. It is still a big stretch to get there, as the task doing
btrfs_sync_log() has to start writeback, wait for its completion and write
the super blocks.
So fix this in two different ways:
1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the
value of ->last_sub_trans minus 1;
2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just
like we do for buffered and direct writes at btrfs_file_write_iter(),
which is all we need to make sure multiple writes and fsyncs to an
inode in the same transaction never result in an fsync missing that
the inode changed and needs to be logged. Turn this into a helper
function and use it both at btrfs_page_mkwrite() and at
btrfs_file_write_iter() - this also fixes the problem that at
btrfs_page_mkwrite() we were setting those fields without the
protection of the inode's spinlock.
This is an extremely unlikely race to happen in practice.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing an fsync we flush all delalloc, lock the inode (VFS lock), flush
any new delalloc that might have been created before taking the lock and
then wait either for the ordered extents to complete or just for the
writeback to complete (depending on whether the full sync flag is set or
not). We then start logging the inode and assume that while we are doing it
no one else is touching the inode's file extent items (or adding new ones).
That is generally true because all operations that modify an inode acquire
the inode's lock first, including buffered and direct IO writes. However
there is one exception: memory mapped writes, which do not and can not
acquire the inode's lock.
This can cause two types of issues: ending up logging file extent items
with overlapping ranges, which is detected by the tree checker and will
result in aborting the transaction when starting writeback for a log
tree's extent buffers, or a silent corruption where we log a version of
the file that never existed.
Scenario 1 - logging overlapping extents
The following steps explain how we can end up with file extents items with
overlapping ranges in a log tree due to a race between a fsync and memory
mapped writes:
1) Task A starts an fsync on inode X, which has the full sync runtime flag
set. First it starts by flushing all delalloc for the inode;
2) Task A then locks the inode and flushes any other delalloc that might
have been created after the previous flush and waits for all ordered
extents to complete;
3) In the inode's root we have the following leaf:
Leaf N, generation == current transaction id:
---------------------------------------------------------
| (...) [ file extent item, offset 640K, length 128K ] |
---------------------------------------------------------
The last file extent item in leaf N covers the file range from 640K to
768K;
4) Task B does a memory mapped write for the page corresponding to the
file range from 764K to 768K;
5) Task A starts logging the inode. At copy_inode_items_to_log() it uses
btrfs_search_forward() to search for leafs modified in the current
transaction that contain items for the inode. It finds leaf N and copies
all the inode items from that leaf into the log tree.
Now the log tree has a copy of the last file extent item from leaf N.
At the end of the while loop at copy_inode_items_to_log(), we have the
minimum key set to:
min_key.objectid = <inode X number>
min_key.type = BTRFS_EXTENT_DATA_KEY
min_key.offset = 640K
Then we increment the key's offset by 1 so that the next call to
btrfs_search_forward() leaves us at the first key greater than the key
we just processed.
But before btrfs_search_forward() is called again...
6) Dellaloc for the page at offset 764K, dirtied by task B, is started.
It can be started for several reasons:
- The async reclaim task is attempting to satisfy metadata or data
reservation requests, and it has reached a point where it decided
to flush delalloc;
- Due to memory pressure the VMM triggers writeback of dirty pages;
- The system call sync_file_range(2) is called from user space.
7) When the respective ordered extent completes, it trims the length of
the existing file extent item for file offset 640K from 128K to 124K,
and a new file extent item is added with a key offset of 764K and a
length of 4K;
8) Task A calls btrfs_search_forward(), which returns us a path pointing
to the leaf (can be leaf N or some other) containing the new file extent
item for file offset 764K.
We end up copying this item to the log tree, which overlaps with the
last copied file extent item, which covers the file range from 640K to
768K.
When writeback is triggered for log tree's extent buffers, the issue
will be detected by the tree checker which will dump a trace and an
error message on dmesg/syslog. If the writeback is triggered when
syncing the log, which typically is, then we also end up aborting the
current transaction.
This is the same type of problem fixed in 0c713cbab6 ("Btrfs: fix race
between ranged fsync and writeback of adjacent ranges").
Scenario 2 - logging a version of the file that never existed
This scenario only happens when using the NO_HOLES feature and results in
a silent corruption, in the sense that is not detectable by 'btrfs check'
or the tree checker:
1) We have an inode I with a size of 1M and two file extent items, one
covering an extent with disk_bytenr == X for the file range [0, 512K)
and another one covering another extent with disk_bytenr == Y for the
file range [512K, 1M);
2) A hole is punched for the file range [512K, 1M);
3) Task A starts an fsync of inode I, which has the full sync runtime flag
set. It starts by flushing all existing delalloc, locks the inode (VFS
lock), starts any new delalloc that might have been created before
taking the lock and waits for all ordered extents to complete;
4) Some other task does a memory mapped write for the page corresponding to
the file range [640K, 644K) for example;
5) Task A then logs all items of the inode with the call to
copy_inode_items_to_log();
6) In the meanwhile delalloc for the range [640K, 644K) is started. It can
be started for several reasons:
- The async reclaim task is attempting to satisfy metadata or data
reservation requests, and it has reached a point where it decided
to flush delalloc;
- Due to memory pressure the VMM triggers writeback of dirty pages;
- The system call sync_file_range(2) is called from user space.
7) The ordered extent for the range [640K, 644K) completes and a file
extent item for that range is added to the subvolume tree, pointing
to a 4K extent with a disk_bytenr == Z;
8) Task A then calls btrfs_log_holes(), to scan for implicit holes in
the subvolume tree. It finds two implicit holes:
- one for the file range [512K, 640K)
- one for the file range [644K, 1M)
As a result we end up neither logging a hole for the range [640K, 644K)
nor logging the file extent item with a disk_bytenr == Z.
This means that if we have a power failure and replay the log tree we
end up getting the following file extent layout:
[ disk_bytenr X ] [ hole ] [ disk_bytenr Y ] [ hole ]
0 512K 512K 640K 640K 644K 644K 1M
Which does not corresponding to any layout the file ever had before
the power failure. The only two valid layouts would be:
[ disk_bytenr X ] [ hole ]
0 512K 512K 1M
and
[ disk_bytenr X ] [ hole ] [ disk_bytenr Z ] [ hole ]
0 512K 512K 640K 640K 644K 644K 1M
This can be fixed by serializing memory mapped writes with fsync, and there
are two ways to do it:
1) Make a fsync lock the entire file range, from 0 to (u64)-1 / LLONG_MAX
in the inode's io tree. This prevents the race but also blocks any reads
during the duration of the fsync, which has a negative impact for many
common workloads;
2) Make an fsync write lock the i_mmap_lock semaphore in the inode. This
semaphore was recently added by Josef's patch set:
btrfs: add a i_mmap_lock to our inode
btrfs: cleanup inode_lock/inode_unlock uses
btrfs: exclude mmaps while doing remap
btrfs: exclude mmap from happening during all fallocate operations
and is used to solve races between memory mapped writes and
clone/dedupe/fallocate. This also makes us have the same behaviour we
have regarding other writes (buffered and direct IO) and fsync - block
them while the inode logging is in progress.
This change uses the second approach due to the performance impact of the
first one.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a small window where a deadlock can happen between fallocate and
mmap. This is described in detail by Filipe:
"""
When doing a fallocate operation we lock the inode, flush delalloc within
the target range, wait for any ordered extents to complete and then lock
the file range. Before we lock the range and after we flush delalloc,
there is a time window where another task can come in and do a memory
mapped write for a page within the fallocate range.
This means that after fallocate locks the range, there can be a dirty page
in the range. More often than not, this does not cause any problem.
The exception is when we are low on available metadata space, because an
fallocate operation needs to start a transaction while holding the file
range locked, either through btrfs_prealloc_file_range() or through the
call to btrfs_fallocate_update_isize(). If that's the case, we can end up
in a deadlock. The following list of steps explains how that happens:
1) A fallocate operation starts, locks the inode, flushes delalloc in the
range and waits for ordered extents in the range to complete;
2) Before the fallocate task locks the file range, another task does a
memory mapped write for a page in the fallocate target range. This is
possible since memory mapped writes do not (and can not) lock the
inode;
3) The fallocate task locks the file range. At this point there is one
dirty page in the range (due to the memory mapped write);
4) When the fallocate task attempts to start a transaction, it blocks when
attempting to reserve metadata space, since we are low on available
metadata space. Before blocking (wait on its reservation ticket), it
starts the async reclaim task (if not running already);
5) The async reclaim task is not able to release space through any other
means, so it decides to flush delalloc for inodes with dirty pages.
It finds that the inode used in the fallocate operation has a dirty
page and therefore queues a job (fs_info->flush_workers workqueue) to
flush delalloc for that inode and waits on that job to complete;
6) The flush job blocks when attempting to lock the file range because
it is currently locked by the fallocate task;
7) The fallocate task keeps waiting for its metadata reservation, waiting
for a wakeup on its reservation ticket. The async reclaim task is
waiting on the flush job, which in turn is waiting for locking the file
range that is currently locked by the fallocate task. So unless some
other task is able to release enough metadata space, for example an
ordered extent for some other inode completes, we end up in a deadlock
between all these tasks.
When this happens stack traces like the following show up in dmesg/syslog:
INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
Call Trace:
__schedule+0x5d1/0xcf0
schedule+0x45/0xe0
lock_extent_bits+0x1e6/0x2d0 [btrfs]
? finish_wait+0x90/0x90
btrfs_invalidatepage+0x32c/0x390 [btrfs]
? __mod_memcg_state+0x8e/0x160
__extent_writepage+0x2d4/0x400 [btrfs]
extent_write_cache_pages+0x2b2/0x500 [btrfs]
? lock_release+0x20e/0x4c0
? trace_hardirqs_on+0x1b/0xf0
extent_writepages+0x43/0x90 [btrfs]
? lock_acquire+0x1a3/0x490
do_writepages+0x43/0xe0
? __filemap_fdatawrite_range+0xa4/0x100
__filemap_fdatawrite_range+0xc5/0x100
btrfs_run_delalloc_work+0x17/0x40 [btrfs]
btrfs_work_helper+0xf1/0x600 [btrfs]
process_one_work+0x24e/0x5e0
worker_thread+0x50/0x3b0
? process_one_work+0x5e0/0x5e0
kthread+0x153/0x170
? kthread_mod_delayed_work+0xc0/0xc0
ret_from_fork+0x22/0x30
INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
Call Trace:
__schedule+0x5d1/0xcf0
? kvm_clock_read+0x14/0x30
? wait_for_completion+0x81/0x110
schedule+0x45/0xe0
schedule_timeout+0x30c/0x580
? _raw_spin_unlock_irqrestore+0x3c/0x60
? lock_acquire+0x1a3/0x490
? try_to_wake_up+0x7a/0xa20
? lock_release+0x20e/0x4c0
? lock_acquired+0x199/0x490
? wait_for_completion+0x81/0x110
wait_for_completion+0xab/0x110
start_delalloc_inodes+0x2af/0x390 [btrfs]
btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
flush_space+0x24f/0x660 [btrfs]
btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
process_one_work+0x24e/0x5e0
worker_thread+0x20f/0x3b0
? process_one_work+0x5e0/0x5e0
kthread+0x153/0x170
? kthread_mod_delayed_work+0xc0/0xc0
ret_from_fork+0x22/0x30
(...)
several tasks waiting for the inode lock held by the fallocate task below
(...)
RIP: 0033:0x7f61efe73fff
Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5.
RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff
RDX: 00000000ffffff9c RSI: 0000560fbd5d90a0 RDI: 00000000ffffff9c
RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
R10: 0000560fbd5d7ad0 R11: 0000000000000202 R12: 0000000000000001
R13: 000000000000005e R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
task:fdm-stress state:D stack: 0 pid:2508243 ppid:2508153 flags:0x00000000
Call Trace:
__schedule+0x5d1/0xcf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
schedule+0x45/0xe0
__reserve_bytes+0x4a4/0xb10 [btrfs]
? finish_wait+0x90/0x90
btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
btrfs_block_rsv_add+0x1f/0x50 [btrfs]
start_transaction+0x2d1/0x760 [btrfs]
btrfs_replace_file_extents+0x120/0x930 [btrfs]
? btrfs_fallocate+0xdcf/0x1260 [btrfs]
btrfs_fallocate+0xdfb/0x1260 [btrfs]
? filename_lookup+0xf1/0x180
vfs_fallocate+0x14f/0x440
ioctl_preallocate+0x92/0xc0
do_vfs_ioctl+0x66b/0x750
? __do_sys_newfstat+0x53/0x60
__x64_sys_ioctl+0x62/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
"""
Fix this by disallowing mmaps from happening while we're doing any of
the fallocate operations on this inode.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A few places we intermix btrfs_inode_lock with a inode_unlock, and some
places we just use inode_lock/inode_unlock instead of btrfs_inode_lock.
None of these places are using this incorrectly, but as we adjust some
of these callers it would be nice to keep everything consistent, so
convert everybody to use btrfs_inode_lock/btrfs_inode_unlock.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmBCOi4ACgkQxWXV+ddt
WDtXvw//TWx3m05qHJqqG8V90uel8hB2J5vd4CA2r62Je1G8RDho57Bo7fyvL4l+
mdCPt+INajb0mpp0IoHMtyLHefojgNOsrX6FAK1/gjnLkjRLFZ3wQqkA34Ue9pNs
2u+rMY6eB105iaS3VejEmiebr++MZfjfQRV+GXU336AEeOEDZdgol8o6jMyde5TO
zRH9Dni5Sy/YAGGAb0vaoG2BMyVigrqkbjkzwjYChbUj/KuyffAgQj0v8BvsC9Y6
DnPD5yrt5kSZzuqQFH7c2jxLN0cvW+tJ0znCpnwn/nmiCALbl6y2a4dmewC32TwJ
II+3OPGpYudafLJEP15qafsJb7LmEfnGwUIrfEZbyb4lQG12uyYOdP3IN7+8td14
fd29GE62w5aErsmurcMFj/x43k4DIfcqC8b+Y+S27JZF1szh7ExCfoYC/6c5e5Qf
j6/6RtRSVqdxImRd0QYv3mCIeSG0CH2UR/1otvC81jRTHRyB3r6TV8wPLo+5K/Rk
ongKZ+BQa5RUk8skdFburhrkDDKgfBcjlexl5Gsqw+D/xTGNAcVnNQrTtW9sTSle
hB3b7CunXA1eCyui2SIqN1dR8hwao4b9RzYNs3y2jWjSPZD/Bp0BdQ8oxSPvIWkX
a8kauFGhKhY2Tdqau+CQ4UbbQWzEB7FulkPCOLiHDDZjyxIvAA4=
=tlU3
-----END PGP SIGNATURE-----
Merge tag 'for-5.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"More regression fixes and stabilization.
Regressions:
- zoned mode
- count zone sizes in wider int types
- fix space accounting for read-only block groups
- subpage: fix page tail zeroing
Fixes:
- fix spurious warning when remounting with free space tree
- fix warning when creating a directory with smack enabled
- ioctl checks for qgroup inheritance when creating a snapshot
- qgroup
- fix missing unlock on error path in zero range
- fix amount of released reservation on error
- fix flushing from unsafe context with open transaction,
potentially deadlocking
- minor build warning fixes"
* tag 'for-5.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: do not account freed region of read-only block group as zone_unusable
btrfs: zoned: use sector_t for zone sectors
btrfs: subpage: fix the false data csum mismatch error
btrfs: fix warning when creating a directory with smack enabled
btrfs: don't flush from btrfs_delayed_inode_reserve_metadata
btrfs: export and rename qgroup_reserve_meta
btrfs: free correct amount of space in btrfs_delayed_inode_reserve_metadata
btrfs: fix spurious free_space_tree remount warning
btrfs: validate qgroup inherit for SNAP_CREATE_V2 ioctl
btrfs: unlock extents in btrfs_zero_range in case of quota reservation errors
btrfs: ref-verify: use 'inline void' keyword ordering
If btrfs_qgroup_reserve_data returns an error (i.e quota limit reached)
the handling logic directly goes to the 'out' label without first
unlocking the extent range between lockstart, lockend. This results in
deadlocks as other processes try to lock the same extent.
Fixes: a7f8b1c2ac ("btrfs: file: reserve qgroup space after the hole punch range is locked")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rename generic_file_buffered_read to match the naming of filemap_fault,
also update the written parameter to a more descriptive name and improve
the kerneldoc comment.
Link: https://lkml.kernel.org/r/20210122160140.223228-18-willy@infradead.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Adjust the final parameter of iomap_dio_rw.
- Add a new flag to request that iomap directio writes return EAGAIN if
the write is not a pure overwrite within EOF; this will be used to
reduce lock contention with unaligned direct writes on XFS.
- Amend XFS' directio code to eliminate exclusive locking for unaligned
direct writes if the circumstances permit
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmAZgQAACgkQ+H93GTRK
tOtNqw/+KPff1NjQVK2k361R0+LjlEHfe2nxh7+kS10IiR5nbBz4Fu+GwEosZKq+
H9ficBbZ0wIveV+5CEt2xZLEJFC4LZUpNPVVrUf8XPLKiVexP/U3wtKzmv9Z7D5J
5walMWQycVeR+ycomynV36giqekvARL7KCQG5By2ITfSNxfnb/wvKhn1d61ZDOF6
f4xzq7F6+cEOrSZt2LcFzGSfsTl6oakYMAomPU57sqGmw7MHRqoPTErbdh2HnVJy
yQ47eiZgSKWKA+Qm+VvHHePYCYnu0nvA2rbNerjTN70hnO8rK9S0Vle6Sp5CUqAX
sXOy8zxOLYKqyM4S/QkIN2TGIyWg+CHiakVLZGF3Q4AUDDYfpD0cHvAe9N3v9euL
qt8ypT8dz2C3qiTg5E31xy033wlAP0wg3FZiLAqEjL5o3fzD+qbplTiSmYbMV2Fb
xuu7a2T6u1MHaIn1IhaL0cB49Fzn+5EMyp6BlAucAOakyuqJCyJiXokdk0Looy5e
jUshvcwWcmHMpI/YYYY6t56KV6tl2exGq5sySY5U6dr8/r5lwc0SI+TrYFG0jTR8
59DGd5CkKgdBFcuys+eaZDXgr7A4ymkVE+pE0QNDz9UwNP20tLb3dQNlhgxchUgu
NgPaFgQkoNM3HmQNyU2wX/t1aFlC/doqSkb/96UWQSxq6IrajMU=
=AR07
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.12-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap updates from Darrick Wong:
"The big change in this cycle is some new code to make it possible for
XFS to try unaligned directio overwrites without taking locks. If the
block is fully written and within EOF (i.e. doesn't require any
further fs intervention) then we can let the unlocked write proceed.
If not, we fall back to synchronizing direct writes.
Summary:
- Adjust the final parameter of iomap_dio_rw.
- Add a new flag to request that iomap directio writes return EAGAIN
if the write is not a pure overwrite within EOF; this will be used
to reduce lock contention with unaligned direct writes on XFS.
- Amend XFS' directio code to eliminate exclusive locking for
unaligned direct writes if the circumstances permit"
* tag 'iomap-5.12-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: reduce exclusive locking on unaligned dio
xfs: split the unaligned DIO write code out
xfs: improve the reflink_bounce_dio_write tracepoint
xfs: simplify the read/write tracepoints
xfs: remove the buffered I/O fallback assert
xfs: cleanup the read/write helper naming
xfs: make xfs_file_aio_write_checks IOCB_NOWAIT-aware
xfs: factor out a xfs_ilock_iocb helper
iomap: add a IOMAP_DIO_OVERWRITE_ONLY flag
iomap: pass a flags argument to iomap_dio_rw
iomap: rename the flags variable in __iomap_dio_rw
Enable zone append writing for zoned mode. When using zone append, a
bio is issued to the start of a target zone and the device decides to
place it inside the zone. Upon completion the device reports the actual
written position back to the host.
Three parts are necessary to enable zone append mode. First, modify the
bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust the
bi_sector to point the beginning of the zone.
Second, record the returned physical address (and disk/partno) to the
ordered extent in end_bio_extent_writepage() after the bio has been
completed. We cannot resolve the physical address to the logical address
because we can neither take locks nor allocate a buffer in this end_bio
context. So, we need to record the physical address to resolve it later
in btrfs_finish_ordered_io().
And finally, rewrite the logical addresses of the extent mapping and
checksum data according to the physical address using btrfs_rmap_block.
If the returned address matches the originally allocated address, we can
skip this rewriting process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To support subpage sector size, data also need extra info to make sure
which sectors in a page are uptodate/dirty/...
This patch will make pages for data inodes get btrfs_subpage structure
attached, and detached when the page is freed.
This patch also slightly changes the timing when
set_page_extent_mapped() is called to make sure:
- We have page->mapping set
page->mapping->host is used to grab btrfs_fs_info, thus we can only
call this function after page is mapped to an inode.
One call site attaches pages to inode manually, thus we have to modify
the timing of set_page_extent_mapped() a bit.
- As soon as possible, before other operations
Since memory allocation can fail, we have to do extra error handling.
Calling set_page_extent_mapped() as soon as possible can simply the
error handling for several call sites.
The idea is pretty much the same as iomap_page, but with more bitmaps
for btrfs specific cases.
Currently the plan is to switch iomap if iomap can provide sector
aligned write back (only write back dirty sectors, but not the full
page, data balance require this feature).
So we will stick to btrfs specific bitmap for now.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Often an fsync needs to fallback to a transaction commit for several
reasons (to ensure consistency after a power failure, a new block group
was allocated or a temporary error such as ENOMEM or ENOSPC happened).
In that case the log is marked as needing a full commit and any concurrent
tasks attempting to log inodes or commit the log will also fallback to the
transaction commit. When this happens they all wait for the task that first
started the transaction commit to finish the transaction commit - however
they wait until the full transaction commit happens, which is not needed,
as they only need to wait for the superblocks to be persisted and not for
unpinning all the extents pinned during the transaction's lifetime, which
even for short lived transactions can be a few thousand and take some
significant amount of time to complete - for dbench workloads I have
observed up to 4~5 milliseconds of time spent unpinning extents in the
worst cases, and the number of pinned extents was between 2 to 3 thousand.
So allow fsync tasks to skip waiting for the unpinning of extents when
they call btrfs_commit_transaction() and they were not the task that
started the transaction commit (that one has to do it, the alternative
would be to offload the transaction commit to another task so that it
could avoid waiting for the extent unpinning or offload the extent
unpinning to another task).
This patch is part of a patchset comprised of the following patches:
btrfs: remove unnecessary directory inode item update when deleting dir entry
btrfs: stop setting nbytes when filling inode item for logging
btrfs: avoid logging new ancestor inodes when logging new inode
btrfs: skip logging directories already logged when logging all parents
btrfs: skip logging inodes already logged when logging new entries
btrfs: remove unnecessary check_parent_dirs_for_sync()
btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
After applying the entire patchset, dbench shows improvements in respect
to throughput and latency. The script used to measure it is the following:
$ cat dbench-test.sh
#!/bin/bash
DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-m single -d single"
echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
umount $DEV &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 300 64
umount $MNT
The test was run on a physical machine with 12 cores (Intel corei7), 64G
of ram, using a NVMe device and a non-debug kernel configuration (Debian's
default configuration).
Before applying patchset, 32 clients:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 9627107 0.153 61.938
Close 7072076 0.001 3.175
Rename 407633 1.222 44.439
Unlink 1943895 0.658 44.440
Deltree 256 17.339 110.891
Mkdir 128 0.003 0.009
Qpathinfo 8725406 0.064 17.850
Qfileinfo 1529516 0.001 2.188
Qfsinfo 1599884 0.002 1.457
Sfileinfo 784200 0.005 3.562
Find 3373513 0.411 30.312
WriteX 4802132 0.053 29.054
ReadX 15089959 0.002 5.801
LockX 31344 0.002 0.425
UnlockX 31344 0.001 0.173
Flush 674724 5.952 341.830
Throughput 1008.02 MB/sec 32 clients 32 procs max_latency=341.833 ms
After applying patchset, 32 clients:
After patchset, with 32 clients:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 9931568 0.111 25.597
Close 7295730 0.001 2.171
Rename 420549 0.982 49.714
Unlink 2005366 0.497 39.015
Deltree 256 11.149 89.242
Mkdir 128 0.002 0.014
Qpathinfo 9001863 0.049 20.761
Qfileinfo 1577730 0.001 2.546
Qfsinfo 1650508 0.002 3.531
Sfileinfo 809031 0.005 5.846
Find 3480259 0.309 23.977
WriteX 4952505 0.043 41.283
ReadX 15568127 0.002 5.476
LockX 32338 0.002 0.978
UnlockX 32338 0.001 2.032
Flush 696017 7.485 228.835
Throughput 1049.91 MB/sec 32 clients 32 procs max_latency=228.847 ms
--> +4.1% throughput, -39.6% max latency
Before applying patchset, 64 clients:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 8956748 0.342 108.312
Close 6579660 0.001 3.823
Rename 379209 2.396 81.897
Unlink 1808625 1.108 131.148
Deltree 256 25.632 172.176
Mkdir 128 0.003 0.018
Qpathinfo 8117615 0.131 55.916
Qfileinfo 1423495 0.001 2.635
Qfsinfo 1488496 0.002 5.412
Sfileinfo 729472 0.007 8.643
Find 3138598 0.855 78.321
WriteX 4470783 0.102 79.442
ReadX 14038139 0.002 7.578
LockX 29158 0.002 0.844
UnlockX 29158 0.001 0.567
Flush 627746 14.168 506.151
Throughput 924.738 MB/sec 64 clients 64 procs max_latency=506.154 ms
After applying patchset, 64 clients:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 9069003 0.303 43.193
Close 6662328 0.001 3.888
Rename 383976 2.194 46.418
Unlink 1831080 1.022 43.873
Deltree 256 24.037 155.763
Mkdir 128 0.002 0.005
Qpathinfo 8219173 0.137 30.233
Qfileinfo 1441203 0.001 3.204
Qfsinfo 1507092 0.002 4.055
Sfileinfo 738775 0.006 5.431
Find 3177874 0.936 38.170
WriteX 4526152 0.084 39.518
ReadX 14213562 0.002 24.760
LockX 29522 0.002 1.221
UnlockX 29522 0.001 0.694
Flush 635652 14.358 422.039
Throughput 990.13 MB/sec 64 clients 64 procs max_latency=422.043 ms
--> +6.8% throughput, -18.1% max latency
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The original comment is from the initial merge, which has several
problems:
- No holes check any more
- No inline decision is made
Update the out-of-date comment with more correct one.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
First replace all inode instances with a pointer to btrfs_inode. This
removes multiple invocations of the BTRFS_I macro, subsequently remove
2 local variables as they are called only once and simply refer to
them directly.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a set of flags to iomap_dio_rw instead of the boolean
wait_for_completion argument. The IOMAP_DIO_FORCE_WAIT flag
replaces the wait_for_completion, but only needs to be passed
when the iocb isn't synchronous to start with to simplify the
callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
[djwong: rework xfs_file.c so that we can push iomap changes separately]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
fallocate() is implemented by reserving actual extent instead of
reservations. This can result in exposing the sequential write
constraint of host-managed zoned block devices to the application, which
would break the POSIX semantic for the fallocated file. To avoid this,
report fallocate() as not supported when in ZONED mode for now.
In the future, we may be able to implement "in-memory" fallocate() in
ZONED mode by utilizing space_info->bytes_may_use or similar, so this
returns EOPNOTSUPP.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several occasions where we do not update the inode's number of
used bytes atomically, resulting in a concurrent stat(2) syscall to report
a value of used blocks that does not correspond to a valid value, that is,
a value that does not match neither what we had before the operation nor
what we get after the operation completes.
In extreme cases it can result in stat(2) reporting zero used blocks, which
can cause problems for some userspace tools where they can consider a file
with a non-zero size and zero used blocks as completely sparse and skip
reading data, as reported/discussed a long time ago in some threads like
the following:
https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
The cases where this can happen are the following:
-> Case 1
If we do a write (buffered or direct IO) against a file region for which
there is already an allocated extent (or multiple extents), then we have a
short time window where we can report a number of used blocks to stat(2)
that does not take into account the file region being overwritten. This
short time window happens when completing the ordered extent(s).
This happens because when we drop the extents in the write range we
decrement the inode's number of bytes and later on when we insert the new
extent(s) we increment the number of bytes in the inode, resulting in a
short time window where a stat(2) syscall can get an incorrect number of
used blocks.
If we do writes that overwrite an entire file, then we have a short time
window where we report 0 used blocks to stat(2).
Example reproducer:
$ cat reproducer-1.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got
while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT
xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
expected=$(stat -c %b $MNT/foobar)
# Create a process to keep calling stat(2) on the file and see if the
# reported number of blocks used (disk space used) changes, it should
# not because we are not increasing the file size nor punching holes.
stat_loop $MNT/foobar $expected &
loop_pid=$!
for ((i = 0; i < 50000; i++)); do
xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
done
kill $loop_pid &> /dev/null
wait
umount $DEV
$ ./reproducer-1.sh
ERROR: unexpected used blocks (got: 0 expected: 128)
ERROR: unexpected used blocks (got: 0 expected: 128)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
-> Case 2
If we do a buffered write against a file region that does not have any
allocated extents, like a hole or beyond EOF, then during ordered extent
completion we have a short time window where a concurrent stat(2) syscall
can report a number of used blocks that does not correspond to the value
before or after the write operation, a value that is actually larger than
the value after the write completes.
This happens because once we start a buffered write into an unallocated
file range we increment the inode's 'new_delalloc_bytes', to make sure
any stat(2) call gets a correct used blocks value before delalloc is
flushed and completes. However at ordered extent completion, after we
inserted the new extent, we increment the inode's number of bytes used
with the size of the new extent, and only later, when clearing the range
in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
counter with the size of the extent. So this results in a short time
window where a concurrent stat(2) syscall can report a number of used
blocks that accounts for the new extent twice.
Example reproducer:
$ cat reproducer-2.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got
while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT
touch $MNT/foobar
write_size=$((64 * 1024))
for ((i = 0; i < 16384; i++)); do
offset=$(($i * $write_size))
xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
blocks_used=$(stat -c %b $MNT/foobar)
# Fsync the file to trigger writeback and keep calling stat(2) on it
# to see if the number of blocks used changes.
stat_loop $MNT/foobar $blocks_used &
loop_pid=$!
xfs_io -c "fsync" $MNT/foobar
kill $loop_pid &> /dev/null
wait $loop_pid
done
umount $DEV
$ ./reproducer-2.sh
ERROR: unexpected used blocks (got: 265472 expected: 265344)
ERROR: unexpected used blocks (got: 284032 expected: 283904)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
-> Case 3
Another case where such problems happen is during other operations that
replace extents in a file range with other extents. Those operations are
extent cloning, deduplication and fallocate's zero range operation.
The cause of the problem is similar to the first case. When we drop the
extents from a range, we decrement the inode's number of bytes, and later
on, after inserting the new extents we increment it. Since this is not
done atomically, a concurrent stat(2) call can see and return a number of
used blocks that is smaller than it should be, does not match the number
of used blocks before or after the clone/deduplication/zero operation.
Like for the first case, when doing a clone, deduplication or zero range
operation against an entire file, we end up having a time window where we
can report 0 used blocks to a stat(2) call.
Example reproducer:
$ cat reproducer-3.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f -m reflink=1 $DEV > /dev/null
mount $DEV $MNT
extent_size=$((64 * 1024))
num_extents=16384
file_size=$(($extent_size * $num_extents))
# File foo has many small extents.
xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
> /dev/null
# File bar has much less extents and has exactly the same data as foo.
xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null
expected=$(stat -c %b $MNT/foo)
# Now deduplicate bar into foo. While the deduplication is in progres,
# the number of used blocks/file size reported by stat should not change
xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null &
dedupe_pid=$!
while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
used=$(stat -c %b $MNT/foo)
if [ $used -ne $expected ]; then
echo "Unexpected blocks used: $used (expected: $expected)"
fi
done
umount $DEV
$ ./reproducer-3.sh
Unexpected blocks used: 2076800 (expected: 2097152)
Unexpected blocks used: 2097024 (expected: 2097152)
Unexpected blocks used: 2079872 (expected: 2097152)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
So fix this by:
1) Making btrfs_drop_extents() not decrement the VFS inode's number of
bytes, and instead return the number of bytes;
2) Making any code that drops extents and adds new extents update the
inode's number of bytes atomically, while holding the btrfs inode's
spinlock, which is also used by the stat(2) callback to get the inode's
number of bytes;
3) For ranges in the inode's iotree that are marked as 'delalloc new',
corresponding to previously unallocated ranges, increment the inode's
number of bytes when clearing the 'delalloc new' bit from the range,
in the same critical section that decrements the inode's
'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.
An alternative would be to have btrfs_getattr() wait for any IO (ordered
extents in progress) and locking the whole range (0 to (u64)-1) while it
it computes the number of blocks used. But that would mean blocking
stat(2), which is a very used syscall and expected to be fast, waiting
for writes, clone/dedupe, fallocate, page reads, fiemap, etc.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are many arguments for __btrfs_drop_extents() and its wrapper
btrfs_drop_extents(), which makes it hard to add more arguments to it and
requires changing every caller. I have added a couple myself back in 2014
commit 1acae57b16 ("Btrfs: faster file extent item replace operations")
and therefore know firsthand that it is a bit cumbersome to add additional
arguments to these functions.
Since I will need to add more arguments in a subsequent bug fix, this
change is preparatory work and adds a data structure that holds all the
arguments, for both input and output, that are passed to this function,
with some comments in the structure's definition mentioning what each
field is and how it relates to other fields.
Callers of this function need only to zero out the content of the
structure and setup only the fields they need. This also removes the
need to have both __btrfs_drop_extents() and btrfs_drop_extents(), so
now we have a single function named btrfs_drop_extents() that takes a
pointer to this new data structure (struct btrfs_drop_extents_args).
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we're using a rw_semaphore we no longer need to indicate if a
lock is blocking or not, nor do we need to flip the entire path from
blocking to spinning. Remove these helpers and all the places they are
called.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The value of super_block::s_blocksize_bits is the same as
fs_info::sectorsize_bits, but we don't need to do the extra dereferences
in many functions and storing the bits as u32 (in fs_info) generates
shorter assembly.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This effectively reverts 09745ff88d93 ("btrfs: dio iomap DSYNC
workaround") now that the iomap API has been updated to allow
iomap_dio_complete() not to be called under i_rwsem anymore.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If direct writes are called with O_DIRECT | O_DSYNC, it will result in a
deadlock because iomap_dio_rw() is called under i_rwsem which calls:
iomap_dio_complete()
generic_write_sync()
btrfs_sync_file()
btrfs_sync_file() requires i_rwsem, so call __iomap_dio_rw() with the
i_rwsem locked, and call iomap_dio_complete() after unlocking i_rwsem.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The inode dio_sem can be eliminated because all DIO synchronization is
now performed through inode->i_rwsem that provides the same guarantees.
This reduces btrfs_inode size by 40 bytes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Direct writes within EOF are safe to be performed with inode shared lock
to improve parallelization with other direct writes or reads because EOF
is not changed and there is no race with truncate().
Direct reads are already performed under shared inode lock.
This patch is precursor to removing btrfs_inode->dio_sem.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Push inode locking and unlocking closer to where we perform the I/O. For
this we need to move the write checks inside the respective functions as
well.
pos is evaluated after generic_write_checks because O_APPEND can change
iocb->ki_pos.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_inode_lock/unlock() are wrappers around inode locks, separating
the type of lock and actual locking.
- 0 - default, exclusive lock
- BTRFS_ILOCK_SHARED - for shared locks, for possible parallel DIO
- BTRFS_ILOCK_TRY - for the RWF_NOWAIT sequence
The bits SHARED and TRY can be combined together.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_write_check() checks write parameters in one place before
beginning a write. This does away with inode_unlock() after every check.
In the later patches, it will help push inode_lock/unlock() in buffered
and direct write functions respectively.
generic_write_checks needs to be called before as it could truncate
iov_iter and its return used as count.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fs_info::fs_state is a filesystem bit check as opposed to inode and can
be performed before we begin with write checks. This eliminates inode
lock/unlock in case the error bit is set.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While we do this, correct the call to pagecache_isize_extended:
- pagecache_isize_extended needs to be called to the start of the write
as opposed to i_size
- we don't need to check range before the call, this is done in the
function
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The read and write DIO don't have anything in common except for the
call to iomap_dio_rw. Extract the write call into a new function to get
rid of conditional statements for direct write.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Set the extent bits EXTENT_NORESERVE inside btrfs_dirty_pages() as
opposed to calling set_extent_bits again later.
Fold check for written length within the function.
Note: EXTENT_NORESERVE is set before unlocking extents.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
round_down looks prettier than the bit mask operations.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
write_bytes can change in btrfs_check_nocow_lock(). Calculate variables
such as num_pages and reserve_bytes once we are sure of the value of
write_bytes so there is no need to re-calculate.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a buffered write, through one of the write family syscalls, we
look for ranges which currently don't have allocated extents and set the
'delalloc new' bit on them, so that we can report a correct number of used
blocks to the stat(2) syscall until delalloc is flushed and ordered extents
complete.
However there are a few other places where we can do a buffered write
against a range that is mapped to a hole (no extent allocated) and where
we do not set the 'new delalloc' bit. Those places are:
- Doing a memory mapped write against a hole;
- Cloning an inline extent into a hole starting at file offset 0;
- Calling btrfs_cont_expand() when the i_size of the file is not aligned
to the sector size and is located in a hole. For example when cloning
to a destination offset beyond EOF.
So after such cases, until the corresponding delalloc range is flushed and
the respective ordered extents complete, we can report an incorrect number
of blocks used through the stat(2) syscall.
In some cases we can end up reporting 0 used blocks to stat(2), which is a
particular bad value to report as it may mislead tools to think a file is
completely sparse when its i_size is not zero, making them skip reading
any data, an undesired consequence for tools such as archivers and other
backup tools, as reported a long time ago in the following thread (and
other past threads):
https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
Example reproducer:
$ cat reproducer.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
mount $DEV $MNT
xfs_io -f -c "truncate 64K" \
-c "mmap -w 0 64K" \
-c "mwrite -S 0xab 0 64K" \
-c "munmap" \
$MNT/foo
blocks_used=$(stat -c %b $MNT/foo)
echo "blocks used: $blocks_used"
if [ $blocks_used -eq 0 ]; then
echo "ERROR: blocks used is 0"
fi
umount $DEV
$ ./reproducer.sh
blocks used: 0
ERROR: blocks used is 0
So move the logic that decides to set the 'delalloc bit' bit into the
function btrfs_set_extent_delalloc(), since that is what we use for all
those missing cases as well as for the cases that currently work well.
This change is also preparatory work for an upcoming patch that fixes
other problems related to tracking and reporting the number of bytes used
by an inode.
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we switched to the iomap infrastructure in b5ff9f1a96e8f ("btrfs:
switch to iomap for direct IO") we're calling generic_file_buffered_read()
directly and not via generic_file_read_iter() anymore.
If the read could read everything there is no need to bother calling
generic_file_buffered_read(), like it is handled in
generic_file_read_iter().
If we call generic_file_buffered_read() in this case we can hit a
situation where we do an invalid readahead and cause this UBSAN splat
in fstest generic/091:
run fstests generic/091 at 2020-10-21 10:52:32
================================================================================
UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
shift exponent 64 is too large for 64-bit type 'long unsigned int'
CPU: 0 PID: 656 Comm: fsx Not tainted 5.9.0-rc7+ #821
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
Call Trace:
__dump_stack lib/dump_stack.c:77
dump_stack+0x57/0x70 lib/dump_stack.c:118
ubsan_epilogue+0x5/0x40 lib/ubsan.c:148
__ubsan_handle_shift_out_of_bounds.cold+0x61/0xe9 lib/ubsan.c:395
__roundup_pow_of_two ./include/linux/log2.h:57
get_init_ra_size mm/readahead.c:318
ondemand_readahead.cold+0x16/0x2c mm/readahead.c:530
generic_file_buffered_read+0x3ac/0x840 mm/filemap.c:2199
call_read_iter ./include/linux/fs.h:1876
new_sync_read+0x102/0x180 fs/read_write.c:415
vfs_read+0x11c/0x1a0 fs/read_write.c:481
ksys_read+0x4f/0xc0 fs/read_write.c:615
do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9 arch/x86/entry/entry_64.S:118
RIP: 0033:0x7fe87fee992e
RSP: 002b:00007ffe01605278 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 000000000004f000 RCX: 00007fe87fee992e
RDX: 0000000000004000 RSI: 0000000001677000 RDI: 0000000000000003
RBP: 000000000004f000 R08: 0000000000004000 R09: 000000000004f000
R10: 0000000000053000 R11: 0000000000000246 R12: 0000000000004000
R13: 0000000000000000 R14: 000000000007a120 R15: 0000000000000000
================================================================================
BTRFS info (device nullb0): has skinny extents
BTRFS info (device nullb0): ZONED mode enabled, zone size 268435456 B
BTRFS info (device nullb0): enabling ssd optimizations
Fixes: f85781fb50 ("btrfs: switch to iomap for direct IO")
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 8d875f95da ("btrfs: disable strict file flushes for
renames and truncates") eliminated the notion of ordered operations and
instead BTRFS_INODE_ORDERED_DATA_CLOSE only remained as a flag
indicating that a file's content should be synced to disk in case a
file is truncated and any writes happen to it concurrently. In fact
this intendend behavior was broken until it was fixed in
f6dc45c7a9 ("Btrfs: fix filemap_flush call in btrfs_file_release").
All things considered let's give the flag a more descriptive name. Also
slightly reword comments.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The passed in ordered_extent struct is always well-formed and contains
the inode making the explicit argument redundant.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
That parameter can easily be derived based on the "data_size" and "nr"
parameters exploit this fact to simply the function's signature. No
functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The value of this argument can be derived from the total_data as it's
simply the value of the data size + size of btrfs_items being touched.
Move the parameter calculation inside the function. This results in a
simpler interface and also a minor size reduction:
./scripts/bloat-o-meter ctree.original fs/btrfs/ctree.o
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-34 (-34)
Function old new delta
btrfs_duplicate_item 260 259 -1
setup_items_for_insert 1200 1190 -10
btrfs_insert_empty_items 177 154 -23
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we use the same mechanism to replace all the extents in a file
range with either a hole, an existing extent (when cloning) or a new
extent (when using fallocate), the name of btrfs_insert_clone_extent()
no longer reflects its genericity.
So rename it to btrfs_insert_replace_extent(), since what it does is
to either insert an existing extent or a new extent into a file range.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_punch_hole_range() is now used to replace all the file
extents in a given file range with an extent described in the given struct
btrfs_replace_extent_info argument. This extent can either be an existing
extent that is being cloned or it can be a new extent (namely a prealloc
extent). When that argument is NULL it only punches a hole (drops all the
existing extents) in the file range.
So rename the function to btrfs_replace_file_extents().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we can use btrfs_clone_extent_info to convey information for a
new prealloc extent as well, and not just for existing extents that are
being cloned, rename it to btrfs_replace_extent_info, which reflects the
fact that this is now more generic and it is used to replace all existing
extents in a file range with the extent described by the structure.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The value of item_size of struct btrfs_clone_extent_info is always set to
the size of a non-inline file extent item, and in fact the infrastructure
that uses this structure (btrfs_punch_hole_range()) does not work with
inline file extents at all (and it is not supposed to).
So just remove that field from the structure and use directly
sizeof(struct btrfs_file_extent_item) instead. Also assert that the
file extent type is not inline at btrfs_insert_clone_extent().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing an fallocate(), specially a zero range operation, we assume
that reserving 3 units of metadata space is enough, that at most we touch
one leaf in subvolume/fs tree for removing existing file extent items and
inserting a new file extent item. This assumption is generally true for
most common use cases. However when we end up needing to remove file extent
items from multiple leaves, we can end up failing with -ENOSPC and abort
the current transaction, turning the filesystem to RO mode. When this
happens a stack trace like the following is dumped in dmesg/syslog:
[ 1500.620934] ------------[ cut here ]------------
[ 1500.620938] BTRFS: Transaction aborted (error -28)
[ 1500.620973] WARNING: CPU: 2 PID: 30807 at fs/btrfs/inode.c:9724 __btrfs_prealloc_file_range+0x512/0x570 [btrfs]
[ 1500.620974] Modules linked in: btrfs intel_rapl_msr intel_rapl_common kvm_intel (...)
[ 1500.621010] CPU: 2 PID: 30807 Comm: xfs_io Tainted: G W 5.9.0-rc3-btrfs-next-67 #1
[ 1500.621012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 1500.621023] RIP: 0010:__btrfs_prealloc_file_range+0x512/0x570 [btrfs]
[ 1500.621026] Code: 8b 40 50 f0 48 (...)
[ 1500.621028] RSP: 0018:ffffb05fc8803ca0 EFLAGS: 00010286
[ 1500.621030] RAX: 0000000000000000 RBX: ffff9608af276488 RCX: 0000000000000000
[ 1500.621032] RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff
[ 1500.621033] RBP: ffffb05fc8803d90 R08: 0000000000000001 R09: 0000000000000001
[ 1500.621035] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000003200000
[ 1500.621037] R13: 00000000ffffffe4 R14: ffff9608af275fe8 R15: ffff9608af275f60
[ 1500.621039] FS: 00007fb5b2368ec0(0000) GS:ffff9608b6600000(0000) knlGS:0000000000000000
[ 1500.621041] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1500.621043] CR2: 00007fb5b2366fb8 CR3: 0000000202d38005 CR4: 00000000003706e0
[ 1500.621046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1500.621047] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1500.621049] Call Trace:
[ 1500.621076] btrfs_prealloc_file_range+0x10/0x20 [btrfs]
[ 1500.621087] btrfs_fallocate+0xccd/0x1280 [btrfs]
[ 1500.621108] vfs_fallocate+0x14d/0x290
[ 1500.621112] ksys_fallocate+0x3a/0x70
[ 1500.621117] __x64_sys_fallocate+0x1a/0x20
[ 1500.621120] do_syscall_64+0x33/0x80
[ 1500.621123] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1500.621126] RIP: 0033:0x7fb5b248c477
[ 1500.621128] Code: 89 7c 24 08 (...)
[ 1500.621130] RSP: 002b:00007ffc7bee9060 EFLAGS: 00000293 ORIG_RAX: 000000000000011d
[ 1500.621132] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fb5b248c477
[ 1500.621134] RDX: 0000000000000000 RSI: 0000000000000010 RDI: 0000000000000003
[ 1500.621136] RBP: 0000557718faafd0 R08: 0000000000000000 R09: 0000000000000000
[ 1500.621137] R10: 0000000003200000 R11: 0000000000000293 R12: 0000000000000010
[ 1500.621139] R13: 0000557718faafb0 R14: 0000557718faa480 R15: 0000000000000003
[ 1500.621151] irq event stamp: 1026217
[ 1500.621154] hardirqs last enabled at (1026223): [<ffffffffba965570>] console_unlock+0x500/0x5c0
[ 1500.621156] hardirqs last disabled at (1026228): [<ffffffffba9654c7>] console_unlock+0x457/0x5c0
[ 1500.621159] softirqs last enabled at (1022486): [<ffffffffbb6003dc>] __do_softirq+0x3dc/0x606
[ 1500.621161] softirqs last disabled at (1022477): [<ffffffffbb4010b2>] asm_call_on_stack+0x12/0x20
[ 1500.621162] ---[ end trace 2955b08408d8b9d4 ]---
[ 1500.621167] BTRFS: error (device sdj) in __btrfs_prealloc_file_range:9724: errno=-28 No space left
When we use fallocate() internally, for reserving an extent for a space
cache, inode cache or relocation, we can't hit this problem since either
there aren't any file extent items to remove from the subvolume tree or
there is at most one.
When using plain fallocate() it's very unlikely, since that would require
having many file extent items representing holes for the target range and
crossing multiple leafs - we attempt to increase the range (merge) of such
file extent items when punching holes, so at most we end up with 2 file
extent items for holes at leaf boundaries.
However when using the zero range operation of fallocate() for a large
range (100+ MiB for example) that's fairly easy to trigger. The following
example reproducer triggers the issue:
$ cat reproducer.sh
#!/bin/bash
umount /dev/sdj &> /dev/null
mkfs.btrfs -f -n 16384 -O ^no-holes /dev/sdj > /dev/null
mount /dev/sdj /mnt/sdj
# Create a 100M file with many file extent items. Punch a hole every 8K
# just to speedup the file creation - we could do 4K sequential writes
# followed by fsync (or O_SYNC) as well, but that takes a lot of time.
file_size=$((100 * 1024 * 1024))
xfs_io -f -c "pwrite -S 0xab -b 10M 0 $file_size" /mnt/sdj/foobar
for ((i = 0; i < $file_size; i += 8192)); do
xfs_io -c "fpunch $i 4096" /mnt/sdj/foobar
done
# Force a transaction commit, so the zero range operation will be forced
# to COW all metadata extents it need to touch.
sync
xfs_io -c "fzero 0 $file_size" /mnt/sdj/foobar
umount /mnt/sdj
$ ./reproducer.sh
wrote 104857600/104857600 bytes at offset 0
100 MiB, 10 ops; 0.0669 sec (1.458 GiB/sec and 149.3117 ops/sec)
fallocate: No space left on device
$ dmesg
<shows the same stack trace pasted before>
To fix this use the existing infrastructure that hole punching and
extent cloning use for replacing a file range with another extent. This
deals with doing the removal of file extent items and inserting the new
one using an incremental approach, reserving more space when needed and
always ensuring we don't leave an implicit hole in the range in case
we need to do multiple iterations and a crash happens between iterations.
A test case for fstests will follow up soon.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's counterintuitive to have a function named btrfs_inode_xxx which
takes a generic inode. Also move the function to btrfs_inode.h so that
it has access to the definition of struct btrfs_inode.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
iomap dio will run generic_write_sync() for us if the iocb is DSYNC.
This is problematic for us because of 2 reasons:
1. we hold the inode_lock() during this operation, and we take it in
generic_write_sync()
2. we hold a read lock on the dio_sem but take the write lock in fsync
Since we don't want to rip out this code right now, but reworking the
locking is a bit much to do at this point, work around this problem with
this masterpiece of a patch.
First, we clear DSYNC on the iocb so that the iomap stuff doesn't know
that it needs to handle the sync. We save this fact in
current->journal_info, because we need to see do special things once
we're in iomap_begin, and we have no way to pass private information
into iomap_dio_rw().
Next we specify a separate iomap_dio_ops for sync, which implements an
->end_io() callback that gets called when the dio completes. This is
important for AIO, because we really do need to run generic_write_sync()
if we complete asynchronously. However if we're still in the submitting
context when we enter ->end_io() we clear the flag so that the submitter
knows they're the ones that needs to run generic_write_sync().
This is meant to be temporary. We need to work out how to eliminate the
inode_lock() and the dio_sem in our fsync and use another mechanism to
protect these operations.
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're using direct io implementation based on buffer heads. This patch
switches to the new iomap infrastructure.
Switch from __blockdev_direct_IO() to iomap_dio_rw(). Rename
btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it as
iomap_begin() for iomap direct I/O functions. This function allocates
and locks all the blocks required for the I/O. btrfs_submit_direct() is
used as the submit_io() hook for direct I/O ops.
Since we need direct I/O reads to go through iomap_dio_rw(), we change
file_operations.read_iter() to a btrfs_file_read_iter() which calls
btrfs_direct_IO() for direct reads and falls back to
generic_file_buffered_read() for incomplete reads and buffered reads.
We don't need address_space.direct_IO() anymore: set it to noop.
Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
capable of direct I/O reads from a hole, so we don't need to return
-ENOENT.
Btrfs direct I/O is now done under i_rwsem, shared in case of reads and
exclusive in case of writes. This guards against simultaneous truncates.
Use iomap->iomap_end() to check for failed or incomplete direct I/O:
- for writes, call __endio_write_update_ordered()
- for reads, unlock extents
btrfs_dio_data is now hooked in iomap->private and not
current->journal_info. It carries the reservation variable and the
amount of data submitted, so we can calculate the amount of data to call
__endio_write_update_ordered in case of an error.
This patch removes last use of struct buffer_head from btrfs.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently regardless of a full or a fast fsync we always wait for ordered
extents to complete, and then start logging the inode after that. However
for fast fsyncs we can just wait for the writeback to complete, we don't
need to wait for the ordered extents to complete since we use the list of
modified extents maps to figure out which extents we must log and we can
get their checksums directly from the ordered extents that are still in
flight, otherwise look them up from the checksums tree.
Until commit b5e6c3e170 ("btrfs: always wait on ordered extents at
fsync time"), for fast fsyncs, we used to start logging without even
waiting for the writeback to complete first, we would wait for it to
complete after logging, while holding a transaction open, which lead to
performance issues when using cgroups and probably for other cases too,
as wait for IO while holding a transaction handle should be avoided as
much as possible. After that, for fast fsyncs, we started to wait for
ordered extents to complete before starting to log, which adds some
latency to fsyncs and we even got at least one report about a performance
drop which bisected to that particular change:
https://lore.kernel.org/linux-btrfs/20181109215148.GF23260@techsingularity.net/
This change makes fast fsyncs only wait for writeback to finish before
starting to log the inode, instead of waiting for both the writeback to
finish and for the ordered extents to complete. This brings back part of
the logic we had that extracts checksums from in flight ordered extents,
which are not yet in the checksums tree, and making sure transaction
commits wait for the completion of ordered extents previously logged
(by far most of the time they have already completed by the time a
transaction commit starts, resulting in no wait at all), to avoid any
data loss if an ordered extent completes after the transaction used to
log an inode is committed, followed by a power failure.
When there are no other tasks accessing the checksums and the subvolume
btrees, the ordered extent completion is pretty fast, typically taking
100 to 200 microseconds only in my observations. However when there are
other tasks accessing these btrees, ordered extent completion can take a
lot more time due to lock contention on nodes and leaves of these btrees.
I've seen cases over 2 milliseconds, which starts to be significant. In
particular when we do have concurrent fsyncs against different files there
is a lot of contention on the checksums btree, since we have many tasks
writing the checksums into the btree and other tasks that already started
the logging phase are doing lookups for checksums in the btree.
This change also turns all ranged fsyncs into full ranged fsyncs, which
is something we already did when not using the NO_HOLES features or when
doing a full fsync. This is to guarantee we never miss checksums due to
writeback having been triggered only for a part of an extent, and we end
up logging the full extent but only checksums for the written range, which
results in missing checksums after log replay. Allowing ranged fsyncs to
operate again only in the original range, when using the NO_HOLES feature
and doing a fast fsync is doable but requires some non trivial changes to
the writeback path, which can always be worked on later if needed, but I
don't think they are a very common use case.
Several tests were performed using fio for different numbers of concurrent
jobs, each writing and fsyncing its own file, for both sequential and
random file writes. The tests were run on bare metal, no virtualization,
on a box with 12 cores (Intel i7-8700), 64Gb of RAM and a NVMe device,
with a kernel configuration that is the default of typical distributions
(debian in this case), without debug options enabled (kasan, kmemleak,
slub debug, debug of page allocations, lock debugging, etc).
The following script that calls fio was used:
$ cat test-fsync.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/btrfs
MOUNT_OPTIONS="-o ssd -o space_cache=v2"
MKFS_OPTIONS="-d single -m single"
if [ $# -ne 5 ]; then
echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE [write|randwrite]"
exit 1
fi
NUM_JOBS=$1
FILE_SIZE=$2
FSYNC_FREQ=$3
BLOCK_SIZE=$4
WRITE_MODE=$5
if [ "$WRITE_MODE" != "write" ] && [ "$WRITE_MODE" != "randwrite" ]; then
echo "Invalid WRITE_MODE, must be 'write' or 'randwrite'"
exit 1
fi
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=$WRITE_MODE
fsync=$FSYNC_FREQ
fallocate=none
group_reporting=1
direct=0
bs=$BLOCK_SIZE
ioengine=sync
size=$FILE_SIZE
directory=$MNT
numjobs=$NUM_JOBS
EOF
echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The results were the following:
*************************
*** sequential writes ***
*************************
==== 1 job, 8GiB file, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=36.6MiB/s (38.4MB/s), 36.6MiB/s-36.6MiB/s (38.4MB/s-38.4MB/s), io=8192MiB (8590MB), run=223689-223689msec
After patch:
WRITE: bw=40.2MiB/s (42.1MB/s), 40.2MiB/s-40.2MiB/s (42.1MB/s-42.1MB/s), io=8192MiB (8590MB), run=203980-203980msec
(+9.8%, -8.8% runtime)
==== 2 jobs, 4GiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=35.8MiB/s (37.5MB/s), 35.8MiB/s-35.8MiB/s (37.5MB/s-37.5MB/s), io=8192MiB (8590MB), run=228950-228950msec
After patch:
WRITE: bw=43.5MiB/s (45.6MB/s), 43.5MiB/s-43.5MiB/s (45.6MB/s-45.6MB/s), io=8192MiB (8590MB), run=188272-188272msec
(+21.5% throughput, -17.8% runtime)
==== 4 jobs, 2GiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=50.1MiB/s (52.6MB/s), 50.1MiB/s-50.1MiB/s (52.6MB/s-52.6MB/s), io=8192MiB (8590MB), run=163446-163446msec
After patch:
WRITE: bw=64.5MiB/s (67.6MB/s), 64.5MiB/s-64.5MiB/s (67.6MB/s-67.6MB/s), io=8192MiB (8590MB), run=126987-126987msec
(+28.7% throughput, -22.3% runtime)
==== 8 jobs, 1GiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=64.0MiB/s (68.1MB/s), 64.0MiB/s-64.0MiB/s (68.1MB/s-68.1MB/s), io=8192MiB (8590MB), run=126075-126075msec
After patch:
WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=8192MiB (8590MB), run=94358-94358msec
(+35.6% throughput, -25.2% runtime)
==== 16 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=79.8MiB/s (83.6MB/s), 79.8MiB/s-79.8MiB/s (83.6MB/s-83.6MB/s), io=8192MiB (8590MB), run=102694-102694msec
After patch:
WRITE: bw=107MiB/s (112MB/s), 107MiB/s-107MiB/s (112MB/s-112MB/s), io=8192MiB (8590MB), run=76446-76446msec
(+34.1% throughput, -25.6% runtime)
==== 32 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=93.2MiB/s (97.7MB/s), 93.2MiB/s-93.2MiB/s (97.7MB/s-97.7MB/s), io=16.0GiB (17.2GB), run=175836-175836msec
After patch:
WRITE: bw=111MiB/s (117MB/s), 111MiB/s-111MiB/s (117MB/s-117MB/s), io=16.0GiB (17.2GB), run=147001-147001msec
(+19.1% throughput, -16.4% runtime)
==== 64 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====
Before patch:
WRITE: bw=108MiB/s (114MB/s), 108MiB/s-108MiB/s (114MB/s-114MB/s), io=32.0GiB (34.4GB), run=302656-302656msec
After patch:
WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=246003-246003msec
(+23.1% throughput, -18.7% runtime)
************************
*** random writes ***
************************
==== 1 job, 8GiB file, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=8192MiB (8590MB), run=714281-714281msec
After patch:
WRITE: bw=11.6MiB/s (12.2MB/s), 11.6MiB/s-11.6MiB/s (12.2MB/s-12.2MB/s), io=8192MiB (8590MB), run=705959-705959msec
(+0.9% throughput, -1.7% runtime)
==== 2 jobs, 4GiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=12.8MiB/s (13.5MB/s), 12.8MiB/s-12.8MiB/s (13.5MB/s-13.5MB/s), io=8192MiB (8590MB), run=638101-638101msec
After patch:
WRITE: bw=13.1MiB/s (13.7MB/s), 13.1MiB/s-13.1MiB/s (13.7MB/s-13.7MB/s), io=8192MiB (8590MB), run=625374-625374msec
(+2.3% throughput, -2.0% runtime)
==== 4 jobs, 2GiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=15.4MiB/s (16.2MB/s), 15.4MiB/s-15.4MiB/s (16.2MB/s-16.2MB/s), io=8192MiB (8590MB), run=531146-531146msec
After patch:
WRITE: bw=17.8MiB/s (18.7MB/s), 17.8MiB/s-17.8MiB/s (18.7MB/s-18.7MB/s), io=8192MiB (8590MB), run=460431-460431msec
(+15.6% throughput, -13.3% runtime)
==== 8 jobs, 1GiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=19.9MiB/s (20.8MB/s), 19.9MiB/s-19.9MiB/s (20.8MB/s-20.8MB/s), io=8192MiB (8590MB), run=412664-412664msec
After patch:
WRITE: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=8192MiB (8590MB), run=368589-368589msec
(+11.6% throughput, -10.7% runtime)
==== 16 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=8192MiB (8590MB), run=279924-279924msec
After patch:
WRITE: bw=30.4MiB/s (31.9MB/s), 30.4MiB/s-30.4MiB/s (31.9MB/s-31.9MB/s), io=8192MiB (8590MB), run=269258-269258msec
(+3.8% throughput, -3.8% runtime)
==== 32 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=36.9MiB/s (38.7MB/s), 36.9MiB/s-36.9MiB/s (38.7MB/s-38.7MB/s), io=16.0GiB (17.2GB), run=443581-443581msec
After patch:
WRITE: bw=41.6MiB/s (43.6MB/s), 41.6MiB/s-41.6MiB/s (43.6MB/s-43.6MB/s), io=16.0GiB (17.2GB), run=394114-394114msec
(+12.7% throughput, -11.2% runtime)
==== 64 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====
Before patch:
WRITE: bw=45.9MiB/s (48.1MB/s), 45.9MiB/s-45.9MiB/s (48.1MB/s-48.1MB/s), io=32.0GiB (34.4GB), run=714614-714614msec
After patch:
WRITE: bw=48.8MiB/s (51.1MB/s), 48.8MiB/s-48.8MiB/s (51.1MB/s-51.1MB/s), io=32.0GiB (34.4GB), run=672087-672087msec
(+6.3% throughput, -6.0% runtime)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're just doing rounding up to sectorsize to calculate the lockend.
There is no need to do the unnecessary length calculation, just direct
round_up() is enough.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl9D5EkACgkQxWXV+ddt
WDto+g/6A/2QzxhgOmqqHTiDvn3DkL60XfjB6lmq3NEvinrST+VH20EoX/EuX2Kn
u2+gMiWrgBUwlvERkoSasxdJf/6dCCc+9zYDjjKkAxCckENT85Np71o3iEc7Z5z+
LFgS26mt6aYlCCHyIsHutzHtK2MKiUz7/oaUYZMJBHHkKS/5hL1mzIbwiWAqfU2H
q0iMz9L2mjp1kZnpwa/yhg/NJ/oGZsKm3UPGDhdc0RlCWHBbDXHFk1wvNRo/yKQW
l+yy0dh6PAZ45pRL0/WZwvOzcAglb+uSmwa64UOvwio4Na9P7oAcBzTFmtbBtvP4
WBrOUPCTzkvgQcmoAsWFpD4nrzgW4oS71EICTOIRlPx7A86TP3wYpFEygUlLCoZC
Pd4e9mPClmW78hcRT12eJeGcJIzgoKWhR8597jNUEYz3R5T2wKHOcNnq9a1E1PLv
zR+5MFShsylUHd7HbMC1O86XnfXe5esegNQMvx36kTS+cR9Dyt5EWMNIAYK4BPM3
/tXWZRqlZPOh3T7DZ4QR5oSSDDNq7ROTdv9jmsleno+woG0MNDYsA7jCbeJnGTmI
CtTUP+p41otyM2lFZjV8PG/XyXDKb3UfU5gcsDOZdGP5S0tkyBiKSA6eqhz6DaTi
fQOLGZdkNpNN/burbMq7d7YEHr3F6LC17U3L4k5V4MTAm2lp7ZQ=
=ONgI
-----END PGP SIGNATURE-----
Merge tag 'for-5.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix swapfile activation on subvolumes with deleted snapshots
- error value mixup when removing directory entries from tree log
- fix lzo compression level reset after previous level setting
- fix space cache memory leak after transaction abort
- fix const function attribute
- more error handling improvements
* tag 'for-5.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: detect nocow for swap after snapshot delete
btrfs: check the right error variable in btrfs_del_dir_entries_in_log
btrfs: fix space cache memory leak after transaction abort
btrfs: use the correct const function attribute for btrfs_get_num_csums
btrfs: reset compression level for lzo on remount
btrfs: handle errors from async submission
can_nocow_extent and btrfs_cross_ref_exist both rely on a heuristic for
detecting a must cow condition which is not exactly accurate, but saves
unnecessary tree traversal. The incorrect assumption is that if the
extent was created in a generation smaller than the last snapshot
generation, it must be referenced by that snapshot. That is true, except
the snapshot could have since been deleted, without affecting the last
snapshot generation.
The original patch claimed a performance win from this check, but it
also leads to a bug where you are unable to use a swapfile if you ever
snapshotted the subvolume it's in. Make the check slower and more strict
for the swapon case, without modifying the general cow checks as a
compromise. Turning swap on does not seem to be a particularly
performance sensitive operation, so incurring a possibly unnecessary
btrfs_search_slot seems worthwhile for the added usability.
Note: Until the snapshot is competely cleaned after deletion,
check_committed_refs will still cause the logic to think that cow is
necessary, so the user must until 'btrfs subvolu sync' finished before
activating the swapfile swapon.
CC: stable@vger.kernel.org # 5.4+
Suggested-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8m7asQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplrCD/0S17kio+k4cOJDGwl88WoJw+QiYmM5019k
decZ1JymQvV1HXRmlcZiEAu0hHDD0FoovSRrw7II3gw3GouETmYQM62f6ZTpDeMD
CED/fidnfULAkPaI6h+bj3jyI0cEuujG/R47rGSQEkIIr3RttqKZUzVkB9KN+KMw
+OBuXZtMIoFFEVJ91qwC2dm2qHLqOn1/5MlT59knso/xbPOYOXsFQpGiACJqF97x
6qSSI8uGE+HZqvL2OLWPDBbLEJhrq+dzCgxln5VlvLele4UcRhOdonUb7nUwEKCe
zwvtXzz16u1D1b8bJL4Kg5bGqyUAQUCSShsfBJJxh6vTTULiHyCX5sQaai1OEB16
4dpBL9E+nOUUix4wo9XBY0/KIYaPWg5L1CoEwkAXqkXPhFvNUucsC0u6KvmzZR3V
1OogVTjl6GhS8uEVQjTKNshkTIC9QHEMXDUOHtINDCb/sLU+ANXU5UpvsuzZ9+kt
KGc4mdyCwaKBq4YW9sVwhhq/RHLD4AUtWZiUVfOE+0cltCLJUNMbQsJ+XrcYaQnm
W4zz22Rep+SJuQNVcCW/w7N2zN3yB6gC1qeroSLvzw4b5el2TdFp+BcgVlLHK+uh
xjsGNCq++fyzNk7vvMZ5hVq4JGXYjza7AiP5HlQ8nqdiPUKUPatWCBqUm9i9Cz/B
n+0dlYbRwQ==
=2vmy
-----END PGP SIGNATURE-----
Merge tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
"Lots of cleanups in here, hardening the code and/or making it easier
to read and fixing bugs, but a core feature/change too adding support
for real async buffered reads. With the latter in place, we just need
buffered write async support and we're done relying on kthreads for
the fast path. In detail:
- Cleanup how memory accounting is done on ring setup/free (Bijan)
- sq array offset calculation fixup (Dmitry)
- Consistently handle blocking off O_DIRECT submission path (me)
- Support proper async buffered reads, instead of relying on kthread
offload for that. This uses the page waitqueue to drive retries
from task_work, like we handle poll based retry. (me)
- IO completion optimizations (me)
- Fix race with accounting and ring fd install (me)
- Support EPOLLEXCLUSIVE (Jiufei)
- Get rid of the io_kiocb unionizing, made possible by shrinking
other bits (Pavel)
- Completion side cleanups (Pavel)
- Cleanup REQ_F_ flags handling, and kill off many of them (Pavel)
- Request environment grabbing cleanups (Pavel)
- File and socket read/write cleanups (Pavel)
- Improve kiocb_set_rw_flags() (Pavel)
- Tons of fixes and cleanups (Pavel)
- IORING_SQ_NEED_WAKEUP clear fix (Xiaoguang)"
* tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block: (127 commits)
io_uring: flip if handling after io_setup_async_rw
fs: optimise kiocb_set_rw_flags()
io_uring: don't touch 'ctx' after installing file descriptor
io_uring: get rid of atomic FAA for cq_timeouts
io_uring: consolidate *_check_overflow accounting
io_uring: fix stalled deferred requests
io_uring: fix racy overflow count reporting
io_uring: deduplicate __io_complete_rw()
io_uring: de-unionise io_kiocb
io-wq: update hash bits
io_uring: fix missing io_queue_linked_timeout()
io_uring: mark ->work uninitialised after cleanup
io_uring: deduplicate io_grab_files() calls
io_uring: don't do opcode prep twice
io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works
io_uring: batch put_task_struct()
tasks: add put_task_struct_many()
io_uring: return locked and pinned page accounting
io_uring: don't miscount pinned memory
io_uring: don't open-code recv kbuf managment
...
Instead of calling BTRFS_I on the passed vfs_inode take btrfs_inode
directly.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It needs btrfs_inode so take it as a parameter directly.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It only uses btrfs_inode internally so take it as a parameter.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's only a single use of vfs_inode in a tracepoint so let's take
btrfs_inode directly.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a single use of the generic vfs_inode so let's take btrfs_inode
as a parameter and remove couple of redundant BTRFS_I() calls.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparation to make btrfs_dirty_pages take btrfs_inode as parameter.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_check_can_nocow() now has two completely different
call patterns.
For nowait variant, callers don't need to do any cleanup. While for
wait variant, callers need to release the lock if they can do nocow
write.
This is somehow confusing, and is already a problem for the exported
btrfs_check_can_nocow().
So this patch will separate the different patterns into different
functions.
For nowait variant, the function will be called check_nocow_nolock().
For wait variant, the function pair will be btrfs_check_nocow_lock()
btrfs_check_nocow_unlock().
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These two functions have extra conditions that their callers need to
meet, and some not-that-common parameters used for return value.
So adding some comments may save reviewers some time.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When the data space is exhausted, even if the inode has NOCOW attribute,
we will still refuse to truncate unaligned range due to ENOSPC.
The following script can reproduce it pretty easily:
#!/bin/bash
dev=/dev/test/test
mnt=/mnt/btrfs
umount $dev &> /dev/null
umount $mnt &> /dev/null
mkfs.btrfs -f $dev -b 1G
mount -o nospace_cache $dev $mnt
touch $mnt/foobar
chattr +C $mnt/foobar
xfs_io -f -c "pwrite -b 4k 0 4k" $mnt/foobar > /dev/null
xfs_io -f -c "pwrite -b 4k 0 1G" $mnt/padding &> /dev/null
sync
xfs_io -c "fpunch 0 2k" $mnt/foobar
umount $mnt
Currently this will fail at the fpunch part.
[CAUSE]
Because btrfs_truncate_block() always reserves space without checking
the NOCOW attribute.
Since the writeback path follows NOCOW bit, we only need to bother the
space reservation code in btrfs_truncate_block().
[FIX]
Make btrfs_truncate_block() follow btrfs_buffered_write() to try to
reserve data space first, and fall back to NOCOW check only when we
don't have enough space.
Such always-try-reserve is an optimization introduced in
btrfs_buffered_write(), to avoid expensive btrfs_check_can_nocow() call.
This patch will export check_can_nocow() as btrfs_check_can_nocow(), and
use it in btrfs_truncate_block() to fix the problem.
Reported-by: Martin Doucha <martin.doucha@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It has only 4 uses of a vfs_inode for inode_sub_bytes but unifies the
interface with the non __ prefixed version. Will also makes converting
its callers to btrfs_inode easier.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The incoming qgroup reserved space timing will move the data reservation
to ordered extent completely.
However in btrfs_punch_hole_lock_range() will call
btrfs_invalidate_page(), which will clear QGROUP_RESERVED bit for the
range.
In current stage it's OK, but if we're making ordered extents handle the
reserved space, then btrfs_punch_hole_lock_range() can clear the
QGROUP_RESERVED bit before we submit ordered extent, leading to qgroup
reserved space leakage.
So here change the timing to make reserve data space after
btrfs_punch_hole_lock_range().
The new timing is fine for either current code or the new code.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The call to btrfs_btree_balance_dirty has been there since the early
days of BTRFS, when the btree was directly modified from the write path,
hence dirtied btree inode pages. With the implementation of b888db2bd7
("Btrfs: Add delayed allocation to the extent based page tree code")
13 years ago the btree is no longer modified from the write path, hence
there is no point in calling this function. Just remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs implements the iter_write op and thus can use the more efficient
iov_iter based splice implementation. For now falling back to the less
efficient default is pretty harmless, but I have a pending series that
removes the default, and thus would cause btrfs to not support splice
at all.
Reported-by: Andy Lavr <andy.lavr@gmail.com>
Tested-by: Andy Lavr <andy.lavr@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A RWF_NOWAIT write is not supposed to wait on filesystem locks that can be
held for a long time or for ongoing IO to complete.
However when calling check_can_nocow(), if the inode has prealloc extents
or has the NOCOW flag set, we can block on extent (file range) locks
through the call to btrfs_lock_and_flush_ordered_range(). Such lock can
take a significant amount of time to be available. For example, a fiemap
task may be running, and iterating through the entire file range checking
all extents and doing backref walking to determine if they are shared,
or a readpage operation may be in progress.
Also at btrfs_lock_and_flush_ordered_range(), called by check_can_nocow(),
after locking the file range we wait for any existing ordered extent that
is in progress to complete. Another operation that can take a significant
amount of time and defeat the purpose of RWF_NOWAIT.
So fix this by trying to lock the file range and if it's currently locked
return -EAGAIN to user space. If we are able to lock the file range without
waiting and there is an ordered extent in the range, return -EAGAIN as
well, instead of waiting for it to complete. Finally, don't bother trying
to lock the snapshot lock of the root when attempting a RWF_NOWAIT write,
as that is only important for buffered writes.
Fixes: edf064e7c6 ("btrfs: nowait aio support")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we attempt to do a RWF_NOWAIT write against a file range for which we
can only do NOCOW for a part of it, due to the existence of holes or
shared extents for example, we proceed with the write as if it were
possible to NOCOW the whole range.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ touch /mnt/sdj/bar
$ chattr +C /mnt/sdj/bar
$ xfs_io -d -c "pwrite -S 0xab -b 256K 0 256K" /mnt/bar
wrote 262144/262144 bytes at offset 0
256 KiB, 1 ops; 0.0003 sec (694.444 MiB/sec and 2777.7778 ops/sec)
$ xfs_io -c "fpunch 64K 64K" /mnt/bar
$ sync
$ xfs_io -d -c "pwrite -N -V 1 -b 128K -S 0xfe 0 128K" /mnt/bar
wrote 131072/131072 bytes at offset 0
128 KiB, 1 ops; 0.0007 sec (160.051 MiB/sec and 1280.4097 ops/sec)
This last write should fail with -EAGAIN since the file range from 64K to
128K is a hole. On xfs it fails, as expected, but on ext4 it currently
succeeds because apparently it is expensive to check if there are extents
allocated for the whole range, but I'll check with the ext4 people.
Fix the issue by checking if check_can_nocow() returns a number of
NOCOW'able bytes smaller then the requested number of bytes, and if it
does return -EAGAIN.
Fixes: edf064e7c6 ("btrfs: nowait aio support")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we do a successful RWF_NOWAIT write we end up locking the snapshot lock
of the inode, through a call to check_can_nocow(), but we never unlock it.
This means the next attempt to create a snapshot on the subvolume will
hang forever.
Trivial reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ touch /mnt/foobar
$ chattr +C /mnt/foobar
$ xfs_io -d -c "pwrite -S 0xab 0 64K" /mnt/foobar
$ xfs_io -d -c "pwrite -N -V 1 -S 0xfe 0 64K" /mnt/foobar
$ btrfs subvolume snapshot -r /mnt /mnt/snap
--> hangs
Fix this by unlocking the snapshot lock if check_can_nocow() returned
success.
Fixes: edf064e7c6 ("btrfs: nowait aio support")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit a43a67a2d7.
This patch reverts the main part of switching direct io implementation
to iomap infrastructure. There's a problem in invalidate page that
couldn't be solved as regression in this development cycle.
The problem occurs when buffered and direct io are mixed, and the ranges
overlap. Although this is not recommended, filesystems implement
measures or fallbacks to make it somehow work. In this case, fallback to
buffered IO would be an option for btrfs (this already happens when
direct io is done on compressed data), but the change would be needed in
the iomap code, bringing new semantics to other filesystems.
Another problem arises when again the buffered and direct ios are mixed,
invalidation fails, then -EIO is set on the mapping and fsync will fail,
though there's no real error.
There have been discussions how to fix that, but revert seems to be the
least intrusive option.
Link: https://lore.kernel.org/linux-btrfs/20200528192103.xm45qoxqmkw7i5yl@fiona/
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit d8f3e73587.
The patch is a cleanup of direct IO port to iomap infrastructure,
which gets reverted.
Signed-off-by: David Sterba <dsterba@suse.com>
The read and write versions don't have anything in common except for the
call to iomap_dio_rw. So split this function, and merge each half into
its only caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Switch from __blockdev_direct_IO() to iomap_dio_rw().
Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it
as iomap_begin() for iomap direct I/O functions. This function
allocates and locks all the blocks required for the I/O.
btrfs_submit_direct() is used as the submit_io() hook for direct I/O
ops.
Since we need direct I/O reads to go through iomap_dio_rw(), we change
file_operations.read_iter() to a btrfs_file_read_iter() which calls
btrfs_direct_IO() for direct reads and falls back to
generic_file_buffered_read() for incomplete reads and buffered reads.
We don't need address_space.direct_IO() anymore so set it to noop.
Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
capable of direct I/O reads from a hole, so we don't need to return
-ENOENT.
BTRFS direct I/O is now done under i_rwsem, shared in case of reads and
exclusive in case of writes. This guards against simultaneous truncates.
Use iomap->iomap_end() to check for failed or incomplete direct I/O:
- for writes, call __endio_write_update_ordered()
- for reads, unlock extents
btrfs_dio_data is now hooked in iomap->private and not
current->journal_info. It carries the reservation variable and the
amount of data submitted, so we can calculate the amount of data to call
__endio_write_update_ordered in case of an error.
This patch removes last use of struct buffer_head from btrfs.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The inode lookup starting at btrfs_iget takes the full location key,
while only the objectid is used to match the inode, because the lookup
happens inside the given root thus the inode number is unique.
The entire location key is properly set up in btrfs_init_locked_inode.
Simplify the helpers and pass only inode number, renaming it to 'ino'
instead of 'objectid'. This allows to remove temporary variables key,
saving some stack space.
Signed-off-by: David Sterba <dsterba@suse.com>
The main function to lookup a root by its id btrfs_get_fs_root takes the
whole key, while only using the objectid. The value of offset is preset
to (u64)-1 but not actually used until btrfs_find_root that does the
actual search.
Switch btrfs_get_fs_root to use only objectid and remove all local
variables that existed just for the lookup. The actual key for search is
set up in btrfs_get_fs_root, reusing another key variable.
Signed-off-by: David Sterba <dsterba@suse.com>
The name BTRFS_ROOT_REF_COWS is not very clear about the meaning.
In fact, that bit can only be set to those trees:
- Subvolume roots
- Data reloc root
- Reloc roots for above roots
All other trees won't get this bit set. So just by the result, it is
obvious that, roots with this bit set can have tree blocks shared with
other trees. Either shared by snapshots, or by reloc roots (an special
snapshot created by relocation).
This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to
make it easier to understand, and update all comment mentioning
"reference counted" to follow the rename.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a revert of commit 0a8068a3dd ("btrfs: make ranged full
fsyncs more efficient"), with updated comment in btrfs_sync_file.
Commit 0a8068a3dd ("btrfs: make ranged full fsyncs more efficient")
made full fsyncs operate on the given range only as it assumed it was safe
when using the NO_HOLES feature, since the hole detection was simplified
some time ago and no longer was a source for races with ordered extent
completion of adjacent file ranges.
However it's still not safe to have a full fsync only operate on the given
range, because extent maps for new extents might not be present in memory
due to inode eviction or extent cloning. Consider the following example:
1) We are currently at transaction N;
2) We write to the file range [0, 1MiB);
3) Writeback finishes for the whole range and ordered extents complete,
while we are still at transaction N;
4) The inode is evicted;
5) We open the file for writing, causing the inode to be loaded to
memory again, which sets the 'full sync' bit on its flags. At this
point the inode's list of modified extent maps is empty (figuring
out which extents were created in the current transaction and were
not yet logged by an fsync is expensive, that's why we set the
'full sync' bit when loading an inode);
6) We write to the file range [512KiB, 768KiB);
7) We do a ranged fsync (such as msync()) for file range [512KiB, 768KiB).
This correctly flushes this range and logs its extent into the log
tree. When the writeback started an extent map for range [512KiB, 768KiB)
was added to the inode's list of modified extents, and when the fsync()
finishes logging it removes that extent map from the list of modified
extent maps. This fsync also clears the 'full sync' bit;
8) We do a regular fsync() (full ranged). This fsync() ends up doing
nothing because the inode's list of modified extents is empty and
no other changes happened since the previous ranged fsync(), so
it just returns success (0) and we end up never logging extents for
the file ranges [0, 512KiB) and [768KiB, 1MiB).
Another scenario where this can happen is if we replace steps 2 to 4 with
cloning from another file into our test file, as that sets the 'full sync'
bit in our inode's flags and does not populate its list of modified extent
maps.
This was causing test case generic/457 to fail sporadically when using the
NO_HOLES feature, as it exercised this later case where the inode has the
'full sync' bit set and has no extent maps in memory to represent the new
extents due to extent cloning.
Fix this by reverting commit 0a8068a3dd ("btrfs: make ranged full fsyncs
more efficient") since there is no easy way to work around it.
Fixes: 0a8068a3dd ("btrfs: make ranged full fsyncs more efficient")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Ordered ops are started twice in sync file, once outside of inode mutex
and once inside, taking the dio semaphore. There was one error path
missing the semaphore unlock.
Fixes: aab15e8ec2 ("Btrfs: fix rare chances for data loss when doing a fast fsync")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[ add changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have proper root ref counting everywhere we can kill the
subvol_srcu.
* removal of fs_info::subvol_srcu reduces size of fs_info by 1176 bytes
* the refcount_t used for the references checks for accidental 0->1
in cases where the root lifetime would not be properly protected
* there's a leak detector for roots to catch unfreed roots at umount
time
* SRCU served us well over the years but is was not a proper
synchronization mechanism for some cases
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 0c713cbab6 ("Btrfs: fix race between ranged fsync and writeback
of adjacent ranges") fixed a bug where we could end up with file extent
items in a log tree that represent file ranges that overlap due to a race
between the hole detection of a ranged full fsync and writeback for a
different file range.
The problem was solved by forcing any ranged full fsync to become a
non-ranged full fsync - setting the range start to 0 and the end offset to
LLONG_MAX. This was a simple solution because the code that detected and
marked holes was very complex, it used to be done at copy_items() and
implied several searches on the fs/subvolume tree. The drawback of that
solution was that we started to flush delalloc for the entire file and
wait for all the ordered extents to complete for ranged full fsyncs
(including ordered extents covering ranges completely outside the given
range). Fortunatelly ranged full fsyncs are not the most common case
(hopefully for most workloads).
However a later fix for detecting and marking holes was made by commit
0e56315ca1 ("Btrfs: fix missing hole after hole punching and fsync
when using NO_HOLES") and it simplified a lot the detection of holes,
and now copy_items() no longer does it and we do it in a much more simple
way at btrfs_log_holes().
This makes it now possible to simply make the code that detects holes to
operate only on the initial range and no longer need to operate on the
whole file, while also avoiding the need to flush delalloc for the entire
file and wait for ordered extents that cover ranges that don't overlap the
given range.
Another special care is that we must skip file extent items that fall
entirely outside the fsync range when copying inode items from the
fs/subvolume tree into the log tree - this is to avoid races with ordered
extent completion for extents falling outside the fsync range, which could
cause us to end up with file extent items in the log tree that have
overlapping ranges - for example if the fsync range is [1Mb, 2Mb], when
we copy inode items we could copy an extent item for the range [0, 512K],
then release the search path and before moving to the next leaf, an
ordered extent for a range of [256Kb, 512Kb] completes - this would
cause us to copy the new extent item for range [256Kb, 512Kb] into the
log tree after we have copied one for the range [0, 512Kb] - the extents
overlap, resulting in a corruption.
So this change just does these steps:
1) When the NO_HOLES feature is enabled it leaves the initial range
intact - no longer sets it to [0, LLONG_MAX] when the full sync bit
is set in the inode. If NO_HOLES is not enabled, always set the range
to a full, just like before this change, to avoid missing file extent
items representing holes after replaying the log (for both full and
fast fsyncs);
2) Make the hole detection code to operate only on the fsync range;
3) Make the code that copies items from the fs/subvolume tree to skip
copying file extent items that cover a range completely outside the
range of the fsync.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a fast fsync for a range that starts at an offset greater than
zero, we can end up with a log that when replayed causes the respective
inode miss a file extent item representing a hole if we are not using the
NO_HOLES feature. This is because for fast fsyncs we don't log any extents
that cover a range different from the one requested in the fsync.
Example scenario to trigger it:
$ mkfs.btrfs -O ^no-holes -f /dev/sdd
$ mount /dev/sdd /mnt
# Create a file with a single 256K and fsync it to clear to full sync
# bit in the inode - we want the msync below to trigger a fast fsync.
$ xfs_io -f -c "pwrite -S 0xab 0 256K" -c "fsync" /mnt/foo
# Force a transaction commit and wipe out the log tree.
$ sync
# Dirty 768K of data, increasing the file size to 1Mb, and flush only
# the range from 256K to 512K without updating the log tree
# (sync_file_range() does not trigger fsync, it only starts writeback
# and waits for it to finish).
$ xfs_io -c "pwrite -S 0xcd 256K 768K" /mnt/foo
$ xfs_io -c "sync_range -abw 256K 256K" /mnt/foo
# Now dirty the range from 768K to 1M again and sync that range.
$ xfs_io -c "mmap -w 768K 256K" \
-c "mwrite -S 0xef 768K 256K" \
-c "msync -s 768K 256K" \
-c "munmap" \
/mnt/foo
<power fail>
# Mount to replay the log.
$ mount /dev/sdd /mnt
$ umount /mnt
$ btrfs check /dev/sdd
Opening filesystem to check...
Checking filesystem on /dev/sdd
UUID: 482fb574-b288-478e-a190-a9c44a78fca6
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
root 5 inode 257 errors 100, file extent discount
Found file extent holes:
start: 262144, len: 524288
ERROR: errors found in fs roots
found 720896 bytes used, error(s) found
total csum bytes: 512
total tree bytes: 131072
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 123514
file data blocks allocated: 589824
referenced 589824
Fix this issue by setting the range to full (0 to LLONG_MAX) when the
NO_HOLES feature is not enabled. This results in extra work being done
but it gives the guarantee we don't end up with missing holes after
replaying the log.
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The reflink code is quite large and has been living in ioctl.c since ever.
It has grown over the years after many bug fixes and improvements, and
since I'm planning on making some further improvements on it, it's time
to get it better organized by moving into its own file, reflink.c
(similar to what xfs does for example).
This change only moves the code out of ioctl.c into the new file, it
doesn't do any other change.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch removes all haphazard code implementing nocow writers
exclusion from pending snapshot creation and switches to using the drew
lock to ensure this invariant still holds.
'Readers' are snapshot creators from create_snapshot and 'writers' are
nocow writers from buffered write path or btrfs_setsize. This locking
scheme allows for multiple snapshots to happen while any nocow writers
are blocked, since writes to page cache in the nocow path will make
snapshots inconsistent.
So for performance reasons we'd like to have the ability to run multiple
concurrent snapshots and also favors readers in this case. And in case
there aren't pending snapshots (which will be the majority of the cases)
we rely on the percpu's writers counter to avoid cacheline contention.
The main gain from using the drew lock is it's now a lot easier to
reason about the guarantees of the locking scheme and whether there is
some silent breakage lurking.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree pointer can be safely read from the inode so we can drop the
redundant argument from btrfs_lock_and_flush_ordered_range.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are now using these for all roots, rename them to btrfs_put_root()
and btrfs_grab_root();
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that all callers of btrfs_get_fs_root are subsequently calling
btrfs_grab_fs_root and handling dropping the ref when they are done
appropriately, go ahead and push btrfs_grab_fs_root up into
btrfs_get_fs_root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are looking up an arbitrary inode, we need to hold a ref on the root
while we're doing this.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All this does is call btrfs_get_fs_root() with check_ref == true. Just
use btrfs_get_fs_root() so we don't have a bunch of different helpers
that do the same thing.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have a safe way to update the i_size, replace all uses of
btrfs_ordered_update_i_size with btrfs_inode_safe_disk_i_size_write.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We want to use this everywhere we modify the file extent items
permanently. These include:
1) Inserting new file extents for writes and prealloc extents.
2) Truncating inode items.
3) btrfs_cont_expand().
4) Insert inline extents.
5) Insert new extents from log replay.
6) Insert a new extent for clone, as it could be past i_size.
7) Hole punching
For hole punching in particular it might seem it's not necessary because
anybody extending would use btrfs_cont_expand, however there is a corner
that still can give us trouble. Start with an empty file and
fallocate KEEP_SIZE 1M-2M
We now have a 0 length file, and a hole file extent from 0-1M, and a
prealloc extent from 1M-2M. Now
punch 1M-1.5M
Because this is past i_size we have
[HOLE EXTENT][ NOTHING ][PREALLOC]
[0 1M][1M 1.5M][1.5M 2M]
with an i_size of 0. Now if we pwrite 0-1.5M we'll increas our i_size
to 1.5M, but our disk_i_size is still 0 until the ordered extent
completes.
However if we now immediately truncate 2M on the file we'll just call
btrfs_cont_expand(inode, 1.5M, 2M), since our old i_size is 1.5M. If we
commit the transaction here and crash we'll expose the gap.
To fix this we need to clear the file extent mapping for the range that
we punched but didn't insert a corresponding file extent for. This will
mean the truncate will only get an disk_i_size set to 1M if we crash
before the finish ordered io happens.
I've written an xfstest to reproduce the problem and validate this fix.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only pass this as 1 from __extent_writepage_io(). The parameter
basically means "pretend I didn't pass in a page". This is silly since
we can simply not pass in the page. Get rid of the parameter from
btrfs_get_extent(), and since it's used as a get_extent_t callback,
remove it from get_extent_t and btree_get_extent(), neither of which
need it.
While we're here, let's document btrfs_get_extent().
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
ordered->start, ordered->len, and ordered->disk_len correspond to
fi->disk_bytenr, fi->num_bytes, and fi->disk_num_bytes, respectively.
It's confusing to translate between the two naming schemes. Since a
btrfs_ordered_extent is basically a pending btrfs_file_extent_item,
let's make the former use the naming from the latter.
Note that I didn't touch the names in tracepoints just in case there are
scripts depending on the current naming.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When using the NO_HOLES feature if we clone a range that contains a hole
and a temporary ENOSPC happens while dropping extents from the target
inode's range, we can end up failing and aborting the transaction with
-EEXIST or with a corrupt file extent item, that has a length greater
than it should and overlaps with other extents. For example when cloning
the following range from inode A to inode B:
Inode A:
extent A1 extent A2
[ ----------- ] [ hole, implicit, 4MB length ] [ ------------- ]
0 1MB 5MB 6MB
Range to clone: [1MB, 6MB)
Inode B:
extent B1 extent B2 extent B3 extent B4
[ ---------- ] [ --------- ] [ ---------- ] [ ---------- ]
0 1MB 1MB 2MB 2MB 5MB 5MB 6MB
Target range: [1MB, 6MB) (same as source, to make it easier to explain)
The following can happen:
1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents();
2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents()
set 'drop_end' to 2MB, meaning it was able to drop only extent B2;
3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 2MB - 1MB =
1MB;
4) We then attempt to insert a file extent item at inode B with a file
offset of 5MB, which is the value of clone_info->file_offset. This
fails with error -EEXIST because there's already an extent at that
offset (extent B4);
5) We abort the current transaction with -EEXIST and return that error
to user space as well.
Another example, for extent corruption:
Inode A:
extent A1 extent A2
[ ----------- ] [ hole, implicit, 10MB length ] [ ------------- ]
0 1MB 11MB 12MB
Inode B:
extent B1 extent B2
[ ----------- ] [ --------- ] [ ----------------------------- ]
0 1MB 1MB 5MB 5MB 12MB
Target range: [1MB, 12MB) (same as source, to make it easier to explain)
1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents();
2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents()
set 'drop_end' to 5MB, meaning it was able to drop only extent B2;
3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 5MB - 1MB =
4MB;
4) We then insert a file extent item at inode B with a file offset of 11MB
which is the value of clone_info->file_offset, and a length of 4MB (the
value of 'clone_len'). So we get 2 extents items with ranges that
overlap and an extent length of 4MB, larger then the extent A2 from
inode A (1MB length);
5) After that we end the transaction, balance the btree dirty pages and
then start another or join the previous transaction. It might happen
that the transaction which inserted the incorrect extent was committed
by another task so we end up with extent corruption if a power failure
happens.
So fix this by making sure we attempt to insert the extent to clone at
the destination inode only if we are past dropping the sub-range that
corresponds to a hole.
Fixes: 690a5dbfc5 ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can now remove the bdev from extent_map. Previous patches made sure
that bio_set_dev is correctly in all places and that we don't need to
grab it from latest_bdev or pass it around inside the extent map.
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using an input pointer parameter as the return value and have
an int as the return type of find_desired_extent, rework the function to
directly return the found offset. Doing that the 'ret' variable in
btrfs_llseek_file can be removed. Additional (subjective) benefit is
that btrfs' llseek function now resemebles those of the other major
filesystems.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Handle SEEK_END/SEEK_CUR in a single 'default' case by directly
returning from generic_file_llseek. This makes the 'out' label
redundant. Finally return directly the vale from vfs_setpos. No
semantic changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Modifying the file position is done on a per-file basis. This renders
holding the inode lock for writing useless and makes the performance of
concurrent llseek's abysmal.
Fix this by holding the inode for read. This provides protection against
concurrent truncates and find_desired_extent already includes proper
extent locking for the range which ensures proper locking against
concurrent writes. SEEK_CUR and SEEK_END can be done lockessly.
The former is synchronized by file::f_lock spinlock. SEEK_END is not
synchronized but atomic, but that's OK since there is not guarantee that
SEEK_END will always be at the end of the file in the face of tail
modifications.
This change brings ~82% performance improvement when doing a lot of
parallel fseeks. The workload essentially does:
for (d=0; d<num_seek_read; d++)
{
/* offset %= 16777216; */
fseek (f, 256 * d % 16777216, SEEK_SET);
fread (buffer, 64, 1, f);
}
Without patch:
num workprocesses = 16
num fseek/fread = 8000000
step = 256
fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
real 0m41.412s
user 0m28.777s
sys 2m16.510s
With patch:
num workprocesses = 16
num fseek/fread = 8000000
step = 256
fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
real 0m11.479s
user 0m27.629s
sys 0m21.040s
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a buffered write it's possible to leave the subv_writers
counter of the root, used for synchronization between buffered nocow
writers and snapshotting. This happens in an exceptional case like the
following:
1) We fail to allocate data space for the write, since there's not
enough available data space nor enough unallocated space for allocating
a new data block group;
2) Because of that failure, we try to go to NOCOW mode, which succeeds
and therefore we set the local variable 'only_release_metadata' to true
and set the root's sub_writers counter to 1 through the call to
btrfs_start_write_no_snapshotting() made by check_can_nocow();
3) The call to btrfs_copy_from_user() returns zero, which is very unlikely
to happen but not impossible;
4) No pages are copied because btrfs_copy_from_user() returned zero;
5) We call btrfs_end_write_no_snapshotting() which decrements the root's
subv_writers counter to 0;
6) We don't set 'only_release_metadata' back to 'false' because we do
it only if 'copied', the value returned by btrfs_copy_from_user(), is
greater than zero;
7) On the next iteration of the while loop, which processes the same
page range, we are now able to allocate data space for the write (we
got enough data space released in the meanwhile);
8) After this if we fail at btrfs_delalloc_reserve_metadata(), because
now there isn't enough free metadata space, or in some other place
further below (prepare_pages(), lock_and_cleanup_extent_if_need(),
btrfs_dirty_pages()), we break out of the while loop with
'only_release_metadata' having a value of 'true';
9) Because 'only_release_metadata' is 'true' we end up decrementing the
root's subv_writers counter to -1 (through a call to
btrfs_end_write_no_snapshotting()), and we also end up not releasing the
data space previously reserved through btrfs_check_data_free_space().
As a consequence the mechanism for synchronizing NOCOW buffered writes
with snapshotting gets broken.
Fix this by always setting 'only_release_metadata' to false at the start
of each iteration.
Fixes: 8257b2dc3c ("Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume")
Fixes: 7ee9e4405f ("Btrfs: check if we can nocow if we don't have data space")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter is now always set to NULL and could be dropped. The last
user was get_default_root but that got reworked in 05dbe6837b ("Btrfs:
unify subvol= and subvolid= mounting") and the parameter became unused.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is similar to 942491c9e6 ("xfs: fix AIM7 regression"). Apparently
our current rwsem code doesn't like doing the trylock, then lock for
real scheme. This causes extra contention on the lock and can be
measured eg. by AIM7 benchmark. So change our read/write methods to
just do the trylock for the RWF_NOWAIT case.
Fixes: edf064e7c6 ("btrfs: nowait aio support")
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
We were checking for the full fsync flag in the inode before locking the
inode, which is racy, since at that that time it might not be set but
after we acquire the inode lock some other task set it. One case where
this can happen is on a system low on memory and some concurrent task
failed to allocate an extent map and therefore set the full sync flag on
the inode, to force the next fsync to work in full mode.
A consequence of missing the full fsync flag set is hitting the problems
fixed by commit 0c713cbab6 ("Btrfs: fix race between ranged fsync and
writeback of adjacent ranges"), BUG_ON() when dropping extents from a log
tree, hitting assertion failures at tree-log.c:copy_items() or all sorts
of weird inconsistencies after replaying a log due to file extents items
representing ranges that overlap.
So just move the check such that it's done after locking the inode and
before starting writeback again.
Fixes: 0c713cbab6 ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[Background]
Btrfs qgroup uses two types of reserved space for METADATA space,
PERTRANS and PREALLOC.
PERTRANS is metadata space reserved for each transaction started by
btrfs_start_transaction().
While PREALLOC is for delalloc, where we reserve space before joining a
transaction, and finally it will be converted to PERTRANS after the
writeback is done.
[Inconsistency]
However there is inconsistency in how we handle PREALLOC metadata space.
The most obvious one is:
In btrfs_buffered_write():
btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
We always free qgroup PREALLOC meta space.
While in btrfs_truncate_block():
btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
We only free qgroup PREALLOC meta space when something went wrong.
[The Correct Behavior]
The correct behavior should be the one in btrfs_buffered_write(), we
should always free PREALLOC metadata space.
The reason is, the btrfs_delalloc_* mechanism works by:
- Reserve metadata first, even it's not necessary
In btrfs_delalloc_reserve_metadata()
- Free the unused metadata space
Normally in:
btrfs_delalloc_release_extents()
|- btrfs_inode_rsv_release()
Here we do calculation on whether we should release or not.
E.g. for 64K buffered write, the metadata rsv works like:
/* The first page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=0
total: num_bytes=calc_inode_reservations()
/* The first page caused one outstanding extent, thus needs metadata
rsv */
/* The 2nd page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed
/* The 2nd page doesn't cause new outstanding extent, needs no new meta
rsv, so we free what we have reserved */
/* The 3rd~16th pages */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed (still space for one outstanding extent)
This means, if btrfs_delalloc_release_extents() determines to free some
space, then those space should be freed NOW.
So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
than btrfs_qgroup_convert_reserved_meta().
The good news is:
- The callers are not that hot
The hottest caller is in btrfs_buffered_write(), which is already
fixed by commit 336a8bb8e3 ("btrfs: Fix wrong
btrfs_delalloc_release_extents parameter"). Thus it's not that
easy to cause false EDQUOT.
- The trans commit in advance for qgroup would hide the bug
Since commit f5fef45936 ("btrfs: qgroup: Make qgroup async transaction
commit more aggressive"), when btrfs qgroup metadata free space is slow,
it will try to commit transaction and free the wrongly converted
PERTRANS space, so it's not that easy to hit such bug.
[FIX]
So to fix the problem, remove the @qgroup_free parameter for
btrfs_delalloc_release_extents(), and always pass true to
btrfs_inode_rsv_release().
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 43b18595d6 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we have a buffered write that starts at an offset greater than or
equals to the file's size happening concurrently with a full ranged
fiemap, we can end up leaking an extent state structure.
Suppose we have a file with a size of 1Mb, and before the buffered write
and fiemap are performed, it has a single extent state in its io tree
representing the range from 0 to 1Mb, with the EXTENT_DELALLOC bit set.
The following sequence diagram shows how the memory leak happens if a
fiemap a buffered write, starting at offset 1Mb and with a length of
4Kb, are performed concurrently.
CPU 1 CPU 2
extent_fiemap()
--> it's a full ranged fiemap
range from 0 to LLONG_MAX - 1
(9223372036854775807)
--> locks range in the inode's
io tree
--> after this we have 2 extent
states in the io tree:
--> 1 for range [0, 1Mb[ with
the bits EXTENT_LOCKED and
EXTENT_DELALLOC_BITS set
--> 1 for the range
[1Mb, LLONG_MAX[ with
the EXTENT_LOCKED bit set
--> start buffered write at offset
1Mb with a length of 4Kb
btrfs_file_write_iter()
btrfs_buffered_write()
--> cached_state is NULL
lock_and_cleanup_extent_if_need()
--> returns 0 and does not lock
range because it starts
at current i_size / eof
--> cached_state remains NULL
btrfs_dirty_pages()
btrfs_set_extent_delalloc()
(...)
__set_extent_bit()
--> splits extent state for range
[1Mb, LLONG_MAX[ and now we
have 2 extent states:
--> one for the range
[1Mb, 1Mb + 4Kb[ with
EXTENT_LOCKED set
--> another one for the range
[1Mb + 4Kb, LLONG_MAX[ with
EXTENT_LOCKED set as well
--> sets EXTENT_DELALLOC on the
extent state for the range
[1Mb, 1Mb + 4Kb[
--> caches extent state
[1Mb, 1Mb + 4Kb[ into
@cached_state because it has
the bit EXTENT_LOCKED set
--> btrfs_buffered_write() ends up
with a non-NULL cached_state and
never calls anything to release its
reference on it, resulting in a
memory leak
Fix this by calling free_extent_state() on cached_state if the range was
not locked by lock_and_cleanup_extent_if_need().
The same issue can happen if anything else other than fiemap locks a range
that covers eof and beyond.
This could be triggered, sporadically, by test case generic/561 from the
fstests suite, which makes duperemove run concurrently with fsstress, and
duperemove does plenty of calls to fiemap. When CONFIG_BTRFS_DEBUG is set
the leak is reported in dmesg/syslog when removing the btrfs module with
a message like the following:
[77100.039461] BTRFS: state leak: start 6574080 end 6582271 state 16402 in tree 0 refs 1
Otherwise (CONFIG_BTRFS_DEBUG not set) detectable with kmemleak.
CC: stable@vger.kernel.org # 4.16+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit fee187d9d9 ("Btrfs: do not set EXTENT_DIRTY along with
EXTENT_DELALLOC"), we never set EXTENT_DIRTY in inode->io_tree, so we
can simplify and stop trying to clear it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The VFS indicates a synchronous write to ->write_iter() via
iocb->ki_flags. The IOCB_{,D}SYNC flags may be set based on the file
(see iocb_flags()) or the RWF_* flags passed to a syscall like
pwritev2() (see kiocb_set_rw_flags()).
However, in btrfs_file_write_iter(), we're checking if a write is
synchronous based only on the file; we use this to decide when to bump
the sync_writers counter and thus do CRCs synchronously. Make sure we do
this for all synchronous writes as determined by the VFS.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add const ]
Signed-off-by: David Sterba <dsterba@suse.com>
generic_write_checks() may modify iov_iter_count(), so we must get the
count after the call, not before. Using the wrong one has a couple of
consequences:
1. We check a longer range in check_can_nocow() for nowait than we're
actually writing.
2. We create extra hole extent maps in btrfs_cont_expand(). As far as I
can tell, this is harmless, but I might be missing something.
These issues are pretty minor, but let's fix it before something more
important trips on it.
Fixes: edf064e7c6 ("btrfs: nowait aio support")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_calc_trunc_metadata_size differs from trans_metadata_size in that
it doesn't take into account any splitting at the levels, because
truncate will never split nodes. However truncate _and_ changing will
never split nodes, so rename btrfs_calc_trunc_metadata_size to
btrfs_calc_metadata_size. Also btrfs_calc_trans_metadata_size is purely
for inserting items, so rename this to btrfs_calc_insert_metadata_size.
Making these clearer will help when I start using them differently in
upcoming patches.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>