Currently unallocated chunks are always trimmed. For example
2 consecutive trims on large storage would trim freespace twice
irrespective of whether the space was actually allocated or not between
those trims.
Optimise this behavior by exploiting the newly introduced alloc_state
tree of btrfs_device. A new CHUNK_TRIMMED bit is used to mark
those unallocated chunks which have been trimmed and have not been
allocated afterwards. On chunk allocation the respective underlying devices'
physical space will have its CHUNK_TRIMMED flag cleared. This avoids
submitting discards for space which hasn't been changed since the last
time discard was issued.
This applies to the single mount period of the filesystem as the
information is not stored permanently.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The pending chunks list contains chunks that are allocated in the
current transaction but haven't been created yet. The pinned chunks
list contains chunks that are being released in the current transaction.
Both describe chunks that are not reflected on disk as in use but are
unavailable just the same.
The pending chunks list is anchored by the transaction handle, which
means that we need to hold a reference to a transaction when working
with the list.
The way we use them is by iterating over both lists to perform
comparisons on the stripes they describe for each device. This is
backwards and requires that we keep a transaction handle open while
we're trimming.
This patchset adds an extent_io_tree to btrfs_device that maintains
the allocation state of the device. Extents are set dirty when
chunks are first allocated -- when the extent maps are added to the
mapping tree. They're cleared when last removed -- when the extent
maps are removed from the mapping tree. This matches the lifespan
of the pending and pinned chunks list and allows us to do trims
on unallocated space safely without pinning the transaction for what
may be a lengthy operation. We can also use this io tree to mark
which chunks have already been trimmed so we don't repeat the operation.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can never have extents marked as EXTENT_MAP_DELALLOC since this
value is only ever used by btrfs_get_extent_fiemap. In this case the
extent map is created by btrfs_get_extent_fiemap and is never really
published, this flag is used to return the corresponding userspace one.
Considering this, it's pointless having a check for EXTENT_MAP_DELALLOC
in mergable_maps. Just remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The typos accumulate over time so once in a while time they get fixed in
a large patch.
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
rb_first_cached() trades an extra pointer "leftmost" for doing the
same job as rb_first() but in O(1).
As evict_inode_truncate_pages() removes all extent mapping by always
looking for the first rb entry, it's helpful to use rb_first_cached
instead.
For more details about the optimization see patch "Btrfs: delayed-refs:
use rb_first_cached for href_root".
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
remove_extent_mapping uses the variable "ret" for return value, but it
is not modified after initialzation. Further, I find that any of the
callers do not handle the return value and the callees are only simple
functions so the return values does not need to be passed.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We really want to know to which filesystem the extent map events belong,
but as it cannot be reached from the extent_map pointers, we need to
pass it down the callchain.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.
Signed-off-by: David Sterba <dsterba@suse.com>
The __cold functions are placed to a special section, as they're
expected to be called rarely. This could help i-cache prefetches or help
compiler to decide which branches are more/less likely to be taken
without any other annotations needed.
Though we can't add more __exit annotations, it's still possible to add
__cold (that's also added with __exit). That way the following function
categories are tagged:
- printf wrappers, error messages
- exit helpers
Signed-off-by: David Sterba <dsterba@suse.com>
This is adding a tracepoint 'btrfs_handle_em_exist' to help debug the
subtle bugs around merge_extent_mapping.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by btrfs at all.
So, remove the unused hardirq.h.
Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to debug subtle bugs around merge_extent_mapping(), perf probe
can be used to check the arguments, but sometimes merge_extent_mapping()
got inlined by compiler and couldn't be probed.
This is adding noinline attribute to merge_extent_mapping().
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a subtle case, so in order to understand the problem, it'd be good
to know the content of existing and em when any error occurs.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These helpers are extent map specific, move them to extent_map.c.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
BTRFS is using a variety of slab caches to satisfy internal needs.
Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT,
meaning allocations from the caches are going to be accounted as
SReclaimable. At the same time btrfs is not registering any shrinkers
whatsoever, thus preventing memory from the slabs to be shrunk. This
means those caches are not in fact reclaimable.
To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the
inode cache, since this one is being freed by the generic VFS super_block
shrinker. Also set the transaction related caches as SLAB_TEMPORARY,
to better document the lifetime of the objects (it just translates
to SLAB_RECLAIM_ACCOUNT).
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
So that its better organized.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Cleanup.
kmem_cache_destroy has support NULL argument checking,
so drop the double null testing before calling it.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Overloading extent_map->bdev to struct map_lookup * might have started out
as a means to an end, but it's a pattern that's used all over the place
now. Let's get rid of the casting and just add a union instead.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We use the modified list to keep track of which extents have been modified so we
know which ones are candidates for logging at fsync() time. Newly modified
extents are added to the list at modification time, around the same time the
ordered extent is created. We do this so that we don't have to wait for ordered
extents to complete before we know what we need to log. The problem is when
something like this happens
log extent 0-4k on inode 1
copy csum for 0-4k from ordered extent into log
sync log
commit transaction
log some other extent on inode 1
ordered extent for 0-4k completes and adds itself onto modified list again
log changed extents
see ordered extent for 0-4k has already been logged
at this point we assume the csum has been copied
sync log
crash
On replay we will see the extent 0-4k in the log, drop the original 0-4k extent
which is the same one that we are replaying which also drops the csum, and then
we won't find the csum in the log for that bytenr. This of course causes us to
have errors about not having csums for certain ranges of our inode. So remove
the modified list manipulation in unpin_extent_cache, any modified extents
should have been added well before now, and we don't want them re-logged. This
fixes my test that I could reliably reproduce this problem with. Thanks,
cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
While running balance, scrub, fsstress concurrently we hit the
following kernel crash:
[56561.448845] BTRFS info (device sde): relocating block group 11005853696 flags 132
[56561.524077] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
[56561.524237] IP: [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs]
[56561.524297] PGD 9be28067 PUD 7f3dd067 PMD 0
[56561.524325] Oops: 0000 [#1] SMP
[....]
[56561.527237] Call Trace:
[56561.527309] [<ffffffffa038980e>] scrub_enumerate_chunks+0x24e/0x490 [btrfs]
[56561.527392] [<ffffffff810abe00>] ? abort_exclusive_wait+0x50/0xb0
[56561.527476] [<ffffffffa038add4>] btrfs_scrub_dev+0x1a4/0x530 [btrfs]
[56561.527561] [<ffffffffa0368107>] btrfs_ioctl+0x13f7/0x2a90 [btrfs]
[56561.527639] [<ffffffff811c82f0>] do_vfs_ioctl+0x2e0/0x4c0
[56561.527712] [<ffffffff8109c384>] ? vtime_account_user+0x54/0x60
[56561.527788] [<ffffffff810f768c>] ? __audit_syscall_entry+0x9c/0xf0
[56561.527870] [<ffffffff811c8551>] SyS_ioctl+0x81/0xa0
[56561.527941] [<ffffffff815707f7>] tracesys+0xdd/0xe2
[...]
[56561.528304] RIP [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs]
[56561.528395] RSP <ffff88004c0f5be8>
[56561.528454] CR2: 0000000000000078
This is because in btrfs_relocate_chunk(), we will free @bdev directly while
scrub may still hold extent mapping, and may access freed memory.
Fix this problem by wrapping freeing @bdev work into free_extent_map() which
is based on reference count.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
While droping extent map structures from the extent cache that cover our
target range, we would remove each extent map structure from the red black
tree and then add either 1 or 2 new extent map structures if the former
extent map covered sections outside our target range.
This change simply attempts to replace the existing extent map structure
with a new one that covers the subsection we're not interested in, instead
of doing a red black remove operation followed by an insertion operation.
The number of elements in an inode's extent map tree can get very high for large
files under random writes. For example, while running the following test:
sysbench --test=fileio --file-num=1 --file-total-size=10G \
--file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
--max-requests=500000 --file-rw-ratio=2 [prepare|run]
I captured the following histogram capturing the number of extent_map items
in the red black tree while that test was running:
Count: 122462
Range: 1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981
Percentiles: 90th: 160120.000; 95th: 166335.000; 99th: 171070.000
1.000 - 5.231: 452 |
5.231 - 187.392: 87 |
187.392 - 585.911: 206 |
585.911 - 1827.438: 623 |
1827.438 - 5695.245: 1962 #
5695.245 - 17744.861: 6204 ####
17744.861 - 55283.764: 21115 ############
55283.764 - 172231.000: 91813 #####################################################
Benchmark:
sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \
--num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \
--file-io-mode=sync --file-fsync-freq=0 [prepare|run]
Before this change: 122.1Mb/sec
After this change: 125.07Mb/sec
(averages of 5 test runs)
Test machine: quad core intel i5-3570K, 32Gb of ram, SSD
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
We don't need to have an unsigned int field in the extent_map struct
to tell us whether the extent map is in the inode's extent_map tree or
not. We can use the rb_node struct field and the RB_CLEAR_NODE and
RB_EMPTY_NODE macros to achieve the same task.
This reduces sizeof(struct extent_map) from 152 bytes to 144 bytes (on a
64 bits system).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
When merging an extent_map with its right neighbor, increment
its block_len with the neighbor's block_len.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Before this change, adding an extent map to the extent map tree of an
inode required 2 tree nevigations:
1) doing a tree navigation to search for an existing extent map starting
at the same offset or an extent map that overlaps the extent map we
want to insert;
2) Another tree navigation to add the extent map to the tree (if the
former tree search didn't found anything).
This change just merges these 2 steps into a single one.
While running first few btrfs xfstests I had noticed these trees easily
had a few hundred elements, and then with the following sysbench test it
reached over 1100 elements very often.
Test:
sysbench --test=fileio --file-num=32 --file-total-size=10G \
--file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \
--max-requests=1000000 --file-io-mode=sync [prepare|run]
(fs created with mkfs.btrfs -l 4096 -f /dev/sdb3 before each sysbench
prepare phase)
Before this patch:
run 1 - 41.894Mb/sec
run 2 - 40.527Mb/sec
run 3 - 40.922Mb/sec
run 4 - 49.433Mb/sec
run 5 - 40.959Mb/sec
average - 42.75Mb/sec
After this patch:
run 1 - 48.036Mb/sec
run 2 - 50.21Mb/sec
run 3 - 50.929Mb/sec
run 4 - 46.881Mb/sec
run 5 - 53.192Mb/sec
average - 49.85Mb/sec
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Big patch, but all it does is add statics to functions which
are in fact static, then remove the associated dead-code fallout.
removed functions:
btrfs_iref_to_path()
__btrfs_lookup_delayed_deletion_item()
__btrfs_search_delayed_insertion_item()
__btrfs_search_delayed_deletion_item()
find_eb_for_page()
btrfs_find_block_group()
range_straddles_pages()
extent_range_uptodate()
btrfs_file_extent_length()
btrfs_scrub_cancel_devid()
btrfs_start_transaction_lflush()
btrfs_print_tree() is left because it is used for debugging.
btrfs_start_transaction_lflush() and btrfs_reada_detach() are
left for symmetry.
ulist.c functions are left, another patch will take care of those.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
A user sent me a btrfs-image of a file system that was panicing on mount during
the log recovery. I had originally thought these problems were from a bug in
the free space cache code, but that was just a symptom of the problem. The
problem is if your application does something like this
[prealloc][prealloc][prealloc]
the internal extent maps will merge those all together into one extent map, even
though on disk they are 3 separate extents. So if you go to write into one of
these ranges the extent map will be right since we use the physical extent when
doing the write, but when we log the extents they will use the wrong sizes for
the remainder prealloc space. If this doesn't happen to trip up the free space
cache (which it won't in a lot of cases) then you will get bogus entries in your
extent tree which will screw stuff up later. The data and such will still work,
but everything else is broken. This patch fixes this by not allowing extents
that are on the modified list to be merged. This has the side effect that we
are no longer adding everything to the modified list all the time, which means
we now have to call btrfs_drop_extents every time we log an extent into the
tree. So this allows me to drop all this speciality code I was using to get
around calling btrfs_drop_extents. With this patch the testcase I've created no
longer creates a bogus file system after replaying the log. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We want to avoid module.h where posible, since it in turn includes
nearly all of header space. This means removing it where it is not
required, and using export.h where we are only exporting symbols via
EXPORT_SYMBOL and friends.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs fixes from Chris Mason:
"We've got corner cases for updating i_size that ceph was hitting,
error handling for quotas when we run out of space, a very subtle
snapshot deletion race, a crash while removing devices, and one
deadlock between subvolume creation and the sb_internal code (thanks
lockdep)."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: move d_instantiate outside the transaction during mksubvol
Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata
Btrfs: fix possible stale data exposure
Btrfs: fix missing i_size update
Btrfs: fix race between snapshot deletion and getting inode
Btrfs: fix missing release of the space/qgroup reservation in start_transaction()
Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write()
Btrfs: do not merge logged extents if we've removed them from the tree
btrfs: don't try to notify udev about missing devices
You can run into this problem where if somebody is fsyncing and writing out
the existing extents you will have removed the extent map from the em tree,
but it's still valid for the current fsync so we go ahead and write it. The
problem is we unconditionally try to merge it back into the em tree, but if
we've removed it from the em tree that will cause use after free problems.
Fix this to only merge if we are still a part of the tree. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Pull btrfs fixes from Chris Mason:
"It turns out that we had two crc bugs when running fsx-linux in a
loop. Many thanks to Josef, Miao Xie, and Dave Sterba for nailing it
all down. Miao also has a new OOM fix in this v2 pull as well.
Ilya fixed a regression Liu Bo found in the balance ioctls for pausing
and resuming a running balance across drives.
Josef's orphan truncate patch fixes an obscure corruption we'd see
during xfstests.
Arne's patches address problems with subvolume quotas. If the user
destroys quota groups incorrectly the FS will refuse to mount.
The rest are smaller fixes and plugs for memory leaks."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (30 commits)
Btrfs: fix repeated delalloc work allocation
Btrfs: fix wrong max device number for single profile
Btrfs: fix missed transaction->aborted check
Btrfs: Add ACCESS_ONCE() to transaction->abort accesses
Btrfs: put csums on the right ordered extent
Btrfs: use right range to find checksum for compressed extents
Btrfs: fix panic when recovering tree log
Btrfs: do not allow logged extents to be merged or removed
Btrfs: fix a regression in balance usage filter
Btrfs: prevent qgroup destroy when there are still relations
Btrfs: ignore orphan qgroup relations
Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag
Btrfs: fix unlock order in btrfs_ioctl_rm_dev
Btrfs: fix unlock order in btrfs_ioctl_resize
Btrfs: fix "mutually exclusive op is running" error code
Btrfs: bring back balance pause/resume logic
btrfs: update timestamps on truncate()
btrfs: fix btrfs_cont_expand() freeing IS_ERR em
Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents
Btrfs: fix off-by-one in lseek
...
We drop the extent map tree lock while we're logging extents, so somebody
could come in and merge another extent into this one and screw up our
logging, or they could even remove us from the list which would keep us from
logging the extent or freeing our ref on it, so we need to make sure to not
clear LOGGING until after the extent is logged, and then we can merge it to
adjacent extents. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Pull btrfs update from Chris Mason:
"A big set of fixes and features.
In terms of line count, most of the code comes from Stefan, who added
the ability to replace a single drive in place. This is different
from how btrfs normally replaces drives, and is much much much faster.
Josef is plowing through our synchronous write performance. This pull
request does not include the DIO_OWN_WAITING patch that was discussed
on the list, but it has a number of other improvements to cut down our
latencies and CPU time during fsync/O_DIRECT writes.
Miao Xie has a big series of fixes and is spreading out ordered
operations over more CPUs. This improves performance and reduces
contention.
I've put in fixes for error handling around hash collisions. These
are going back to individual stable kernels as I test against them.
Otherwise we have a lot of fixes and cleanups, thanks everyone!
raid5/6 is being rebased against the device replacement code. I'll
have it posted this Friday along with a nice series of benchmarks."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (115 commits)
Btrfs: fix a bug of per-file nocow
Btrfs: fix hash overflow handling
Btrfs: don't take inode delalloc mutex if we're a free space inode
Btrfs: fix autodefrag and umount lockup
Btrfs: fix permissions of empty files not affected by umask
Btrfs: put raid properties into global table
Btrfs: fix BUG() in scrub when first superblock reading gives EIO
Btrfs: do not call file_update_time in aio_write
Btrfs: only unlock and relock if we have to
Btrfs: use tokens where we can in the tree log
Btrfs: optimize leaf_space_used
Btrfs: don't memset new tokens
Btrfs: only clear dirty on the buffer if it is marked as dirty
Btrfs: move checks in set_page_dirty under DEBUG
Btrfs: log changed inodes based on the extent map tree
Btrfs: add path->really_keep_locks
Btrfs: do not mark ems as prealloc if we are writing to them
Btrfs: keep track of the extents original block length
Btrfs: inline csums if we're fsyncing
Btrfs: don't bother copying if we're only logging the inode
...
We don't really need to copy extents from the source tree since we have all
of the information already available to us in the extent_map tree. So
instead just write the extents straight to the log tree and don't bother to
copy the extent items from the source tree.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We are going to use EM's to log extents in the future, so we need to not
mark them as prealloc if they aren't actually prealloc extents. Instead
mark them with FILLING so we know to ammend mod_start/mod_len and that way
we don't confuse the extent logging code. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Dave Sterba pointed out a sleeping while atomic bug while doing fsync. This
is because I'm an idiot and didn't realize that rwlock's were spin locks, so
we've been holding this thing while doing allocations and such which is not
good. This patch fixes this by dropping the write lock before we do
anything heavy and re-acquire it when it is done. We also need to take a
ref on the em's in case their corresponding pages are evicted and mark them
as being logged so that releasepage does not remove them and doesn't remove
them from our local list. Thanks,
Reported-by: Dave Sterba <dave@jikos.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This is based on Josef's "Btrfs: turbo charge fsync".
The above Josef's patch performs very good in random sync write test,
because we won't have too much extents to merge.
However, it does not performs good on the test:
dd if=/dev/zero of=foobar bs=4k count=12500 oflag=sync
The reason is when we do sequencial sync write, we need to merge the
current extent just with the previous one, so that we can get accumulated
extents to log:
A(4k) --> AA(8k) --> AAA(12k) --> AAAA(16k) ...
So we'll have to flush more and more checksum into log tree, which is the
bottleneck according to my tests.
But we can avoid this by telling fsync the real extents that are needed
to be logged.
With this, I did the above dd sync write test (size=50m),
w/o (orig) w/ (josef's) w/ (this)
SATA 104KB/s 109KB/s 121KB/s
ramdisk 1.5MB/s 1.5MB/s 10.7MB/s (613%)
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
At least for the vm workload. Currently on fsync we will
1) Truncate all items in the log tree for the given inode if they exist
and
2) Copy all items for a given inode into the log
The problem with this is that for things like VMs you can have lots of
extents from the fragmented writing behavior, and worst yet you may have
only modified a few extents, not the entire thing. This patch fixes this
problem by tracking which transid modified our extent, and then when we do
the tree logging we find all of the extents we've modified in our current
transaction, sort them and commit them. We also only truncate up to the
xattrs of the inode and copy that stuff in normally, and then just drop any
extents in the range we have that exist in the log already. Here are some
numbers of a 50 meg fio job that does random writes and fsync()s after every
write
Original Patched
SATA drive 82KB/s 140KB/s
Fusion drive 431KB/s 2532KB/s
So around 2-6 times faster depending on your hardware. There are a few
corner cases, for example if you truncate at all we have to do it the old
way since there is no way to be sure what is in the log is ok. This
probably could be done smarter, but if you write-fsync-truncate-write-fsync
you deserve what you get. All this work is in RAM of course so if your
inode gets evicted from cache and you read it in and fsync it we'll do it
the slow way if we are still in the same transaction that we last modified
the inode in.
The biggest cool part of this is that it requires no changes to the recovery
code, so if you fsync with this patch and crash and load an old kernel, it
will run the recovery and be a-ok. I have tested this pretty thoroughly
with an fsync tester and everything comes back fine, as well as xfstests.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
unpin_extent_cache() and add_extent_mapping() shares the same code
that merges extent maps.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
lookup_extent_map() and search_extent_map() can share most of code.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
rb_node returned by __tree_search() can be a valid pointer or NULL,
but won't be some errno.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I add the check on the return value of alloc_extent_map() to several places.
In addition, alloc_extent_map() returns only the address or NULL.
Therefore, check by IS_ERR() is unnecessary. So, I remove IS_ERR() checking.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Make the code aware of compression type, instead of always assuming
zlib compression.
Also make the zlib workspace function as common code for all
compression types.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more
clear what is the purpose of the operation, which otherwise looks like a
no-op.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
type T;
T x;
identifier f;
@@
T f (...) { <+...
- ERR_PTR(PTR_ERR(x))
+ x
...+> }
@@
expression x;
@@
- ERR_PTR(PTR_ERR(x))
+ ERR_CAST(x)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: kfree correct pointer during mount option parsing
Btrfs: use RB_ROOT to intialize rb_trees instead of setting rb_node to NULL
btrfs inialize rb trees in quite a number of places by settin rb_node =
NULL; The problem with this is that 17d9ddc72f in the
linux-next tree adds a new field to that struct which needs to be NULL for
the new rbtree library code to work properly. This patch uses RB_ROOT as
the intializer so all of the relevant fields will be NULL'd. Without the
patch I get a panic.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: check total number of devices when removing missing
Btrfs: check return value of open_bdev_exclusive properly
Btrfs: do not mark the chunk as readonly if in degraded mode
Btrfs: run orphan cleanup on default fs root
Btrfs: fix a memory leak in btrfs_init_acl
Btrfs: Use correct values when updating inode i_size on fallocate
Btrfs: remove tree_search() in extent_map.c
Btrfs: Add mount -o compress-force
This patch removes tree_search() in extent_map.c because it is not called by
anything.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I re-orderred the checks to avoid dereferencing "em" if it was null.
Found by smatch static checker.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The allocator has some nice knobs for sending hints about where
to try and allocate new blocks, but when we're doing file allocations
we're not sending any hint at all.
This commit adds a simple extent map search to see if we can
quickly and easily find a hint for the allocator.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Data COW means that whenever we write to a file, we replace any old
extent pointers with new ones. There was a window where a readpage
might find the old extent pointers on disk and cache them in the
extent_map tree in ram in the middle of a given write replacing them.
Even though both the readpage and the write had their respective bytes
in the file locked, the extent readpage inserts may cover more bytes than
it had locked down.
This commit closes the race by keeping the new extent pinned in the extent
map tree until after the on-disk btree is properly setup with the new
extent pointers.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are two main users of the extent_map tree. The
first is regular file inodes, where it is evenly spread
between readers and writers.
The second is the chunk allocation tree, which maps blocks from
logical addresses to phyiscal ones, and it is 99.99% reads.
The mapping tree is a point of lock contention during heavy IO
workloads, so this commit switches things to a rw lock.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently the extent_map code is only for btrfs so don't export it's
symbols.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This improves the comments at the top of many functions. It didn't
dive into the guts of functions because I was trying to
avoid merging problems with the new allocator and back reference work.
extent-tree.c and volumes.c were both skipped, and there is definitely
more work todo in cleaning and commenting the code.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
add_extent_mapping was allowing the insertion of overlapping extents.
This never used to happen because it only inserted the extents from disk
and those were never overlapping.
But, with the data=ordered code, the disk and memory representations of the
file are not the same. add_extent_mapping needs to ensure a new extent
does not overlap before it inserts.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Stress testing was showing data checksum errors, most of which were caused
by a lookup bug in the extent_map tree. The tree was caching the last
pointer returned, and searches would check the last pointer first.
But, search callers also expect the search to return the very first
matching extent in the range, which wasn't always true with the last
pointer usage.
For now, the code to cache the last return value is just removed. It is
easy to fix, but I think lookups are rare enough that it isn't required anymore.
This commit also replaces do_sync_mapping_range with a local copy of the
related functions.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It was possible for stale mappings from disk to be used instead of the
new pending ordered extent. This adds a flag to the extent map struct
to keep it pinned until the pending ordered extent is actually on disk.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The old data=ordered code would force commit to wait until
all the data extents from the transaction were fully on disk. This
introduced large latencies into the commit and stalled new writers
in the transaction for a long time.
The new code changes the way data allocations and extents work:
* When delayed allocation is filled, data extents are reserved, and
the extent bit EXTENT_ORDERED is set on the entire range of the extent.
A struct btrfs_ordered_extent is allocated an inserted into a per-inode
rbtree to track the pending extents.
* As each page is written EXTENT_ORDERED is cleared on the bytes corresponding
to that page.
* When all of the bytes corresponding to a single struct btrfs_ordered_extent
are written, The previously reserved extent is inserted into the FS
btree and into the extent allocation trees. The checksums for the file
data are also updated.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is now extent_map for mapping offsets in the file to disk and
extent_io for state tracking, IO submission and extent_bufers.
The new extent_map code shifts from [start,end] pairs to [start,len], and
pushes the locking out into the caller. This allows a few performance
optimizations and is easier to use.
A number of extent_map usage bugs were fixed, mostly with failing
to remove extent_map entries when changing the file.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There were a few places that could cause duplicate extent insertion,
this adjusts the code that creates holes to avoid it.
lookup_extent_map is changed to correctly return all of the extents in a
range, even when there are none matching at the start of the range.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
test_range_bit doesn't properly handle the case: there's a hole at the
end of the range and there's no other extent_state after the range.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This is intended to prevent accidentally filling the drive. A determined
user can still make things oops.
It includes some accounting of the current bytes under delayed allocation,
but this will change as things get optimized
Signed-off-by: Chris Mason <chris.mason@oracle.com>
--Boundary-00=_CcOWHFYK4T+JwSj
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Hello everybody,
compiling btrfs into the kernel results in section mismatch warnings. __exit
functions are called where they are not allowed to. The attached patch fixes
this for me. Not sure if it is correct though.
Signed-off-by: Christian Hesse <mail@earthworm.de>
--
Regards,
Chris
--Boundary-00=_CcOWHFYK4T+JwSj
Content-Type: text/x-diff; charset="iso-8859-1";
name="btrfs-section_mismatches.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="btrfs-section_mismatches.patch"
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is a 'finish_wait', but no 'prepare_to_wait' . So I think that
the 'prepare_to_wait' is missing. The second change is according to
the name of variable.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The fixes do a number of things:
1) Most btrfs_drop_extent callers will try to leave the inline extents in
place. It can truncate bytes off the beginning of the inline extent if
required.
2) writepage can now update the inline extent, allowing mmap writes to
go directly into the inline extent.
3) btrfs_truncate_in_transaction truncates inline extents
4) extent_map.c fixed to not merge inline extent mappings and hole
mappings together
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It now returns void and it is never called for partial completions, so
the bio->bi_size check must go.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
XFS updates the ondisk inode size only after the data I/O has finished,
so it needs a hook when the writepage end_bio handler has finished.
Might not be worth applying as-is as the per-page callback is very
ineffcient. What XFS really wants is a callback when writeout of a
whole extent has completed. This delayed i_size updates scheme might
be worthwile for btrfs aswell, btw.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The writepage_io is not mandatory, e.g. my port of xfs to the extent_map
code does not have one for now. So handle a NULL pointer gracefully
here.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
generic_bmap is completely trivial, while the extent to bh mapping in
btrfs is rather complex. So provide a extent_bmap instead that takes
a get_extent callback and can be used by filesystem using the extent_map
code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The bio completion handlers can be run in any context, e.g. when using
the old ide driver they run in hardirq context with irqs disabled so
lockdep rightfully warns about using write_lock_irq useage in these
handlers.
This patch switches clear_extent_bit and set_extent_bit to
write_lock_irqsave to fix this problem.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Yan Zheng noticed that set_extent_bit was exiting too early when there
was a hole in the map. The fix is to reorder the tests to check for the
hole first.
Signed-off-by: Chris Mason <chris.mason@oracle.com>