Commit Graph

442 Commits

Author SHA1 Message Date
Qu Wenruo
150e4b0597 btrfs: submit read time repair only for each corrupted sector
Currently btrfs_submit_read_repair() has some extra check on whether the
failed bio needs extra validation for repair.  But we can avoid all
these extra mechanisms if we submit the repair for each sector.

By this, each read repair can be easily handled without the need to
verify which sector is corrupted.

This will also benefit subpage, as one subpage bvec can contain several
sectors, making the extra verification more complex.

So this patch will:

- Introduce repair_one_sector()
  The main code submitting repair, which is more or less the same as old
  btrfs_submit_read_repair().
  But this time, it only repairs one sector.

- Make btrfs_submit_read_repair() to handle sectors differently
  There are 3 different cases:

  * Good sector
    We need to release the page and extent, set the range uptodate.

  * Bad sector and failed to submit repair bio
    We need to release the page and extent, but not set the range
    uptodate.

  * Bad sector but repair bio submitted
    The page and extent release will be handled by the submitted repair
    bio. Nothing needs to be done.

  Since btrfs_submit_read_repair() will handle the page and extent
  release now, we need to skip to next bvec even we hit some error.

- Change the lifespan of @uptodate in end_bio_extent_readpage()
  Since now btrfs_submit_read_repair() will handle the full bvec
  which contains any corruption, we don't need to bother updating
  @uptodate bit anymore.
  Just let @uptodate to be local variable inside the main loop,
  so that any error from one bvec won't affect later bvec.

- Only export btrfs_repair_one_sector(), unexport
  btrfs_submit_read_repair()
  The only outside caller for read repair is DIO, which already submits
  its repair for just one sector.
  Only export btrfs_repair_one_sector() for DIO.

This patch will focus on the change on the repair path, the extra
validation code is still kept as is, and will be cleaned up later.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:05 +02:00
Wan Jiabing
183ebab766 btrfs: move forward declarations to the beginning of extent_io.h
There are two forward declarations deep in extent_io.h, move them
to the beginning and remove the duplicate one.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:19 +02:00
Naohiro Aota
d3575156f6 btrfs: zoned: redirty released extent buffers
Tree manipulating operations like merging nodes often release
once-allocated tree nodes. Such nodes are cleaned so that pages in the
node are not uselessly written out. On zoned volumes, however, such
optimization blocks the following IOs as the cancellation of the write
out of the freed blocks breaks the sequential write sequence expected by
the device.

Introduce a list of clean and unwritten extent buffers that have been
released in a transaction. Redirty the buffers so that
btree_write_cache_pages() can send proper bios to the devices.

Besides it clears the entire content of the extent buffer not to confuse
raw block scanners e.g. 'btrfs check'. By clearing the content,
csum_dirty_buffer() complains about bytenr mismatch, so avoid the
checking and checksum using newly introduced buffer flag
EXTENT_BUFFER_NO_CHECK.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:04 +01:00
Qu Wenruo
32443de338 btrfs: introduce btrfs_subpage for data inodes
To support subpage sector size, data also need extra info to make sure
which sectors in a page are uptodate/dirty/...

This patch will make pages for data inodes get btrfs_subpage structure
attached, and detached when the page is freed.

This patch also slightly changes the timing when
set_page_extent_mapped() is called to make sure:

- We have page->mapping set
  page->mapping->host is used to grab btrfs_fs_info, thus we can only
  call this function after page is mapped to an inode.

  One call site attaches pages to inode manually, thus we have to modify
  the timing of set_page_extent_mapped() a bit.

- As soon as possible, before other operations
  Since memory allocation can fail, we have to do extra error handling.
  Calling set_page_extent_mapped() as soon as possible can simply the
  error handling for several call sites.

The idea is pretty much the same as iomap_page, but with more bitmaps
for btrfs specific cases.

Currently the plan is to switch iomap if iomap can provide sector
aligned write back (only write back dirty sectors, but not the full
page, data balance require this feature).

So we will stick to btrfs specific bitmap for now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:03 +01:00
Qu Wenruo
6869b0a8be btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to PAGE_START_WRITEBACK
PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK are two defines used in
__process_pages_contig(), to let the function know to clear page dirty
bit and then set page writeback.

However page writeback and dirty bits are conflicting (at least for
sector size == PAGE_SIZE case), this means these two have to be always
updated together.

This means we can merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to
PAGE_START_WRITEBACK.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:01 +01:00
Qu Wenruo
4a3dc93843 btrfs: update num_extent_pages to support subpage sized extent buffer
For subpage sized extent buffer, we have ensured no extent buffer will
cross page boundary, thus we would only need one page for any extent
buffer.

Update function num_extent_pages to handle such case.  Now
num_extent_pages() returns 1 for subpage sized extent buffer.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:10 +01:00
Qu Wenruo
deb6789553 btrfs: calculate inline extent buffer page size based on page size
Btrfs only support 64K as maximum node size, thus for 4K page system, we
would have at most 16 pages for one extent buffer.

For a system using 64K page size, we would really have just one page.

While we always use 16 pages for extent_buffer::pages, this means for
systems using 64K pages, we are wasting memory for 15 page pointers
which will never be used.

Calculate the array size based on page size and the node size maximum.

- for systems using 4K page size, it will stay 16 pages
- for systems using 64K page size, it will be 1 page

Move the definition of BTRFS_MAX_METADATA_BLOCKSIZE to btrfs_tree.h, to
avoid circular inclusion of ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:10 +01:00
Qu Wenruo
7ffd27e378 btrfs: pass bio_offset to check_data_csum() directly
Parameter icsum for check_data_csum() is a little hard to understand.
So is the phy_offset for btrfs_verify_data_csum().

Both parameters are calculated values for csum lookup.

Instead of some calculated value, just pass bio_offset and let the
final and only user, check_data_csum(), calculate whatever it needs.

Since we are here, also make the bio_offset parameter and some related
variables to be u32 (unsigned int).
As bio size is limited by its bi_size, which is unsigned int, and has
extra size limit check during various bio operations.
Thus we are ensured that bio_offset won't overflow u32.

Thus for all involved functions, not only rename the parameter from
@phy_offset to @bio_offset, but also reduce its width to u32, so we
won't have suspicious "u32 = u64 >> sector_bits;" lines anymore.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:09 +01:00
Qu Wenruo
1941b64b08 btrfs: rename bio_offset of extent_submit_bio_start_t to dio_file_offset
The parameter bio_offset of extent_submit_bio_start_t is very confusing.
If it's really bio_offset (offset to bio), then it should be u32.  But
in fact, it's only utilized by dio read, and that member is used as file
offset, which must be u64.

Rename it to dio_file_offset since the only user uses it as file offset,
and add comment for who is using it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:09 +01:00
Qu Wenruo
f97e27e91d btrfs: use fixed width int type for extent_state::state
Currently the type is unsigned int which could change its width
depending on the architecture. We need up to 32 bits so make it
explicit.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:13 +01:00
Josef Bacik
a55463c9f0 btrfs: remove extent_buffer::recursed
It is unused everywhere now, it can be removed.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:10 +01:00
Josef Bacik
3fbaf25817 btrfs: pass the owner_root and level to alloc_extent_buffer
Now that we've plumbed all of the callers to have the owner root and the
level, plumb it down into alloc_extent_buffer().

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:07 +01:00
Josef Bacik
bfb484d922 btrfs: cleanup extent buffer readahead
We're going to pass around more information when we allocate extent
buffers, in order to make that cleaner how we do readahead.  Most of the
callers have the parent node that we're getting our blockptr from, with
the sole exception of relocation which simply has the bytenr it wants to
read.

Add a helper that takes the current arguments that we need (bytenr and
gen), and add another helper for simply reading the slot out of a node.
In followup patches the helper that takes all the extra arguments will
be expanded, and the simpler helper won't need to have it's arguments
adjusted.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:05 +01:00
David Sterba
dc51616486 btrfs: reorder extent buffer members for better packing
After the rwsem replaced the tree lock implementation, the extent buffer
got smaller but leaving some holes behind. By changing log_index type
and reordering, we can squeeze the size further to 240 bytes, measured on
release config on x86_64. Log_index spans only 3 values and needs to be
signed.

Before:

struct extent_buffer {
        u64                        start;                /*     0     8 */
        long unsigned int          len;                  /*     8     8 */
        long unsigned int          bflags;               /*    16     8 */
        struct btrfs_fs_info *     fs_info;              /*    24     8 */
        spinlock_t                 refs_lock;            /*    32     4 */
        atomic_t                   refs;                 /*    36     4 */
        atomic_t                   io_pages;             /*    40     4 */
        int                        read_mirror;          /*    44     4 */
        struct callback_head       callback_head __attribute__((__aligned__(8))); /*    48    16 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        pid_t                      lock_owner;           /*    64     4 */
        bool                       lock_recursed;        /*    68     1 */

        /* XXX 3 bytes hole, try to pack */

        struct rw_semaphore        lock;                 /*    72    40 */
        short int                  log_index;            /*   112     2 */

        /* XXX 6 bytes hole, try to pack */

        struct page *              pages[16];            /*   120   128 */

        /* size: 248, cachelines: 4, members: 14 */
        /* sum members: 239, holes: 2, sum holes: 9 */
        /* forced alignments: 1 */
        /* last cacheline: 56 bytes */
} __attribute__((__aligned__(8)));

After:

struct extent_buffer {
        u64                        start;                /*     0     8 */
        long unsigned int          len;                  /*     8     8 */
        long unsigned int          bflags;               /*    16     8 */
        struct btrfs_fs_info *     fs_info;              /*    24     8 */
        spinlock_t                 refs_lock;            /*    32     4 */
        atomic_t                   refs;                 /*    36     4 */
        atomic_t                   io_pages;             /*    40     4 */
        int                        read_mirror;          /*    44     4 */
        struct callback_head       callback_head __attribute__((__aligned__(8))); /*    48    16 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        pid_t                      lock_owner;           /*    64     4 */
        bool                       lock_recursed;        /*    68     1 */
        s8                         log_index;            /*    69     1 */

        /* XXX 2 bytes hole, try to pack */

        struct rw_semaphore        lock;                 /*    72    40 */
        struct page *              pages[16];            /*   112   128 */

        /* size: 240, cachelines: 4, members: 14 */
        /* sum members: 238, holes: 1, sum holes: 2 */
        /* forced alignments: 1 */
        /* last cacheline: 48 bytes */
} __attribute__((__aligned__(8)));

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:02 +01:00
Qu Wenruo
8896a08d8e btrfs: replace fs_info and private_data with inode in btrfs_wq_submit_bio
All callers of btrfs_wq_submit_bio() pass struct inode as @private_data,
so there is no need for it to be (void *), replace it with "struct inode
*inode".

While we can extract fs_info from struct inode, also remove the @fs_info
parameter.

Since we're here, also replace all the (void *private_data) into (struct
inode *inode).

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:54 +01:00
Josef Bacik
196d59ab9c btrfs: switch extent buffer tree lock to rw_semaphore
Historically we've implemented our own locking because we wanted to be
able to selectively spin or sleep based on what we were doing in the
tree.  For instance, if all of our nodes were in cache then there's
rarely a reason to need to sleep waiting for node locks, as they'll
likely become available soon.  At the time this code was written the
rw_semaphore didn't do adaptive spinning, and thus was orders of
magnitude slower than our home grown locking.

However now the opposite is the case.  There are a few problems with how
we implement blocking locks, namely that we use a normal waitqueue and
simply wake everybody up in reverse sleep order.  This leads to some
suboptimal performance behavior, and a lot of context switches in highly
contended cases.  The rw_semaphores actually do this properly, and also
have adaptive spinning that works relatively well.

The locking code is also a bit of a bear to understand, and we lose the
benefit of lockdep for the most part because the blocking states of the
lock are simply ad-hoc and not mapped into lockdep.

So rework the locking code to drop all of this custom locking stuff, and
simply use a rw_semaphore for everything.  This makes the locking much
simpler for everything, as we can now drop a lot of cruft and blocking
transitions.  The performance numbers vary depending on the workload,
because generally speaking there doesn't tend to be a lot of contention
on the btree.  However, on my test system which is an 80 core single
socket system with 256GiB of RAM and a 2TiB NVMe drive I get the
following results (with all debug options off):

  dbench 200 baseline
  Throughput 216.056 MB/sec  200 clients  200 procs  max_latency=1471.197 ms

  dbench 200 with patch
  Throughput 737.188 MB/sec  200 clients  200 procs  max_latency=714.346 ms

Previously we also used fs_mark to test this sort of contention, and
those results are far less impressive, mostly because there's not enough
tasks to really stress the locking

  fs_mark -d /d[0-15] -S 0 -L 20 -n 100000 -s 0 -t 16

  baseline
    Average Files/sec:     160166.7
    p50 Files/sec:         165832
    p90 Files/sec:         123886
    p99 Files/sec:         123495

    real    3m26.527s
    user    2m19.223s
    sys     48m21.856s

  patched
    Average Files/sec:     164135.7
    p50 Files/sec:         171095
    p90 Files/sec:         122889
    p99 Files/sec:         113819

    real    3m29.660s
    user    2m19.990s
    sys     44m12.259s

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:43 +01:00
Nikolay Borisov
905eb88bce btrfs: remove struct extent_io_ops
It's no longer used just remove the function and any related code which
was initialising it for inodes. No functional changes.

Removing 8 bytes from extent_io_tree in turn reduces size of other
structures where it is embedded, notably btrfs_inode where it reduces
size by 24 bytes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:25 +02:00
Nikolay Borisov
1f03d9cfda btrfs: remove extent_io_ops::readpage_end_io_hook
It's no longer used so let's remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:24 +02:00
Nikolay Borisov
0f20881249 btrfs: open code extent_read_full_page to its sole caller
This makes reading the code a tad easier by decreasing the level of
indirection by one.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:21 +02:00
Nikolay Borisov
6f15af6060 btrfs: sink read_flags argument into extent_read_full_page
It's always set to 0 by its sole caller - btrfs_readpage. Simply remove
it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:20 +02:00
Nikolay Borisov
003c286aef btrfs: sink mirror_num argument in extent_read_full_page
It's always set to 0 from the sole caller - btrfs_readpage.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:20 +02:00
Nikolay Borisov
c1be9c1ad5 btrfs: promote extent_read_full_page to btrfs_readpage
Now that btrfs_readpage is the only caller of extent_read_full_page the
latter can be open coded in the former. Use the occassion to rename
__extent_read_full_page to extent_read_full_page. To facillitate this
change submit_one_bio has to be exported as well.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:20 +02:00
Nikolay Borisov
72cffee463 btrfs: remove mirror_num argument from extent_read_full_page
It's called only from btrfs_readpage which always passes 0 so just sink
the argument into extent_read_full_page.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:20 +02:00
Nikolay Borisov
1a5ee1e626 btrfs: remove btrfs_get_extent indirection from __do_readpage
Now that this function is only responsible for reading data pages it's
no longer necessary to pass get_extent_t parameter across several
layers of functions. This patch removes this parameter from multiple
functions: __get_extent_map/__do_readpage/__extent_read_full_page/
extent_read_full_page and simply calls btrfs_get_extent directly in
__get_extent_map.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:20 +02:00
Nikolay Borisov
facee0a09c btrfs: make extent_fiemap take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:19 +02:00
Josef Bacik
329ced799b btrfs: rename extent_buffer::lock_nested to extent_buffer::lock_recursed
Nested locking with lockdep and everything else refers to lock hierarchy
within the same lock map.  This is how we indicate the same locks for
different objects are ok to take in a specific order, for our use case
that would be to take the lock on a leaf and then take a lock on an
adjacent leaf.

What ->lock_nested _actually_ refers to is if we happen to already be
holding the write lock on the extent buffer and we're allowing a read
lock to be taken on that extent buffer, which is recursion.  Rename this
so we don't get confused when we switch to a rwsem and have to start
using the _nested helpers.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:15 +02:00
Josef Bacik
a48b73eca4 btrfs: fix potential deadlock in the search ioctl
With the conversion of the tree locks to rwsem I got the following
lockdep splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Not tainted
  ------------------------------------------------------
  compsize/11122 is trying to acquire lock:
  ffff889fabca8768 (&mm->mmap_lock#2){++++}-{3:3}, at: __might_fault+0x3e/0x90

  but task is already holding lock:
  ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #2 (btrfs-fs-00){++++}-{3:3}:
	 down_write_nested+0x3b/0x70
	 __btrfs_tree_lock+0x24/0x120
	 btrfs_search_slot+0x756/0x990
	 btrfs_lookup_inode+0x3a/0xb4
	 __btrfs_update_delayed_inode+0x93/0x270
	 btrfs_async_run_delayed_root+0x168/0x230
	 btrfs_work_helper+0xd4/0x570
	 process_one_work+0x2ad/0x5f0
	 worker_thread+0x3a/0x3d0
	 kthread+0x133/0x150
	 ret_from_fork+0x1f/0x30

  -> #1 (&delayed_node->mutex){+.+.}-{3:3}:
	 __mutex_lock+0x9f/0x930
	 btrfs_delayed_update_inode+0x50/0x440
	 btrfs_update_inode+0x8a/0xf0
	 btrfs_dirty_inode+0x5b/0xd0
	 touch_atime+0xa1/0xd0
	 btrfs_file_mmap+0x3f/0x60
	 mmap_region+0x3a4/0x640
	 do_mmap+0x376/0x580
	 vm_mmap_pgoff+0xd5/0x120
	 ksys_mmap_pgoff+0x193/0x230
	 do_syscall_64+0x50/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&mm->mmap_lock#2){++++}-{3:3}:
	 __lock_acquire+0x1272/0x2310
	 lock_acquire+0x9e/0x360
	 __might_fault+0x68/0x90
	 _copy_to_user+0x1e/0x80
	 copy_to_sk.isra.32+0x121/0x300
	 search_ioctl+0x106/0x200
	 btrfs_ioctl_tree_search_v2+0x7b/0xf0
	 btrfs_ioctl+0x106f/0x30a0
	 ksys_ioctl+0x83/0xc0
	 __x64_sys_ioctl+0x16/0x20
	 do_syscall_64+0x50/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  other info that might help us debug this:

  Chain exists of:
    &mm->mmap_lock#2 --> &delayed_node->mutex --> btrfs-fs-00

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(btrfs-fs-00);
				 lock(&delayed_node->mutex);
				 lock(btrfs-fs-00);
    lock(&mm->mmap_lock#2);

   *** DEADLOCK ***

  1 lock held by compsize/11122:
   #0: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

  stack backtrace:
  CPU: 17 PID: 11122 Comm: compsize Kdump: loaded Not tainted 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922
  Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
  Call Trace:
   dump_stack+0x78/0xa0
   check_noncircular+0x165/0x180
   __lock_acquire+0x1272/0x2310
   lock_acquire+0x9e/0x360
   ? __might_fault+0x3e/0x90
   ? find_held_lock+0x72/0x90
   __might_fault+0x68/0x90
   ? __might_fault+0x3e/0x90
   _copy_to_user+0x1e/0x80
   copy_to_sk.isra.32+0x121/0x300
   ? btrfs_search_forward+0x2a6/0x360
   search_ioctl+0x106/0x200
   btrfs_ioctl_tree_search_v2+0x7b/0xf0
   btrfs_ioctl+0x106f/0x30a0
   ? __do_sys_newfstat+0x5a/0x70
   ? ksys_ioctl+0x83/0xc0
   ksys_ioctl+0x83/0xc0
   __x64_sys_ioctl+0x16/0x20
   do_syscall_64+0x50/0x90
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

The problem is we're doing a copy_to_user() while holding tree locks,
which can deadlock if we have to do a page fault for the copy_to_user().
This exists even without my locking changes, so it needs to be fixed.
Rework the search ioctl to do the pre-fault and then
copy_to_user_nofault for the copying.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-27 13:56:27 +02:00
David Sterba
bab16e21e8 btrfs: don't use UAPI types for fiemap callback
The fiemap callback is not part of UAPI interface and the prototypes
don't have the __u64 types either.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:27 +02:00
Nikolay Borisov
ad7ff17b65 btrfs: make extent_clear_unlock_delalloc take btrfs_inode
It has one VFS and 1 btrfs inode usages but converting it to btrfs_inode
interface will allow seamless conversion of its callers.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:26 +02:00
Linus Torvalds
0b166a57e6 A lot of bug fixes and cleanups for ext4, including:
* Fix performance problems found in dioread_nolock now that it is the
   default, caused by transaction leaks.
 * Clean up fiemap handling in ext4
 * Clean up and refactor multiple block allocator (mballoc) code
 * Fix a problem with mballoc with a smaller file systems running out
   of blocks because they couldn't properly use blocks that had been
   reserved by inode preallocation.
 * Fixed a race in ext4_sync_parent() versus rename()
 * Simplify the error handling in the extent manipulation code
 * Make sure all metadata I/O errors are felected to ext4_ext_dirty()'s and
   ext4_make_inode_dirty()'s callers.
 * Avoid passing an error pointer to brelse in ext4_xattr_set()
 * Fix race which could result to freeing an inode on the dirty last
   in data=journal mode.
 * Fix refcount handling if ext4_iget() fails
 * Fix a crash in generic/019 caused by a corrupted extent node
 -----BEGIN PGP SIGNATURE-----
 
 iQEyBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl7Ze8kACgkQ8vlZVpUN
 gaNChAf4xn0ytFSrweI/S2Sp05G/2L/ocZ2TZZk2ZdGeN1E+ABdSIv/zIF9zuFgZ
 /pY/C+fyEZWt4E3FlNO8gJzoEedkzMCMnUhSIfI+wZbcclyTOSNMJtnrnJKAEtVH
 HOvGZJmg357jy407RCGhZpJ773nwU2xhBTr5OFxvSf9mt/vzebxIOnw5D7HPlC1V
 Fgm6Du8q+tRrPsyjv1Yu4pUEVXMJ7qUcvt326AXVM3kCZO1Aa5GrURX0w3J4mzW1
 tc1tKmtbLcVVYTo9CwHXhk/edbxrhAydSP2iACand3tK6IJuI6j9x+bBJnxXitnr
 vsxsfTYMG18+2SxrJ9LwmagqmrRq
 =HMTs
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "A lot of bug fixes and cleanups for ext4, including:

   - Fix performance problems found in dioread_nolock now that it is the
     default, caused by transaction leaks.

   - Clean up fiemap handling in ext4

   - Clean up and refactor multiple block allocator (mballoc) code

   - Fix a problem with mballoc with a smaller file systems running out
     of blocks because they couldn't properly use blocks that had been
     reserved by inode preallocation.

   - Fixed a race in ext4_sync_parent() versus rename()

   - Simplify the error handling in the extent manipulation code

   - Make sure all metadata I/O errors are felected to
     ext4_ext_dirty()'s and ext4_make_inode_dirty()'s callers.

   - Avoid passing an error pointer to brelse in ext4_xattr_set()

   - Fix race which could result to freeing an inode on the dirty last
     in data=journal mode.

   - Fix refcount handling if ext4_iget() fails

   - Fix a crash in generic/019 caused by a corrupted extent node"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits)
  ext4: avoid unnecessary transaction starts during writeback
  ext4: don't block for O_DIRECT if IOCB_NOWAIT is set
  ext4: remove the access_ok() check in ext4_ioctl_get_es_cache
  fs: remove the access_ok() check in ioctl_fiemap
  fs: handle FIEMAP_FLAG_SYNC in fiemap_prep
  fs: move fiemap range validation into the file systems instances
  iomap: fix the iomap_fiemap prototype
  fs: move the fiemap definitions out of fs.h
  fs: mark __generic_block_fiemap static
  ext4: remove the call to fiemap_check_flags in ext4_fiemap
  ext4: split _ext4_fiemap
  ext4: fix fiemap size checks for bitmap files
  ext4: fix EXT4_MAX_LOGICAL_BLOCK macro
  add comment for ext4_dir_entry_2 file_type member
  jbd2: avoid leaking transaction credits when unreserving handle
  ext4: drop ext4_journal_free_reserved()
  ext4: mballoc: use lock for checking free blocks while retrying
  ext4: mballoc: refactor ext4_mb_good_group()
  ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
  ext4: mballoc: refactor ext4_mb_discard_preallocations()
  ...
2020-06-05 16:19:28 -07:00
Christoph Hellwig
10c5db2864 fs: move the fiemap definitions out of fs.h
No need to pull the fiemap definitions into almost every file in the
kernel build.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20200523073016.2944131-5-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:55 -04:00
Linus Torvalds
f3cdc8ae11 for-5.8-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl7U50AACgkQxWXV+ddt
 WDtK1g//RXeNsTguYQr1N9R5eUPThjLEI0+4J0l4SYfCPU8Ou3C7nqpOEJJQgm8F
 ezZE+16cWi9U5uGueOc+w0rfyz4AuIXKgzoz+c0/GG2+yV5jp6DsAMbWqojAb96L
 V/N3HxEzR66jqwgVUBE/x5okb2SyY7//B1l/O0amc66XDO7KTMImpIwThere6zWZ
 o2SNpYpHAPQeUYJQx8h+FAW3w1CxrCZmnifazU9Jqe9J7QeQLg7rbUlJDV38jySm
 ZOA8ohKN9U1gPZy+dTU3kdyyuBIq1etkIaSPJANyTo5TczPKiC0IMg75cXtS4ae/
 NSxhccMpSIjVMcIHARzSFGYKNP3sGNRsmaTUg/2Cx/9GoHOhYMiCAVc8qtBBpwJO
 UI0siexrCe64RuTBMRRc128GdFv7IjmSImcdi8xaR62bCcUiNdEa3zvjRe/9tOEH
 ET7Z85oBnKpSzpC3MdhSUU4dtHY5XLawP8z3oUU1VSzSWM2DVjlHf79/VzbOfp18
 miCVpt94lCn/gUX7el6qcnbuvMAjDyeC6HmfD+TwzQgGwyV6TLgKN9lRXeH/Oy6/
 VgjGQSavGHMll3zIGURmrBCXKudjJg0J+IP4wN1TimmSEMfwKH+7tnekQd8y5qlF
 eXEIqlWNykKeDzEnmV9QJy+/cV83hVWM/mUslcTx39tLN/3B/Us=
 =qTt8
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "Highlights:

   - speedup dead root detection during orphan cleanup, eg. when there
     are many deleted subvolumes waiting to be cleaned, the trees are
     now looked up in radix tree instead of a O(N^2) search

   - snapshot creation with inherited qgroup will mark the qgroup
     inconsistent, requires a rescan

   - send will emit file capabilities after chown, this produces a
     stream that does not need postprocessing to set the capabilities
     again

   - direct io ported to iomap infrastructure, cleaned up and simplified
     code, notably removing last use of struct buffer_head in btrfs code

  Core changes:

   - factor out backreference iteration, to be used by ordinary
     backreferences and relocation code

   - improved global block reserve utilization
      * better logic to serialize requests
      * increased maximum available for unlink
      * improved handling on large pages (64K)

   - direct io cleanups and fixes
      * simplify layering, where cloned bios were unnecessarily created
        for some cases
      * error handling fixes (submit, endio)
      * remove repair worker thread, used to avoid deadlocks during
        repair

   - refactored block group reading code, preparatory work for new type
     of block group storage that should improve mount time on large
     filesystems

  Cleanups:

   - cleaned up (and slightly sped up) set/get helpers for metadata data
     structure members

   - root bit REF_COWS got renamed to SHAREABLE to reflect the that the
     blocks of the tree get shared either among subvolumes or with the
     relocation trees

  Fixes:

   - when subvolume deletion fails due to ENOSPC, the filesystem is not
     turned read-only

   - device scan deals with devices from other filesystems that changed
     ownership due to overwrite (mkfs)

   - fix a race between scrub and block group removal/allocation

   - fix long standing bug of a runaway balance operation, printing the
     same line to the syslog, caused by a stale status bit on a reloc
     tree that prevented progress

   - fix corrupt log due to concurrent fsync of inodes with shared
     extents

   - fix space underflow for NODATACOW and buffered writes when it for
     some reason needs to fallback to COW mode"

* tag 'for-5.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (133 commits)
  btrfs: fix space_info bytes_may_use underflow during space cache writeout
  btrfs: fix space_info bytes_may_use underflow after nocow buffered write
  btrfs: fix wrong file range cleanup after an error filling dealloc range
  btrfs: remove redundant local variable in read_block_for_search
  btrfs: open code key_search
  btrfs: split btrfs_direct_IO to read and write part
  btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK
  fs: remove dio_end_io()
  btrfs: switch to iomap_dio_rw() for dio
  iomap: remove lockdep_assert_held()
  iomap: add a filesystem hook for direct I/O bio submission
  fs: export generic_file_buffered_read()
  btrfs: turn space cache writeout failure messages into debug messages
  btrfs: include error on messages about failure to write space/inode caches
  btrfs: remove useless 'fail_unlock' label from btrfs_csum_file_blocks()
  btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums
  btrfs: make checksum item extension more efficient
  btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents
  btrfs: unexport btrfs_compress_set_level()
  btrfs: simplify iget helpers
  ...
2020-06-02 19:59:25 -07:00
Matthew Wilcox (Oracle)
ba206a026f btrfs: convert from readpages to readahead
Implement the new readahead method in btrfs using the new
readahead_page_batch() function.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-18-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:07 -07:00
David Sterba
2b48966a4d btrfs: constify extent_buffer in the API functions
There are many helpers around extent buffers, found in extent_io.h and
ctree.h. Most of them can be converted to take constified eb as there
are no changes to the extent buffer structure itself but rather the
pages.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:34 +02:00
David Sterba
db3756c879 btrfs: remove unused map_private_extent_buffer
All uses of map_private_extent_buffer have been replaced by more
effective way. The set/get helpers have their own bounds checker.
The function name was confusing since the non-private helper was removed
in a65917156e ("Btrfs: stop using highmem for extent_buffers") many
years ago.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:33 +02:00
Omar Sandoval
77d5d68931 btrfs: unify buffered and direct I/O read repair
Currently, direct I/O has its own versions of bio_readpage_error() and
btrfs_check_repairable() (dio_read_error() and
btrfs_check_dio_repairable(), respectively). The main difference is that
the direct I/O version doesn't do read validation. The rework of direct
I/O repair makes it possible to do validation, so we can get rid of
btrfs_check_dio_repairable() and combine bio_readpage_error() and
dio_read_error() into a new helper, btrfs_submit_read_repair().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:27 +02:00
Omar Sandoval
ce06d3ec2b btrfs: make btrfs_check_repairable() static
Since its introduction in commit 2fe6303e7c ("Btrfs: split
bio_readpage_error into several functions"), btrfs_check_repairable()
has only been used from extent_io.c where it is defined.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:25 +02:00
Omar Sandoval
c7333972b9 btrfs: look at full bi_io_vec for repair decision
Read repair does two things: it finds a good copy of data to return to
the reader, and it corrects the bad copy on disk. If a read of multiple
sectors has an I/O error, repair does an extra "validation" step that
issues a separate read for each sector. This allows us to find the exact
failing sectors and only rewrite those.

This heuristic is implemented in
bio_readpage_error()/btrfs_check_repairable() as:

	failed_bio_pages = failed_bio->bi_iter.bi_size >> PAGE_SHIFT;
	if (failed_bio_pages > 1)
		do validation

However, at this point, bi_iter may have already been advanced. This
means that we'll skip the validation step and rewrite the entire failed
read.

Fix it by getting the actual size from the biovec (which we can do
because this is only called for non-cloned bios, although that will
change in a later commit).

Fixes: 8a2ee44a37 ("btrfs: look at bi_size for repair decisions")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:25 +02:00
Josef Bacik
3fd6372758 btrfs: make the extent buffer leak check per fs info
I'm going to make the entire destruction of btrfs_root's controlled by
their refcount, so it will be helpful to notice if we're leaking their
eb's on umount.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:58 +01:00
David Sterba
71ad38b44e btrfs: sink argument tree to extent_read_full_page
The tree pointer can be safely read from the page's inode, use it and
drop the redundant argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:35 +01:00
Omar Sandoval
39b07b5d70 btrfs: drop create parameter to btrfs_get_extent()
We only pass this as 1 from __extent_writepage_io(). The parameter
basically means "pretend I didn't pass in a page". This is silly since
we can simply not pass in the page. Get rid of the parameter from
btrfs_get_extent(), and since it's used as a get_extent_t callback,
remove it from get_extent_t and btree_get_extent(), neither of which
need it.

While we're here, let's document btrfs_get_extent().

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20 16:40:55 +01:00
David Sterba
67439dadb0 btrfs: opencode extent_buffer_get
The helper is trivial and we can understand what the atomic_inc on
something named refs does.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:54 +01:00
Josef Bacik
b3f167aa6c btrfs: move the failrec tree stuff into extent-io-tree.h
This needs to be cleaned up in the future, but for now it belongs to the
extent-io-tree stuff since it uses the internal tree search code.
Needed to export get_state_failrec and set_state_failrec as well since
we're not going to move the actual IO part of the failrec stuff out at
this point.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:47 +01:00
Josef Bacik
9c7d3a5483 btrfs: move extent_io_tree defs to their own header
extent_io.c/h are huge, encompassing a bunch of different things.  The
extent_io_tree code can live on its own, so separate this out.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:47 +01:00
Josef Bacik
6f0d04f8e7 btrfs: separate out the extent io init function
We are moving extent_io_tree into it's on file, so separate out the
extent_state init stuff from extent_io_tree_init().

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:47 +01:00
Nikolay Borisov
74e9194afb btrfs: Remove delalloc_end argument from extent_clear_unlock_delalloc
It was added in ba8b04c1d4 ("btrfs: extend btrfs_set_extent_delalloc
and its friends to support in-band dedupe and subpage size patchset") as
a preparatory patch for in-band and subapge block size patchsets.
However neither of those are likely to be merged anytime soon and the
code has diverged significantly from the last public post of either
of those patchsets.

It's unlikely either of the patchests are going to use those preparatory
steps so just remove the variables. Since cow_file_range also took
delalloc_end to pass it to extent_clear_unlock_delalloc remove the
parameter from that function as well.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:58:59 +02:00
Goldwyn Rodrigues
9978059be8 btrfs: Evaluate io_tree in find_lock_delalloc_range()
Simplification.  No point passing the tree variable when it can be
evaluated from inode. The tests now use the io_tree from btrfs_inode as
opposed to creating one.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-04 17:26:17 +02:00
David Sterba
e749af443f btrfs: lift bio_set_dev from bio allocation helpers
The block device is passed around for the only purpose to set it in new
bios. Move the assignment one level up. This is a preparatory patch for
further bdev cleanups.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:51 +02:00
David Sterba
00801ae4bb btrfs: switch extent_buffer write_locks from atomic to int
The write_locks is either 0 or 1 and always updated under the lock,
so we don't need the atomic_t semantics.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:47 +02:00
David Sterba
f3dc24c52a btrfs: switch extent_buffer spinning_writers from atomic to int
The spinning_writers is either 0 or 1 and always updated under the lock,
so we don't need the atomic_t semantics.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:47 +02:00
David Sterba
06297d8cef btrfs: switch extent_buffer blocking_writers from atomic to int
The blocking_writers is either 0 or 1 and always updated under the lock,
so we don't need the atomic_t semantics.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:47 +02:00
Nikolay Borisov
50489a5734 btrfs: Remove bio_offset argument from submit_bio_hook
None of the implementers of the submit_bio_hook use the bio_offset
parameter, simply remove it. No functional changes.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:47 +02:00
Nikolay Borisov
c2ccfbc62e btrfs: Remove 'tree' argument from read_extent_buffer_pages
This function always uses the btree inode's io_tree. Stop taking the
tree as a function argument and instead access it internally from
read_extent_buffer_pages. No functional changes.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:46 +02:00
Nikolay Borisov
a56b1c7bc8 btrfs: Change submit_bio_hook to taking an inode directly
The only possible 'private_data' that is passed to this function is
actually an inode. Make that explicit by changing the signature of the
call back. No functional changes.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:46 +02:00
Nikolay Borisov
a9355a0ef3 btrfs: Define submit_bio_hook's type directly
There is no need to use a typedef to define the type of the function and
then use that to define the respective member in extent_io_ops.  Define
struct's member directly. No functional changes.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:46 +02:00
Nikolay Borisov
45bfcfc168 btrfs: Implement find_first_clear_extent_bit
This function is very similar to find_first_extent_bit except that it
locates the first contiguous span of space which does not have bits set.
It's intended use is in the freespace trimming code.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:38 +02:00
Nikolay Borisov
8811133d8a btrfs: Optimize unallocated chunks discard
Currently unallocated chunks are always trimmed. For example
2 consecutive trims on large storage would trim freespace twice
irrespective of whether the space was actually allocated or not between
those trims.

Optimise this behavior by exploiting the newly introduced alloc_state
tree of btrfs_device. A new CHUNK_TRIMMED bit is used to mark
those unallocated chunks which have been trimmed and have not been
allocated afterwards. On chunk allocation the respective underlying devices'
physical space will have its CHUNK_TRIMMED flag cleared. This avoids
submitting discards for space which hasn't been changed since the last
time discard was issued.

This applies to the single mount period of the filesystem as the
information is not stored permanently.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:38 +02:00
Nikolay Borisov
4ca7365606 btrfs: Implement set_extent_bits_nowait
It will be used in a future patch that will require modifying an
extent_io_tree struct under a spinlock.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:36 +02:00
Nikolay Borisov
930b090729 btrfs: Introduce new bits for device allocation tree
Rather than hijacking the existing defines let's just define new bits,
with more descriptive names. Instead of using yet more (currently at 18)
bits for the new flags, use the fact those flags will be specific to
the device allocation tree so define them using existing EXTENT_* flags.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:36 +02:00
Nikolay Borisov
41e7acd38c btrfs: Rename and export clear_btree_io_tree
This function is going to be used to clear out the device extent
allocation information. Give it a more generic name and export it. This
is in preparation to replacing the pending/pinned chunk lists with an
extent tree. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:36 +02:00
David Sterba
20a1fbf97e btrfs: get fs_info from eb in repair_eb_io_failure
We can read fs_info from extent buffer and can drop it from the
parameters. As all callsites are updated, add the btrfs_ prefix as the
function is exported.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:29 +02:00
David Sterba
ed1b4ed79d btrfs: switch extent_buffer::lock_nested to bool
The member is tracking simple status of the lock, we can use bool for
that and make some room for further space reduction in the structure.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:21 +02:00
David Sterba
c79adfc085 btrfs: use assertion helpers for extent buffer write lock counters
Use the helpers where open coded. On non-debug builds, the warnings will
not trigger and extent_buffer::write_locks become unused and can be
moved to the appropriate section, saving a few bytes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:21 +02:00
David Sterba
5c9c799ab7 btrfs: use assertion helpers for extent buffer read lock counters
Use the helpers where open coded. On non-debug builds, the warnings will
not trigger and extent_buffer::read_locks become unused and can be
moved to the appropriate section, saving a few bytes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:21 +02:00
David Sterba
afd495a826 btrfs: use assertion helpers for spinning readers
Use the helpers where open coded. On non-debug builds, the warnings will
not trigger and extent_buffer::spining_readers become unused and can be
moved to the appropriate section, saving a few bytes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:21 +02:00
David Sterba
843ccf9f46 btrfs: use assertion helpers for spinning writers
Use the helpers where open coded. On non-debug builds, the warnings will
not trigger and extent_buffer::spining_writers become unused and can be
moved to the appropriate section, saving a few bytes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:20 +02:00
Nikolay Borisov
8882679ea5 btrfs: Remove EXTENT_IOBITS
This flag just became synonymous to EXTENT_LOCKED, so just remove it and
used EXTENT_LOCKED directly. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:20 +02:00
Nikolay Borisov
4e586ca3c3 btrfs: Remove EXTENT_WRITEBACK
This flag was introduced in a52d9a8033 ("Btrfs: Extent based page
cache code.") and subsequently it's usage effectively was removed by
1edbb734b4 ("Btrfs: reduce CPU usage in the extent_state tree") and
f2a97a9dbd ("btrfs: remove all unused functions"). Just remove it,
no functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:20 +02:00
Qu Wenruo
43eb5f2975 btrfs: Introduce extent_io_tree::owner to distinguish different io_trees
Btrfs has the following different extent_io_trees used:

- fs_info::free_extents[2]
- btrfs_inode::io_tree - for both normal inodes and the btree inode
- btrfs_inode::io_failure_tree
- btrfs_transaction::dirty_pages
- btrfs_root::dirty_log_pages

If we want to trace changes in those trees, it will be pretty hard to
distinguish them.

Instead of using hard-to-read pointer address, this patch will introduce
a new member extent_io_tree::owner to track the owner.

This modification needs all the callers of extent_io_tree_init() to
accept a new parameter @owner.

This patch provides the basis for later trace events.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:18 +02:00
David Sterba
7b4397386f btrfs: switch extent_io_tree::track_uptodate to bool
This patch is split from the following one "btrfs: Introduce
extent_io_tree::owner to distinguish different io_trees" from Qu, so the
different changes are not mixed together.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:18 +02:00
Qu Wenruo
c258d6e364 btrfs: Introduce fs_info to extent_io_tree
This patch will add a new member fs_info to extent_io_tree.

This provides the basis for later trace events to distinguish the output
between different btrfs filesystems. While this increases the size of
the structure, we want to know the source of the trace events and
passing the fs_info as an argument to all contexts is not possible.

The selftests are now allowed to set it to NULL as they don't use the
tracepoints.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:18 +02:00
Nikolay Borisov
ba8f5206a4 btrfs: Remove EXTENT_FIRST_DELALLOC bit
With the refactoring introduced in 8b62f87bad ("Btrfs: reworki
outstanding_extents") this flag became unused. Remove it and renumber
the following flags accordingly. No functional changes.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:36 +01:00
Andrea Gelmini
52042d8e82 btrfs: Fix typos in comments and strings
The typos accumulate over time so once in a while time they get fixed in
a large patch.

Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:50 +01:00
Lu Fengqi
3522e90301 btrfs: remove always true if branch in find_delalloc_range
The @found is always false when it comes to the if branch. Besides, the
bool type is more suitable for @found. Change the return value of the
function and its caller to bool as well.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:44 +01:00
David Sterba
80cb38362d btrfs: switch EXTENT_BUFFER_* to enums
We can use simple enum for values that are not part of on-disk format:
extent buffer flags;

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:43 +01:00
Nikolay Borisov
78e62c02ab btrfs: Remove extent_io_ops::readpage_io_failed_hook
For data inodes this hook does nothing but to return -EAGAIN which is
used to signal to the endio routines that this bio belongs to a data
inode. If this is the case the actual retrying is handled by
bio_readpage_error. Alternatively, if this bio belongs to the btree
inode then btree_io_failed_hook just does some cleanup and doesn't retry
anything.

This patch simplifies the code flow by eliminating
readpage_io_failed_hook and instead open-coding btree_io_failed_hook in
end_bio_extent_readpage. Also eliminate some needless checks since IO is
always performed on either data inode or btree inode, both of which are
guaranteed to have their extent_io_tree::ops set.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:41 +01:00
Filipe Manana
0e6ec385b5 Btrfs: allow clear_extent_dirty() to receive a cached extent state record
We can have a lot freed extents during the life span of transaction, so
the red black tree that keeps track of the ranges of each freed extent
(fs_info->freed_extents[]) can get quite big. When finishing a
transaction commit we find each range, process it (discard the extents,
unpin them) and then remove it from the red black tree.

We can use an extent state record as a cache when searching for a range,
so that when we clean the range we can use the cached extent state we
passed to the search function instead of iterating the red black tree
again. Doing things as fast as possible when finishing a transaction (in
state TRANS_STATE_UNBLOCKED) is convenient as it reduces the time we
block another task that wants to commit the next transaction.

So change clear_extent_dirty() to allow an optional extent state record to
be passed as an argument, which will be passed down to __clear_extent_bit.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:38 +01:00
Johannes Thumshirn
ce9f967f31 btrfs: use EXPORT_FOR_TESTS for conditionally exported functions
Several functions in BTRFS are only used inside the source file they are
declared if CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not defined. However if
CONFIG_BTRFS_FS_RUN_SANITY_TESTS is defined these functions are shared
with the unit tests code.

Before the introduction of the EXPORT_FOR_TESTS macro, these functions
could not be declared as static and the compiler had a harder task when
optimizing and inlining them.

As we have EXPORT_FOR_TESTS now, use it where appropriate to support the
compiler.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:37 +01:00
Nikolay Borisov
917aacecc5 btrfs: Sink find_lock_delalloc_range's 'max_bytes' argument
All callers of this function pass BTRFS_MAX_EXTENT_SIZE (128M) so let's
reduce the argument count and make that a local variable. No functional
changes.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:36 +01:00
Nikolay Borisov
abbb55f4cd btrfs: Remove extent_io_ops::split_extent_hook callback
This is the counterpart to merge_extent_hook, similarly, it's used only
for data/freespace inodes so let's remove it, rename it and call it
directly where necessary. No functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:29 +01:00
Nikolay Borisov
5c848198aa btrfs: Remove extent_io_ops::merge_extent_hook callback
This callback is used only for data and free space inodes. Such inodes
are guaranteed to have their extent_io_tree::private_data set to the
inode struct. Exploit this fact to directly call the function. Also give
it a more descriptive name. No functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:28 +01:00
Nikolay Borisov
a36bb5f9a9 btrfs: Remove extent_io_ops::clear_bit_hook callback
This is the counterpart to ex-set_bit_hook (now btrfs_set_delalloc_extent),
similar to what was done before remove clear_bit_hook and rename the
function. No functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:28 +01:00
Nikolay Borisov
e06a1fc99c btrfs: Remove extent_io_ops::set_bit_hook extent_io callback
This callback is used to properly account delalloc extents for data
inodes (ordinary file inodes and freespace v1 inodes). Those can be
easily identified since they have their extent_io trees ->private_data
member point to the inode. Let's exploit this fact to remove the
needless indirection through extent_io_hooks and directly call the
function. Also give the function a name which reflects its purpose -
btrfs_set_delalloc_extent.

This patch also modified test_find_delalloc so that the extent_io_tree
used for testing doesn't have its ->private_data set which would have
caused a crash in btrfs_set_delalloc_extent due to the btrfs_inode->root
member not being initialised. The old version of the code also didn't
call set_bit_hook since the extent_io ops weren't set for the inode.  No
functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:28 +01:00
Nikolay Borisov
65a680f6b7 btrfs: Remove extent_io_ops::check_extent_io_range callback
This callback was only used in debug builds by btrfs_leak_debug_check.
A better approach is to move its implementation in
btrfs_leak_debug_check and ensure the latter is only executed for extent
tree which have ->private_data set i.e. relate to a data node and not
the btree one. No functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:28 +01:00
Nikolay Borisov
7087a9d8db btrfs: Remove extent_io_ops::writepage_end_io_hook
This callback is ony ever called for data page writeout so there is no
need to actually abstract it via extent_io_ops. Lets just export it,
remove the definition of the callback and call it directly in the
functions that invoke the callback. Also rename the function to
btrfs_writepage_endio_finish_ordered since what it really does is
account finished io in the ordered extent data structures.  No
functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:28 +01:00
Nikolay Borisov
d75855b451 btrfs: Remove extent_io_ops::writepage_start_hook
This hook is called only from __extent_writepage_io which is already
called only from the data page writeout path. So there is no need to
make an indirect call via extent_io_ops. This patch just removes the
callback definition, exports the callback function and calls it directly
at the only call site. Also give the function a more descriptive name.
No functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:27 +01:00
Nikolay Borisov
5eaad97af8 btrfs: Remove extent_io_ops::fill_delalloc
This callback is called only from writepage_delalloc which in turn is
guaranteed to be called from the data page writeout path. In the end
there is no reason to have the call to this function to be indrected via
the extent_io_ops structure. This patch removes the callback definition,
exports the function and calls it directly. No functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename to btrfs_run_delalloc_range ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:27 +01:00
David Sterba
9c36396c2a btrfs: tests: add separate stub for find_lock_delalloc_range
The helper find_lock_delalloc_range is now conditionally built static,
dpending on whether the self-tests are enabled or not. There's a macro
that is supposed to hide the export, used only once. To discourage
further use, drop it an add a public wrapper for the helper needed by
tests.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:34 +02:00
Liu Bo
abb57ef3ff Btrfs: skip set_page_dirty if eb pages are already dirty
As long as @eb is marked with EXTENT_BUFFER_DIRTY, all of its pages
are dirty, so no need to set pages dirty again.

Ftrace showed that the loop took 10us on my dev box, so removing this
can save us at least 10us if eb is already dirty and otherwise avoid a
potentially expensive calls to set_page_dirty.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:31 +02:00
David Sterba
5cdc84bfde btrfs: drop extent_io_ops::set_range_writeback callback
The data and metadata callback implementation both use the same
function. We can remove the call indirection and intermediate helper
completely.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:56 +02:00
David Sterba
00032d38ea btrfs: drop extent_io_ops::merge_bio_hook callback
The data and metadata callback implementation both use the same
function. We can remove the call indirection completely.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:56 +02:00
David Sterba
05912a3c04 btrfs: drop extent_io_ops::tree_fs_info callback
All implementations of the callback are trivial and do the same and
there's only one user. Merge everything together.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:55 +02:00
David Sterba
e288c080dd btrfs: unify end_io callbacks of async_submit_bio
The end_io callbacks passed to btrfs_wq_submit_bio
(btrfs_submit_bio_done and btree_submit_bio_done) are effectively the
same code, there's no point to do the indirection. Export
btrfs_submit_bio_done and call it directly.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:55 +02:00
Nikolay Borisov
b0132a3be5 btrfs: Rename EXTENT_BUFFER_DUMMY to EXTENT_BUFFER_UNMAPPED
EXTENT_BUFFER_DUMMY is an awful name for this flag. Buffers which have
this flag set are not in any way dummy. Rather, they are private in the
sense that are not mapped and linked to the global buffer tree. This
flag has subtle implications to the way free_extent_buffer works for
example, as well as controls whether page->mapping->private_lock is held
during extent_buffer release. Pages for an unmapped buffer cannot be
under io, nor can they be written by a 3rd party so taking the lock is
unnecessary.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ EXTENT_BUFFER_UNMAPPED, update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:55 +02:00
David Sterba
cc5e31a477 btrfs: switch types to int when counting eb pages
The loops iterating eb pages use unsigned long, that's an overkill as
we know that there are at most 16 pages (64k / 4k), and 4 by default
(with nodesize 16k).

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:44 +02:00
David Sterba
8791d43207 btrfs: use round_up wrapper in num_extent_pages
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:44 +02:00
David Sterba
65ad010488 btrfs: pass only eb to num_extent_pages
Almost all callers pass the start and len as 2 arguments but this is not
necessary, all the information is provided by the eb. By reordering the
calls to num_extent_pages, we don't need the local variables with
start/len.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:43 +02:00
Nikolay Borisov
8ae225a8a4 btrfs: Remove tree argument from extent_writepages
It can be directly referenced from the passed address_space so do that.
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:20 +02:00
Nikolay Borisov
2a3ff0adc9 btrfs: Remove redundant tree argument from extent_readpages
This function is called only from btrfs_readpage and is already passed
the mapping. Simplify its signature by moving the code obtaining
reference to the extent tree in the function. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:19 +02:00
Nikolay Borisov
477a30ba5f btrfs: Sink extent_tree arguments in try_release_extent_mapping
This function already gets the page from which the two extent trees
are referenced. Simplify its signature by moving the code getting the
trees inside the function. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:19 +02:00
Howard McLauchlan
3b079a919a btrfs: remove unused le_test_bit()
With commit b18253ec57c0 ("btrfs: optimize free space tree bitmap
conversion"), there are no more callers to le_test_bit(). This patch
removes le_test_bit().

Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:18 +02:00
Howard McLauchlan
6faa8f475e btrfs: clean up le_bitmap_{set, clear}()
le_bitmap_set() is only used by free-space-tree, so move it there and
make it static. le_bitmap_clear() is not used, so remove it.

Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:18 +02:00
David Sterba
9888c3402c btrfs: replace GPL boilerplate by SPDX -- headers
Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.

Unify the include protection macros to match the file names.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-12 16:29:46 +02:00
David Sterba
6c55343587 btrfs: remove unused parameters from extent_submit_bio_done_t
Remove parameters not used by any of the callbacks.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:26:55 +02:00
David Sterba
d0779291b1 btrfs: remove unused parameters from extent_submit_bio_start_t
Remove parameters not used by any of the callbacks.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:26:55 +02:00
David Sterba
a758781d4b btrfs: separate types for submit_bio_start and submit_bio_done
The callbacks make use of different parameters that are passed to the
other type unnecessarily. This patch adds separate types for each and
the unused parameters will be removed.

The type extent_submit_bio_hook_t keeps all parameters and can be used
where the start/done types are not appropriate.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-31 01:26:55 +02:00
David Sterba
ab0d093616 btrfs: drop extern from function declarations
Extern for functions does not make any difference, there are only a few
so let's remove them before it's too late.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:41 +02:00
David Sterba
e67c718b5b btrfs: add more __cold annotations
The __cold functions are placed to a special section, as they're
expected to be called rarely. This could help i-cache prefetches or help
compiler to decide which branches are more/less likely to be taken
without any other annotations needed.

Though we can't add more __exit annotations, it's still possible to add
__cold (that's also added with __exit). That way the following function
categories are tagged:

- printf wrappers, error messages
- exit helpers

Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:39 +02:00
Anand Jain
ba020491c8 btrfs: extent_buffer_uptodate() make it static and inline
extent_buffer_uptodate() is a trivial wrapper around test_bit() and
nothing else. So make it static and inline, save on code space and call
indirection.

Before:
   text	   data	    bss	    dec	    hex	filename
1131257	  82898	  18992	1233147	 12d0fb	fs/btrfs/btrfs.ko

After:
   text	   data	    bss	    dec	    hex	filename
1131090	  82898	  18992	1232980	 12d054	fs/btrfs/btrfs.ko

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:33 +02:00
Linus Torvalds
31466f3ed7 for-4.16-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlpvikQACgkQxWXV+ddt
 WDs6qA//ZE7eEH0sKpD4Z+3gUevk/MMXwE9prRijEdjXz/K/UXtvpq0sI7HMQskZ
 Ls9Wmzof+3WEQoa08RQZFzwuclW1Udm09SqE2oHP2gXQB5rC0BtWdrlMaKUJy03y
 NUwxHetbE6TsFLU5HIVmi05NexNx9SVV6oJTWt00RlXTePw9Aoc88ikoXXUE2vqH
 wbH9/ccmM9EkDFxdG+YG5QX054kQV8/5RXdqBJnIiGVRX5ZsAY84AN9x9YoRCVUw
 wq9TfPu6XmeA6Uq6knpeLlXDms5w+FE3n5CduROk7Q7YNgpoZekF20c8uK8HzT4T
 KF8hc0QpQgRCVBJ8I4MbPSMRIDf3IWfZmWSDEDda/6/ep6Bl99b8PFvdDKDBMUct
 8wsgGrwGbHuz2l2QUIXjpBL9Cv9Tbu8vjmg0h2hFrpiH1c8JaXjKtJXAMtigWsZ1
 DdX+5Y0zqvV/YLpzKF4aMDWXIteN4qaznvjdmj3B7BxgcnITOV/cmPCyMplNrtUa
 Cs2fzGV5tpxhBzxE490v+frMULmLq2W1e6WPfFCqPKBCqulcR75TozDQS9M2/h4k
 uAZzVKoguHrUPP1ONVas9aVC05K473nbYPd28eecYwkZ4z32hK4SAdnQY1aPreTe
 axoV7p7a+i1bkzT5LK6gIfpddVWth8w45nz4P0lwxp0Z6XlkbJE=
 =Irul
 -----END PGP SIGNATURE-----

Merge tag 'for-4.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "Features or user visible changes:

   - fallocate: implement zero range mode

   - avoid losing data raid profile when deleting a device

   - tree item checker: more checks for directory items and xattrs

  Notable fixes:

   - raid56 recovery: don't use cached stripes, that could be
     potentially changed and a later RMW or recovery would lead to
     corruptions or failures

   - let raid56 try harder to rebuild damaged data, reading from all
     stripes if necessary

   - fix scrub to repair raid56 in a similar way as in the case above

  Other:

   - cleanups: device freeing, removed some call indirections, redundant
     bio_put/_get, unused parameters, refactorings and renames

   - RCU list traversal fixups

   - simplify mount callchain, remove recursing back when mounting a
     subvolume

   - plug for fsync, may improve bio merging on multiple devices

   - compression heurisic: replace heap sort with radix sort, gains some
     performance

   - add extent map selftests, buffered write vs dio"

* tag 'for-4.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (155 commits)
  btrfs: drop devid as device_list_add() arg
  btrfs: get device pointer from device_list_add()
  btrfs: set the total_devices in device_list_add()
  btrfs: move pr_info into device_list_add
  btrfs: make btrfs_free_stale_devices() to match the path
  btrfs: rename btrfs_free_stale_devices() arg to skip_dev
  btrfs: make btrfs_free_stale_devices() argument optional
  btrfs: make btrfs_free_stale_device() to iterate all stales
  btrfs: no need to check for btrfs_fs_devices::seeding
  btrfs: Use IS_ALIGNED in btrfs_truncate_block instead of opencoding it
  Btrfs: noinline merge_extent_mapping
  Btrfs: add WARN_ONCE to detect unexpected error from merge_extent_mapping
  Btrfs: extent map selftest: dio write vs dio read
  Btrfs: extent map selftest: buffered write vs dio read
  Btrfs: add extent map selftests
  Btrfs: move extent map specific code to extent_map.c
  Btrfs: add helper for em merge logic
  Btrfs: fix unexpected EEXIST from btrfs_get_extent
  Btrfs: fix incorrect block_len in merge_extent_mapping
  btrfs: Remove unused readahead spinlock
  ...
2018-01-29 14:04:23 -08:00
David Sterba
e43bbe5e16 btrfs: sink unlock_extent parameter gfp_flags
All callers pass either GFP_NOFS or GFP_KERNEL now, so we can sink the
parameter to the function, though we lose some of the slightly better
semantics of GFP_KERNEL in some places, it's worth cleaning up the
callchains.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
David Sterba
d810a4be1a btrfs: add separate helper for unlock_extent_cached with GFP_ATOMIC
There's only one instance where we pass different gfp mask to
unlock_extent_cached. Add a separate helper for that and then we can
drop the gfp parameter from unlock_extent_cached.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
Nikolay Borisov
0a9b0e5351 btrfs: sink extent_write_full_page tree argument
The tree argument passed to extent_write_full_page is referenced from
the page being passed to the same function. Since we already have
enough information to get the reference, remove the function parameter.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:16 +01:00
Nikolay Borisov
5e3ee23648 btrfs: sink extent_write_locked_range tree parameter
This function is called only from submit_compressed_extents and the
io tree being passed is always that of the inode. But we are also
passing the inode, so just move getting the io tree pointer in
extent_write_locked_range to simplify the signature.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:16 +01:00
David Sterba
6af49dbde9 btrfs: sink get_extent parameter to read_extent_buffer_pages
All callers pass btree_get_extent, which needs to be exported.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:13 +01:00
David Sterba
0932584b66 btrfs: sink get_extent parameter to extent_readpages
There's only one caller that passes btrfs_get_extent.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:13 +01:00
David Sterba
2135fb9bb4 btrfs: sink get_extent parameter to extent_fiemap
All callers pass btrfs_get_extent_fiemap and we don't expect anything
else in the context of extent_fiemap.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:13 +01:00
David Sterba
deac642d7e btrfs: sink get_extent parameter to extent_write_full_page
There's only one caller.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:13 +01:00
David Sterba
916b929831 btrfs: sink get_extent parameter to extent_write_locked_range
There's only one caller.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
David Sterba
433175992c btrfs: sink get_extent parameter to extent_writepages
There's only one caller.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
David Sterba
f08dc36f78 btrfs: sink gfp parameter to clear_extent_uptodate
There's only one callsite with GFP_NOFS.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
David Sterba
ae0f162534 btrfs: sink gfp parameter to clear_extent_bit
All callers use GFP_NOFS, we don't have to pass it as an argument. The
built-in tests pass GFP_KERNEL, but they run only at module load time
and NOFS works there as well.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
David Sterba
66b0c887bb btrfs: prepare to drop gfp mask parameter from clear_extent_bit
Use __clear_extent_bit directly in case we want to pass unknown
gfp flags. Otherwise all clear_extent_bit callers use GFP_NOFS, so we
can sink them to the function and reduce argument count, at the cost
that __clear_extent_bit has to be exported.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:12 +01:00
Ming Lei
a0b60d725e btrfs: avoid access to .bi_vcnt directly
BTRFS uses bio->bi_vcnt to figure out page numbers, this approach is no
longer valid once we start enabling multipage bvecs.
correct once we start to enable multipage bvec.

Use bio_nr_pages() to do that instead.

Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Linus Torvalds
26cd94744e for-4.15-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlofBpkACgkQxWXV+ddt
 WDvtTQ//emI1QsD4N0e4BxMcZ1bcigiEk3jc4gj+biRapnMHHAHOqJbVtpK1v8gS
 PCTw+4uD5UOGvhBtS4kXJn8e2qxWCESWJDXwVlW0RHmWLfwd9z7ly0sBMi3oiIqH
 qief8CIkk3oTexiuuJ3mZGxqnDQjRGtWx2LM+bRJBWMk+jN32v2ObSlv9V505a5M
 1daDBsjWojFWa8d4r3YZNJq1df2om/dwVQZ0Wk59bacIo9Xbvok0X459cOlWuv0p
 mjx8m8uA/z+HVdkTYlzyKpq08O8Z4shj3GrBbSnZ511gKzV+c9jJPxij5pKm3Z2z
 KW4Mp17+/7GSNcSsJiqnOYi+wtOrak2lD0COlZTijnY2jrv18h8ianoIM6CpzUdy
 +b09yuFXbPLoUfyl6vFaO/JHuvAkQdaR2tJbds6lvW+liC1ReoL4W1WcUjY6nv9f
 6wTaIv0vwgrHaxeIzxKNpnsTlpHAgorFFk0/w8nLb40WX8AoJ/95lo2zws8oaFDN
 0Fylu3NYhoDrJZK+D8dbsWx2eTsFVCqep4w0+iEVZl3lfuy3FZl1pu8CL7ru9vJl
 DNieh+lUvK1Fk+SYIoilGoriW96RbU8+jPo2W4A1ENzeMJfrNCSWtUSZZp4XT4tO
 8m1PGud07XBLSxd62bAEDV3KZO2DnY1WxgXbKuIHSi9D5CI1LMo=
 =7UW+
 -----END PGP SIGNATURE-----

Merge tag 'for-4.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "We've collected some fixes in since the pre-merge window freeze.

  There's technically only one regression fix for 4.15, but the rest
  seems important and candidates for stable.

   - fix missing flush bio puts in error cases (is serious, but rarely
     happens)

   - fix reporting stat::st_blocks for buffered append writes

   - fix space cache invalidation

   - fix out of bound memory access when setting zlib level

   - fix potential memory corruption when fsync fails in the middle

   - fix crash in integrity checker

   - incremetnal send fix, path mixup for certain unlink/rename
     combination

   - pass flags to writeback so compressed writes can be throttled
     properly

   - error handling fixes"

* tag 'for-4.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: incremental send, fix wrong unlink path after renaming file
  btrfs: tree-checker: Fix false panic for sanity test
  Btrfs: fix list_add corruption and soft lockups in fsync
  btrfs: Fix wild memory access in compression level parser
  btrfs: fix deadlock when writing out space cache
  btrfs: clear space cache inode generation always
  Btrfs: fix reported number of inode blocks after buffered append writes
  Btrfs: move definition of the function btrfs_find_new_delalloc_bytes
  Btrfs: bail out gracefully rather than BUG_ON
  btrfs: dev_alloc_list is not protected by RCU, use normal list_del
  btrfs: add missing device::flush_bio puts
  btrfs: Fix transaction abort during failure in btrfs_rm_dev_item
  Btrfs: add write_flags for compression bio
2017-11-29 14:26:50 -08:00
Filipe Manana
e3b8a48585 Btrfs: fix reported number of inode blocks after buffered append writes
The patch from commit a7e3b975a0 ("Btrfs: fix reported number of inode
blocks") introduced a regression where if we do a buffered write starting
at position equal to or greater than the file's size and then stat(2) the
file before writeback is triggered, the number of used blocks does not
change (unless there's a prealloc/unwritten extent). Example:

  $ xfs_io -f -c "pwrite -S 0xab 0 64K" foobar
  $ du -h foobar
  0	foobar
  $ sync
  $ du -h foobar
  64K	foobar

The first version of that patch didn't had this regression and the second
version, which was the one committed, was made only to address some
performance regression detected by the intel test robots using fs_mark.

This fixes the regression by setting the new delaloc bit in the range, and
doing it at btrfs_dirty_pages() while setting the regular dealloc bit as
well, so that this way we set both bits at once avoiding navigation of the
inode's io tree twice. Doing it at btrfs_dirty_pages() is also the most
meaninful place, as we should set the new dellaloc bit when if we set the
delalloc bit, which happens only if we copied bytes into the pages at
__btrfs_buffered_write().

This was making some of LTP's du tests fail, which can be quickly run
using a command line like the following:

  $ ./runltp -q -p -l /ltp.log -f commands -s du -d /mnt

Fixes: a7e3b975a0 ("Btrfs: fix reported number of inode blocks")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-15 17:27:46 +01:00
Liu Bo
f82b735936 Btrfs: add write_flags for compression bio
Compression code path has only flaged bios with REQ_OP_WRITE no matter
where the bios come from, but it could be a sync write if fsync starts
this writeback or a normal writeback write if wb kthread starts a
periodic writeback.

It breaks the rule that sync writes and writeback writes need to be
differentiated from each other, because from the POV of block layer,
all bios need to be recognized by these flags in order to do some
management, e.g. throttlling.

This passes writeback_control to compression write path so that it can
send bios with proper flags to block layer.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-15 14:44:31 +01:00
Linus Torvalds
5cea7647e6 Merge branch 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
 "There are some new user features and the usual load of invisible
  enhancements or cleanups.

  New features:

   - extend mount options to specify zlib compression level, -o
     compress=zlib:9

   - v2 of ioctl "extent to inode mapping", addressing a usecase where
     we want to retrieve more but inaccurate results and do the
     postprocessing in userspace, aiding defragmentation or
     deduplication tools

   - populate compression heuristics logic, do data sampling and try to
     guess compressibility by: looking for repeated patterns, counting
     unique byte values and distribution, calculating Shannon entropy;
     this will need more benchmarking and possibly fine tuning, but the
     base should be good enough

   - enable indexing for btrfs as lower filesystem in overlayfs

   - speedup page cache readahead during send on large files

  Internal enhancements:

   - more sanity checks of b-tree items when reading them from disk

   - more EINVAL/EUCLEAN fixups, missing BLK_STS_* conversion, other
     errno or error handling fixes

   - remove some homegrown IO-related logic, that's been obsoleted by
     core block layer changes (batching, plug/unplug, own counters)

   - add ref-verify, optional debugging feature to verify extent
     reference accounting

   - simplify code handling outstanding extents, make it more clear
     where and how the accounting is done

   - make delalloc reservations per-inode, simplify the code and make
     the logic more straightforward

   - extensive cleanup of delayed refs code

  Notable fixes:

   - fix send ioctl on 32bit with 64bit kernel"

* 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (102 commits)
  btrfs: Fix bug for misused dev_t when lookup in dev state hash table.
  Btrfs: heuristic: add Shannon entropy calculation
  Btrfs: heuristic: add byte core set calculation
  Btrfs: heuristic: add byte set calculation
  Btrfs: heuristic: add detection of repeated data patterns
  Btrfs: heuristic: implement sampling logic
  Btrfs: heuristic: add bucket and sample counters and other defines
  Btrfs: compression: separate heuristic/compression workspaces
  btrfs: move btrfs_truncate_block out of trans handle
  btrfs: don't call btrfs_start_delalloc_roots in flushoncommit
  btrfs: track refs in a rb_tree instead of a list
  btrfs: add a comp_refs() helper
  btrfs: switch args for comp_*_refs
  btrfs: make the delalloc block rsv per inode
  btrfs: add tracepoints for outstanding extents mods
  Btrfs: rework outstanding_extents
  btrfs: increase output size for LOGICAL_INO_V2 ioctl
  btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2
  btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents
  btrfs: send: remove unused code
  ...
2017-11-14 13:35:29 -08:00
Greg Kroah-Hartman
b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Liu Bo
18fdc67900 Btrfs: remove bio_flags which indicates a meta block of log-tree
Since both committing transaction and writing log-tree are doing
plugging on metadata IO, we can unify to use %sync_writers to benefit
both cases, instead of checking bio_flags while writing meta blocks of
log-tree.

We can remove this bio_flags because in order to write dirty blocks,
log tree also uses btrfs_write_marked_extents(), inside which we
have enabled %sync_writers, therefore, every write goes in a
synchronous way, so does checksuming.

Please also note that, bio_flags is applied per-context while
%sync_writers is applied per-inode, so this might incur some overhead, ie.

1) while log tree is flushing its dirty blocks via
   btrfs_write_marked_extents(), in which %sync_writers is increased
   by one.

2) in the meantime, some writeback operations may happen upon btrfs's
   metadata inode, so these writes go synchronously, too.

However, AFAICS, the overhead is not a big one while the win is that
we unify the two places that needs synchronous way and remove a
special hack/flag.

This removes the bio_flags related stuff for writing log-tree.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:56 +01:00
Jeff Mahoney
1cbb1f454e btrfs: struct-funcs, constify readers
We have reader helpers for most of the on-disk structures that use
an extent_buffer and pointer as offset into the buffer that are
read-only.  We should mark them as const and, in turn, allow consumers
of these interfaces to mark the buffers const as well.

No impact on code, but serves as documentation that a buffer is intended
not to be modified.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:53 +02:00
Linus Torvalds
bc243704fb Merge branch 'for-4.13-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
 "We've identified and fixed a silent corruption (introduced by code in
  the first pull), a fixup after the blk_status_t merge and two fixes to
  incremental send that Filipe has been hunting for some time"

* 'for-4.13-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix unexpected return value of bio_readpage_error
  btrfs: btrfs_create_repair_bio never fails, skip error handling
  btrfs: cloned bios must not be iterated by bio_for_each_segment_all
  Btrfs: fix write corruption due to bio cloning on raid5/6
  Btrfs: incremental send, fix invalid memory access
  Btrfs: incremental send, fix invalid path for link commands
2017-07-14 22:55:52 -07:00
Liu Bo
c3cfb65630 Btrfs: fix unexpected return value of bio_readpage_error
With blk_status_t conversion (that are now present in master),
bio_readpage_error() may return 1 as now ->submit_bio_hook() may not set
%ret if it runs without problems.

This fixes that unexpected return value by changing
btrfs_check_repairable() to return a bool instead of updating %ret, and
patch is applicable to both codebases with and without blk_status_t.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-14 20:42:37 +02:00
Linus Torvalds
8c27cb3566 Merge branch 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
 "The core updates improve error handling (mostly related to bios), with
  the usual incremental work on the GFP_NOFS (mis)use removal,
  refactoring or cleanups. Except the two top patches, all have been in
  for-next for an extensive amount of time.

  User visible changes:

   - statx support

   - quota override tunable

   - improved compression thresholds

   - obsoleted mount option alloc_start

  Core updates:

   - bio-related updates:
       - faster bio cloning
       - no allocation failures
       - preallocated flush bios

   - more kvzalloc use, memalloc_nofs protections, GFP_NOFS updates

   - prep work for btree_inode removal

   - dir-item validation

   - qgoup fixes and updates

   - cleanups:
       - removed unused struct members, unused code, refactoring
       - argument refactoring (fs_info/root, caller -> callee sink)
       - SEARCH_TREE ioctl docs"

* 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits)
  btrfs: Remove false alert when fiemap range is smaller than on-disk extent
  btrfs: Don't clear SGID when inheriting ACLs
  btrfs: fix integer overflow in calc_reclaim_items_nr
  btrfs: scrub: fix target device intialization while setting up scrub context
  btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
  btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
  btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled
  btrfs: qgroup: Return actually freed bytes for qgroup release or free data
  btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
  btrfs: qgroup: Add quick exit for non-fs extents
  Btrfs: rework delayed ref total_bytes_pinned accounting
  Btrfs: return old and new total ref mods when adding delayed refs
  Btrfs: always account pinned bytes when dropping a tree block ref
  Btrfs: update total_bytes_pinned when pinning down extents
  Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT()
  Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64
  btrfs: fix validation of XATTR_ITEM dir items
  btrfs: Verify dir_item in iterate_object_props
  btrfs: Check name_len before in btrfs_del_root_ref
  btrfs: Check name_len before reading btrfs_get_name
  ...
2017-07-05 16:41:23 -07:00
Qu Wenruo
364ecf3651 btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
Introduce a new parameter, struct extent_changeset for
btrfs_qgroup_reserved_data() and its callers.

Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
which range it reserved in current reserve, so it can free it in error
paths.

The reason we need to export it to callers is, at buffered write error
path, without knowing what exactly which range we reserved in current
allocation, we can free space which is not reserved by us.

This will lead to qgroup reserved space underflow.

Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
Qu Wenruo
7bc329c183 btrfs: qgroup: Return actually freed bytes for qgroup release or free data
btrfs_qgroup_release/free_data() only returns 0 or a negative error
number (ENOMEM is the only possible error).

This is normally good enough, but sometimes we need the exact byte
count it freed/released.

Change it to return actually released/freed bytenr number instead of 0
for success.
And slightly modify related extent_changeset structure, since in btrfs
one no-hole data extent won't be larger than 128M, so "unsigned int"
is large enough for the use case.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
David Sterba
c5e4c3d750 btrfs: sink gfp parameter to btrfs_io_bio_alloc
We can hardcode GFP_NOFS to btrfs_io_bio_alloc, although it means we
change it back from GFP_KERNEL in scrub. I'd rather save a few stack
bytes from not passing the gfp flags in the remaining, more imporatant,
contexts and the bio allocating API now looks more consistent.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:04 +02:00
David Sterba
c821e7f3da btrfs: pass bytes to btrfs_bio_alloc
Most callers of btrfs_bio_alloc convert from bytes to sectors. Hide that
in the helper and simplify the logic in the callsers.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:03 +02:00
David Sterba
9f2179a5e7 btrfs: remove redundant parameters from btrfs_bio_alloc
All callers pass gfp_flags=GFP_NOFS and nr_vecs=BIO_MAX_PAGES.

submit_extent_page adds __GFP_HIGH that does not make a difference in
our case as it allows access to memory reserves but otherwise does not
change the constraints.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:03 +02:00
David Sterba
8b6c1d56f2 btrfs: sink gfp parameter to btrfs_bio_clone
All callers pass GFP_NOFS.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:03 +02:00
Liu Bo
e477094f0d Btrfs: hardcode GFP_NOFS for btrfs_bio_clone_partial
We only pass GFP_NOFS to btrfs_bio_clone_partial, so lets hardcode it.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo
2f8e914042 Btrfs: new helper btrfs_bio_clone_partial
This adds a new helper btrfs_bio_clone_partial, it'll allocate a cloned
bio that only owns a part of the original bio's data.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Josef Bacik
7870d0822b Btrfs: don't pass the inode through clean_io_failure
Instead pass around the failure tree and the io tree.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Josef Bacik
6ec656bc0f btrfs: remove inode argument from repair_io_failure
Once we remove the btree_inode we won't have an inode to pass anymore,
just pass the fs_info directly and the inum since we use that to print
out the repair message.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Josef Bacik
c6100a4b4e Btrfs: replace tree->mapping with tree->private_data
For extent_io tree's we have carried the address_mapping of the inode
around in the io tree in order to pull the inode back out for calling
into various tree ops hooks.  This works fine when everything that has
an extent_io_tree has an inode.  But we are going to remove the
btree_inode, so we need to change this.  Instead just have a generic
void * for private data that we can initialize with, and have all the
tree ops use that instead.  This had a lot of cascading changes but
should be relatively straightforward.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor reordering of the callback prototypes ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Christoph Hellwig
4e4cbee93d block: switch bios to blk_status_t
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Chris Mason
bce19f9d23 Merge branch 'for-chris-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.12 2017-04-27 14:13:09 -07:00
Filipe Manana
a7e3b975a0 Btrfs: fix reported number of inode blocks
Currently when there are buffered writes that were not yet flushed and
they fall within allocated ranges of the file (that is, not in holes or
beyond eof assuming there are no prealloc extents beyond eof), btrfs
simply reports an incorrect number of used blocks through the stat(2)
system call (or any of its variants), regardless of mount options or
inode flags (compress, compress-force, nodatacow). This is because the
number of blocks used that is reported is based on the current number
of bytes in the vfs inode plus the number of dealloc bytes in the btrfs
inode. The later covers bytes that both fall within allocated regions
of the file and holes.

Example scenarios where the number of reported blocks is wrong while the
buffered writes are not flushed:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt/sdc

  $ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1
  wrote 65536/65536 bytes at offset 0
  64 KiB, 16 ops; 0.0000 sec (259.336 MiB/sec and 66390.0415 ops/sec)

  $ sync

  $ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1
  wrote 65536/65536 bytes at offset 0
  64 KiB, 16 ops; 0.0000 sec (192.308 MiB/sec and 49230.7692 ops/sec)

  # The following should have reported 64K...
  $ du -h /mnt/sdc/foo1
  128K	/mnt/sdc/foo1

  $ sync

  # After flushing the buffered write, it now reports the correct value.
  $ du -h /mnt/sdc/foo1
  64K	/mnt/sdc/foo1

  $ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2
  wrote 65536/65536 bytes at offset 0
  64 KiB, 16 ops; 0.0000 sec (520.833 MiB/sec and 133333.3333 ops/sec)

  $ sync

  $ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2
  wrote 65536/65536 bytes at offset 65536
  64 KiB, 16 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec)

  # The following should have reported 128K...
  $ du -h /mnt/sdc/foo2
  192K	/mnt/sdc/foo2

  $ sync

  # After flushing the buffered write, it now reports the correct value.
  $ du -h /mnt/sdc/foo2
  128K	/mnt/sdc/foo2

So the number of used file blocks is simply incorrect, unlike in other
filesystems such as ext4 and xfs for example, but only while the buffered
writes are not flushed.

Fix this by tracking the number of delalloc bytes that fall within holes
and beyond eof of a file, and use instead this new counter when reporting
the number of used blocks for an inode.

Another different problem that exists is that the delalloc bytes counter
is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from
the respective range in the inode's iotree) and the vfs inode's bytes
counter is only incremented when writeback finishes (through
insert_reserved_file_extent()). Therefore while writeback is ongoing we
simply report a wrong number of blocks used by an inode if the write
operation covers a range previously unallocated. While this change does
not fix this problem, it does minimizes it a lot by shortening that time
window, as the new dealloc bytes counter (new_delalloc_bytes) is only
decremented when writeback finishes right before updating the vfs inode's
bytes counter. Fully fixing this second problem is not trivial and will
be addressed later by a different patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-04-26 16:27:26 +01:00
Filipe Manana
a315e68f6e Btrfs: fix invalid attempt to free reserved space on failure to cow range
When attempting to COW a file range (we are starting writeback and doing
COW), if we manage to reserve an extent for the range we will write into
but fail after reserving it and before creating the respective ordered
extent, we end up in an error path where we attempt to decrement the
data space's bytes_may_use counter after we already did it while
reserving the extent, leading to a warning/trace like the following:

[  847.621524] ------------[ cut here ]------------
[  847.625441] WARNING: CPU: 5 PID: 4905 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
[  847.633704] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq i2c_piix4 ppdev psmouse tpm_tis serio_raw pcspkr parport_pc tpm_tis_core i2c_core sg
[  847.644616] CPU: 5 PID: 4905 Comm: xfs_io Not tainted 4.10.0-rc8-btrfs-next-37+ #2
[  847.648601] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[  847.648601] Call Trace:
[  847.648601]  dump_stack+0x67/0x90
[  847.648601]  __warn+0xc2/0xdd
[  847.648601]  warn_slowpath_null+0x1d/0x1f
[  847.648601]  btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
[  847.648601]  btrfs_clear_bit_hook+0x140/0x258 [btrfs]
[  847.648601]  clear_state_bit+0x87/0x128 [btrfs]
[  847.648601]  __clear_extent_bit+0x222/0x2b7 [btrfs]
[  847.648601]  clear_extent_bit+0x17/0x19 [btrfs]
[  847.648601]  extent_clear_unlock_delalloc+0x3b/0x6b [btrfs]
[  847.648601]  cow_file_range.isra.39+0x387/0x39a [btrfs]
[  847.648601]  run_delalloc_nocow+0x4d7/0x70e [btrfs]
[  847.648601]  ? arch_local_irq_save+0x9/0xc
[  847.648601]  run_delalloc_range+0xa7/0x2b5 [btrfs]
[  847.648601]  writepage_delalloc.isra.31+0xb9/0x15c [btrfs]
[  847.648601]  __extent_writepage+0x249/0x2e8 [btrfs]
[  847.648601]  extent_write_cache_pages.constprop.33+0x28b/0x36c [btrfs]
[  847.648601]  ? arch_local_irq_save+0x9/0xc
[  847.648601]  ? mark_lock+0x24/0x201
[  847.648601]  extent_writepages+0x4b/0x5c [btrfs]
[  847.648601]  ? btrfs_writepage_start_hook+0xed/0xed [btrfs]
[  847.648601]  btrfs_writepages+0x28/0x2a [btrfs]
[  847.648601]  do_writepages+0x23/0x2c
[  847.648601]  __filemap_fdatawrite_range+0x5a/0x61
[  847.648601]  filemap_fdatawrite_range+0x13/0x15
[  847.648601]  btrfs_fdatawrite_range+0x20/0x46 [btrfs]
[  847.648601]  start_ordered_ops+0x19/0x23 [btrfs]
[  847.648601]  btrfs_sync_file+0x136/0x42c [btrfs]
[  847.648601]  vfs_fsync_range+0x8c/0x9e
[  847.648601]  vfs_fsync+0x1c/0x1e
[  847.648601]  do_fsync+0x31/0x4a
[  847.648601]  SyS_fsync+0x10/0x14
[  847.648601]  entry_SYSCALL_64_fastpath+0x18/0xad
[  847.648601] RIP: 0033:0x7f5b05200800
[  847.648601] RSP: 002b:00007ffe204f71c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[  847.648601] RAX: ffffffffffffffda RBX: ffffffff8109637b RCX: 00007f5b05200800
[  847.648601] RDX: 00000000008bd0a0 RSI: 00000000008bd2e0 RDI: 0000000000000003
[  847.648601] RBP: ffffc90001d67f98 R08: 000000000000ffff R09: 000000000000001f
[  847.648601] R10: 00000000000001f6 R11: 0000000000000246 R12: 0000000000000046
[  847.648601] R13: ffffc90001d67f78 R14: 00007f5b054be740 R15: 00007f5b054be740
[  847.648601]  ? trace_hardirqs_off_caller+0x3f/0xaa
[  847.685787] ---[ end trace 2a4a3e15382508e8 ]---

So fix this by not attempting to decrement the data space info's
bytes_may_use counter if we already reserved the extent and an error
happened before creating the ordered extent. We are already correctly
freeing the reserved extent if an error happens, so there's no additional
measure needed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-04-26 16:27:22 +01:00
Elena Reshetova
b7ac31b7b2 btrfs: convert extent_state.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00