Commit Graph

330 Commits

Author SHA1 Message Date
Jan Kara
dc3b17cc8b block: Use pointer to backing_dev_info from request_queue
We will want to have struct backing_dev_info allocated separately from
struct request_queue. As the first step add pointer to backing_dev_info
to request_queue and convert all users touching it. No functional
changes in this patch.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-02 08:20:48 -07:00
Christoph Hellwig
f73f44eb00 block: add a op_is_flush helper
This centralizes the checks for bios that needs to be go into the flush
state machine.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27 09:01:45 -07:00
Eric Wheeler
b8c0d911ac bcache: partition support: add 16 minors per bcacheN device
Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net>
Tested-by: Wido den Hollander <wido@widodh.nl>
2016-12-17 13:02:00 -07:00
Kent Overstreet
be628be095 bcache: Make gc wakeup sane, remove set_task_state()
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2016-12-17 13:01:55 -07:00
Ming Lei
4113b88a65 bcache: debug: avoid accessing .bi_io_vec directly
Instead we use standard iterator way to do that.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:55 -07:00
Ming Lei
3a83f46775 block: bio: pass bvec table to bio_init()
Some drivers often use external bvec table, so introduce
this helper for this case. It is always safe to access the
bio->bi_io_vec in this way for this case.

After converting to this usage, it will becomes a bit easier
to evaluate the remaining direct access to bio->bi_io_vec,
so it can help to prepare for the following multipage bvec
support.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

Fixed up the new O_DIRECT cases.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:21 -07:00
Christoph Hellwig
70fd76140a block,fs: use REQ_* flags directly
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly.  Where applicable this also drops usage of the
bio_set_op_attrs wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Christoph Hellwig
83b5df67c5 bcache: use op_is_sync to check for synchronous requests
(and remove one layer of masking for the op_is_write call next to it).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Guoqing Jiang
491221f88d block: export bio_free_pages to other modules
bio_free_pages is introduced in commit 1dfa0f68c0
("block: add a helper to free bio bounce buffer pages"),
we can reuse the func in other modules after it was
imported.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-22 07:48:03 -06:00
Eric Wheeler
90706094d5 bcache: pr_err: more meaningful error message when nr_stripes is invalid
The original error was thought to be corruption, but was actually caused by:
	make-bcache --data-offset N
where N was in bytes and should have been in sectors.  While userspace
tools should be updated to check --data-offset beyond end of volume,
hopefully this will help others that might not have noticed the units.

Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
2016-08-18 20:31:03 -07:00
Kent Overstreet
acc9cf8c66 bcache: RESERVE_PRIO is too small by one when prio_buckets() is a power of two.
This patch fixes a cachedev registration-time allocation deadlock.
This can deadlock on boot if your initrd auto-registeres bcache devices:

Allocator thread:
[  720.727614] INFO: task bcache_allocato:3833 blocked for more than 120 seconds.
[  720.732361]  [<ffffffff816eeac7>] schedule+0x37/0x90
[  720.732963]  [<ffffffffa05192b8>] bch_bucket_alloc+0x188/0x360 [bcache]
[  720.733538]  [<ffffffff810e6950>] ? prepare_to_wait_event+0xf0/0xf0
[  720.734137]  [<ffffffffa05302bd>] bch_prio_write+0x19d/0x340 [bcache]
[  720.734715]  [<ffffffffa05190bf>] bch_allocator_thread+0x3ff/0x470 [bcache]
[  720.735311]  [<ffffffff816ee41c>] ? __schedule+0x2dc/0x950
[  720.735884]  [<ffffffffa0518cc0>] ? invalidate_buckets+0x980/0x980 [bcache]

Registration thread:
[  720.710403] INFO: task bash:3531 blocked for more than 120 seconds.
[  720.715226]  [<ffffffff816eeac7>] schedule+0x37/0x90
[  720.715805]  [<ffffffffa05235cd>] __bch_btree_map_nodes+0x12d/0x150 [bcache]
[  720.716409]  [<ffffffffa0522d30>] ? bch_btree_insert_check_key+0x1c0/0x1c0 [bcache]
[  720.717008]  [<ffffffffa05236e4>] bch_btree_insert+0xf4/0x170 [bcache]
[  720.717586]  [<ffffffff810e6950>] ? prepare_to_wait_event+0xf0/0xf0
[  720.718191]  [<ffffffffa0527d9a>] bch_journal_replay+0x14a/0x290 [bcache]
[  720.718766]  [<ffffffff810cc90d>] ? ttwu_do_activate.constprop.94+0x5d/0x70
[  720.719369]  [<ffffffff810cf684>] ? try_to_wake_up+0x1d4/0x350
[  720.719968]  [<ffffffffa05317d0>] run_cache_set+0x580/0x8e0 [bcache]
[  720.720553]  [<ffffffffa053302e>] register_bcache+0xe2e/0x13b0 [bcache]
[  720.721153]  [<ffffffff81354cef>] kobj_attr_store+0xf/0x20
[  720.721730]  [<ffffffff812a2dad>] sysfs_kf_write+0x3d/0x50
[  720.722327]  [<ffffffff812a225a>] kernfs_fop_write+0x12a/0x180
[  720.722904]  [<ffffffff81225177>] __vfs_write+0x37/0x110
[  720.723503]  [<ffffffff81228048>] ? __sb_start_write+0x58/0x110
[  720.724100]  [<ffffffff812cedb3>] ? security_file_permission+0x23/0xa0
[  720.724675]  [<ffffffff812258a9>] vfs_write+0xa9/0x1b0
[  720.725275]  [<ffffffff8102479c>] ? do_audit_syscall_entry+0x6c/0x70
[  720.725849]  [<ffffffff81226755>] SyS_write+0x55/0xd0
[  720.726451]  [<ffffffff8106a390>] ? do_page_fault+0x30/0x80
[  720.727045]  [<ffffffff816f2cae>] system_call_fastpath+0x12/0x71

The fifo code in upstream bcache can't use the last element in the buffer,
which was the cause of the bug: if you asked for a power of two size,
it'd give you a fifo that could hold one less than what you asked for
rather than allocating a buffer twice as big.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: stable@vger.kernel.org
2016-08-18 20:29:49 -07:00
Eric Wheeler
d9dc1702b2 bcache: register_bcache(): call blkdev_put() when cache_alloc() fails
register_cache() is supposed to return an error string on error so that
register_bcache() will will blkdev_put and cleanup other user counters,
but it does not set 'char *err' when cache_alloc() fails (eg, due to
memory pressure) and thus register_bcache() performs no cleanup.

register_bcache() <----------\  <- no jump to err_close, no blkdev_put()
   |                         |
   +->register_cache()       |  <- fails to set char *err
         |                   |
         +->cache_alloc() ---/  <- returns error

This patch sets `char *err` for this failure case so that register_cache()
will cause register_bcache() to correctly jump to err_close and do
cleanup.  This was tested under OOM conditions that triggered the bug.

Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: stable@vger.kernel.org
2016-08-18 20:28:23 -07:00
Jens Axboe
1eff9d322a block: rename bio bi_rw to bi_opf
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Christoph Hellwig
ed996a52c8 block: simplify and cleanup bvec pool handling
Instead of a flag and an index just make sure an index of 0 means
no need to free the bvec array.  Also move the constants related
to the bvec pools together and use a consistent naming scheme for
them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 17:37:02 -06:00
Yijing Wang
89b920e003 bcache: Remove redundant block_size assignment
We have assigned sb->block_size before the switch,
so remove the redundant one.

Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Acked-by: Eric Wheeler <bcache@lists.ewheeler.net>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:34:50 -06:00
Yijing Wang
7abc70d700 bcache: update document info
There is no return in continue_at(), update the documentation.

Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:34:49 -06:00
Yijing Wang
c50d4d5dd3 bcache: Remove redundant parameter for cache_alloc()
Cache_sb is not used in cache_alloc, and we have copied
sb info to cache->sb already, remove it.

Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:34:47 -06:00
Bhaktipriya Shridhar
81baf90af2 bcache: Remove deprecated create_workqueue
alloc_workqueue replaces deprecated create_workqueue().

Dedicated workqueues have been used since bcache_wq and moving_gc_wq
are workqueues for writes and are being used on a memory reclaim path.
WQ_MEM_RECLAIM has been set to ensure forward progress under memory
pressure.
Since there are only a fixed number of work items, explicit concurrency
limit is unnecessary here.

Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-11 20:03:04 -06:00
Mike Christie
28a8f0d317 block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
ad0d9e76a4 bcache: use bio op accessors
Separate the op from the rq_flag_bits and have bcache
set/get the bio using bio_set_op_attrs/bio_op.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
c8d93247f1 bcache: use op_is_write instead of checking for REQ_WRITE
We currently set REQ_WRITE/WRITE for all non READ IOs
like discard, flush, writesame, etc. In the next patches where we
no longer set up the op as a bitmap, we will not be able to
detect a operation direction like writesame by testing if REQ_WRITE is
set.

This has bcache use the op_is_write helper which will do the right
thing.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
4e49ea4a3d block/fs/drivers: remove rw argument from submit_bio
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.

Signed-off-by: Mike Christie <mchristi@redhat.com>

Fixed up fs/ext4/crypto.c

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Jiri Kosina
29e6c57cc7 bcache: bch_gc_thread() is not freezable
bch_gc_thread() doesn't mark itself freezable, so calling try_to_freeze()
in its context is just an expensive no-op.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-24 09:00:45 -06:00
Jiri Kosina
770b8ce400 bcache: bch_allocator_thread() is not freezable
bch_allocator_thread() is calling try_to_freeze(), but that's just an
expensive no-op given the fact that the thread is not marked freezable.

Bucket allocator has to be up and running to the very last stages of the
suspend, as the bcache I/O that's in flight (think of writing an
hibernation image to a swap device served by bcache).

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-24 09:00:43 -06:00
Jiri Kosina
7c87df9c15 bcache: bch_writeback_thread() is not freezable
bch_writeback_thread() is calling try_to_freeze(), but that's just an
expensive no-op given the fact that the thread is not marked freezable.

I/O helper kthreads, exactly such as the bcache writeback thread, actually
shouldn't be freezable, because they are potentially necessary for
finalizing the image write-out.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-24 09:00:40 -06:00
Jens Axboe
84b4ff9ef2 bcache: switch to using blk_queue_write_cache()
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-04-12 16:00:39 -06:00
Eric Wheeler
f8b11260a4 bcache: fix cache_set_flush() NULL pointer dereference on OOM
When bch_cache_set_alloc() fails to kzalloc the cache_set, the
asyncronous closure handling tries to dereference a cache_set that
hadn't yet been allocated inside of cache_set_flush() which is called
by __cache_set_unregister() during cleanup.  This appears to happen only
during an OOM condition on bcache_register.

Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: stable@vger.kernel.org
2016-03-08 09:19:10 -07:00
Eric Wheeler
9b299728ed bcache: cleaned up error handling around register_cache()
Fix null pointer dereference by changing register_cache() to return an int
instead of being void.  This allows it to return -ENOMEM or -ENODEV and
enables upper layers to handle the OOM case without NULL pointer issues.

See this thread:
  http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3521

Fixes this error:
  gargamel:/sys/block/md5/bcache# echo /dev/sdh2 > /sys/fs/bcache/register

  bcache: register_cache() error opening sdh2: cannot allocate memory
  BUG: unable to handle kernel NULL pointer dereference at 00000000000009b8
  IP: [<ffffffffc05a7e8d>] cache_set_flush+0x102/0x15c [bcache]
  PGD 120dff067 PUD 1119a3067 PMD 0
  Oops: 0000 [#1] SMP
  Modules linked in: veth ip6table_filter ip6_tables
  (...)
  CPU: 4 PID: 3371 Comm: kworker/4:3 Not tainted 4.4.2-amd64-i915-volpreempt-20160213bc1 #3
  Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
  Workqueue: events cache_set_flush [bcache]
  task: ffff88020d5dc280 ti: ffff88020b6f8000 task.ti: ffff88020b6f8000
  RIP: 0010:[<ffffffffc05a7e8d>]  [<ffffffffc05a7e8d>] cache_set_flush+0x102/0x15c [bcache]

Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net>
Tested-by: Marc MERLIN <marc@merlins.org>
Cc: <stable@vger.kernel.org>
2016-03-08 09:19:08 -07:00
Eric Wheeler
07cc6ef8ed bcache: fix race of writeback thread starting before complete initialization
The bch_writeback_thread might BUG_ON in read_dirty() if
dc->sb==BDEV_STATE_DIRTY and bch_sectors_dirty_init has not yet completed
its related initialization.  This patch downs the dc->writeback_lock until
after initialization is complete, thus preventing bch_writeback_thread
from proceeding prematurely.

See this thread:
  http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3453

Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net>
Tested-by: Marc MERLIN <marc@merlins.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-08 09:17:30 -07:00
Linus Torvalds
641203549a Merge branch 'for-4.5/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "This is the block driver pull request for 4.5, with the exception of
  NVMe, which is in a separate branch and will be posted after this one.

  This pull request contains:

   - A set of bcache stability fixes, which have been acked by Kent.
     These have been used and tested for more than a year by the
     community, so it's about time that they got in.

   - A set of drbd updates from the drbd team (Andreas, Lars, Philipp)
     and Markus Elfring, Oleg Drokin.

   - A set of fixes for xen blkback/front from the usual suspects, (Bob,
     Konrad) as well as community based fixes from Kiri, Julien, and
     Peng.

   - A 2038 time fix for sx8 from Shraddha, with a fix from me.

   - A small mtip32xx cleanup from Zhu Yanjun.

   - A null_blk division fix from Arnd"

* 'for-4.5/drivers' of git://git.kernel.dk/linux-block: (71 commits)
  null_blk: use sector_div instead of do_div
  mtip32xx: restrict variables visible in current code module
  xen/blkfront: Fix crash if backend doesn't follow the right states.
  xen/blkback: Fix two memory leaks.
  xen/blkback: make st_ statistics per ring
  xen/blkfront: Handle non-indirect grant with 64KB pages
  xen-blkfront: Introduce blkif_ring_get_request
  xen-blkback: clear PF_NOFREEZE for xen_blkif_schedule()
  xen/blkback: Free resources if connect_ring failed.
  xen/blocks: Return -EXX instead of -1
  xen/blkback: make pool of persistent grants and free pages per-queue
  xen/blkback: get the number of hardware queues/rings from blkfront
  xen/blkback: pseudo support for multi hardware queues/rings
  xen/blkback: separate ring information out of struct xen_blkif
  xen/blkfront: correct setting for xen_blkif_max_ring_order
  xen/blkfront: make persistent grants pool per-queue
  xen/blkfront: Remove duplicate setting of ->xbdev.
  xen/blkfront: Cleanup of comments, fix unaligned variables, and syntax errors.
  xen/blkfront: negotiate number of queues/rings to be used with backend
  xen/blkfront: split per device io_lock
  ...
2016-01-21 18:19:38 -08:00
Al Viro
93bbf5831d md: more open-coded offset_in_page()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-04 10:29:12 -05:00
Kent Overstreet
627ccd20b4 bcache: Change refill_dirty() to always scan entire disk if necessary
Previously, it would only scan the entire disk if it was starting from
the very start of the disk - i.e. if the previous scan got to the end.

This was broken by refill_full_stripes(), which updates last_scanned so
that refill_dirty was never triggering the searched_from_start path.

But if we change refill_dirty() to always scan the entire disk if
necessary, regardless of what last_scanned was, the code gets cleaner
and we fix that bug too.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-30 20:23:16 -07:00
Stefan Bader
8d16ce540c bcache: prevent crash on changing writeback_running
Added a safeguard in the shutdown case. At least while not being
attached it is also possible to trigger a kernel bug by writing into
writeback_running. This change  adds the same check before trying to
wake up the thread for that case.

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-30 20:23:14 -07:00
Gabriel de Perthuis
d7076f2162 bcache: allows use of register in udev to avoid "device_busy" error.
Allows to use register, not register_quiet in udev to avoid "device_busy" error.
The initial patch proposed at https://lkml.org/lkml/2013/8/26/549 by Gabriel de Perthuis
<g2p.code@gmail.com> does not unlock the mutex and hangs the kernel.

See http://thread.gmane.org/gmane.linux.kernel.bcache.devel/2594 for the discussion.

Cc: Denis Bychkov <manover@gmail.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Gabriel de Perthuis <g2p.code@gmail.com>
Cc: stable@vger.kernel.org

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-30 20:23:13 -07:00
Zheng Liu
2ecf0cdb2b bcache: unregister reboot notifier if bcache fails to unregister device
In bcache_init() function it forgot to unregister reboot notifier if
bcache fails to unregister a block device.  This commit fixes this.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Tested-by: Joshua Schmid <jschmid@suse.com>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-30 20:23:11 -07:00
Al Viro
4d4d8573a8 bcache: fix a leak in bch_cached_dev_run()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Joshua Schmid <jschmid@suse.com>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-30 20:23:10 -07:00
Zheng Liu
fecaee6f20 bcache: clear BCACHE_DEV_UNLINK_DONE flag when attaching a backing device
This bug can be reproduced by the following script:

  #!/bin/bash

  bcache_sysfs="/sys/fs/bcache"

  function clear_cache()
  {
  	if [ ! -e $bcache_sysfs ]; then
  		echo "no bcache sysfs"
  		exit
  	fi

  	cset_uuid=$(ls -l $bcache_sysfs|head -n 2|tail -n 1|awk '{print $9}')
  	sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/detach"
  	sleep 5
  	sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/attach"
  }

  for ((i=0;i<10;i++)); do
  	clear_cache
  done

The warning messages look like below:
[  275.948611] ------------[ cut here ]------------
[  275.963840] WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xb8/0xd0() (Tainted: P        W
---------------   )
[  275.979253] Hardware name: Tecal RH2285
[  275.994106] sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:09.0/0000:08:00.0/host4/target4:2:1/4:2:1:0/block/sdb/sdb1/bcache/cache'
[  276.024105] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler
bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801
i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas
pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan]
[  276.072643] Pid: 2765, comm: sh Tainted: P        W  ---------------    2.6.32 #1
[  276.089315] Call Trace:
[  276.105801]  [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0
[  276.122650]  [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50
[  276.139361]  [<ffffffff81205c08>] ? sysfs_add_one+0xb8/0xd0
[  276.156012]  [<ffffffff8120609b>] ? sysfs_do_create_link+0x12b/0x170
[  276.172682]  [<ffffffff81206113>] ? sysfs_create_link+0x13/0x20
[  276.189282]  [<ffffffffa03bda21>] ? bcache_device_link+0xc1/0x110 [bcache]
[  276.205993]  [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache]
[  276.222794]  [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache]
[  276.239680]  [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110
[  276.256594]  [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170
[  276.273364]  [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0
[  276.290133]  [<ffffffff811890b1>] ? sys_write+0x51/0x90
[  276.306368]  [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b
[  276.322301] ---[ end trace 9f5d4fcdd0c3edfb ]---
[  276.338241] ------------[ cut here ]------------
[  276.354109] WARNING: at /home/wenqing.lz/bcache/bcache/super.c:720
bcache_device_link+0xdf/0x110 [bcache]() (Tainted: P        W  ---------------   )
[  276.386017] Hardware name: Tecal RH2285
[  276.401430] Couldn't create device <-> cache set symlinks
[  276.401759] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler
bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801
i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas
pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan]
[  276.465477] Pid: 2765, comm: sh Tainted: P        W  ---------------    2.6.32 #1
[  276.482169] Call Trace:
[  276.498610]  [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0
[  276.515405]  [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50
[  276.532059]  [<ffffffffa03bda3f>] ? bcache_device_link+0xdf/0x110 [bcache]
[  276.548808]  [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache]
[  276.565569]  [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache]
[  276.582418]  [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110
[  276.599341]  [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170
[  276.616142]  [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0
[  276.632607]  [<ffffffff811890b1>] ? sys_write+0x51/0x90
[  276.648671]  [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b
[  276.664756] ---[ end trace 9f5d4fcdd0c3edfc ]---

We forget to clear BCACHE_DEV_UNLINK_DONE flag in bcache_device_attach()
function when we attach a backing device first time.  After detaching this
backing device, this flag will be true and sysfs_remove_link() isn't called in
bcache_device_unlink().  Then when we attach this backing device again,
sysfs_create_link() will return EEXIST error in bcache_device_link().

So the fix is trival and we clear this flag in bcache_device_link().

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Tested-by: Joshua Schmid <jschmid@suse.com>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-30 20:23:08 -07:00
Kent Overstreet
c5f1e5adf9 bcache: Add a cond_resched() call to gc
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-30 20:23:06 -07:00
Zheng Liu
2ef9ccbfcb bcache: fix a livelock when we cause a huge number of cache misses
Subject :	[PATCH v2] bcache: fix a livelock in btree lock
Date :	Wed, 25 Feb 2015 20:32:09 +0800 (02/25/2015 04:32:09 AM)

This commit tries to fix a livelock in bcache.  This livelock might
happen when we causes a huge number of cache misses simultaneously.

When we get a cache miss, bcache will execute the following path.

->cached_dev_make_request()
  ->cached_dev_read()
    ->cached_lookup()
      ->bch->btree_map_keys()
        ->btree_root()  <------------------------
          ->bch_btree_map_keys_recurse()        |
            ->cache_lookup_fn()                 |
              ->cached_dev_cache_miss()         |
                ->bch_btree_insert_check_key() -|
                  [If btree->seq is not equal to seq + 1, we should return
                   EINTR and traverse btree again.]

In bch_btree_insert_check_key() function we first need to check upgrade
flag (op->lock == -1), and when this flag is true we need to release
read btree->lock and try to take write btree->lock.  During taking and
releasing this write lock, btree->seq will be monotone increased in
order to prevent other threads modify this in cache miss (see btree.h:74).
But if there are some cache misses caused by some requested, we could
meet a livelock because btree->seq is always changed by others.  Thus no
one can make progress.

This commit will try to take write btree->lock if it encounters a race
when we traverse btree.  Although it sacrifice the scalability but we
can ensure that only one can modify the btree.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Tested-by: Joshua Schmid <jschmid@suse.com>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Joshua Schmid <jschmid@suse.com>
Cc: Zhu Yanhai <zhu.yanhai@gmail.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-30 20:23:05 -07:00
Linus Torvalds
3419b45039 Merge branch 'for-4.4/io-poll' of git://git.kernel.dk/linux-block
Pull block IO poll support from Jens Axboe:
 "Various groups have been doing experimentation around IO polling for
  (really) fast devices.  The code has been reviewed and has been
  sitting on the side for a few releases, but this is now good enough
  for coordinated benchmarking and further experimentation.

  Currently O_DIRECT sync read/write are supported.  A framework is in
  the works that allows scalable stats tracking so we can auto-tune
  this.  And we'll add libaio support as well soon.  Fow now, it's an
  opt-in feature for test purposes"

* 'for-4.4/io-poll' of git://git.kernel.dk/linux-block:
  direct-io: be sure to assign dio->bio_bdev for both paths
  directio: add block polling support
  NVMe: add blk polling support
  block: add block polling support
  blk-mq: return tag/queue combo in the make_request_fn handlers
  block: change ->make_request_fn() and users to return a queue cookie
2015-11-10 17:23:49 -08:00
Linus Torvalds
75021d2859 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial updates from Jiri Kosina:
 "Trivial stuff from trivial tree that can be trivially summed up as:

   - treewide drop of spurious unlikely() before IS_ERR() from Viresh
     Kumar

   - cosmetic fixes (that don't really affect basic functionality of the
     driver) for pktcdvd and bcache, from Julia Lawall and Petr Mladek

   - various comment / printk fixes and updates all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  bcache: Really show state of work pending bit
  hwmon: applesmc: fix comment typos
  Kconfig: remove comment about scsi_wait_scan module
  class_find_device: fix reference to argument "match"
  debugfs: document that debugfs_remove*() accepts NULL and error values
  net: Drop unlikely before IS_ERR(_OR_NULL)
  mm: Drop unlikely before IS_ERR(_OR_NULL)
  fs: Drop unlikely before IS_ERR(_OR_NULL)
  drivers: net: Drop unlikely before IS_ERR(_OR_NULL)
  drivers: misc: Drop unlikely before IS_ERR(_OR_NULL)
  UBI: Update comments to reflect UBI_METAONLY flag
  pktcdvd: drop null test before destroy functions
2015-11-07 13:05:44 -08:00
Jens Axboe
dece16353e block: change ->make_request_fn() and users to return a queue cookie
No functional changes in this patch, but it prepares us for returning
a more useful cookie related to the IO that was queued up.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
2015-11-07 10:40:46 -07:00
Petr Mladek
8d090f4731 bcache: Really show state of work pending bit
WORK_STRUCT_PENDING is a mask for testing the pending bit.
test_bit() expects the number of the bit and we need to
use WORK_STRUCT_PENDING_BIT there.

Also work_data_bits() is defined in workqueues.h now.

I have noticed this just by chance when looking how
WORK_STRUCT_PENDING_BIT is used. The change is compile
tested.

Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-11-06 15:06:05 +01:00
Kent Overstreet
749b61dab3 bcache: remove driver private bio splitting code
The bcache driver has always accepted arbitrarily large bios and split
them internally.  Now that every driver must accept arbitrarily large
bios this code isn't nessecary anymore.

Cc: linux-bcache@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13 12:31:40 -06:00
Christoph Hellwig
4246a0b63b block: add a bi_error field to struct bio
Currently we have two different ways to signal an I/O error on a BIO:

 (1) by clearing the BIO_UPTODATE flag
 (2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario.  Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29 08:55:15 -06:00
Jens Axboe
2bb4cd5cc4 block: have drivers use blk_queue_max_discard_sectors()
Some drivers use it now, others just set the limits field manually.
But in preparation for splitting this into a hard and soft limit,
ensure that they all call the proper function for setting the hw
limit for discards.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-17 08:41:53 -06:00
Jens Axboe
77b5a08427 bcache: don't embed 'return' statements in closure macros
This is horribly confusing, it breaks the flow of the code without
it being apparent in the caller.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
2015-07-11 09:57:32 -06:00
Joe Perches
d1aa1ab33d MAINTAINERS: BCACHE: Kent Overstreet has changed email address
Kent's email address in MAINTAINERS seems to be invalid.
This was his last sign-off address, so use that if appropriate.

Fix the S: status entry while there.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30 19:45:01 -07:00
Pekka Enberg
958b43384e bcache: use kvfree() in various places
Use kvfree() instead of open-coding it.

Signed-off-by: Pekka Enberg <penberg@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30 19:45:00 -07:00
Tejun Heo
66114cad64 writeback: separate out include/linux/backing-dev-defs.h
With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.

This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it.  c files
which need access to more backing-dev details now include
backing-dev.h directly.  This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.

v2: fs/fat build failure fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:34 -06:00
Mike Snitzer
326e1dbb57 block: remove management of bi_remaining when restoring original bi_end_io
Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
 1) saving a bio's original bi_end_io
 2) wiring up an intermediate bi_end_io
 3) restoring the original bi_end_io from intermediate bi_end_io
 4) calling bio_endio() to execute the restored original bi_end_io

The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called.  For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0.  When bio_inc_remaining() occurred during step
3 it brought it to a value of 2.  When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).

Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining.  For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.

Also, the bio_inc_remaining() interface has been moved local to bio.c.

Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 08:58:55 -06:00
Jens Axboe
dac56212e8 bio: skip atomic inc/dec of ->bi_cnt for most use cases
Struct bio has a reference count that controls when it can be freed.
Most uses cases is allocating the bio, which then returns with a
single reference to it, doing IO, and then dropping that single
reference. We can remove this atomic_dec_and_test() in the completion
path, if nobody else is holding a reference to the bio.

If someone does call bio_get() on the bio, then we flag the bio as
now having valid count and that we must properly honor the reference
count when it's being put.

Tested-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05 13:32:49 -06:00
Gu Zheng
aae4933da9 md/bcache: use generic io stats accounting functions to simplify io stat accounting
Use generic io stats accounting help functions (generic_{start,end}_io_acct)
to simplify io stat accounting.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: Kent Overstreet <kmo@datera.io>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-24 08:05:12 -07:00
Mike Snitzer
b277da0a8a block: disable entropy contributions for nonrot devices
Clear QUEUE_FLAG_ADD_RANDOM in all block drivers that set
QUEUE_FLAG_NONROT.

Historically, all block devices have automatically made entropy
contributions.  But as previously stated in commit e2e1a148 ("block: add
sysfs knob for turning off disk entropy contributions"):
    - On SSD disks, the completion times aren't as random as they
      are for rotational drives. So it's questionable whether they
      should contribute to the random pool in the first place.
    - Calling add_disk_randomness() has a lot of overhead.

There are more reliable sources for randomness than non-rotational block
devices.  From a security perspective it is better to err on the side of
caution than to allow entropy contributions from unreliable "random"
sources.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-04 10:55:32 -06:00
Kent Overstreet
0781c8748c bcache: Drop unneeded blk_sync_queue() calls
this is needed for the queue/block device we created (it's done by
blk_cleanup_queue() which we do call) - but calling it for the block devices we
only opened is pointless.

Change-Id: I53dfded14ed15b9581d10ca8399d5e1b3abbf9f2
2014-08-04 15:23:04 -07:00
Jianjian Huo
789d21dbd9 bcache: add mutex lock for bch_is_open
Since bch_is_open will iterate linked list bch_cache_sets and
uncached_devices, it needs bch_register_lock.

Signed-off-by: Jianjian Huo <samuel.huo@gmail.com>
2014-08-04 15:23:04 -07:00
Surbhi Palande
5b25abade2 bcache: Correct printing of btree_gc_max_duration_ms
time_stats::btree_gc_max_duration_mc is not bit shifted by 8

Fixes BUG #138

Change-Id: I44fc6e1d0579674016acc533f1a546b080e5371a
Signed-off-by: Surbhi Palande <sap@daterainc.com>
2014-08-04 15:23:04 -07:00
Slava Pestov
2452cc8906 bcache: try to set b->parent properly
bcache_flash_dev.ktest would reliably crash with 8k and 16k bucket size
before; now it passes.

Change-Id: Ib542232235e39298c3a7548fe52b645cabb823d1
2014-08-04 15:23:04 -07:00
Slava Pestov
c9a78332b4 bcache: fix memory corruption in init error path
If register_cache_set() failed, we would touch ca->set after
it had already been freed. Also, fix an assertion to catch
this.

Change-Id: I748e5f5b223e2d9b2602075dec2f997cced2394d
2014-08-04 15:23:04 -07:00
Slava Pestov
bf0c55c986 bcache: fix crash with incomplete cache set
Change-Id: I6abde52afe917633480caaf4e2518f42a816d886
2014-08-04 15:23:04 -07:00
Kent Overstreet
d83353b319 bcache: Fix more early shutdown bugs
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:04 -07:00
Slava Pestov
400ffaa2ac bcache: fix use-after-free in btree_gc_coalesce()
If we goto out_nocoalesce after we free new_nodes[0], we end up freeing
new_nodes[0] again. This was generating a lockdep warning. The fix is
to set new_nodes[0] to NULL, since the out_nocoalesce path safely
ignores NULL entries in the new_nodes array.

This regression was introduced in 2d7f9531.

Change-Id: I76564d7257800583214376b4bacf236cda90c89c
2014-08-04 15:23:04 -07:00
Kent Overstreet
6b708de64a bcache: Fix an infinite loop in journal replay
When running with multiple cache devices, if one of the devices has a completely
empty journal but we'd already found some journal entries on a previosu device
we'd go into an infinite loop.

Change-Id: I1dcdc0d738192746de28f40e8b08825b0dea5e2b
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov
913dc33fb2 bcache: fix crash in bcache_btree_node_alloc_fail tracepoint
'b' was NULL.

Change-Id: Icac0fd04afa2d23f213d96d51afd53374e6dd0c0
2014-08-04 15:23:03 -07:00
Slava Pestov
60ae81eee8 bcache: bcache_write tracepoint was crashing
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov
8e09480806 bcache: fix typo in bch_bkey_equal_header
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Kent Overstreet
501d52a90c bcache: Allocate bounce buffers with GFP_NOWAIT
There's no point in blocking on these allocations, since our fallback paths will
probably go faster than blocking.

Change-Id: I733ca202c25cb36bde02607a0a60552229a4241c
2014-08-04 15:23:03 -07:00
Kent Overstreet
bcf090e004 bcache: Make sure to pass GFP_WAIT to mempool_alloc()
this was very wrong - mempool_alloc() only guarantees success with GFP_WAIT.
bcache uses GFP_NOWAIT in various other places where we have a fallback,
circuits must've gotten crossed when writing this code or something.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov
9e5c353510 bcache: fix uninterruptible sleep in writeback thread
There were two issues here:

- writeback thread did not start until the device first became dirty
- writeback thread used uninterruptible sleep once running

Without this patch I see kernel warnings printed and a load average of
1.52 after booting my test VM. With this patch the warnings are gone and
the load average is near 0.00 as expected.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov
c5aa4a3157 bcache: wait for buckets when allocating new btree root
Tested:
- sometimes bcache_tier test would hang on startup with a failure
  to allocate the btree root -- no longer seeing this

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:03 -07:00
Slava Pestov
a664d0f05a bcache: fix crash on shutdown in passthrough mode
We never started the writeback thread in this case, so don't stop it.
2014-08-04 15:23:03 -07:00
Slava Pestov
e5112201c1 bcache: fix lockdep warnings on shutdown 2014-08-04 15:23:03 -07:00
Slava Pestov
8b326d3a2a bcache allocator: send discards with correct size 2014-08-04 15:23:03 -07:00
Surbhi Palande
dbd810ab67 bcache: Fix to remove the rcu_sched stalls.
while loop was executing infinitely.
This fix ends the while loop gracefully.

Signed-off-by: Surbhi Palande <sap@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:02 -07:00
Kent Overstreet
9aa61a992a bcache: Fix a journal replay bug
journal replay wansn't validating pointers with bch_extent_invalid() before
derefing, fixed

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:02 -07:00
Kent Overstreet
5b1016e62f bcache: Fix a bug when detaching
After detaching a backing device from a cache set, a bit wasn't getting
reset meaning the second detach wouldn't work correctly.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-08-04 15:23:02 -07:00
Peter Zijlstra
4e857c58ef arch: Mass conversion of smp_mb__*()
Mostly scripted conversion of the smp_mb__* barriers.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-18 14:20:48 +02:00
John Sheu
cb85114956 bcache: remove nested function usage
Uninlined nested functions can cause crashes when using ftrace, as they don't
follow the normal calling convention and confuse the ftrace function graph
tracer as it examines the stack.

Also, nested functions are supported as a gcc extension, but may fail on other
compilers (e.g. llvm).

Signed-off-by: John Sheu <john.sheu@gmail.com>
2014-03-18 12:39:28 -07:00
Kent Overstreet
3a2fd9d509 bcache: Kill bucket->gc_gen
gc_gen was a temporary used to recalculate last_gc, but since we only need
bucket->last_gc when gc isn't running (gc_mark_valid = 1), we can just update
last_gc directly.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:24:54 -07:00
Kent Overstreet
2531d9ee61 bcache: Kill unused freelist
This was originally added as at optimization that for various reasons isn't
needed anymore, but it does add a lot of nasty corner cases (and it was
responsible for some recently fixed bugs). Just get rid of it now.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:23:36 -07:00
Kent Overstreet
0a63b66db5 bcache: Rework btree cache reserve handling
This changes the bucket allocation reserves to use _real_ reserves - separate
freelists - instead of watermarks, which if nothing else makes the current code
saner to reason about and is going to be important in the future when we add
support for multiple btrees.

It also adds btree_check_reserve(), which checks (and locks) the reserves for
both bucket allocation and memory allocation for btree nodes; the old code just
kinda sorta assumed that since (e.g. for btree node splits) it had the root
locked and that meant no other threads could try to make use of the same
reserve; this technically should have been ok for memory allocation (we should
always have a reserve for memory allocation (the btree node cache is used as a
reserve and we preallocate it)), but multiple btrees will mean that locking the
root won't be sufficient anymore, and for the bucket allocation reserve it was
technically possible for the old code to deadlock.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:23:35 -07:00
Kent Overstreet
56b30770b2 bcache: Kill btree_io_wq
With the locking rework in the last patch, this shouldn't be needed anymore -
btree_node_write_work() only takes b->write_lock which is never held for very
long.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:23:35 -07:00
Kent Overstreet
2a285686c1 bcache: btree locking rework
Add a new lock, b->write_lock, which is required to actually modify - or write -
a btree node; this lock is only held for short durations.

This means we can write out a btree node without taking b->lock, which _is_ held
for long durations - solving a deadlock when btree_flush_write() (from the
journalling code) is called with a btree node locked.

Right now just occurs in bch_btree_set_root(), but with an upcoming journalling
rework is going to happen a lot more.

This also turns b->lock is now more of a read/intent lock instead of a
read/write lock - but not completely, since it still blocks readers. May turn it
into a real intent lock at some point in the future.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:23:35 -07:00
Kent Overstreet
05335cff9f bcache: Fix a race when freeing btree nodes
This isn't a bulletproof fix; btree_node_free() -> bch_bucket_free() puts the
bucket on the unused freelist, where it can be reused right away without any
ordering requirements. It would be better to wait on at least a journal write to
go down before reusing the bucket. bch_btree_set_root() does this, and inserting
into non leaf nodes is completely synchronous so we should be ok, but future
patches are just going to get rid of the unused freelist - it was needed in the
past for various reasons but shouldn't be anymore.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:23:34 -07:00
Kent Overstreet
4fe6a81670 bcache: Add a real GC_MARK_RECLAIMABLE
This means the garbage collection code can better check for data and metadata
pointers to the same buckets.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:36 -07:00
Kent Overstreet
c13f3af924 bcache: Add bch_keylist_init_single()
This will potentially save us an allocation when we've got inode/dirent bkeys
that don't fit in the keylist's inline keys.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:36 -07:00
Kent Overstreet
1575402052 bcache: Improve priority_stats
Break down data into clean data/dirty data/metadata.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:35 -07:00
Kent Overstreet
7159b1ad3d bcache: Better alloc tracepoints
Change the invalidate tracepoint to indicate how much data we're invalidating,
and change the alloc tracepoints to indicate what offset they're for.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:35 -07:00
Kent Overstreet
3f5e0a34da bcache: Kill dead cgroup code
This hasn't been used or even enabled in ages.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:35 -07:00
Nicholas Swenson
3f6ef38110 bcache: stop moving_gc marking buckets that can't be moved.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
2014-03-18 12:22:34 -07:00
Kent Overstreet
10d9dcf6ee bcache: Fix moving_pred()
Avoid a potential null pointer deref (e.g. from check keys for cache misses)

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:34 -07:00
Nicholas Swenson
da415a096f bcache: Fix moving_gc deadlocking with a foreground write
Deadlock happened because a foreground write slept, waiting for a bucket
to be allocated. Normally the gc would mark buckets available for invalidation.
But the moving_gc was stuck waiting for outstanding writes to complete.
These writes used the bcache_wq, the same queue foreground writes used.

This fix gives moving_gc its own work queue, so it was still finish moving
even if foreground writes are stuck waiting for allocation. It also makes
work queue a parameter to the data_insert path, so moving_gc can use its
workqueue for writes.

Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:33 -07:00
Kent Overstreet
90db6919f5 bcache: Fix discard granularity
blk_stack_limits() doesn't like a discard granularity of 0.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:33 -07:00
Kent Overstreet
487dded86e bcache: Fix another bug recovering from unclean shutdown
The on disk bucket gens are allowed to be out of date, when we reuse buckets
that didn't have any live data in them. To deal with this, the initial gc has to
update the bucket gen when we find a pointer gen newer than the bucket's gen.

Unfortunately we weren't doing this for pointers in the journal that we're about
to replay.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:33 -07:00
Kent Overstreet
0bd143fd80 bcache: Fix a bug recovering from unclean shutdown
The code to fixup incorrect bucket prios incorrectly did not skip btree node
freeing keys

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:22:32 -07:00
Kent Overstreet
27201cfdaa bcache: Fix a journalling reclaim after recovery bug
On recovery we weren't correctly keeping track of what journal buckets had open
journal entries, thus it was possible for them to be overwritten until we'd
written all new journal entries.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-18 12:21:48 -07:00
Kent Overstreet
65ddf45a31 bcache: Fix a null ptr deref in journal replay
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-17 19:01:03 -07:00
Kent Overstreet
4fa03402cd bcache: Fix a lockdep splat in an error path
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-03-17 18:59:09 -07:00
Kent Overstreet
dabb443340 bcache: Fix a shutdown bug
Shutdown wasn't cancelling/waiting on journal_write_work()

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-02-25 18:42:49 -08:00
Kent Overstreet
1b4eaf3d38 bcache: Fix flash_dev_cache_miss() for real this time
The code was using sectors to count the number of sectors it was zeroing... but
then it passed it to bio_advance()... after it had been set to 0. Amusing...

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-02-25 18:41:11 -08:00