btree node scan needs to not use the btree node cache: that causes
interference from prior failed reads and parallel workers.
Instead we need to allocate btree nodes that don't live in the btree
cache, so that we can call bch2_btree_node_read_done() directly.
This patch tweaks the low level helpers so they don't touch the btree
cache lists.
Cc: Nikita Ofitserov <himikof@gmail.com>
Reviewed-by: Nikita Ofitserov <himikof@gmail.com>
Reported-and-tested-by: Edoardo Codeglia <bcachefs@404.blue>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The new guard(), scoped_guard() allow for more natural code.
Some of the uses with creative flow control have been left.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We have a bug report that looks like we might be leaking open buckets -
let's check if they got left attached to the cached btree node.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Like we just did with the data read path, emit a single error message
per btree node reads, nicely formatted, with all the actions we took
grouped together.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Debugging infrastructure for async objs: this lets us easily create
fast_lists for various object types so they'll be visible in debugfs.
Add new object types to the BCH_ASYNC_OBJS_TYPES() enum, and drop a
pretty-printer wrapper in async_objs.c.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Convert device IO refs to enumerated_refs, for easier debugging of
refcount issues.
Simple conversion: enumerate all users and convert to the new helpers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Single device filesystems are now identified by the block device name,
not the UUID - and single device filesystems with the same UUID can be
mounted simultaneously, without any special options.
This allocates a new bit in the superblock, BCH_SB_MULTI_DEVICE, which
indicates whether a filesystem has ever been multi device.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're occasionally seeing the WARN_ON() for bump allocator usage
exceeding BTREE_TRANS_MEM_MAX; add some tracing so we can see what's
going on.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We now have separate per device io_refs for read and write access.
This fixes a device removal bug where the discard workers were still
running while we're removing alloc info for that device.
It's also a bit of hardening; we no longer allow writes to devices that
are read-only.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Running the preempt tiering tests with a lower than normal journal
reclaim delay turned up a shutdown hang - a lost wakeup, caused because
flushing a journal pin (e.g. key cache/write buffer) can generate a new
journal pin.
The "simple" fix of adding the correct wakeup didn't work because of
ordering issues; if we flush btree node pins too aggressively before
other pins have completed, we end up spinning where each flush iteration
generates new work.
So to fix this correctly:
- The list of flushed journal pins is now broken out by type, so that
we can wait for key cache/write buffer pin flushing to complete
before flushing dirty btree nodes
- A new closure_waitlist is added for bch2_journal_flush_pins; this one
is only used under or when we're taking the journal lock, so it's
pretty cheap to add rigorously correct wakeups to journal_pin_set()
and journal_pin_drop().
Additionally, bch2_journal_seq_pins_to_text() is moved to
journal_reclaim.c, where it belongs, along with a bit of other small
renaming and refactoring.
Besides fixing the hang, the better ordering between key cache/write
buffer flushing and btree node flushing should help or fix the "unmount
taking excessively long" a few users have been noticing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
for_each_btree_node() now works similarly to for_each_btree_key(), where
the loop body is passed as an argument to be passed to lockrestart_do().
This now calls trans_begin() on every loop iteration - which fixes an
SRCU warning in backpointers fsck.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The debug code relies on btree_trans_list being ordered so that it can
resume on subsequent calls or lock restarts.
However, it was using trans->locknig_wait.task.pid, which is incorrect
since btree_trans objects are cached and reused - typically by different
tasks.
Fix this by switching to pointer order, and also sort them lazily when
required - speeding up the btree_trans_get() fastpath.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
debug.c was using closure_get() on a different thread's closure where
the we don't know if the object being refcounted is alive.
We keep btree_trans objects on a list so they can be printed by debug
code, and because it is cost prohibitive to touch the btree_trans list
every time we allocate and free btree_trans objects, cached objects are
also on this list.
However, we do not want the debug code to see cached but not in use
btree_trans objects - critically because the btree_paths array will have
been freed (if it was reallocated).
closure_get() is also incorrect to use when that get may race with it
hitting zero, i.e. we must already have a ref on the object or know the
ref can't currently hit 0 for other reasons (as used in the cycle
detector).
to fix this, use the previously introduced closure_get_not_zero(),
closure_return_sync(), and closure_init_stack_release(); the debug code
now can only take a ref on a trans object if it's alive and in use.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
btree_deadlock_to_text() searches the list of btree transactions to find
a deadlock - when it finds one it's done; it's not like other *_read()
functions that's printing each object.
Factor out btree_deadlock_to_text() to make this clearer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We were grabbing the sequence number before unlock incremented it - fix
this by moving the increment to seqmutex_lock() (so the seqmutex_relock()
failure path skips the mutex_trylock()), and returning the sequence
number from unlock(), to make the API simpler and safer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Combine iter/update/trigger/str_hash flags into a single enum, and
x-macroize them for a to_text() function later.
These flags are all for a specific iter/key/update context, so it makes
sense to group them together - iter/update/trigger flags were already
given distinct bits, this cleans up and unifies that handling.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
sysfs is limited to PAGE_SIZE, and when we're debugging strange
deadlocks/priority inversions we need to see the full list.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bcachefs btree nodes are big - typically 256k - and btree roots are
pinned in memory. As we're now up to 18 btrees, we now have significant
memory overhead in mostly empty btree roots.
And in the future we're going to start enforcing that certain btree node
boundaries exist, to solve lock contention issues - analagous to XFS's
AGIs.
Thus, we need to start allocating smaller btree node buffers when we
can. This patch changes code that refers to the filesystem constant
c->opts.btree_node_size to refer to the btree node buffer size -
btree_buf_bytes() - where appropriate.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Some tweaks to greatly reduce locking overhead for the list of btree
transactions, so that it can always be enabled: leave btree_trans
objects on the list when they're on the percpu single item freelist,
and only check for duplicates in the same process when
CONFIG_BCACHEFS_DEBUG is enabled
- don't zero out the full btree_trans() unless we allocated it from
the mempool
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the debugfs code, we had an incorrect use of drop_locks_do(); on
transaction restart we don't want to restart the current loop iteration,
since we've already emitted the current key to the buffer for userspace.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since we can run with unknown btree IDs, we can't directly index btree
IDs into fixed size arrays.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
copy_to_user() returns the number of bytes successfully copied - not an
errcode.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're using more stack than we'd like in a number of functions, and
btree_trans is the biggest object that we stack allocate.
But we have to do a heap allocatation to initialize it anyways, so
there's no real downside to heap allocating the entire thing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
More reorganization, this splits up io.c into
- io_read.c
- io_misc.c - fallocate, fpunch, truncate
- io_write.c
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can't be holding btree_trans_lock while copying to user space, which
might incur a page fault. To fix this, convert it to a seqmutex so we
can unlock/relock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
GFP_NOIO dates from the bcache days, when we operated under the block
layer. Now, GFP_NOFS is more appropriate, so switch all GFP_NOIO uses to
GFP_NOFS.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Pulling out a helper from cmd_list.c, as the rest is being rewritten in
Rust but we're not ready to rewrite lower-level btree code in Rust.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Rust bindgen doesn't cope well with anonymous structs and unions. This
patch drops the fancy anonymous structs & unions in bkey_i that let us
use the same helpers for bkey_i and bkey_packed; since bkey_packed is an
internal type that's never exposed to outside code, it's only a minor
inconvenienc.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>