path->idx is now a code smell: we should be using path_idx_t, since it's
stable across btree path reallocation.
This is also a bit faster, using the same loop counter vs. fetching
path->idx from each path we iterate over.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the future we'll be making trans->paths resizable and potentially
having _many_ more paths (for fsck); we need to start fixing algorithms
that walk each path in a transaction where possible.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
for_each_btree_key() handles transaction restarts, like
for_each_btree_key2(), but only calls bch2_trans_begin() after a
transaction restart - for_each_btree_key2() wraps every loop iteration
in a transaction.
The for_each_btree_key() behaviour is problematic when it leads to
holding the SRCU lock that prevents key cache reclaim for an unbounded
amount of time - there's no real need to keep it around.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
For BTREE_ITER_WITH_JOURNAL, we memoize lookups in the journal keys, to
avoid the binary search overhead.
Previously we stashed the pos of the last key returned from the journal,
in order to force the lookup to be redone when rewinding.
Now bch2_journal_keys_peek_upto() handles rewinding itself when
necessary - so we can slim down btree_iter.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
As discussed in the previous patch, BTREE_ITER_ALL_LEVELS appears to be
racy with concurrent interior node updates - and perhaps it is fixable,
but it's tricky and unnecessary.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When bch2_fs_alloc() gets an error before calling
bch2_fs_btree_iter_init(), bch2_fs_btree_iter_exit() makes an invalid
memory access because btree_trans_list is uninitialized.
Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Fixes: 6bd68ec266 ("bcachefs: Heap allocate btree_trans")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BTREE_INSERT_NOJOURNAL is primarily used for a performance optimization
related to inode updates and fsync - document it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
rebalance_work entries may refer to entries in the extents btree, which
is a snapshots btree, or they may also refer to entries in the reflink
btree, which is not.
Hence rebalance_work keys may use the snapshot field but it's not
required to be nonzero - add a new btree flag to reflect this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The SRCU read lock that btree_trans takes exists to make it safe for
bch2_trans_relock() to deref pointers to btree nodes/key cache items we
don't have locked, but as a side effect it blocks reclaim from freeing
those items.
Thus, it's important to not hold it for too long: we need to
differentiate between bch2_trans_unlock() calls that will be only for a
short duration, and ones that will be for an unbounded duration.
This introduces bch2_trans_unlock_long(), to be used mainly by the data
move paths.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
More forwards compatibility fixups: having BKEY_TYPE_btree at the end of
the enum conflicts with unnkown btree IDs, this shifts BKEY_TYPE_btree
to slot 0 and fixes things up accordingly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're using more stack than we'd like in a number of functions, and
btree_trans is the biggest object that we stack allocate.
But we have to do a heap allocatation to initialize it anyways, so
there's no real downside to heap allocating the entire thing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When we handle a transaction restart in a nested context, we need to
return -BCH_ERR_transaction_restart_nested because we invalidated the
outer context's iterators and locks.
bch2_propagate_key_to_snapshot_leaves() wasn't doing this, this patch
fixes it to use trans_was_restarted().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This extents KEY_TYPE_snapshot to include some new fields:
- depth, to indicate depth of this particular node from the root
- skip[3], skiplist entries for quickly walking back up to the root
These are to improve bch2_snapshot_is_ancestor(), making it O(ln(n))
instead of O(n) in the snapshot tree depth.
Skiplist nodes are picked at random from the set of ancestor nodes, not
some fixed fraction.
This introduces bcachefs_metadata_version 1.1, snapshot_skiplists.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- endianness fixes
- mark some things static
- fix a few __percpu annotations
- fix silent enum conversions
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add two new helpers for allocating memory with btree locks held: The
idea is to first try the allocation with GFP_NOWAIT|__GFP_NOWARN, then
if that fails - unlock, retry with GFP_KERNEL, and then call
trans_relock().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a new helper for the common pattern of:
- trans_unlock()
- do something
- trans_relock()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
As with previous conversions, replace -ENOENT uses with more informative
private error codes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_btree_trans_to_text() is used on btree_trans objects that are owned
by different threads - when printing out deadlock cycles - so we need a
safe version of trans_for_each_path(), else we race with seeing a
btree_path that was just allocated and not fully initialized:
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
As suggested by Linus, this drops the six_lock_state union in favor of
raw bitmasks.
On the one hand, bitfields give more type-level structure to the code.
However, a significant amount of the code was working with
six_lock_state as a u64/atomic64_t, and the conversions from the
bitfields to the u64 were deemed a bit too out-there.
More significantly, because bitfield order is poorly defined (#ifdef
__LITTLE_ENDIAN_BITFIELD can be used, but is gross), incrementing the
sequence number would overflow into the rest of the bitfield if the
compiler didn't put the sequence number at the high end of the word.
The new code is a bit saner when we're on an architecture without real
atomic64_t support - all accesses to lock->state now go through
atomic64_*() operations.
On architectures with real atomic64_t support, we additionally use
atomic bit ops for setting/clearing individual bits.
Text size: 7467 bytes -> 4649 bytes - compilers still suck at
bitfields.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It's for doing updates - this is where it belongs, and next pathes will
be changing these helpers to use items from btree_update.h.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Introduce new helpers for a common pattern:
bch2_trans_iter_init();
bch2_btree_iter_peek_slot();
- bch2_bkey_get_iter_type() returns -ENOENT if it doesn't find a key of
the correct type
- bch2_bkey_get_val_typed() copies the val out of the btree to a
(typically stack allocated) variable; it handles the case where the
value in the btree is smaller than the current version of the type,
zeroing out the remainder.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a new helper, bch2_trans_mutex_lock(), for locking a mutex -
dropping and retaking btree locks as needed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It's important that in BTREE_ITER_FILTER_SNAPSHOTS mode we always use
peek_upto() and provide an end for the interval we're searching for -
otherwise, when we hit the end of the inode the next inode be in a
different subvolume and not have any keys in the current snapshot, and
we'd iterate over arbitrarily many keys before returning one.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This replaces various BUG_ON() assertions with panics that tell us where
the restart was done and the restart type.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In debug mode, we now track where btree iterators and paths are
initialized/allocated - helpful in tracking down btree path overflows.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This introduces some new conveniences, to help cut down on boilerplate:
- bch2_trans_kmalloc_nomemzero() - performance optimiation
- bch2_bkey_make_mut()
- bch2_bkey_get_mut()
- bch2_bkey_get_mut_typed()
- bch2_bkey_alloc()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch introduces
- bpos_eq()
- bpos_lt()
- bpos_le()
- bpos_gt()
- bpos_ge()
and equivalent replacements for bkey_cmp().
Looking at the generated assembly these could probably be improved
further, but we already see a significant code size improvement with
this patch.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When flags & btree_id are constants, we can constant fold the entire
calculation of the actual iterator flags - and the whole thing becomes
small enough to inline.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, when we exited from the loop body with a break statement
_ret wouldn't have been assigned to yet, and we could spuriously return
a transaction restart error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Marking a non-static function as inline doesn't actually work and is
now causing problems - drop that
- Introduce BCACHEFS_LOG_PREFIX for when we want to prefix log messages
with bcachefs (filesystem name)
- Userspace doesn't have real percpu variables (maybe we can get this
fixed someday), put an #ifdef around bch2_disk_reservation_add()
fastpath
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now we store the transaction's fn idx in a local variable, instead of
redoing the lookup every time we call bch2_trans_init().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We've outgrown our own deadlock avoidance strategy.
The btree iterator API provides an interface where the user doesn't need
to concern themselves with lock ordering - different btree iterators can
be traversed in any order. Without special care, this will lead to
deadlocks.
Our previous strategy was to define a lock ordering internally, and
whenever we attempt to take a lock and trylock() fails, we'd check if
the current btree transaction is holding any locks that cause a lock
ordering violation. If so, we'd issue a transaction restart, and then
bch2_trans_begin() would re-traverse all previously used iterators, but
in the correct order.
That approach had some issues, though.
- Sometimes we'd issue transaction restarts unnecessarily, when no
deadlock would have actually occured. Lock ordering restarts have
become our primary cause of transaction restarts, on some workloads
totally 20% of actual transaction commits.
- To avoid deadlock or livelock, we'd often have to take intent locks
when we only wanted a read lock: with the lock ordering approach, it
is actually illegal to hold _any_ read lock while blocking on an intent
lock, and this has been causing us unnecessary lock contention.
- It was getting fragile - the various lock ordering rules are not
trivial, and we'd been seeing occasional livelock issues related to
this machinery.
So, since bcachefs is already a relational database masquerading as a
filesystem, we're stealing the next traditional database technique and
switching to a cycle detector for avoiding deadlocks.
When we block taking a btree lock, after adding ourself to the waitlist
but before sleeping, we do a DFS of btree transactions waiting on other
btree transactions, starting with the current transaction and walking
our held locks, and transactions blocking on our held locks.
If we find a cycle, we emit a transaction restart. Occasionally (e.g.
the btree split path) we can not allow the lock() operation to fail, so
if necessary we'll tell another transaction that it has to fail.
Result: trans_restart_would_deadlock events are reduced by a factor of
10 to 100, and we'll be able to delete a whole bunch of grotty, fragile
code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
With the new deadlock cycle detector, it's critical that all held locks
be marked in a btree_path, because that's what the cycle detector
traverses - any locks that aren't correctly marked will cause deadlocks.
This changes the btree_path to allocate some btree_paths for the new
nodes, since until the final update is done we otherwise don't have a
path referencing them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Also, do some reorganizing/renaming, convert atomic counters in bch_fs
to persistent counters, and add a few missing counters.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Don't decrease BTREE_ITER_MAX when building with CONFIG_LOCKDEP
anymore. The lockdep table sizes are configurable now, we don't need
this anymore.
- btree_trans_too_many_iters() is less conservative now. Previously it
was causing a transaction restart if we had used more than
BTREE_ITER_MAX / 2 paths, change this to BTREE_ITER_MAX - 8.
This helps with excessive transaction restarts/livelocks in the bucket
allocator path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch
- tracks maximum bch2_trans_kmalloc() memory used in btree_transaction_stats
- makes it available in debugfs
- switches bch2_trans_init() to using that for the amount of memory to
preallocate, instead of the parameter passed in
This drastically reduces transaction restarts, and means we no longer
need to track this in the source code.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Start to centralize some of the locking code in a new file; more locking
code will be moving here in the future.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Instead of counting transaction restarts, count when the transaction is
restarted: if bch2_trans_begin() was called when the transaction wasn't
restarted we need to ensure restart_count is still incremented.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We need a way to check if the machinery for handling btree_paths with in
a transaction is behaving reasonably, as it often has not been - we've
had bugs with transaction path overflows caused by duplicate paths and
plenty of other things.
This patch tracks, per transaction fn, the most btree paths ever
allocated by that transaction and makes it available in debugfs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Our types are exported to the tracepoint code, so it's not necessary to
break things out individually when passing them to tracepoints - we can
also call other functions from TP_fast_assign().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Clearing path->preserve means the path will be dropping in
bch2_trans_begin() - but on transaction restart, we're likely to need
that path again.
This fixes a livelock in the allocation path.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds a new macro, like for_each_btree_key2(), but for iterating in
reverse order.
Also, change for_each_btree_key2() to properly check the return value of
bch2_btree_iter_advance().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Now that we have error codes, with subtypes, we can switch to our own
error code for transaction restarts - and even better, a distinct error
code for each transaction restart reason: clearer code and better
debugging.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This will help us improve nested transactions - we need to add
assertions that whenever an inner transaction handles a restart, it
still returns -EINTR to the outer transaction.
This also adds nested_lockrestart_do() and nested_commit_do() which use
the new counters to correctly return -EINTR when the transaction was
restarted.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We need the caller name and a place to store our results, btree_trans provides this.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This introduces two new macros for iterating through the btree, with
transaction restart handling
- for_each_btree_key2()
- for_each_btree_key_commit()
Every iteration is now in an implicit transaction, and - as with
lockrestart_do() and commit_do() - returning -EINTR will cause the
transaction to be restarted, at the same key.
This patch converts a bunch of code that was open coding this to these
new macros, saving a substantial amount of code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
When inserting a key type that's not valid for a given btree, we should
print out which btree we were inserting into.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, on every btree_iter_peek() operation we were searching the
journal keys, doing a full binary search - which was slow.
This patch fixes that by saving our position in the journal keys, so
that we only do a full binary search when moving our position backwards
or a large jump forwards.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds bch2_btree_iter_peek_all_levels(), which returns keys from
every level of the btree - interior nodes included - in monotonically
increasing order, soon to be used by the backpointers check & repair
code.
- BTREE_ITER_ALL_LEVELS can now be passed to for_each_btree_key() to
iterate thusly, much like BTREE_ITER_SLOTS
- The existing algorithm in bch2_btree_iter_advance() doesn't work with
peek_all_levels(): we have to defer the actual advancing until the
next time we call peek, where we have the btree path traversed and
uptodate. So, we add an advanced bit to btree_iter; when
BTREE_ITER_ALL_LEVELS is set bch2_btree_iter_advanced() just marks
the iterator as advanced.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a bug where __bch2_btree_node_update_key() wasn't clearing
should_be_locked, leading to bch2_btree_path_traverse() always failing -
all callers of btree_path_make_mut() want should_be_locked cleared, so
do it there.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This turns bch2_dump_trans_updates() into a to_text() method - this way
it can be used by debug tracing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a new error macro that also dumps transaction updates in addition to
doing an emergency shutdown - when a transaction update discovers or is
causing a fs inconsistency, it's helpful to see what updates it was
doing.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
In BTREE_ITER_FILTER_SNAPHOTS mode, we skip over keys in unrelated
snapshots. When we hit the end of an inode, if the next inode(s) are in
a different subvolume, we could potentially have to skip past many keys
before finding a key we can return to the caller, so they can terminate
the iteration.
This adds a peek_upto() variant to solve this problem, to be used when
we know the range we're searching within.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This is the start of cache coherency with the btree key cache - this
adds a btree iterator flag that causes lookups to also check the key
cache when we're iterating over the btree (not iterating over the key
cache).
Note that we could still race with another thread creating at item in
the key cache and updating it, since we aren't holding the key cache
locked if it wasn't found. The next patch for the update path will
address this by causing the transaction to restart if the key cache is
found to be dirty.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With BTREE_ITER_FILTER_SNAPSHOTS, we have to distinguish between the
path where the key was found, and the path for inserting into the
current snapshot. This adds a new field to struct btree_iter for saving
a path for the current snapshot, and plumbs it through
bch2_trans_update().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This splits bch2_btree_iter() up into two functions: an inner function
that handles BTREE_ITER_WITH_JOURNAL, BTREE_ITER_WITH_UPDATES, and
iterating acrcoss leaf nodes, and an outer one that implements
BTREE_ITER_FILTER_SNAPHSOTS.
This is prep work for remember a btree_path at our update position in
BTREE_ITER_FILTER_SNAPSHOTS mode.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Symbol decoding, via %ps, isn't supported in userspace - this will also
be faster when we're using trans->fn in the fast path, as with the new
BCH_JSET_ENTRY_log journal messages.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds a flag to not mark the initial btree_path as preserve, for
paths that we expect to be cheap to reconstitute if necessary - this
solves a btree_path overflow caused by need_whiteout_for_snapshot().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Reading from cached data, which calls bch2_bucket_io_time_reset(), is
leading to transaction iterator overflows - this standardizes the
workaround.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds a new assertion to be used by bch2_inode_update_after_write(),
which updates the VFS inode based on the update to the btree inode we
just did - we require that the btree inode still be locked when we do
that update.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>