This adds a new helper, bch2_trans_mutex_lock(), for locking a mutex -
dropping and retaking btree locks as needed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It's important that in BTREE_ITER_FILTER_SNAPSHOTS mode we always use
peek_upto() and provide an end for the interval we're searching for -
otherwise, when we hit the end of the inode the next inode be in a
different subvolume and not have any keys in the current snapshot, and
we'd iterate over arbitrarily many keys before returning one.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This replaces various BUG_ON() assertions with panics that tell us where
the restart was done and the restart type.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In debug mode, we now track where btree iterators and paths are
initialized/allocated - helpful in tracking down btree path overflows.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This introduces some new conveniences, to help cut down on boilerplate:
- bch2_trans_kmalloc_nomemzero() - performance optimiation
- bch2_bkey_make_mut()
- bch2_bkey_get_mut()
- bch2_bkey_get_mut_typed()
- bch2_bkey_alloc()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch introduces
- bpos_eq()
- bpos_lt()
- bpos_le()
- bpos_gt()
- bpos_ge()
and equivalent replacements for bkey_cmp().
Looking at the generated assembly these could probably be improved
further, but we already see a significant code size improvement with
this patch.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When flags & btree_id are constants, we can constant fold the entire
calculation of the actual iterator flags - and the whole thing becomes
small enough to inline.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, when we exited from the loop body with a break statement
_ret wouldn't have been assigned to yet, and we could spuriously return
a transaction restart error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Marking a non-static function as inline doesn't actually work and is
now causing problems - drop that
- Introduce BCACHEFS_LOG_PREFIX for when we want to prefix log messages
with bcachefs (filesystem name)
- Userspace doesn't have real percpu variables (maybe we can get this
fixed someday), put an #ifdef around bch2_disk_reservation_add()
fastpath
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now we store the transaction's fn idx in a local variable, instead of
redoing the lookup every time we call bch2_trans_init().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We've outgrown our own deadlock avoidance strategy.
The btree iterator API provides an interface where the user doesn't need
to concern themselves with lock ordering - different btree iterators can
be traversed in any order. Without special care, this will lead to
deadlocks.
Our previous strategy was to define a lock ordering internally, and
whenever we attempt to take a lock and trylock() fails, we'd check if
the current btree transaction is holding any locks that cause a lock
ordering violation. If so, we'd issue a transaction restart, and then
bch2_trans_begin() would re-traverse all previously used iterators, but
in the correct order.
That approach had some issues, though.
- Sometimes we'd issue transaction restarts unnecessarily, when no
deadlock would have actually occured. Lock ordering restarts have
become our primary cause of transaction restarts, on some workloads
totally 20% of actual transaction commits.
- To avoid deadlock or livelock, we'd often have to take intent locks
when we only wanted a read lock: with the lock ordering approach, it
is actually illegal to hold _any_ read lock while blocking on an intent
lock, and this has been causing us unnecessary lock contention.
- It was getting fragile - the various lock ordering rules are not
trivial, and we'd been seeing occasional livelock issues related to
this machinery.
So, since bcachefs is already a relational database masquerading as a
filesystem, we're stealing the next traditional database technique and
switching to a cycle detector for avoiding deadlocks.
When we block taking a btree lock, after adding ourself to the waitlist
but before sleeping, we do a DFS of btree transactions waiting on other
btree transactions, starting with the current transaction and walking
our held locks, and transactions blocking on our held locks.
If we find a cycle, we emit a transaction restart. Occasionally (e.g.
the btree split path) we can not allow the lock() operation to fail, so
if necessary we'll tell another transaction that it has to fail.
Result: trans_restart_would_deadlock events are reduced by a factor of
10 to 100, and we'll be able to delete a whole bunch of grotty, fragile
code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
With the new deadlock cycle detector, it's critical that all held locks
be marked in a btree_path, because that's what the cycle detector
traverses - any locks that aren't correctly marked will cause deadlocks.
This changes the btree_path to allocate some btree_paths for the new
nodes, since until the final update is done we otherwise don't have a
path referencing them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Also, do some reorganizing/renaming, convert atomic counters in bch_fs
to persistent counters, and add a few missing counters.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Don't decrease BTREE_ITER_MAX when building with CONFIG_LOCKDEP
anymore. The lockdep table sizes are configurable now, we don't need
this anymore.
- btree_trans_too_many_iters() is less conservative now. Previously it
was causing a transaction restart if we had used more than
BTREE_ITER_MAX / 2 paths, change this to BTREE_ITER_MAX - 8.
This helps with excessive transaction restarts/livelocks in the bucket
allocator path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch
- tracks maximum bch2_trans_kmalloc() memory used in btree_transaction_stats
- makes it available in debugfs
- switches bch2_trans_init() to using that for the amount of memory to
preallocate, instead of the parameter passed in
This drastically reduces transaction restarts, and means we no longer
need to track this in the source code.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Start to centralize some of the locking code in a new file; more locking
code will be moving here in the future.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Instead of counting transaction restarts, count when the transaction is
restarted: if bch2_trans_begin() was called when the transaction wasn't
restarted we need to ensure restart_count is still incremented.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We need a way to check if the machinery for handling btree_paths with in
a transaction is behaving reasonably, as it often has not been - we've
had bugs with transaction path overflows caused by duplicate paths and
plenty of other things.
This patch tracks, per transaction fn, the most btree paths ever
allocated by that transaction and makes it available in debugfs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Our types are exported to the tracepoint code, so it's not necessary to
break things out individually when passing them to tracepoints - we can
also call other functions from TP_fast_assign().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Clearing path->preserve means the path will be dropping in
bch2_trans_begin() - but on transaction restart, we're likely to need
that path again.
This fixes a livelock in the allocation path.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds a new macro, like for_each_btree_key2(), but for iterating in
reverse order.
Also, change for_each_btree_key2() to properly check the return value of
bch2_btree_iter_advance().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Now that we have error codes, with subtypes, we can switch to our own
error code for transaction restarts - and even better, a distinct error
code for each transaction restart reason: clearer code and better
debugging.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This will help us improve nested transactions - we need to add
assertions that whenever an inner transaction handles a restart, it
still returns -EINTR to the outer transaction.
This also adds nested_lockrestart_do() and nested_commit_do() which use
the new counters to correctly return -EINTR when the transaction was
restarted.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We need the caller name and a place to store our results, btree_trans provides this.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This introduces two new macros for iterating through the btree, with
transaction restart handling
- for_each_btree_key2()
- for_each_btree_key_commit()
Every iteration is now in an implicit transaction, and - as with
lockrestart_do() and commit_do() - returning -EINTR will cause the
transaction to be restarted, at the same key.
This patch converts a bunch of code that was open coding this to these
new macros, saving a substantial amount of code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
When inserting a key type that's not valid for a given btree, we should
print out which btree we were inserting into.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, on every btree_iter_peek() operation we were searching the
journal keys, doing a full binary search - which was slow.
This patch fixes that by saving our position in the journal keys, so
that we only do a full binary search when moving our position backwards
or a large jump forwards.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds bch2_btree_iter_peek_all_levels(), which returns keys from
every level of the btree - interior nodes included - in monotonically
increasing order, soon to be used by the backpointers check & repair
code.
- BTREE_ITER_ALL_LEVELS can now be passed to for_each_btree_key() to
iterate thusly, much like BTREE_ITER_SLOTS
- The existing algorithm in bch2_btree_iter_advance() doesn't work with
peek_all_levels(): we have to defer the actual advancing until the
next time we call peek, where we have the btree path traversed and
uptodate. So, we add an advanced bit to btree_iter; when
BTREE_ITER_ALL_LEVELS is set bch2_btree_iter_advanced() just marks
the iterator as advanced.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a bug where __bch2_btree_node_update_key() wasn't clearing
should_be_locked, leading to bch2_btree_path_traverse() always failing -
all callers of btree_path_make_mut() want should_be_locked cleared, so
do it there.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This turns bch2_dump_trans_updates() into a to_text() method - this way
it can be used by debug tracing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a new error macro that also dumps transaction updates in addition to
doing an emergency shutdown - when a transaction update discovers or is
causing a fs inconsistency, it's helpful to see what updates it was
doing.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
In BTREE_ITER_FILTER_SNAPHOTS mode, we skip over keys in unrelated
snapshots. When we hit the end of an inode, if the next inode(s) are in
a different subvolume, we could potentially have to skip past many keys
before finding a key we can return to the caller, so they can terminate
the iteration.
This adds a peek_upto() variant to solve this problem, to be used when
we know the range we're searching within.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This is the start of cache coherency with the btree key cache - this
adds a btree iterator flag that causes lookups to also check the key
cache when we're iterating over the btree (not iterating over the key
cache).
Note that we could still race with another thread creating at item in
the key cache and updating it, since we aren't holding the key cache
locked if it wasn't found. The next patch for the update path will
address this by causing the transaction to restart if the key cache is
found to be dirty.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With BTREE_ITER_FILTER_SNAPSHOTS, we have to distinguish between the
path where the key was found, and the path for inserting into the
current snapshot. This adds a new field to struct btree_iter for saving
a path for the current snapshot, and plumbs it through
bch2_trans_update().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This splits bch2_btree_iter() up into two functions: an inner function
that handles BTREE_ITER_WITH_JOURNAL, BTREE_ITER_WITH_UPDATES, and
iterating acrcoss leaf nodes, and an outer one that implements
BTREE_ITER_FILTER_SNAPHSOTS.
This is prep work for remember a btree_path at our update position in
BTREE_ITER_FILTER_SNAPSHOTS mode.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Symbol decoding, via %ps, isn't supported in userspace - this will also
be faster when we're using trans->fn in the fast path, as with the new
BCH_JSET_ENTRY_log journal messages.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds a flag to not mark the initial btree_path as preserve, for
paths that we expect to be cheap to reconstitute if necessary - this
solves a btree_path overflow caused by need_whiteout_for_snapshot().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Reading from cached data, which calls bch2_bucket_io_time_reset(), is
leading to transaction iterator overflows - this standardizes the
workaround.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds a new assertion to be used by bch2_inode_update_after_write(),
which updates the VFS inode based on the update to the btree inode we
just did - we require that the btree inode still be locked when we do
that update.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- We should only be clearing should_be_locked in btree_path_set_pos() -
it's the responsiblity of the btree_path code, not the btree_iter
code.
- bch2_path_put() needs to pay attention to path->should_be_locked, to
ensure we don't drop locks we're supposed to be keeping.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The bch2_btree_path_upgrade() call was failing and tripping an assert -
path->level + 1 is in this case not necessarily exactly what we want,
fix it by upgrading exactly the locks we want.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
for_each_btree_key() now calls bch2_trans_begin() as needed; that means,
we can also call it when we're in danger of overflowing transaction
paths.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
But we don't need to call it from outside the btree iterator code
anymore, since it's called by bch2_trans_begin() and
bch2_btree_path_traverse().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Code that uses for_each_btree_key often wants transaction restarts to be
handled locally and not returned. Originally, we wouldn't return
transaction restarts if there was a single iterator in the transaction -
the reasoning being if there weren't other iterators being invalidated,
and the current iterator was being advanced/retraversed, there weren't
any locks or iterators we were required to preserve.
But with the btree_path conversion that approach doesn't work anymore -
even when we're using for_each_btree_key() with a single iterator there
will still be two paths in the transaction, since we now always preserve
the path at the pos the iterator was initialized at - the reason being
that on restart we often restart from the same place.
And it turns out there's now a lot of for_each_btree_key() uses that _do
not_ want transaction restarts handled locally, and should be returning
them.
This patch splits out for_each_btree_key_norestart() and
for_each_btree_key_continue_norestart(), and converts existing users as
appropriate. for_each_btree_key(), for_each_btree_key_continue(), and
for_each_btree_node() now handle transaction restarts themselves by
calling bch2_trans_begin() when necessary - and the old hack to not
return transaction restarts when there's a single path in the
transaction has been deleted.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Now that peek_node()/next_node() are converted to return errors
directly, we don't need bch2_trans_exit() to return errors - it's
cleaner this way and wasn't used much anymore.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This changes for_each_btree_node() to work like for_each_btree_key(),
and to that end bch2_btree_iter_peek_node() and next_node() also return
error ptrs.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
For snapshots, we need to implement btree lookups that return the first
key that's an ancestor of the snapshot ID the lookup is being done in -
and filter out keys in unrelated snapshots. This patch adds the btree
iterator flag BTREE_ITER_FILTER_SNAPSHOTS which does that filtering.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This patch significantly reduces the number of btree lookups required in
the extent update path.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
New rule is: if a btree path holds any locks it should be holding
precisely the locks wanted (accoringing to path->level and
path->locks_want).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
btree_path_traverse_all() traverses btree iterators in sorted order, and
thus shouldn't see transaction restarts due to potential deadlocks - but
sometimes we do. This patch adds some more assertions and tracks some
more state to help track this down.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This splits btree_iter into two components: btree_iter is now the
externally visible componont, and it points to a btree_path which is now
reference counted.
This means we no longer have to clone iterators up front if they might
be mutated - btree_path can be shared by multiple iterators, and cloned
if an iterator would mutate a shared btree_path. This will help us use
iterators more efficiently, as well as slimming down the main long lived
state in btree_trans, and significantly cleans up the logic for iterator
lifetimes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This was used for an optimization that hasn't existing in quite awhile
- iter->uptodate will probably be going away as well.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
These utility functions are for managing btree node state within a
btree_trans - rename them for consistency, and drop some unneeded
arguments.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This is prep work for splitting btree_path out from btree_iter -
btree_path will not have a pointer to btree_trans.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This factors out bch2_dump_trans_iters_updates() from the iter alloc
overflow path, and makes some small improvements to what it prints.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This will be used to make other operations on btree iterators within a
transaction more efficient, and enable some other improvements to how we
manage btree iterators.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Start tracking when btree transactions have been restarted - and assert
that we're always calling bch2_trans_begin() immediately after
transaction restart.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Btree node merging now happens prior to transaction commit, not after,
so we don't need to pay attention to BTREE_INSERT_NOUNLOCK.
Also, foreground_maybe_merge shouldn't be calling
bch2_btree_iter_traverse_all() - this is becoming private to the btree
iterator code and should only be called by bch2_trans_begin().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds a new helper for btree_cache.c that does what we want where
the iterator is still being traverse - and also eliminates some
unnecessary transaction restarts.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This closes a significant hole (and last known hole) in our ability to
verify metadata. Previously, since btree nodes are log structured, we
couldn't detect lost btree writes that weren't the first write to a
given node. Additionally, this seems to have lead to some significant
metadata corruption on multi device filesystems with metadata
replication: since a write may have made it to one device and not
another, if we read that btree node back from the replica that did have
that write and started appending after that point, the other replica
would have a gap in the bset entries and reading from that replica
wouldn't find the rest of the bsets.
But, since updates to interior btree nodes are now journalled, we can
close this hole by updating pointers to btree nodes after every write
with the currently written number of sectors, without negatively
affecting performance. This means we will always detect lost or corrupt
metadata - it also means that our btree is now a curious hybrid of COW
and non COW btrees, with all the benefits of both (excluding
complexity).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We weren't correctly verifying that we had interior node intent locks -
this patch also fixes bugs uncovered by the new assertions.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Add basic kernel docs for bch2_trans_reset and bch2_trans_begin.
Signed-off-by: Dan Robertson <dan@dlrobertson.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Adding iter->should_be_locked introduced a regression where it ended up
not being set on the iterator passed to bch2_btree_update_start(), which
is definitely not what we want.
This patch requires it to be set when calling bch2_trans_update(), and
adds various fixups to make that happen.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This drops bch2_btree_iter_peek_with_updates() and replaces it with a
new flag, BTREE_ITER_WITH_UPDATES, and also reworks
bch2_btree_iter_peek_slot() to respect it too.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds the ability for btree iterators to own child iterators - to be
used by an upcoming rework of bch2_btree_iter_peek_slot(), so we can
scan forwards while maintaining our current position.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Add a field to struct btree_iter for tracking whether it should be
locked - this fixes spurious transaction restarts in
bch2_trans_relock().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This patch adds some new tracepoints to the btree iterator code, and
adds new fields to the existing tracepoints - primarily for the iterator
position.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
By changing it to upgrade iterators to intent locks to avoid lock
restarts we can simplify __bch2_btree_node_lock() quite a bit - this
fixes a probable bug where it could potentially drop a lock on an
unrelated error but still succeed instead of causing a transaction
restart.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
gcc is emitting rep stos here, which is silly (and slow) for an 8 byte
memset.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Eventually BTREE_ITER_NODES should be going away. This patch is to fix a
transaction iterator overflow in the btree node merge path because
BTREE_ITER_NODES iterators couldn't be reused.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since we're no longer doing btree node merging post commit, we can now
delete a bunch of code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch starts treating the bpos.snapshot field like part of the key
in the btree code:
* bpos_successor() and bpos_predecessor() now include the snapshot field
* Keys in btrees that will be using snapshots (extents, inodes, dirents
and xattrs) now always have their snapshot field set to U32_MAX
The btree iterator code gets a new flag, BTREE_ITER_ALL_SNAPSHOTS, that
determines whether we're iterating over keys in all snapshots or not -
internally, this controlls whether bkey_(successor|predecessor)
increment/decrement the snapshot field, or only the higher bits of the
key.
We add a new member to struct btree_iter, iter->snapshot: when
BTREE_ITER_ALL_SNAPSHOTS is not set, iter->pos.snapshot should always
equal iter->snapshot, which will be 0 for btrees that don't use
snapshots, and alsways U32_MAX for btrees that will use snapshots
(until we enable snapshot creation).
This patch also introduces a new metadata version number, and compat
code for reading from/writing to older versions - this isn't a forced
upgrade (yet).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When we pass BTREE_INSERT_NOUNLOCK bch2_trans_commit isn't supposed to
unlock after a successful commit, but it was calling
bch2_trans_cond_resched() - oops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
External (to the btree iterator code) users of bch2_btree_iter_traverse
expect that on success the iterator will be pointed at iter->pos and
have that position locked - but since we split iter->pos and
iter->real_pos, that means it has to update iter->real_pos if necessary.
Internal users don't expect it to modify iter->real_pos, so we need two
separate functions.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
peek() has to update iter->real_pos - there's no need for
bch2_btree_iter_set_pos() to update it as well.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The way btree iterators work internally has been changing, particularly
with the iter->real_pos changes, and bch2_btree_iter_next() is no longer
hyper optimized - it's just advance followed by peek, so it's more
efficient to just call advance where we're not using the return value of
bch2_btree_iter_next().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is a bit clearer than using bch2_btree_iter_free().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Change fsck code to always put btree iterators - also, make some flow
control improvements to deal with lock restarts better, and refactor
check_extents() to not walk extents twice for counting/checking
i_sectors.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
New helper to clean things up a bit - also, improve iter->flags
handling.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The only reason we were keeping this around was for
BTREE_INSERT_NOUNLOCK semantics - if bch2_btree_iter_set_pos() advances
to the next leaf node, it'll drop the lock on the node that we just
inserted to.
But we don't rely on BTREE_INSERT_NOUNLOCK semantics for the extents
btree, just the inodes btree, and if we do need it for the extents btree
in the future we can do it more cleanly by cloning the iterator - this
lets us delete some special cases in the btree iterator code, which is
complicated enough as it is.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Also, print out more information on btree transaction iterator overflow.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since we now always preallocate the maximum number of iterators when we
initialize a btree transaction, getting an iterator never fails - we can
delete a fair amount of error path code.
This patch also simplifies the iterator allocation code a bit.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The checks for lock ordering violations weren't quite right.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There was a bug where bch2_trans_update() would incorrectly delete a
pending update where the new update did not actually overwrite the
existing update, because we were incorrectly using BTREE_ITER_TYPE when
sorting pending btree updates.
This affects the pending patch to use cached iterators for inode
updates.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This introduces a new kind of btree iterator, cached iterators, which
point to keys cached in a hash table. The cache also acts as a write
cache - in the update path, we journal the update but defer updating the
btree until the cached entry is flushed by journal reclaim.
Cache coherency is for now up to the users to handle, which isn't ideal
but should be good enough for now.
These new iterators will be used for updating inodes and alloc info (the
alloc and stripes btrees).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_btree_iter_downgrade() was looping over all iterators in a
transaction; bch2_trans_downgrade() should be doing that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There was a bad interaction with bch2_btree_iter_set_pos_same_leaf(),
which can leave a btree node locked that is just outside iter->pos,
breaking the lock ordering checks in __bch2_btree_node_lock(). Ideally
we should get rid of this corner case, but for now fix it locally with
verbose comments.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Intented to help debug deadlocks, since we can't use lockdep to check
btree node lock ordering.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This will help with iterator overflow bugs.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, BTREE_ID_INODES was special - inodes were indexed by the
inode field, which meant the offset field of struct bpos wasn't used,
which led to special cases in e.g. the btree iterator code.
Now, inodes in the inodes btree are indexed by the offset field.
Also: prevously min_key was special for extents btrees, min_key for
extents would equal max_key for the previous node. Now, min_key =
bkey_successor() of the previous node, same as non extent btrees.
This means we can completely get rid of
btree_type_sucessor/predecessor.
Also make some improvements to the metadata IO validate/compat code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Introduce a new iterator method that provides a consistent view of the
btree plus uncommitted updates.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
More aggressively checking iterator invariants, and fixing the resulting
bugs. Also greatly simplifying iter_next() and iter_next_slot() - they
were hyper optimized before, but the optimizations were getting too
brittle.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
All iterators should be released now with bch2_trans_iter_put(), so
TRANS_RESET_ITERS shouldn't be needed anymore, and TRANS_RESET_MEM is
always used.
Also convert more code to __bch2_trans_do().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This one takes an additional argument for whether we're searching for >=
or > the search key.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Clean up a bit of duplicated code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It needs to be called when we get -EINTR due to e.g. lock restart - this
fixes a transaction iterators overflow bug.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is considerably cheaper than bch2_btree_node_iter_fix(), for cases
where the key was only modified and key ordering isn't changing.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The btree_trans struct needs to memoize/cache btree iterators, so that
on transaction restart we don't have to completely redo btree lookups,
and so that we can do them all at once in the correct order when the
transaction had to restart to avoid a deadlock.
This switches the btree iterator lookups to work based on iterator
position, instead of trying to match them up based on the stack trace.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Last of the basic operations for iterating forwards and backwards over
the btree: we now have
- peek(), returns key >= iter->pos
- next(), returns key > iter->pos
- peek_prev(), returns key <= iter->pos
- prev(), returns key < iter->pos
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is prep work for the btree key cache: btree iterators will point to
either struct btree, or a new struct bkey_cached.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
exceptionally crappy "tracing", but it's a start at documenting the
places restarts can be triggered
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>