This switches btree_key_cache_fill() to use a btree iterator, not a
btree path, so that it can search for keys in previous snapshots.
We also add another iterator flag, BTREE_ITER_KEY_CACHE_FILL, to avoid
recursion back into the key cache.
This will allow us to re-enable the key cache for inodes in the next
patch.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There was no reason for this to be a separate helper - we always want
the relock call that btree_trans_peek_key_cache() did.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch introduces
- bpos_eq()
- bpos_lt()
- bpos_le()
- bpos_gt()
- bpos_ge()
and equivalent replacements for bkey_cmp().
Looking at the generated assembly these could probably be improved
further, but we already see a significant code size improvement with
this patch.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When flags & btree_id are constants, we can constant fold the entire
calculation of the actual iterator flags - and the whole thing becomes
small enough to inline.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Marking a non-static function as inline doesn't actually work and is
now causing problems - drop that
- Introduce BCACHEFS_LOG_PREFIX for when we want to prefix log messages
with bcachefs (filesystem name)
- Userspace doesn't have real percpu variables (maybe we can get this
fixed someday), put an #ifdef around bch2_disk_reservation_add()
fastpath
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
btree_path_copy() doesn't need to call
bch2_btree_path_check_sort_fast() - the newly allocated path will always
be in the correct position, post copy; also delete some redundant
branches from __bch2_btree_path_make_mut().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Don't call into bch2_encrypt_bio() when we're not encrypting
- Pull slowpath out of trans_lock_write()
- Make sure bc2h_trans_journal_res_get() gets inlined.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- In the btree iterator code that overlays keys from the journal, we
were incorrectly specifying level=0 instead of the btree_path's
current level in a few places
- When we didn't do journal replay, we shouldn't free the journal keys:
this fixes cmd_list and cmd_dump, which run in norecovery mode
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This moves the JOURNAL_REPLAY_DONE flag check from
bch2_trans_iter_init() to bch2_trans_init(), where we stash a copy in
btree_trans - gaining us a small performance improvement.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
checkpatch.pl gives lots of warnings that we don't want - suggested
ignore list:
ASSIGN_IN_IF
UNSPECIFIED_INT - bcachefs coding style prefers single token type names
NEW_TYPEDEFS - typedefs are occasionally good
FUNCTION_ARGUMENTS - we prefer to look at functions in .c files
(hopefully with docbook documentation), not .h
file prototypes
MULTISTATEMENT_MACRO_USE_DO_WHILE
- we have _many_ x-macros and other macros where
we can't do this
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now we store the transaction's fn idx in a local variable, instead of
redoing the lookup every time we call bch2_trans_init().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
local_clock() isn't always completely accurate - e.g. on machines with
TSC drift - but ktime_get_ns() overhead is too high, unfortunately.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We were forgetting to count down the number of nodes to prefetch, firing
off _way_ more than intended - whoops.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the drop_alloc tests, we may end up calling
bch2_btree_iter_peek_slot() on an interior level that doesn't exist.
Previously, this would hit the path->uptodate assertion in
bch2_btree_path_peek_slot(); this path first checks a NULL btree node,
which is how we know we're at the end of the btree.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The btree iterator code may allocate extra btree paths, temporarily,
that do not refer to keys being returned: we don't need to wait until
transaction restart to drop these, when they're not referenced they
should be deleted right away.
This fixes a transaction path overflow bug.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There was a rare bug when path->locks_want was nonzero, but not
BTREE_MAX_DEPTH, where we'd return on a valid node that wasn't locked -
oops.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This function is fairly small and only used in two places: one very hot,
the other cold, so it should definitely be inlined.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- move slowpath code to a separate function, btree_path_overflow()
- no need to use hweight64
- copy nr_max_paths from btree_transaction_stats to btree_trans,
avoiding a data dependency in the fast path
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a helper for printing a large buffer one line at a time, to
avoid the 1k printk limit.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Most of the node_relock_fail trace events are generated from
bch2_btree_path_verify_level(), when debugcheck_iterators is enabled -
but we're not interested in these trace events, they don't indicate that
we're in a slowpath.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the event that we're not finished debugging the cycle detector, this
adds a new file to debugfs that shows what the cycle detector finds, if
anything. By comparing this with btree_transactions, which shows held
locks for every btree_transaction, we'll be able to determine if it's
the cycle detector that's buggy or something else.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We've outgrown our own deadlock avoidance strategy.
The btree iterator API provides an interface where the user doesn't need
to concern themselves with lock ordering - different btree iterators can
be traversed in any order. Without special care, this will lead to
deadlocks.
Our previous strategy was to define a lock ordering internally, and
whenever we attempt to take a lock and trylock() fails, we'd check if
the current btree transaction is holding any locks that cause a lock
ordering violation. If so, we'd issue a transaction restart, and then
bch2_trans_begin() would re-traverse all previously used iterators, but
in the correct order.
That approach had some issues, though.
- Sometimes we'd issue transaction restarts unnecessarily, when no
deadlock would have actually occured. Lock ordering restarts have
become our primary cause of transaction restarts, on some workloads
totally 20% of actual transaction commits.
- To avoid deadlock or livelock, we'd often have to take intent locks
when we only wanted a read lock: with the lock ordering approach, it
is actually illegal to hold _any_ read lock while blocking on an intent
lock, and this has been causing us unnecessary lock contention.
- It was getting fragile - the various lock ordering rules are not
trivial, and we'd been seeing occasional livelock issues related to
this machinery.
So, since bcachefs is already a relational database masquerading as a
filesystem, we're stealing the next traditional database technique and
switching to a cycle detector for avoiding deadlocks.
When we block taking a btree lock, after adding ourself to the waitlist
but before sleeping, we do a DFS of btree transactions waiting on other
btree transactions, starting with the current transaction and walking
our held locks, and transactions blocking on our held locks.
If we find a cycle, we emit a transaction restart. Occasionally (e.g.
the btree split path) we can not allow the lock() operation to fail, so
if necessary we'll tell another transaction that it has to fail.
Result: trans_restart_would_deadlock events are reduced by a factor of
10 to 100, and we'll be able to delete a whole bunch of grotty, fragile
code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
With the new deadlock cycle detector, it's critical that all held locks
be marked in a btree_path, because that's what the cycle detector
traverses - any locks that aren't correctly marked will cause deadlocks.
This changes the btree_path to allocate some btree_paths for the new
nodes, since until the final update is done we otherwise don't have a
path referencing them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With the upcoming cycle detector, we have to be careful about using
btree_node_lock_nopath - in particular, using it to take write locks can
cause deadlocks.
All held locks need to be tracked in a btree_path, so that the cycle
detector knows about them - unless we know that we cannot cause
deadlocks for other reasons: e.g. we are only taking read locks, or
we're in very early fsck (topology repair).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a type descriptor to btree_bkey_cached_common - there's no reason
not to since we've got padding that was otherwise unused, and this is a
nice cleanup (and helpful in later patches).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Also, do some reorganizing/renaming, convert atomic counters in bch_fs
to persistent counters, and add a few missing counters.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Improve our debugfs output, to help in debugging deadlocks: this shows,
for every btree node we print, the current number of readers/intent
locks/write locks held.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch
- tracks maximum bch2_trans_kmalloc() memory used in btree_transaction_stats
- makes it available in debugfs
- switches bch2_trans_init() to using that for the amount of memory to
preallocate, instead of the parameter passed in
This drastically reduces transaction restarts, and means we no longer
need to track this in the source code.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, we used two different bit arrays for tracking held btree
node locks. This patch switches to an array of two bit integers, which
will let us track, in a future patch, when we hold a write lock.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Held btree locks are tracked in btree_path->nodes_locked and
btree_path->nodes_intent_locked. Upcoming patches are going to change
the representation in struct btree_path, so this patch switches to
proper helpers instead of direct access to these fields.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Start to centralize some of the locking code in a new file; more locking
code will be moving here in the future.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- fsck_inode_rm() wasn't returning BCH_ERR_transaction_restart_nested
- change bch2_trans_verify_not_restarted() to call panic() - we don't
want these errors to be missed
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
iter->k needs to be consistent with iter->pos - required for
bch2_btree_iter_(rewind|advance) to work correctly.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
When returning a key from the key cache, in BTREE_ITER_WITH_KEY_CACHE
mode, we don't want to set should_be_locked on iter->path; we're not
returning a key from that path, so we donn't need to, and also since we
traversed the key cache iterator before setting should_be_locked on that
path it might be unlocked (if we unlocked, bch2_trans_relock() won't
have relocked it).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We should be calling btree_node_mem_ptr_set() before path_level_init(),
since we already touched the key that btree_node_mem_ptr_set() will
modify and path_level_init() will be doing the lookup in the child btree
node we're recursing to.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Instead of counting transaction restarts, count when the transaction is
restarted: if bch2_trans_begin() was called when the transaction wasn't
restarted we need to ensure restart_count is still incremented.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We need a way to check if the machinery for handling btree_paths with in
a transaction is behaving reasonably, as it often has not been - we've
had bugs with transaction path overflows caused by duplicate paths and
plenty of other things.
This patch tracks, per transaction fn, the most btree paths ever
allocated by that transaction and makes it available in debugfs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Going to be adding more things to this in the next patch.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_path_put() is supposed to drop paths that aren't needed on
transaction restart, or to hold locks that we're supposed to keep until
transaction commit: but it was failing to free paths in some cases that
it should have, leading to transaction path overflows with lots of
duplicate paths.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
These were used more prior to getting rid of the in-memory bucket arrays
- they don't serve much purpose anymore, and deleting them lets us write
better assertions.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Our types are exported to the tracepoint code, so it's not necessary to
break things out individually when passing them to tracepoints - we can
also call other functions from TP_fast_assign().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
It doesn't make any sense to set should_be_locked on btree_paths that
aren't locked, and is often a bug - this patch adds assertions and fixes
some of those bugs.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- use strlcpy(), not strncpy()
- add tracepoints for btree_path alloc and free
- give the tracepoint for key cache upgrade fail a proper name
- add a tracepoint for btree_node_upgrade_fail
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
bch2_btree_trans_to_text() is used to print btree_transactions owned by
other threads; thus, it needs to be particularly careful. This fixes a
null ptr deref caused by racing with the owning thread changing
path->l[].b.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Currently seeing a very rare and difficult to explain btree_path
inconsistency - this patch adds assertions to the only place that seems
to be missing them.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
In CONFIG_BCACHEFS_DEBUG mode, we'll now randomly issue transaction
restarts - with a decaying probability based on the number of restarts
we've already had, to ensure that transactions eventually make forward
progress. This should help shake out some bugs.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we have error codes, with subtypes, we can switch to our own
error code for transaction restarts - and even better, a distinct error
code for each transaction restart reason: clearer code and better
debugging.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This will help us improve nested transactions - we need to add
assertions that whenever an inner transaction handles a restart, it
still returns -EINTR to the outer transaction.
This also adds nested_lockrestart_do() and nested_commit_do() which use
the new counters to correctly return -EINTR when the transaction was
restarted.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We now record the length of time btree locks are held and expose this in debugfs.
Enabled via CONFIG_BCACHEFS_LOCK_TIME_STATS.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need the caller name and a place to store our results, btree_trans provides this.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We try to ensure we never hold btree locks for too long - bcachefs tries
to be soft realtime. This adds a check when restarting a transaction,
where a transaction restart is cheap - if we've been holding locks for
too long, drop and retake them.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
When inserting a key type that's not valid for a given btree, we should
print out which btree we were inserting into.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This converts bcachefs to the modern printbuf interface/implementation,
synced with the version to be submitted upstream.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, on every btree_iter_peek() operation we were searching the
journal keys, doing a full binary search - which was slow.
This patch fixes that by saving our position in the journal keys, so
that we only do a full binary search when moving our position backwards
or a large jump forwards.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This is pretty expensive, and we've tested sufficiently with it now that
it doesn't need to be on by default.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds bch2_btree_iter_peek_all_levels(), which returns keys from
every level of the btree - interior nodes included - in monotonically
increasing order, soon to be used by the backpointers check & repair
code.
- BTREE_ITER_ALL_LEVELS can now be passed to for_each_btree_key() to
iterate thusly, much like BTREE_ITER_SLOTS
- The existing algorithm in bch2_btree_iter_advance() doesn't work with
peek_all_levels(): we have to defer the actual advancing until the
next time we call peek, where we have the btree path traversed and
uptodate. So, we add an advanced bit to btree_iter; when
BTREE_ITER_ALL_LEVELS is set bch2_btree_iter_advanced() just marks
the iterator as advanced.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds two new helpers to btree_iter.c for changing the level of a
path up or down - to be used by the new
bch2_btree_iter_peek_all_levels().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The new backpointers code will be using bch2_btree_iter_peek_slot() on
interior nodes - this patch updates peek_slot() to make that work.
- Pass the correct level to bch2_journal_keys_peek_slot()
- We should only set BTREE_ITER_CACHED or BTREE_ITER_WITH_KEY_CACHE
when using bch2_trans_iter_init(), not bch2_trans_node_iter_init()
- Update assertions
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
When many journal replay keys have been overwritten,
bch2_journal_keys_peek() was taking excessively long to scan before it
found a key to return.
Fix this by introducing bch2_journal_keys_peek_upto() which takes a
parameter for the end of the range we want, so that we can terminate the
search much sooner, and replace all uses of bch2_journal_keys_peek()
with peek_upto() or peek_slot().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This fixes a bug where __bch2_btree_node_update_key() wasn't clearing
should_be_locked, leading to bch2_btree_path_traverse() always failing -
all callers of btree_path_make_mut() want should_be_locked cleared, so
do it there.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
bch2_btree_iter_next_node() was mucking with other btree_path state
without setting path->update to be consistent with the fact that the
path is very much no longer uptodate - oops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This turns bch2_dump_trans_updates() into a to_text() method - this way
it can be used by debug tracing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a new error macro that also dumps transaction updates in addition to
doing an emergency shutdown - when a transaction update discovers or is
causing a fs inconsistency, it's helpful to see what updates it was
doing.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
In BTREE_ITER_FILTER_SNAPHOTS mode, we skip over keys in unrelated
snapshots. When we hit the end of an inode, if the next inode(s) are in
a different subvolume, we could potentially have to skip past many keys
before finding a key we can return to the caller, so they can terminate
the iteration.
This adds a peek_upto() variant to solve this problem, to be used when
we know the range we're searching within.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
traverse_all() traverses btree paths in sorted order, so it should never
see transaction restarts due to lock ordering violations. But some code
in __bch2_btree_path_upgrade(), while necessary when not running under
traverse_all(), was causing some confusing lock ordering violations -
disabling this code under traverse_all() will let us put in some more
assertions.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
In btree_path_traverse_all() we were failing to check for -EIO in the
retry loop, and after btree node read error we'd go into an infinite
loop.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
When bch2_trans_begin() is called and there hasn't been a transaction
restart, we presume that we're now doing something new - iterating over
different keys, and we now shouldn't keep aruond paths related to the
previous transaction, excepting the subvolumes btree.
This should fix some of our "transaction path overflow" bugs.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
In btree_update_interior.c, we were changing a path's level directly -
which affects path sort order - without re-sorting paths, leading to
assertions when bch2_path_get() verified paths were sorted correctly.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This patch changes printbufs dynamically allocate and reallocate a
buffer as needed. Stack usage has become a bit of a problem, and a major
cause of that has been static size string buffers on the stack.
The most involved part of this refactoring is that printbufs must now be
exited with printbuf_exit().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're hitting a strange bug with transaction paths not being sorted
correctly - this dumps transaction paths in the order we thought was
sorted, which will hopefully shed some light as to what's going on.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
bch2_trans_begin() invalidates all iterators, until they're revalidated
by calling peek() or traverse().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The error code when we fail to allocate a node in the btree node cache
doesn't make it to bch2_btree_path_traverse_all(). Instead, we need to
stash a flag in btree_trans so we know we have to take the cannibalize
lock.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The loop that traverses paths in traverse_all() needs to be a little bit
tricky, because traversing a path can cause other paths to be added (or
perhaps removed) at about the same position.
The old logic was buggy, replace it with simpler logic.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
__bch2_btree_node_lock() was implementing the wrong lock ordering for
cached vs. non cached paths - this fixes it to match the btree path sort
order as defined by __btree_path_cmp(), and also simplifies the code
some.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This consolidates some of the btree node lock path, so that when we're
blocked taking a write lock on a node it shows up in
bch2_btree_trans_to_text(), along with intent and read locks.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The old .debugcheck methods are no more and this just calls the .invalid
method, which doesn't add much since we already check that when doing
btree updates and when reading metadata in.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- Updates to non key cache iterators will now be transparently
redirected to the key cache for cached btrees.
- Except when creating new keys: then the update goes to underlying
btree
For for iterating over a cached btree to work, we need to ensure that if
a key exists in the key cache, it also exists in the btree - otherwise
the iterator code will skip past it and not check the key cache.
Otherwise, for consistency, all updates should go to the same place -
the key cache.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is the start of cache coherency with the btree key cache - this
adds a btree iterator flag that causes lookups to also check the key
cache when we're iterating over the btree (not iterating over the key
cache).
Note that we could still race with another thread creating at item in
the key cache and updating it, since we aren't holding the key cache
locked if it wasn't found. The next patch for the update path will
address this by causing the transaction to restart if the key cache is
found to be dirty.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, when doing updates and running triggers before journal
replay completes, triggers would see the incorrect key for the old key
being overwritten - this patch updates the trigger code to check the
journal keys when necessary, needed for the upcoming allocator rewrite.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We currently need to call bch2_btree_path_peek_slot() multiple times in
the transaction commit path - and some of those need to be updated to
also check the keys from journal replay, too. Let's consolidate this and
stash the key being overwritten in btree_insert_entry.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>