We're ramping up on checking transaction restart handling correctness -
so, in debug mode we now save a backtrace for where the restart was
emitted, which makes it much easier to track down the incorrect
handling.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
lock_fail_root_changed has not been used since commit
0d7009d7ca ("bcachefs: Delete old deadlock avoidance code")
Remove it.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add NULL check for key returned from bch2_btree_and_journal_iter_peek in
btree_node_iter_and_journal_peek to avoid NULL ptr dereference in
bch2_bkey_buf_reassemble.
When key returned from bch2_btree_and_journal_iter_peek is NULL it means
that btree topology needs repair. Print topology error message with
position at which node wasn't found, its parent node information and
btree_id with level.
Return error code returned by bch2_topology_error to ensure that topology
error is handled properly by recovery.
Reported-by: syzbot+005ef9aa519f30d97657@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=005ef9aa519f30d97657
Fixes: 5222a4607c ("bcachefs: BTREE_ITER_WITH_JOURNAL")
Suggested-by: Alan Huang <mmpgouride@gmail.com>
Suggested-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
for_each_btree_node() now works similarly to for_each_btree_key(), where
the loop body is passed as an argument to be passed to lockrestart_do().
This now calls trans_begin() on every loop iteration - which fixes an
SRCU warning in backpointers fsck.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a bug exposed by the next path - we pop an assert in
path_set_should_be_locked().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If a btree_trans is in use it's supposed to be passed to fsck_err so
that it can be unlocked if we're waiting on userspace input; but the
btree IO paths do call fsck errors where a btree_trans exists on the
stack but it's not passed through.
But it's ok, because it's unlocked while doing IO.
Fixes: a850bde649 ("bcachefs: fsck_err() may now take a btree_trans")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds lockdep tracking for held btree locks with a single dep_map in
btree_trans, i.e. tracking all held btree locks as one object.
This is more practical and more useful than having lockdep track held
btree locks individually, because
- we can take more locks than lockdep can track (unbounded, now that we
have dynamically resizable btree paths)
- there's no lock ordering between btree locks for lockdep to track (we
do cycle detection)
- and this makes it easy to teach lockdep that btree locks are not safe
to hold while invoking memory reclaim.
The last rule is one that lockdep would never learn, because we only do
trylock() from within shrinkers - but we very much do not want to be
invoking memory reclaim while holding btree node locks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Don't allocate the new bkey_cached until after we've done the btree
lookup; this means we can kill bkey_cached.valid.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
fsck_err() now optionally takes a btree_trans; if the current thread has
one, it is required that it be passed.
The next patch will use this to unlock when waiting for user input.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Main part of the disk accounting rewrite.
This is a wholesale rewrite of the existing disk space accounting, which
relies on percepu counters that are sharded by journal buffer, and
rolled up and added to each journal write.
With the new scheme, every set of counters is a distinct key in the
accounting btree; this fixes scaling limitations of the old scheme,
where counters took up space in each journal entry and required multiple
percpu counters.
Now, in memory accounting requires a single set of percpu counters - not
multiple for each in flight journal buffer - and in the future we'll
probably also have counters that don't use in memory percpu counters,
they're not strictly required.
An accounting update is now a normal btree update, using the btree write
buffer path. At transaction commit time, we apply accounting updates to
the in memory counters, which are percpu counters indexed in an
eytzinger tree by the accounting key.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The debug code relies on btree_trans_list being ordered so that it can
resume on subsequent calls or lock restarts.
However, it was using trans->locknig_wait.task.pid, which is incorrect
since btree_trans objects are cached and reused - typically by different
tasks.
Fix this by switching to pointer order, and also sort them lazily when
required - speeding up the btree_trans_get() fastpath.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
debug.c was using closure_get() on a different thread's closure where
the we don't know if the object being refcounted is alive.
We keep btree_trans objects on a list so they can be printed by debug
code, and because it is cost prohibitive to touch the btree_trans list
every time we allocate and free btree_trans objects, cached objects are
also on this list.
However, we do not want the debug code to see cached but not in use
btree_trans objects - critically because the btree_paths array will have
been freed (if it was reallocated).
closure_get() is also incorrect to use when that get may race with it
hitting zero, i.e. we must already have a ref on the object or know the
ref can't currently hit 0 for other reasons (as used in the cycle
detector).
to fix this, use the previously introduced closure_get_not_zero(),
closure_return_sync(), and closure_init_stack_release(); the debug code
now can only take a ref on a trans object if it's alive and in use.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
reference: https://github.com/koverstreet/bcachefs/issues/692
trans->ref is the reference used by the cycle detector, which walks
btree_trans objects of other threads to walk the graph of held locks and
issue wakeups when an abort is required.
We have to wait for the ref to go to 1 before freeing trans->paths or
clearing trans->locking_wait.task.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
for forwards compat we now explicitly allow mounting and using
filesystems with unknown btrees, and we have to walk them for fsck.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With the new assertions, we shouldn't be holding locks when
trans->locked is false, thus, we shouldn't use relock when we just want
to check if we can relock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a field for tracking whether a transaction object holds btree locks,
and assertions to verify state.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Combine iter/update/trigger/str_hash flags into a single enum, and
x-macroize them for a to_text() function later.
These flags are all for a specific iter/key/update context, so it makes
sense to group them together - iter/update/trigger flags were already
given distinct bits, this cleans up and unifies that handling.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Consolidate bch2_gc_check_topology() and btree_node_interior_verify(),
and replace them with an improved version,
bch2_btree_node_check_topology().
This checks that children of an interior node correctly span the full
range of the parent node with no overlaps.
Also, ensure that topology repairs at runtime are always a fatal error;
in particular, this adds a check in btree_iter_down() - if we don't find
a key while walking down the btree that's indicative of a topology error
and should be flagged as such, not a null ptr deref.
Some checks in btree_update_interior.c remaining BUG_ONS(), because we
already checked the node for topology errors when starting the update,
and the assertions indicate that we _just_ corrupted the btree node -
i.e. the problem can't be that existing on disk corruption, they
indicate an actual algorithmic bug.
In the future, we'll be annotating the fsck errors list with which
recovery pass corrects them; the open coded "run explicit recovery pass
or fatal error" in bch2_btree_node_check_topology() will in the future
be done for every fsck_err() call.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The old code doesn't consider the mem alloced from mempool when call
krealloc on trans->mem. Also in bch2_trans_put, using mempool_free to
free trans->mem by condition "trans->mem_bytes == BTREE_TRANS_MEM_MAX"
is inaccurate when trans->mem was allocated by krealloc function.
Instead, we use used_mempool stuff to record the situation, and realloc
or free the trans->mem in elegant way.
Also, after krealloc failed in __bch2_trans_kmalloc, the old data
should be copied to the new buffer when alloc from mempool_alloc.
Fixes: 31403dca5b ("bcachefs: optimize __bch2_trans_get(), kill DEBUG_TRANSACTIONS")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
THis silences a mm/page_alloc.c warning about allocating more than a
page with GFP_NOFAIL - and there's no reason for this to not have a
vmalloc fallback anyways.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When bch2_btree_iter_peek_slot() clones the iterator to search for the
next key, and then discovers that the key from the cloned iterator is
the key we want to return - we also want to save the
iter->key_cache_path as well, for the update path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This converts -EIOs related to btree node errors to private error codes,
which will help with some ongoing debugging by giving us better error
messages.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
we now always have a btree_trans when using a btree_and_journal_iter;
prep work for adding prefetching to btree_and_journal_iter
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We were failing to set path->uptodate when reaching the end of a btree
node iterator, causing the new prefetch code for backpointers gc to go
into an infinite loop.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When a btree root is unreadable, we still might be able to get some data
back by replaying what's in the journal. Previously though, we got
confused when journal replay would attempt to replay a key for a level
that didn't exist.
This adds bch2_btree_increase_depth(), so that journal replay can handle
this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we're in FILTER_SNAPSHOTS mode and we start scanning a range of the
keyspace where no keys are visible in the current snapshot, we have a
problem - we'll scan for a very long time before scanning terminates.
Awhile back, this was fixed for most cases with peek_upto() (and
assertions that enforce that it's being used).
But the fix missed the fact that the inodes btree is different - every
key offset is in a different snapshot tree, not just the inode field.
Fixes:
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
refactoring the BTREE_ITER_WITH_UPDATES code, prep for removing the flag
and making it always-on
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
refactoring the BTREE_ITER_WITH_UPDATES code, prep for removing the flag
and making it always-on
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
XXX: we're allocating memory with btree locks held - bad
We need to plumb through an error path so we can do
allocate_dropping_locks() - but we're merging this now because it fixes
a transaction path overflow caused by indirect extent fragmentation, and
the resize path is rare.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Start to plumb through dynamically growable btree_paths; this patch
replaces most BTREE_ITER_MAX references with trans->nr_paths.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
the reflink triggers are also bumping up against the maximum number of
paths in a transaction - and generating proportional numbers of updates.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Some tweaks to greatly reduce locking overhead for the list of btree
transactions, so that it can always be enabled: leave btree_trans
objects on the list when they're on the percpu single item freelist,
and only check for duplicates in the same process when
CONFIG_BCACHEFS_DEBUG is enabled
- don't zero out the full btree_trans() unless we allocated it from
the mempool
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Upcoming patches are going to be changing trans->paths to a
reallocatable buffer. We need to guard against use after free when it's
used by other threads; this introduces RCU protection to those paths and
changes them to check for trans->paths == NULL
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
path->idx is now a code smell: we should be using path_idx_t, since it's
stable across btree path reallocation.
This is also a bit faster, using the same loop counter vs. fetching
path->idx from each path we iterate over.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the future we'll be making trans->paths resizable and potentially
having _many_ more paths (for fsck); we need to start fixing algorithms
that walk each path in a transaction where possible.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Instead of using a darray, we now allocate journal entries for the
transaction commit path with our normal bump allocator - with an inlined
fastpath, and using btree_transaction_stats to remember how much to
initially allocate so as to avoid transaction restarts.
This is prep work for converting write buffer updates to use this
mechanism.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
These were for extra info in tracepoints for debugging a specialized
issue - we do not want to bloat btree_path for this, at least in release
builds.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the CI, we're seeing tests failing due to excessive would_deadlock
transaction restarts - the tracepoint now includes the lock cycle that
occured.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now we can print out filesystem flags in sysfs, useful for debugging
various "what's my filesystem doing" issues.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
For BTREE_ITER_WITH_JOURNAL, we memoize lookups in the journal keys, to
avoid the binary search overhead.
Previously we stashed the pos of the last key returned from the journal,
in order to force the lookup to be redone when rewinding.
Now bch2_journal_keys_peek_upto() handles rewinding itself when
necessary - so we can slim down btree_iter.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
As discussed in the previous patch, BTREE_ITER_ALL_LEVELS appears to be
racy with concurrent interior node updates - and perhaps it is fixable,
but it's tricky and unnecessary.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
path->level was being read, but never used.
Reported-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
peek_upto() checks against the end position and bails out before
FILTER_SNAPSHOTS checks; this is because if we end up at a different
inode number than the original search key none of the keys we see might
be visibile in the current snapshot - we might be looking at inode in a
completely different subvolume.
But this is broken, because when we're iterating over extents we're
checking against the extent start position to decide when to bail out,
and the extent start position isn't monotonically increasing until after
we've run FILTER_SNAPSHOTS.
Fix this by adding a simple inode number check where the old bailout
check was, and moving the main check to the correct position.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reported-by: "Carl E. Thompson" <list-bcachefs@carlthompson.net>
When bch2_fs_alloc() gets an error before calling
bch2_fs_btree_iter_init(), bch2_fs_btree_iter_exit() makes an invalid
memory access because btree_trans_list is uninitialized.
Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Fixes: 6bd68ec266 ("bcachefs: Heap allocate btree_trans")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The btree iterator code overlays keys from the journal until journal
replay is finished; since we're now starting copygc/rebalance etc.
before replay is finished, this is multithreaded access and thus needs
refcounting.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This deletes the complicated and somewhat expensive journal
pre-reservation machinery in favor of just using journal watermarks:
when the journal is more than half full, we run journal reclaim more
aggressively, and when the journal is more than 3/4s full we only allow
journal reclaim to get new journal reservations.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BTREE_INSERT_NOJOURNAL is primarily used for a performance optimization
related to inode updates and fsync - document it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>