Commit Graph

232 Commits

Author SHA1 Message Date
Kent Overstreet
1831840c2b bcachefs: Fix write buffer flushing from open journal entry
When flushing the btree write buffer, we pull write buffer keys directly
from the journal instead of letting the journal write path copy them to
the write buffer.

When flushing from the currently open journal buffer, we have to block
new reservations and wait for outstanding reservations to complete.

Recheck the reservation state after blocking new reservations:
previously, we were checking the reservation count from before calling
__journal_block().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-24 22:56:37 -04:00
Kent Overstreet
f5109c201c bcachefs: Use wait_on_allocator() when allocating journal
wait_on_allocator() emits debug info when we hang trying to allocate.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-24 18:16:01 -04:00
Kent Overstreet
2ba562cc04 bcachefs: pass last_seq into fs_journal_start()
Prep work for journal rewind, where the seq we're replaying from may be
different than the last journal entry's last_seq.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15 22:11:56 -04:00
Kent Overstreet
09b9c72bd4 bcachefs: bch_err_throw()
Add a tracepoint for any time we return an error and unwind.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-02 12:16:35 -04:00
Kent Overstreet
18dad454cd bcachefs: Replace rcu_read_lock() with guards
The new guard(), scoped_guard() allow for more natural code.

Some of the uses with creative flow control have been left.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-01 00:03:12 -04:00
Kent Overstreet
d21262d4e3 bcachefs: bch2_dev_journal_bucket_delete()
Recover from "journal and btree in same bucket".

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-31 22:03:17 -04:00
Kent Overstreet
521f9584c2 bcachefs: Ensure we don't use a blacklisted journal seq
Different versions differ on the size of the blacklist range; it is
theoretically possible that we could end up with blacklisted journal
sequence numbers newer than the newest seq we find in the journal, and
pick a new start seq that's blacklisted.

Explicitly check for this in bch2_fs_journal_start().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-23 19:52:31 -04:00
Kent Overstreet
cca2c0d224 bcachefs: bch_dev.io_ref -> enumerated_ref
Convert device IO refs to enumerated_refs, for easier debugging of
refcount issues.

Simple conversion: enumerate all users and convert to the new helpers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:28 -04:00
Kent Overstreet
c9b1d94a21 bcachefs: bch_fs.writes -> enumerated_refs
Drop the single-purpose write ref code in bcachefs.h, and convert to
enumarated refs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:27 -04:00
Kent Overstreet
6d67de1079 bcachefs: for_each_rw_member_rcu()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:27 -04:00
Kent Overstreet
530112d88e bcachefs: BCH_FEATURE_small_image
We can't go RW if it's an image file that hasn't been resized.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:20 -04:00
Kent Overstreet
ebf561b208 bcachefs: print_str_as_lines() -> print_str()
bch2_print_string_as_lines() is a low level helper that allows messages
longer than 1k to be printed without truncation.

But we should always be printing with the helpers that take a filesystem
object, if we're in fsck they direct output to the userspace process
controlling fsck instead of the dmesg log.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:18 -04:00
Kent Overstreet
6f03e30e7c bcachefs: Clean up duplicated code in bch2_journal_halt()
It's now a wrapper around bch2_journal_halt_locked().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:13 -04:00
Kent Overstreet
2e0d51d00e bcachefs: bch2_dev_journal_alloc() now respects data_allowed
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:13 -04:00
Kent Overstreet
a17e985be9 bcachefs: Move various init code to _init_early()
_init_early() is for initialization that cannot fail, and often must
happen for teardown partway through initialization to work.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:02 -04:00
Kent Overstreet
25ee021c7f bcachefs: simplify journal pin initialization
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:59 -04:00
Kent Overstreet
4c327d03d7 bcachefs: Change __journal_entry_close() assert to ERO
We've got some reports of this happening in the wild, and need a bit
more info to debug it:

https://github.com/koverstreet/bcachefs/issues/854
https://www.reddit.com/r/bcachefs/comments/1k28kjm/surprise_soft_lockup/

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-20 19:41:38 -04:00
Kent Overstreet
4c0d2c67ac bcachefs: Fix early startup error path
Don't set JOURNAL_running until we're also calling
journal_space_available() for the first time.

If JOURNAL_running is set, shutdown will write an empty journal entry -
but this will hit an assert in journal_entry_open() if we've never
called journal_space_available().

Reported-by: syzbot+53bb24d476ef8368a7f0@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-20 19:41:38 -04:00
Kent Overstreet
77ad1df82b bcachefs: Fix "journal stuck" during recovery
If we crash when the journal pin fifo is completely full - i.e. we're
at the maximum number of dirty journal entries - that may put us in a
sticky situation in recovery, as journal replay will need to be able to
open new journal entries in order to get going.

bch2_fs_journal_start() already had provisions for resizing the journal
pin fifo if needed, but it needs a fudge factor to ensure there's room
for journal replay.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-03 12:11:43 -04:00
Kent Overstreet
dcffc3b1ae bcachefs: Split up bch_dev.io_ref
We now have separate per device io_refs for read and write access.

This fixes a device removal bug where the discard workers were still
running while we're removing alloc info for that device.

It's also a bit of hardening; we no longer allow writes to devices that
are read-only.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02 10:24:34 -04:00
Kent Overstreet
2b47102b93 bcachefs: Reorder error messages that include journal debug
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30 16:36:27 -04:00
Kent Overstreet
e1e50a6330 bcachefs: Use print_string_as_lines() for journal stuck messages
They were being truncated, printk has a 1k limit per call

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-25 11:49:46 -04:00
Kent Overstreet
5ae6f33053 bcachefs: zero init journal bios
fix a kmsan splat

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:37 -04:00
Kent Overstreet
4a4000b9a6 bcachefs: Kill JOURNAL_ERRORS()
Convert these to standard error codes, which means we can pass them
outside the journal code, they're easier to pass to tracepoints, etc.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Alan Huang
dd7ae389ff bcachefs: Remove spurious smp_mb()
The smp_mb() is paired with nothing.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
cb87f623c1 bcachefs: minor journal errcode cleanup
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
898bda5b72 bcachefs: Increase JOURNAL_BUF_NR
Increase journal pipelining.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Kent Overstreet
35282ce9e8 bcachefs: Free journal bufs when not in use
Since we're increasing the number of 'struct journal_bufs', we don't
want them all permanently holding onto buffers for the journal data -
that'd be 16 * 2MB = 32MB, or potentially more.

Add a single-element mempool (open coded, since buffer size varies),
this also means we won't be hitting the memory allocator every time we
open and close a journal entry/buffer.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Kent Overstreet
199a3578ed bcachefs: Kill journal_res.idx
More dead code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Kent Overstreet
c2be81d48a bcachefs: Kill journal_res_state.unwritten_idx
Dead code

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Kent Overstreet
33255c161a bcachefs: Fix bch2_dev_journal_alloc() spuriously failing
Previously, we fixed journal resize spuriousl failing with
-BCH_ERR_open_buckets_empty, but initial journal allocation was missed
because it didn't invoke the "block on allocator" loop at all.

Factor out the "loop on allocator" code to fix that.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-06 18:15:01 -05:00
Kent Overstreet
7909d1fb90 bcachefs: Check for -BCH_ERR_open_buckets_empty in journal resize
This fixes occasional failures from journal resize.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-26 19:31:05 -05:00
Kent Overstreet
9e9033522a bcachefs: Fix discard path journal flushing
The discard path is supposed to issue journal flushes when there's too
many buckets empty buckets that need a journal commit before they can be
written to again, but at some point this code seems to have been lost.

Bring it back with a new optimization to make sure we don't issue too
many journal flushes: the journal now tracks the sequence number of the
most recent flush in progress, which the discard path uses when deciding
which buckets need a journal flush.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-06 22:35:11 -05:00
Jeongjun Park
2ef995df0c bcachefs: fix deadlock in journal_entry_open()
In the previous commit b3d82c2f27, code was added to prevent journal sequence
overflow. Among them, the code added to journal_entry_open() uses the
bch2_fs_fatal_err_on() function to handle errors.

However, __journal_res_get() , which calls journal_entry_open() , calls
journal_entry_open() while holding journal->lock , but bch2_fs_fatal_err_on()
internally tries to acquire journal->lock , which results in a deadlock.

So we need to add a locked helper to handle fatal errors even when the
journal->lock is held.

Fixes: b3d82c2f27 ("bcachefs: Guard against journal seq overflow")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-06 22:35:11 -05:00
Kent Overstreet
35f5197009 bcachefs: Improve journal pin flushing
Running the preempt tiering tests with a lower than normal journal
reclaim delay turned up a shutdown hang - a lost wakeup, caused because
flushing a journal pin (e.g. key cache/write buffer) can generate a new
journal pin.

The "simple" fix of adding the correct wakeup didn't work because of
ordering issues; if we flush btree node pins too aggressively before
other pins have completed, we end up spinning where each flush iteration
generates new work.

So to fix this correctly:
- The list of flushed journal pins is now broken out by type, so that
  we can wait for key cache/write buffer pin flushing to complete
  before flushing dirty btree nodes

- A new closure_waitlist is added for bch2_journal_flush_pins; this one
  is only used under or when we're taking the journal lock, so it's
  pretty cheap to add rigorously correct wakeups to journal_pin_set()
  and journal_pin_drop().

Additionally, bch2_journal_seq_pins_to_text() is moved to
journal_reclaim.c, where it belongs, along with a bit of other small
renaming and refactoring.

Besides fixing the hang, the better ordering between key cache/write
buffer flushing and btree node flushing should help or fix the "unmount
taking excessively long" a few users have been noticing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-25 19:37:43 -05:00
Kent Overstreet
2c5d8a8347 bcachefs: "Journal stuck" timeout now takes into account device latency
If a block device (e.g. your typical consumer SSD) is taking multiple
seconds for IOs (typically flushes), we don't want to emit the "journal
stuck" message prematurely.

Also, make sure to drop the btree_trans srcu lock if we're blocking for
more than a second.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-21 18:32:05 -05:00
Kent Overstreet
89e74eccab bcachefs: bch2_journal_noflush_seq() now takes [start, end)
Harder to screw up if we're explicit about the range, and more correct
as journal reservations can be outstanding on multiple journal entries
simultaneously.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
ff7e7c5367 bcachefs: Journal write path refactoring, debug improvements
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
b3d82c2f27 bcachefs: Guard against journal seq overflow
Wraparound is impractical to handle since in various places we use 0 as
a sentinal value - but 64 bits (or 56, because the btree write buffer
steals a few bits) is enough for all practical purposes.

Reported-by: syzbot+73ed43fbe826227bd4e0@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:20 -05:00
Kent Overstreet
828552ca74 bcachefs: Kill bch2_bucket_alloc_new_fs()
The early-early allocation path, bch2_bucket_alloc_new_fs(), is no
longer needed - and inconsistencies around new_fs_bucket_idx have been a
frequent source of bugs.

Reported-by: syzbot+592425844580a6598410@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
6534a404d4 bcachefs: errcode cleanup: journal errors
Instead of throwing standard error codes, we should be throwing
dedicated private error codes, this greatly improves debugability.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
0eaac0b44f bcachefs: btree_write_buffer_flush_seq() no longer closes journal
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
c601e5d7da bcachefs: Can now block journal activity without closing cur entry
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
fb8c835b18 bcachefs: bch2_journal_meta() takes ref on c->writes
This part of addressing
https://github.com/koverstreet/bcachefs/issues/656

where we're getting stuck in bch2_journal_meta() in the dump tool.

We shouldn't be invoking the journal without a ref on c->writes (if
we're not RW), and there's no reason for the dump tool to be going
read-write.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Kent Overstreet
3956ff8bc2 bcachefs: Don't use wait_event_interruptible() in recovery
Fix a bug where mount was failing with -ERESTARTSYS:
https://github.com/koverstreet/bcachefs/issues/741

We only want the interruptible wait when called from fsync.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-20 16:50:14 -04:00
Kent Overstreet
a7e2dd58fb bcachefs: Check if stuck in journal_res_get()
Like how we already do when the allocator seems to be stuck, check if
we're waiting too long for a journal reservation and print some debug
info.

This is specifically to track down
https://github.com/koverstreet/bcachefs/issues/656

which is showing up in userspace where we don't have sysfs/debugfs to
get the journal debug info.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-09 16:57:59 -04:00
Kent Overstreet
7f2de6947f bcachefs: Fix warning in bch2_fs_journal_stop()
j->last_empty_seq needs to match j->seq when the journal is empty

Reported-by: syzbot+4093905737cf289b6b38@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:23 -04:00
Uros Bizjak
68573b936d bcachefs: Use try_cmpxchg() family of functions instead of cmpxchg()
Use try_cmpxchg() family of functions instead of
cmpxchg (*ptr, old, new) == old. x86 CMPXCHG instruction returns
success in ZF flag, so this change saves a compare after cmpxchg
(and related move instruction in front of cmpxchg).

Also, try_cmpxchg() implicitly assigns old *ptr value to "old" when
cmpxchg fails. There is no need to re-read the value in the loop.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
ef05bdf5d6 bcachefs: Add missing printbuf_tabstops_reset() calls
Fixes warnings from bch2_print_allocator_stuck()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-29 18:14:18 -04:00
Kent Overstreet
44ec599035 bcachefs: Don't use the new_fs() bucket alloc path on an initialized fs
On a new filesystem or device we have to allocate the journal with a
bump allocator, because allocation info isn't ready yet - but when
hot-adding a device that doesn't have a journal, we don't want to use
that path.

Reported-by: syzbot+24a867cb90d8315cccff@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-28 19:47:31 -04:00