Commit Graph

74 Commits

Author SHA1 Message Date
Kent Overstreet
2ba562cc04 bcachefs: pass last_seq into fs_journal_start()
Prep work for journal rewind, where the seq we're replaying from may be
different than the last journal entry's last_seq.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15 22:11:56 -04:00
Kent Overstreet
d21262d4e3 bcachefs: bch2_dev_journal_bucket_delete()
Recover from "journal and btree in same bucket".

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-31 22:03:17 -04:00
Kent Overstreet
6f03e30e7c bcachefs: Clean up duplicated code in bch2_journal_halt()
It's now a wrapper around bch2_journal_halt_locked().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:13 -04:00
Kent Overstreet
a17e985be9 bcachefs: Move various init code to _init_early()
_init_early() is for initialization that cannot fail, and often must
happen for teardown partway through initialization to work.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:02 -04:00
Kent Overstreet
4c0d2c67ac bcachefs: Fix early startup error path
Don't set JOURNAL_running until we're also calling
journal_space_available() for the first time.

If JOURNAL_running is set, shutdown will write an empty journal entry -
but this will hit an assert in journal_entry_open() if we've never
called journal_space_available().

Reported-by: syzbot+53bb24d476ef8368a7f0@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-20 19:41:38 -04:00
Alan Huang
5d361ae5af bcachefs: Add missing smp_rmb()
The smp_rmb() guarantees that reads from reservations.counter
occur before accessing cur_entry_u64s. It's paired with the
atomic64_try_cmpxchg in journal_entry_open.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Alan Huang
5cc0ab39fb bcachefs: Fix incorrect state count
atomic64_read(&j->seq) - j->seq_write_started == JOURNAL_STATE_BUF_NR is
the condition in journal_entry_open where we return JOURNAL_ERR_max_open,
so journal_cur_seq(j) - seq == JOURNAL_STATE_BUF_NR means that the buf
corresponding to seq has started to write.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
898bda5b72 bcachefs: Increase JOURNAL_BUF_NR
Increase journal pipelining.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Kent Overstreet
2e853fdbc7 bcachefs: Don't touch journal_buf->data->seq in journal_res_get
This is a small optimization, reducing the number of cachelines we touch
in the fast path - and it's also necessary for the next patch that
increases JOURNAL_BUF_NR.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Kent Overstreet
199a3578ed bcachefs: Kill journal_res.idx
More dead code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Jeongjun Park
2ef995df0c bcachefs: fix deadlock in journal_entry_open()
In the previous commit b3d82c2f27, code was added to prevent journal sequence
overflow. Among them, the code added to journal_entry_open() uses the
bch2_fs_fatal_err_on() function to handle errors.

However, __journal_res_get() , which calls journal_entry_open() , calls
journal_entry_open() while holding journal->lock , but bch2_fs_fatal_err_on()
internally tries to acquire journal->lock , which results in a deadlock.

So we need to add a locked helper to handle fatal errors even when the
journal->lock is held.

Fixes: b3d82c2f27 ("bcachefs: Guard against journal seq overflow")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-06 22:35:11 -05:00
Kent Overstreet
35f5197009 bcachefs: Improve journal pin flushing
Running the preempt tiering tests with a lower than normal journal
reclaim delay turned up a shutdown hang - a lost wakeup, caused because
flushing a journal pin (e.g. key cache/write buffer) can generate a new
journal pin.

The "simple" fix of adding the correct wakeup didn't work because of
ordering issues; if we flush btree node pins too aggressively before
other pins have completed, we end up spinning where each flush iteration
generates new work.

So to fix this correctly:
- The list of flushed journal pins is now broken out by type, so that
  we can wait for key cache/write buffer pin flushing to complete
  before flushing dirty btree nodes

- A new closure_waitlist is added for bch2_journal_flush_pins; this one
  is only used under or when we're taking the journal lock, so it's
  pretty cheap to add rigorously correct wakeups to journal_pin_set()
  and journal_pin_drop().

Additionally, bch2_journal_seq_pins_to_text() is moved to
journal_reclaim.c, where it belongs, along with a bit of other small
renaming and refactoring.

Besides fixing the hang, the better ordering between key cache/write
buffer flushing and btree node flushing should help or fix the "unmount
taking excessively long" a few users have been noticing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-25 19:37:43 -05:00
Kent Overstreet
2c5d8a8347 bcachefs: "Journal stuck" timeout now takes into account device latency
If a block device (e.g. your typical consumer SSD) is taking multiple
seconds for IOs (typically flushes), we don't want to emit the "journal
stuck" message prematurely.

Also, make sure to drop the btree_trans srcu lock if we're blocking for
more than a second.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-21 18:32:05 -05:00
Kent Overstreet
89e74eccab bcachefs: bch2_journal_noflush_seq() now takes [start, end)
Harder to screw up if we're explicit about the range, and more correct
as journal reservations can be outstanding on multiple journal entries
simultaneously.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
6534a404d4 bcachefs: errcode cleanup: journal errors
Instead of throwing standard error codes, we should be throwing
dedicated private error codes, this greatly improves debugability.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
0eaac0b44f bcachefs: btree_write_buffer_flush_seq() no longer closes journal
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
c601e5d7da bcachefs: Can now block journal activity without closing cur entry
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
3956ff8bc2 bcachefs: Don't use wait_event_interruptible() in recovery
Fix a bug where mount was failing with -ERESTARTSYS:
https://github.com/koverstreet/bcachefs/issues/741

We only want the interruptible wait when called from fsync.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-20 16:50:14 -04:00
Uros Bizjak
68573b936d bcachefs: Use try_cmpxchg() family of functions instead of cmpxchg()
Use try_cmpxchg() family of functions instead of
cmpxchg (*ptr, old, new) == old. x86 CMPXCHG instruction returns
success in ZF flag, so this change saves a compare after cmpxchg
(and related move instruction in front of cmpxchg).

Also, try_cmpxchg() implicitly assigns old *ptr value to "old" when
cmpxchg fails. There is no need to re-read the value in the loop.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
44ec599035 bcachefs: Don't use the new_fs() bucket alloc path on an initialized fs
On a new filesystem or device we have to allocate the journal with a
bump allocator, because allocation info isn't ready yet - but when
hot-adding a device that doesn't have a journal, we don't want to use
that path.

Reported-by: syzbot+24a867cb90d8315cccff@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-28 19:47:31 -04:00
Kent Overstreet
b895c70326 bcachefs: x-macroize journal flags enums
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:22 -04:00
Kent Overstreet
497c982f05 bcachefs: New assertion for writing to the journal after shutdown
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:18 -04:00
Kent Overstreet
916abefd43 bcachefs: better journal pipelining
Recently a severe performance regression was discovered, which bisected
to

  a6548c8b5e bcachefs: Avoid flushing the journal in the discard path

It turns out the old behaviour, which issued excessive journal flushes,
worked around a performance issue where queueing delays would cause the
journal to not be able to write quickly enough and stall.

The journal flushes masked the issue because they periodically flushed
the device write cache, reducing write latency for non flushes.

This patch reworks the journalling code to allow more than one
(non-flush) write to be in flight at a time. With this patch, doing 4k
random writes and an iodepth of 128, we are now able to hit 560k iops to
a Samsung 970 EVO Plus - previously, we were stuck in the ~200k range.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet
cea07a7b6a bcachefs: vstruct_for_each() now declares loop iter
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:42 -05:00
Kent Overstreet
09caeabe1a bcachefs: btree write buffer now slurps keys from journal
Previosuly, the transaction commit path would have to add keys to the
btree write buffer as a separate operation, requiring additional global
synchronization.

This patch introduces a new journal entry type, which indicates that the
keys need to be copied into the btree write buffer prior to being
written out. We switch the journal entry type back to
JSET_ENTRY_btree_keys prior to write, so this is not an on disk format
change.

Flushing the btree write buffer may require pulling keys out of journal
entries yet to be written, and quiescing outstanding journal
reservations; we previously added journal->buf_lock for synchronization
with the journal write path.

We also can't put strict bounds on the number of keys in the journal
destined for the write buffer, which means we might overflow the size of
the preallocated buffer and have to reallocate - this introduces a
potentially fatal memory allocation failure. This is something we'll
have to watch for, if it becomes an issue in practice we can do
additional mitigation.

The transaction commit path no longer has to explicitly check if the
write buffer is full and wait on flushing; this is another performance
optimization. Instead, when the btree write buffer is close to full we
change the journal watermark, so that only reservations for journal
reclaim are allowed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:41 -05:00
Kent Overstreet
8ab3fa9639 bcachefs: kill journal->preres_wait
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:39 -05:00
Kent Overstreet
a66ff26b0f bcachefs: Close journal entry if necessary when flushing all pins
Since outstanding journal buffers hold a journal pin, when flushing all
pins we need to close the current journal entry if necessary so its pin
can be released.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-10 16:53:46 -05:00
Kent Overstreet
ef0beeb8dd bcachefs: move journal seq assertion
journal_cur_seq() can legitimately be used outside of the journal lock,
where this assert can race

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-28 22:58:22 -05:00
Kent Overstreet
006ccc3090 bcachefs: Kill journal pre-reservations
This deletes the complicated and somewhat expensive journal
pre-reservation machinery in favor of just using journal watermarks:
when the journal is more than half full, we run journal reclaim more
aggressively, and when the journal is more than 3/4s full we only allow
journal reclaim to get new journal reservations.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-14 23:44:43 -05:00
Kent Overstreet
bbe682c767 bcachefs: Ensure devices are always correctly initialized
We can't mark device superblocks or allocate journal on a device that
isn't online.

That means we may need to do this on every mount, because we may have
formatted a new filesystem and then done the first mount
(bch2_fs_initialize()) in degraded mode.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-31 12:18:37 -04:00
Brian Foster
3e55189b50 bcachefs: fix race between journal entry close and pin set
bcachefs freeze testing via fstests generic/390 occasionally
reproduces the following BUG from bch2_fs_read_only():

  BUG_ON(atomic_long_read(&c->btree_key_cache.nr_dirty));

This indicates that one or more dirty key cache keys still exist
after the attempt to flush and quiesce the fs. The sequence that
leads to this problem actually occurs on unfreeze (ro->rw), and
looks something like the following:

- Task A begins a transaction commit and acquires journal_res for
  the current seq. This transaction intends to perform key cache
  insertion.
- Task B begins a bch2_journal_flush() via bch2_sync_fs(). This ends
  up in journal_entry_want_write(), which closes the current journal
  entry and drops the reference to the pin list created on entry open.
  The pin put pops the front of the journal via fast reclaim since the
  reference count has dropped to 0.
- Task A attempts to set the journal pin for the associated cached
  key, but bch2_journal_pin_set() skips the pin insert because the
  seq of the transaction reservation is behind the front of the pin
  list fifo.

The end result is that the pin associated with the cached key is not
added, which prevents a subsequent reclaim from processing the key
and thus leaves it dangling at freeze time. The fundamental cause of
this problem is that the front of the journal is allowed to pop
before a transaction with outstanding reservation on the associated
journal seq is able to add a pin. The count for the pin list
associated with the seq drops to zero and is prematurely reclaimed
as a result.

The logical fix for this problem lies in how the journal buffer is
managed in similar scenarios where the entry might have been closed
before a transaction with outstanding reservations happens to be
committed.

When a journal entry is opened, the current sequence number is
bumped, the associated pin list is initialized with a reference
count of 1, and the journal buffer reference count is bumped (via
journal_state_inc()). When a journal reservation is acquired, the
reservation also acquires a reference on the associated buffer. If
the journal entry is closed in the meantime, it drops both the pin
and buffer references held by the open entry, but the buffer still
has references held by outstanding reservation. After the associated
transaction commits, the reservation release drops the associated
buffer references and the buffer is written out once the reference
count has dropped to zero.

The fundamental problem here is that the lifecycle of the pin list
reference held by an open journal entry is too short to cover the
processing of transactions with outstanding reservations. The
simplest way to address this is to expand the pin list reference to
the lifecycle of the buffer vs. the shorter lifecycle of the open
journal entry. This ensures the pin list for a seq with outstanding
reservation cannot be popped and reclaimed before all outstanding
reservations have been released, even if the associated journal
entry has been closed for further reservations.

Move the pin put from journal entry close to where final processing
of the journal buffer occurs. Create a duplicate helper to cover the
case where the caller doesn't already hold the journal lock. This
allows generic/390 to pass reliably.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:14 -04:00
Brian Foster
fc08031bb8 bcachefs: prepare journal buf put to handle pin put
bcachefs freeze testing has uncovered some raciness between journal
entry open/close and pin list reference count management. The
details of the problem are described in a separate patch. In
preparation for the associated fix, refactor the journal buffer put
path a bit to allow it to eventually handle dropping the pin list
reference currently held by an open journal entry.

Retain the journal write dispatch helper since the closure code is
inlined and we don't want to increase the amount of inline code in
the transaction commit path, but rename the function to reflect
the purpose of final processing of the journal buffer.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:14 -04:00
Kent Overstreet
ec14fc6010 bcachefs: Kill JOURNAL_WATERMARK
This unifies JOURNAL_WATERMARK with BCH_WATERMARK; we're working towards
specifying watermarks once in the transaction commit path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:05 -04:00
Kent Overstreet
87ced107f3 bcachefs: Convert EAGAIN errors to private error codes
More error code cleanup, for better error messages and debugability.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:49 -04:00
Kent Overstreet
5bbe3f2d0e bcachefs: Log more messages in the journal
This patch

 - Adds a mechanism for queuing up journal entries prior to the journal
   being started, which will be used for early journal log messages

 - Adds bch2_fs_log_msg() and improves bch2_trans_log_msg(), which now
   take format strings. bch2_fs_log_msg() can be used before or after
   the journal has been started, and will use the appropriate mechanism.

 - Deletes the now obsolete bch2_journal_log_msg()

 - And adds more log messages to the recovery path - messages for
   journal/filesystem started, journal entries being blacklisted, and
   journal replay starting/finishing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:48 -04:00
Kent Overstreet
a101957649 bcachefs: More style fixes
Fixes for various checkpatch errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:45 -04:00
Kent Overstreet
43ddf44834 bcachefs: Refactor journal entry adding
This takes copying the payload out of bch2_journal_add_entry(), which
means we can use it for journal_transaction_name() - also prep work for
journalling overwrites.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:33 -04:00
Kent Overstreet
ce6201c456 bcachefs: Use a genradix for reading journal entries
Previously, the journal read path used a linked list for storing the
journal entries we read from disk. But there's been a bug that's been
causing journal_flush_delay to incorrectly be set to 0, leading to far
more journal entries than is normal being written out, which then means
filesystems are no longer able to start due to the O(n^2) behaviour of
inserting into/searching that linked list.

Fix this by switching to a radix tree.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:30 -04:00
Kent Overstreet
31f63fd124 bcachefs: Introduce a separate journal watermark for copygc
Since journal reclaim -> btree key cache flushing may require the
allocation of new btree nodes, it has an implicit dependency on copygc
in order to make forward progress - so we should avoid blocking copygc
unless the journal is really close to full.

This introduces watermarks to replace our single MAY_GET_UNRESERVED bit
in the journal, and adds a watermark for copygc and plumbs it through.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:29 -04:00
Kent Overstreet
d5d3be7dc5 bcachefs: bch2_journal_log_msg()
This adds bch2_journal_log_msg(), which just logs a message to the
journal, and uses it to mark startup and when journal replay finishes.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:27 -04:00
Kent Overstreet
24a3d53b28 bcachefs: __journal_entry_close() never fails
Previous patch just moved responsibility for incrementing the journal
sequence number and initializing the new journal entry from
__journal_entry_close() to journal_entry_open(); this patch makes the
analagous change for journal reservation state, incrementing the index
into array of journal_bufs at open time.

This means that __journal_entry_close() never fails to close an open
journal entry, which is important for the next patch that will change
our emergency shutdown behaviour.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:26 -04:00
Kent Overstreet
30ef633a0b bcachefs: Refactor journal code to not use unwritten_idx
It makes the code more readable if we work off of sequence numbers,
instead of direct indexes into the array of journal buffers.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:26 -04:00
Kent Overstreet
75ef2c59bc bcachefs: Start moving debug info from sysfs to debugfs
In sysfs, files can only output at most PAGE_SIZE. This is a problem for
debug info that needs to list an arbitrary number of times, and because
of this limit some of our debug info has been terser and harder to read
than we'd like.

This patch moves info about journal pins and cached btree nodes to
debugfs, and greatly expands and improves the output we return.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:26 -04:00
Kent Overstreet
b66b2bc0f6 bcachefs: Revert "Ensure journal doesn't get stuck in nochanges mode"
This patch was originally to work around the journal geting stuck in
nochanges mode - but that was just a hack, we needed to fix the actual
bug. It should be fixed now, so revert it.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:25 -04:00
Kent Overstreet
e201f70b11 bcachefs: Fix for journal getting stuck
The journal can get stuck if we need to get a journal reservation for
something we have a pre-reservation for, but aren't able to reclaim
space, or if the pin fifo is full - it's impractical to resize the pin
fifo at runtime.

Previously, we reserved 8 entries in the pin fifo for pre-reservations,
but that seems small - we're seeing the journal occasionally get stuck.
Let's reserve a quarter of it.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:25 -04:00
Kent Overstreet
5b2e599f50 bcachefs: bch2_journal_noflush_seq()
Add bch2_journal_noflush_seq(), for telling the journal that entries
before a given sequence number should not be flushes - to be used by an
upcoming allocator optimization.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:20 -04:00
Kent Overstreet
0e030f5e20 bcachefs: Kill journal buf bloom filter
This was used for recording which inodes have been modified by in flight
journal writes, but was broken and has been superceded.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:16 -04:00
Kent Overstreet
fae1157d18 bcachefs: Ensure journal doesn't get stuck in nochanges mode
This tweaks the journal code to always act as if there's space available
in nochanges mode, when we're not going to be doing any writes. This
helps in recovering filesystems that won't mount because they need
journal replay and the journal has gotten stuck.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:15 -04:00
Kent Overstreet
8ce600d447 bcachefs: Fix for btree_gc repairing interior btree ptrs
Using the normal transaction commit path to insert and journal updates
to interior nodes hadn't been done before this repair code was written,
not surprising that there was a bug.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00
Kent Overstreet
671cc8a51b bcachefs: Eliminate memory barrier from fast path of journal_preres_put()
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:59 -04:00