Commit Graph

125 Commits

Author SHA1 Message Date
Kent Overstreet
74f3931a1b bcachefs: Fix additional misalignment in journal space calculations
Additional fix on top of

f54b2a80d0 bcachefs: Fix misaligned bucket check in journal space calculations

Make sure that when we calculate space for the next entry it's not
misaligned: we need to round_down() to filesystem block size in multiple
places (next entry size calculation as well as total space available).

Reported-by: Ondřej Kraus <neverberlerfellerer@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-07 18:19:30 -04:00
Kent Overstreet
09b9c72bd4 bcachefs: bch_err_throw()
Add a tracepoint for any time we return an error and unwind.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-02 12:16:35 -04:00
Kent Overstreet
18dad454cd bcachefs: Replace rcu_read_lock() with guards
The new guard(), scoped_guard() allow for more natural code.

Some of the uses with creative flow control have been left.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-01 00:03:12 -04:00
Kent Overstreet
f54b2a80d0 bcachefs: Fix misaligned bucket check in journal space calculations
Fix an assertion pop in the tiering_misaligned test: rounding down to
bucket size at the end of the journal space calculations leaves
cur_entry_sectors == 0, which is incorrect with !cur_entry_err.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30 01:21:12 -04:00
Kent Overstreet
a78a11900e bcachefs: journal path now uses discard_opt_enabled()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:15:01 -04:00
Gustavo A. R. Silva
ae0386e111 bcachefs: Avoid -Wflex-array-member-not-at-end warnings
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.

Refactor a couple of structs that contain flexible arrays in the
middle by replacing them with unions.

So, with these changes, fix the following warnings:

fs/bcachefs/disk_accounting.c:429:51: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
fs/bcachefs/ec_types.h:8:41: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:38 -04:00
Kent Overstreet
cca2c0d224 bcachefs: bch_dev.io_ref -> enumerated_ref
Convert device IO refs to enumerated_refs, for easier debugging of
refcount issues.

Simple conversion: enumerate all users and convert to the new helpers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:28 -04:00
Kent Overstreet
6d67de1079 bcachefs: for_each_rw_member_rcu()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:27 -04:00
Alan Huang
09279bba72 bcachefs: Kill dead code
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:24 -04:00
Kent Overstreet
530112d88e bcachefs: BCH_FEATURE_small_image
We can't go RW if it's an image file that hasn't been resized.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:20 -04:00
Kent Overstreet
7b6759b199 bcachefs: Fix livelock in journal_entry_open()
When the journal is low on space, we might do discards from
journal_res_get() -> journal_entry_open().

Make sure we set j->can_discard correctly, so that if we're low on space
but not because discards aren't keeping up we don't livelock.

Fixes: 8e4d28036c ("bcachefs: Don't aggressively discard the journal")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-14 17:05:19 -04:00
Kent Overstreet
8e4d28036c bcachefs: Don't aggressively discard the journal
We frequently use 'bcachefs list_journal -a' for debugging, as it
provides a record of all btree transactions, and a history of what
happened.

But it's not so useful if we immediately discard journal buckets right
after they're no longer dirty.

This tweaks journal reclaim to only discard when we're low on space,
keeping the journal mostly un-discarded.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07 17:10:10 -04:00
Kent Overstreet
6468aef231 bcachefs: Ensure journal space is block size aligned
We don't require that bucket size is block size aligned (although it
should be!) - so we need to handle this in the journal code.

This fixes an assertion pop in jorunal_entry_close(), where the journal
entry overruns available space - after rounding it up to block size.

Fixes: https://github.com/koverstreet/bcachefs/issues/854
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-20 19:41:38 -04:00
Kent Overstreet
8a9f3d0582 bcachefs: EIO cleanup
Replace these with proper private error codes, so that when we get an
error message we're not sifting through the entire codebase to see where
it came from.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
4a4000b9a6 bcachefs: Kill JOURNAL_ERRORS()
Convert these to standard error codes, which means we can pass them
outside the journal code, they're easier to pass to tracepoints, etc.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
1e690efa72 bcachefs: Split out journal pins by btree level
This lets us flush the journal to go read-only more effectively.

Flushing the journal and going read-only requires halting mutually
recursive processes, which strictly speaking are not guaranteed to
terminate.

Flushing btree node journal pins will kick off a btree node write, and
btree node writes on completion must do another btree update to the
parent node to update the 'sectors_written' field for that node's key.

If the parent node is full and requires a split or compaction, that's
going to generate a whole bunch of additional btree updates - alloc
info, LRU btree, and more - which then have to be flushed, and the cycle
repeats.

This process will terminate much more effectively if we tweak journal
reclaim to flush btree updates leaf to root: i.e., don't flush updates
for a given btree node (kicking off a write, and consuming space within
that node up to the next block boundary) if there might still be
unflushed updates in child nodes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-11 10:10:32 -05:00
Kent Overstreet
35f5197009 bcachefs: Improve journal pin flushing
Running the preempt tiering tests with a lower than normal journal
reclaim delay turned up a shutdown hang - a lost wakeup, caused because
flushing a journal pin (e.g. key cache/write buffer) can generate a new
journal pin.

The "simple" fix of adding the correct wakeup didn't work because of
ordering issues; if we flush btree node pins too aggressively before
other pins have completed, we end up spinning where each flush iteration
generates new work.

So to fix this correctly:
- The list of flushed journal pins is now broken out by type, so that
  we can wait for key cache/write buffer pin flushing to complete
  before flushing dirty btree nodes

- A new closure_waitlist is added for bch2_journal_flush_pins; this one
  is only used under or when we're taking the journal lock, so it's
  pretty cheap to add rigorously correct wakeups to journal_pin_set()
  and journal_pin_drop().

Additionally, bch2_journal_seq_pins_to_text() is moved to
journal_reclaim.c, where it belongs, along with a bit of other small
renaming and refactoring.

Besides fixing the hang, the better ordering between key cache/write
buffer flushing and btree node flushing should help or fix the "unmount
taking excessively long" a few users have been noticing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-25 19:37:43 -05:00
Kent Overstreet
58117dbdd6 bcachefs: Journal space calculations should skip durability=0 devices
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
9c22dd02ae bcachefs: Fix allocating too big journal entry
The "journal space available" calculations didn't take into account
mismatched bucket sizes; we need to take the minimum space available out
of our devices.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Kent Overstreet
828552ca74 bcachefs: Kill bch2_bucket_alloc_new_fs()
The early-early allocation path, bch2_bucket_alloc_new_fs(), is no
longer needed - and inconsistencies around new_fs_bucket_idx have been a
frequent source of bugs.

Reported-by: syzbot+592425844580a6598410@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
135c0c8524 bcachefs: Fix racy use of jiffies
Calculate the timeout, then check if it's positive before calling
schedule_timeout().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:13 -05:00
Kent Overstreet
7a51608d01 bcachefs: Rework btree node pinning
In backpointers fsck, we do a seqential scan of one btree, and check
references to another: extents <-> backpointers

Checking references generates random lookups, so we want to pin that
btree in memory (or only a range, if it doesn't fit in ram).

Previously, this was done with a simple check in the shrinker - "if
btree node is in range being pinned, don't free it" - but this generated
OOMs, as our shrinker wasn't well behaved if there was less memory
available than expected.

Instead, we now have two different shrinkers and lru lists; the second
shrinker being for pinned nodes, with seeks set much higher than normal
- so they can still be freed if necessary, but we'll prefer not to.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
91ddd71510 bcachefs: split up btree cache counters for live, freeable
this is prep for introducing a second live list and shrinker for pinned
nodes

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
691f2cba22 bcachefs: btree cache counters should be size_t
32 bits won't overflow any time soon, but size_t is the correct type for
counting objects in memory.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
737759fc09 bcachefs: Fix printbuf usage while atomic
Reported-by: syzbot+f765e51170cf13493f0b@syzkaller.appspotmail.com
Fixes: f12410bb7d ("bcachefs: Add an error message for insufficient rw journal devs")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-22 11:27:15 -04:00
Kent Overstreet
f12410bb7d bcachefs: Add an error message for insufficient rw journal devs
This causes us to go read-only - need an error message saying why.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-18 18:33:30 -04:00
Kent Overstreet
b895c70326 bcachefs: x-macroize journal flags enums
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:22 -04:00
Kent Overstreet
497c982f05 bcachefs: New assertion for writing to the journal after shutdown
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:18 -04:00
Kent Overstreet
6088234ce8 bcachefs: JOURNAL_SPACE_LOW
"bcachefs; Fix deadlock in bch2_btree_update_start()" was a significant
performance regression (nearly 50%) on multithreaded random writes with
fio.

The reason is that the journal watermark checks multiple things,
including the state of the btree write buffer, and on multithreaded
update heavy workloads we're bottleneked on write buffer flushing - we
don't want kicknig off btree updates to depend on the state of the write
buffer.

This isn't strictly correct; the interior btree update path does do
write buffer updates, but it's a tiny fraction of total accounting
updates and we're more concerned with space in the journal itself.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-06 13:50:26 -04:00
Kent Overstreet
f1ca1abfb0 bcachefs: pull out time_stats.[ch]
prep work for lifting out of fs/bcachefs/

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:30:35 -04:00
Kent Overstreet
4f70176cb9 bcachefs: Kill unnecessary wakeups in journal reclaim
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet
097471f9e4 bcachefs: Fix bch2_journal_flush_device_pins()
If a journal write errored, the list of devices it was written to could
be empty - we're not supposed to mark an empty replicas list.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-24 20:46:48 -05:00
Kent Overstreet
4e07447503 bcachefs: Clamp replicas_required to replicas
This prevents going emergency read only when the user has specified
replicas_required > replicas.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-13 20:33:38 -05:00
Kent Overstreet
41b84fb489 bcachefs: for_each_member_device_rcu() now declares loop iter
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:42 -05:00
Kent Overstreet
9fea2274f7 bcachefs: for_each_member_device() now declares loop iter
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:42 -05:00
Kent Overstreet
cf904c8d96 bcachefs: bch_err_(fn|msg) check if should print
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:41 -05:00
Kent Overstreet
09caeabe1a bcachefs: btree write buffer now slurps keys from journal
Previosuly, the transaction commit path would have to add keys to the
btree write buffer as a separate operation, requiring additional global
synchronization.

This patch introduces a new journal entry type, which indicates that the
keys need to be copied into the btree write buffer prior to being
written out. We switch the journal entry type back to
JSET_ENTRY_btree_keys prior to write, so this is not an on disk format
change.

Flushing the btree write buffer may require pulling keys out of journal
entries yet to be written, and quiescing outstanding journal
reservations; we previously added journal->buf_lock for synchronization
with the journal write path.

We also can't put strict bounds on the number of keys in the journal
destined for the write buffer, which means we might overflow the size of
the preallocated buffer and have to reallocate - this introduces a
potentially fatal memory allocation failure. This is something we'll
have to watch for, if it becomes an issue in practice we can do
additional mitigation.

The transaction commit path no longer has to explicitly check if the
write buffer is full and wait on flushing; this is another performance
optimization. Instead, when the btree write buffer is close to full we
change the journal watermark, so that only reservations for journal
reclaim are allowed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:41 -05:00
Kent Overstreet
0ba9375a11 bcachefs: Unwritten journal buffers are always dirty
Ensure that journal bufs that haven't been written can't be reclaimed
from the journal pin fifo, and can thus have new pins taken.

Prep work for changing the btree write buffer to pull keys from the
journal directly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:41 -05:00
Kent Overstreet
066a26460b bcachefs: track_event_change()
This introduces a new helper for connecting time_stats to state changes,
i.e. when taking journal reservations is blocked for some reason.

We use this to track separately the different reasons the journal might
be blocked - i.e. space in the journal full, or the journal pin fifo
full.

Also do some cleanup and improvements on the time stats code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:37 -05:00
Kent Overstreet
3eedfe1af9 bcachefs: Journal pins must always have a flush_fn
flush_fn is how we identify journal pins in debugfs - this is a
debugging aid.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:37 -05:00
Kent Overstreet
df8e13ccf3 bcachefs: Add an assertion in bch2_journal_pin_set()
Previously, bch2_journal_pin_set() would silently ignore a request to
pin a journal sequence number that was no longer dirty, because it was
used internally by bch2_journal_pin_copy() which could race with the src
pin being flushed.

Split these apart so that we can properly assert that @seq is a
currently dirty journal sequence number - this is almost always a bug.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:37 -05:00
Kent Overstreet
a66ff26b0f bcachefs: Close journal entry if necessary when flushing all pins
Since outstanding journal buffers hold a journal pin, when flushing all
pins we need to close the current journal entry if necessary so its pin
can be released.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-10 16:53:46 -05:00
Kent Overstreet
006ccc3090 bcachefs: Kill journal pre-reservations
This deletes the complicated and somewhat expensive journal
pre-reservation machinery in favor of just using journal watermarks:
when the journal is more than half full, we run journal reclaim more
aggressively, and when the journal is more than 3/4s full we only allow
journal reclaim to get new journal reservations.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-14 23:44:43 -05:00
Brian Foster
92b63f5bf0 bcachefs: refactor pin put helpers
We have a couple journal pin put helpers to handle cases where the
journal lock is already held or not. Refactor the helpers to lock
and reclaim from the highest level and open code the reclaim from
the one caller of the internal variant. The latter call will be
moved into the journal buf release helper in a later patch.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:14 -04:00
Kent Overstreet
96dea3d599 bcachefs: Fix W=12 build errors
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:13 -04:00
Kent Overstreet
e46c181af9 bcachefs: Convert more code to bch_err_msg()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:12 -04:00
Kent Overstreet
fb8e5b4cae bcachefs: sb-members.c
Split out a new file for bch_sb_field_members - we'll likely want to
move more code here in the future.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:10 -04:00
Kent Overstreet
1e81f89b02 bcachefs: Fix assorted checkpatch nits
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:10 -04:00
Kent Overstreet
9a644843c4 bcachefs: Fix error path in bch2_journal_flush_device_pins()
We need to always call bch2_replicas_gc_end() after we've called
bch2_replicas_gc_start(), else we leave state around that needs to be
cleaned up.

Partial fix for: https://github.com/koverstreet/bcachefs/issues/560

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:06 -04:00
Kent Overstreet
73bd774d28 bcachefs: Assorted sparse fixes
- endianness fixes
 - mark some things static
 - fix a few __percpu annotations
 - fix silent enum conversions

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:06 -04:00