Commit Graph

361 Commits

Author SHA1 Message Date
Kent Overstreet
15eaaa4c31 bcachefs: Upgrades now specify errors to fix, like downgrades
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-05 23:24:20 -05:00
Kent Overstreet
d55ddf6e7a bcachefs: Online fsck can now fix errors
BCH_FS_fsck_done -> BCH_FS_fsck_running; set when we might be fixing
fsck errors. Also; set fix_errors to ask by default when fsck is
running.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-05 23:24:20 -05:00
Kent Overstreet
eff1f728be bcachefs: Upgrading uses bch_sb.recovery_passes_required
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-05 23:24:19 -05:00
Kent Overstreet
62719cf33c bcachefs: Fix nochanges/read_only interaction
nochanges means "we cannot issue writes at all"; it's possible to go
into a pseudo read-write mode where we pin dirty metadata in memory,
which is used for fsck in dry run mode and doing journal replay on a
read only mount, but we do not want to allow an actual read-write mount
in nochanges mode.

But we do always want to allow early read-write, during recovery - this
patch clarifies that.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-05 23:24:19 -05:00
Kent Overstreet
cea07a7b6a bcachefs: vstruct_for_each() now declares loop iter
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:42 -05:00
Kent Overstreet
9fea2274f7 bcachefs: for_each_member_device() now declares loop iter
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:42 -05:00
Kent Overstreet
defd9e39b5 bcachefs: darray_for_each() now declares loop iter
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:42 -05:00
Kent Overstreet
cf904c8d96 bcachefs: bch_err_(fn|msg) check if should print
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:41 -05:00
Kent Overstreet
249bf593e8 bcachefs: Fix snapshot.c assertion for online fsck
c->curr_recovery_pass can go backwards; this adds a non rewinding
version, c->recovery_pass_done.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:41 -05:00
Kent Overstreet
27b2df982f bcachefs: Kill for_each_btree_key()
for_each_btree_key() handles transaction restarts, like
for_each_btree_key2(), but only calls bch2_trans_begin() after a
transaction restart - for_each_btree_key2() wraps every loop iteration
in a transaction.

The for_each_btree_key() behaviour is problematic when it leads to
holding the SRCU lock that prevents key cache reclaim for an unbounded
amount of time - there's no real need to keep it around.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:40 -05:00
Kent Overstreet
7f391b2f8e bcachefs: bch2_run_online_recovery_passes()
Add a new helper for running online recovery passes - i.e. online fsck.
This is a subset of our normal recovery passes, and does not - for now -
use or follow c->curr_recovery_pass.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:40 -05:00
Kent Overstreet
2b41226d7f bcachefs: Add ability to redirect log output
Upcoming patches are going to add two new ioctls for running fsck in the
kernel, but pretending that we're running our normal userspace fsck.

This patch adds some plumbing for redirecting our normal log messages
away from the dmesg log to a thread_with_file file descriptor - via a
struct log_output, which will be consumed by the fsck f_op's read method.

The new ioctls will allow for running fsck in the kernel against an
offline filesystem (without mounting it), and an online filesystem. For
an offline filesystem we need a way to pass in a pointer to the
log_output, which is done via a new hidden opts.h option.

For online fsck, we can set c->output directly, but only want to
redirect log messages from the thread running fsck - hence the new
c->output_filter method.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:40 -05:00
Kent Overstreet
3f0e297d86 bcachefs: Explicity go RW for fsck
This eliminates a lot of BCH_TRANS_COMMIT_lazy_rw flags, and is less
error prone.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:39 -05:00
Kent Overstreet
3c471b6588 bcachefs: convert bch_fs_flags to x-macro
Now we can print out filesystem flags in sysfs, useful for debugging
various "what's my filesystem doing" issues.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:38 -05:00
Kent Overstreet
9b34f02cdc bcachefs: Kill dev_usage->buckets_ec
This counter is redundant; it's simply the sum of BCH_DATA_stripe and
BCH_DATA_parity buckets.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:38 -05:00
Kent Overstreet
b27d7afb79 bcachefs: Don't flush journal after replay
The flush_all_pins() after journal replay was unecessary, and trying to
completely flush the journal while RW is not a great idea - it's not
guaranteed to terminate if other threads keep adding things to the
jorunal.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:37 -05:00
Kent Overstreet
cb52d23e77 bcachefs: Rename BTREE_INSERT flags
BTREE_INSERT flags are actually transaction commit flags - rename them
for clarity.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:37 -05:00
Kent Overstreet
573224301c bcachefs: Make journal replay more efficient
Journal replay now first attempts to replay keys in sorted order,
similar to how the btree write buffer flush path works.

Any keys that can not be replayed due to journal deadlock are then left
for later and replayed in journal order, unpinning journal entries as we
go.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:37 -05:00
Kent Overstreet
bdde9829de bcachefs: Go rw before journal replay
This gets us slightly nicer log messages.

Also, this slightly clarifies synchronization of c->journal_keys; after
we go RW it's in use by multiple threads (so that the btree iterator
code can overlay keys from the journal); so it has to be prepped before
that point.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:37 -05:00
Kent Overstreet
9a71de675f bcachefs: BTREE_INSERT_JOURNAL_REPLAY now "don't init trans->journal_res"
This slightly changes how trans->journal_res works, in preparation for
changing the btree write buffer flush path to use it.

Now, BTREE_INSERT_JOURNAL_REPLAY means "don't take a journal
reservation; trans->journal_res.seq already refers to the journal
sequence number to pin".

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:37 -05:00
Kent Overstreet
fbf9270817 bcachefs: Print old version when scanning for old metadata
Also: we should be using bch2_fs_read_write_early()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:36 -05:00
Kent Overstreet
30418de09e bcachefs: Flush fsck errors before running twice
It's confusing if we run fsck a second time (in debug mode, to verify
the second run is clean), but errors are still ratelimited from the
first run.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:36 -05:00
Kent Overstreet
84f1638795 bcachefs: bch_sb_field_downgrade
Add a new superblock section that contains a list of
  { minor version, recovery passes, errors_to_fix }

that is - a list of recovery passes that must be run when downgrading
past a given version, and a list of errors to silently fix.

The upcoming disk accounting rewrite is not going to be fully
compatible: we're going to have to regenerate accounting both when
upgrading to the new version, and also from downgrading from the new
version, since the new method of doing disk space accounting is a
completely different architecture based on deltas, and synchronizing
them for every jounal entry write to maintain compatibility is going to
be too expensive and impractical.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:07 -05:00
Kent Overstreet
8b16413cda bcachefs: bch_sb.recovery_passes_required
Add two new superblock fields. Since the main section of the superblock
is now fully, we have to add a new variable length section for them -
bch_sb_field_ext.

 - recovery_passes_requried: recovery passes that must be run on the
   next mount
 - errors_silent: errors that will be silently fixed

These are to improve upgrading and dwongrading: these fields won't be
cleared until after recovery successfully completes, so there won't be
any issues with crashing partway through an upgrade or a downgrade.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:07 -05:00
Kent Overstreet
808c680f2a bcachefs: Add persistent identifiers for recovery passes
The next patch will start to refer to recovery passes from the
superblock; naturally, we now need identifiers that don't change, since
the existing enum is in the order in which they are run and is not
fixed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:07 -05:00
Kent Overstreet
f87bf892ea bcachefs: fix setting version_upgrade_complete
If a superblock write hasn't happened (i.e. we never had to go rw), then
c->sb.version will be out of date w.r.t. c->disk_sb.sb->version.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:46:52 -05:00
Kent Overstreet
4a147af208 bcachefs: Fix uninitialized var in bch2_journal_replay()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-10 12:23:07 -05:00
Kent Overstreet
8a443d3ea1 bcachefs: Proper refcounting for journal_keys
The btree iterator code overlays keys from the journal until journal
replay is finished; since we're now starting copygc/rebalance etc.
before replay is finished, this is multithreaded access and thus needs
refcounting.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-24 02:43:12 -05:00
Kent Overstreet
5a53f851e6 bcachefs: Fix recovery when forced to use JSET_NO_FLUSH journal entry
When we didn't find anything in the journal that we'd like to use, and
we're forced to use whatever we can find - that entry will have been a
JSET_NO_FLUSH entry with a garbage last_seq value, since it's not
normally used.

Initialize it to something sane, for bch2_fs_journal_start().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04 22:19:13 -04:00
Kent Overstreet
6dfa10ab22 bcachefs: Fix build errors with gcc 10
gcc 10 seems to complain about array bounds in situations where gcc 11
does not - curious.

This unfortunately requires adding some casts for now; we may
investigate getting rid of our __u64 _data[] VLA in a future patch so
that our start[0] members can be VLAs.

Reported-by: John Stoffel <john@stoffel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04 14:17:11 -04:00
Kent Overstreet
b65db750e2 bcachefs: Enumerate fsck errors
This patch adds a superblock error counter for every distinct fsck
error; this means that when analyzing filesystems out in the wild we'll
be able to see what sorts of inconsistencies are being found and repair,
and hence what bugs to look for.

Errors validating bkeys are not yet considered distinct fsck errors, but
this patch adds a new helper, bkey_fsck_err(), in order to add distinct
error types for them as well.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-01 21:11:08 -04:00
Kent Overstreet
fb3f57bb11 bcachefs: rebalance_work
This adds a new btree, rebalance_work, to eliminate scanning required
for finding extents that need work done on them in the background - i.e.
for the background_target and background_compression options.

rebalance_work is a bitset btree, where a KEY_TYPE_set corresponds to an
extent in the extents or reflink btree at the same pos.

A new extent field is added, bch_extent_rebalance, which indicates that
this extent has work that needs to be done in the background - and which
options to use. This allows per-inode options to be propagated to
indirect extents - at least in some circumstances. In this patch,
changing IO options on a file will not propagate the new options to
indirect extents pointed to by that file.

Updating (setting/clearing) the rebalance_work btree is done by the
extent trigger, which looks at the bch_extent_rebalance field.

Scanning is still requrired after changing IO path options - either just
for a given inode, or for the whole filesystem. We indicate that
scanning is required by adding a KEY_TYPE_cookie key to the
rebalance_work btree: the cookie counter is so that we can detect that
scanning is still required when an option has been flipped mid-way
through an existing scan.

Future possible work:
 - Propagate options to indirect extents when being changed
 - Add other IO path options - nr_replicas, ec, to rebalance_work so
   they can be applied in the background when they change
 - Add a counter, for bcachefs fs usage output, showing the pending
   amount of rebalance work: we'll probably want to do this after the
   disk space accounting rewrite (moving it to a new btree)

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-01 21:11:05 -04:00
Kent Overstreet
bbe682c767 bcachefs: Ensure devices are always correctly initialized
We can't mark device superblocks or allocate journal on a device that
isn't online.

That means we may need to do this on every mount, because we may have
formatted a new filesystem and then done the first mount
(bch2_fs_initialize()) in degraded mode.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-31 12:18:37 -04:00
Kent Overstreet
88dfe193bd bcachefs: bch2_btree_id_str()
Since we can run with unknown btree IDs, we can't directly index btree
IDs into fixed size arrays.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-31 12:18:37 -04:00
Kent Overstreet
b0b5bbf99f bcachefs: Don't run bch2_delete_dead_snapshots() unnecessarily
Be a bit more careful about when bch2_delete_dead_snapshots needs to
run: it only needs to run synchronously if we're running fsck, and it
only needs to run at all if we have snapshot nodes to delete or if fsck
has noticed that it needs to run.

Also:
  Rename BCH_FS_HAVE_DELETED_SNAPSHOTS -> BCH_FS_NEED_DELETE_DEAD_SNAPSHOTS

  Kill bch2_delete_dead_snapshots_hook(), move functionality to
  bch2_mark_snapshot()

  Factor out bch2_check_snapshot_needs_deletion(), to explicitly check
  if we need to be running snapshot deletion.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-31 12:18:37 -04:00
Kent Overstreet
795413c548 bcachefs: Fix drop_alloc_keys()
For consistency with the rest of the reconstruct_alloc option, we should
be skipping all alloc keys.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:16 -04:00
Kent Overstreet
4fc1f402c6 bcachefs: Fix another smatch complaint
This should be harmless, but initialize last_seq anyways.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:16 -04:00
Kent Overstreet
7dcf62c06d bcachefs: Make btree root read errors recoverable
The entire btree will be lost, but that is better than the entire
filesystem not being recoverable.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:15 -04:00
Kent Overstreet
6bd68ec266 bcachefs: Heap allocate btree_trans
We're using more stack than we'd like in a number of functions, and
btree_trans is the biggest object that we stack allocate.

But we have to do a heap allocatation to initialize it anyways, so
there's no real downside to heap allocating the entire thing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:13 -04:00
Kent Overstreet
96dea3d599 bcachefs: Fix W=12 build errors
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:13 -04:00
Colin Ian King
6bf3766b52 bcachefs: Fix a handful of spelling mistakes in various messages
There are several spelling mistakes in error messages. Fix these.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:13 -04:00
Kent Overstreet
aaad530ac6 bcachefs: BTREE_ID_logged_ops
Add a new btree for long running logged operations - i.e. for logging
operations that we can't do within a single btree transaction, so that
they can be resumed if we crash.

Keys in the logged operations btree will represent operations in
progress, with the state of the operation stored in the value.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:12 -04:00
Kent Overstreet
8e877caaad bcachefs: Split out snapshot.c
subvolume.c has gotten a bit large, this splits out a separate file just
for managing snapshot trees - BTREE_ID_snapshots.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:11 -04:00
Kent Overstreet
e0a2b00a42 bcachefs: Fix check_version_upgrade()
We were failing to upgrade to the latest compatible version - whoops.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:10 -04:00
Kent Overstreet
401585fe87 bcachefs: btree_journal_iter.c
Split out a new file from recovery.c for managing the list of keys we
read from the journal: before journal replay finishes the btree iterator
code needs to be able to iterate over and return keys from the journal
as well, so there's a fair bit of code here.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:10 -04:00
Kent Overstreet
a37ad1a3ab bcachefs: sb-clean.c
Pull code for bch_sb_field_clean out into its own file.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:10 -04:00
Kent Overstreet
1e81f89b02 bcachefs: Fix assorted checkpatch nits
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:10 -04:00
Kent Overstreet
e08e63e44e bcachefs: BCH_COMPAT_bformat_overflow_done no longer required
Awhile back, we changed bkey_format generation to ensure that the packed
representation could never represent fields larger than the unpacked
representation.

This was to ensure that bkey_packed_successor() always gave a sensible
result, but in the current code bkey_packed_successor() is only used in
a debug assertion - not for anything important.

This kills the requirement that we've gotten rid of those weird bkey
formats, and instead changes the assertion to check if we're dealing
with an old weird bkey format.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:09 -04:00
Kent Overstreet
0ed4ca146e bcachefs: Ensure topology repair runs
This fixes should_restart_for_topology_repair() - previously it was
returning false if the btree io path had already seleceted topology
repair to run, even if it hadn't run yet.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:09 -04:00
Kent Overstreet
ad52bac251 bcachefs: Log a message when running an explicit recovery pass
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:09 -04:00
Kent Overstreet
a1d1072fe7 bcachefs: Print out required recovery passes on version upgrade
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:09 -04:00
Kent Overstreet
b56b787c7d bcachefs: In debug mode, run fsck again after fixing errors
We want to ensure that fsck actually fixed all the errors it found - the
second fsck run should be clean.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:09 -04:00
Kent Overstreet
922bc5a037 bcachefs: Make topology repair a normal recovery pass
This adds bch2_run_explicit_recovery_pass(), for rewinding recovery and
explicitly running a specific recovery pass - this is a more general
replacement for how we were running topology repair before.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:08 -04:00
Kent Overstreet
ae2e13d780 bcachefs: bch2_run_explicit_recovery_pass()
This introduces bch2_run_explicit_recovery_pass() and uses it for when
fsck detects that we need to re-run dead snaphots cleanup, and makes
dead snapshot cleanup more like a normal recovery pass.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:08 -04:00
Kent Overstreet
a0f8faea5f bcachefs: fix_errors option is now a proper enum
Before, it was parsed as a bool but internally it was really an enum:
this lets us pass in all the possible values.

But we special case the option parsing: no supplied value is parsed as
FSCK_FIX_yes, to match the previous behaviour.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:07 -04:00
Kent Overstreet
f26c67f4a7 bcachefs: Snapshot depth, skiplist fields
This extents KEY_TYPE_snapshot to include some new fields:
 - depth, to indicate depth of this particular node from the root
 - skip[3], skiplist entries for quickly walking back up to the root

These are to improve bch2_snapshot_is_ancestor(), making it O(ln(n))
instead of O(n) in the snapshot tree depth.

Skiplist nodes are picked at random from the set of ancestor nodes, not
some fixed fraction.

This introduces bcachefs_metadata_version 1.1, snapshot_skiplists.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:06 -04:00
Kent Overstreet
065bd3356c bcachefs: Version table now lists required recovery passes
Now that we've got forward compatibility sorted out, we should be doing
more frequent version upgrades in the future.

To avoid having to run a full fsck for every version upgrade, this
improves the BCH_METADATA_VERSIONS() table to explicitly specify a
bitmask of recovery passes to run when upgrading to or past a given
version.

This means we can also delete PASS_UPGRADE().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:06 -04:00
Kent Overstreet
6619d84626 bcachefs: bch2_sb_maybe_downgrade(), bch2_sb_upgrade()
Add some new helpers, and fix upgrade/downgrade in bch2_fs_initialize().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:06 -04:00
Kent Overstreet
ba8eeae8ee bcachefs: bcachefs_metadata_version_major_minor
This introduces major/minor versioning to the superblock version number.
Major version number changes indicate incompatible releases; we can move
forward to a new major version number, but not backwards. Minor version
numbers indicate compatible changes - these add features, but can still
be mounted and used by old versions.

With the recent patches that make it possible to roll out new btrees and
key types without breaking compatibility, we should be able to roll out
most new features without incompatible changes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:06 -04:00
Kent Overstreet
067d228bb0 bcachefs: Enumerate recovery passes
Recovery and fsck have many different passes/jobs to do, which always
run in the same order - but not all of them run all the time. Some are
for fsck, some for unclean shutdown, some for version upgrades.

This adds some new structure: a defined list of recovery passes that we
can run in a loop, as well as consolidating the log messages.

The main benefit is consolidating the "should run this recovery pass"
logic, as well as cleaning up the "this recovery pass has finished"
state; instead of having a bunch of ad-hoc state bits in c->flags, we've
now got c->curr_recovery_pass.

By consolidating the "should run this recovery pass" logic, in the
future on disk format upgrades will be able to say "upgrading to this
version requires x passes to run", instead of forcing all of fsck to
run.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:06 -04:00
Kent Overstreet
78328fec70 bcachefs: Stash journal replay params in bch_fs
For the upcoming enumeration of recovery passes, we need all recovery
passes to be called the same way - including journal replay.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:06 -04:00
Kent Overstreet
10a6ced2da bcachefs: Kill bch2_bucket_gens_read()
This folds bch2_bucket_gens_read() into bch2_alloc_read(), doing the
version check there.

This is prep work for enumarating all recovery passes: we need some
cleanup first to make calling all the recovery passes consistent.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:06 -04:00
Kent Overstreet
3045bb958a bcachefs: version_upgrade is now an enum
The version_upgrade parameter is now an enum, not a bool, and it's
persistent in the superblock:
 - compatible (default):	upgrade to the latest compatible version
 - incompatible:		upgrade to latest incompatible version
 - none

Currently all upgrades are incompatible upgrades, but the next release
will introduce major:minor versions.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:06 -04:00
Kent Overstreet
24964e1c5c bcachefs: BCH_SB_VERSION_UPGRADE_COMPLETE()
Version upgrades are not atomic operations: when we do a version upgrade
we need to update the superblock before we start using new features, and
then when the upgrade completes we need to update the superblock again.
This adds a new superblock field so we can detect and handle incomplete
version upgrades.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:06 -04:00
Kent Overstreet
c8b4534d82 bcachefs: Delete redundant log messages
Now that we have distinct error codes for different memory allocation
failures, the early init log messages are no longer needed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:06 -04:00
Kent Overstreet
73bd774d28 bcachefs: Assorted sparse fixes
- endianness fixes
 - mark some things static
 - fix a few __percpu annotations
 - fix silent enum conversions

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:06 -04:00
Kent Overstreet
faa6cb6c13 bcachefs: Allow for unknown btree IDs
We need to allow filesystems with metadata from newer versions to be
mountable and usable by older versions.

This patch enables us to roll out new btrees without a new major version
number; we can now handle btree roots for unknown btree types.

The unknown btree roots will be retained, and fsck (including
backpointers) will check them, the same as other btree types.

We add a dynamic array for the extra, unknown btree roots, in addition
to the fixed size btree root array, and add new helpers for looking up
btree roots.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:05 -04:00
Kent Overstreet
e3804b55e4 bcachefs: bch2_version_to_text()
Add a new helper for printing out metadata versions in a standard
format.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:05 -04:00
Kent Overstreet
ec14fc6010 bcachefs: Kill JOURNAL_WATERMARK
This unifies JOURNAL_WATERMARK with BCH_WATERMARK; we're working towards
specifying watermarks once in the transaction commit path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:05 -04:00
Kent Overstreet
1bb3c2a974 bcachefs: New error message helpers
Add two new helpers for printing error messages with __func__ and
bch2_err_str():
 - bch_err_fn
 - bch_err_msg

Also kill the old error strings in the recovery path, which were causing
us to incorrectly report memory allocation failures - they're not needed
anymore.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:04 -04:00
Kent Overstreet
e47a390aa5 bcachefs: Convert -ENOENT to private error codes
As with previous conversions, replace -ENOENT uses with more informative
private error codes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:03 -04:00
Kent Overstreet
1c59b483a3 bcachefs: BTREE_ID_snapshot_tree
This adds a new btree which gets us a persistent per-snapshot-tree
identifier.

 - BTREE_ID_snapshot_trees
 - KEY_TYPE_snapshot_tree
 - bch_snapshot now has a field that points to a snapshot_tree

This is going to be used to designate one snapshot ID/subvolume out of a
given tree of snapshots as the "main" subvolume, so that we can do quota
accounting in that subvolume and not the rest.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:01 -04:00
Kent Overstreet
bcb79a51cb bcachefs: bch2_bkey_get_iter() helpers
Introduce new helpers for a common pattern:

  bch2_trans_iter_init();
  bch2_btree_iter_peek_slot();

 - bch2_bkey_get_iter_type() returns -ENOENT if it doesn't find a key of
   the correct type
 - bch2_bkey_get_val_typed() copies the val out of the btree to a
   (typically stack allocated) variable; it handles the case where the
   value in the btree is smaller than the current version of the type,
   zeroing out the remainder.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:00 -04:00
Kent Overstreet
2776369266 bcachefs: Add a cond_resched() call to journal_keys_sort()
We're just doing cpu work here and it could take awhile, a
cond_resched() is definitely needed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:59 -04:00
Kent Overstreet
62a03559d6 bcachefs: Rip out code for storing backpointers in alloc keys
We don't store backpointers in alloc keys anymore, since we gained the
btree write buffer.

This patch drops support for backpointers in alloc keys, and revs the on
disk format version so that we know a fsck is required.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:59 -04:00
Brian Foster
349b1d832b bcachefs: use reservation for log messages during recovery
If we block on journal reservation attempting to log journal
messages during recovery, particularly for the first message(s)
before we start doing actual work, chances are the filesystem ends
up deadlocked.

Allow logged messages to use reserved journal space to mitigate this
problem. In the worst case where no space is available whatsoever,
this at least allows the fs to recognize that the journal is stuck
and fail the mount gracefully.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:59 -04:00
Kent Overstreet
26559553e4 bcachefs: Add a fallback when journal_keys doesn't fit in ram
We may end up in a situation where allocating the buffer for the sorted
journal_keys fails - but it would likely succeed, post compaction where
we drop duplicates.

We've had reports of this allocation failing, so this adds a slowpath to
do the compaction incrementally.

This is only a band-aid fix; we need to look at limiting the number of
keys in the journal based on the amount of system RAM.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:57 -04:00
Kent Overstreet
40a18fe273 bcachefs: Add error message for failing to allocate sorted journal keys
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:57 -04:00
Kent Overstreet
65d48e3525 bcachefs: Private error codes: ENOMEM
This adds private error codes for most (but not all) of our ENOMEM uses,
which makes it easier to track down assorted allocation failures.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:57 -04:00
Kent Overstreet
ac2ccddc26 bcachefs: Drop some anonymous structs, unions
Rust bindgen doesn't cope well with anonymous structs and unions. This
patch drops the fancy anonymous structs & unions in bkey_i that let us
use the same helpers for bkey_i and bkey_packed; since bkey_packed is an
internal type that's never exposed to outside code, it's only a minor
inconvenienc.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:55 -04:00
Kent Overstreet
27616a3124 bcachefs: Simplify ec stripes heap
Now that we have a separate data structure for tracking open stripes,
the stripes heap can track all existing stripes, which is a nice
simplification.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:54 -04:00
Kent Overstreet
80c3308578 bcachefs: Fragmentation LRU
Now that we have much more efficient updates to the LRU btree, this
patch adds a new LRU that indexes buckets by fragmentation.

This means copygc no longer has to scan every bucket to find buckets
that need to be evacuated.

Changes:
 - A new field in bch_alloc_v4, fragmentation_lru - this corresponds to
   the bucket's position in the fragmentation LRU. We add a new field
   for this instead of calculating it as needed because we may make the
   fragmentation LRU optional; this field indicates whether a bucket is
   on the fragmentation LRU.

   Also, zoned devices will introduce variable bucket sizes; explicitly
   recording the LRU position will be safer for them.

 - A new copygc path for using the fragmentation LRU instead of
   scanning every bucket and building up an in-memory heap.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:53 -04:00
Kent Overstreet
806c8a6aa8 bcachefs: Fix failure to read btree roots
If failed to read a btree root - or if we're not using a btree root,
because of the reconstruct_alloc option - make sure we update the
corresponding info for the key/level for the root on disk.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:52 -04:00
Kent Overstreet
83f33d6865 bcachefs: Rework lru btree
This patch changes how the LRU index works:

Instead of using KEY_TYPE_lru where the bucket the lru entry points to
is part of the value, this switches to KEY_TYPE_set and encoding the
bucket we refer to in the low bits of the key.

This means that we no longer have to check for collisions when inserting
LRU entries. We'll be making using of this in the next patch, which adds
a btree write buffer - a pure write buffer for btree updates, where
updates are appended to a simple array and then periodically sorted and
batch inserted.

This is a new on disk format version, and a forced upgrade.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:52 -04:00
Kent Overstreet
5250b74d55 bcachefs: bucket_gens btree
To improve mount times, add a btree for just bucket gens, 256 of them
per key: this means we'll have to scan drastically less metadata at
startup.

This adds
 - trigger for keeping it in sync with the all btree
 - initialization code, for filesystems from previous versions
 - new path for reading bucket gens
 - new fsck code

And a new on disk format version.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:51 -04:00
Kent Overstreet
8dd69d9f64 bcachefs: KEY_TYPE_inode_v3, metadata_version_inode_v3
Move bi_size and bi_sectors into the non-varint portion of the inode, so
that the write path can update them without going through the relatively
expensive unpack/pack operations.

Other changes:
 - Add a field for the offset of the varint section, so we can add new
   non-varint fields without needing a new inode type, like alloc_v3
 - Move bi_mode into the flags field, so that the varint section can be
   u64 aligned

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:51 -04:00
Kent Overstreet
47b323a0b0 bcachefs: Start snapshots before bch2_gc()
bch2_gc may require snapshots to be started - the repair path when
checking the reflink btree may do updates to the extents btree.

This moves bch2_fs_initialize_subvolumes() and bch2_fs_snapshots_start()
to before bch2_gc() - since we haven't gone RW yet, the updates in
bch2_fs_initialize_subvolumes() are done via the journal replay keys
list, so it's fine to do this before bch2_gc().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:51 -04:00
Kent Overstreet
a8c752bb1d bcachefs: New on disk format: Backpointers
This patch adds backpointers: we now have a reverse index from device
and offset on that device (specifically, offset within a bucket) back to
btree nodes and (non cached) data extents.

The first 40 backpointers within a bucket are stored in the alloc key;
after that backpointers spill over to the next backpointers btree. This
is to help avoid performance regressions from additional btree updates
on large streaming workloads.

This patch adds all the code for creating, checking and repairing
backpointers. The next patch in the series is going to use backpointers
for copygc - finally getting rid of the need to scan all extents to do
copygc.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:50 -04:00
Kent Overstreet
f2b542ba42 bcachefs: Go RW before check_alloc_info()
It's possible to do btree updates before going RW by adding them to the
list of updates for journal replay to do, but this is limited by what
fits in RAM. This patch switches the second alloc info phase to run
after going RW - btree_gc has already ensured the alloc btree itself is
correct - and tweaks the allocation path to deal with the potential
small inconsistencies.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:50 -04:00
Kent Overstreet
5f5c746617 bcachefs: Start copygc when first going read-write
In the distant past, it wasn't possible to start copygc until after
journal replay had finished. Now, the btree iterator code overlays keys
from the journal, so there's no reason not to start it earlier - and it
solves a rare deadlock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:50 -04:00
Kent Overstreet
858536c7ce bcachefs: Convert EROFS errors to private error codes
More error code improvements - this gets us more useful error messages.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:49 -04:00
Kent Overstreet
5bbe3f2d0e bcachefs: Log more messages in the journal
This patch

 - Adds a mechanism for queuing up journal entries prior to the journal
   being started, which will be used for early journal log messages

 - Adds bch2_fs_log_msg() and improves bch2_trans_log_msg(), which now
   take format strings. bch2_fs_log_msg() can be used before or after
   the journal has been started, and will use the appropriate mechanism.

 - Deletes the now obsolete bch2_journal_log_msg()

 - And adds more log messages to the recovery path - messages for
   journal/filesystem started, journal entries being blacklisted, and
   journal replay starting/finishing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:48 -04:00
Kent Overstreet
67ace27246 bcachefs: Add a missing bch2_err_str() call
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:48 -04:00
Kent Overstreet
1ba8a796b4 bcachefs: Recover from blacklisted journal entries
If it so happens that we crash while dirty, meaning we don't have the
superblock clean section, and we erroneously mark a journal entry we
wrote as blacklisted, we won't be able to recover.

This patch fixes this by adding a fallback: if we've got no superblock
clean section, and no non-ignored journal entries, we try the most
recent ignored journal entry.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:48 -04:00
Kent Overstreet
4f948723ed bcachefs: Fix bch2_journal_keys_peek_upto()
bch2_journal_keys_peek_upto() was comparing against btree_id & level
incorrectly - fix this by using __journal_key_cmp().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:47 -04:00
Kent Overstreet
e0de429a3a bcachefs: Don't error out when just reading the journal
This tweaks the recovery and journal paths so that we don't error out
before we need to: the list_journal command should work, even if we
wouldn't be able to replay successfully.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:47 -04:00
Kent Overstreet
e88a75ebe8 bcachefs: New bpos_cmp(), bkey_cmp() replacements
This patch introduces
 - bpos_eq()
 - bpos_lt()
 - bpos_le()
 - bpos_gt()
 - bpos_ge()

and equivalent replacements for bkey_cmp().

Looking at the generated assembly these could probably be improved
further, but we already see a significant code size improvement with
this patch.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:47 -04:00
Kent Overstreet
b2d1d56b1d bcachefs: Fixes for building in userspace
- Marking a non-static function as inline doesn't actually work and is
   now causing problems - drop that

 - Introduce BCACHEFS_LOG_PREFIX for when we want to prefix log messages
   with bcachefs (filesystem name)

 - Userspace doesn't have real percpu variables (maybe we can get this
   fixed someday), put an #ifdef around bch2_disk_reservation_add()
   fastpath

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:46 -04:00
Kent Overstreet
a101957649 bcachefs: More style fixes
Fixes for various checkpatch errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:45 -04:00
Kent Overstreet
c167f9e541 bcachefs: Journal keys overlay fixes
- In the btree iterator code that overlays keys from the journal, we
   were incorrectly specifying level=0 instead of the btree_path's
   current level in a few places
 - When we didn't do journal replay, we shouldn't free the journal keys:
   this fixes cmd_list and cmd_dump, which run in norecovery mode

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:45 -04:00