Commit Graph

360 Commits

Author SHA1 Message Date
Kent Overstreet
94426e4201 bcachefs: opts.casefold_disabled
Add an option for completely disabling casefolding on a filesystem, as a
workaround for overlayfs.

This should only be needed as a temporary workaround, until the
overlayfs fix arrives.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-01 19:33:46 -04:00
Alan Huang
56be92c63f bcachefs: Fix pool->alloc NULL pointer dereference
btree_interior_update_pool has not been initialized before the
filesystem becomes read-write, thus mempool_alloc in bch2_btree_update_start
will trigger pool->alloc NULL pointer dereference in mempool_alloc_noprof

Reported-by: syzbot+2f3859bd28f20fa682e6@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16 19:03:52 -04:00
Alan Huang
0acb385ec1 bcachefs: Fix possible console lock involved deadlock
Link: https://lore.kernel.org/all/6822ab02.050a0220.f2294.00cb.GAE@google.com/T/
Reported-by: syzbot+2c3ef91c9523c3d1a25c@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-11 23:21:30 -04:00
Kent Overstreet
09b9c72bd4 bcachefs: bch_err_throw()
Add a tracepoint for any time we return an error and unwind.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-02 12:16:35 -04:00
Kent Overstreet
011d644b76 bcachefs: subvol_inum_eq()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:15:08 -04:00
Kent Overstreet
247abee6ae bcachefs: btree_trans_subbuf
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:15:06 -04:00
Kent Overstreet
15f969326e bcachefs: Improve bucket_bitmap code
Add some more helpers, and mismatches is now a superset of the empty
bitmap - simplifies most checks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:15:04 -04:00
Kent Overstreet
06977ea82b bcachefs: Run recovery passes asynchronously
When we request a recovery pass to be run online, i.e. not during
recovery, if it's an online pass it'll now be run in the background,
instead of waiting for the next mount.

To avoid situations where recovery passes are running continuously, this
also includes ratelimiting: if the RUN_RECOVERY_PASS_ratelimit flag is
passed, the pass may be deferred until later - depending on the runtime
and last run stats in the recovery_passes superblock section.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:15:04 -04:00
Kent Overstreet
68708efcac bcachefs: struct bch_fs_recovery
bch_fs has gotten obnoxiously big, let's start organizing thins a bit
better.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:15:03 -04:00
Kent Overstreet
a78a11900e bcachefs: journal path now uses discard_opt_enabled()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:15:01 -04:00
Kent Overstreet
367cad0966 bcachefs: Rename fsck_running, recovery_running flags
Slightly more readable.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:56 -04:00
Kent Overstreet
34aeb820f9 bcachefs: debug_check_bset_lookups
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:55 -04:00
Kent Overstreet
c4e3889440 bcachefs: debug_check_iterators no longer requires BCACHEFS_DEBUG
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:55 -04:00
Kent Overstreet
110bb6cb8b bcachefs: debug_check_btree_locking modparam
Don't put btree locking asserts behind CONFIG_BCACHEFS_DEBUG, put them
behind a module parameter.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:54 -04:00
Kent Overstreet
2842515575 bcachefs: Debug params are now static_keys
We'd like users to be able to debug without building custom kernels, so
this will help us get rid of CONFIG_BCACHEFS_DEBUG, at least for most
things.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:54 -04:00
Kent Overstreet
001c1d146f bcachefs: online_fsck_mutex -> run_recovery_passes_lock
Prep work for automatically running recovery passes asynchronously.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:53 -04:00
Kent Overstreet
20a4b7f3b8 bcachefs: recovery_passes_types.h -> recovery_passes_format.h
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:52 -04:00
Kent Overstreet
13ffcbae86 bcachefs: "buckets with backpointer mismatches" now allocated on demand
More self healing work: we're going to be calling
check_bucket_backpointer_mismatch() at runtime, outside of fsck.

Then when we need to we'll kick off the full
check_extents_to_backpointers recovery pass.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:52 -04:00
Kent Overstreet
7f9dada701 bcachefs: delete dead items in bch_dev
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:51 -04:00
Kent Overstreet
15dbd0d814 bcachefs: snapshot delete progress indicator
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:40 -04:00
Kent Overstreet
aff2b6a7fc bcachefs: provide unlocked version of run_explicit_recovery_pass_persistent
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:36 -04:00
Kent Overstreet
156d9e8341 bcachefs: Emit a single log message on data read error
Instead of emitting a message immediately when we get an error in the
read path, and then another at the end if we successfully retry - emit
one single log message before returning from bch2_rbio_retry().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:32 -04:00
Kent Overstreet
0499a82b18 bcachefs: Async object debugging
Debugging infrastructure for async objs: this lets us easily create
fast_lists for various object types so they'll be visible in debugfs.

Add new object types to the BCH_ASYNC_OBJS_TYPES() enum, and drop a
pretty-printer wrapper in async_objs.c.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:29 -04:00
Kent Overstreet
cca2c0d224 bcachefs: bch_dev.io_ref -> enumerated_ref
Convert device IO refs to enumerated_refs, for easier debugging of
refcount issues.

Simple conversion: enumerate all users and convert to the new helpers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:28 -04:00
Kent Overstreet
c9b1d94a21 bcachefs: bch_fs.writes -> enumerated_refs
Drop the single-purpose write ref code in bcachefs.h, and convert to
enumarated refs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:27 -04:00
Kent Overstreet
9fa4a8a3bd bcachefs: for_each_online_member_rcu()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:26 -04:00
Kent Overstreet
ebf561b208 bcachefs: print_str_as_lines() -> print_str()
bch2_print_string_as_lines() is a low level helper that allows messages
longer than 1k to be printed without truncation.

But we should always be printing with the helpers that take a filesystem
object, if we're in fsck they direct output to the userspace process
controlling fsck instead of the dmesg log.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:18 -04:00
Kent Overstreet
c02e5b5728 bcachefs: Single device mode
Single device filesystems are now identified by the block device name,
not the UUID - and single device filesystems with the same UUID can be
mounted simultaneously, without any special options.

This allocates a new bit in the superblock, BCH_SB_MULTI_DEVICE, which
indicates whether a filesystem has ever been multi device.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:15 -04:00
Kent Overstreet
d4d71b58e5 bcachefs: RO mounts now use less memory
Defer memory allocations only needed in RW mode until we actually go RW.

This is part of improved support for RO images.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:04 -04:00
Kent Overstreet
2767f4f258 bcachefs: btree_io_complete_wq -> btree_write_complete_wq
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:56 -04:00
Kent Overstreet
d02755b8c5 bcachefs: trace bch2_trans_kmalloc()
We're occasionally seeing the WARN_ON() for bump allocator usage
exceeding BTREE_TRANS_MEM_MAX; add some tracing so we can see what's
going on.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:27 -04:00
Kent Overstreet
c3b02e6d67 bcachefs: Log message when incompat version requested but not enabled
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-15 11:35:05 -04:00
Eric Biggers
4bf4b5046d bcachefs: use library APIs for ChaCha20 and Poly1305
Just use the ChaCha20 and Poly1305 libraries instead of the clunky
crypto API.  This is much simpler.  It is also slightly faster, since
the libraries provide more direct access to the same
architecture-optimized ChaCha20 and Poly1305 code.

I've tested that existing encrypted bcachefs filesystems can be continue
to be accessed with this patch applied.

Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-06 19:33:53 -04:00
Kent Overstreet
955ba7b5ea bcachefs: bch_dev_usage_full
All the fastpaths that need device usage don't need the sector totals or
fragmentation, just bucket counts.

Split bch_dev_usage up into two different versions, the normal one with
just bucket counts.

This is also a stack usage improvement, since we have a bch_dev_usage on
the stack in the allocation path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02 10:24:34 -04:00
Kent Overstreet
dcffc3b1ae bcachefs: Split up bch_dev.io_ref
We now have separate per device io_refs for read and write access.

This fixes a device removal bug where the discard workers were still
running while we're removing alloc info for that device.

It's also a bit of hardening; we no longer allow writes to devices that
are read-only.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02 10:24:34 -04:00
Kent Overstreet
80be08cdb5 bcachefs: Filesystem discard option now propagates to devices
the discard option is special, because it's both a filesystem and a
device option.

When set at the filesytsem level, it's supposed to propagate to (if set
persistently via sysfs) or override (if non persistently as a mount
option) the devices - that now works correctly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Eric Biggers
71fbb0b86e bcachefs: use sha256() instead of crypto_shash API
Just use sha256() instead of the clunky crypto API.  This is much
simpler.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Kent Overstreet
94373026d9 bcachefs: We no longer read stripes into memory at startup
And the stripes heap gets deleted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
981e380144 bcachefs: Kick devices out after too many write IO errors
We're improving our handling of write errors - we shouldn't write
degraded data just because a write failed once, we should retry it (on
other devices, if possible).

But for this to work, we need to kick devices out when they're only
returning errors - otherwise those retries will loop infinitely.

This adds a configurable timeout - if writes are failing for too long,
we'll set that device read-only.

In the future we should also implement more tracking and another knob
for an "allowed error rate", so that we can kick out drives that are
acting "unhealthy".

Another thing we'll want is a mechanism (likely in userspace) for
bringing a device back in after a transient error - perhaps a cable was
jiggled, or there was a controller reset.

After transient errors we also need a mechanism to walk (from the
journal) recent btree updates that weren't flushed to that device and
treat them as "degraded", since unflushed data may well not have been
written. Out of scope for this patch, but becoming relevant.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Joshua Ashton
d37c14ac6f bcachefs: bcachefs_metadata_version_casefolding
This patch implements support for case-insensitive file name lookups
in bcachefs.

The implementation uses the same UTF-8 lowering and normalization that
ext4 and f2fs is using.

More information is provided in Documentation/bcachefs/casefolding.rst

Compatibility notes:

This uses the new versioning scheme for incompatible features where an
incompatible feature is tied to a version number: the superblock says
"we may use incompat features up to x" and "incompat features up to x
are in use", disallowing mounting by previous versions.

Additionally, and old style incompat feature bit is used, so that
kernels without utf8 casefolding support know if casefolding
specifically is in use and they're allowed to mount.

Signed-off-by: Joshua Ashton <joshua@froggi.es>
Cc: André Almeida <andrealmeid@igalia.com>
Cc: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
fd49882f12 bcachefs: Add time_stat for btree writes
We have other metadata IO types covered, this was missing.

Note: this includes the time until completion, i.e. including parent
pointer update.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Kent Overstreet
898bda5b72 bcachefs: Increase JOURNAL_BUF_NR
Increase journal pipelining.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Kent Overstreet
3e2ad29865 bcachefs: bch2_btree_node_scrub()
Add a function for scrubbing btree nodes - reading them in, and kicking
off a rewrite if there's an error.

The btree_node_read_done() checks have to be duplicated because we're
not using a pointer to a struct btree - the btree node might already be
in cache, and we need to check a specific replica, which might not be
the one we previously read from.

This will be used in the next patch implementing high-level scrub.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Kent Overstreet
df448ca355 bcachefs: bcachefs_metadata_version_persistent_inode_cursors
Persistent cursors for inode allocation.

A free inodes btree would add substantial overhead to inode allocation
and freeing - a "next num to allocate" cursor is always going to be
faster.

We just need it to be persistent, to avoid scanning the inodes btree
from the start on startup.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
a36d8f0e0e bcachefs: BCH_SB_VERSION_INCOMPAT
We've been getting away from feature bits: they don't have any kind of
ordering, and thus it's possible for people to enable weird combinations
of features that were never tested or intended to be run.

Much better to just give every new feature, compatible or incompatible,
a version number.

Additionally, we probably won't ever rev the major version number: major
version numbers represent incompatible versions, but that doesn't really
fit with how we actually roll out incompatible features - we need a
better way of rolling out incompatible features.

So, this patch adds two new superblock fields:
- BCH_SB_VERSION_INCOMPAT
- BCH_SB_VERSION_INCOMPAT_ALLOWED

BCH_SB_VERSION_INCOMPAT_ALLOWED indicates that incompatible features up
to version number x are allowed to be used without user prompting, but
it does not by itself deny old versions from mounting.

BCH_SB_VERSION_INCOMPAT does deny old versions from mounting, and must
be <= BCH_SB_VERSION_INCOMPAT_ALLOWED.

BCH_SB_VERSION_INCOMPAT will only be set when a codepath attempts to use
an incompatible feature, so as to not unnecessarily break compatibility
with old versions.

bch2_request_incompat_feature() is the new interface to check if an
incompatible feature may be used.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
c738866e47 bcachefs: check_extents_to_backpointers() now only checks buckets with mismatches
Instead of walking every extent and every backpointer it points to,
first sum up backpointers in each bucket and check for mismatches, and
only look for missing backpointers if mismatches were detected, and only
check extents in those buckets.

This is a major fsck scalability improvement, since the two backpointers
passes (backpointers -> extents and extents -> backpointers) are the
most expensive fsck passes by far.

Additionally, to speed up the upgrade for backpointer bucket gens, or in
situations when we have to rebuild alloc info, add a special case for
when no backpointers are found in a bucket - don't check each individual
backpointer (in particular, avoiding the write buffer flushes), just
recreate them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
49f2d18263 bcachefs: Kill unnecessary mark_lock usage
We can't hold mark_lock while calling fsck_err() - that's a deadlock,
mark_lock is meant to be a leaf node lock.

It's also unnecessary for gc_bucket() and bucket_gen(); rcu suffices
since the bucket_gens array describes its size, and we can't race with
device removal or resize during gc/fsck since that takes state lock.

Reported-by: syzbot+38641fcbda1aaffefdd4@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
c67fab0774 bcachefs: Make sure __bch2_run_explicit_recovery_pass() signals to rewind
We should always signal to rewind if the requested pass hasn't been run,
even if called multiple times.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
c1f618f4f7 bcachefs: bch2_async_btree_node_rewrites_flush()
Add a method to flush btree node rewrites at the end of recovery, to
ensure that corrected errors are persisted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Kent Overstreet
2cd85fea49 bcachefs: Don't recurse in check_discard_freespace_key
When calling check_discard_freeespace_key from the allocator, we can't
repair without recursing - run it asynchronously instead.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00