Commit Graph

5280 Commits

Author SHA1 Message Date
Kent Overstreet
c21f41f690 bcachefs: bch2_dirent_to_text() shows casefolded dirents
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:36 -04:00
Kent Overstreet
cd3cdb1ef7 bcachefs: Single err message for btree node reads
Like we just did with the data read path, emit a single error message
per btree node reads, nicely formatted, with all the actions we took
grouped together.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:35 -04:00
Kent Overstreet
9c2472658b bcachefs: bch2_mark_btree_validate_failure()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:35 -04:00
Kent Overstreet
d31f155964 bcachefs: bch2_fsck_err_opt()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:34 -04:00
Kent Overstreet
600a9207c8 bcachefs: Plumb printbuf through bch2_btree_lost_data()
Part of the ongoing project to improve error messages by building them
up in printbufs and emitting them all at once, so that we can easily see
what events are related in the log.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:34 -04:00
Kent Overstreet
300904700f bcachefs: kill bch2_run_explicit_recovery_pass_persistent()
No longer has users, so we can kill it and rename
bch2_run_explicit_recovery_pass_persistent_locked().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:33 -04:00
Kent Overstreet
3aecbb01a1 bcachefs: Remove redundant calls to btree_lost_data()
The btree node read path calls this before returning the read error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:33 -04:00
Kent Overstreet
3be132f93c bcachefs: bch2_btree_lost_data() now handles snapshots tree
We have a consolidated places for "this btree lost data, run this
repair", so use it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:33 -04:00
Kent Overstreet
b3bbd47f83 bcachefs: Kill redundant error message in topology repair
The btree node read path already logs btree node read errors, this isn't
needed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:32 -04:00
Kent Overstreet
156d9e8341 bcachefs: Emit a single log message on data read error
Instead of emitting a message immediately when we get an error in the
read path, and then another at the end if we successfully retry - emit
one single log message before returning from bch2_rbio_retry().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:32 -04:00
Kent Overstreet
353b89c6e6 bcachefs: bch2_io_failures_to_text()
Pretty printer for bch_io_failures, to be used for better read error
messages.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:31 -04:00
Kent Overstreet
dbc18c97f1 bcachefs: print_string_as_lines: avoid printing empty line
If the final line in in the message to be printed is blang, don't print
it.

This happens with indented printbufs - after a newline we emit spaces up
to the indent level.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:31 -04:00
Kent Overstreet
41e51769b8 bcachefs: Make various async objs visible in debugfs
Add async objs list for
- promote_op
- bch_read_bio
- btree_read_bio
- btree_write_bio

This gets us introspection on in-flight async ops, and because under the
hood it uses fast_lists (percpu slot buffer on top of a radix tree),
it'll be fast enough to enable in production.

This will be very helpful for debugging "something got stuck" issues,
which have been cropping up from time to time (in the CI, especially
with folio writeback).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:30 -04:00
Kent Overstreet
0499a82b18 bcachefs: Async object debugging
Debugging infrastructure for async objs: this lets us easily create
fast_lists for various object types so they'll be visible in debugfs.

Add new object types to the BCH_ASYNC_OBJS_TYPES() enum, and drop a
pretty-printer wrapper in async_objs.c.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:29 -04:00
Kent Overstreet
d49bafdc5d bcachefs: fast_list
A fast "list" data structure, which is actually a radix tree, with an
IDA for slot allocation and a percpu buffer on top of that.

Items cannot be added or moved to the head or tail, only added at some
(arbitrary) position and removed. The advantage is that adding, removing
and iteration is generally lockless, only hitting the lock in ida when
the percpu buffer is full or empty.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:29 -04:00
Kent Overstreet
989b4c375a bcachefs: bch2_read_bio_to_text
Pretty printer for struct bch_read_bio.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:29 -04:00
Kent Overstreet
5f0de475f9 bcachefs: bch2_bio_to_text()
Pretty printer for struct bio, to be used for async object debugging.

This is pretty minimal, we'll add more to it as we discover what we
need.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:28 -04:00
Kent Overstreet
cca2c0d224 bcachefs: bch_dev.io_ref -> enumerated_ref
Convert device IO refs to enumerated_refs, for easier debugging of
refcount issues.

Simple conversion: enumerate all users and convert to the new helpers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:28 -04:00
Kent Overstreet
c9b1d94a21 bcachefs: bch_fs.writes -> enumerated_refs
Drop the single-purpose write ref code in bcachefs.h, and convert to
enumarated refs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:27 -04:00
Kent Overstreet
f5241e4127 bcachefs: enumerated_ref.c
Factor out the debug code for rw filesystem refs into a small library.

In release mode an enumerated ref is a normal percpu refcount, but in
debug mode all enumerated users of the ref get their own atomic_long_t
ref - making it much easier to chase down refcount usage bugs for when a
refcount has many users.

For debugging, we have enumerated_ref_to_text(), which prints the
current value of each different user.

Additionally, in debug mode enumerated_ref_stop() has a 10 second
timeout, after which it will dump outstanding refcounts.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:27 -04:00
Kent Overstreet
6d67de1079 bcachefs: for_each_rw_member_rcu()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:27 -04:00
Kent Overstreet
e14e06e91d bcachefs: __bch2_fs_read_write() no longer depends on io_ref
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:26 -04:00
Kent Overstreet
9fa4a8a3bd bcachefs: for_each_online_member_rcu()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:26 -04:00
Kent Overstreet
2483dd1243 bcachefs: recalc_capacity() no longer depends on io_ref
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:25 -04:00
Kent Overstreet
c53be0ffaa bcachefs: bch2_target_to_text() no longer depends on io_ref
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:25 -04:00
Kent Overstreet
834f9475aa bcachefs: bch2_check_rebalance_work()
Add a pass for checking the rebalance_work btree.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:24 -04:00
Alan Huang
09279bba72 bcachefs: Kill dead code
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:24 -04:00
Kent Overstreet
62095464e9 bcachefs: Fix struct with flex member ABI warning
This pops up when buliding in userspace.

The structs aren't actually variable length, but no way to tell the
compiler that...

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:24 -04:00
Kent Overstreet
fe27298b92 bcachefs: bch2_move_data_btree() can now walk roots
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:22 -04:00
Kent Overstreet
3484840ece bcachefs: bch2_move_data_btree() can move btree nodes
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:22 -04:00
Kent Overstreet
7a274285d3 bcachefs: plumb btree_id through move_pred_fd
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:22 -04:00
Kent Overstreet
f3c8eaf7a1 bcachefs: Plumb target parameter through btree_node_rewrite_pos()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:21 -04:00
Kent Overstreet
ecedc87cfa bcachefs: export bch2_move_data_phys()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:21 -04:00
Kent Overstreet
0ca375b177 bcachefs: BCH_MEMBER_RESIZE_ON_MOUNT
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:21 -04:00
Kent Overstreet
530112d88e bcachefs: BCH_FEATURE_small_image
We can't go RW if it's an image file that hasn't been resized.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:20 -04:00
Kent Overstreet
203852d9db bcachefs: BCH_FEATURE_no_alloc_info
If a filesystem is going to only be used read-only, and will be a
deployable image, we can strip out alloc info for a substantial
reduction in metadata size - around half, due to backpointers.

Alloc info will be regenerated on first read-write mount.

Remounting RW is disallowed for now, since we don't yet have
check_allocations running in RW mode.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:20 -04:00
Kent Overstreet
576493133f bcachefs: Print features on startup with -o verbose
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:20 -04:00
Kent Overstreet
0dc73809e9 bcachefs: Shrink superblock downgrade table
Don't generate entries for versions that won't be able to mount.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:20 -04:00
Kent Overstreet
1c8dfd7ba5 bcachefs: sb_validate() no longer requires members_v1
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:19 -04:00
Kent Overstreet
d12bd41018 bcachefs: Add a recovery pass for making sure root inode is readable
If the root inode/subvolume is unreadable we can repair automatically -
but only if we're still in recovery, so that we can rewind to the
appropriate recovery pass.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:19 -04:00
Kent Overstreet
bdad8962c9 bcachefs: Flag for repair on missing subvolume
Instead of going emegency read only with a bch2_fs_inconsistent() call,
log the error and recovery pass appropriately.

If we're still in recovery it'll be repaired immediately, otherwise
it'll be repaired on the next mount.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:18 -04:00
Kent Overstreet
ebf561b208 bcachefs: print_str_as_lines() -> print_str()
bch2_print_string_as_lines() is a low level helper that allows messages
longer than 1k to be printed without truncation.

But we should always be printing with the helpers that take a filesystem
object, if we're in fsck they direct output to the userspace process
controlling fsck instead of the dmesg log.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:18 -04:00
Kent Overstreet
040c762152 bcachefs: bch2_dev_missing_bkey()
Part of the ongoing project to kill off bch2_(fs|trans)_inconsistent
calls - they generally need to be replaced with either

- a fsck_err() call that can repair the error, or

- logging an error of the appropriate type in the superblock, and
  flagging the appropriate recovery pass to repair the error

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:17 -04:00
Kent Overstreet
2085325171 bcachefs: Simplify bch2_count_fsck_err()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:17 -04:00
Kent Overstreet
bb36a12921 bcachefs: bch2_run_explicit_recovery_pass_printbuf()
We prefer helpers that emit log messages to printbufs rather than
printing them directly; that way, we can ensure that different log
messages from the same event are grouped together and formatted
appropriately in the dmesg log.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:16 -04:00
Kent Overstreet
5022d0e183 bcachefs: Incompatible features may now be enabled at runtime
version_upgrade is now a runtime option.

In the future we'll want to add compatible upgrades at runtime, and call
the full check_version_upgrade() when the option changes, but we don't
have compatible optional upgrades just yet.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:16 -04:00
Kent Overstreet
c79eb06da4 bcachefs: Clean up option pre/post hooks, small fixes
The helpers are now:
- bch2_opt_hook_pre_set()
- bch2_opts_hooks_pre_set()
- bch2_opt_hook_post_set

Fix a bug where the filesystem discard option would incorrectly be
changed when setting the device option, and don't trigger rebalance
scans unnecessarily (when options aren't changing).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:16 -04:00
Kent Overstreet
83ecd1b122 bcachefs: Use drop_locks_do() in bch2_inode_hash_find()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:16 -04:00
Kent Overstreet
c02e5b5728 bcachefs: Single device mode
Single device filesystems are now identified by the block device name,
not the UUID - and single device filesystems with the same UUID can be
mounted simultaneously, without any special options.

This allocates a new bit in the superblock, BCH_SB_MULTI_DEVICE, which
indicates whether a filesystem has ever been multi device.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:15 -04:00
Kent Overstreet
58c36e6710 bcachefs: Initialize c->name earlier on single dev filesystems
On single device filesystems, c->name contains the block device name,
not the UUID.

Initialize this earlier, so that single device mode can use it for
initializing sysfs/debugfs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:15 -04:00
Alan Huang
0e43bf5a6a bcachefs: Simplify logic
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:14 -04:00
Alan Huang
152bae193c bcachefs: Remove spurious +1/-1 operation
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:14 -04:00
Alan Huang
f013b4ca35 bcachefs: Kill bch2_trans_unlock_noassert
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:14 -04:00
Kent Overstreet
6f03e30e7c bcachefs: Clean up duplicated code in bch2_journal_halt()
It's now a wrapper around bch2_journal_halt_locked().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:13 -04:00
Kent Overstreet
03f8f9a129 bcachefs: bch2_dev_allocator_set_rw()
Add a helper that lets us change bch_member.data_allowed at runtime.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:13 -04:00
Kent Overstreet
2e0d51d00e bcachefs: bch2_dev_journal_alloc() now respects data_allowed
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:13 -04:00
Kent Overstreet
93ac4d5f92 bcachefs: Improve bch2_btree_cache_to_text()
Make the output slightly clearer, and include a counter for "nodes we
couldn't free because we would have gone under our reserve".

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:13 -04:00
Kent Overstreet
e50fe14c54 bcachefs: __btree_node_reclaim_checks()
Factor out a helper so we're not duplicating checks after locking the
btree node.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:13 -04:00
Kent Overstreet
68aaeb7c8b bcachefs: kill BTREE_CACHE_NOT_FREED_INCREMENT()
Small cleanup, just always increment the counters.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:12 -04:00
Kent Overstreet
ef8dd631f7 bcachefs: Improve opts.degraded
Kill 'opts.very_degraded', and make 'opts.degraded' a persistent option,
stored in the superblock.

It's now an enum, with available choices ask/yes/very/no.

"ask" mode will be handled by the mount helper, for prompting the user
(on a machine used interactively) for whether to do a degraded mount.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:12 -04:00
Kent Overstreet
2758c28aca bcachefs: export bch2_chacha20
Needed for userspcae.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:11 -04:00
Integral
dd1b99f706 bcachefs: indent error messages of invalid compression
This patch uses printbuf_indent_add_nextline() to set a consistent
indentation level for error messages of invalid compression.

In my previous patch [1], the newline is added by using '\n' in
the argument of prt_str(). This patch replaces prt_str() with
prt_printf() to make indentation level work correctly.

[1] Link: https://lore.kernel.org/20250406152659.205997-2-integral@archlinuxcn.org

Signed-off-by: Integral <integral@archlinuxcn.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:10 -04:00
Integral
84ccd47d26 bcachefs: split error messages of invalid compression into two lines
When an invalid compression type or level is passed as an argument
to `--compression`, two error messages are squashed into one line:

    > bcachefs format --compression=lzo bcachefs-comp.img
    invalid option: invalid compression typecompression: parse error

    > bcachefs format --compression=lz4:16 bcachefs-comp.img
    invalid option: invalid compression levelcompression: parse error

To resolve this issue, add a newline character at the end of the
first error message to separate them into two lines.

Signed-off-by: Integral <integral@archlinuxcn.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:08 -04:00
Integral
0e790469bf bcachefs: early return for negative values when parsing BCH_OPT_UINT
Currently, when passing a negative integer as argument, the error
message is "too big" due to casting to an unsigned integer:

    > bcachefs format --block_size=-1 bcachefs.img
    invalid option: block_size: too big (max 65536)

When negative value in argument detected, return early before
calling bch2_opt_validate().

A new error code `BCH_ERR_option_negative` is added.

Signed-off-by: Integral <integral@archlinuxcn.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:07 -04:00
Kent Overstreet
3a2a0d08b2 bcachefs: move_data_phys: stats are not required
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:05 -04:00
Kent Overstreet
d4d71b58e5 bcachefs: RO mounts now use less memory
Defer memory allocations only needed in RW mode until we actually go RW.

This is part of improved support for RO images.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:04 -04:00
Kent Overstreet
a17e985be9 bcachefs: Move various init code to _init_early()
_init_early() is for initialization that cannot fail, and often must
happen for teardown partway through initialization to work.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:02 -04:00
Kent Overstreet
31813dcf37 bcachefs: alphabetize init function calls
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:14:00 -04:00
Kent Overstreet
25ee021c7f bcachefs: simplify journal pin initialization
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:59 -04:00
Kent Overstreet
2767f4f258 bcachefs: btree_io_complete_wq -> btree_write_complete_wq
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:56 -04:00
Kent Overstreet
c9b5d9cd26 bcachefs: bch2_kvmalloc() mem alloc profiling
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:56 -04:00
Kent Overstreet
bcaea61adc bcachefs: add missing include
Hygeine, and fix build in userspace.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:52 -04:00
Kent Overstreet
b974357c63 bcachefs: bch2_snapshot_table_make_room()
Add a better helper for check_snapshot_exists().

create_snapids() can't be changed to use this, unfortunately, because
the transaction that creates new snapshot will also be inserting other
keys (e.g. root inode) that reference that snapshot ID, and they expect
the snapshot table to already be updated.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:50 -04:00
Kent Overstreet
ea27e8ca5d bcachefs: darray: provide typedefs for primitive types
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:47 -04:00
Kent Overstreet
2a81bd454c bcachefs: reduce new_stripe_alloc_buckets() stack usage
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:44 -04:00
Kent Overstreet
a0b0b9bb9e bcachefs: alloc_request no longer on stack
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:43 -04:00
Kent Overstreet
95f2315af7 bcachefs: alloc_request.ptrs2
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:40 -04:00
Kent Overstreet
e038213658 bcachefs: alloc_request.ca
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:39 -04:00
Kent Overstreet
7f65d1cf5c bcachefs: alloc_request.counters
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:37 -04:00
Kent Overstreet
4d00e88d21 bcachefs: alloc_request.usage
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:36 -04:00
Kent Overstreet
a0312f4251 bcachefs: alloc_request: deallocate_extra_replicas()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:35 -04:00
Kent Overstreet
ac0952b0e5 bcachefs: new_stripe_alloc_buckets() takes alloc_request
More stack usage improvements: instead of creating a new alloc_request
(currently on the stack), save/restore just the fields we need to reuse.

This is a bit tricky, because we're doing a normal alloc_foreground.c
allocation, which calls into ec.c to get a stripe, which then does more
normal allocations - some of the fields get reused, and used
differently.

So we have to save and restore them - but the stack usage improvements
will be well worth it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:33 -04:00
Kent Overstreet
7100344301 bcachefs: bch2_ec_stripe_head_get() takes alloc_request
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:32 -04:00
Kent Overstreet
9259883b79 bcachefs: bch2_bucket_alloc_trans() takes alloc_request
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:31 -04:00
Kent Overstreet
799c418303 bcachefs: alloc_request.data_type
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:30 -04:00
Kent Overstreet
ad63f9f1e9 bcachefs: struct alloc_request
Add a struct for common state for satisfying an on disk allocation,
instead of passing the same long list of items to every function.

This will help with stack usage, performance, and perhaps enable some
code cleanups.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:27 -04:00
Kent Overstreet
d02755b8c5 bcachefs: trace bch2_trans_kmalloc()
We're occasionally seeing the WARN_ON() for bump allocator usage
exceeding BTREE_TRANS_MEM_MAX; add some tracing so we can see what's
going on.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:27 -04:00
Roxana Nicolescu
caa6baa45f bcachefs: replace memcpy with memcpy_and_pad for jset_entry_log->d buff
This was achieved before by zero-ing out the source buffer and then
copying the bytes into the destination buffer. This can also be done with
memcpy_and_pad which will zero out only the destination buffer if its
size is bigger than the size of the source buffer. This is already used in
the same way in journal_transaction_name().

Moreover, zero-ing the source buffer was done twice, first in
__bch2_fs_log_msg() and then in bch2_trans_log_msg(). And this method
may also require allocating some extra memory for the source buffer.

In conclusion, using memcpy_and_pad is better even tough the result is
the same because it brings uniformity with what's already used in
journal_transaction_name, it avoids code duplication and reallocating
extra memory.

Signed-off-by: Roxana Nicolescu <nicolescu.roxana@protonmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:25 -04:00
Roxana Nicolescu
4e2caf82ce bcachefs: replace strncpy() with memcpy_and_pad in journal_transaction_name
Strncpy is now deprecated.
The buffer destination is not required to be NULL-terminated, but we also
want to zero out the rest of the buffer as it is already done in other
places.

Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Roxana Nicolescu <nicolescu.roxana@protonmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:24 -04:00
Kent Overstreet
8c087d2ddf bcachefs: Rebalance now skips poisoned extents
Let's not move poisoned extents unnecessarily, since we can't guard
against introducing more bitrot.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:22 -04:00
Kent Overstreet
cb8336ca42 bcachefs: Data move can read from poisoned extents
Now, if an extent is poisoned we can move it even if there was a
checksum error. We'll have to give it a new checksum, but the poison bit
means that userspace will still see the appropriate error when they try
to read it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:21 -04:00
Kent Overstreet
760be1ad5e bcachefs: Poison extents that can't be read due to checksum errors
Copygc needs to be able to move extents that have bitrotted. We don't
want to delete them - in the future we'll have an API for "read me the
data even if there's checksum errors", and in general we don't want to
delete anything unless the user asks us to.

That will require writing it with a new checksum, which means we can't
forget that there was a checksum error so we return the correct error to
userspace.

Rebalance also wants to skip bad extents; we can now use the poison flag
for that.

This is currently disabled by default, as we want read fua support so
that we can distinguish between transient and permanent errors from the
device. It may be enabled with the module parameter:

  poison_extents_on_checksum_error

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:19 -04:00
Kent Overstreet
6659ba3b18 bcachefs: Be precise about bch_io_failures
If the extent we're reading from changes, due to be being overwritten or
moved (possibly partially) - we need to reset bch_io_failures so that we
don't accidentally mark a new extent as poisoned prematurely.

This means we have to separately track (in the retry path) the extent we
previously read from.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:17 -04:00
Kent Overstreet
0e5f1f3f8f bcachefs: bch2_subvolume_wait_for_pagecache_and_delete() cleanup
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:15 -04:00
Kent Overstreet
010c894681 bcachefs: Check for casefolded dirents in non casefolded dirs
Check for mismatches between casefold dirents and casefold directories.

A mismatch will cause lookups to fail, as we'll be doing the lookup with
the casefolded name, which won't match the non-casefolded dirent, and
vice versa.

Reported-by: Christopher Snowhill <chris@kode54.net>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:14 -04:00
Kent Overstreet
ecd76c5f10 bcachefs: Fix bch2_dirent_create_snapshot() for casefolding
bch2_dirent_create_snapshot(), used in fsck, neglected to create a
casefolded dirent.

Just move this into dirent_create_key().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:13 -04:00
Kent Overstreet
8d5ac187da bcachefs: Fix casefold opt via xattr interface
Changing the casefold option requires extra checks/work - factor out a
helper from bch2_fileattr_set() for the xattr code to use.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21 20:13:09 -04:00
Kent Overstreet
cbed8287e5 bcachefs: mkwrite() now only dirties one page
Don't dirty the whole folio - fixes write amplification with
applications doing mmaped writes.

https://www.reddit.com/r/bcachefs/comments/1klzcg1/incredible_amounts_of_write_amplification_when/

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-19 08:28:41 -04:00
Kent Overstreet
494d458cfa bcachefs: fix extent_has_stripe_ptr()
This wasn't checking indirect extents.

Fixes: https://github.com/koverstreet/bcachefs/issues/887
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-18 22:35:33 -04:00
Kent Overstreet
49771a7578 bcachefs: Fix bch2_btree_path_traverse_cached() when paths realloced
btree_key_cache_fill() will allocate and traverse another path (for the
underlying btree), so we can't hold pointers to paths across a call - we
have to pass indices.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-17 18:46:17 -04:00
Kent Overstreet
9c09e59cc5 bcachefs: fix wrong arg to fsck_err()
fsck_err() needs the btree transaction passed to it if there is one - so
that it can unlock/relock around prompting userspace for fixing the
error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-14 18:59:15 -04:00
Kent Overstreet
d1041d8eab bcachefs: Fix missing commit in backpointer to missing target
Fsck wants to do transaction commits from an outer context; it may have
other repair to do (i.e. duplicate backpointers).

But when calling backpointer_not_found() from runtime code, i.e. runtime
self healing, we should be doing the commit - the outer context expects
to just be doing lookups.

This fixes bugs where we get stuck spinning, reported as "RCU lock hold
time warnings.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-14 17:05:19 -04:00
Kent Overstreet
a12cb6f758 bcachefs: Fix accidental O(n^2) in fiemap
Since bch2_seek_pagecache_data() searches for dirty data, we only want
to call it for holes in the extents btree - otherwise we have an
accidental O(n^2), as we repeatedly search the same range.

Reported-by: Marcin Mirosław <marcin@mejor.pl>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-14 17:05:19 -04:00
Kent Overstreet
43b9fece2d bcachefs: Fix set_should_be_locked() call in peek_slot()
set_should_be_locked() needs to be called before peek_key_cache(), which
traverses other paths and may do a trans unlock/relock.

This fixes an assertion pop in path_peek_slot(), when the path we're
using is unexpectedly not uptodate.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-14 17:05:19 -04:00
Alan Huang
61198e6287 bcachefs: Fix self deadlock
Before invoking bch2_accounting_mem_mod_locked in
bch2_gc_accounting_done, we already write locked mark_lock,
in bch2_accounting_mem_insert, we lock mark_lock again.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-14 17:05:19 -04:00
Kent Overstreet
19b22d04cd bcachefs: Don't set btree nodes as accessed on fill
Prevent jobs that do lots of scanning (i.e. evacuatee, scrub) from
causing OOMs.

The shrinker code seems to be having issues when it doesn't do any
freeing because it's just flipping off the acccessed bit - and the
accessed bit shouldn't be set on first use anyways.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-14 17:05:19 -04:00
Kent Overstreet
7b6759b199 bcachefs: Fix livelock in journal_entry_open()
When the journal is low on space, we might do discards from
journal_res_get() -> journal_entry_open().

Make sure we set j->can_discard correctly, so that if we're low on space
but not because discards aren't keeping up we don't livelock.

Fixes: 8e4d28036c ("bcachefs: Don't aggressively discard the journal")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-14 17:05:19 -04:00
Kent Overstreet
b1c71cb492 bcachefs: Fix broken btree_path lock invariants in next_node()
This fixes btree locking assert pops users were seeing during evacuate:

https://github.com/koverstreet/bcachefs/issues/878

May 09 22:45:02 sharon kernel: bcachefs (68116e25-fa2d-4c6f-86c7-e8b431d792ae):   bch2_btree_insert_node(): node not locked at level 1
May 09 22:45:02 sharon kernel:   bch2_btree_node_rewrite [bcachefs]: watermark=btree no_check_rw alloc l=0-1 mode=none nodes_written=0 cl.remaining=2 journal_seq=0
May 09 22:45:02 sharon kernel:   path: idx   1 ref 1:0   S B btree=alloc level=0 pos 0:3699637:0 0:3698012:1-0:3699637:0 bch2_move_btree.isra.0+0x1db/0x490 [bcachefs] uptodate 0 locks_want 2
May 09 22:45:02 sharon kernel:     l=0 locks intent seq 4 node ffff8bd700c93600
May 09 22:45:02 sharon kernel:     l=1 locks unlocked seq 1712 node ffff8bd6fd5e7a00
May 09 22:45:02 sharon kernel:     l=2 locks unlocked seq 2295 node ffff8bd6cc725400
May 09 22:45:02 sharon kernel:     l=3 locks unlocked seq 0 node 0000000000000000

Evacuate walks btree nodes with bch2_btree_iter_next_node() and rewrites
them, bch2_btree_update_start() upgrades the path to take intent locks
as far as it needs to.

But next_node() does low level unlock/relock calls on individual nodes,
and didn't handle the case where a path is supposed to be holding
multiple intent locks. If a path has locks_want > 1, it needs to be
either holding locks on all the btree nodes (at each level) requested,
or none of them.

Fix this with a bch2_btree_path_downgrade().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-14 17:05:19 -04:00
Kent Overstreet
cd52cc3544 bcachefs: Don't strip rebalance_opts from indirect extents
Fix bch2_bkey_clear_needs_rebalance(): indirect extents are never
supposed to have bch_extent_rebalance stripped off, because that's how
we get the IO path options when we don't have the original inode it
belonged to.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-14 17:05:19 -04:00
Eric Biggers
607c92141c crypto: lib/chacha - add strongly-typed state zeroization
Now that the ChaCha state matrix is strongly-typed, add a helper
function chacha_zeroize_state() which zeroizes it.  Then convert all
applicable callers to use it instead of direct memzero_explicit.  No
functional changes.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-12 13:32:53 +08:00
Eric Biggers
98066f2f89 crypto: lib/chacha - strongly type the ChaCha state
The ChaCha state matrix is 16 32-bit words.  Currently it is represented
in the code as a raw u32 array, or even just a pointer to u32.  This
weak typing is error-prone.  Instead, introduce struct chacha_state:

    struct chacha_state {
            u32 x[16];
    };

Convert all ChaCha and HChaCha functions to use struct chacha_state.
No functional changes.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-12 13:32:53 +08:00
Fedor Pchelkin
f3def8270c sort.h: hoist cmp_int() into generic header file
Deduplicate the same functionality implemented in several places by
moving the cmp_int() helper macro into linux/sort.h.

The macro performs a three-way comparison of the arguments mostly useful
in different sorting strategies and algorithms.

Link: https://lkml.kernel.org/r/20250427201451.900730-1-pchelkin@ispras.ru
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Kent Overstreet <kent.overstreet@linux.dev>
Acked-by: Coly Li <colyli@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Coly Li <colyli@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11 17:54:12 -07:00
Ingo Molnar
aad823aa3a treewide, timers: Rename destroy_timer_on_stack() as timer_destroy_on_stack()
Move this API to the canonical timer_*() namespace.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250507175338.672442-10-mingo@kernel.org
2025-05-08 19:49:33 +02:00
Kent Overstreet
8e4d28036c bcachefs: Don't aggressively discard the journal
We frequently use 'bcachefs list_journal -a' for debugging, as it
provides a record of all btree transactions, and a history of what
happened.

But it's not so useful if we immediately discard journal buckets right
after they're no longer dirty.

This tweaks journal reclaim to only discard when we're low on space,
keeping the journal mostly un-discarded.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07 17:10:10 -04:00
Kent Overstreet
da18dabc37 bcachefs: Ensure superblock gets written when we go ERO
When we go emergency read-only, make sure we do a final write_super() to
persist counters and error counts - this can be critical for piecing
together what fsck was doing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07 17:09:59 -04:00
Kent Overstreet
2fea3aa76e bcachefs: Filter out harmless EROFS error messages
These just indicate that we're shutting down.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07 16:58:32 -04:00
Kent Overstreet
473f09f362 bcachefs: journal_shutdown is EROFS, not EIO
We often filter out EROFS errors to avoid log spew after an emergency
shutdown - journal_shutdown is just another emergency shutdown error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07 16:58:26 -04:00
Kent Overstreet
9c61856099 bcachefs: Call bch2_fs_start before getting vfs superblock
This reverts

1fdbe0b184 bcachefs: Make sure c->vfs_sb is set before starting fs

switched up bch2_fs_get_tree() so that we got a superblock before
calling bch2_fs_start, so that c->vfs_sb would always be initialized
while the filesystem was active.

This turned out not to be necessary, because blk_holder_ops were
implemented using our own locking, not vfs locking.

And this had the side effect of creating a super_block and doing our
full recovery (including potentially fsck) before setting SB_BORN, which
causes things like sync calls to hang until our recovery is finished.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05 16:06:35 -04:00
Kent Overstreet
aed4ccbf45 bcachefs: fix hung task timeout in journal read
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05 14:21:28 -04:00
Kent Overstreet
7a69fa6571 bcachefs: Add missing barriers before wake_up_bit()
wake_up() doesn't require a barrier - but wake_up_bit() does.

This only affected non x86, and primarily lead to lost wakeups after
btree node reads.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05 14:19:10 -04:00
Kent Overstreet
50a7b899a0 bcachefs: Ensure proper write alignment
There was a buggy version of bcachefs-tools which picked misaligned
bucket sizes when formatting, and we're also about to do dynamic block
sizes - which will allow picking logical block size or physical block
size of the device per-write, allowing for better compression ratios at
the cost of slightly worse write performance (i.e. forcing the device to
do RMW or extra buffering).

To account for this, tweak bch2_alloc_sectors_start() to properly align
open_buckets to the blocksize of the write we're about to do.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05 14:19:01 -04:00
Kent Overstreet
844f766e02 bcachefs: Improve want_cached_ptr()
If promote target isn't set, rebalance should still leave a cached copy
on the faster device.

Fall back to foreground_target if it's set, or allow a cached copy on
any device if neither are set.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05 14:16:20 -04:00
Kent Overstreet
df2e19a883 bcachefs: thread_with_stdio: fix spinning instead of exiting
bch2_stdio_redirect_vprintf() was missing a check for stdio->done, i.e.
exiting.

This caused the thread attempting to print to spin, and since it was
being called from the kthread ran by thread_with_stdio, the userspace
side hung as well.

Change it to return -EPIPE - i.e. writing to a pipe that's been closed.

Reported-by: Jan Solanti <jhs@psonet.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-04 14:00:14 -04:00
Alan Huang
6846100b00 bcachefs: Remove incorrect __counted_by annotation
This actually reverts 86e92eeeb2 ("bcachefs: Annotate struct bch_xattr
with __counted_by()").

After the x_name, there is a value. According to the disscussion[1],
__counted_by assumes that the flexible array member contains exactly
the amount of elements that are specified. Now there are users came across
a false positive detection of an out of bounds write caused by
the __counted_by here[2], so revert that.

[1] https://lore.kernel.org/lkml/Zv8VDKWN1GzLRT-_@archlinux/T/#m0ce9541c5070146320efd4f928cc1ff8de69e9b2
[2] https://privatebin.net/?a0d4e97d590d71e1#9bLmp2Kb5NU6X6cZEucchDcu88HzUQwHUah8okKPReEt

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-01 16:38:58 -04:00
Kent Overstreet
28580052e6 bcachefs: add missing sched_annotate_sleep()
00594 ------------[ cut here ]------------
00594 do not call blocking ops when !TASK_RUNNING; state=2 set at [<000000003e51ef4a>] prepare_to_wait_event+0x5c/0x1c0
00594 WARNING: CPU: 12 PID: 1117 at kernel/sched/core.c:8741 __might_sleep+0x74/0x88
00594 Modules linked in:
00594 CPU: 12 UID: 0 PID: 1117 Comm: umount Not tainted 6.15.0-rc4-ktest-g3a72e369412d #21845 PREEMPT
00594 Hardware name: linux,dummy-virt (DT)
00594 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
00594 pc : __might_sleep+0x74/0x88
00594 lr : __might_sleep+0x74/0x88
00594 sp : ffffff80c8d67a90
00594 x29: ffffff80c8d67a90 x28: ffffff80f5903500 x27: 0000000000000000
00594 x26: 0000000000000000 x25: ffffff80cf5002a0 x24: ffffffc087dad000
00594 x23: ffffff80c8d67b40 x22: 0000000000000000 x21: 0000000000000000
00594 x20: 0000000000000242 x19: ffffffc080b92020 x18: 00000000ffffffff
00594 x17: 30303c5b20746120 x16: 74657320323d6574 x15: 617473203b474e49
00594 x14: 0000000000000001 x13: 00000000000c0000 x12: ffffff80facc0000
00594 x11: 0000000000000001 x10: 0000000000000001 x9 : ffffffc0800b0774
00594 x8 : c0000000fffbffff x7 : ffffffc087dac670 x6 : 00000000015fffa8
00594 x5 : ffffff80facbffa8 x4 : ffffff80fbd30b90 x3 : 0000000000000000
00594 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffff80f5903500
00594 Call trace:
00594  __might_sleep+0x74/0x88 (P)
00594  __mutex_lock+0x64/0x8d8
00594  mutex_lock_nested+0x28/0x38
00594  bch2_fs_ec_flush+0xf8/0x128
00594  __bch2_fs_read_only+0x54/0x1d8
00594  bch2_fs_read_only+0x3e0/0x438
00594  __bch2_fs_stop+0x5c/0x250
00594  bch2_put_super+0x18/0x28
00594  generic_shutdown_super+0x6c/0x140
00594  bch2_kill_sb+0x1c/0x38
00594  deactivate_locked_super+0x54/0xd0
00594  deactivate_super+0x70/0x90
00594  cleanup_mnt+0xec/0x188
00594  __cleanup_mnt+0x18/0x28
00594  task_work_run+0x90/0xd8
00594  do_notify_resume+0x138/0x148
00594  el0_svc+0x9c/0xa0
00594  el0t_64_sync_handler+0x104/0x130
00594  el0t_64_sync+0x154/0x158

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-01 13:54:58 -04:00
Kent Overstreet
e2699274d5 bcachefs: Fix __bch2_dev_group_set()
bch2_sb_disk_groups_to_cpu() goes off of the superblock member info, so
we need to set that first.

Reported-by: Stijn Tintel <stijn@linux-ipv6.be>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-01 12:22:10 -04:00
Kent Overstreet
e660d7ca74 bcachefs: Kill ERO for i_blocks check in truncate
Replace with logging the error in the superblock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-01 06:19:58 -04:00
Kent Overstreet
3a72e36941 bcachefs: check for inode.bi_sectors underflow
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-01 06:19:58 -04:00
Kent Overstreet
05450c48a3 bcachefs: Kill ERO in __bch2_i_sectors_acct()
We won't be root causing this in the immediate future, and it's fairly
innocuous - so just log it in the superblock.

https://github.com/koverstreet/bcachefs/issues/869

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-01 06:19:58 -04:00
Kent Overstreet
5e63d579e7 bcachefs: readdir fixes
- Don't call bch2_trans_relock() after dir_emit(); taking a transaction
  restart here will cause us to emit the same dirent to userspace twice

- Fix incorrect checking of the return value on dir_emit(): "true" means
  success, keep going, but bch2_dir_emit() needs to return true when
  we're finished iterating.

https://github.com/koverstreet/bcachefs/issues/867

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-30 11:49:34 -04:00
Kent Overstreet
2feaa92c7c bcachefs: improve missing journal write device error message
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-30 11:49:28 -04:00
Kent Overstreet
dbe4674802 bcachefs: Topology error after insert is now an ERO
A user hit this, and this will naturally be easier to debug if we don't
panic.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 22:42:17 -04:00
Kent Overstreet
9a4a858c9b bcachefs: Use bch2_kvmalloc() for journal keys array
We can hit this limit fairly easy when we have to reconstuct large
amounts of alloc info on large filesystems.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 22:42:17 -04:00
Kent Overstreet
e5a3b8cf33 bcachefs: More informative error message when shutting down due to error
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 22:42:17 -04:00
Kent Overstreet
652dd6558b bcachefs: btree_root_unreadable_and_scan_found_nothing autofix for non data btrees
If loosing a btree won't cause data loss - i.e. it's an alloc btree, or
we can easily reconstruct it - we shouldn't require user action to
continue repair.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 22:42:17 -04:00
Kent Overstreet
c366b1672d bcachefs: btree_node_data_missing is now autofix
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 16:46:13 -04:00
Kent Overstreet
eca5b56ccf bcachefs: Don't generate alloc updates to invalid buckets
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 16:46:13 -04:00
Kent Overstreet
e7f1a52849 bcachefs: Improve bch2_dev_bucket_missing()
More useful error message.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 16:46:13 -04:00
Kent Overstreet
002466446a bcachefs: fix bch2_dev_buckets_resize()
The resize memcpy path was totally busted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 16:46:13 -04:00
Kent Overstreet
9e9c28acfd bcachefs: Add upgrade table entry from 0.14
There are a few errors that needed to be marked as autofix.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 16:46:12 -04:00
Kent Overstreet
3c24020119 bcachefs: Run BCH_RECOVERY_PASS_reconstruct_snapshots on missing subvol -> snapshot
Fix this repair path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 16:46:12 -04:00
Kent Overstreet
bdc32a10a2 bcachefs: Add missing utf8_unload()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 16:46:12 -04:00
Kent Overstreet
70c3d89f49 bcachefs: Emit unicode version message on startup
fstests expects this

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 16:46:12 -04:00
Kent Overstreet
c83311c5b9 bcachefs: Use generic_set_sb_d_ops for standard casefolding d_ops
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 16:46:12 -04:00
Kent Overstreet
a2f546330e bcachefs: Fix losing return code in next_fiemap_extent()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 16:46:12 -04:00
Kent Overstreet
d1b0f9aa73 bcachefs: Rework fiemap transaction restart handling
Restart handling in the previous patch was incorrect, so: move btree
operations into a separate helper, and run it with a lockrestart_do().

Additionally, clarify whether pagecache or the btree takes precedence.

Right now, the btree takes precedence: this is incorrect, but it's
needed to pass fstests. Add a giant comment explaining why.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:10:29 -04:00
Brian Foster
b9b0494017 bcachefs: add fiemap delalloc extent detection
bcachefs currently populates fiemap data from the extents btree.
This works correctly when the fiemap sync flag is provided, but if
not, it skips all delalloc extents that have not yet been flushed.
This is because delalloc extents from buffered writes are first
stored as reservation in the pagecache, and only become resident in
the extents btree after writeback completes.

Update the fiemap implementation to process holes between extents by
scanning pagecache for data, via seek data/hole. If a valid data
range is found over a hole in the extent btree, fake up an extent
key and flag the extent as delalloc for reporting to userspace.

Note that this does not necessarily change behavior for the case
where there is dirty pagecache over already written extents, where
when in COW mode, writeback will allocate new blocks for the
underlying ranges. The existing behavior is consistent with btrfs
and it is recommended to use the sync flag for the most up to date
extent state from fiemap.

Signed-off-by: Brian Foster <bfoster@redhat.com>
2025-04-24 19:10:29 -04:00
Brian Foster
2d55a63709 bcachefs: refactor fiemap processing into extent helper and struct
The bulk of the loop in bch2_fiemap() involves processing the
current extent key from the iter, including following indirections
and trimming the extent size and such. This patch makes a few
changes to reduce the size of the loop and facilitate future changes
to support delalloc extents.

Define a new bch_fiemap_extent structure to wrap the bkey buffer
that holds the extent key to report to userspace along with
associated fiemap flags. Update bch2_fill_extent() to take the
bch_fiemap_extent as a param instead of the individual fields.
Finally, lift the bulk of the extent processing into a
bch2_fiemap_extent() helper that takes the current key and formats
the bch_fiemap_extent appropriately for the fill function.

No functional changes intended by this patch.

Signed-off-by: Brian Foster <bfoster@redhat.com>
2025-04-24 19:10:29 -04:00
Brian Foster
d020a9fb11 bcachefs: track current fiemap offset in start variable
Signed-off-by: Brian Foster <bfoster@redhat.com>
2025-04-24 19:10:28 -04:00
Brian Foster
28d2d19ccc bcachefs: drop duplicate fiemap sync flag
FIEMAP_FLAG_SYNC handling was deliberately moved into core code in
commit 45dd052e67 ("fs: handle FIEMAP_FLAG_SYNC in fiemap_prep"),
released in kernel v5.8. Update bcachefs accordingly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
2025-04-24 19:10:28 -04:00
Kent Overstreet
353739f1d1 bcachefs: Fix btree_iter_peek_prev() at end of inode
At the end of the inode, on an extents iterator, peek_slot() has to
advance to the next position to avoid returning a 0 size extent, which
is not allowed.

Changing iter->pos confuses peek_prev(), but we don't need to call
peek_slot() in this case.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:09:52 -04:00
Kent Overstreet
c4f89a1d35 bcachefs: Make btree_iter_peek_prev() assert more precise
The issue this assert is guarding against is that in
BTREE_ITER_filter_snapshots mode we only want to be iterating within a
single inode number - if we iterate into another inode number with keys
for a different snapshot tree, we'll loop arbitrarily long before
finding a key we can return.

This comes up in the unit tests, where we're using inode 0 for our test
keys.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:09:52 -04:00
Kent Overstreet
394ef278e1 bcachefs: Unit test fixes
The peek_end() tests expect an empty btree.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:09:52 -04:00
Kent Overstreet
caab547686 bcachefs: Print mount opts earlier
If we aren't mounting with the correct degraded option, it's helpful to
know that before we fail to mount degraded.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:09:52 -04:00
Kent Overstreet
7cb85324c4 bcachefs: unlink: casefold d_invalidate
casefolding results in additional aliases on lookup for the
non-casefolded names - these need invalidating on unlink.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:09:52 -04:00
Kent Overstreet
9cdde3c7aa bcachefs: Fix casefold lookups
Add casefolding to bch2_lookup_trans:

During the delay between when casefolding was written and when it was
merged, the main filesystem lookup path grew self healing - which meant
it was no longer using bch2_dirent_lookup_trans(), where casefolding on
lookups happens.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:09:52 -04:00
Kent Overstreet
b9e1f873d2 bcachefs: Casefold is now a regular opts.h option
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-24 19:09:00 -04:00
Kent Overstreet
7a4a86618e bcachefs: Implement fileattr_(get|set)
inode_operations.fileattr_(get|set) didn't exist when the various flag
ioctls where implemented - but they do now, which means we can delete a
bunch of ioctl code in favor of standard VFS level wrappers.

Closes: https://lore.kernel.org/linux-bcachefs/7ltgrgqgfummyrlvw7hnfhnu42rfiamoq3lpcvrjnlyytldmzp@yazbhusnztqn/
Cc: Petr Vorel <pvorel@suse.cz>
Cc: Andrea Cervesato <andrea.cervesato@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-21 19:50:56 -04:00
Kent Overstreet
4ede80a9a8 bcachefs: Allocator now copes with unaligned buckets
We had a buggy release of bcachefs-tools that wasn't properly aligning
bucket sizes.

We can't ask users to reformat - and it's easy to teach the allocator to
make sure writes are properly aligned.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-21 19:36:45 -04:00
Kent Overstreet
387df33129 bcachefs: Start copygc, rebalance threads earlier
Previously, copygc and rebalance weren't started until the very end of
mounting, after all recvoery passes have finished.

But copygc really should be started earlier, since it may be needed for
allocations to make forward progress. Additionally, we've been seeing
occasional bug reports where starting the kthread fails due to a pending
signal - i.e. we're getting timed out by systemd (during a version
upgrade), but we're not seeing the signal until mount is about to
complete.

Additionally, we now have copygc/rebalance explicitly wait for
check_snapshots to complete (if being run); they require that for
snapshot_is_ancestor() in the data move path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-21 11:57:24 -04:00
Kent Overstreet
d64e8e842b bcachefs: Refactor bch2_run_recovery_passes()
Don't use a continue; this simplifies the next patch where
run_recovery_passes() will be responsible for waking up copygc and
rebalance at the appropriate time.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-21 11:56:43 -04:00
Kent Overstreet
10e42b6f25 bcachefs: bch2_copygc_wakeup()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-20 20:01:48 -04:00
Kent Overstreet
bfbb76ec98 bcachefs: Fix ref leak in write_super()
found with the new enumerated_ref code

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-20 19:41:38 -04:00
Kent Overstreet
4c327d03d7 bcachefs: Change __journal_entry_close() assert to ERO
We've got some reports of this happening in the wild, and need a bit
more info to debug it:

https://github.com/koverstreet/bcachefs/issues/854
https://www.reddit.com/r/bcachefs/comments/1k28kjm/surprise_soft_lockup/

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-20 19:41:38 -04:00
Kent Overstreet
6468aef231 bcachefs: Ensure journal space is block size aligned
We don't require that bucket size is block size aligned (although it
should be!) - so we need to handle this in the journal code.

This fixes an assertion pop in jorunal_entry_close(), where the journal
entry overruns available space - after rounding it up to block size.

Fixes: https://github.com/koverstreet/bcachefs/issues/854
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-20 19:41:38 -04:00
Kent Overstreet
71f8e806a5 bcachefs: Stricter checks on "key allowed in this btree"
Syzbot managed to come up with a filesystem where check/repair got
rather confused at finding a reflink pointer in the inodes btree.

Currently, the "key allowed in this btree" checks only apply at commit
time, not read time - for forwards compatibility. It seems this is too
loose.

Now, strict key type allowed checks apply:
 - at commit time (no forward compatibility issues)
 - for btree node pointers
 - if it's a known btree, known key type, and the key type has the
   "BKEY_TYPE_strict_btree_checks" flag.

This means we still have the option of using generic key types - e.g.
KEY_TYPE_error, KEY_TYPE_set - on more existing btrees in the future,
while most key types that are intended for only a specific btree get
stricter checks.

Reported-by: syzbot+baee8591f336cab0958b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-20 19:41:38 -04:00
Kent Overstreet
417f01e726 bcachefs: Error ratelimiting is no longer only during fsck
We now more often do repair automatically, without the user invoking
fsck - and sometimes that can involve fixing lots of errors, so let's
avoid flooding the dmesg log.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-20 19:41:38 -04:00
Kent Overstreet
aa6a591f0f bcachefs: Fix null ptr deref in bch2_snapshot_tree_oldest_subvol()
Reported-by: syzbot+baee8591f336cab0958b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-20 19:41:38 -04:00
Kent Overstreet
4c0d2c67ac bcachefs: Fix early startup error path
Don't set JOURNAL_running until we're also calling
journal_space_available() for the first time.

If JOURNAL_running is set, shutdown will write an empty journal entry -
but this will hit an assert in journal_entry_open() if we've never
called journal_space_available().

Reported-by: syzbot+53bb24d476ef8368a7f0@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-20 19:41:38 -04:00
Linus Torvalds
9e99c1accb bcachefs fixes for 6.15-rc3
Usual set of small fixes/logging improvements.
 
 One bigger user reported fix, for inode <-> dirent inconsistencies
 reported in fsck, after moving a subvolume that had been snapshotted.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmgBYaIACgkQE6szbY3K
 bnYcGQ/+K+LsEvGAZ5wtTwUN4KqJIYWhcHYcuLS2mHKf8PMbgYhL7TjmCwb9VWyr
 0+GFQcJgfLsl++kX4j7CjG4gHd22aLiwbhMDmSt3r6c4aF29rG+zCpe4W1+7o60k
 UIKokfbLUV6b+0vF5bA/W3PmtXK7S8E0yAPMfWxv4/sACu8RUvrUJtrUCKEWwLzC
 bcrRGsN91456qNhCrOp3e4t3yZjiGtZIz+SbPYIxdNrZYIMURlGUm+f9sLH3O+2R
 NKsi41sggo/TmgUyspH3KCtMT88IDbN07F7O9/zcxgtdfzfC9l9FI6HnvRVSHDOV
 boFaH/NdRaIbg+O5kqZXYul+/EPXsYp5B77TL6KQ3jhv3q16uwpv9EL4v6HIwvz9
 BTDOfI2y/+YWHMfrtzXgh3C9dZDPS7qxFFWjSjCs/lXwKVz46RjBWVmtQoTJSEmb
 Ee29kBGMpkwmH8fqr5KQheJUIeYewpyTVeB6orgtshnrr+aezS6zunIbk7fJ6+Ng
 Tc08H/Aqc2KGcyBS3KTLhbReQ1clQKGOqWJymeb1p2V3SMXfABMbh61B1VU1XulC
 Al5B7/w/WPwb+T2XZIM2qbmeoRJ8OBara5RWkx4HN8pcYuWV8H6GWJtRJQD/eKSO
 pOT5bz8z9N2n/otwrfLT5lfO2fNW1mULCAamn6iSzR+EDHyuaMU=
 =pi0D
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2025-04-17' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "Usual set of small fixes/logging improvements.

  One bigger user reported fix, for inode <-> dirent inconsistencies
  reported in fsck, after moving a subvolume that had been snapshotted"

* tag 'bcachefs-2025-04-17' of git://evilpiepirate.org/bcachefs:
  bcachefs: Fix snapshotting a subvolume, then renaming it
  bcachefs: Add missing READ_ONCE() for metadata replicas
  bcachefs: snapshot_node_missing is now autofix
  bcachefs: Log message when incompat version requested but not enabled
  bcachefs: Print version_incompat_allowed on startup
  bcachefs: Silence extent_poisoned error messages
  bcachefs: btree_root_unreadable_and_scan_found_nothing now AUTOFIX
  bcachefs: fix bch2_dev_usage_full_read_fast()
  bcachefs: Don't print data read retry success on non-errors
  bcachefs: Add missing error handling
  bcachefs: Prevent granting write refs when filesystem is read-only
2025-04-17 15:08:29 -07:00
Kent Overstreet
261592ba06 bcachefs: Fix snapshotting a subvolume, then renaming it
Subvolume roots and the dirents that point to them are special; they
don't obey the normal snapshot versioning rules because they cross
snapshot boundaries.

We don't keep around older versions of subvolume dirents on rename - we
don't need to, because subvolume dirents are only visible in the parent
subvolume, and we wouldn't be able to match up the different dirent and
inode versions due to crossing the snapshot ID boundary.

That means that when we rename a subvolume, that's been snapshotted, the
older version of the subvolume root will become dangling - it won't have
a dirent that points to it.

That's expected, we just need to tell fsck that this is ok.

Fixes: https://github.com/koverstreet/bcachefs/issues/856
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-17 14:17:16 -04:00
Kent Overstreet
8dd3804bf4 bcachefs: Add missing READ_ONCE() for metadata replicas
If we race with the user changing the metadata_replicas setting, this
could cause us to get an incorrectly sized disk reservation.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-16 18:31:47 -04:00
Kent Overstreet
72b5259053 bcachefs: snapshot_node_missing is now autofix
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-15 17:37:39 -04:00
Kent Overstreet
c3b02e6d67 bcachefs: Log message when incompat version requested but not enabled
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-15 11:35:05 -04:00
Kent Overstreet
14bcf982f4 bcachefs: Print version_incompat_allowed on startup
Let users know if incompatible features aren't enabled

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-15 11:35:05 -04:00
Kent Overstreet
a06459657e bcachefs: Silence extent_poisoned error messages
extent poisoning is partly so that we don't keep spewing the dmesg log
when we've got unreadable data - we don't want to print these.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-15 11:34:37 -04:00
Kent Overstreet
8692c7db9a bcachefs: btree_root_unreadable_and_scan_found_nothing now AUTOFIX
This will likely mean that the btree had only one node - there was
nothing or almost nothing in it, and we should reconstruct and continue.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-13 21:17:55 -04:00
Kent Overstreet
345731a389 bcachefs: fix bch2_dev_usage_full_read_fast()
One reference to bch_dev_usage wasn't updated, which meant we weren't
reading the full bch_dev_usage_full - oops.

Fixes: 955ba7b5ea ("bcachefs: bch_dev_usage_full")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-13 14:54:34 -04:00
Kent Overstreet
7dfd42a07a bcachefs: Don't print data read retry success on non-errors
We may end up in the data read retry path when reading cached data and
racing with invalidation, or on checksum error when we were reading into
a userspace buffer that might have been modified while the read was in
flight.

These aren't real errors, so we shouldn't print the 'retry success'
message.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-13 10:17:00 -04:00
Alan Huang
806776ad9c bcachefs: Add missing error handling
Reported-by: syzbot+d10151bf01574a09a915@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-13 07:56:28 -04:00
Gabriel Shahrouzi
d62922ba3c bcachefs: Prevent granting write refs when filesystem is read-only
Fix a shutdown WARNING in bch2_dev_free caused by active write I/O
references (ca->io_ref[WRITE]) on a device being freed.

The problem occurs when:
- The filesystem is marked read-only (BCH_FS_rw clear in c->flags).
- A subsequent operation (e.g., error handling for device removal)
  incorrectly tries to grant write references back to a device.
- During final shutdown, the read-only flag causes the system to skip
  stopping write I/O references (bch2_dev_io_ref_stop(ca, WRITE)).
- The leftover active write reference triggers the WARN_ON in
  bch2_dev_free.

Prevent this by checking if the filesystem is read-only before
attempting to grant write references to a device in the problematic
code path. Ensure consistency between the filesystem state flag
and the device I/O reference state during shutdown.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-13 07:56:28 -04:00
Linus Torvalds
ef77858826 bcachefs fixes for 6.15-rc2
Mostly minor fixes.
 
 Eric Biggers' crypto API conversion is included because of long standing
 sporadic crashes - mostly, but not entirely syzbot - in the crypto API
 code when calling poly1305, which have been nigh impossible to reproduce
 and debug. His rework deletes the code where we've seen the crashes, so
 either it'll be a fix or we'll end up with backtraces we can debug.
 (Thanks Eric!).
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmf4Yq8ACgkQE6szbY3K
 bnb1fxAAu68Ll/4PLWr3xHVp7ETWgEZSzuwRqA87fqs/Q0jNRC2aDIO03Wmj28qM
 ckEM3PFPuiQubLhOmU21Osta/sFU6GxL0IggMoEC50F5XiVlcKRiNSWhRLnr07Qp
 v7sc5MQ1HDGnZXQMcdRymzRWixn2hzdRMwKOvbhBHuj0YUPd4yk5I/+AtJi5SAnT
 ri+dGIQLBRmG7J2pd4AZNza0YZ5pkTAFj9/4wRVoJdKX2pPzf40e7qzYNPt4/6rK
 A6P9ecU3TDDDQEE4S7s8Dng4rXwsa+9qTUcpXnTTC1L6YbbnZd/IQYzWI4b+FUsS
 wqnUD+aE7UEMANZh891QlJpGj3ih/6z8opUP4T6RdsVuJwt9X1vFJY99CsOTEf1o
 7jAcssL+ueEWPZj8tBoN1niujikyFsXM+xKiUOMZxbuM6BhE40j/WrA77sRhI5+I
 7DXlf5s8SDh+gw0IGUboBJe3ofGisRXnfxeZAKQHGHgtEFboY4bDAURcGW4MbIqE
 uN5Cd+5IJlcKmJdXLCbHMb5KktfBNWu9/VrOMcZ2QHhIuOfd3fFgLzE0ZEroj4lN
 kTWxpzKeNDt3bPF4esYnvduafHDbzClwfkTt5TBgcOeE4TcIL2mOmweLE2LTKIwW
 xr5Xhqx1/9//PeaOTwxbCoeZ26G0Q9B8L1+eUZgjPS0FcRdZH9w=
 =DpNC
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2025-04-10' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "Mostly minor fixes.

  Eric Biggers' crypto API conversion is included because of long
  standing sporadic crashes - mostly, but not entirely syzbot - in the
  crypto API code when calling poly1305, which have been nigh impossible
  to reproduce and debug.

  His rework deletes the code where we've seen the crashes, so either
  it'll be a fix or we'll end up with backtraces we can debug. (Thanks
  Eric!)"

* tag 'bcachefs-2025-04-10' of git://evilpiepirate.org/bcachefs:
  bcachefs: Use sort_nonatomic() instead of sort()
  bcachefs: Remove unnecessary softdep on xxhash
  bcachefs: use library APIs for ChaCha20 and Poly1305
  bcachefs: Fix duplicate "ro,read_only" in opts at startup
  bcachefs: Fix UAF in bchfs_read()
  bcachefs: Use cpu_to_le16 for dirent lengths
  bcachefs: Fix type for parameter in journal_advance_devs_to_next_bucket
  bcachefs: Fix escape sequence in prt_printf
2025-04-10 19:38:22 -07:00
Linus Torvalds
97c484ccb8 CRC cleanups for 6.15
Finish cleaning up the CRC kconfig options by removing the remaining
 unnecessary prompts and an unnecessary 'default y', removing
 CONFIG_LIBCRC32C, and documenting all the CRC library options.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZ/P7QhQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKyoOAQCynFcS1dWuD27S+SdUREmBjMAoZo5M
 zdsIvlPv9KLycgD/QX5lXjW3KIYY6jQ8vHUuLVwfDl/JEp4GJS9dLGU+agg=
 =0R1T
 -----END PGP SIGNATURE-----

Merge tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux

Pull CRC cleanups from Eric Biggers:
 "Finish cleaning up the CRC kconfig options by removing the remaining
  unnecessary prompts and an unnecessary 'default y', removing
  CONFIG_LIBCRC32C, and documenting all the CRC library options"

* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
  lib/crc: remove CONFIG_LIBCRC32C
  lib/crc: document all the CRC library kconfig options
  lib/crc: remove unnecessary prompt for CONFIG_CRC_ITU_T
  lib/crc: remove unnecessary prompt for CONFIG_CRC_T10DIF
  lib/crc: remove unnecessary prompt for CONFIG_CRC16
  lib/crc: remove unnecessary prompt for CONFIG_CRC_CCITT
  lib/crc: remove unnecessary prompt for CONFIG_CRC32 and drop 'default y'
2025-04-08 12:09:28 -07:00
Kent Overstreet
55fd97fbc4 bcachefs: Use sort_nonatomic() instead of sort()
Fixes "task out to lunch" warnings during recovery on large machines
with lots of dirty data in the journal.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-06 19:33:53 -04:00
Eric Biggers
c92896ffb7 bcachefs: Remove unnecessary softdep on xxhash
As with the other algorithms, bcachefs does not access xxhash through
the crypto API.  So there is no need to use a module softdep to ensure
that it is loaded.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-06 19:33:53 -04:00
Eric Biggers
4bf4b5046d bcachefs: use library APIs for ChaCha20 and Poly1305
Just use the ChaCha20 and Poly1305 libraries instead of the clunky
crypto API.  This is much simpler.  It is also slightly faster, since
the libraries provide more direct access to the same
architecture-optimized ChaCha20 and Poly1305 code.

I've tested that existing encrypted bcachefs filesystems can be continue
to be accessed with this patch applied.

Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-06 19:33:53 -04:00
Kent Overstreet
1ec94a9f6d bcachefs: Fix duplicate "ro,read_only" in opts at startup
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-06 19:33:53 -04:00
Kent Overstreet
34b47e3d73 bcachefs: Fix UAF in bchfs_read()
Commit 3ba0240a87 fixed a bug in the read retry path in __bch2_read(),
and changed bchfs_read() to match - to avoid a landmine if
bch2_read_extent() ever starts returning transaction restarts.

But that was incorrect, because bchfs_read() doesn't use a separate
stack allocated bvec_iter, it uses the one in the rbio being submitted.

Add a comment explaining the issue, and revert the buggy change.

Fixes: 3ba0240a87 ("bcachefs: Fix silent short reads in data read retry path")
Reported-by: syzbot+2deb10b8dc9aae6fab67@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-06 19:13:43 -04:00
Gabriel Shahrouzi
4a22a73323 bcachefs: Use cpu_to_le16 for dirent lengths
Prevent incorrect byte ordering for big-endian systems.

Signed-off-by: Gabriel Shahrouzi <gshahrouzi@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-06 19:13:43 -04:00
Gabriel Shahrouzi
afc5444e4d bcachefs: Fix type for parameter in journal_advance_devs_to_next_bucket
Replace u64 with __le64 to match the expected parameter type. Ensure consistency both in function calls and within the function itself.

Signed-off-by: Gabriel Shahrouzi <gshahrouzi@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-06 19:13:43 -04:00
Gabriel Shahrouzi
f5cd27ec71 bcachefs: Fix escape sequence in prt_printf
Remove backslash before format specifier. Ensure correct output.

Signed-off-by: Gabriel Shahrouzi <gshahrouzi@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-06 19:13:43 -04:00
Thomas Gleixner
8fa7292fee treewide: Switch/rename to timer_delete[_sync]()
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.

Conversion was done with coccinelle plus manual fixups where necessary.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-05 10:30:12 +02:00
Eric Biggers
b261d22220 lib/crc: remove CONFIG_LIBCRC32C
Now that LIBCRC32C does nothing besides select CRC32, make every option
that selects LIBCRC32C instead select CRC32 directly.  Then remove
LIBCRC32C.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250401221600.24878-8-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-04-04 11:31:42 -07:00
Linus Torvalds
56770e24f6 bcachefs fixes for 6.15-rc1
More notable fixes:
 
 - Fix for striping behaviour on tiering filesystems where replicas
   exceeds durability on destination target
 - Fix a race in device removal where deleting alloc info races with the
   discard worker
 - Some small stack usage improvements: this is just enough for KMSAN
   builds to not blow the stack, more is queued up for 6.16.
 -----BEGIN PGP SIGNATURE-----
 
 iQIyBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmfu7pMACgkQE6szbY3K
 bnY6rw/3W4dho57OPjOoHUbQ7A7IK1hI4SFvGtDDb3vX1RjF2r+RbdBupMAd0zGj
 T+5SzhYCQLGbyfBa6MW+iPqiHokZsG904+3mRogOf9cpz2Mup9ZOq/vV+Z7ndaF8
 2i9wpQTb7GShkSaXkeTvQqnx3YAUxVcRB2ExraTXmv4wIxr1SYyJEeakmMBDasJB
 UanXXVHzrKo9WLiqWz0JSZCiuQW2v03P84zZo1d/GyMKlTxYDt5aAteos77lJBef
 5CWVr4/HKKozt/vI2qHQ+3LJXktLjvb07zoENXwadmgQawYA3nQ+9jLT3Q0FKjXG
 bK28AHTtiXgWsYsbCs5sVh1+WLPdEj0UBBoFZGWo++TzaN2hXhoMsFTQfuddhaEh
 W63MWtelv4TGIVOEFk+ayHRgPL6ajhCsa1boHS9EKdosl2nl9Vk9Nq0i++hYZDGW
 KhWqENT9E5EpVCnZ6H4m1tsXprWavNqXnkOJzXW0T2F3t8+94zp1n6YXkwDdgLfs
 l+xTEEAL5J8lvlfSS6dW7QcMSMtMKbo3+qlerpH8J4zBZJBbb2nF1ggCtpYg6zFt
 4Jgs5FPQLVqWsPXQr4CaSF2UIt3zMPnNIawL1cEpRBU1j35qo0e/kxIjEpS0Pnjt
 mX67gBlodY54/pwGGLfc/Vkw4xqh//dqTmYIdHkibdAEvKf0dg==
 =2TfM
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2025-04-03' of git://evilpiepirate.org/bcachefs

Pull more bcachefs updates from Kent Overstreet:
 "More notable fixes:

   - Fix for striping behaviour on tiering filesystems where replicas
     exceeds durability on destination target

   - Fix a race in device removal where deleting alloc info races with
     the discard worker

   - Some small stack usage improvements: this is just enough for KMSAN
     builds to not blow the stack, more is queued up for 6.16"

* tag 'bcachefs-2025-04-03' of git://evilpiepirate.org/bcachefs:
  bcachefs: Fix "journal stuck" during recovery
  bcachefs: backpointer_get_key: check for null from peek_slot()
  bcachefs: Fix null ptr deref in invalidate_one_bucket()
  bcachefs: Fix check_snapshot_exists() restart handling
  bcachefs: use nonblocking variant of print_string_as_lines in error path
  bcachefs: Fix scheduling while atomic from logging changes
  bcachefs: Add error handling for zlib_deflateInit2()
  bcachefs: add missing selection of XARRAY_MULTI
  bcachefs: bch_dev_usage_full
  bcachefs: Kill btree_iter.trans
  bcachefs: do_trace_key_cache_fill()
  bcachefs: Split up bch_dev.io_ref
  bcachefs: fix ref leak in btree_node_read_all_replicas
  bcachefs: Fix null ptr deref in bch2_write_endio()
  bcachefs: Fix field spanning write warning
  bcachefs: Fix striping behaviour
2025-04-03 15:39:47 -07:00
Kent Overstreet
77ad1df82b bcachefs: Fix "journal stuck" during recovery
If we crash when the journal pin fifo is completely full - i.e. we're
at the maximum number of dirty journal entries - that may put us in a
sticky situation in recovery, as journal replay will need to be able to
open new journal entries in order to get going.

bch2_fs_journal_start() already had provisions for resizing the journal
pin fifo if needed, but it needs a fudge factor to ensure there's room
for journal replay.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-03 12:11:43 -04:00
Kent Overstreet
2581f89ac8 bcachefs: backpointer_get_key: check for null from peek_slot()
peek_slot() doesn't normally return bkey_s_c_null - except when we ask
for a key at a btree level that doesn't exist, which can happen here.

We might want to revisit this, but we'll have to look over all the
places where we use peek_slot() on interior nodes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-03 12:11:43 -04:00
Kent Overstreet
39ebd74864 bcachefs: Fix null ptr deref in invalidate_one_bucket()
bch2_backpointer_get_key() returns bkey_s_c_null when the target isn't
found.

backpointer_get_key() flags the error, so there's nothing else to do
here - just skip it and move on.

Link: https://github.com/koverstreet/bcachefs/issues/847
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-03 12:11:43 -04:00
Kent Overstreet
83d539b1b0 bcachefs: Fix check_snapshot_exists() restart handling
Codepaths that create entries in the snapshots btree currently call
bch2_mark_snapshot(), which updates the in-memory snapshot table, before
transaction commit.

This is because bch2_mark_snapshot() is an atomic trigger, run with
btree write locks held, and isn't allowed to fail - but it might need to
reallocate the table, hence we call it early when we're still allowed to
fail.

This is generally harmless - if we fail, we'll have left an entry in the
snapshots table around, but nothing will reference it and it'll get
overwritten if reused by another transaction.

But check_snapshot_exists(), which reconstructs snapshots when the
snapshots btree has been corrupted or lost, was erronously rechecking if
the snapshot exists inside the transaction commit loop - so on
transaction restart (in this case mem_realloced), the second iteration
would return without repairing.

This code needs some cleanup: splitting out a "maybe realloc snapshots
table" helper would have avoided this, that will be in the next patch.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-03 12:11:43 -04:00
Bharadwaj Raju
570f5050bb bcachefs: use nonblocking variant of print_string_as_lines in error path
The inconsistency error path calls print_string_as_lines, which calls
console_lock, which is a potentially-sleeping function and so can't be
called in an atomic context.

Replace calls to it with the nonblocking variant which is safe to call.

Signed-off-by: Bharadwaj Raju <bharadwaj.raju777@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-03 12:11:42 -04:00
Kent Overstreet
b2ffadcc7f bcachefs: Fix scheduling while atomic from logging changes
Two fixes from the recent logging changes:

bch2_inconsistent(), bch2_fs_inconsistent() be called from interrupt
context, or with rcu_read_lock() held.

The one syzbot found is in
  bch2_bkey_pick_read_device
  bch2_dev_rcu
  bch2_fs_inconsistent

We're starting to switch to lift the printbufs up to higher levels so we
can emit better log messages and print them all in one go (avoid
garbling), so that conversion will help with spotting these in the
future; when we declare a printbuf it must be flagged if we're in an
atomic context.

Secondly, in btree_node_write_endio:

00085 BUG: sleeping function called from invalid context at include/linux/sched/mm.h:321
00085 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 618, name: bch-reclaim/fa6
00085 preempt_count: 10001, expected: 0
00085 RCU nest depth: 0, expected: 0
00085 4 locks held by bch-reclaim/fa6/618:
00085  #0: ffffff80d7ccad68 (&j->reclaim_lock){+.+.}-{4:4}, at: bch2_journal_reclaim_thread+0x84/0x198
00085  #1: ffffff80d7c84218 (&c->btree_trans_barrier){.+.+}-{0:0}, at: __bch2_trans_get+0x1c0/0x440
00085  #2: ffffff80cd3f8140 (bcachefs_btree){+.+.}-{0:0}, at: __bch2_trans_get+0x22c/0x440
00085  #3: ffffff80c3823c20 (&vblk->vqs[i].lock){-.-.}-{3:3}, at: virtblk_done+0x58/0x130
00085 irq event stamp: 328
00085 hardirqs last  enabled at (327): [<ffffffc080073a14>] finish_task_switch.isra.0+0xbc/0x2a0
00085 hardirqs last disabled at (328): [<ffffffc080971a10>] el1_interrupt+0x20/0x60
00085 softirqs last  enabled at (0): [<ffffffc08002f920>] copy_process+0x7c8/0x2118
00085 softirqs last disabled at (0): [<0000000000000000>] 0x0
00085 Preemption disabled at:
00085 [<ffffffc08003ada0>] irq_enter_rcu+0x18/0x90
00085 CPU: 8 UID: 0 PID: 618 Comm: bch-reclaim/fa6 Not tainted 6.14.0-rc6-ktest-g04630bde23e8 #18798
00085 Hardware name: linux,dummy-virt (DT)
00085 Call trace:
00085  show_stack+0x1c/0x30 (C)
00085  dump_stack_lvl+0x84/0xc0
00085  dump_stack+0x14/0x20
00085  __might_resched+0x180/0x288
00085  __might_sleep+0x4c/0x88
00085  __kmalloc_node_track_caller_noprof+0x34c/0x3e0
00085  krealloc_noprof+0x1a0/0x2d8
00085  bch2_printbuf_make_room+0x9c/0x120
00085  bch2_prt_printf+0x60/0x1b8
00085  btree_node_write_endio+0x1b0/0x2d8
00085  bio_endio+0x138/0x1f0
00085  btree_node_write_endio+0xe8/0x2d8
00085  bio_endio+0x138/0x1f0
00085  blk_update_request+0x220/0x4c0
00085  blk_mq_end_request+0x28/0x148
00085  virtblk_request_done+0x64/0xe8
00085  blk_mq_complete_request+0x34/0x40
00085  virtblk_done+0x78/0x130
00085  vring_interrupt+0x6c/0xb0
00085  __handle_irq_event_percpu+0x8c/0x2e0
00085  handle_irq_event+0x50/0xb0
00085  handle_fasteoi_irq+0xc4/0x250
00085  handle_irq_desc+0x44/0x60
00085  generic_handle_domain_irq+0x20/0x30
00085  gic_handle_irq+0x54/0xc8
00085  call_on_irq_stack+0x24/0x40

Reported-by: syzbot+c82cd2906e2f192410bb@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-03 12:11:42 -04:00
Wentao Liang
9364f17ba4 bcachefs: Add error handling for zlib_deflateInit2()
In attempt_compress(), the return value of zlib_deflateInit2() needs to be
checked. A proper implementation can be found in  pstore_compress().

Add an error check and return 0 immediately if the initialzation fails.

Fixes: 986e9842fb ("bcachefs: Compression levels")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-03 12:11:42 -04:00
Eric Biggers
a07c43e6c2 bcachefs: add missing selection of XARRAY_MULTI
When CONFIG_XARRAY_MULTI is not set, reading from a bcachefs file hits
the 'BUG_ON(order > 0);' in xas_set_order(), because it tries to insert
a large folio in the page cache.  Fix this by making bcachefs select
XARRAY_MULTI.

Fixes: be212d86b1 ("bcachefs: bs > ps support")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02 10:24:34 -04:00
Kent Overstreet
955ba7b5ea bcachefs: bch_dev_usage_full
All the fastpaths that need device usage don't need the sector totals or
fragmentation, just bucket counts.

Split bch_dev_usage up into two different versions, the normal one with
just bucket counts.

This is also a stack usage improvement, since we have a bch_dev_usage on
the stack in the allocation path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02 10:24:34 -04:00
Kent Overstreet
9180ad2e16 bcachefs: Kill btree_iter.trans
This was planned to be done ages ago, now finally completed; there are
places where we have quite a few btree_trans objects on the stack, so
this reduces stack usage somewhat.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02 10:24:34 -04:00
Kent Overstreet
1c8f4587d2 bcachefs: do_trace_key_cache_fill()
Reducing stack frame usage; this moves the printbuf out of the main
stack frame.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02 10:24:34 -04:00
Kent Overstreet
dcffc3b1ae bcachefs: Split up bch_dev.io_ref
We now have separate per device io_refs for read and write access.

This fixes a device removal bug where the discard workers were still
running while we're removing alloc info for that device.

It's also a bit of hardening; we no longer allow writes to devices that
are read-only.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02 10:24:34 -04:00
Linus Torvalds
eb0ece1602 - The 6 patch series "Enable strict percpu address space checks" from
Uros Bizjak uses x86 named address space qualifiers to provide
   compile-time checking of percpu area accesses.
 
   This has caused a small amount of fallout - two or three issues were
   reported.  In all cases the calling code was founf to be incorrect.
 
 - The 4 patch series "Some cleanup for memcg" from Chen Ridong
   implements some relatively monir cleanups for the memcontrol code.
 
 - The 17 patch series "mm: fixes for device-exclusive entries (hmm)"
   from David Hildenbrand fixes a boatload of issues which David found then
   using device-exclusive PTE entries when THP is enabled.  More work is
   needed, but this makes thins better - our own HMM selftests now succeed.
 
 - The 2 patch series "mm: zswap: remove z3fold and zbud" from Yosry
   Ahmed remove the z3fold and zbud implementations.  They have been
   deprecated for half a year and nobody has complained.
 
 - The 5 patch series "mm: further simplify VMA merge operation" from
   Lorenzo Stoakes implements numerous simplifications in this area.  No
   runtime effects are anticipated.
 
 - The 4 patch series "mm/madvise: remove redundant mmap_lock operations
   from process_madvise()" from SeongJae Park rationalizes the locking in
   the madvise() implementation.  Performance gains of 20-25% were observed
   in one MADV_DONTNEED microbenchmark.
 
 - The 12 patch series "Tiny cleanup and improvements about SWAP code"
   from Baoquan He contains a number of touchups to issues which Baoquan
   noticed when working on the swap code.
 
 - The 2 patch series "mm: kmemleak: Usability improvements" from Catalin
   Marinas implements a couple of improvements to the kmemleak user-visible
   output.
 
 - The 2 patch series "mm/damon/paddr: fix large folios access and
   schemes handling" from Usama Arif provides a couple of fixes for DAMON's
   handling of large folios.
 
 - The 3 patch series "mm/damon/core: fix wrong and/or useless
   damos_walk() behaviors" from SeongJae Park fixes a few issues with the
   accuracy of kdamond's walking of DAMON regions.
 
 - The 3 patch series "expose mapping wrprotect, fix fb_defio use" from
   Lorenzo Stoakes changes the interaction between framebuffer deferred-io
   and core MM.  No functional changes are anticipated - this is
   preparatory work for the future removal of page structure fields.
 
 - The 4 patch series "mm/damon: add support for hugepage_size DAMOS
   filter" from Usama Arif adds a DAMOS filter which permits the filtering
   by huge page sizes.
 
 - The 4 patch series "mm: permit guard regions for file-backed/shmem
   mappings" from Lorenzo Stoakes extends the guard region feature from its
   present "anon mappings only" state.  The feature now covers shmem and
   file-backed mappings.
 
 - The 4 patch series "mm: batched unmap lazyfree large folios during
   reclamation" from Barry Song cleans up and speeds up the unmapping for
   pte-mapped large folios.
 
 - The 18 patch series "reimplement per-vma lock as a refcount" from
   Suren Baghdasaryan puts the vm_lock back into the vma.  Our reasons for
   pulling it out were largely bogus and that change made the code more
   messy.  This patchset provides small (0-10%) improvements on one
   microbenchmark.
 
 - The 5 patch series "Docs/mm/damon: misc DAMOS filters documentation
   fixes and improves" from SeongJae Park does some maintenance work on the
   DAMON docs.
 
 - The 27 patch series "hugetlb/CMA improvements for large systems" from
   Frank van der Linden addresses a pile of issues which have been observed
   when using CMA on large machines.
 
 - The 2 patch series "mm/damon: introduce DAMOS filter type for unmapped
   pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the
   page's mapped/unmapped status.
 
 - The 19 patch series "zsmalloc/zram: there be preemption" from Sergey
   Senozhatsky teaches zram to run its compression and decompression
   operations preemptibly.
 
 - The 12 patch series "selftests/mm: Some cleanups from trying to run
   them" from Brendan Jackman fixes a pile of unrelated issues which
   Brendan encountered while runnimg our selftests.
 
 - The 2 patch series "fs/proc/task_mmu: add guard region bit to pagemap"
   from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
   determine whether a particular page is a guard page.
 
 - The 7 patch series "mm, swap: remove swap slot cache" from Kairui Song
   removes the swap slot cache from the allocation path - it simply wasn't
   being effective.
 
 - The 5 patch series "mm: cleanups for device-exclusive entries (hmm)"
   from David Hildenbrand implements a number of unrelated cleanups in this
   code.
 
 - The 5 patch series "mm: Rework generic PTDUMP configs" from Anshuman
   Khandual implements a number of preparatoty cleanups to the
   GENERIC_PTDUMP Kconfig logic.
 
 - The 8 patch series "mm/damon: auto-tune aggregation interval" from
   SeongJae Park implements a feedback-driven automatic tuning feature for
   DAMON's aggregation interval tuning.
 
 - The 5 patch series "Fix lazy mmu mode" from Ryan Roberts fixes some
   issues in powerpc, sparc and x86 lazy MMU implementations.  Ryan did
   this in preparation for implementing lazy mmu mode for arm64 to optimize
   vmalloc.
 
 - The 2 patch series "mm/page_alloc: Some clarifications for migratetype
   fallback" from Brendan Jackman reworks some commentary to make the code
   easier to follow.
 
 - The 3 patch series "page_counter cleanup and size reduction" from
   Shakeel Butt cleans up the page_counter code and fixes a size increase
   which we accidentally added late last year.
 
 - The 3 patch series "Add a command line option that enables control of
   how many threads should be used to allocate huge pages" from Thomas
   Prescher does that.  It allows the careful operator to significantly
   reduce boot time by tuning the parallalization of huge page
   initialization.
 
 - The 3 patch series "Fix calculations in trace_balance_dirty_pages()
   for cgwb" from Tang Yizhou fixes the tracing output from the dirty page
   balancing code.
 
 - The 9 patch series "mm/damon: make allow filters after reject filters
   useful and intuitive" from SeongJae Park improves the handling of allow
   and reject filters.  Behaviour is made more consistent and the
   documention is updated accordingly.
 
 - The 5 patch series "Switch zswap to object read/write APIs" from Yosry
   Ahmed updates zswap to the new object read/write APIs and thus permits
   the removal of some legacy code from zpool and zsmalloc.
 
 - The 6 patch series "Some trivial cleanups for shmem" from Baolin Wang
   does as it claims.
 
 - The 20 patch series "fs/dax: Fix ZONE_DEVICE page reference counts"
   from Alistair Popple regularizes the weird ZONE_DEVICE page refcount
   handling in DAX, permittig the removal of a number of special-case
   checks.
 
 - The 4 patch series "refactor mremap and fix bug" from Lorenzo Stoakes
   is a preparatoty refactoring and cleanup of the mremap() code.
 
 - The 20 patch series "mm: MM owner tracking for large folios (!hugetlb)
   + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
   which we determine whether a large folio is known to be mapped
   exclusively into a single MM.
 
 - The 8 patch series "mm/damon: add sysfs dirs for managing DAMOS
   filters based on handling layers" from SeongJae Park adds a couple of
   new sysfs directories to ease the management of DAMON/DAMOS filters.
 
 - The 13 patch series "arch, mm: reduce code duplication in mem_init()"
   from Mike Rapoport consolidates many per-arch implementations of
   mem_init() into code generic code, where that is practical.
 
 - The 13 patch series "mm/damon/sysfs: commit parameters online via
   damon_call()" from SeongJae Park continues the cleaning up of sysfs
   access to DAMON internal data.
 
 - The 3 patch series "mm: page_ext: Introduce new iteration API" from
   Luiz Capitulino reworks the page_ext initialization to fix a boot-time
   crash which was observed with an unusual combination of compile and
   cmdline options.
 
 - The 8 patch series "Buddy allocator like (or non-uniform) folio split"
   from Zi Yan reworks the code to split a folio into smaller folios.  The
   main benefit is lessened memory consumption: fewer post-split folios are
   generated.
 
 - The 2 patch series "Minimize xa_node allocation during xarry split"
   from Zi Yan reduces the number of xarray xa_nodes which are generated
   during an xarray split.
 
 - The 2 patch series "drivers/base/memory: Two cleanups" from Gavin Shan
   performs some maintenance work on the drivers/base/memory code.
 
 - The 3 patch series "Add tracepoints for lowmem reserves, watermarks
   and totalreserve_pages" from Martin Liu adds some more tracepoints to
   the page allocator code.
 
 - The 4 patch series "mm/madvise: cleanup requests validations and
   classifications" from SeongJae Park cleans up some warts which SeongJae
   observed during his earlier madvise work.
 
 - The 3 patch series "mm/hwpoison: Fix regressions in memory failure
   handling" from Shuai Xue addresses two quite serious regressions which
   Shuai has observed in the memory-failure implementation.
 
 - The 5 patch series "mm: reliable huge page allocator" from Johannes
   Weiner makes huge page allocations cheaper and more reliable by reducing
   fragmentation.
 
 - The 5 patch series "Minor memcg cleanups & prep for memdescs" from
   Matthew Wilcox is preparatory work for the future implementation of
   memdescs.
 
 - The 4 patch series "track memory used by balloon drivers" from Nico
   Pache introduces a way to track memory used by our various balloon
   drivers.
 
 - The 2 patch series "mm/damon: introduce DAMOS filter type for active
   pages" from Nhat Pham permits users to filter for active/inactive pages,
   separately for file and anon pages.
 
 - The 2 patch series "Adding Proactive Memory Reclaim Statistics" from
   Hao Jia separates the proactive reclaim statistics from the direct
   reclaim statistics.
 
 - The 2 patch series "mm/vmscan: don't try to reclaim hwpoison folio"
   from Jinjiang Tu fixes our handling of hwpoisoned pages within the
   reclaim code.
 -----BEGIN PGP SIGNATURE-----
 
 iHQEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nZaAAKCRDdBJ7gKXxA
 jsOWAPiP4r7CJHMZRK4eyJOkvS1a1r+TsIarrFZtjwvf/GIfAQCEG+JDxVfUaUSF
 Ee93qSSLR1BkNdDw+931Pu0mXfbnBw==
 =Pn2K
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - The series "Enable strict percpu address space checks" from Uros
   Bizjak uses x86 named address space qualifiers to provide
   compile-time checking of percpu area accesses.

   This has caused a small amount of fallout - two or three issues were
   reported. In all cases the calling code was found to be incorrect.

 - The series "Some cleanup for memcg" from Chen Ridong implements some
   relatively monir cleanups for the memcontrol code.

 - The series "mm: fixes for device-exclusive entries (hmm)" from David
   Hildenbrand fixes a boatload of issues which David found then using
   device-exclusive PTE entries when THP is enabled. More work is
   needed, but this makes thins better - our own HMM selftests now
   succeed.

 - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
   remove the z3fold and zbud implementations. They have been deprecated
   for half a year and nobody has complained.

 - The series "mm: further simplify VMA merge operation" from Lorenzo
   Stoakes implements numerous simplifications in this area. No runtime
   effects are anticipated.

 - The series "mm/madvise: remove redundant mmap_lock operations from
   process_madvise()" from SeongJae Park rationalizes the locking in the
   madvise() implementation. Performance gains of 20-25% were observed
   in one MADV_DONTNEED microbenchmark.

 - The series "Tiny cleanup and improvements about SWAP code" from
   Baoquan He contains a number of touchups to issues which Baoquan
   noticed when working on the swap code.

 - The series "mm: kmemleak: Usability improvements" from Catalin
   Marinas implements a couple of improvements to the kmemleak
   user-visible output.

 - The series "mm/damon/paddr: fix large folios access and schemes
   handling" from Usama Arif provides a couple of fixes for DAMON's
   handling of large folios.

 - The series "mm/damon/core: fix wrong and/or useless damos_walk()
   behaviors" from SeongJae Park fixes a few issues with the accuracy of
   kdamond's walking of DAMON regions.

 - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
   Stoakes changes the interaction between framebuffer deferred-io and
   core MM. No functional changes are anticipated - this is preparatory
   work for the future removal of page structure fields.

 - The series "mm/damon: add support for hugepage_size DAMOS filter"
   from Usama Arif adds a DAMOS filter which permits the filtering by
   huge page sizes.

 - The series "mm: permit guard regions for file-backed/shmem mappings"
   from Lorenzo Stoakes extends the guard region feature from its
   present "anon mappings only" state. The feature now covers shmem and
   file-backed mappings.

 - The series "mm: batched unmap lazyfree large folios during
   reclamation" from Barry Song cleans up and speeds up the unmapping
   for pte-mapped large folios.

 - The series "reimplement per-vma lock as a refcount" from Suren
   Baghdasaryan puts the vm_lock back into the vma. Our reasons for
   pulling it out were largely bogus and that change made the code more
   messy. This patchset provides small (0-10%) improvements on one
   microbenchmark.

 - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
   improves" from SeongJae Park does some maintenance work on the DAMON
   docs.

 - The series "hugetlb/CMA improvements for large systems" from Frank
   van der Linden addresses a pile of issues which have been observed
   when using CMA on large machines.

 - The series "mm/damon: introduce DAMOS filter type for unmapped pages"
   from SeongJae Park enables users of DMAON/DAMOS to filter my the
   page's mapped/unmapped status.

 - The series "zsmalloc/zram: there be preemption" from Sergey
   Senozhatsky teaches zram to run its compression and decompression
   operations preemptibly.

 - The series "selftests/mm: Some cleanups from trying to run them" from
   Brendan Jackman fixes a pile of unrelated issues which Brendan
   encountered while runnimg our selftests.

 - The series "fs/proc/task_mmu: add guard region bit to pagemap" from
   Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
   determine whether a particular page is a guard page.

 - The series "mm, swap: remove swap slot cache" from Kairui Song
   removes the swap slot cache from the allocation path - it simply
   wasn't being effective.

 - The series "mm: cleanups for device-exclusive entries (hmm)" from
   David Hildenbrand implements a number of unrelated cleanups in this
   code.

 - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
   implements a number of preparatoty cleanups to the GENERIC_PTDUMP
   Kconfig logic.

 - The series "mm/damon: auto-tune aggregation interval" from SeongJae
   Park implements a feedback-driven automatic tuning feature for
   DAMON's aggregation interval tuning.

 - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
   powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
   preparation for implementing lazy mmu mode for arm64 to optimize
   vmalloc.

 - The series "mm/page_alloc: Some clarifications for migratetype
   fallback" from Brendan Jackman reworks some commentary to make the
   code easier to follow.

 - The series "page_counter cleanup and size reduction" from Shakeel
   Butt cleans up the page_counter code and fixes a size increase which
   we accidentally added late last year.

 - The series "Add a command line option that enables control of how
   many threads should be used to allocate huge pages" from Thomas
   Prescher does that. It allows the careful operator to significantly
   reduce boot time by tuning the parallalization of huge page
   initialization.

 - The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
   from Tang Yizhou fixes the tracing output from the dirty page
   balancing code.

 - The series "mm/damon: make allow filters after reject filters useful
   and intuitive" from SeongJae Park improves the handling of allow and
   reject filters. Behaviour is made more consistent and the documention
   is updated accordingly.

 - The series "Switch zswap to object read/write APIs" from Yosry Ahmed
   updates zswap to the new object read/write APIs and thus permits the
   removal of some legacy code from zpool and zsmalloc.

 - The series "Some trivial cleanups for shmem" from Baolin Wang does as
   it claims.

 - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
   Alistair Popple regularizes the weird ZONE_DEVICE page refcount
   handling in DAX, permittig the removal of a number of special-case
   checks.

 - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
   preparatoty refactoring and cleanup of the mremap() code.

 - The series "mm: MM owner tracking for large folios (!hugetlb) +
   CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
   which we determine whether a large folio is known to be mapped
   exclusively into a single MM.

 - The series "mm/damon: add sysfs dirs for managing DAMOS filters based
   on handling layers" from SeongJae Park adds a couple of new sysfs
   directories to ease the management of DAMON/DAMOS filters.

 - The series "arch, mm: reduce code duplication in mem_init()" from
   Mike Rapoport consolidates many per-arch implementations of
   mem_init() into code generic code, where that is practical.

 - The series "mm/damon/sysfs: commit parameters online via
   damon_call()" from SeongJae Park continues the cleaning up of sysfs
   access to DAMON internal data.

 - The series "mm: page_ext: Introduce new iteration API" from Luiz
   Capitulino reworks the page_ext initialization to fix a boot-time
   crash which was observed with an unusual combination of compile and
   cmdline options.

 - The series "Buddy allocator like (or non-uniform) folio split" from
   Zi Yan reworks the code to split a folio into smaller folios. The
   main benefit is lessened memory consumption: fewer post-split folios
   are generated.

 - The series "Minimize xa_node allocation during xarry split" from Zi
   Yan reduces the number of xarray xa_nodes which are generated during
   an xarray split.

 - The series "drivers/base/memory: Two cleanups" from Gavin Shan
   performs some maintenance work on the drivers/base/memory code.

 - The series "Add tracepoints for lowmem reserves, watermarks and
   totalreserve_pages" from Martin Liu adds some more tracepoints to the
   page allocator code.

 - The series "mm/madvise: cleanup requests validations and
   classifications" from SeongJae Park cleans up some warts which
   SeongJae observed during his earlier madvise work.

 - The series "mm/hwpoison: Fix regressions in memory failure handling"
   from Shuai Xue addresses two quite serious regressions which Shuai
   has observed in the memory-failure implementation.

 - The series "mm: reliable huge page allocator" from Johannes Weiner
   makes huge page allocations cheaper and more reliable by reducing
   fragmentation.

 - The series "Minor memcg cleanups & prep for memdescs" from Matthew
   Wilcox is preparatory work for the future implementation of memdescs.

 - The series "track memory used by balloon drivers" from Nico Pache
   introduces a way to track memory used by our various balloon drivers.

 - The series "mm/damon: introduce DAMOS filter type for active pages"
   from Nhat Pham permits users to filter for active/inactive pages,
   separately for file and anon pages.

 - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
   separates the proactive reclaim statistics from the direct reclaim
   statistics.

 - The series "mm/vmscan: don't try to reclaim hwpoison folio" from
   Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
   code.

* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
  mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
  x86/mm: restore early initialization of high_memory for 32-bits
  mm/vmscan: don't try to reclaim hwpoison folio
  mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
  cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
  mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
  selftests/mm: speed up split_huge_page_test
  selftests/mm: uffd-unit-tests support for hugepages > 2M
  docs/mm/damon/design: document active DAMOS filter type
  mm/damon: implement a new DAMOS filter type for active pages
  fs/dax: don't disassociate zero page entries
  MM documentation: add "Unaccepted" meminfo entry
  selftests/mm: add commentary about 9pfs bugs
  fork: use __vmalloc_node() for stack allocation
  docs/mm: Physical Memory: Populate the "Zones" section
  xen: balloon: update the NR_BALLOON_PAGES state
  hv_balloon: update the NR_BALLOON_PAGES state
  balloon_compaction: update the NR_BALLOON_PAGES state
  meminfo: add a per node counter for balloon drivers
  mm: remove references to folio in __memcg_kmem_uncharge_page()
  ...
2025-04-01 09:29:18 -07:00
Kent Overstreet
f1350c2c74 bcachefs: fix ref leak in btree_node_read_all_replicas
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-01 02:47:41 -04:00
Linus Torvalds
98fb679d19 bcachefs updates for 6.15, part 2
All bugfixes and logging improvements.
 
 Minor merge conflict, see:
 https://lore.kernel.org/linux-next/20250331092816.778a7c83@canb.auug.org.au/T/#u
 
 CI says the fs-next tree is good:
 https://evilpiepirate.org/~testdashboard/ci?user=fs-next&branch=master
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmfqyxsACgkQE6szbY3K
 bnY4aQ/7BylMgHZsAG2OLRRtegCsuFZ5fZt148TObofSGTTDPcVKYQWcz249Hlao
 RZzv9nbqq2M7fJrUK5Xloc4DA0ICuWIh9n+uRf+5od7JgtygjJpXqMRz9HrGBtTo
 QZJE/wAzsa1A8xBORjWVki4koHT3YivaMW2zdgbHIHWTjDJso5Es7RW0/WZYv3lW
 cFLFEfOnBCEXckhEtK7TAPnJpHEPw+/d0bMFU/PIHbokUwTjxCR0bmRL/RecKUXa
 5U1o1x7gAo1iPi3XPGLJVVxXWgjmxzQlF/3aXva+DYaeLgxPMqKxUlC6hkV4f6Oc
 9lH/w1pEiCMcANbbp2E3Q91sDFRlafFCgvsKhEz79W5WoNq+vSrxLhLaynyuBT/K
 lfoiig6IFRTWJDYHu2L6YHFMmp8JOxgJSJ0+dcgyVRnaDJQeGgbuv1tEldonQLsg
 9DT8iRJpVDomffwPUoVhujlvJOqUi8zFkxyMCgVWExFzC3ief2B5s3D4uLXcpApO
 nZfb01W0ElW7qBMQxjyD0Vy+wY8EryzTht9ZKJq5Id1T/LWc9Qi+jPaY86OBC9/w
 GJgW9OcYLFjYdsDokk5XkwOd/IAXz6fU+vHGtahFJPVfH4T8zzdBnxfPbiR2mXo8
 4EfeNmRevZP/oK7/2l2cqIzY7tYBJBUK1gFyvz1+7bcuFwVI8rc=
 =Udka
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2025-03-31' of git://evilpiepirate.org/bcachefs

Pull more bcachefs updates from Kent Overstreet:
 "All bugfixes and logging improvements"

* tag 'bcachefs-2025-03-31' of git://evilpiepirate.org/bcachefs: (35 commits)
  bcachefs: fix bch2_write_point_to_text() units
  bcachefs: Log original key being moved in data updates
  bcachefs: BCH_JSET_ENTRY_log_bkey
  bcachefs: Reorder error messages that include journal debug
  bcachefs: Don't use designated initializers for disk_accounting_pos
  bcachefs: Silence errors after emergency shutdown
  bcachefs: fix units in rebalance_status
  bcachefs: bch2_ioctl_subvolume_destroy() fixes
  bcachefs: Clear fs_path_parent on subvolume unlink
  bcachefs: Change btree_insert_node() assertion to error
  bcachefs: Better printing of inconsistency errors
  bcachefs: bch2_count_fsck_err()
  bcachefs: Better helpers for inconsistency errors
  bcachefs: Consistent indentation of multiline fsck errors
  bcachefs: Add an "ignore unknown" option to bch2_parse_mount_opts()
  bcachefs: bch2_time_stats_init_no_pcpu()
  bcachefs: Fix bch2_fs_get_tree() error path
  bcachefs: fix logging in journal_entry_err_msg()
  bcachefs: add missing newline in bch2_trans_updates_to_text()
  bcachefs: print_string_as_lines: fix extra newline
  ...
2025-03-31 18:33:51 -07:00
Kent Overstreet
de39965858 bcachefs: Fix null ptr deref in bch2_write_endio()
This was previously hard to hit since it requires racing with device
removal, but splitting up io_ref uncovered it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-31 17:39:27 -04:00
Kent Overstreet
7f10fde38f bcachefs: Fix field spanning write warning
Struct with embedded VLA...

memcpy: detected field-spanning write (size 8) of single field "&gc->r.e" at fs/bcachefs/ec.c:465 (size 3)
WARNING: CPU: 1 PID: 936 at fs/bcachefs/ec.c:465 bch2_trigger_stripe+0x706/0x730
Modules linked in:
CPU: 1 UID: 0 PID: 936 Comm: mount.bcachefs Not tainted 6.14.0-rc6-ktest-00236-gefb0b5c62dbc #55
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:bch2_trigger_stripe+0x706/0x730
Code: b4 00 01 b9 03 00 00 00 48 89 fb 48 c7 c7 33 54 da 81 48 89 d6 49 89 d6 48 c7 c2 c3 36 db 81 e8 60 54 c5 ff 48 89 df 4c 89 f2 <0f> 0b e9 5c fd ff ff e8 fe 5e 4e 00 bf 10 00 00 00 48 c7 c6 ff ff
RSP: 0018:ffff88817081f680 EFLAGS: 00010246
RAX: f8fe7dd1c56b5600 RBX: ffff888101265368 RCX: 0000000000000027
RDX: 0000000000000008 RSI: 00000000fffbffff RDI: ffff888101265368
RBP: 0000000000000000 R08: 000000000003ffff R09: ffff88817f1fe000
R10: 00000000000bfffd R11: 0000000000000004 R12: ffff8881012652c0
R13: 0000000000000000 R14: 0000000000000008 R15: ffff88817081f6c9
FS:  00007fc428bc7c80(0000) GS:ffff888179280000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffd3ee4a038 CR3: 000000010a9bc000 CR4: 0000000000750eb0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __warn+0xce/0x1b0
 ? bch2_trigger_stripe+0x706/0x730
 ? report_bug+0x11b/0x1a0
 ? bch2_trigger_stripe+0x706/0x730
 ? handle_bug+0x5e/0x90
 ? exc_invalid_op+0x1a/0x50
 ? asm_exc_invalid_op+0x1a/0x20
 ? bch2_trigger_stripe+0x706/0x730
 bch2_gc_mark_key+0x2cf/0x430
 bch2_check_allocations+0x1a64/0x1ed0
 ? vsnprintf+0x1ad/0x420
 ? bch2_check_allocations+0x191f/0x1ed0
 bch2_run_recovery_passes+0x13b/0x2b0
 bch2_fs_recovery+0x9b7/0x1290
 ? __bch2_print+0xb2/0xf0
 ? bch2_printbuf_exit+0x1e/0x30
 ? print_mount_opts+0x153/0x180
 bch2_fs_start+0x274/0x3b0
 bch2_fs_get_tree+0x516/0x6e0
 vfs_get_tree+0x21/0xa0
 do_new_mount+0x153/0x350
 __x64_sys_mount+0x16c/0x1f0
 do_syscall_64+0x6c/0x140
 ? arch_exit_to_user_mode_prepare+0x9/0x40
 entry_SYSCALL_64_after_hwframe+0x4b/0x53

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-31 17:39:10 -04:00
Kent Overstreet
f540876f4e bcachefs: Fix striping behaviour
For striping across devices, we maintain "clocks", and we advance them
by the inverse of "how much free space this device has left", so that we
round robin biased in favor of devices with more free space.

This code was originally trying to do EWMA-ish stuff when originally
written, ~10 years ago, and was never properly cleaned up when it was
realized that an EWMA is not the right approach here.

That left a bug, when we rescale to keep all the clocks in the correct
range and prevent overflow.

It was assumed that we'd always be allocated from the device with the
smallest clock hand, but that's actually not correct: with the target
options, allocations will be first tried from a subset of devices, and
then the entire filesystem if that fails.

Thus, the rescale from the first allocation - allocating from a subset
of devices - can pick the wrong rescale value and cause the rest of the
clocks to go to 0, losing information.

This resuls in incorrect striping behaviour when the desired number of
replicas doesn't fit on the foreground target.

Link: https://www.reddit.com/r/bcachefs/comments/1jn3t26/replica_allocation_not_evenly_distributed_among/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-31 17:39:10 -04:00
Kent Overstreet
650f5353dc bcachefs: fix bch2_write_point_to_text() units
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30 20:04:16 -04:00
Kent Overstreet
7fdc3fa3cb bcachefs: Log original key being moved in data updates
There's something going on with the data move path; log the original key
being moved for debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30 18:25:12 -04:00
Kent Overstreet
edaed8ee8c bcachefs: BCH_JSET_ENTRY_log_bkey
Add a journal entry type for logging - but logging a bkey, not a string;
to be used for data move path debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30 18:25:12 -04:00
Kent Overstreet
2b47102b93 bcachefs: Reorder error messages that include journal debug
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30 16:36:27 -04:00
Kent Overstreet
393a05a741 bcachefs: Don't use designated initializers for disk_accounting_pos
Not all compilers fully initialize these - they're not guaranteed to
because of the union shenanigans.

Fixes: https://github.com/koverstreet/bcachefs/issues/844
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30 16:35:13 -04:00
Kent Overstreet
f548db4d31 bcachefs: Silence errors after emergency shutdown
We don't care about errors from asynchronous ops that were because we
did an emergency shutdown; silence them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30 16:35:13 -04:00
Kent Overstreet
458e2ef882 bcachefs: fix units in rebalance_status
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30 16:35:13 -04:00
Kent Overstreet
707549600c bcachefs: bch2_ioctl_subvolume_destroy() fixes
bch2_evict_subvolume_inodes() was getting stuck - due to incorrectly
pruning the dcache.

Also, fix missing permissions checks.

Reported-by: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30 16:33:22 -04:00
Kent Overstreet
b3981564ca bcachefs: Clear fs_path_parent on subvolume unlink
This fixes recursive subvolume removal.

Subvolume deletion is asynchronous; fs_path_parent, and thus the entry
in the subvolume_children btree, need to be cleared when the subvolume
is unlinked from the fs heirarchy - else we'll spuriously think a
subvolume has children and deletion will fail.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-29 20:16:49 -04:00
Kent Overstreet
63c3b8f616 bcachefs: Change btree_insert_node() assertion to error
Debug for https://github.com/koverstreet/bcachefs/issues/843

Print useful debug info and go emergency read-only.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-29 14:24:48 -04:00
Kent Overstreet
6d77ce4a27 bcachefs: Better printing of inconsistency errors
Build up and emit the error message for an inconsistency error all at
once, instead of spread over multiple printk calls, so they're not
jumbled in the dmesg log.

Also, add better indenting.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-29 13:26:13 -04:00
Kent Overstreet
7337f9f14e bcachefs: bch2_count_fsck_err()
Factor out a helper from __bch2_fsck_err(), for counting the error in
the superblock and deciding whether to print or ratelimit - will be used
to replace some log_fsck_err() calls, where we want to lift out printing
the error message.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-29 13:26:13 -04:00
Kent Overstreet
b00750c2e5 bcachefs: Better helpers for inconsistency errors
An inconsistency error often happens as part of an event with multiple
error messages, and we want to build up one single error message with
proper indenting to produce more readable log messages that don't get
garbled.

Add new helpers that emit messages to a printbuf instead of printing
them directly, next patch will convert to use them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 22:31:47 -04:00
Kent Overstreet
1ece53237e bcachefs: Consistent indentation of multiline fsck errors
Add the new helper printbuf_indent_add_nextline(), and use it in
__bch2_fsck_err() to centralize setting the indentation of multiline
fsck errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 22:31:47 -04:00
Kent Overstreet
a7cdf2276e bcachefs: Add an "ignore unknown" option to bch2_parse_mount_opts()
To be used by the mount helper in userspace, where we still have options
to be parsed by other layers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 22:31:47 -04:00
Kent Overstreet
daa771332e bcachefs: bch2_time_stats_init_no_pcpu()
Add a mode to disable automatic switching to percpu mode, useful when a
time_stats will only be used by one thread and we don't want to have to
flush the percpu buffers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 22:31:47 -04:00
Florian Albrechtskirchinger
7c4cb50e1a bcachefs: Fix bch2_fs_get_tree() error path
When a filesystem is mounted read-only, subsequent attempts to mount it
as read-write fail with EBUSY. Previously, the error path in
bch2_fs_get_tree() would unconditionally call __bch2_fs_stop(),
improperly freeing resources for a filesystem that was still actively
mounted. This change modifies the error path to only call
__bch2_fs_stop() if the superblock has no valid root dentry, ensuring
resources are not cleaned up prematurely when the filesystem is in use.

Signed-off-by: Florian Albrechtskirchinger <falbrechtskirchinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 13:51:09 -04:00
Kent Overstreet
6b1e0b9e18 bcachefs: fix logging in journal_entry_err_msg()
We want to log errors all at once, not spread across multiple printks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 12:36:32 -04:00
Kent Overstreet
ff4e0f7de6 bcachefs: add missing newline in bch2_trans_updates_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 12:36:32 -04:00
Kent Overstreet
35a11506a3 bcachefs: print_string_as_lines: fix extra newline
Don't print a newline on empty string; this was causing us to also print
an extra newline when we got to the end of th string.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 12:36:32 -04:00
Kent Overstreet
3c72d3eea9 bcachefs: Fix WARN() in bch2_bkey_pick_read_device()
syzbot discovered that this one is possible: we have pointers, but none
of them are to valid devices.

Reported-by: syzbot+336a6e6a2dbb7d4dba9a@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 12:35:05 -04:00
Kent Overstreet
af3d4c276a bcachefs: Don't return 0 size holes from bch2_seek_hole()
The hole we find in the btree might be fully dirty in the page cache. If
so, keep searching.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 11:30:14 -04:00
Kent Overstreet
1f4bb8254c bcachefs: Fix bch2_seek_hole() locking
We can't call bch2_seek_pagecache_hole(), and block on page locks, with
btree locks held.

This is easily fixed because we're at the end of the transaction - we
can just unlock, we don't need a drop_locks_do().

Reported-by: https://github.com/nagalun
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 11:30:14 -04:00
Kent Overstreet
2dd202dbaf bcachefs: Recovery no longer holds state_lock
state_lock guards against devices coming or leaving, changing state, or
the filesystem changing between ro <-> rw.

But it's not necessary for running recovery passes, and holding it
blocks asynchronous events that would cause us to go RO or kick out
devices.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 11:13:25 -04:00
Kent Overstreet
c6c6a39109 bcachefs: Fix permissions on version modparam
There's no reason for this not to be world readable - it provides the
currently supported on disk format version.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 11:13:23 -04:00
Linus Torvalds
4a4b30ea80 bcachefs updates for 6.15
On disk format is now soft frozen: no more required/automatic are
 anticipated before taking off the experimental label.
 
 Major changes/features since 6.14:
 
 - Scrub
 
 - Blocksize greater than page size support
 
 - A number of "rebalance spinning and doing no work" issues have been
   fixed; we now check if the write allocation will succeed in
   bch2_data_update_init(), before kicking off the read.
 
   There's still more work to do in this area. Later we may want to add
   another bitset btree, like rebalance_work, to track "extents that
   rebalance was requested to move but couldn't", e.g. due to destination
   target having insufficient online devices.
 
 - We can now support scaling well into the petabyte range: latest
   bcachefs-tools will pick an appropriate bucket size at format time to
   ensure fsck can run in available memory (e.g. a server with 256GB of
   ram and 100PB of storage would want 16MB buckets).
 
 On disk format changes:
 
 - 1.21: cached backpointers (scalability improvement)
 
   Cached replicas now get backpointers, which means we no longer rely on
   incrementing bucket generation numbers to invalidate cached data: this
   lets us get rid of the bucket generation number garbage collection,
   which had to periodically rescan all extents to recompute bucket
   oldest_gen.
 
   Bucket generation numbers are now only used as a consistency check,
   but they're quite useful for that.
 
 - 1.22: stripe backpointers
 
   Stripes now have backpointers: erasure coded stripes have their own
   checksums, separate from the checksums for the extents they contain
   (and stripe checksums also cover the parity blocks). This is required
   for implementing scrub for stripes.
 
 - 1.23: stripe lru (scalability improvement)
 
   Persistent lru for stripes, ordered by "number of empty blocks". This
   is used by the stripe creation path, which depending on free space
   may create a new stripe out of a partially empty existing stripe
   instead of starting a brand new stripe.
 
   This replaces an in-memory heap, and means we no longer have to read
   in the stripes btree at startup.
 
 - 1.24: casefolding
 
   Case insensitive directory support, courtesy of Valve.
 
   This is an incompatible feature, to enable mount with
     -o version_upgrade=incompatible
 
 - 1.25: extent_flags
 
   Another incompatible feature requiring explicit opt-in to enable.
 
   This adds a flags entry to extents, and a flag bit that marks extents
   as poisoned.
 
   A poisoned extent is an extent that was unreadable due to checksum
   errors. We can't move such extents without giving them a new checksum,
   and we may have to move them (for e.g. copygc or device evacuate).
   We also don't want to delete them: in the future we'll have an API
   that lets userspace ignore checksum errors and attempt to deal with
   simple bitrot itself. Marking them as poisoned lets us continue to
   return the correct error to userspace on normal read calls.
 
 Other changes/features:
 
 - BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top'
   command, which shows a live view of all internal filesystem counters.
 
 - Improved journal pipelining: we can now have 16 journal writes in
   flight concurrently, up from 4. We're logging significantly more to
   the journal than we used to with all the recent disk accounting
   changes and additions, so some users should see a performance
   increase on some workloads.
 
 - BCH_MEMBER_STATE_failed: previously, we would do no IO at all to
   devices marked as failed. Now we will attempt to read from them, but
   only if we have no better options.
 
 - New option, write_error_timeout: devices will be kicked out of the
   filesystem if all writes have been failing for x number of seconds.
 
   We now also kick devices out when notified by blk_holder_ops that
   they've gone offline.
 
 - Device option handling improvements: the discard option should now be
   working as expected (additionally, in -tools, all device options that
   can be set at format time can now be set at device add time, i.e.
   data_allowed, state).
 
 - We now try harder to read data after a checksum error: we'll do
   additional retries if necessary to a device after after it gave us
   data with a checksum error.
 
 - More self healing work: the full inode <-> dirent consistency checks
   that are currently run by fsck are now also run every time we do a
   lookup, meaning we'll be able to correct errors at runtime. Runtime
   self healing will be flipped on after the new changes have seen more
   testing, currently they're just checking for consistency.
 
 - KMSAN fixes: our KMSAN builds should be nearly clean now, which will
   put a massive dent in the syzbot dashboard.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmfhbnsACgkQE6szbY3K
 bnY6ew/9FXh3m71BvVpuqTYcUGzIC7gVrnkFy6n4W96v07OjSOoTNHOVVovajxc3
 P9LvA77BHC4Xro3H7ORpsIurOZUc6yx18ZizzulVbQFuYa7LY/kNri4ZBtGHcRiV
 pIdQDLSNmwFjPA4x2S1qTFSF1c586lad+UNQiLam5ophBwQPEO6vG51ZEHa4wld9
 +OhWTDYfrvij4D3Lt1ppvhuDP+PQBjhu/QFc0bGjHvKOjfV6sw9XU91sCYKOJIzd
 qzpsiQd5sepnX717Br3f5SLdxMq2lJYvRp9756vltOCaMBvJYJtHqtXCglHQEkFw
 yjhmPjk4r3VlKTF8K+wEJfAHwbC2kEn7csJNbt0+Nko5PPtFyrb8ok6QUbHCKscL
 L0VMnzaXHVqvG2VgYa31temfdz7HM/zHjQ8Al3eQPaqTHIoTXIBQxOQSea/apVMt
 TIlastvLoHfR8W7+LrwOmTjnBJGCJ+MrdcJzJDVk2tQmmcMA0boeZvl4aSklFuyB
 zNN5fxp0VMsxNyIHLJjQ3UcwVqHXC5w+f5H1ByQLUyQh+m/xaAaz7S+BTVdVbFPa
 1Z1xDuvuHOTnjIOamnOD1l36afJnhq5RciPCXCNtQSB819mc+AfNGQNQTVNOTReC
 iTiUCcNxu0/DIPlPmeJzAlukVJUgz+/knOI/6zPs3eI7/o88ZGg=
 =k3cV
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs

Pull bcachefs updates from Kent Overstreet:
 "On disk format is now soft frozen: no more required/automatic are
  anticipated before taking off the experimental label.

  Major changes/features since 6.14:

   - Scrub

   - Blocksize greater than page size support

   - A number of "rebalance spinning and doing no work" issues have been
     fixed; we now check if the write allocation will succeed in
     bch2_data_update_init(), before kicking off the read.

     There's still more work to do in this area. Later we may want to
     add another bitset btree, like rebalance_work, to track "extents
     that rebalance was requested to move but couldn't", e.g. due to
     destination target having insufficient online devices.

   - We can now support scaling well into the petabyte range: latest
     bcachefs-tools will pick an appropriate bucket size at format time
     to ensure fsck can run in available memory (e.g. a server with
     256GB of ram and 100PB of storage would want 16MB buckets).

  On disk format changes:

   - 1.21: cached backpointers (scalability improvement)

     Cached replicas now get backpointers, which means we no longer rely
     on incrementing bucket generation numbers to invalidate cached
     data: this lets us get rid of the bucket generation number garbage
     collection, which had to periodically rescan all extents to
     recompute bucket oldest_gen.

     Bucket generation numbers are now only used as a consistency check,
     but they're quite useful for that.

   - 1.22: stripe backpointers

     Stripes now have backpointers: erasure coded stripes have their own
     checksums, separate from the checksums for the extents they contain
     (and stripe checksums also cover the parity blocks). This is
     required for implementing scrub for stripes.

   - 1.23: stripe lru (scalability improvement)

     Persistent lru for stripes, ordered by "number of empty blocks".
     This is used by the stripe creation path, which depending on free
     space may create a new stripe out of a partially empty existing
     stripe instead of starting a brand new stripe.

     This replaces an in-memory heap, and means we no longer have to
     read in the stripes btree at startup.

   - 1.24: casefolding

     Case insensitive directory support, courtesy of Valve.

     This is an incompatible feature, to enable mount with
       -o version_upgrade=incompatible

   - 1.25: extent_flags

     Another incompatible feature requiring explicit opt-in to enable.

     This adds a flags entry to extents, and a flag bit that marks
     extents as poisoned.

     A poisoned extent is an extent that was unreadable due to checksum
     errors. We can't move such extents without giving them a new
     checksum, and we may have to move them (for e.g. copygc or device
     evacuate). We also don't want to delete them: in the future we'll
     have an API that lets userspace ignore checksum errors and attempt
     to deal with simple bitrot itself. Marking them as poisoned lets us
     continue to return the correct error to userspace on normal read
     calls.

  Other changes/features:

   - BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top'
     command, which shows a live view of all internal filesystem
     counters.

   - Improved journal pipelining: we can now have 16 journal writes in
     flight concurrently, up from 4. We're logging significantly more to
     the journal than we used to with all the recent disk accounting
     changes and additions, so some users should see a performance
     increase on some workloads.

   - BCH_MEMBER_STATE_failed: previously, we would do no IO at all to
     devices marked as failed. Now we will attempt to read from them,
     but only if we have no better options.

   - New option, write_error_timeout: devices will be kicked out of the
     filesystem if all writes have been failing for x number of seconds.

     We now also kick devices out when notified by blk_holder_ops that
     they've gone offline.

   - Device option handling improvements: the discard option should now
     be working as expected (additionally, in -tools, all device options
     that can be set at format time can now be set at device add time,
     i.e. data_allowed, state).

   - We now try harder to read data after a checksum error: we'll do
     additional retries if necessary to a device after after it gave us
     data with a checksum error.

   - More self healing work: the full inode <-> dirent consistency
     checks that are currently run by fsck are now also run every time
     we do a lookup, meaning we'll be able to correct errors at runtime.
     Runtime self healing will be flipped on after the new changes have
     seen more testing, currently they're just checking for consistency.

   - KMSAN fixes: our KMSAN builds should be nearly clean now, which
     will put a massive dent in the syzbot dashboard"

* tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs: (180 commits)
  bcachefs: Kill unnecessary bch2_dev_usage_read()
  bcachefs: btree node write errors now print btree node
  bcachefs: Fix race in print_chain()
  bcachefs: btree_trans_restart_foreign_task()
  bcachefs: bch2_disk_accounting_mod2()
  bcachefs: zero init journal bios
  bcachefs: Eliminate padding in move_bucket_key
  bcachefs: Fix a KMSAN splat in btree_update_nodes_written()
  bcachefs: kmsan asserts
  bcachefs: Fix kmsan warnings in bch2_extent_crc_pack()
  bcachefs: Disable asm memcpys when kmsan enabled
  bcachefs: Handle backpointers with unknown data types
  bcachefs: Count BCH_DATA_parity backpointers correctly
  bcachefs: Run bch2_check_dirent_target() at lookup time
  bcachefs: Refactor bch2_check_dirent_target()
  bcachefs: Move bch2_check_dirent_target() to namei.c
  bcachefs: fs-common.c -> namei.c
  bcachefs: EIO cleanup
  bcachefs: bch2_write_prep_encoded_data() now returns errcode
  bcachefs: Simplify bch2_write_op_error()
  ...
2025-03-27 13:20:07 -07:00
Kent Overstreet
d0fb2a266a bcachefs: cond_resched() in journal_key_sort_cmp()
Fixes "task out to lunch" warnings during recovery on large machines
with lots of dirty data in the journal.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26 16:26:35 -04:00
Kent Overstreet
ef488bb5d0 bcachefs: Fix 'hung task' messages in btree node scan
btree node scan has to wait on kthread workers that scan each device,
potentially for awhile.

We would like this to be interruptible, but we may need a different
mechanism than signals for that - we've had bugs in the past where
mounts were failing due to checking for signals, and no explanation on
where they came from.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26 16:26:35 -04:00
Kent Overstreet
9314e2fb26 bcachefs: Fix btree iter flags in data move (2)
Data move -> move_get_io_opts -> bch2_get_update_rebalance_opts

requires a not_extents iterator; this fixes the path where we're walking
the extents btree and chase a reflink pointer into the reflink btree.

bch2_lookup_indirect_extent() requires working with an extents iterator
(due to peek_slot() semantics), so we implement
bch2_lookup_indirect_extent_for_move().

This is simplified because there's no need to report
indirect_extent_missing_errors here, that can be deferred until fsck or
when a user reads that data.

Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26 16:26:35 -04:00
Kent Overstreet
19ff84b20d bcachefs: Don't unnecessarily decrypt data when moving
There's various checks for "are we going to compress this" - but we're
not going to compress if we know it's incompressible.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26 16:26:35 -04:00
Kent Overstreet
a44e4f8f00 bcachefs: Document disk accounting keys and conuters
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26 16:26:35 -04:00
Kent Overstreet
9c893face2 bcachefs: Validate number of counters for accounting keys
We weren't checking that accounting keys have the expected number of
accounters. Originally we probably wanted to be flexible on this, but it
doesn't look like that will be required - accounting is extended by
adding new counter types, not more counters to an existing type.

This means we can drop a BUG_ON() that popped once in automated testing,
and the new validation will make that bug easier to track down.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26 16:26:35 -04:00
Kent Overstreet
e1e50a6330 bcachefs: Use print_string_as_lines() for journal stuck messages
They were being truncated, printk has a 1k limit per call

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-25 11:49:46 -04:00
Kent Overstreet
a76db26a96 bcachefs: Fix duplicate checksum error messages in write path
Also, improve the message in prep_encoded_data() - it now prints
good/bad checksums, and checksum type.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-25 11:49:43 -04:00
Kent Overstreet
3ba0240a87 bcachefs: Fix silent short reads in data read retry path
__bch2_read, before calling __bch2_read_extent(), sets bvec_iter.bi_size
to "the size we can read from the current extent" with a swap, and
restores it to "the size for the total read" after the read_extent call
with another swap.

But we neglected to do the restore before the "if (ret) goto err;" -
which is a problem if we're retrying those errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-25 11:49:16 -04:00
Kent Overstreet
5af61dbd96 bcachefs: Fix nonce inconsistency in bch2_write_prep_encoded_data()
If we're moving an extent that was partially overwritten,
bch2_write_rechecksum() will trim it to the currenty live range.

If we then also want to compress it, it'll be decrypted - but the nonce
has been advanced for the overwritten start of the extent that we
dropped, and we were using the nonce we calculated before rechecksum().

Reported-by: Gabriel de Perthuis <g2p.code@gmail.com>
Fixes: 127d90d282 ("bcachefs: bch2_write_prep_encoded_data() now returns errcode")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-25 11:40:35 -04:00
Linus Torvalds
e41170cc5e vfs-6.15-rc1.pagesize
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rxAAKCRCRxhvAZXjc
 ooIPAQCwMjDjtWegvBy8kefiRw+fa4z3ZWHrwRT9DJrD/K9WyAD+JVd0ou27SVpQ
 jKpRSRct2eTbyxdYiGydHQGm5F5sLg4=
 =0FyQ
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs pagesize updates from Christian Brauner:
 "This enables block sizes greater than the page size for block devices.

  With this we can start supporting block devices with logical block
  sizes larger than 4k.

  It also allows to lift the device cache sector size support to 64k.
  This allows filesystems which can use larger sector sizes up to 64k to
  ensure that the filesystem will not generate writes that are smaller
  than the specified sector size"

* tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  bdev: add back PAGE_SIZE block size validation for sb_set_blocksize()
  bdev: use bdev_io_min() for statx block size
  block/bdev: lift block size restrictions to 64k
  block/bdev: enable large folio support for large logical block sizes
  fs/buffer fs/mpage: remove large folio restriction
  fs/mpage: use blocks_per_folio instead of blocks_per_page
  fs/mpage: avoid negative shift for large blocksize
  fs/buffer: remove batching from async read
  fs/buffer: simplify block_read_full_folio() with bh_offset()
2025-03-24 12:01:29 -07:00
Linus Torvalds
26d8e43079 vfs-6.15-rc1.async.dir
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rNwAKCRCRxhvAZXjc
 onBJAP9Z8Ywmlb5KQ1E3HvDmkwyY6yOSyZ9/CmbzrkCJ8ywYkQD/d9/xt0EP/O/q
 N8YtzXArHWt7u0YbcVpy9WK3F72BdwU=
 =VJgY
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs async dir updates from Christian Brauner:
 "This contains cleanups that fell out of the work from async directory
  handling:

   - Change kern_path_locked() and user_path_locked_at() to never return
     a negative dentry. This simplifies the usability of these helpers
     in various places

   - Drop d_exact_alias() from the remaining place in NFS where it is
     still used. This also allows us to drop the d_exact_alias() helper
     completely

   - Drop an unnecessary call to fh_update() from nfsd_create_locked()

   - Change i_op->mkdir() to return a struct dentry

     Change vfs_mkdir() to return a dentry provided by the filesystems
     which is hashed and positive. This allows us to reduce the number
     of cases where the resulting dentry is not positive to very few
     cases. The code in these places becomes simpler and easier to
     understand.

   - Repack DENTRY_* and LOOKUP_* flags"

* tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  doc: fix inline emphasis warning
  VFS: Change vfs_mkdir() to return the dentry.
  nfs: change mkdir inode_operation to return alternate dentry if needed.
  fuse: return correct dentry for ->mkdir
  ceph: return the correct dentry on mkdir
  hostfs: store inode in dentry after mkdir if possible.
  Change inode_operations.mkdir to return struct dentry *
  nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked()
  nfs/vfs: discard d_exact_alias()
  VFS: add common error checks to lookup_one_qstr_excl()
  VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry
  VFS: repack LOOKUP_ bit flags.
  VFS: repack DENTRY_ flags.
2025-03-24 10:47:14 -07:00
Kent Overstreet
d8bdc8daac bcachefs: Kill unnecessary bch2_dev_usage_read()
bch2_dev_usage_read() is fairly expensive, we should optimize this more.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:37 -04:00
Kent Overstreet
2adfa46734 bcachefs: btree node write errors now print btree node
It turned out a user was wondering why we were going read-only after a
write error, and he didn't realize he didn't have replication enabled -
this will make that more obvious, and we should be printing it anyways.

Link: https://www.reddit.com/r/bcachefs/comments/1jf9akl/large_data_transfers_switched_bcachefs_to_readonly/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:37 -04:00
Kent Overstreet
739200c573 bcachefs: Fix race in print_chain()
00636 Unable to handle kernel NULL pointer dereference at virtual address 00000000000000b0
00636 Mem abort info:
00636   ESR = 0x0000000096000005
00636   EC = 0x25: DABT (current EL), IL = 32 bits
00636   SET = 0, FnV = 0
00636   EA = 0, S1PTW = 0
00636   FSC = 0x05: level 1 translation fault
00636 Data abort info:
00636   ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
00636   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
00636   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
00636 user pgtable: 4k pages, 39-bit VAs, pgdp=0000000101b10000
00636 [00000000000000b0] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
00636 Internal error: Oops: 0000000096000005 [#1] SMP
00636 Modules linked in:
00636 CPU: 12 UID: 0 PID: 79369 Comm: cat Not tainted 6.14.0-rc6-ktest-g3783b8973ab7 #17757
00636 Hardware name: linux,dummy-virt (DT)
00636 pstate: 20001005 (nzCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
00636 pc : print_chain+0xb8/0x170
00636 lr : print_chain+0xa0/0x170
00636 sp : ffffff80d9c1bbb0
00636 x29: ffffff80d9c1bbb0 x28: 0000000000000002 x27: ffffff80c1be8250
00636 x26: ffffff80dd9b0000 x25: 0000000000000020 x24: 000000000000002d
00636 x23: 000000000000003c x22: ffffffc080a54518 x21: ffffff80da6e00d0
00636 x20: ffffff80da6e0170 x19: ffffff80c1a1d240 x18: 00000000ffffffff
00636 x17: 3535303937202d3c x16: 203139202d3c2035 x15: 00000000ffffffff
00636 x14: 0000000000000000 x13: ffffff80d71b63f1 x12: 0000000000000006
00636 x11: ffffffc080beb1c0 x10: 0000000000000020 x9 : 00000000000134cc
00636 x8 : 0000000000000020 x7 : 0000000000000004 x6 : 0000000000000020
00636 x5 : ffffff80d71b63f7 x4 : ffffffc080a5451b x3 : 0000000000000000
00636 x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000
00636 Call trace:
00636  print_chain+0xb8/0x170 (P)
00636  bch2_check_for_deadlock+0x444/0x5a0
00636  bch2_btree_deadlock_read+0xb4/0x1c8
00636  full_proxy_read+0x74/0xd8
00636  vfs_read+0x90/0x300

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:37 -04:00
Kent Overstreet
0b4fd56726 bcachefs: btree_trans_restart_foreign_task()
In debug mode, we save the call stack on transaction restart - but
there's no locking, so we can't touch it if we're issuing the restart
from another thread.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:37 -04:00
Kent Overstreet
f4a584f4bf bcachefs: bch2_disk_accounting_mod2()
We're hitting some issues with uninitialized struct padding, flagged by
kmsan.

They appear to be falso positives, otherwise bch2_accounting_validate()
would have flagged them as "junk at end". But for now, we'll need to
initialize disk_accounting_pos with memset().

This adds a new helper, bch2_disk_accounting_mod2(), that initializes a
disk_accounting_pos and does the accounting mod all at once - so overall
things actually get slightly more ergonomic.

BCH_DISK_ACCOUNTING_replicas keys are left for now; KMSAN isn't warning
about them and they're a bit special.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:37 -04:00
Kent Overstreet
5ae6f33053 bcachefs: zero init journal bios
fix a kmsan splat

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:37 -04:00
Kent Overstreet
9ea24b287b bcachefs: Eliminate padding in move_bucket_key
We appear to be tripping over a compiler/kmsan bug with padding fields -
this is an easy workaround.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:37 -04:00
Kent Overstreet
1f88c35674 bcachefs: Fix a KMSAN splat in btree_update_nodes_written()
We may sometimes read from uninitialized memory; we know, and that's ok.

We check if a btree node has been reused before waiting on any
outstanding IO.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:37 -04:00
Kent Overstreet
28aa859b6b bcachefs: kmsan asserts
Catching these early makes them a lot easier to track down.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
53cf2a3daa bcachefs: Fix kmsan warnings in bch2_extent_crc_pack()
We store to all fields, so the kmsan warnings were spurious - but
initializing via stores to bitfields appear to have been giving the
compiler/kmsan trouble, and they're not necessary.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
9c3a2c9b47 bcachefs: Disable asm memcpys when kmsan enabled
kmsan doesn't know about inline assembly, obviously; this will close a
ton of syzbot bugs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
962322475b bcachefs: Handle backpointers with unknown data types
New data types might be added later, so we don't want to disallow
unknown data types - that'll be a compatibility hassle later. Instead,
ignore them.

Reported-by: syzbot+3a290f5ff67ca3023834@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
6a9f681ef6 bcachefs: Count BCH_DATA_parity backpointers correctly
These are counted as stripe data in the corresponding alloc keys.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
04e90891be bcachefs: Run bch2_check_dirent_target() at lookup time
More on the "full online self healing" project:

We now run most of the dirent <-> inode consistency checks, with repair
code, at runtime - the exact same check and repair code that fsck runs.

This will allow us to repair the "dirent points to inode that does not
point back" inconsistencies that have been popping up at runtime.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
9b0d00a369 bcachefs: Refactor bch2_check_dirent_target()
Prep work for calling bch2_check_dirent_target() from bch2_lookup().

- Add an inline wrapper, if the target and backpointer match we can skip
  the function call.

- We don't (yet?) want to remove the dirent we did the lookup from (when
  we find a directory or subvol with multiple valid dirents pointing to
  it), we can defer on that until later. For now, add an "are we in
  fsck?" parameter.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
758ea4ff81 bcachefs: Move bch2_check_dirent_target() to namei.c
We're gradually running more and more fsck.c checks at runtime,
whereever applicable; when we do so they get moved out of fsck.c.

Next patch will call bch2_check_dirent_target() from bch2_lookup().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
4fcd4de0a6 bcachefs: fs-common.c -> namei.c
name <-> inode, code for managing the relationships between inodes and
dirents.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
8a9f3d0582 bcachefs: EIO cleanup
Replace these with proper private error codes, so that when we get an
error message we're not sifting through the entire codebase to see where
it came from.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
127d90d282 bcachefs: bch2_write_prep_encoded_data() now returns errcode
Prep work for killing off EIO and replacing them with proper private
error codes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
2fe208303a bcachefs: Simplify bch2_write_op_error()
There's no reason for the caller to do the actual logging, it's all done
the same.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
af2ff37da7 bcachefs: Fix block/btree node size defaults
We're fixing option parsing in userspace, it now obeys
OPT_SB_FIELD_SECTORS

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Alan Huang
5d361ae5af bcachefs: Add missing smp_rmb()
The smp_rmb() guarantees that reads from reservations.counter
occur before accessing cur_entry_u64s. It's paired with the
atomic64_try_cmpxchg in journal_entry_open.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
4a4000b9a6 bcachefs: Kill JOURNAL_ERRORS()
Convert these to standard error codes, which means we can pass them
outside the journal code, they're easier to pass to tracepoints, etc.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
80be08cdb5 bcachefs: Filesystem discard option now propagates to devices
the discard option is special, because it's both a filesystem and a
device option.

When set at the filesytsem level, it's supposed to propagate to (if set
persistently via sysfs) or override (if non persistently as a mount
option) the devices - that now works correctly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
8d7b7ac367 bcachefs: Device state is now a runtime option
Other options can normally be set at runtime via sysfs, no reason for
this one not to be as well - it just doesn't support the degraded flags
argument this way, that requires the ioctl.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
7b84d934a1 bcachefs: Setting foreground_target at runtime now triggers rebalance
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
8b294a9b5c bcachefs: Device options now use standard sysfs code
Device options now use the common code for sysfs, and can superblock
fields (in a struct bch_member).

This replaces BCH_DEV_OPT_SETTERS(), which was weird and easy to miss.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
d2bad59255 bcachefs: Kill BCH_DEV_OPT_SETTERS()
Previously, device options had their superblock option field listed
separately, which was weird and easy to miss when defining options.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Alan Huang
dd7ae389ff bcachefs: Remove spurious smp_mb()
The smp_mb() is paired with nothing.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Alan Huang
5cc0ab39fb bcachefs: Fix incorrect state count
atomic64_read(&j->seq) - j->seq_write_started == JOURNAL_STATE_BUF_NR is
the condition in journal_entry_open where we return JOURNAL_ERR_max_open,
so journal_cur_seq(j) - seq == JOURNAL_STATE_BUF_NR means that the buf
corresponding to seq has started to write.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
16a8d5d00b bcachefs: Fix btree iter flags in data move
Rebalance requires a not_extents iterator.

This wasn't hit before because all_snapshots disableds is_extents on
snapshots btrees - but has no effect on the reflink btree.

Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
92c7789a9e bcachefs: Validate bch_sb.offset field
This was missed - but it needs to be correct for the superblock recovery
tool that scans the start and end of the device for backup superblocks:
we don't want to pick up superblocks that belong to a different
partition that starts at a different offset.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
8bd875ae47 bcachefs: bch2_sb_validate() doesn't need bch_sb_handle
Minor refactoring, so that bch2_sb_validate() can be used in the new
userspace superblock recovery tool.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
5e67243ea6 bcachefs: Add missing random.h includes
Fix build in userspace, and good hygeine.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Kent Overstreet
2eb985c549 bcachefs: Better incompat version/feature error messages
If we can't mount because of an incompatibility, print what's supported
and unsupported - to help solve PEBKAC issues.

Reported-by: Roland Vet <vet.roland@protonmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Kent Overstreet
6aa446c05a bcachefs: Fix offset_into_extent in data move path
Fixes the following:

[   17.607394] kernel BUG at fs/bcachefs/reflink.c:261!
[   17.608316] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[   17.608485] CPU: 0 UID: 0 PID: 564 Comm: bch-rebalance/3 Tainted: G           OE      6.14.0-rc6-arch1-gfcb0bd9609d2 #7 0efd7a8f4a00afeb2c5fb6e7ecb1aec8ddcbb1e1
[   17.608616] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[   17.608736] Hardware name: Micro-Star International Co., Ltd. MS-7D75/MAG B650 TOMAHAWK WIFI (MS-7D75), BIOS 1.74 08/01/2023
[   17.608855] RIP: 0010:bch2_lookup_indirect_extent+0x252/0x290 [bcachefs]
[   17.609006] Code: 00 00 00 00 e8 7f 51 f5 ff 89 c3 85 c0 74 52 48 8b 7d b0 4c 89 ee e8 4d 4b f4 ff 48 63 d3 48 89 d0 31 d2 e9 2e ff ff ff 0f 0b <0f> 0b 48 8b 7d b0 4c 89 ee 48 89 55 a8 e8 2c 4b f4 ff 4c 8b 55 a8
[   17.609136] RSP: 0018:ffffa3714455f850 EFLAGS: 00010246
[   17.609261] RAX: 0000000000000080 RBX: ffff895891098790 RCX: 0000000000000000
[   17.609387] RDX: 0000000000000080 RSI: ffffa3714455fa90 RDI: ffff895889550000
[   17.609511] RBP: ffffa3714455f8c0 R08: ffff895891098790 R09: 0000000000000001
[   17.609637] R10: ffffa3714455f8d8 R11: ffffa3714455f950 R12: ffffa3714455fa58
[   17.609763] R13: ffff895891098790 R14: ffffa3714455fa58 R15: ffff895889550000
[   17.609888] FS:  0000000000000000(0000) GS:ffff896757c00000(0000) knlGS:0000000000000000
[   17.610015] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   17.610143] CR2: 0000716b8cda2750 CR3: 0000000914e22000 CR4: 0000000000f50ef0
[   17.610272] PKRU: 55555554
[   17.610403] Call Trace:
[   17.610535]  <TASK>
[   17.610662]  ? __die_body.cold+0x19/0x27
[   17.610791]  ? die+0x2e/0x50
[   17.610918]  ? do_trap+0xca/0x110
[   17.611049]  ? do_error_trap+0x6a/0x90
[   17.611178]  ? bch2_lookup_indirect_extent+0x252/0x290 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.611331]  ? exc_invalid_op+0x50/0x70
[   17.611468]  ? bch2_lookup_indirect_extent+0x252/0x290 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.611620]  ? asm_exc_invalid_op+0x1a/0x20
[   17.611757]  ? bch2_lookup_indirect_extent+0x252/0x290 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.611911]  ? bch2_move_data_btree+0x58a/0x6c0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.612084]  bch2_move_data_btree+0x58a/0x6c0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.612256]  ? __pfx_rebalance_pred+0x10/0x10 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.612431]  ? bch2_move_extent+0x3d7/0x6e0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.612607]  ? __bch2_move_data+0xea/0x200 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.612782]  __bch2_move_data+0xea/0x200 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.612959]  ? __pfx_rebalance_pred+0x10/0x10 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.613149]  do_rebalance+0x517/0x8d0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.613342]  ? local_clock_noinstr+0xd/0xd0
[   17.613518]  ? local_clock+0x15/0x30
[   17.613693]  ? __bch2_trans_get+0x152/0x300 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.613890]  ? __pfx_bch2_rebalance_thread+0x10/0x10 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.614090]  bch2_rebalance_thread+0x66/0xb0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]

The offset_into_extent bit was copied from the read path, but it's
unnecessary here, where we always want to read and move the entire
indirect extent, and it causes the assertion pop - because we're using a
non-extents iterator, which always points to the end of the reflink
pointer.

Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Eric Biggers
71fbb0b86e bcachefs: use sha256() instead of crypto_shash API
Just use sha256() instead of the clunky crypto API.  This is much
simpler.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Eric Biggers
39abc73b59 bcachefs: Remove unnecessary softdeps on crc32c and crc64
Since bcachefs does not access crc32c and crc64 through the crypto API,
there is no need to use module softdeps to ensure they are loaded.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Kent Overstreet
9b39835e93 bcachefs: #if 0 out (enable|disable)_encryption()
These weren't hooked up, but they probably should be - add some comments
for context.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Kent Overstreet
9962cb7748 bcachefs: Improve can_write_extent()
This fixes another "rebalance spinning and doing no work" issue;
rebalance was reading extents it wanted to move, but then failing in
bch2_write() -> bch2_alloc_sectors_start() due to being unable to
allocate sufficient replicas.

This was triggered by a user playing with the durability settings, the
foreground device was an NVME device with durability=2, and originally
he'd set the background device to durability=2 as well, but changed it
back to 1 (the default) after seeing IO errors.

That meant that with replicas=2, we want to move data off the NVME
device which satisfies that constraint, but with a single durability=1
device on the background target there's no way to move the extent to
that target while satisfiying the "required replicas" constraint.

The solution for now is for bch2_data_update_init() to check for this,
and return an error - before kicking off the read.

bch2_data_update_init() already had two different checks for "will we be
able to write this extent", with partially duplicated code, so this
patch combines and improves that logic.

Additionally, we now always bail out and return an error if there's
insufficient space on the destination target. Previously, we only did
this for BCH_WRITE_alloc_nowait moves, because it might be the case that
copygc just needs to free up space on the destination target.

But we really shouldn't kick off a move if the destination is full, we
can't currently distinguish between "really full" and "just need to wait
for copygc", and if we are going to wait on copygc it'd be better to do
that before kicking off the move.

This will additionally fix "rebalance spinning" issues caused by a
filesystem that has more data than can fit in background_target - which
is a valid scenario, since we don't exclude foreground/cache devices
when calculating filesystem capacity.

Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Kent Overstreet
fb8a9a32cc bcachefs: trace_io_move_write_fail
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Alan Huang
76bc6e51cd bcachefs: Increase blacklist range
Now there are 16 journal buffers, 8 is too small to be enough.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Kent Overstreet
35de2abc22 bcachefs: __bch2_read() now takes a btree_trans
Next patch will be checking if the extent we're reading from matches the
IO failure we saw before marking the failure.

For this to work, __bch2_read() needs to take the same transaction
context that bch2_rbio_retry() uses to do that check.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Kent Overstreet
3fb8bacb14 bcachefs: BCH_READ_data_update -> bch_read_bio.data_update
Read flags are codepath dependent and change as they're passed around,
while the fields in rbio._state are mostly fixed properties of that
particular object.

Losing track of BCH_READ_data_update would be bad, and previously it was
not obvious if it was always correctly set in the rbio, so this is a
safety cleanup.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Uros Bizjak
8a3c392388 percpu: use TYPEOF_UNQUAL() in variable declarations
Use TYPEOF_UNQUAL() to declare variables as a corresponding type without
named address space qualifier to avoid "`__seg_gs' specified for auto
variable `var'" errors.

Link: https://lkml.kernel.org/r/20250127160709.80604-4-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Nadav Amit <nadav.amit@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16 22:05:53 -07:00
Kent Overstreet
be31e412ac bcachefs: Checksum errors get additional retries
It's possible for checksum errors to be transient - e.g. flakey
controller or cable, thus we need additional retries (besides retrying
from different replicas) before we can definitely return an error.

This is particularly important for the next patch, which will allow the
data move path to move extents with checksum errors - we don't want to
accidentally introduce bitrot due to a transient error!

- bch2_bkey_pick_read_device() is substantially reworked, and
  bch2_dev_io_failures is expanded to record more information about the
  type of failure (i.e. number of checksum errors).

  It now returns an error code that describes more precisely the reason
  for the failure - checksum error, io error, or offline device, instead
  of the previous generic "insufficient devices". This is important for
  the next patches that add poisoning, as we only want to poison extents
  when we've got real checksum errors (or perhaps IO errors?) - not
  because a device was offline.

- Add a new option and superblock field for the number of checksum
  retries.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
ccba9029b0 bcachefs: Print message on successful read retry
Users have been asking for this, and now that errors are returned to the
top level read retry path - we can.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
de73677ff8 bcachefs: Return errors to top level bch2_rbio_retry()
Next patch will be adding an additional retry loop for checksum errors,
so that we can rule out transient errors before marking an extent as
poisoned.

Prerequisite to this is returning errors to bch2_rbio_retry(); this will
also let us add a "successful retry" message.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
881b598ef1 bcachefs: BCH_ERR_data_read_buffer_too_small
Now that the read path uses proper error codes, we can get rid of the
weird rbio->hole signalling to the move path that the read didn't
happen.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
f4b84bac20 bcachefs: Read error message now indicates if it was for an internal move
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
e75993b0bf bcachefs: Fix BCH_ERR_data_read_csum_err_maybe_userspace in retry path
When we do a read to a buffer that's mapped into userspace, it's
possible to get a spurious checksum error if userspace was modified the
buffer at the same time.

When we retry those, they have to be bounced before we know definitively
whether we're reading corrupt data.

But the retry path propagates read flags differently, so needs special
handling.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
943f0cfb15 bcachefs: Convert read path to standard error codes
Kill the READ_ERR/READ_RETRY/READ_RETRY_AVOID enums, and add standard
error codes that describe precisely which error occured.

This is going to be used for the data move path, to move but poison
extents with checksum errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
5a06cb8000 bcachefs: Debug params for data corruption injection
dm-flakey is busted, and this is simpler anyways - this lets us test the
checksum error retry ptahs

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
6d80fca9ef bcachefs: Don't create bch_io_failures unless it's needed
Only needed in retry path, no point in wasting stack space.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
9ec0089149 bcachefs: bch2_bkey_ptrs_rebalance_opts()
Small optimization for bch2_bkey_sectors_need_rebalance()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
7c1e2a254f bcachefs: Add a cond_resched() to btree cache teardown
[12308.606480] watchdog: BUG: soft lockup - CPU#18 stuck for 26s! [umount:48479]
[12308.606485] Modules linked in: bcachefs lz4hc_compress lz4_compress lz4_decompress sunrpc overlay nf_conntrack_netlink xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE bridge stp llc xfrm_user ip6table_nat ip6table_filter ip6_tables iptable_nat xt_addrtype iptable_filter ip_tables x_tables nfnetlink_cttimeout nfnetlink openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 psample ext4 mbcache jbd2 nls_iso8859_1 nls_cp850 vfat fat binfmt_misc skx_edac_common nfit edac_core libnvdimm cbc encrypted_keys intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common ipmi_ssif x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm drivetemp rapl intel_cstate coretemp mgag200 i2c_algo_bit ixgbe drm_shmem_helper drm_kms_helper mdio_devres xfrm_algo mdio drm ptp intel_uncore mei_me efi_pstore evdev uas pl2303 pps_core libphy usb_storage usbserial lpc_ich mei drm_panel_orientation_quirks acpi_power_meter tiny_power_button ipmi_si mfd_core intel_pch_thermal acpi_tad acpi_ipmi ioatdma
[12308.606541]  ipmi_devintf ipmi_msghandler dca wmi button efivarfs polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 sha1_generic xhci_pci xhci_hcd aesni_intel ehci_pci ehci_hcd gf128mul crypto_simd cryptd usbcore hpwdt usb_common
[12308.606557] CPU: 18 UID: 0 PID: 48479 Comm: umount Tainted: G             L     6.14.0-rc6-x86_64-00159-ga09496a03e63 #1
[12308.606560] Tainted: [L]=SOFTLOCKUP
[12308.606561] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 07/20/2023
[12308.606563] RIP: 0010:clear_page_erms+0x7/0x10
[12308.606570] Code: 48 89 47 38 48 8d 7f 40 75 d9 90 c3 cc cc cc cc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 b9 00 10 00 00 31 c0 <f3> aa c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[12308.606572] RSP: 0018:ffff9ed5b622fba0 EFLAGS: 00010246
[12308.606574] RAX: 0000000000000000 RBX: ffff90347fffe6c0 RCX: 00000000000004c0
[12308.606575] RDX: ffffe34ea9bec1c0 RSI: 00000000000405f0 RDI: ffff902eafb07b40
[12308.606576] RBP: ffff9ed5b622fbf0 R08: 0000000000000001 R09: 0000000000000006
[12308.606577] R10: 0000000000040001 R11: 0000000000000000 R12: ffffe34ea9bec000
[12308.606578] R13: 0000000000000000 R14: 0000000000000006 R15: ffffe34ea9bed000
[12308.606580] FS:  00007fe704ecfb68(0000) GS:ffff9053fea00000(0000) knlGS:0000000000000000
[12308.606581] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12308.606582] CR2: 00007f18159068ae CR3: 00000001314d0005 CR4: 00000000007726f0
[12308.606583] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[12308.606584] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[12308.606584] PKRU: 55555554
[12308.606585] Call Trace:
[12308.606587]  <IRQ>
[12308.606590]  ? show_regs.cold+0x19/0x28
[12308.606595]  ? watchdog_timer_fn.cold+0x3d/0x9d
[12308.606598]  ? __pfx_watchdog_timer_fn+0x10/0x10
[12308.606602]  ? __hrtimer_run_queues+0x12e/0x250
[12308.606607]  ? hrtimer_interrupt+0xfd/0x220
[12308.606609]  ? __sysvec_apic_timer_interrupt+0x53/0xe0
[12308.606614]  ? sysvec_apic_timer_interrupt+0x76/0xa0
[12308.606619]  </IRQ>
[12308.606620]  <TASK>
[12308.606620]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
[12308.606626]  ? clear_page_erms+0x7/0x10
[12308.606628]  ? __free_pages_ok+0x374/0x640
[12308.606633]  free_frozen_pages+0x34/0x570
[12308.606636]  __folio_put+0x87/0xe0
[12308.606641]  free_large_kmalloc+0x70/0x80
[12308.606645]  kfree+0x2f6/0x390
[12308.606648]  kvfree+0x2d/0x40
[12308.606653]  __btree_node_data_free+0xaf/0xf0 [bcachefs]
[12308.606726]  btree_node_data_free+0x6a/0x80 [bcachefs]
[12308.606778]  bch2_fs_btree_cache_exit+0x262/0x440 [bcachefs]
[12308.606829]  bch2_fs_release+0xe8/0x340 [bcachefs]
[12308.606905]  kobject_put+0x60/0xc0
[12308.606908]  bch2_fs_free+0xdd/0x120 [bcachefs]
[12308.606981]  bch2_kill_sb+0x1e/0x30 [bcachefs]
[12308.607051]  deactivate_locked_super+0x32/0xb0
[12308.607055]  deactivate_super+0x40/0x50
[12308.607057]  cleanup_mnt+0xc3/0x160
[12308.607060]  __cleanup_mnt+0x12/0x20
[12308.607062]  task_work_run+0x5f/0xa0
[12308.607064]  syscall_exit_to_user_mode+0x194/0x1a0
[12308.607066]  do_syscall_64+0x67/0x170
[12308.607068]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[12308.607070] RIP: 0033:0x7fe704e66eed
[12308.607073] Code: 08 49 89 ca b8 a5 00 00 00 0f 05 48 89 c7 e8 8a e6 ff ff 48 83 c4

Reported-by: Stijn Tintel <stijn@linux-ipv6.be>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
c991fbee8e bcachefs: rebalance, copygc status also print stacktrace
These are commonly needed when debugging, and saves from having to ask
users to dig.

Also, rebalance_status now includes pending rebalance work.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
8dc4514d58 bcachefs: Kill bch2_remount()
Single caller, so inline it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:03:16 -04:00
Kent Overstreet
a2e9e68746 bcachefs: Kill a bit of dead code
Found with CC=clang W=1

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Thorsten Blum
ff4cb203cc bcachefs: Use max() to improve gen_after()
Use max() to simplify gen_after() and improve its readability.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Thorsten Blum
c073ec6bec bcachefs: Remove unnecessary byte allocation
The extra byte is not used - remove it.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
94373026d9 bcachefs: We no longer read stripes into memory at startup
And the stripes heap gets deleted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
434a3f2ffa bcachefs: trace_stripe_create
Add a simple tracepoint for stripe creation, we'll want to expand this
later.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
6c336144b9 bcachefs: get_existing_stripe() uses new stripe lru
Convert to the new persistent stripe LRU.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
039790cfb5 bcachefs: ec_stripe_delete() uses new stripe lru
Convert to the new persistent stripe LRU.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
4b0fac4bed bcachefs: journal write path comment
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
981e380144 bcachefs: Kick devices out after too many write IO errors
We're improving our handling of write errors - we shouldn't write
degraded data just because a write failed once, we should retry it (on
other devices, if possible).

But for this to work, we need to kick devices out when they're only
returning errors - otherwise those retries will loop infinitely.

This adds a configurable timeout - if writes are failing for too long,
we'll set that device read-only.

In the future we should also implement more tracking and another knob
for an "allowed error rate", so that we can kick out drives that are
acting "unhealthy".

Another thing we'll want is a mechanism (likely in userspace) for
bringing a device back in after a transient error - perhaps a cable was
jiggled, or there was a controller reset.

After transient errors we also need a mechanism to walk (from the
journal) recent btree updates that weren't flushed to that device and
treat them as "degraded", since unflushed data may well not have been
written. Out of scope for this patch, but becoming relevant.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
d71e023376 bcachefs: Change BCH_MEMBER_STATE_failed semantics
Previously, we woudn't try to read at all from a failed device - that
doesn't make much sense, the device may be unhealthy (perhaps taking
longer than it should to service reads), but if it's our only option we
should still try to read from it.

Now, bch2_bkey_pick_read_device() will pick failed devices only if there
are no non-failed replicas to read from.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
cf164a9106 bcachefs: bch2_dev_get_ioref() may now sleep
The next patch implementing freezing will change bch2_dev_get_ioref() to
sleep if a device is currently frozen.

Add an annotation and fix the journal code accordingly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
2efa8397ca bcachefs: Fix btree_node_scan io_ref handling
This was completely fubar; it's now simplified a bit as well.
Note that for_each_online_member() takes and releases io_refs as it
iterates, so we need to release that if we break.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
d5308203a8 bcachefs: Implement blk_holder_ops
We can't use the standard fs_holder_ops because they're meant for single
device filesystems - fs_bdev_mark_dead() in particular - and they assume
that the blk_holder is the super_block, which also doesn't work for a
multi device filesystem.

These generally follow the standard fs_holder_ops; the
locking/refcounting is a bit simplified because c->ro_ref suffices, and
bch2_fs_bdev_mark_dead() is not necessarily shutting down the entire
filesystem.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
1fdbe0b184 bcachefs: Make sure c->vfs_sb is set before starting fs
This is necessary for the new blk_holder_ops, which want the vfs
super_block available for synchronization.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
13fd6be102 bcachefs: Stash a pointer to the filesystem for blk_holder_ops
Note that we open block devices before we allocate bch_fs, but once
attached to a filesystem they will be closed before the bch_fs is torn
down - so stashing a pointer without a refcount looks incorrect but it's
not.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
b31c070407 bcachefs: Finish bch2_account_io_completion() conversions
More prep work for automatically kicking devices out after too many IO
errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
3526bca36b bcachefs: bch2_account_io_completion()
We need to start accounting successes for every IO, not just failures,
so introduce a unified hook for io completion accounting and convert
io_read.c.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
3480aecd5f bcachefs: Fix read path io_ref handling
We were using our device pointer after we'd released our ref to it.

Unlikely to be a race that's practical to hit, since actually removing a
member device is a whole process besides just taking it offline, but -
needs to be fixed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
7bc5808168 bcachefs: data_update now checks for extents that can't be moved
If a device is ro or failed, we might not have anywhere to move a
replica.

Check for this early, before doing the read and attempting to write.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
fba513a9ee bcachefs: give bch2_write_super() a proper error code
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
4a90675cfe bcachefs: bcachefs_metadata_version_extent_flags
This implements a new extent field bitflags that apply to the whole
extent. There's been a couple things we've wanted this for in the past,
but the immediate need is extent poisoning, to solve a rebalance issue.

Unknown extent fields can't be parsed (we won't known their size, so we
can't advance to the next field), so this is an incompat feature, and
using it prevents the filesystem from being mounted by old versions.

This also adds the BCH_EXTENT_poisoned flag; this indicates that the
data is known to be bad (i.e. there was a checksum error, and we had to
write a new checksum) and reads will return errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
6422bf8117 bcachefs: bch2_request_incompat_feature() now returns error code
For future usage, we'll want a dedicated error code for better
debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Thorsten Blum
bafd41b435 bcachefs: Fix error type in bch2_alloc_v3_validate()
Use error type alloc_v3_unpack_error in bch2_alloc_v3_validate().

Fixes: b65db750e2 ("bcachefs: Enumerate fsck errors")
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
fb195fa753 bcachefs: BCH_SB_FEATURES_ALL includes BCH_FEATURE_incompat_verison_field
These features are set on format and incompat upgarde.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
24d790a7da bcachefs: sysfs internal/trigger_btree_updates
Add a debug knob to manually trigger the btree updates worker.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Joshua Ashton
d37c14ac6f bcachefs: bcachefs_metadata_version_casefolding
This patch implements support for case-insensitive file name lookups
in bcachefs.

The implementation uses the same UTF-8 lowering and normalization that
ext4 and f2fs is using.

More information is provided in Documentation/bcachefs/casefolding.rst

Compatibility notes:

This uses the new versioning scheme for incompatible features where an
incompatible feature is tied to a version number: the superblock says
"we may use incompat features up to x" and "incompat features up to x
are in use", disallowing mounting by previous versions.

Additionally, and old style incompat feature bit is used, so that
kernels without utf8 casefolding support know if casefolding
specifically is in use and they're allowed to mount.

Signed-off-by: Joshua Ashton <joshua@froggi.es>
Cc: André Almeida <andrealmeid@igalia.com>
Cc: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Joshua Ashton
76872d46b7 bcachefs: Split out dirent alloc and name initialization
Splits out the code that allocates the dirent and initializes the name
to make things easier to implement casefolding in a future commit.

Cc: André Almeida <andrealmeid@igalia.com>
Cc: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
72f4edcf45 bcachefs: Kill dirent_occupied_size() in create path
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
68171d91ce bcachefs: Kill dirent_occupied_size() in rename path
Cc: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
6756e385a5 bcachefs: bcachefs_metadata_version_stripe_lru
Add a persistent LRU for stripes, ordered by "number of empty blocks",
i.e. order in which we wish to reuse them.

This will replace the in-memory stripes heap, so we can kill off reading
stripes into memory at startup.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
88d961b518 bcachefs: bcachefs_metadata_version_stripe_backpointers
Stripes now have backpointers.

This is needed for proper scrub - stripe checksums need to be verified,
separately from extents within the stripe, since a block may not be full
of live extents but it's still needed for reconstruct.

And this will be needed for (efficient) evacuate/repair paths.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
69bd8a9277 bcachefs: Advance bch_alloc.oldest_gen if no stale pointers
Now that we've got cached backpointers and aren't leaving around stale
pointers on bucket invalidation, we no longer need the periodic (rare)
gc_gens - which recalculates each bucket's oldest gen to avoid wraparound.

We can't delete that code because we've got to support existing
filesystems that will still have stale pointers, but this gets rid of
another scalability limit.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
942a418c7a bcachefs: Invalidate cached data by backpointers
If we don't leave stale pointers around, we won't have to deal with
bucket gen wraparound.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
15800f3d4b bcachefs: bcachefs_metadata_version_cached_backpointers
Cached pointers now have backpointers.

This means that we'll be able to kill cached pointers in the
bucket_invalidate path, when invalidating/reusing buckets containing
cached data, instead of leaving them around to be cleaned up by gc_gens
garbago collection - which requires a full metadata scan.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
65bc7688b8 bcachefs: rework bch2_trans_commit_run_triggers()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
c7c07bf250 bcachefs: Better trigger ordering
Transactional triggers need to run in a defined ordering, which is not
quite the same as btree ID integer comparison.

Previously this was handled in a hacky way in
bch2_trans_commit_run_triggers(), since it was only the alloc btree that
needed special handling, but upcoming stripe btree changes are going to
require more ordering changes - so, define that ordering.

Next patch will change the transaction commit path to use it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
cc297dfb41 bcachefs: bch2_trigger_stripe_ptr() no longer uses ec_stripes_heap_lock
Introduce per-entry locks, like with struct bucket - the stripes heap is
going away.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
bc76ba70d2 bcachefs: Rework bch2_check_lru_key()
It's now easier to add new LRU types.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
3aff608b86 bcachefs: decouple bch2_lru_check_set() from alloc btree
Pass in the backpointer explicitly, instead of assuming 'referring_k' is
an alloc key and calculating it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
b8e37c1645 bcachefs: s/BCH_LRU_FRAGMENTATION_START/BCH_LRU_BUCKET_FRAGMENTATION/
FRAGMENTATION_START was incorrect, there's currently only one
fragmentation LRU (at the end of the reserved bits for LRU type), and
we're getting ready to add a stripe fragmentation lru - so give it a
better name.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
e130496707 bcachefs: bch2_lru_change() checks for no-op
Minor cleanup, no reason for the caller to have to this.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
cb87f623c1 bcachefs: minor journal errcode cleanup
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00