Don't put btree locking asserts behind CONFIG_BCACHEFS_DEBUG, put them
behind a module parameter.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We'd like users to be able to debug without building custom kernels, so
this will help us get rid of CONFIG_BCACHEFS_DEBUG, at least for most
things.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
New superblock section for statistics on recovery passes - last time
ran (successfully), last runtime.
This will be used by self healing code to determine when to kick off
potentially expensive recovery passes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
More self healing work: we're going to be calling
check_bucket_backpointer_mismatch() at runtime, outside of fsck.
Then when we need to we'll kick off the full
check_extents_to_backpointers recovery pass.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add 'opts.snapshot_deletion_enabled', enabled by default.
This may be turned off so that the new sysfs knob,
'internal/trigger_delete_dead_snapshots', may be used instead - this
will allow snapshot deletion to be profiled more easily.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Fast device removal, that uses backpointers to find pointers to the
device being removed instead of a full metadata scan.
This requires BCH_SB_MEMBER_DELETED_UUID, which is an incompatible
change - hence the version number bump. We don't fully trust
backpointers, so we don't want to reuse device indexes until after a
fsck has verified that there aren't any pointers to removed devices.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Currently, device removal has to scan all metadata for pointers to the
device being removed.
Add a new method, with the same interface as bch2_dev_data_drop(), that
scans by backpointers instead - this will drastically speed up device
removal.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a sentinal value for devices that have been removed, but don't want
to reuse their index until a fsck has completed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since extents, dirents and xattrs require an inode with the
corresponding snapshot ID to exists, we can avoid a lot of scanning by
only scanning those trees for keys to process if the correspending inode
exists.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're going to be speeding up snapshot deletion, by only having it
process the extents/dirents/xattrs btrees if an inode of a given
snapshot ID was present.
This raises the possibility of 'bkey_in_missing_snapshot' errors popping
up, if we ever accidentally don't do the corresponding inode update, or
if the new algorithm has bugs.
So instead of deleting snapshot IDs, add a new deleted flag, so that
'key in missing snapshot' errors can more definitively tell what
happened and automatically repair.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're going to be speeding up snapshot deletion, by only having it
process the extents/dirents/xattrs btrees if an inode of a given
snapshot ID was present.
This raises the possibility of 'bkey_in_missing_snapshot' errors popping
up, if we ever accidentally don't do the corresponding inode update, or
if the new algorithm has bugs.
So we'll want to be able to differentiate more definitively between
'snapshot went missing' (and perhaps needs to be reconstructed), and
'key in snapshot that was deleted'.
So instead of deleting snapshot IDs, we'll be adding a new deleted flag
and leaving them permanently.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Don't scan keys in inodes for which the snapshot tree doesn't match any
we're deleting from.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're going to be doing some snapshot deletion performance improvements,
and those will strictly require that if an extent/dirent/xattr is
present, an inode is present in that snapshot ID.
We already check for this, but we don't repair it on disk: this patch
adds that repair and turns it into a real fsck_err().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The next patch is going to change lookup_inode_for_snapshot to
rigorously require that a extent/dirent/xattr keys have a corresponding
inode key present - whiteouts included, so this simplifies the checks
lookup_inode_for_snapshot() will have to do.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In 'bcachefs_metadata_extent_flags', we stopped requireding members_v1
to be present - only that either v1 or v2 is present.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The flexible array contains name and value, the x_name is misleading.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Device add doesn't get the devide index and attach to the filesystem
until after attaching the block device, and setting the device name from
the block device name - these needs some minor tweaks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.
Refactor a couple of structs that contain flexible arrays in the
middle by replacing them with unions.
So, with these changes, fix the following warnings:
fs/bcachefs/disk_accounting.c:429:51: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
fs/bcachefs/ec_types.h:8:41: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Allow btree_insert_entry.ip_allocated to be passed in, so we get better
info on where alloc updates are coming from.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we detect an error that requires running a recovery pass, and we're
not in recovery, we won't be able to fix it until the next mount - make
sure we're noting in the superblock that it needs to run.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Like we just did with the data read path, emit a single error message
per btree node reads, nicely formatted, with all the actions we took
grouped together.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Part of the ongoing project to improve error messages by building them
up in printbufs and emitting them all at once, so that we can easily see
what events are related in the log.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
No longer has users, so we can kill it and rename
bch2_run_explicit_recovery_pass_persistent_locked().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Instead of emitting a message immediately when we get an error in the
read path, and then another at the end if we successfully retry - emit
one single log message before returning from bch2_rbio_retry().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If the final line in in the message to be printed is blang, don't print
it.
This happens with indented printbufs - after a newline we emit spaces up
to the indent level.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add async objs list for
- promote_op
- bch_read_bio
- btree_read_bio
- btree_write_bio
This gets us introspection on in-flight async ops, and because under the
hood it uses fast_lists (percpu slot buffer on top of a radix tree),
it'll be fast enough to enable in production.
This will be very helpful for debugging "something got stuck" issues,
which have been cropping up from time to time (in the CI, especially
with folio writeback).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Debugging infrastructure for async objs: this lets us easily create
fast_lists for various object types so they'll be visible in debugfs.
Add new object types to the BCH_ASYNC_OBJS_TYPES() enum, and drop a
pretty-printer wrapper in async_objs.c.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
A fast "list" data structure, which is actually a radix tree, with an
IDA for slot allocation and a percpu buffer on top of that.
Items cannot be added or moved to the head or tail, only added at some
(arbitrary) position and removed. The advantage is that adding, removing
and iteration is generally lockless, only hitting the lock in ida when
the percpu buffer is full or empty.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Pretty printer for struct bio, to be used for async object debugging.
This is pretty minimal, we'll add more to it as we discover what we
need.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Convert device IO refs to enumerated_refs, for easier debugging of
refcount issues.
Simple conversion: enumerate all users and convert to the new helpers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Factor out the debug code for rw filesystem refs into a small library.
In release mode an enumerated ref is a normal percpu refcount, but in
debug mode all enumerated users of the ref get their own atomic_long_t
ref - making it much easier to chase down refcount usage bugs for when a
refcount has many users.
For debugging, we have enumerated_ref_to_text(), which prints the
current value of each different user.
Additionally, in debug mode enumerated_ref_stop() has a 10 second
timeout, after which it will dump outstanding refcounts.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This pops up when buliding in userspace.
The structs aren't actually variable length, but no way to tell the
compiler that...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If a filesystem is going to only be used read-only, and will be a
deployable image, we can strip out alloc info for a substantial
reduction in metadata size - around half, due to backpointers.
Alloc info will be regenerated on first read-write mount.
Remounting RW is disallowed for now, since we don't yet have
check_allocations running in RW mode.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If the root inode/subvolume is unreadable we can repair automatically -
but only if we're still in recovery, so that we can rewind to the
appropriate recovery pass.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Instead of going emegency read only with a bch2_fs_inconsistent() call,
log the error and recovery pass appropriately.
If we're still in recovery it'll be repaired immediately, otherwise
it'll be repaired on the next mount.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_print_string_as_lines() is a low level helper that allows messages
longer than 1k to be printed without truncation.
But we should always be printing with the helpers that take a filesystem
object, if we're in fsck they direct output to the userspace process
controlling fsck instead of the dmesg log.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Part of the ongoing project to kill off bch2_(fs|trans)_inconsistent
calls - they generally need to be replaced with either
- a fsck_err() call that can repair the error, or
- logging an error of the appropriate type in the superblock, and
flagging the appropriate recovery pass to repair the error
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We prefer helpers that emit log messages to printbufs rather than
printing them directly; that way, we can ensure that different log
messages from the same event are grouped together and formatted
appropriately in the dmesg log.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
version_upgrade is now a runtime option.
In the future we'll want to add compatible upgrades at runtime, and call
the full check_version_upgrade() when the option changes, but we don't
have compatible optional upgrades just yet.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The helpers are now:
- bch2_opt_hook_pre_set()
- bch2_opts_hooks_pre_set()
- bch2_opt_hook_post_set
Fix a bug where the filesystem discard option would incorrectly be
changed when setting the device option, and don't trigger rebalance
scans unnecessarily (when options aren't changing).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Single device filesystems are now identified by the block device name,
not the UUID - and single device filesystems with the same UUID can be
mounted simultaneously, without any special options.
This allocates a new bit in the superblock, BCH_SB_MULTI_DEVICE, which
indicates whether a filesystem has ever been multi device.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
On single device filesystems, c->name contains the block device name,
not the UUID.
Initialize this earlier, so that single device mode can use it for
initializing sysfs/debugfs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Make the output slightly clearer, and include a counter for "nodes we
couldn't free because we would have gone under our reserve".
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kill 'opts.very_degraded', and make 'opts.degraded' a persistent option,
stored in the superblock.
It's now an enum, with available choices ask/yes/very/no.
"ask" mode will be handled by the mount helper, for prompting the user
(on a machine used interactively) for whether to do a degraded mount.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch uses printbuf_indent_add_nextline() to set a consistent
indentation level for error messages of invalid compression.
In my previous patch [1], the newline is added by using '\n' in
the argument of prt_str(). This patch replaces prt_str() with
prt_printf() to make indentation level work correctly.
[1] Link: https://lore.kernel.org/20250406152659.205997-2-integral@archlinuxcn.org
Signed-off-by: Integral <integral@archlinuxcn.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When an invalid compression type or level is passed as an argument
to `--compression`, two error messages are squashed into one line:
> bcachefs format --compression=lzo bcachefs-comp.img
invalid option: invalid compression typecompression: parse error
> bcachefs format --compression=lz4:16 bcachefs-comp.img
invalid option: invalid compression levelcompression: parse error
To resolve this issue, add a newline character at the end of the
first error message to separate them into two lines.
Signed-off-by: Integral <integral@archlinuxcn.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Currently, when passing a negative integer as argument, the error
message is "too big" due to casting to an unsigned integer:
> bcachefs format --block_size=-1 bcachefs.img
invalid option: block_size: too big (max 65536)
When negative value in argument detected, return early before
calling bch2_opt_validate().
A new error code `BCH_ERR_option_negative` is added.
Signed-off-by: Integral <integral@archlinuxcn.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Defer memory allocations only needed in RW mode until we actually go RW.
This is part of improved support for RO images.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
_init_early() is for initialization that cannot fail, and often must
happen for teardown partway through initialization to work.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a better helper for check_snapshot_exists().
create_snapids() can't be changed to use this, unfortunately, because
the transaction that creates new snapshot will also be inserting other
keys (e.g. root inode) that reference that snapshot ID, and they expect
the snapshot table to already be updated.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
More stack usage improvements: instead of creating a new alloc_request
(currently on the stack), save/restore just the fields we need to reuse.
This is a bit tricky, because we're doing a normal alloc_foreground.c
allocation, which calls into ec.c to get a stripe, which then does more
normal allocations - some of the fields get reused, and used
differently.
So we have to save and restore them - but the stack usage improvements
will be well worth it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a struct for common state for satisfying an on disk allocation,
instead of passing the same long list of items to every function.
This will help with stack usage, performance, and perhaps enable some
code cleanups.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're occasionally seeing the WARN_ON() for bump allocator usage
exceeding BTREE_TRANS_MEM_MAX; add some tracing so we can see what's
going on.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This was achieved before by zero-ing out the source buffer and then
copying the bytes into the destination buffer. This can also be done with
memcpy_and_pad which will zero out only the destination buffer if its
size is bigger than the size of the source buffer. This is already used in
the same way in journal_transaction_name().
Moreover, zero-ing the source buffer was done twice, first in
__bch2_fs_log_msg() and then in bch2_trans_log_msg(). And this method
may also require allocating some extra memory for the source buffer.
In conclusion, using memcpy_and_pad is better even tough the result is
the same because it brings uniformity with what's already used in
journal_transaction_name, it avoids code duplication and reallocating
extra memory.
Signed-off-by: Roxana Nicolescu <nicolescu.roxana@protonmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Strncpy is now deprecated.
The buffer destination is not required to be NULL-terminated, but we also
want to zero out the rest of the buffer as it is already done in other
places.
Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Roxana Nicolescu <nicolescu.roxana@protonmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Let's not move poisoned extents unnecessarily, since we can't guard
against introducing more bitrot.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now, if an extent is poisoned we can move it even if there was a
checksum error. We'll have to give it a new checksum, but the poison bit
means that userspace will still see the appropriate error when they try
to read it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Copygc needs to be able to move extents that have bitrotted. We don't
want to delete them - in the future we'll have an API for "read me the
data even if there's checksum errors", and in general we don't want to
delete anything unless the user asks us to.
That will require writing it with a new checksum, which means we can't
forget that there was a checksum error so we return the correct error to
userspace.
Rebalance also wants to skip bad extents; we can now use the poison flag
for that.
This is currently disabled by default, as we want read fua support so
that we can distinguish between transient and permanent errors from the
device. It may be enabled with the module parameter:
poison_extents_on_checksum_error
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If the extent we're reading from changes, due to be being overwritten or
moved (possibly partially) - we need to reset bch_io_failures so that we
don't accidentally mark a new extent as poisoned prematurely.
This means we have to separately track (in the retry path) the extent we
previously read from.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Check for mismatches between casefold dirents and casefold directories.
A mismatch will cause lookups to fail, as we'll be doing the lookup with
the casefolded name, which won't match the non-casefolded dirent, and
vice versa.
Reported-by: Christopher Snowhill <chris@kode54.net>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_dirent_create_snapshot(), used in fsck, neglected to create a
casefolded dirent.
Just move this into dirent_create_key().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Changing the casefold option requires extra checks/work - factor out a
helper from bch2_fileattr_set() for the xattr code to use.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
btree_key_cache_fill() will allocate and traverse another path (for the
underlying btree), so we can't hold pointers to paths across a call - we
have to pass indices.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
fsck_err() needs the btree transaction passed to it if there is one - so
that it can unlock/relock around prompting userspace for fixing the
error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Fsck wants to do transaction commits from an outer context; it may have
other repair to do (i.e. duplicate backpointers).
But when calling backpointer_not_found() from runtime code, i.e. runtime
self healing, we should be doing the commit - the outer context expects
to just be doing lookups.
This fixes bugs where we get stuck spinning, reported as "RCU lock hold
time warnings.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since bch2_seek_pagecache_data() searches for dirty data, we only want
to call it for holes in the extents btree - otherwise we have an
accidental O(n^2), as we repeatedly search the same range.
Reported-by: Marcin Mirosław <marcin@mejor.pl>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
set_should_be_locked() needs to be called before peek_key_cache(), which
traverses other paths and may do a trans unlock/relock.
This fixes an assertion pop in path_peek_slot(), when the path we're
using is unexpectedly not uptodate.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Before invoking bch2_accounting_mem_mod_locked in
bch2_gc_accounting_done, we already write locked mark_lock,
in bch2_accounting_mem_insert, we lock mark_lock again.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Prevent jobs that do lots of scanning (i.e. evacuatee, scrub) from
causing OOMs.
The shrinker code seems to be having issues when it doesn't do any
freeing because it's just flipping off the acccessed bit - and the
accessed bit shouldn't be set on first use anyways.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When the journal is low on space, we might do discards from
journal_res_get() -> journal_entry_open().
Make sure we set j->can_discard correctly, so that if we're low on space
but not because discards aren't keeping up we don't livelock.
Fixes: 8e4d28036c ("bcachefs: Don't aggressively discard the journal")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes btree locking assert pops users were seeing during evacuate:
https://github.com/koverstreet/bcachefs/issues/878
May 09 22:45:02 sharon kernel: bcachefs (68116e25-fa2d-4c6f-86c7-e8b431d792ae): bch2_btree_insert_node(): node not locked at level 1
May 09 22:45:02 sharon kernel: bch2_btree_node_rewrite [bcachefs]: watermark=btree no_check_rw alloc l=0-1 mode=none nodes_written=0 cl.remaining=2 journal_seq=0
May 09 22:45:02 sharon kernel: path: idx 1 ref 1:0 S B btree=alloc level=0 pos 0:3699637:0 0:3698012:1-0:3699637:0 bch2_move_btree.isra.0+0x1db/0x490 [bcachefs] uptodate 0 locks_want 2
May 09 22:45:02 sharon kernel: l=0 locks intent seq 4 node ffff8bd700c93600
May 09 22:45:02 sharon kernel: l=1 locks unlocked seq 1712 node ffff8bd6fd5e7a00
May 09 22:45:02 sharon kernel: l=2 locks unlocked seq 2295 node ffff8bd6cc725400
May 09 22:45:02 sharon kernel: l=3 locks unlocked seq 0 node 0000000000000000
Evacuate walks btree nodes with bch2_btree_iter_next_node() and rewrites
them, bch2_btree_update_start() upgrades the path to take intent locks
as far as it needs to.
But next_node() does low level unlock/relock calls on individual nodes,
and didn't handle the case where a path is supposed to be holding
multiple intent locks. If a path has locks_want > 1, it needs to be
either holding locks on all the btree nodes (at each level) requested,
or none of them.
Fix this with a bch2_btree_path_downgrade().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Fix bch2_bkey_clear_needs_rebalance(): indirect extents are never
supposed to have bch_extent_rebalance stripped off, because that's how
we get the IO path options when we don't have the original inode it
belonged to.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that the ChaCha state matrix is strongly-typed, add a helper
function chacha_zeroize_state() which zeroizes it. Then convert all
applicable callers to use it instead of direct memzero_explicit. No
functional changes.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The ChaCha state matrix is 16 32-bit words. Currently it is represented
in the code as a raw u32 array, or even just a pointer to u32. This
weak typing is error-prone. Instead, introduce struct chacha_state:
struct chacha_state {
u32 x[16];
};
Convert all ChaCha and HChaCha functions to use struct chacha_state.
No functional changes.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Deduplicate the same functionality implemented in several places by
moving the cmp_int() helper macro into linux/sort.h.
The macro performs a three-way comparison of the arguments mostly useful
in different sorting strategies and algorithms.
Link: https://lkml.kernel.org/r/20250427201451.900730-1-pchelkin@ispras.ru
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Kent Overstreet <kent.overstreet@linux.dev>
Acked-by: Coly Li <colyli@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Coly Li <colyli@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We frequently use 'bcachefs list_journal -a' for debugging, as it
provides a record of all btree transactions, and a history of what
happened.
But it's not so useful if we immediately discard journal buckets right
after they're no longer dirty.
This tweaks journal reclaim to only discard when we're low on space,
keeping the journal mostly un-discarded.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When we go emergency read-only, make sure we do a final write_super() to
persist counters and error counts - this can be critical for piecing
together what fsck was doing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We often filter out EROFS errors to avoid log spew after an emergency
shutdown - journal_shutdown is just another emergency shutdown error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This reverts
1fdbe0b184 bcachefs: Make sure c->vfs_sb is set before starting fs
switched up bch2_fs_get_tree() so that we got a superblock before
calling bch2_fs_start, so that c->vfs_sb would always be initialized
while the filesystem was active.
This turned out not to be necessary, because blk_holder_ops were
implemented using our own locking, not vfs locking.
And this had the side effect of creating a super_block and doing our
full recovery (including potentially fsck) before setting SB_BORN, which
causes things like sync calls to hang until our recovery is finished.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
wake_up() doesn't require a barrier - but wake_up_bit() does.
This only affected non x86, and primarily lead to lost wakeups after
btree node reads.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There was a buggy version of bcachefs-tools which picked misaligned
bucket sizes when formatting, and we're also about to do dynamic block
sizes - which will allow picking logical block size or physical block
size of the device per-write, allowing for better compression ratios at
the cost of slightly worse write performance (i.e. forcing the device to
do RMW or extra buffering).
To account for this, tweak bch2_alloc_sectors_start() to properly align
open_buckets to the blocksize of the write we're about to do.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If promote target isn't set, rebalance should still leave a cached copy
on the faster device.
Fall back to foreground_target if it's set, or allow a cached copy on
any device if neither are set.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_stdio_redirect_vprintf() was missing a check for stdio->done, i.e.
exiting.
This caused the thread attempting to print to spin, and since it was
being called from the kthread ran by thread_with_stdio, the userspace
side hung as well.
Change it to return -EPIPE - i.e. writing to a pipe that's been closed.
Reported-by: Jan Solanti <jhs@psonet.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_sb_disk_groups_to_cpu() goes off of the superblock member info, so
we need to set that first.
Reported-by: Stijn Tintel <stijn@linux-ipv6.be>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We won't be root causing this in the immediate future, and it's fairly
innocuous - so just log it in the superblock.
https://github.com/koverstreet/bcachefs/issues/869
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Don't call bch2_trans_relock() after dir_emit(); taking a transaction
restart here will cause us to emit the same dirent to userspace twice
- Fix incorrect checking of the return value on dir_emit(): "true" means
success, keep going, but bch2_dir_emit() needs to return true when
we're finished iterating.
https://github.com/koverstreet/bcachefs/issues/867
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can hit this limit fairly easy when we have to reconstuct large
amounts of alloc info on large filesystems.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If loosing a btree won't cause data loss - i.e. it's an alloc btree, or
we can easily reconstruct it - we shouldn't require user action to
continue repair.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Restart handling in the previous patch was incorrect, so: move btree
operations into a separate helper, and run it with a lockrestart_do().
Additionally, clarify whether pagecache or the btree takes precedence.
Right now, the btree takes precedence: this is incorrect, but it's
needed to pass fstests. Add a giant comment explaining why.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bcachefs currently populates fiemap data from the extents btree.
This works correctly when the fiemap sync flag is provided, but if
not, it skips all delalloc extents that have not yet been flushed.
This is because delalloc extents from buffered writes are first
stored as reservation in the pagecache, and only become resident in
the extents btree after writeback completes.
Update the fiemap implementation to process holes between extents by
scanning pagecache for data, via seek data/hole. If a valid data
range is found over a hole in the extent btree, fake up an extent
key and flag the extent as delalloc for reporting to userspace.
Note that this does not necessarily change behavior for the case
where there is dirty pagecache over already written extents, where
when in COW mode, writeback will allocate new blocks for the
underlying ranges. The existing behavior is consistent with btrfs
and it is recommended to use the sync flag for the most up to date
extent state from fiemap.
Signed-off-by: Brian Foster <bfoster@redhat.com>
The bulk of the loop in bch2_fiemap() involves processing the
current extent key from the iter, including following indirections
and trimming the extent size and such. This patch makes a few
changes to reduce the size of the loop and facilitate future changes
to support delalloc extents.
Define a new bch_fiemap_extent structure to wrap the bkey buffer
that holds the extent key to report to userspace along with
associated fiemap flags. Update bch2_fill_extent() to take the
bch_fiemap_extent as a param instead of the individual fields.
Finally, lift the bulk of the extent processing into a
bch2_fiemap_extent() helper that takes the current key and formats
the bch_fiemap_extent appropriately for the fill function.
No functional changes intended by this patch.
Signed-off-by: Brian Foster <bfoster@redhat.com>
FIEMAP_FLAG_SYNC handling was deliberately moved into core code in
commit 45dd052e67 ("fs: handle FIEMAP_FLAG_SYNC in fiemap_prep"),
released in kernel v5.8. Update bcachefs accordingly.
Signed-off-by: Brian Foster <bfoster@redhat.com>
At the end of the inode, on an extents iterator, peek_slot() has to
advance to the next position to avoid returning a 0 size extent, which
is not allowed.
Changing iter->pos confuses peek_prev(), but we don't need to call
peek_slot() in this case.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The issue this assert is guarding against is that in
BTREE_ITER_filter_snapshots mode we only want to be iterating within a
single inode number - if we iterate into another inode number with keys
for a different snapshot tree, we'll loop arbitrarily long before
finding a key we can return.
This comes up in the unit tests, where we're using inode 0 for our test
keys.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we aren't mounting with the correct degraded option, it's helpful to
know that before we fail to mount degraded.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
casefolding results in additional aliases on lookup for the
non-casefolded names - these need invalidating on unlink.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add casefolding to bch2_lookup_trans:
During the delay between when casefolding was written and when it was
merged, the main filesystem lookup path grew self healing - which meant
it was no longer using bch2_dirent_lookup_trans(), where casefolding on
lookups happens.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
inode_operations.fileattr_(get|set) didn't exist when the various flag
ioctls where implemented - but they do now, which means we can delete a
bunch of ioctl code in favor of standard VFS level wrappers.
Closes: https://lore.kernel.org/linux-bcachefs/7ltgrgqgfummyrlvw7hnfhnu42rfiamoq3lpcvrjnlyytldmzp@yazbhusnztqn/
Cc: Petr Vorel <pvorel@suse.cz>
Cc: Andrea Cervesato <andrea.cervesato@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We had a buggy release of bcachefs-tools that wasn't properly aligning
bucket sizes.
We can't ask users to reformat - and it's easy to teach the allocator to
make sure writes are properly aligned.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, copygc and rebalance weren't started until the very end of
mounting, after all recvoery passes have finished.
But copygc really should be started earlier, since it may be needed for
allocations to make forward progress. Additionally, we've been seeing
occasional bug reports where starting the kthread fails due to a pending
signal - i.e. we're getting timed out by systemd (during a version
upgrade), but we're not seeing the signal until mount is about to
complete.
Additionally, we now have copygc/rebalance explicitly wait for
check_snapshots to complete (if being run); they require that for
snapshot_is_ancestor() in the data move path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Don't use a continue; this simplifies the next patch where
run_recovery_passes() will be responsible for waking up copygc and
rebalance at the appropriate time.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We don't require that bucket size is block size aligned (although it
should be!) - so we need to handle this in the journal code.
This fixes an assertion pop in jorunal_entry_close(), where the journal
entry overruns available space - after rounding it up to block size.
Fixes: https://github.com/koverstreet/bcachefs/issues/854
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Syzbot managed to come up with a filesystem where check/repair got
rather confused at finding a reflink pointer in the inodes btree.
Currently, the "key allowed in this btree" checks only apply at commit
time, not read time - for forwards compatibility. It seems this is too
loose.
Now, strict key type allowed checks apply:
- at commit time (no forward compatibility issues)
- for btree node pointers
- if it's a known btree, known key type, and the key type has the
"BKEY_TYPE_strict_btree_checks" flag.
This means we still have the option of using generic key types - e.g.
KEY_TYPE_error, KEY_TYPE_set - on more existing btrees in the future,
while most key types that are intended for only a specific btree get
stricter checks.
Reported-by: syzbot+baee8591f336cab0958b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We now more often do repair automatically, without the user invoking
fsck - and sometimes that can involve fixing lots of errors, so let's
avoid flooding the dmesg log.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Don't set JOURNAL_running until we're also calling
journal_space_available() for the first time.
If JOURNAL_running is set, shutdown will write an empty journal entry -
but this will hit an assert in journal_entry_open() if we've never
called journal_space_available().
Reported-by: syzbot+53bb24d476ef8368a7f0@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Usual set of small fixes/logging improvements.
One bigger user reported fix, for inode <-> dirent inconsistencies
reported in fsck, after moving a subvolume that had been snapshotted.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmgBYaIACgkQE6szbY3K
bnYcGQ/+K+LsEvGAZ5wtTwUN4KqJIYWhcHYcuLS2mHKf8PMbgYhL7TjmCwb9VWyr
0+GFQcJgfLsl++kX4j7CjG4gHd22aLiwbhMDmSt3r6c4aF29rG+zCpe4W1+7o60k
UIKokfbLUV6b+0vF5bA/W3PmtXK7S8E0yAPMfWxv4/sACu8RUvrUJtrUCKEWwLzC
bcrRGsN91456qNhCrOp3e4t3yZjiGtZIz+SbPYIxdNrZYIMURlGUm+f9sLH3O+2R
NKsi41sggo/TmgUyspH3KCtMT88IDbN07F7O9/zcxgtdfzfC9l9FI6HnvRVSHDOV
boFaH/NdRaIbg+O5kqZXYul+/EPXsYp5B77TL6KQ3jhv3q16uwpv9EL4v6HIwvz9
BTDOfI2y/+YWHMfrtzXgh3C9dZDPS7qxFFWjSjCs/lXwKVz46RjBWVmtQoTJSEmb
Ee29kBGMpkwmH8fqr5KQheJUIeYewpyTVeB6orgtshnrr+aezS6zunIbk7fJ6+Ng
Tc08H/Aqc2KGcyBS3KTLhbReQ1clQKGOqWJymeb1p2V3SMXfABMbh61B1VU1XulC
Al5B7/w/WPwb+T2XZIM2qbmeoRJ8OBara5RWkx4HN8pcYuWV8H6GWJtRJQD/eKSO
pOT5bz8z9N2n/otwrfLT5lfO2fNW1mULCAamn6iSzR+EDHyuaMU=
=pi0D
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2025-04-17' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Usual set of small fixes/logging improvements.
One bigger user reported fix, for inode <-> dirent inconsistencies
reported in fsck, after moving a subvolume that had been snapshotted"
* tag 'bcachefs-2025-04-17' of git://evilpiepirate.org/bcachefs:
bcachefs: Fix snapshotting a subvolume, then renaming it
bcachefs: Add missing READ_ONCE() for metadata replicas
bcachefs: snapshot_node_missing is now autofix
bcachefs: Log message when incompat version requested but not enabled
bcachefs: Print version_incompat_allowed on startup
bcachefs: Silence extent_poisoned error messages
bcachefs: btree_root_unreadable_and_scan_found_nothing now AUTOFIX
bcachefs: fix bch2_dev_usage_full_read_fast()
bcachefs: Don't print data read retry success on non-errors
bcachefs: Add missing error handling
bcachefs: Prevent granting write refs when filesystem is read-only
Subvolume roots and the dirents that point to them are special; they
don't obey the normal snapshot versioning rules because they cross
snapshot boundaries.
We don't keep around older versions of subvolume dirents on rename - we
don't need to, because subvolume dirents are only visible in the parent
subvolume, and we wouldn't be able to match up the different dirent and
inode versions due to crossing the snapshot ID boundary.
That means that when we rename a subvolume, that's been snapshotted, the
older version of the subvolume root will become dangling - it won't have
a dirent that points to it.
That's expected, we just need to tell fsck that this is ok.
Fixes: https://github.com/koverstreet/bcachefs/issues/856
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we race with the user changing the metadata_replicas setting, this
could cause us to get an incorrectly sized disk reservation.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
extent poisoning is partly so that we don't keep spewing the dmesg log
when we've got unreadable data - we don't want to print these.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This will likely mean that the btree had only one node - there was
nothing or almost nothing in it, and we should reconstruct and continue.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
One reference to bch_dev_usage wasn't updated, which meant we weren't
reading the full bch_dev_usage_full - oops.
Fixes: 955ba7b5ea ("bcachefs: bch_dev_usage_full")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We may end up in the data read retry path when reading cached data and
racing with invalidation, or on checksum error when we were reading into
a userspace buffer that might have been modified while the read was in
flight.
These aren't real errors, so we shouldn't print the 'retry success'
message.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Fix a shutdown WARNING in bch2_dev_free caused by active write I/O
references (ca->io_ref[WRITE]) on a device being freed.
The problem occurs when:
- The filesystem is marked read-only (BCH_FS_rw clear in c->flags).
- A subsequent operation (e.g., error handling for device removal)
incorrectly tries to grant write references back to a device.
- During final shutdown, the read-only flag causes the system to skip
stopping write I/O references (bch2_dev_io_ref_stop(ca, WRITE)).
- The leftover active write reference triggers the WARN_ON in
bch2_dev_free.
Prevent this by checking if the filesystem is read-only before
attempting to grant write references to a device in the problematic
code path. Ensure consistency between the filesystem state flag
and the device I/O reference state during shutdown.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Mostly minor fixes.
Eric Biggers' crypto API conversion is included because of long standing
sporadic crashes - mostly, but not entirely syzbot - in the crypto API
code when calling poly1305, which have been nigh impossible to reproduce
and debug. His rework deletes the code where we've seen the crashes, so
either it'll be a fix or we'll end up with backtraces we can debug.
(Thanks Eric!).
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmf4Yq8ACgkQE6szbY3K
bnb1fxAAu68Ll/4PLWr3xHVp7ETWgEZSzuwRqA87fqs/Q0jNRC2aDIO03Wmj28qM
ckEM3PFPuiQubLhOmU21Osta/sFU6GxL0IggMoEC50F5XiVlcKRiNSWhRLnr07Qp
v7sc5MQ1HDGnZXQMcdRymzRWixn2hzdRMwKOvbhBHuj0YUPd4yk5I/+AtJi5SAnT
ri+dGIQLBRmG7J2pd4AZNza0YZ5pkTAFj9/4wRVoJdKX2pPzf40e7qzYNPt4/6rK
A6P9ecU3TDDDQEE4S7s8Dng4rXwsa+9qTUcpXnTTC1L6YbbnZd/IQYzWI4b+FUsS
wqnUD+aE7UEMANZh891QlJpGj3ih/6z8opUP4T6RdsVuJwt9X1vFJY99CsOTEf1o
7jAcssL+ueEWPZj8tBoN1niujikyFsXM+xKiUOMZxbuM6BhE40j/WrA77sRhI5+I
7DXlf5s8SDh+gw0IGUboBJe3ofGisRXnfxeZAKQHGHgtEFboY4bDAURcGW4MbIqE
uN5Cd+5IJlcKmJdXLCbHMb5KktfBNWu9/VrOMcZ2QHhIuOfd3fFgLzE0ZEroj4lN
kTWxpzKeNDt3bPF4esYnvduafHDbzClwfkTt5TBgcOeE4TcIL2mOmweLE2LTKIwW
xr5Xhqx1/9//PeaOTwxbCoeZ26G0Q9B8L1+eUZgjPS0FcRdZH9w=
=DpNC
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2025-04-10' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Mostly minor fixes.
Eric Biggers' crypto API conversion is included because of long
standing sporadic crashes - mostly, but not entirely syzbot - in the
crypto API code when calling poly1305, which have been nigh impossible
to reproduce and debug.
His rework deletes the code where we've seen the crashes, so either
it'll be a fix or we'll end up with backtraces we can debug. (Thanks
Eric!)"
* tag 'bcachefs-2025-04-10' of git://evilpiepirate.org/bcachefs:
bcachefs: Use sort_nonatomic() instead of sort()
bcachefs: Remove unnecessary softdep on xxhash
bcachefs: use library APIs for ChaCha20 and Poly1305
bcachefs: Fix duplicate "ro,read_only" in opts at startup
bcachefs: Fix UAF in bchfs_read()
bcachefs: Use cpu_to_le16 for dirent lengths
bcachefs: Fix type for parameter in journal_advance_devs_to_next_bucket
bcachefs: Fix escape sequence in prt_printf
Finish cleaning up the CRC kconfig options by removing the remaining
unnecessary prompts and an unnecessary 'default y', removing
CONFIG_LIBCRC32C, and documenting all the CRC library options.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZ/P7QhQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKyoOAQCynFcS1dWuD27S+SdUREmBjMAoZo5M
zdsIvlPv9KLycgD/QX5lXjW3KIYY6jQ8vHUuLVwfDl/JEp4GJS9dLGU+agg=
=0R1T
-----END PGP SIGNATURE-----
Merge tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull CRC cleanups from Eric Biggers:
"Finish cleaning up the CRC kconfig options by removing the remaining
unnecessary prompts and an unnecessary 'default y', removing
CONFIG_LIBCRC32C, and documenting all the CRC library options"
* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
lib/crc: remove CONFIG_LIBCRC32C
lib/crc: document all the CRC library kconfig options
lib/crc: remove unnecessary prompt for CONFIG_CRC_ITU_T
lib/crc: remove unnecessary prompt for CONFIG_CRC_T10DIF
lib/crc: remove unnecessary prompt for CONFIG_CRC16
lib/crc: remove unnecessary prompt for CONFIG_CRC_CCITT
lib/crc: remove unnecessary prompt for CONFIG_CRC32 and drop 'default y'
Fixes "task out to lunch" warnings during recovery on large machines
with lots of dirty data in the journal.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
As with the other algorithms, bcachefs does not access xxhash through
the crypto API. So there is no need to use a module softdep to ensure
that it is loaded.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Just use the ChaCha20 and Poly1305 libraries instead of the clunky
crypto API. This is much simpler. It is also slightly faster, since
the libraries provide more direct access to the same
architecture-optimized ChaCha20 and Poly1305 code.
I've tested that existing encrypted bcachefs filesystems can be continue
to be accessed with this patch applied.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Commit 3ba0240a87 fixed a bug in the read retry path in __bch2_read(),
and changed bchfs_read() to match - to avoid a landmine if
bch2_read_extent() ever starts returning transaction restarts.
But that was incorrect, because bchfs_read() doesn't use a separate
stack allocated bvec_iter, it uses the one in the rbio being submitted.
Add a comment explaining the issue, and revert the buggy change.
Fixes: 3ba0240a87 ("bcachefs: Fix silent short reads in data read retry path")
Reported-by: syzbot+2deb10b8dc9aae6fab67@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Prevent incorrect byte ordering for big-endian systems.
Signed-off-by: Gabriel Shahrouzi <gshahrouzi@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Replace u64 with __le64 to match the expected parameter type. Ensure consistency both in function calls and within the function itself.
Signed-off-by: Gabriel Shahrouzi <gshahrouzi@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Remove backslash before format specifier. Ensure correct output.
Signed-off-by: Gabriel Shahrouzi <gshahrouzi@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that LIBCRC32C does nothing besides select CRC32, make every option
that selects LIBCRC32C instead select CRC32 directly. Then remove
LIBCRC32C.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250401221600.24878-8-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
More notable fixes:
- Fix for striping behaviour on tiering filesystems where replicas
exceeds durability on destination target
- Fix a race in device removal where deleting alloc info races with the
discard worker
- Some small stack usage improvements: this is just enough for KMSAN
builds to not blow the stack, more is queued up for 6.16.
-----BEGIN PGP SIGNATURE-----
iQIyBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmfu7pMACgkQE6szbY3K
bnY6rw/3W4dho57OPjOoHUbQ7A7IK1hI4SFvGtDDb3vX1RjF2r+RbdBupMAd0zGj
T+5SzhYCQLGbyfBa6MW+iPqiHokZsG904+3mRogOf9cpz2Mup9ZOq/vV+Z7ndaF8
2i9wpQTb7GShkSaXkeTvQqnx3YAUxVcRB2ExraTXmv4wIxr1SYyJEeakmMBDasJB
UanXXVHzrKo9WLiqWz0JSZCiuQW2v03P84zZo1d/GyMKlTxYDt5aAteos77lJBef
5CWVr4/HKKozt/vI2qHQ+3LJXktLjvb07zoENXwadmgQawYA3nQ+9jLT3Q0FKjXG
bK28AHTtiXgWsYsbCs5sVh1+WLPdEj0UBBoFZGWo++TzaN2hXhoMsFTQfuddhaEh
W63MWtelv4TGIVOEFk+ayHRgPL6ajhCsa1boHS9EKdosl2nl9Vk9Nq0i++hYZDGW
KhWqENT9E5EpVCnZ6H4m1tsXprWavNqXnkOJzXW0T2F3t8+94zp1n6YXkwDdgLfs
l+xTEEAL5J8lvlfSS6dW7QcMSMtMKbo3+qlerpH8J4zBZJBbb2nF1ggCtpYg6zFt
4Jgs5FPQLVqWsPXQr4CaSF2UIt3zMPnNIawL1cEpRBU1j35qo0e/kxIjEpS0Pnjt
mX67gBlodY54/pwGGLfc/Vkw4xqh//dqTmYIdHkibdAEvKf0dg==
=2TfM
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2025-04-03' of git://evilpiepirate.org/bcachefs
Pull more bcachefs updates from Kent Overstreet:
"More notable fixes:
- Fix for striping behaviour on tiering filesystems where replicas
exceeds durability on destination target
- Fix a race in device removal where deleting alloc info races with
the discard worker
- Some small stack usage improvements: this is just enough for KMSAN
builds to not blow the stack, more is queued up for 6.16"
* tag 'bcachefs-2025-04-03' of git://evilpiepirate.org/bcachefs:
bcachefs: Fix "journal stuck" during recovery
bcachefs: backpointer_get_key: check for null from peek_slot()
bcachefs: Fix null ptr deref in invalidate_one_bucket()
bcachefs: Fix check_snapshot_exists() restart handling
bcachefs: use nonblocking variant of print_string_as_lines in error path
bcachefs: Fix scheduling while atomic from logging changes
bcachefs: Add error handling for zlib_deflateInit2()
bcachefs: add missing selection of XARRAY_MULTI
bcachefs: bch_dev_usage_full
bcachefs: Kill btree_iter.trans
bcachefs: do_trace_key_cache_fill()
bcachefs: Split up bch_dev.io_ref
bcachefs: fix ref leak in btree_node_read_all_replicas
bcachefs: Fix null ptr deref in bch2_write_endio()
bcachefs: Fix field spanning write warning
bcachefs: Fix striping behaviour
If we crash when the journal pin fifo is completely full - i.e. we're
at the maximum number of dirty journal entries - that may put us in a
sticky situation in recovery, as journal replay will need to be able to
open new journal entries in order to get going.
bch2_fs_journal_start() already had provisions for resizing the journal
pin fifo if needed, but it needs a fudge factor to ensure there's room
for journal replay.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
peek_slot() doesn't normally return bkey_s_c_null - except when we ask
for a key at a btree level that doesn't exist, which can happen here.
We might want to revisit this, but we'll have to look over all the
places where we use peek_slot() on interior nodes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_backpointer_get_key() returns bkey_s_c_null when the target isn't
found.
backpointer_get_key() flags the error, so there's nothing else to do
here - just skip it and move on.
Link: https://github.com/koverstreet/bcachefs/issues/847
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Codepaths that create entries in the snapshots btree currently call
bch2_mark_snapshot(), which updates the in-memory snapshot table, before
transaction commit.
This is because bch2_mark_snapshot() is an atomic trigger, run with
btree write locks held, and isn't allowed to fail - but it might need to
reallocate the table, hence we call it early when we're still allowed to
fail.
This is generally harmless - if we fail, we'll have left an entry in the
snapshots table around, but nothing will reference it and it'll get
overwritten if reused by another transaction.
But check_snapshot_exists(), which reconstructs snapshots when the
snapshots btree has been corrupted or lost, was erronously rechecking if
the snapshot exists inside the transaction commit loop - so on
transaction restart (in this case mem_realloced), the second iteration
would return without repairing.
This code needs some cleanup: splitting out a "maybe realloc snapshots
table" helper would have avoided this, that will be in the next patch.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The inconsistency error path calls print_string_as_lines, which calls
console_lock, which is a potentially-sleeping function and so can't be
called in an atomic context.
Replace calls to it with the nonblocking variant which is safe to call.
Signed-off-by: Bharadwaj Raju <bharadwaj.raju777@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Two fixes from the recent logging changes:
bch2_inconsistent(), bch2_fs_inconsistent() be called from interrupt
context, or with rcu_read_lock() held.
The one syzbot found is in
bch2_bkey_pick_read_device
bch2_dev_rcu
bch2_fs_inconsistent
We're starting to switch to lift the printbufs up to higher levels so we
can emit better log messages and print them all in one go (avoid
garbling), so that conversion will help with spotting these in the
future; when we declare a printbuf it must be flagged if we're in an
atomic context.
Secondly, in btree_node_write_endio:
00085 BUG: sleeping function called from invalid context at include/linux/sched/mm.h:321
00085 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 618, name: bch-reclaim/fa6
00085 preempt_count: 10001, expected: 0
00085 RCU nest depth: 0, expected: 0
00085 4 locks held by bch-reclaim/fa6/618:
00085 #0: ffffff80d7ccad68 (&j->reclaim_lock){+.+.}-{4:4}, at: bch2_journal_reclaim_thread+0x84/0x198
00085 #1: ffffff80d7c84218 (&c->btree_trans_barrier){.+.+}-{0:0}, at: __bch2_trans_get+0x1c0/0x440
00085 #2: ffffff80cd3f8140 (bcachefs_btree){+.+.}-{0:0}, at: __bch2_trans_get+0x22c/0x440
00085 #3: ffffff80c3823c20 (&vblk->vqs[i].lock){-.-.}-{3:3}, at: virtblk_done+0x58/0x130
00085 irq event stamp: 328
00085 hardirqs last enabled at (327): [<ffffffc080073a14>] finish_task_switch.isra.0+0xbc/0x2a0
00085 hardirqs last disabled at (328): [<ffffffc080971a10>] el1_interrupt+0x20/0x60
00085 softirqs last enabled at (0): [<ffffffc08002f920>] copy_process+0x7c8/0x2118
00085 softirqs last disabled at (0): [<0000000000000000>] 0x0
00085 Preemption disabled at:
00085 [<ffffffc08003ada0>] irq_enter_rcu+0x18/0x90
00085 CPU: 8 UID: 0 PID: 618 Comm: bch-reclaim/fa6 Not tainted 6.14.0-rc6-ktest-g04630bde23e8 #18798
00085 Hardware name: linux,dummy-virt (DT)
00085 Call trace:
00085 show_stack+0x1c/0x30 (C)
00085 dump_stack_lvl+0x84/0xc0
00085 dump_stack+0x14/0x20
00085 __might_resched+0x180/0x288
00085 __might_sleep+0x4c/0x88
00085 __kmalloc_node_track_caller_noprof+0x34c/0x3e0
00085 krealloc_noprof+0x1a0/0x2d8
00085 bch2_printbuf_make_room+0x9c/0x120
00085 bch2_prt_printf+0x60/0x1b8
00085 btree_node_write_endio+0x1b0/0x2d8
00085 bio_endio+0x138/0x1f0
00085 btree_node_write_endio+0xe8/0x2d8
00085 bio_endio+0x138/0x1f0
00085 blk_update_request+0x220/0x4c0
00085 blk_mq_end_request+0x28/0x148
00085 virtblk_request_done+0x64/0xe8
00085 blk_mq_complete_request+0x34/0x40
00085 virtblk_done+0x78/0x130
00085 vring_interrupt+0x6c/0xb0
00085 __handle_irq_event_percpu+0x8c/0x2e0
00085 handle_irq_event+0x50/0xb0
00085 handle_fasteoi_irq+0xc4/0x250
00085 handle_irq_desc+0x44/0x60
00085 generic_handle_domain_irq+0x20/0x30
00085 gic_handle_irq+0x54/0xc8
00085 call_on_irq_stack+0x24/0x40
Reported-by: syzbot+c82cd2906e2f192410bb@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In attempt_compress(), the return value of zlib_deflateInit2() needs to be
checked. A proper implementation can be found in pstore_compress().
Add an error check and return 0 immediately if the initialzation fails.
Fixes: 986e9842fb ("bcachefs: Compression levels")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When CONFIG_XARRAY_MULTI is not set, reading from a bcachefs file hits
the 'BUG_ON(order > 0);' in xas_set_order(), because it tries to insert
a large folio in the page cache. Fix this by making bcachefs select
XARRAY_MULTI.
Fixes: be212d86b1 ("bcachefs: bs > ps support")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
All the fastpaths that need device usage don't need the sector totals or
fragmentation, just bucket counts.
Split bch_dev_usage up into two different versions, the normal one with
just bucket counts.
This is also a stack usage improvement, since we have a bch_dev_usage on
the stack in the allocation path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This was planned to be done ages ago, now finally completed; there are
places where we have quite a few btree_trans objects on the stack, so
this reduces stack usage somewhat.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We now have separate per device io_refs for read and write access.
This fixes a device removal bug where the discard workers were still
running while we're removing alloc info for that device.
It's also a bit of hardening; we no longer allow writes to devices that
are read-only.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Uros Bizjak uses x86 named address space qualifiers to provide
compile-time checking of percpu area accesses.
This has caused a small amount of fallout - two or three issues were
reported. In all cases the calling code was founf to be incorrect.
- The 4 patch series "Some cleanup for memcg" from Chen Ridong
implements some relatively monir cleanups for the memcontrol code.
- The 17 patch series "mm: fixes for device-exclusive entries (hmm)"
from David Hildenbrand fixes a boatload of issues which David found then
using device-exclusive PTE entries when THP is enabled. More work is
needed, but this makes thins better - our own HMM selftests now succeed.
- The 2 patch series "mm: zswap: remove z3fold and zbud" from Yosry
Ahmed remove the z3fold and zbud implementations. They have been
deprecated for half a year and nobody has complained.
- The 5 patch series "mm: further simplify VMA merge operation" from
Lorenzo Stoakes implements numerous simplifications in this area. No
runtime effects are anticipated.
- The 4 patch series "mm/madvise: remove redundant mmap_lock operations
from process_madvise()" from SeongJae Park rationalizes the locking in
the madvise() implementation. Performance gains of 20-25% were observed
in one MADV_DONTNEED microbenchmark.
- The 12 patch series "Tiny cleanup and improvements about SWAP code"
from Baoquan He contains a number of touchups to issues which Baoquan
noticed when working on the swap code.
- The 2 patch series "mm: kmemleak: Usability improvements" from Catalin
Marinas implements a couple of improvements to the kmemleak user-visible
output.
- The 2 patch series "mm/damon/paddr: fix large folios access and
schemes handling" from Usama Arif provides a couple of fixes for DAMON's
handling of large folios.
- The 3 patch series "mm/damon/core: fix wrong and/or useless
damos_walk() behaviors" from SeongJae Park fixes a few issues with the
accuracy of kdamond's walking of DAMON regions.
- The 3 patch series "expose mapping wrprotect, fix fb_defio use" from
Lorenzo Stoakes changes the interaction between framebuffer deferred-io
and core MM. No functional changes are anticipated - this is
preparatory work for the future removal of page structure fields.
- The 4 patch series "mm/damon: add support for hugepage_size DAMOS
filter" from Usama Arif adds a DAMOS filter which permits the filtering
by huge page sizes.
- The 4 patch series "mm: permit guard regions for file-backed/shmem
mappings" from Lorenzo Stoakes extends the guard region feature from its
present "anon mappings only" state. The feature now covers shmem and
file-backed mappings.
- The 4 patch series "mm: batched unmap lazyfree large folios during
reclamation" from Barry Song cleans up and speeds up the unmapping for
pte-mapped large folios.
- The 18 patch series "reimplement per-vma lock as a refcount" from
Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for
pulling it out were largely bogus and that change made the code more
messy. This patchset provides small (0-10%) improvements on one
microbenchmark.
- The 5 patch series "Docs/mm/damon: misc DAMOS filters documentation
fixes and improves" from SeongJae Park does some maintenance work on the
DAMON docs.
- The 27 patch series "hugetlb/CMA improvements for large systems" from
Frank van der Linden addresses a pile of issues which have been observed
when using CMA on large machines.
- The 2 patch series "mm/damon: introduce DAMOS filter type for unmapped
pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the
page's mapped/unmapped status.
- The 19 patch series "zsmalloc/zram: there be preemption" from Sergey
Senozhatsky teaches zram to run its compression and decompression
operations preemptibly.
- The 12 patch series "selftests/mm: Some cleanups from trying to run
them" from Brendan Jackman fixes a pile of unrelated issues which
Brendan encountered while runnimg our selftests.
- The 2 patch series "fs/proc/task_mmu: add guard region bit to pagemap"
from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
determine whether a particular page is a guard page.
- The 7 patch series "mm, swap: remove swap slot cache" from Kairui Song
removes the swap slot cache from the allocation path - it simply wasn't
being effective.
- The 5 patch series "mm: cleanups for device-exclusive entries (hmm)"
from David Hildenbrand implements a number of unrelated cleanups in this
code.
- The 5 patch series "mm: Rework generic PTDUMP configs" from Anshuman
Khandual implements a number of preparatoty cleanups to the
GENERIC_PTDUMP Kconfig logic.
- The 8 patch series "mm/damon: auto-tune aggregation interval" from
SeongJae Park implements a feedback-driven automatic tuning feature for
DAMON's aggregation interval tuning.
- The 5 patch series "Fix lazy mmu mode" from Ryan Roberts fixes some
issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did
this in preparation for implementing lazy mmu mode for arm64 to optimize
vmalloc.
- The 2 patch series "mm/page_alloc: Some clarifications for migratetype
fallback" from Brendan Jackman reworks some commentary to make the code
easier to follow.
- The 3 patch series "page_counter cleanup and size reduction" from
Shakeel Butt cleans up the page_counter code and fixes a size increase
which we accidentally added late last year.
- The 3 patch series "Add a command line option that enables control of
how many threads should be used to allocate huge pages" from Thomas
Prescher does that. It allows the careful operator to significantly
reduce boot time by tuning the parallalization of huge page
initialization.
- The 3 patch series "Fix calculations in trace_balance_dirty_pages()
for cgwb" from Tang Yizhou fixes the tracing output from the dirty page
balancing code.
- The 9 patch series "mm/damon: make allow filters after reject filters
useful and intuitive" from SeongJae Park improves the handling of allow
and reject filters. Behaviour is made more consistent and the
documention is updated accordingly.
- The 5 patch series "Switch zswap to object read/write APIs" from Yosry
Ahmed updates zswap to the new object read/write APIs and thus permits
the removal of some legacy code from zpool and zsmalloc.
- The 6 patch series "Some trivial cleanups for shmem" from Baolin Wang
does as it claims.
- The 20 patch series "fs/dax: Fix ZONE_DEVICE page reference counts"
from Alistair Popple regularizes the weird ZONE_DEVICE page refcount
handling in DAX, permittig the removal of a number of special-case
checks.
- The 4 patch series "refactor mremap and fix bug" from Lorenzo Stoakes
is a preparatoty refactoring and cleanup of the mremap() code.
- The 20 patch series "mm: MM owner tracking for large folios (!hugetlb)
+ CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
which we determine whether a large folio is known to be mapped
exclusively into a single MM.
- The 8 patch series "mm/damon: add sysfs dirs for managing DAMOS
filters based on handling layers" from SeongJae Park adds a couple of
new sysfs directories to ease the management of DAMON/DAMOS filters.
- The 13 patch series "arch, mm: reduce code duplication in mem_init()"
from Mike Rapoport consolidates many per-arch implementations of
mem_init() into code generic code, where that is practical.
- The 13 patch series "mm/damon/sysfs: commit parameters online via
damon_call()" from SeongJae Park continues the cleaning up of sysfs
access to DAMON internal data.
- The 3 patch series "mm: page_ext: Introduce new iteration API" from
Luiz Capitulino reworks the page_ext initialization to fix a boot-time
crash which was observed with an unusual combination of compile and
cmdline options.
- The 8 patch series "Buddy allocator like (or non-uniform) folio split"
from Zi Yan reworks the code to split a folio into smaller folios. The
main benefit is lessened memory consumption: fewer post-split folios are
generated.
- The 2 patch series "Minimize xa_node allocation during xarry split"
from Zi Yan reduces the number of xarray xa_nodes which are generated
during an xarray split.
- The 2 patch series "drivers/base/memory: Two cleanups" from Gavin Shan
performs some maintenance work on the drivers/base/memory code.
- The 3 patch series "Add tracepoints for lowmem reserves, watermarks
and totalreserve_pages" from Martin Liu adds some more tracepoints to
the page allocator code.
- The 4 patch series "mm/madvise: cleanup requests validations and
classifications" from SeongJae Park cleans up some warts which SeongJae
observed during his earlier madvise work.
- The 3 patch series "mm/hwpoison: Fix regressions in memory failure
handling" from Shuai Xue addresses two quite serious regressions which
Shuai has observed in the memory-failure implementation.
- The 5 patch series "mm: reliable huge page allocator" from Johannes
Weiner makes huge page allocations cheaper and more reliable by reducing
fragmentation.
- The 5 patch series "Minor memcg cleanups & prep for memdescs" from
Matthew Wilcox is preparatory work for the future implementation of
memdescs.
- The 4 patch series "track memory used by balloon drivers" from Nico
Pache introduces a way to track memory used by our various balloon
drivers.
- The 2 patch series "mm/damon: introduce DAMOS filter type for active
pages" from Nhat Pham permits users to filter for active/inactive pages,
separately for file and anon pages.
- The 2 patch series "Adding Proactive Memory Reclaim Statistics" from
Hao Jia separates the proactive reclaim statistics from the direct
reclaim statistics.
- The 2 patch series "mm/vmscan: don't try to reclaim hwpoison folio"
from Jinjiang Tu fixes our handling of hwpoisoned pages within the
reclaim code.
-----BEGIN PGP SIGNATURE-----
iHQEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nZaAAKCRDdBJ7gKXxA
jsOWAPiP4r7CJHMZRK4eyJOkvS1a1r+TsIarrFZtjwvf/GIfAQCEG+JDxVfUaUSF
Ee93qSSLR1BkNdDw+931Pu0mXfbnBw==
=Pn2K
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- The series "Enable strict percpu address space checks" from Uros
Bizjak uses x86 named address space qualifiers to provide
compile-time checking of percpu area accesses.
This has caused a small amount of fallout - two or three issues were
reported. In all cases the calling code was found to be incorrect.
- The series "Some cleanup for memcg" from Chen Ridong implements some
relatively monir cleanups for the memcontrol code.
- The series "mm: fixes for device-exclusive entries (hmm)" from David
Hildenbrand fixes a boatload of issues which David found then using
device-exclusive PTE entries when THP is enabled. More work is
needed, but this makes thins better - our own HMM selftests now
succeed.
- The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
remove the z3fold and zbud implementations. They have been deprecated
for half a year and nobody has complained.
- The series "mm: further simplify VMA merge operation" from Lorenzo
Stoakes implements numerous simplifications in this area. No runtime
effects are anticipated.
- The series "mm/madvise: remove redundant mmap_lock operations from
process_madvise()" from SeongJae Park rationalizes the locking in the
madvise() implementation. Performance gains of 20-25% were observed
in one MADV_DONTNEED microbenchmark.
- The series "Tiny cleanup and improvements about SWAP code" from
Baoquan He contains a number of touchups to issues which Baoquan
noticed when working on the swap code.
- The series "mm: kmemleak: Usability improvements" from Catalin
Marinas implements a couple of improvements to the kmemleak
user-visible output.
- The series "mm/damon/paddr: fix large folios access and schemes
handling" from Usama Arif provides a couple of fixes for DAMON's
handling of large folios.
- The series "mm/damon/core: fix wrong and/or useless damos_walk()
behaviors" from SeongJae Park fixes a few issues with the accuracy of
kdamond's walking of DAMON regions.
- The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
Stoakes changes the interaction between framebuffer deferred-io and
core MM. No functional changes are anticipated - this is preparatory
work for the future removal of page structure fields.
- The series "mm/damon: add support for hugepage_size DAMOS filter"
from Usama Arif adds a DAMOS filter which permits the filtering by
huge page sizes.
- The series "mm: permit guard regions for file-backed/shmem mappings"
from Lorenzo Stoakes extends the guard region feature from its
present "anon mappings only" state. The feature now covers shmem and
file-backed mappings.
- The series "mm: batched unmap lazyfree large folios during
reclamation" from Barry Song cleans up and speeds up the unmapping
for pte-mapped large folios.
- The series "reimplement per-vma lock as a refcount" from Suren
Baghdasaryan puts the vm_lock back into the vma. Our reasons for
pulling it out were largely bogus and that change made the code more
messy. This patchset provides small (0-10%) improvements on one
microbenchmark.
- The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
improves" from SeongJae Park does some maintenance work on the DAMON
docs.
- The series "hugetlb/CMA improvements for large systems" from Frank
van der Linden addresses a pile of issues which have been observed
when using CMA on large machines.
- The series "mm/damon: introduce DAMOS filter type for unmapped pages"
from SeongJae Park enables users of DMAON/DAMOS to filter my the
page's mapped/unmapped status.
- The series "zsmalloc/zram: there be preemption" from Sergey
Senozhatsky teaches zram to run its compression and decompression
operations preemptibly.
- The series "selftests/mm: Some cleanups from trying to run them" from
Brendan Jackman fixes a pile of unrelated issues which Brendan
encountered while runnimg our selftests.
- The series "fs/proc/task_mmu: add guard region bit to pagemap" from
Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
determine whether a particular page is a guard page.
- The series "mm, swap: remove swap slot cache" from Kairui Song
removes the swap slot cache from the allocation path - it simply
wasn't being effective.
- The series "mm: cleanups for device-exclusive entries (hmm)" from
David Hildenbrand implements a number of unrelated cleanups in this
code.
- The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
implements a number of preparatoty cleanups to the GENERIC_PTDUMP
Kconfig logic.
- The series "mm/damon: auto-tune aggregation interval" from SeongJae
Park implements a feedback-driven automatic tuning feature for
DAMON's aggregation interval tuning.
- The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
preparation for implementing lazy mmu mode for arm64 to optimize
vmalloc.
- The series "mm/page_alloc: Some clarifications for migratetype
fallback" from Brendan Jackman reworks some commentary to make the
code easier to follow.
- The series "page_counter cleanup and size reduction" from Shakeel
Butt cleans up the page_counter code and fixes a size increase which
we accidentally added late last year.
- The series "Add a command line option that enables control of how
many threads should be used to allocate huge pages" from Thomas
Prescher does that. It allows the careful operator to significantly
reduce boot time by tuning the parallalization of huge page
initialization.
- The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
from Tang Yizhou fixes the tracing output from the dirty page
balancing code.
- The series "mm/damon: make allow filters after reject filters useful
and intuitive" from SeongJae Park improves the handling of allow and
reject filters. Behaviour is made more consistent and the documention
is updated accordingly.
- The series "Switch zswap to object read/write APIs" from Yosry Ahmed
updates zswap to the new object read/write APIs and thus permits the
removal of some legacy code from zpool and zsmalloc.
- The series "Some trivial cleanups for shmem" from Baolin Wang does as
it claims.
- The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
Alistair Popple regularizes the weird ZONE_DEVICE page refcount
handling in DAX, permittig the removal of a number of special-case
checks.
- The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
preparatoty refactoring and cleanup of the mremap() code.
- The series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
which we determine whether a large folio is known to be mapped
exclusively into a single MM.
- The series "mm/damon: add sysfs dirs for managing DAMOS filters based
on handling layers" from SeongJae Park adds a couple of new sysfs
directories to ease the management of DAMON/DAMOS filters.
- The series "arch, mm: reduce code duplication in mem_init()" from
Mike Rapoport consolidates many per-arch implementations of
mem_init() into code generic code, where that is practical.
- The series "mm/damon/sysfs: commit parameters online via
damon_call()" from SeongJae Park continues the cleaning up of sysfs
access to DAMON internal data.
- The series "mm: page_ext: Introduce new iteration API" from Luiz
Capitulino reworks the page_ext initialization to fix a boot-time
crash which was observed with an unusual combination of compile and
cmdline options.
- The series "Buddy allocator like (or non-uniform) folio split" from
Zi Yan reworks the code to split a folio into smaller folios. The
main benefit is lessened memory consumption: fewer post-split folios
are generated.
- The series "Minimize xa_node allocation during xarry split" from Zi
Yan reduces the number of xarray xa_nodes which are generated during
an xarray split.
- The series "drivers/base/memory: Two cleanups" from Gavin Shan
performs some maintenance work on the drivers/base/memory code.
- The series "Add tracepoints for lowmem reserves, watermarks and
totalreserve_pages" from Martin Liu adds some more tracepoints to the
page allocator code.
- The series "mm/madvise: cleanup requests validations and
classifications" from SeongJae Park cleans up some warts which
SeongJae observed during his earlier madvise work.
- The series "mm/hwpoison: Fix regressions in memory failure handling"
from Shuai Xue addresses two quite serious regressions which Shuai
has observed in the memory-failure implementation.
- The series "mm: reliable huge page allocator" from Johannes Weiner
makes huge page allocations cheaper and more reliable by reducing
fragmentation.
- The series "Minor memcg cleanups & prep for memdescs" from Matthew
Wilcox is preparatory work for the future implementation of memdescs.
- The series "track memory used by balloon drivers" from Nico Pache
introduces a way to track memory used by our various balloon drivers.
- The series "mm/damon: introduce DAMOS filter type for active pages"
from Nhat Pham permits users to filter for active/inactive pages,
separately for file and anon pages.
- The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
separates the proactive reclaim statistics from the direct reclaim
statistics.
- The series "mm/vmscan: don't try to reclaim hwpoison folio" from
Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
code.
* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
x86/mm: restore early initialization of high_memory for 32-bits
mm/vmscan: don't try to reclaim hwpoison folio
mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
selftests/mm: speed up split_huge_page_test
selftests/mm: uffd-unit-tests support for hugepages > 2M
docs/mm/damon/design: document active DAMOS filter type
mm/damon: implement a new DAMOS filter type for active pages
fs/dax: don't disassociate zero page entries
MM documentation: add "Unaccepted" meminfo entry
selftests/mm: add commentary about 9pfs bugs
fork: use __vmalloc_node() for stack allocation
docs/mm: Physical Memory: Populate the "Zones" section
xen: balloon: update the NR_BALLOON_PAGES state
hv_balloon: update the NR_BALLOON_PAGES state
balloon_compaction: update the NR_BALLOON_PAGES state
meminfo: add a per node counter for balloon drivers
mm: remove references to folio in __memcg_kmem_uncharge_page()
...
All bugfixes and logging improvements.
Minor merge conflict, see:
https://lore.kernel.org/linux-next/20250331092816.778a7c83@canb.auug.org.au/T/#u
CI says the fs-next tree is good:
https://evilpiepirate.org/~testdashboard/ci?user=fs-next&branch=master
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmfqyxsACgkQE6szbY3K
bnY4aQ/7BylMgHZsAG2OLRRtegCsuFZ5fZt148TObofSGTTDPcVKYQWcz249Hlao
RZzv9nbqq2M7fJrUK5Xloc4DA0ICuWIh9n+uRf+5od7JgtygjJpXqMRz9HrGBtTo
QZJE/wAzsa1A8xBORjWVki4koHT3YivaMW2zdgbHIHWTjDJso5Es7RW0/WZYv3lW
cFLFEfOnBCEXckhEtK7TAPnJpHEPw+/d0bMFU/PIHbokUwTjxCR0bmRL/RecKUXa
5U1o1x7gAo1iPi3XPGLJVVxXWgjmxzQlF/3aXva+DYaeLgxPMqKxUlC6hkV4f6Oc
9lH/w1pEiCMcANbbp2E3Q91sDFRlafFCgvsKhEz79W5WoNq+vSrxLhLaynyuBT/K
lfoiig6IFRTWJDYHu2L6YHFMmp8JOxgJSJ0+dcgyVRnaDJQeGgbuv1tEldonQLsg
9DT8iRJpVDomffwPUoVhujlvJOqUi8zFkxyMCgVWExFzC3ief2B5s3D4uLXcpApO
nZfb01W0ElW7qBMQxjyD0Vy+wY8EryzTht9ZKJq5Id1T/LWc9Qi+jPaY86OBC9/w
GJgW9OcYLFjYdsDokk5XkwOd/IAXz6fU+vHGtahFJPVfH4T8zzdBnxfPbiR2mXo8
4EfeNmRevZP/oK7/2l2cqIzY7tYBJBUK1gFyvz1+7bcuFwVI8rc=
=Udka
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2025-03-31' of git://evilpiepirate.org/bcachefs
Pull more bcachefs updates from Kent Overstreet:
"All bugfixes and logging improvements"
* tag 'bcachefs-2025-03-31' of git://evilpiepirate.org/bcachefs: (35 commits)
bcachefs: fix bch2_write_point_to_text() units
bcachefs: Log original key being moved in data updates
bcachefs: BCH_JSET_ENTRY_log_bkey
bcachefs: Reorder error messages that include journal debug
bcachefs: Don't use designated initializers for disk_accounting_pos
bcachefs: Silence errors after emergency shutdown
bcachefs: fix units in rebalance_status
bcachefs: bch2_ioctl_subvolume_destroy() fixes
bcachefs: Clear fs_path_parent on subvolume unlink
bcachefs: Change btree_insert_node() assertion to error
bcachefs: Better printing of inconsistency errors
bcachefs: bch2_count_fsck_err()
bcachefs: Better helpers for inconsistency errors
bcachefs: Consistent indentation of multiline fsck errors
bcachefs: Add an "ignore unknown" option to bch2_parse_mount_opts()
bcachefs: bch2_time_stats_init_no_pcpu()
bcachefs: Fix bch2_fs_get_tree() error path
bcachefs: fix logging in journal_entry_err_msg()
bcachefs: add missing newline in bch2_trans_updates_to_text()
bcachefs: print_string_as_lines: fix extra newline
...
This was previously hard to hit since it requires racing with device
removal, but splitting up io_ref uncovered it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
For striping across devices, we maintain "clocks", and we advance them
by the inverse of "how much free space this device has left", so that we
round robin biased in favor of devices with more free space.
This code was originally trying to do EWMA-ish stuff when originally
written, ~10 years ago, and was never properly cleaned up when it was
realized that an EWMA is not the right approach here.
That left a bug, when we rescale to keep all the clocks in the correct
range and prevent overflow.
It was assumed that we'd always be allocated from the device with the
smallest clock hand, but that's actually not correct: with the target
options, allocations will be first tried from a subset of devices, and
then the entire filesystem if that fails.
Thus, the rescale from the first allocation - allocating from a subset
of devices - can pick the wrong rescale value and cause the rest of the
clocks to go to 0, losing information.
This resuls in incorrect striping behaviour when the desired number of
replicas doesn't fit on the foreground target.
Link: https://www.reddit.com/r/bcachefs/comments/1jn3t26/replica_allocation_not_evenly_distributed_among/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There's something going on with the data move path; log the original key
being moved for debugging.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a journal entry type for logging - but logging a bkey, not a string;
to be used for data move path debugging.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Not all compilers fully initialize these - they're not guaranteed to
because of the union shenanigans.
Fixes: https://github.com/koverstreet/bcachefs/issues/844
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We don't care about errors from asynchronous ops that were because we
did an emergency shutdown; silence them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_evict_subvolume_inodes() was getting stuck - due to incorrectly
pruning the dcache.
Also, fix missing permissions checks.
Reported-by: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes recursive subvolume removal.
Subvolume deletion is asynchronous; fs_path_parent, and thus the entry
in the subvolume_children btree, need to be cleared when the subvolume
is unlinked from the fs heirarchy - else we'll spuriously think a
subvolume has children and deletion will fail.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Build up and emit the error message for an inconsistency error all at
once, instead of spread over multiple printk calls, so they're not
jumbled in the dmesg log.
Also, add better indenting.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Factor out a helper from __bch2_fsck_err(), for counting the error in
the superblock and deciding whether to print or ratelimit - will be used
to replace some log_fsck_err() calls, where we want to lift out printing
the error message.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
An inconsistency error often happens as part of an event with multiple
error messages, and we want to build up one single error message with
proper indenting to produce more readable log messages that don't get
garbled.
Add new helpers that emit messages to a printbuf instead of printing
them directly, next patch will convert to use them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add the new helper printbuf_indent_add_nextline(), and use it in
__bch2_fsck_err() to centralize setting the indentation of multiline
fsck errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
To be used by the mount helper in userspace, where we still have options
to be parsed by other layers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a mode to disable automatic switching to percpu mode, useful when a
time_stats will only be used by one thread and we don't want to have to
flush the percpu buffers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When a filesystem is mounted read-only, subsequent attempts to mount it
as read-write fail with EBUSY. Previously, the error path in
bch2_fs_get_tree() would unconditionally call __bch2_fs_stop(),
improperly freeing resources for a filesystem that was still actively
mounted. This change modifies the error path to only call
__bch2_fs_stop() if the superblock has no valid root dentry, ensuring
resources are not cleaned up prematurely when the filesystem is in use.
Signed-off-by: Florian Albrechtskirchinger <falbrechtskirchinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Don't print a newline on empty string; this was causing us to also print
an extra newline when we got to the end of th string.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
syzbot discovered that this one is possible: we have pointers, but none
of them are to valid devices.
Reported-by: syzbot+336a6e6a2dbb7d4dba9a@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The hole we find in the btree might be fully dirty in the page cache. If
so, keep searching.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can't call bch2_seek_pagecache_hole(), and block on page locks, with
btree locks held.
This is easily fixed because we're at the end of the transaction - we
can just unlock, we don't need a drop_locks_do().
Reported-by: https://github.com/nagalun
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
state_lock guards against devices coming or leaving, changing state, or
the filesystem changing between ro <-> rw.
But it's not necessary for running recovery passes, and holding it
blocks asynchronous events that would cause us to go RO or kick out
devices.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There's no reason for this not to be world readable - it provides the
currently supported on disk format version.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
On disk format is now soft frozen: no more required/automatic are
anticipated before taking off the experimental label.
Major changes/features since 6.14:
- Scrub
- Blocksize greater than page size support
- A number of "rebalance spinning and doing no work" issues have been
fixed; we now check if the write allocation will succeed in
bch2_data_update_init(), before kicking off the read.
There's still more work to do in this area. Later we may want to add
another bitset btree, like rebalance_work, to track "extents that
rebalance was requested to move but couldn't", e.g. due to destination
target having insufficient online devices.
- We can now support scaling well into the petabyte range: latest
bcachefs-tools will pick an appropriate bucket size at format time to
ensure fsck can run in available memory (e.g. a server with 256GB of
ram and 100PB of storage would want 16MB buckets).
On disk format changes:
- 1.21: cached backpointers (scalability improvement)
Cached replicas now get backpointers, which means we no longer rely on
incrementing bucket generation numbers to invalidate cached data: this
lets us get rid of the bucket generation number garbage collection,
which had to periodically rescan all extents to recompute bucket
oldest_gen.
Bucket generation numbers are now only used as a consistency check,
but they're quite useful for that.
- 1.22: stripe backpointers
Stripes now have backpointers: erasure coded stripes have their own
checksums, separate from the checksums for the extents they contain
(and stripe checksums also cover the parity blocks). This is required
for implementing scrub for stripes.
- 1.23: stripe lru (scalability improvement)
Persistent lru for stripes, ordered by "number of empty blocks". This
is used by the stripe creation path, which depending on free space
may create a new stripe out of a partially empty existing stripe
instead of starting a brand new stripe.
This replaces an in-memory heap, and means we no longer have to read
in the stripes btree at startup.
- 1.24: casefolding
Case insensitive directory support, courtesy of Valve.
This is an incompatible feature, to enable mount with
-o version_upgrade=incompatible
- 1.25: extent_flags
Another incompatible feature requiring explicit opt-in to enable.
This adds a flags entry to extents, and a flag bit that marks extents
as poisoned.
A poisoned extent is an extent that was unreadable due to checksum
errors. We can't move such extents without giving them a new checksum,
and we may have to move them (for e.g. copygc or device evacuate).
We also don't want to delete them: in the future we'll have an API
that lets userspace ignore checksum errors and attempt to deal with
simple bitrot itself. Marking them as poisoned lets us continue to
return the correct error to userspace on normal read calls.
Other changes/features:
- BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top'
command, which shows a live view of all internal filesystem counters.
- Improved journal pipelining: we can now have 16 journal writes in
flight concurrently, up from 4. We're logging significantly more to
the journal than we used to with all the recent disk accounting
changes and additions, so some users should see a performance
increase on some workloads.
- BCH_MEMBER_STATE_failed: previously, we would do no IO at all to
devices marked as failed. Now we will attempt to read from them, but
only if we have no better options.
- New option, write_error_timeout: devices will be kicked out of the
filesystem if all writes have been failing for x number of seconds.
We now also kick devices out when notified by blk_holder_ops that
they've gone offline.
- Device option handling improvements: the discard option should now be
working as expected (additionally, in -tools, all device options that
can be set at format time can now be set at device add time, i.e.
data_allowed, state).
- We now try harder to read data after a checksum error: we'll do
additional retries if necessary to a device after after it gave us
data with a checksum error.
- More self healing work: the full inode <-> dirent consistency checks
that are currently run by fsck are now also run every time we do a
lookup, meaning we'll be able to correct errors at runtime. Runtime
self healing will be flipped on after the new changes have seen more
testing, currently they're just checking for consistency.
- KMSAN fixes: our KMSAN builds should be nearly clean now, which will
put a massive dent in the syzbot dashboard.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmfhbnsACgkQE6szbY3K
bnY6ew/9FXh3m71BvVpuqTYcUGzIC7gVrnkFy6n4W96v07OjSOoTNHOVVovajxc3
P9LvA77BHC4Xro3H7ORpsIurOZUc6yx18ZizzulVbQFuYa7LY/kNri4ZBtGHcRiV
pIdQDLSNmwFjPA4x2S1qTFSF1c586lad+UNQiLam5ophBwQPEO6vG51ZEHa4wld9
+OhWTDYfrvij4D3Lt1ppvhuDP+PQBjhu/QFc0bGjHvKOjfV6sw9XU91sCYKOJIzd
qzpsiQd5sepnX717Br3f5SLdxMq2lJYvRp9756vltOCaMBvJYJtHqtXCglHQEkFw
yjhmPjk4r3VlKTF8K+wEJfAHwbC2kEn7csJNbt0+Nko5PPtFyrb8ok6QUbHCKscL
L0VMnzaXHVqvG2VgYa31temfdz7HM/zHjQ8Al3eQPaqTHIoTXIBQxOQSea/apVMt
TIlastvLoHfR8W7+LrwOmTjnBJGCJ+MrdcJzJDVk2tQmmcMA0boeZvl4aSklFuyB
zNN5fxp0VMsxNyIHLJjQ3UcwVqHXC5w+f5H1ByQLUyQh+m/xaAaz7S+BTVdVbFPa
1Z1xDuvuHOTnjIOamnOD1l36afJnhq5RciPCXCNtQSB819mc+AfNGQNQTVNOTReC
iTiUCcNxu0/DIPlPmeJzAlukVJUgz+/knOI/6zPs3eI7/o88ZGg=
=k3cV
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs
Pull bcachefs updates from Kent Overstreet:
"On disk format is now soft frozen: no more required/automatic are
anticipated before taking off the experimental label.
Major changes/features since 6.14:
- Scrub
- Blocksize greater than page size support
- A number of "rebalance spinning and doing no work" issues have been
fixed; we now check if the write allocation will succeed in
bch2_data_update_init(), before kicking off the read.
There's still more work to do in this area. Later we may want to
add another bitset btree, like rebalance_work, to track "extents
that rebalance was requested to move but couldn't", e.g. due to
destination target having insufficient online devices.
- We can now support scaling well into the petabyte range: latest
bcachefs-tools will pick an appropriate bucket size at format time
to ensure fsck can run in available memory (e.g. a server with
256GB of ram and 100PB of storage would want 16MB buckets).
On disk format changes:
- 1.21: cached backpointers (scalability improvement)
Cached replicas now get backpointers, which means we no longer rely
on incrementing bucket generation numbers to invalidate cached
data: this lets us get rid of the bucket generation number garbage
collection, which had to periodically rescan all extents to
recompute bucket oldest_gen.
Bucket generation numbers are now only used as a consistency check,
but they're quite useful for that.
- 1.22: stripe backpointers
Stripes now have backpointers: erasure coded stripes have their own
checksums, separate from the checksums for the extents they contain
(and stripe checksums also cover the parity blocks). This is
required for implementing scrub for stripes.
- 1.23: stripe lru (scalability improvement)
Persistent lru for stripes, ordered by "number of empty blocks".
This is used by the stripe creation path, which depending on free
space may create a new stripe out of a partially empty existing
stripe instead of starting a brand new stripe.
This replaces an in-memory heap, and means we no longer have to
read in the stripes btree at startup.
- 1.24: casefolding
Case insensitive directory support, courtesy of Valve.
This is an incompatible feature, to enable mount with
-o version_upgrade=incompatible
- 1.25: extent_flags
Another incompatible feature requiring explicit opt-in to enable.
This adds a flags entry to extents, and a flag bit that marks
extents as poisoned.
A poisoned extent is an extent that was unreadable due to checksum
errors. We can't move such extents without giving them a new
checksum, and we may have to move them (for e.g. copygc or device
evacuate). We also don't want to delete them: in the future we'll
have an API that lets userspace ignore checksum errors and attempt
to deal with simple bitrot itself. Marking them as poisoned lets us
continue to return the correct error to userspace on normal read
calls.
Other changes/features:
- BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top'
command, which shows a live view of all internal filesystem
counters.
- Improved journal pipelining: we can now have 16 journal writes in
flight concurrently, up from 4. We're logging significantly more to
the journal than we used to with all the recent disk accounting
changes and additions, so some users should see a performance
increase on some workloads.
- BCH_MEMBER_STATE_failed: previously, we would do no IO at all to
devices marked as failed. Now we will attempt to read from them,
but only if we have no better options.
- New option, write_error_timeout: devices will be kicked out of the
filesystem if all writes have been failing for x number of seconds.
We now also kick devices out when notified by blk_holder_ops that
they've gone offline.
- Device option handling improvements: the discard option should now
be working as expected (additionally, in -tools, all device options
that can be set at format time can now be set at device add time,
i.e. data_allowed, state).
- We now try harder to read data after a checksum error: we'll do
additional retries if necessary to a device after after it gave us
data with a checksum error.
- More self healing work: the full inode <-> dirent consistency
checks that are currently run by fsck are now also run every time
we do a lookup, meaning we'll be able to correct errors at runtime.
Runtime self healing will be flipped on after the new changes have
seen more testing, currently they're just checking for consistency.
- KMSAN fixes: our KMSAN builds should be nearly clean now, which
will put a massive dent in the syzbot dashboard"
* tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs: (180 commits)
bcachefs: Kill unnecessary bch2_dev_usage_read()
bcachefs: btree node write errors now print btree node
bcachefs: Fix race in print_chain()
bcachefs: btree_trans_restart_foreign_task()
bcachefs: bch2_disk_accounting_mod2()
bcachefs: zero init journal bios
bcachefs: Eliminate padding in move_bucket_key
bcachefs: Fix a KMSAN splat in btree_update_nodes_written()
bcachefs: kmsan asserts
bcachefs: Fix kmsan warnings in bch2_extent_crc_pack()
bcachefs: Disable asm memcpys when kmsan enabled
bcachefs: Handle backpointers with unknown data types
bcachefs: Count BCH_DATA_parity backpointers correctly
bcachefs: Run bch2_check_dirent_target() at lookup time
bcachefs: Refactor bch2_check_dirent_target()
bcachefs: Move bch2_check_dirent_target() to namei.c
bcachefs: fs-common.c -> namei.c
bcachefs: EIO cleanup
bcachefs: bch2_write_prep_encoded_data() now returns errcode
bcachefs: Simplify bch2_write_op_error()
...
Fixes "task out to lunch" warnings during recovery on large machines
with lots of dirty data in the journal.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
btree node scan has to wait on kthread workers that scan each device,
potentially for awhile.
We would like this to be interruptible, but we may need a different
mechanism than signals for that - we've had bugs in the past where
mounts were failing due to checking for signals, and no explanation on
where they came from.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Data move -> move_get_io_opts -> bch2_get_update_rebalance_opts
requires a not_extents iterator; this fixes the path where we're walking
the extents btree and chase a reflink pointer into the reflink btree.
bch2_lookup_indirect_extent() requires working with an extents iterator
(due to peek_slot() semantics), so we implement
bch2_lookup_indirect_extent_for_move().
This is simplified because there's no need to report
indirect_extent_missing_errors here, that can be deferred until fsck or
when a user reads that data.
Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There's various checks for "are we going to compress this" - but we're
not going to compress if we know it's incompressible.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We weren't checking that accounting keys have the expected number of
accounters. Originally we probably wanted to be flexible on this, but it
doesn't look like that will be required - accounting is extended by
adding new counter types, not more counters to an existing type.
This means we can drop a BUG_ON() that popped once in automated testing,
and the new validation will make that bug easier to track down.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Also, improve the message in prep_encoded_data() - it now prints
good/bad checksums, and checksum type.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
__bch2_read, before calling __bch2_read_extent(), sets bvec_iter.bi_size
to "the size we can read from the current extent" with a swap, and
restores it to "the size for the total read" after the read_extent call
with another swap.
But we neglected to do the restore before the "if (ret) goto err;" -
which is a problem if we're retrying those errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we're moving an extent that was partially overwritten,
bch2_write_rechecksum() will trim it to the currenty live range.
If we then also want to compress it, it'll be decrypted - but the nonce
has been advanced for the overwritten start of the extent that we
dropped, and we were using the nonce we calculated before rechecksum().
Reported-by: Gabriel de Perthuis <g2p.code@gmail.com>
Fixes: 127d90d282 ("bcachefs: bch2_write_prep_encoded_data() now returns errcode")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rxAAKCRCRxhvAZXjc
ooIPAQCwMjDjtWegvBy8kefiRw+fa4z3ZWHrwRT9DJrD/K9WyAD+JVd0ou27SVpQ
jKpRSRct2eTbyxdYiGydHQGm5F5sLg4=
=0FyQ
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs pagesize updates from Christian Brauner:
"This enables block sizes greater than the page size for block devices.
With this we can start supporting block devices with logical block
sizes larger than 4k.
It also allows to lift the device cache sector size support to 64k.
This allows filesystems which can use larger sector sizes up to 64k to
ensure that the filesystem will not generate writes that are smaller
than the specified sector size"
* tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
bdev: add back PAGE_SIZE block size validation for sb_set_blocksize()
bdev: use bdev_io_min() for statx block size
block/bdev: lift block size restrictions to 64k
block/bdev: enable large folio support for large logical block sizes
fs/buffer fs/mpage: remove large folio restriction
fs/mpage: use blocks_per_folio instead of blocks_per_page
fs/mpage: avoid negative shift for large blocksize
fs/buffer: remove batching from async read
fs/buffer: simplify block_read_full_folio() with bh_offset()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rNwAKCRCRxhvAZXjc
onBJAP9Z8Ywmlb5KQ1E3HvDmkwyY6yOSyZ9/CmbzrkCJ8ywYkQD/d9/xt0EP/O/q
N8YtzXArHWt7u0YbcVpy9WK3F72BdwU=
=VJgY
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs async dir updates from Christian Brauner:
"This contains cleanups that fell out of the work from async directory
handling:
- Change kern_path_locked() and user_path_locked_at() to never return
a negative dentry. This simplifies the usability of these helpers
in various places
- Drop d_exact_alias() from the remaining place in NFS where it is
still used. This also allows us to drop the d_exact_alias() helper
completely
- Drop an unnecessary call to fh_update() from nfsd_create_locked()
- Change i_op->mkdir() to return a struct dentry
Change vfs_mkdir() to return a dentry provided by the filesystems
which is hashed and positive. This allows us to reduce the number
of cases where the resulting dentry is not positive to very few
cases. The code in these places becomes simpler and easier to
understand.
- Repack DENTRY_* and LOOKUP_* flags"
* tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
doc: fix inline emphasis warning
VFS: Change vfs_mkdir() to return the dentry.
nfs: change mkdir inode_operation to return alternate dentry if needed.
fuse: return correct dentry for ->mkdir
ceph: return the correct dentry on mkdir
hostfs: store inode in dentry after mkdir if possible.
Change inode_operations.mkdir to return struct dentry *
nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked()
nfs/vfs: discard d_exact_alias()
VFS: add common error checks to lookup_one_qstr_excl()
VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry
VFS: repack LOOKUP_ bit flags.
VFS: repack DENTRY_ flags.