When we do a read to a buffer that's mapped into userspace, it's
possible to get a spurious checksum error if userspace was modified the
buffer at the same time.
When we retry those, they have to be bounced before we know definitively
whether we're reading corrupt data.
But the retry path propagates read flags differently, so needs special
handling.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kill the READ_ERR/READ_RETRY/READ_RETRY_AVOID enums, and add standard
error codes that describe precisely which error occured.
This is going to be used for the data move path, to move but poison
extents with checksum errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
dm-flakey is busted, and this is simpler anyways - this lets us test the
checksum error retry ptahs
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
These are commonly needed when debugging, and saves from having to ask
users to dig.
Also, rebalance_status now includes pending rebalance work.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Use max() to simplify gen_after() and improve its readability.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The extra byte is not used - remove it.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're improving our handling of write errors - we shouldn't write
degraded data just because a write failed once, we should retry it (on
other devices, if possible).
But for this to work, we need to kick devices out when they're only
returning errors - otherwise those retries will loop infinitely.
This adds a configurable timeout - if writes are failing for too long,
we'll set that device read-only.
In the future we should also implement more tracking and another knob
for an "allowed error rate", so that we can kick out drives that are
acting "unhealthy".
Another thing we'll want is a mechanism (likely in userspace) for
bringing a device back in after a transient error - perhaps a cable was
jiggled, or there was a controller reset.
After transient errors we also need a mechanism to walk (from the
journal) recent btree updates that weren't flushed to that device and
treat them as "degraded", since unflushed data may well not have been
written. Out of scope for this patch, but becoming relevant.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, we woudn't try to read at all from a failed device - that
doesn't make much sense, the device may be unhealthy (perhaps taking
longer than it should to service reads), but if it's our only option we
should still try to read from it.
Now, bch2_bkey_pick_read_device() will pick failed devices only if there
are no non-failed replicas to read from.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The next patch implementing freezing will change bch2_dev_get_ioref() to
sleep if a device is currently frozen.
Add an annotation and fix the journal code accordingly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This was completely fubar; it's now simplified a bit as well.
Note that for_each_online_member() takes and releases io_refs as it
iterates, so we need to release that if we break.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can't use the standard fs_holder_ops because they're meant for single
device filesystems - fs_bdev_mark_dead() in particular - and they assume
that the blk_holder is the super_block, which also doesn't work for a
multi device filesystem.
These generally follow the standard fs_holder_ops; the
locking/refcounting is a bit simplified because c->ro_ref suffices, and
bch2_fs_bdev_mark_dead() is not necessarily shutting down the entire
filesystem.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is necessary for the new blk_holder_ops, which want the vfs
super_block available for synchronization.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Note that we open block devices before we allocate bch_fs, but once
attached to a filesystem they will be closed before the bch_fs is torn
down - so stashing a pointer without a refcount looks incorrect but it's
not.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need to start accounting successes for every IO, not just failures,
so introduce a unified hook for io completion accounting and convert
io_read.c.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We were using our device pointer after we'd released our ref to it.
Unlikely to be a race that's practical to hit, since actually removing a
member device is a whole process besides just taking it offline, but -
needs to be fixed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If a device is ro or failed, we might not have anywhere to move a
replica.
Check for this early, before doing the read and attempting to write.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This implements a new extent field bitflags that apply to the whole
extent. There's been a couple things we've wanted this for in the past,
but the immediate need is extent poisoning, to solve a rebalance issue.
Unknown extent fields can't be parsed (we won't known their size, so we
can't advance to the next field), so this is an incompat feature, and
using it prevents the filesystem from being mounted by old versions.
This also adds the BCH_EXTENT_poisoned flag; this indicates that the
data is known to be bad (i.e. there was a checksum error, and we had to
write a new checksum) and reads will return errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Use error type alloc_v3_unpack_error in bch2_alloc_v3_validate().
Fixes: b65db750e2 ("bcachefs: Enumerate fsck errors")
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch implements support for case-insensitive file name lookups
in bcachefs.
The implementation uses the same UTF-8 lowering and normalization that
ext4 and f2fs is using.
More information is provided in Documentation/bcachefs/casefolding.rst
Compatibility notes:
This uses the new versioning scheme for incompatible features where an
incompatible feature is tied to a version number: the superblock says
"we may use incompat features up to x" and "incompat features up to x
are in use", disallowing mounting by previous versions.
Additionally, and old style incompat feature bit is used, so that
kernels without utf8 casefolding support know if casefolding
specifically is in use and they're allowed to mount.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Cc: André Almeida <andrealmeid@igalia.com>
Cc: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Splits out the code that allocates the dirent and initializes the name
to make things easier to implement casefolding in a future commit.
Cc: André Almeida <andrealmeid@igalia.com>
Cc: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a persistent LRU for stripes, ordered by "number of empty blocks",
i.e. order in which we wish to reuse them.
This will replace the in-memory stripes heap, so we can kill off reading
stripes into memory at startup.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Stripes now have backpointers.
This is needed for proper scrub - stripe checksums need to be verified,
separately from extents within the stripe, since a block may not be full
of live extents but it's still needed for reconstruct.
And this will be needed for (efficient) evacuate/repair paths.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we've got cached backpointers and aren't leaving around stale
pointers on bucket invalidation, we no longer need the periodic (rare)
gc_gens - which recalculates each bucket's oldest gen to avoid wraparound.
We can't delete that code because we've got to support existing
filesystems that will still have stale pointers, but this gets rid of
another scalability limit.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cached pointers now have backpointers.
This means that we'll be able to kill cached pointers in the
bucket_invalidate path, when invalidating/reusing buckets containing
cached data, instead of leaving them around to be cleaned up by gc_gens
garbago collection - which requires a full metadata scan.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Transactional triggers need to run in a defined ordering, which is not
quite the same as btree ID integer comparison.
Previously this was handled in a hacky way in
bch2_trans_commit_run_triggers(), since it was only the alloc btree that
needed special handling, but upcoming stripe btree changes are going to
require more ordering changes - so, define that ordering.
Next patch will change the transaction commit path to use it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Pass in the backpointer explicitly, instead of assuming 'referring_k' is
an alloc key and calculating it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
FRAGMENTATION_START was incorrect, there's currently only one
fragmentation LRU (at the end of the reserved bits for LRU type), and
we're getting ready to add a stripe fragmentation lru - so give it a
better name.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
A user has been seeing the "error verifying existing checksum while
rewriting existing data (memory corruption?)" error.
This generally indicates a hardware issue (and that may be the case
here), but it might also indicate a bug, in which case we need more
information to look for patterns.
Reported-by: Roland Vet <vet.roland@protonmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The eytzinger code was previously relying on the following wrap-around
properties and their "eytzinger0" equivalents:
eytzinger1_prev(0, size) == eytzinger1_last(size)
eytzinger1_next(0, size) == eytzinger1_first(size)
However, these properties are no longer relied upon and no longer
necessary, so remove the corresponding asserts and forbid the use of
eytzinger1_prev(0, size) and eytzinger1_next(0, size).
This allows to further simplify the code in eytzinger1_next() and
eytzinger1_prev(): where the left shifting happens, eytzinger1_next() is
trying to move i to the lowest child on the left, which is equivalent to
doubling i until the next doubling would cause it to be greater than
size. This is implemented by shifting i to the left so that the most
significant bits align and then shifting i to the right by one if the
result is greater than size.
Likewise, eytzinger1_prev() is trying to move to the lowest child on the
right; the same applies here.
The 1-offset in (size - 1) in eytzinger1_next() isn't needed at all, but
the equivalent offset in eytzinger1_prev() is surprisingly needed to
preserve the 'eytzinger1_prev(0, size) == eytzinger1_last(size)'
property. However, since we no longer support that property, we can get
rid of these offsets as well. This saves one addition in each function
and makes the code less confusing.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In this second step, transform the eytzinger indexes i, j, and k in
eytzinger1_sort_r() from 0-based to 1-based. This step looks a bit
messy, but the resulting code is slightly better.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In this first step, convert the eytzinger sort functions to use 1-based
primitives.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Several of the algorithms on eytzinger trees are implemented in terms of
the eytzinger0 primitives. However, those algorithms can just as easily
be expressed in terms of the eytzinger1 primitives, and that leads to
better and easier to understand code. Start by converting
eytzinger0_find().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Function eytzinger0_find() isn't currently covered, so add a self test.
We can rely on eytzinger0_find_le() here because it is being
tested independently.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add an eytzinger0_find_ge() self test similar to eytzinger0_find_gt().
Note that this test requires eytzinger0_find_ge() to return the first
matching element in the array in case of duplicates. To prevent
bisection errors, we only add this test after strenghening the original
implementation (see the previous commit).
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Implement eytzinger0_find_ge() directly instead of implementing it in
terms of eytzinger0_find_le() and adjusting the result.
This turns eytzinger0_find_ge() into a minimum search, so when there are
duplicate elements, the result of eytzinger0_find_ge() will now always
point at the first matching element.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Instead of implementing eytzinger0_find_gt() in terms of
eytzinger0_find_le() and adjusting the result, implement it directly.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add an eytzinger0_find_gt() self test similar to eytzinger0_find_le().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Replace the over-complicated implementation of eytzinger0_find_le() by
an equivalent, simpler version.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
eytzinger0_find_le() is also easy to concert to 1-based eytzinger (but
see the next commit).
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Rename eytzinger0_find_test_val() to eytzinger0_find_test_le() and add a
new eytzinger0_find_test_val() wrapper that calls it.
We have already established that the array is sorted in eytzinger order,
so we can use the eytzinger iterator functions and check the boundary
conditions to verify the result of eytzinger0_find_le().
Only scan the entire array if we get an incorrect result. When we need
to scan, use eytzinger0_for_each_prev() so that we'll stop at the
highest matching element in the array in case there are duplicates;
going through the array linearly wouldn't give us that.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add an eytzinger0_for_each_prev() macro for iterating through an
eytzinger array in reverse.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In eytzinger0_find_test(), remember the smallest element seen so far
instead of comparing adjacent array elements.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In eytzinger[01]_test(), make sure that eytzinger[01]_for_each()
iterates over all array elements.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
pr_info() format strings need to be newline terminated.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The iterator variable of eytzinger0_for_each() loops has been changed to
be locally scoped at some point, so remove variables defined outside the
loop that are now unused. In addition and for clarity, use a different
variable inside those loops where an outside variable would be shadowed.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When EYTZINGER_DEBUG is defined, <linux/bug.h> needs to be included.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Use an eytzinger0_for_each() loop here.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We have other metadata IO types covered, this was missing.
Note: this includes the time until completion, i.e. including parent
pointer update.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Prep work for stripe backpointers: this path previously would get very
confused at being asked to process (remove redundant replicas) stripes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since we're increasing the number of 'struct journal_bufs', we don't
want them all permanently holding onto buffers for the journal data -
that'd be 16 * 2MB = 32MB, or potentially more.
Add a single-element mempool (open coded, since buffer size varies),
this also means we won't be hitting the memory allocator every time we
open and close a journal entry/buffer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is a small optimization, reducing the number of cachelines we touch
in the fast path - and it's also necessary for the next patch that
increases JOURNAL_BUF_NR.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This code needs quite a bit of work: we don't want to be walking all
metadata in the filesystem, we should just be walking backpointers, and
it should be switched to a data ioctl that can report progress via a
file descriptor, not the system console.
But that'll take more work - before we can safely walk only backpointers
we need to change device add to not reuse device indexes, since with
that change accounting being wrong introduces the possibility of
removing a device that still has pointers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
the backpointers code has progress indicators; these aren't great, since
they print to the dmesg console and we much prefer to have progress
indicators reporting to a specific userspace program so they're not
spamming the system console.
But not all codepaths that need progress indicators support that yet,
and we don't want users to think "this is hung".
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
we're starting to use error messages with paths in fsck_errors(), where
we do not want nested transaction restart handling, so let's prepare for
that.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We want all error messages converted to print paths, not just inode
numbers - users want this information, and it speeds up debugging too.
Auditing and converting all error messages is going to be a big project,
so for the moment we're just doing this incrementally.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Iterating over backpointers on a specific device is potentially much
cheaper than walking all filesystem data.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reorganize counters a bit, grouping related counters together.
New counters:
- io_read_inline
- io_read_hole
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When ancestor is less than IS_ANCESTOR_BITMAP, we would get an incorrect
result.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a new data op to walk all data and metadata in a filesystem,
checking if it can be read successfully, and on error repairing from
another copy if possible.
- New helper: bch2_dev_idx_is_online(), so that we can bail out and
report to userspace when we're unable to scrub because the device is
offline
- data_update_opts, which controls the data move path, now understands
scrub: data is only read, not written. The read path is responsible
for rewriting on read error, as with other reads.
- scrub_pred skips data extents that don't have checksums
- bch_ioctl_data has a new scrub member, which has a data_types field
for data types to check - i.e. all data types, or only metadata.
- Add new entries to bch_move_stats so that we can report numbers for
corrected and uncorrected errors
- Add a new enum to bch_ioctl_data_event for explicitly reporting
completion and return code (i.e. device offline)
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a function for scrubbing btree nodes - reading them in, and kicking
off a rewrite if there's an error.
The btree_node_read_done() checks have to be duplicated because we're
not using a pointer to a struct btree - the btree node might already be
in cache, and we need to check a specific replica, which might not be
the one we previously read from.
This will be used in the next patch implementing high-level scrub.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a new helper for rewriting a btree node given a just the key, not a
pointer to the node itself.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We may not need to pull in a btree node when walking backpointers -
don't do so unnecessarily when using backpointer_get_key().
It'll still fall back to backpointer_get_node() in a few situations,
including btree roots (where an iterator can't point at just the key),
and races due to the interior update path not having deleted a
backpointer to an old node yet.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Rework the read path so that BCH_READ_NODECODE reads now also self-heal
after a read error and a successful retry - prerequisite for scrub.
- __bch2_read_endio() now handles a read that's both BCH_READ_NODECODE
and a bounce.
Normally, we don't want a BCH_READ_NODECODE read to ever allocate a
split bch_read_bio: we want to maintain the relationship between the
bch_read_bio and the data_update it's embedded in.
But correcting read errors requires allocating a split/bounce rbio
that's embedded in a promote_op. We do still have a 1-1 relationship,
i.e. we only allocate a single split/bounce if it's a
BCH_READ_NODECODE, so things hopefully don't get too crazy.
- __bch2_read_extent() now is allowed to allocate the promote_op for
rewriting after a failed read, even if it's BCH_READ_NODECODE.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
we don't want to block completion of the read - starting a promote calls
into the write path, which will block.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If a data update doesn't want to block on allocations (promotes, self
healing on read error) - check if the allocation would fail before
kicking off the data update and calling into the write path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Initialize the write op first, so that in the next patch we can check if
the allocator would block (for BCH_WRITE_alloc_nowait ops) and bail out
before taking nocow locks/dev refs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If a drive is failing and we're moving data off of it, we can't
necessairly depend on capacity/disk reservation calculations to avoid
deadlocking/blocking on the allocator.
And, we don't want to queue up infinite self healing moves anyways.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Promotes, like most other internal moves, should only go to the
specified target and not fall back to allocating from the full
filesystem.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that data_update embeds bch_read_bio, BCH_READ_NODECODE means that
the read is embedded in a a data_update - and we can check in the retry
path if the extent has changed and bail out.
This likely fixes some subtle bugs with read errors and data moves.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a new helper for bch2_moving_ctxt_to_text(), which may be used to
debug if moving_ios are getting stuck.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add an ioctl for querying counters, the same ones provided in
/sys/fs/bcachefs/<uuid>/counters/, but more suitable for a 'bcachefs
top' command.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Persistent counters, like recovery passes, include a stable enum in
their definition - but this was never correctly plumbed.
This allows us to add new counters and properly organize them with a
non-stable "presentation order", which can also be used in userspace by
the new 'bcachefs fs top' tool.
Fortunatel, since we haven't yet added any new counters where
presentation order ID doesn't match stable ID, this won't cause any
reordering issues.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We've got per-writepoint statistics to see how well the writepoint index
update threads are pipelining; this separates running vs. runnable so we
can see at a glance if they're blocking.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This makes 'bcachefs fs top' more useful; we can now see at a glance
whether the IO to the device is being done for user reads/writes, or
copygc/rebalance.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Early version of 'bcachefs_metadata_version_cached_backpointers' was
creating backpointers for stale cached pointers - whoops. Now we have to
repair those.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Factor out get_iter_to_node() and use it for
btree_node_rewrite_get_iter(), to be used for fixing btree node write
error behaviour.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bcachefs removed most PAGE_SIZE references long ago, so this is easy;
only readpage_bio_extend() has to be tweaked to respect the minimum
order.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bare 64 bit divides not allowed, whoops
arm-linux-gnueabi-ld: drivers/char/random.o: in function `__get_random_u64_below':
drivers/char/random.c:602:(.text+0xc70): undefined reference to `__aeabi_uldivmod'
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We just had a report of the assert for "btree in write buffer for
non-write buffer btree" popping during the 6.14 upgrade.
- 150TB filesystem, after a reboot the upgrade was able to continue from
where it left off, so no major damage.
But with 6.14 about to come out we want to get this tracked down asap,
and need more data if other users hit this.
Convert the BUG_ON() to an emergency read-only, and print out btree, the
key itself, and stack trace from the original write buffer update (which
did not have this check before).
Reported-by: Stijn Tintel <stijn@linux-ipv6.be>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
get_random_u32_below() has a better algorithm than bch2_rand_range(),
it just didn't exist at the time.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The commit titled "block/bdev: lift block size restrictions to 64k"
lifted the block layer's max supported block size to 64k inside the
helper blk_validate_block_size() now that we support large folios.
However in lifting the block size we also removed the silly use
cases many filesystems have to use sb_set_blocksize() to *verify*
that the block size <= PAGE_SIZE. The call to sb_set_blocksize() was
used to check the block size <= PAGE_SIZE since historically we've
always supported userspace to create for example 64k block size
filesystems even on 4k page size systems, but what we didn't allow
was mounting them. Older filesystems have been using the check with
sb_set_blocksize() for years.
While, we could argue that such checks should be filesystem specific,
there are much more users of sb_set_blocksize() than LBS enabled
filesystem on upstream, so just do the easier thing and bring back
the PAGE_SIZE check for sb_set_blocksize() users and only skip it
for LBS enabled filesystems.
This will ensure that tests such as generic/466 when run in a loop
against say, ext4, won't try to try to actually mount a filesystem with
a block size larger than your filesystem supports given your PAGE_SIZE
and in the worst case crash.
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20250307020403.3068567-1-mcgrof@kernel.org
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
There's no point in doing copygc on non-rw devices: the fragmentation
doesn't matter if we're not writing to them, and we may not have
anywhere to put the data on our other devices.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, we fixed journal resize spuriousl failing with
-BCH_ERR_open_buckets_empty, but initial journal allocation was missed
because it didn't invoke the "block on allocator" loop at all.
Factor out the "loop on allocator" code to fix that.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We shouldn't be setting incompatible bits or the incompatible version
field unless explicitly request or allowed - otherwise we break mounting
with old kernels or userspace.
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Some filesystems, such as NFS, cifs, ceph, and fuse, do not have
complete control of sequencing on the actual filesystem (e.g. on a
different server) and may find that the inode created for a mkdir
request already exists in the icache and dcache by the time the mkdir
request returns. For example, if the filesystem is mounted twice the
directory could be visible on the other mount before it is on the
original mount, and a pair of name_to_handle_at(), open_by_handle_at()
calls could instantiate the directory inode with an IS_ROOT() dentry
before the first mkdir returns.
This means that the dentry passed to ->mkdir() may not be the one that
is associated with the inode after the ->mkdir() completes. Some
callers need to interact with the inode after the ->mkdir completes and
they currently need to perform a lookup in the (rare) case that the
dentry is no longer hashed.
This lookup-after-mkdir requires that the directory remains locked to
avoid races. Planned future patches to lock the dentry rather than the
directory will mean that this lookup cannot be performed atomically with
the mkdir.
To remove this barrier, this patch changes ->mkdir to return the
resulting dentry if it is different from the one passed in.
Possible returns are:
NULL - the directory was created and no other dentry was used
ERR_PTR() - an error occurred
non-NULL - this other dentry was spliced in
This patch only changes file-systems to return "ERR_PTR(err)" instead of
"err" or equivalent transformations. Subsequent patches will make
further changes to some file-systems to return a correct dentry.
Not all filesystems reliably result in a positive hashed dentry:
- NFS, cifs, hostfs will sometimes need to perform a lookup of
the name to get inode information. Races could result in this
returning something different. Note that this lookup is
non-atomic which is what we are trying to avoid. Placing the
lookup in filesystem code means it only happens when the filesystem
has no other option.
- kernfs and tracefs leave the dentry negative and the ->revalidate
operation ensures that lookup will be called to correctly populate
the dentry. This could be fixed but I don't think it is important
to any of the users of vfs_mkdir() which look at the dentry.
The recommendation to use
d_drop();d_splice_alias()
is ugly but fits with current practice. A planned future patch will
change this.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
__bch_truncate_folio() may return 1 to indicate dirtyness of the folio
being truncated, needed for fpunch to get the i_size writes correct.
But truncate was forgetting to clear ret, and sometimes returning it as
an error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This turned out to have several bugs, which were missed because the fsck
code wasn't properly reporting errors - whoops.
Kicking it out for now, hopefully it can make 6.15.
Cc: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The fix alone doesn't fix [1], but should be applied before debugging
that.
[1] https://syzkaller.appspot.com/bug?extid=38a0cbd267eff2d286ff
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
"nonce inconstancy" is popping up again, causing us to go emergency
read-only.
This one looks less serious, i.e. specific to the encryption path and
not indicative of a data corruption bug. But we'll need more info to
track it down.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We don't want to be holding the srcu lock while waiting on btree write
completions - easily fixed.
Reported-by: Janpieter Sollie <janpieter.sollie@edpnet.be>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We had some error handling confusion here;
-BCH_ERR_missing_indirect_extent is thrown by
trans_trigger_reflink_p_segment(); at this point we haven't decide
whether we're generating an error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Error handling was wrong, causing unhandled transaction restart errors.
check_directory_size() was also inefficient, since keys in multiple
snapshots would be iterated over once for every snapshot. Convert it to
the same scheme used for i_sectors and subdir count checking.
Cc: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
No callers of kern_path_locked() or user_path_locked_at() want a
negative dentry. So change them to return -ENOENT instead. This
simplifies callers.
This results in a subtle change to bcachefs in that an ioctl will now
return -ENOENT in preference to -EXDEV. I believe this restores the
behaviour to what it was prior to
Commit bbe6a7c899 ("bch2_ioctl_subvolume_destroy(): fix locking")
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250217003020.3170652-2-neilb@suse.de
Acked-by: Paul Moore <paul@paul-moore.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
_orig_restart_count is unused now, according to the logic, trans_was_restarted
should be using _orig_restart_count.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Incorrectly handled transaction restarts can be a source of heisenbugs;
add a mode where we randomly inject them to shake them out.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
want_new_bset() returns the address of a new bset to initialize if we
wish to do so in a btree node - either because the previous one is too
big, or because it's been written.
The case for 'previous bset was written' was wrong: it's only supposed
to check for if we have space in the node for one more block, but
because it subtracted the header from the space available it would never
initialize a new bset if we were down to the last block in a node.
Fixing this results in fewer btree node splits/compactions, which fixes
a bug with flushing the journal to go read-only sometimes not
terminating or taking excessively long.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This lets us flush the journal to go read-only more effectively.
Flushing the journal and going read-only requires halting mutually
recursive processes, which strictly speaking are not guaranteed to
terminate.
Flushing btree node journal pins will kick off a btree node write, and
btree node writes on completion must do another btree update to the
parent node to update the 'sectors_written' field for that node's key.
If the parent node is full and requires a split or compaction, that's
going to generate a whole bunch of additional btree updates - alloc
info, LRU btree, and more - which then have to be flushed, and the cycle
repeats.
This process will terminate much more effectively if we tweak journal
reclaim to flush btree updates leaf to root: i.e., don't flush updates
for a given btree node (kicking off a write, and consuming space within
that node up to the next block boundary) if there might still be
unflushed updates in child nodes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
reflink pointers to missing indirect extents aren't deleted, they just
have an error bit set - in case the indirect extent somehow reappears.
fsck/mark and sweep thus needs to ignore these errors.
Also, they can be marked AUTOFIX now.
Reported-by: Roland Vet <vet.roland@protonmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, bch2_bkey_sectors_need_rebalance() called
bch2_target_accepts_data(), checking whether the target is writable.
However, this means that adding or removing devices from a target would
change the value of bch2_bkey_sectors_need_rebalance() for an existing
extent; this needs to be invariant so that the extent trigger can
correctly maintain rebalance_work accounting.
Instead, check target_accepts_data() in io_opts_to_rebalance_opts(),
before creating the bch_extent_rebalance entry.
This fixes (one?) cause of rebalance_work accounting being off.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The discard path is supposed to issue journal flushes when there's too
many buckets empty buckets that need a journal commit before they can be
written to again, but at some point this code seems to have been lost.
Bring it back with a new optimization to make sure we don't issue too
many journal flushes: the journal now tracks the sequence number of the
most recent flush in progress, which the discard path uses when deciding
which buckets need a journal flush.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the previous commit b3d82c2f27, code was added to prevent journal sequence
overflow. Among them, the code added to journal_entry_open() uses the
bch2_fs_fatal_err_on() function to handle errors.
However, __journal_res_get() , which calls journal_entry_open() , calls
journal_entry_open() while holding journal->lock , but bch2_fs_fatal_err_on()
internally tries to acquire journal->lock , which results in a deadlock.
So we need to add a locked helper to handle fatal errors even when the
journal->lock is held.
Fixes: b3d82c2f27 ("bcachefs: Guard against journal seq overflow")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
For some unknown reason, checks on struct bkey_s_c_snapshot and struct
bkey_s_c_snapshot_tree pointers are missing.
Therefore, I think it would be appropriate to fix the incorrect pointer checking
through this patch.
Fixes: 4bd06f07bc ("bcachefs: Fixes for snapshot_tree.master_subvol")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZ5yJdgAKCRBZ7Krx/gZQ
69W4AQDwgxceiQ6icx3rFhCWQigne4jdMO84kd8tNaa+xHGe1AD/WnkeChc5DqjQ
wZWZxAAzml9SS01IcSiHWaF5fgrjlA0=
=rXOq
-----END PGP SIGNATURE-----
Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs cleanups from Al Viro:
"Two unrelated patches - one is a removal of long-obsolete include in
overlayfs (it used to need fs/internal.h, but the extern it wanted has
been moved back to include/linux/namei.h) and another introduces
convenience helper constructing struct qstr by a NUL-terminated
string"
* tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
add a string-to-qstr constructor
fs/overlayfs/namei.c: get rid of include ../internal.h
- second half of a fix for a bug that'd been causing oopses on
filesystems using snapshots with memory pressure (key cache fills for
snaphots btrees are tricky)
- build fix for strange compiler configurations that double stack frame
size
- "journal stuck timeout" now takes into account device latency: this
fixes some spurious warnings, and the main remaining source of SRCU
lock hold time warnings (I'm no longer seeing this in my CI, so any
users still seeing this should definitely ping me)
- fix for slow/hanging unmounts (" Improve journal pin flushing")
- some more tracepoint fixes/improvements, to chase down the "rebalance
isn't making progress" issues
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmeajBsACgkQE6szbY3K
bna2BA/9E/+WBDFQHkLJ4kNQBxKL4u1xfav5kKGZ79mUlqruhr3AckLPFzWmQO21
eOJE0NeyvpsvLewDXMGZ8w/Nm3Vdc53X6ATKkQaF/UoTYVWbubmF62sXBzSS8TUh
YIM6s24/CbCi8lT49JAuIaG3OC21KH0X0zOcvyepfmn1aiPNLr4y7zOWKynOhgCt
mAt374ayUDDTgoQmXqPIrGp8eD/C+vUjo1ief+DIGMQmDj4uHpb5iBbmjXm8FF9x
4TcrP1UjWpiWPcHeb98H/CBWOnjDSgOFhYmxhVOvkDpC6XbtPSgKQIOs7tSJ0Nuo
IOzrGuBPVfd2m+wgXsn7zbn0HNOjS76sCo92K1lAdS86k0eqRfXmCxkU6FUphNkA
WCG8WrK0RjHL132iR97dtv36No8ji5mZN1ILPk/h4KRkoKC+9fA8BaJdAGVt+6NP
wZLtZxZkV8BqgXF41HwzHt54YftRPn2kR47Jfu1rPimSUd4Uqy8Yjw2J/fUT7eAd
6JdfiadhAtMRWnFGzmVs4LsEWJ7Ja7GnG7jhjzlACsqbXsTU8k16Wq38IchC6mi+
p+hqq9pLAeosW9Lk/QTGFrq52aQfyOzdUjq1pyCcEYtZFNqjj8GmmHVejxZWiRTo
C6dTEkSIMcBx+9QP8BJ5o+xMR02KABn+8x43TzQxQ2DXj0QamTA=
=QJHL
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2025-01-29' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
- second half of a fix for a bug that'd been causing oopses on
filesystems using snapshots with memory pressure (key cache fills for
snaphots btrees are tricky)
- build fix for strange compiler configurations that double stack frame
size
- "journal stuck timeout" now takes into account device latency: this
fixes some spurious warnings, and the main remaining source of SRCU
lock hold time warnings (I'm no longer seeing this in my CI, so any
users still seeing this should definitely ping me)
- fix for slow/hanging unmounts (" Improve journal pin flushing")
- some more tracepoint fixes/improvements, to chase down the "rebalance
isn't making progress" issues
* tag 'bcachefs-2025-01-29' of git://evilpiepirate.org/bcachefs:
bcachefs: Improve trace_move_extent_finish
bcachefs: Fix trace_copygc
bcachefs: Journal writes are now IOPRIO_CLASS_RT
bcachefs: Improve journal pin flushing
bcachefs: fix bch2_btree_node_flags
bcachefs: rebalance, copygc enabled are runtime opts
bcachefs: Improve decompression error messages
bcachefs: bset_blacklisted_journal_seq is now AUTOFIX
bcachefs: "Journal stuck" timeout now takes into account device latency
bcachefs: Reduce stack frame size of __bch2_str_hash_check_key()
bcachefs: Fix btree_trans_peek_key_cache()
Quite a few places want to build a struct qstr by given string;
it would be convenient to have a primitive doing that, rather
than open-coding it via QSTR_INIT().
The closest approximation was in bcachefs, but that expands to
initializer list - {.len = strlen(string), .name = string}.
It would be more useful to have it as compound literal -
(struct qstr){.len = strlen(string), .name = string}.
Unlike initializer list it's a valid expression. What's more,
it's a valid lvalue - it's an equivalent of anonymous local
variable with such initializer, so the things like
path->dentry = d_alloc_pseudo(mnt->mnt_sb, &QSTR(name));
are valid. It can also be used as initializer, with identical
effect -
struct qstr x = (struct qstr){.name = s, .len = strlen(s)};
is equivalent to
struct qstr anon_variable = {.name = s, .len = strlen(s)};
struct qstr x = anon_variable;
// anon_variable is never used after that point
and any even remotely sane compiler will manage to collapse that
into
struct qstr x = {.name = s, .len = strlen(s)};
What compound literals can't be used for is initialization of
global variables, but those are covered by QSTR_INIT().
This commit lifts definition(s) of QSTR() into linux/dcache.h,
converts it to compound literal (all bcachefs users are fine
with that) and converts assorted open-coded instances to using
that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We're currently debugging issues with rebalance, where it's not making
progress as quickly as it should be (or sometimes not at all).
Add the full data_update to the move_extent_finish tracepoint, so we can
check that the replicas we wrote match what we were supposed to do.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
System performance is particularly sensitive to journal write latency,
the number of outstanding journal writes is bounded and we can't issue
journal flushes until other journal writes have completed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Running the preempt tiering tests with a lower than normal journal
reclaim delay turned up a shutdown hang - a lost wakeup, caused because
flushing a journal pin (e.g. key cache/write buffer) can generate a new
journal pin.
The "simple" fix of adding the correct wakeup didn't work because of
ordering issues; if we flush btree node pins too aggressively before
other pins have completed, we end up spinning where each flush iteration
generates new work.
So to fix this correctly:
- The list of flushed journal pins is now broken out by type, so that
we can wait for key cache/write buffer pin flushing to complete
before flushing dirty btree nodes
- A new closure_waitlist is added for bch2_journal_flush_pins; this one
is only used under or when we're taking the journal lock, so it's
pretty cheap to add rigorously correct wakeups to journal_pin_set()
and journal_pin_drop().
Additionally, bch2_journal_seq_pins_to_text() is moved to
journal_reclaim.c, where it belongs, along with a bit of other small
renaming and refactoring.
Besides fixing the hang, the better ordering between key cache/write
buffer flushing and btree node flushing should help or fix the "unmount
taking excessively long" a few users have been noticing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Ratelimit them, and use the new bch2_write_op_error() helper that prints
path and file offset.
Reported-by: https://github.com/koverstreet/bcachefs/issues/819
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be
directly accessible via the library API, instead of requiring the
crypto API. This is much simpler and more efficient.
- Convert some users such as ext4 to use the CRC32 library API instead
of the crypto API. More conversions like this will come later.
- Add a KUnit test that tests and benchmarks multiple CRC variants.
Remove older, less-comprehensive tests that are made redundant by
this.
- Add an entry to MAINTAINERS for the kernel's CRC library code. I'm
volunteering to maintain it. I have additional cleanups and
optimizations planned for future cycles.
These patches have been in linux-next since -rc1.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZ418ZRQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKyJYAP9kBlpm8W9/XY6N8SpjKaXE/vKQYHQl
Nobhak06Us8uJwEAkcUTymWP4IwQj5A9jgBAPRw53FQcNVKIc+01C7gRHw0=
=mqSH
-----END PGP SIGNATURE-----
Merge tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull CRC updates from Eric Biggers:
- Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be
directly accessible via the library API, instead of requiring the
crypto API. This is much simpler and more efficient.
- Convert some users such as ext4 to use the CRC32 library API instead
of the crypto API. More conversions like this will come later.
- Add a KUnit test that tests and benchmarks multiple CRC variants.
Remove older, less-comprehensive tests that are made redundant by
this.
- Add an entry to MAINTAINERS for the kernel's CRC library code. I'm
volunteering to maintain it. I have additional cleanups and
optimizations planned for future cycles.
* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (31 commits)
MAINTAINERS: add entry for CRC library
powerpc/crc: delete obsolete crc-vpmsum_test.c
lib/crc32test: delete obsolete crc32test.c
lib/crc16_kunit: delete obsolete crc16_kunit.c
lib/crc_kunit.c: add KUnit test suite for CRC library functions
powerpc/crc-t10dif: expose CRC-T10DIF function through lib
arm64/crc-t10dif: expose CRC-T10DIF function through lib
arm/crc-t10dif: expose CRC-T10DIF function through lib
x86/crc-t10dif: expose CRC-T10DIF function through lib
crypto: crct10dif - expose arch-optimized lib function
lib/crc-t10dif: add support for arch overrides
lib/crc-t10dif: stop wrapping the crypto API
scsi: target: iscsi: switch to using the crc32c library
f2fs: switch to using the crc32 library
jbd2: switch to using the crc32c library
ext4: switch to using the crc32c library
lib/crc32: make crc32c() go directly to lib
bcachefs: Explicitly select CRYPTO from BCACHEFS_FS
x86/crc32: expose CRC32 functions through lib
x86/crc32: update prototype for crc32_pclmul_le_16()
...
If a block device (e.g. your typical consumer SSD) is taking multiple
seconds for IOs (typically flushes), we don't want to emit the "journal
stuck" message prematurely.
Also, make sure to drop the btree_trans srcu lock if we're blocking for
more than a second.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BTREE_ITER_cached_nofill has some tricky corner cases; it's used
internally for iterators that aren't walking the key cache, but need to
be coherent with the key cache.
It tells traverse to look up and lock the key cache entry if present,
but don't create one if it doesn't exist.
That means we have to have a BTREE_ITER_UPTODATE path (because after
traverse the path has to be UPTODATE, or we pop assertions) that doesn't
point to anything (which is the less bad option, taken by the previous
fix).
The previous fix for this path missed an issue that can happen in
bch2_trans_peek_key_cache(): we can't set should_be_locked on a path
that doesn't point to anything and doesn't hold locks.
Fixes: bd5b09727f ("bcachefs: Don't set btree_path to updtodate if we don't fill")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeL6hoQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppw2EADQV8nDgLRggZR+il4U03yKHXcQEdAX1GrB
Erowx+dasIJuh6kp3n6qRe9QD/pRqt1DKyLvXoWF8Qfuwq85j7oDnDDYxutNYT27
hDgrLJriJ3VeKYtTu+andHWt8P29b5h57UayInDOUJurEPA6rXyFZ5YVIti8n21K
uDOrQXiACG3qRWS2+p2f3UNhX0MkFNFdN/lxi13WMIJtRWF5bXAP+JOgIWCID4Ze
QuSY6rQD4dp4Q6M2erpX6tn0YZb7Hvw3rPjsd91n6jvYfTUVLH375zg8jCBpi6Wi
Syufbb8xcTtriVPTDRNu0ekjebkc8wD8ax/h86g0z9v3Ua4DlNmsx9eXrtv6r5nu
YXqDODOad6stI0+owFquW2vas0gHmfNSfyfGdlk2g24PMtP5Yx0V6FIEvwIeqnje
ghgxQvBuKUsdhqakByfNnc+XvXi3+RUJek8kvMeUSUQWT1IyMQqPOOk0yp9WdyWD
bY1f2ECP5BR1b37zYOyawewsI5xTupHUswn5a4r4qtGn3O15rGDkX98Nab5aLCnR
rW/DvX7+wT6gW9EwrRHiwjwfNDZbsJ9Ggu3lMhtUl5GUWdk58yTiVgKaHJLnlX9/
CKFKfyyIR1Vl8+gYIpemyFhhcoN+dCSf06ISkrg0jeS0/tYwydaAaCBPL5J4kxZA
h3Rtbh+Pgg==
=EXYs
-----END PGP SIGNATURE-----
Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe pull requests via Keith:
- Target support for PCI-Endpoint transport (Damien)
- TCP IO queue spreading fixes (Sagi, Chaitanya)
- Target handling for "limited retry" flags (Guixen)
- Poll type fix (Yongsoo)
- Xarray storage error handling (Keisuke)
- Host memory buffer free size fix on error (Francis)
- MD pull requests via Song:
- Reintroduce md-linear (Yu Kuai)
- md-bitmap refactor and fix (Yu Kuai)
- Replace kmap_atomic with kmap_local_page (David Reaver)
- Quite a few queue freeze and debugfs deadlock fixes
Ming introduced lockdep support for this in the 6.13 kernel, and it
has (unsurprisingly) uncovered quite a few issues
- Use const attributes for IO schedulers
- Remove bio ioprio wrappers
- Fixes for stacked device atomic write support
- Refactor queue affinity helpers, in preparation for better supporting
isolated CPUs
- Cleanups of loop O_DIRECT handling
- Cleanup of BLK_MQ_F_* flags
- Add rotational support for null_blk
- Various fixes and cleanups
* tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits)
block: Don't trim an atomic write
block: Add common atomic writes enable flag
md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add()
block: limit disk max sectors to (LLONG_MAX >> 9)
block: Change blk_stack_atomic_writes_limits() unit_min check
block: Ensure start sector is aligned for stacking atomic writes
blk-mq: Move more error handling into blk_mq_submit_bio()
block: Reorder the request allocation code in blk_mq_submit_bio()
nvme: fix bogus kzalloc() return check in nvme_init_effects_log()
md/md-bitmap: move bitmap_{start, end}write to md upper layer
md/raid5: implement pers->bitmap_sector()
md: add a new callback pers->bitmap_sector()
md/md-bitmap: remove the last parameter for bimtap_ops->endwrite()
md/md-bitmap: factor behind write counters out from bitmap_{start/end}write()
md: Replace deprecated kmap_atomic() with kmap_local_page()
md: reintroduce md-linear
partitions: ldm: remove the initial kernel-doc notation
blk-cgroup: rwstat: fix kernel-doc warnings in header file
blk-cgroup: fix kernel-doc warnings in header file
nbd: fix partial sending
...
We've got a problem with bch_stripe that is going to take an on disk
format rev to fix - we can't access the block sector counts if the
checksum type is unknown.
Document it for now, there are a few other things to fix as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The transaction is going to abort, so there will be no cycle involving
this transaction anymore.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When the cycle doesn't involve the initiator of the cycle detection,
we might choose a transaction that is not involved in the cycle to abort.
It shouldn't be that since it won't break the cycle, this patch
therefore chooses the transaction in the cycle to abort.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch introduces a helper function called lock_graph_pop_from,
it pops the graph from i.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If the transaction chose itself as a victim before and restarted, it
might request a no fail lock request this time. But it might be added to
others' lock graph and be chose as the victim again, it's no longer safe
without additional check. We can also convert the cycle detector to be
fully RCU-based to solve that unsoundness, but the latency added to trans_put
and additional memory required may not worth it.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If the lock has been acquired and unlocked, we don't have to do clear
and wakeup again, though harmless since we hold the intent lock. Merge
the condition might be clearer.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This reverts commit 62448afee7.
six_lock_tryupgrade fails only if there is an intent lock held,
it won't fail no matter how many read locks are held.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds another metadata version for accounting directory size.
For the new version of the filesystem, when new subdirectory items
are created or deleted, the parent directory's size will change
accordingly. For the old version of the existed file system, running
fsck will automatically upgrade the metadata version, and it will
do the check and recalculationg of the directory size.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The isize of directory is 0 in bcachefs if the directory is empty.
With more child dirents created, its size ought to change. Many
other filesystems changed as that (ie. xfs and btrfs). And many of
them changed as the size of child dirent name. Although the directory
size may not seem to convey much, we can still give it some meaning.
The formula of dentry size as follow:
occupied_size = 40 + ALIGN(9 + namelen, 8)
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
check_unreachable_inodes does work in online mode, with the one caveat
that it assumes check_dirents has also run - and check_dirents is not
PASS_ONLINE yet.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Factor out a version of bch2_btree_pos_to_text() that doesn't take a
pointer to a in-memory btree node, to be used for btree node scrub.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
New helper for dropping all write locks; which is distinct from the
helper the transaction commit path uses, which is faster and only
touches updates.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is needed for the interior update locking rework, where we'll be
holding node write locks for the duration of the update - which is
needed for synchronizing with online check_allocations.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_inum_path() should work even if the filesystem is corrupted - we
don't want it to cause fsck to fail.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If the btree_path's lock seq is wrong, the next bch2_trans_relock()
operation is guaranteed to fail and we take an unnecessary transaction
restart.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When evicting, we shouldn't leave a pointer to the key cache entry lying
around - that screws up btree path asserts we're adding.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Ensure that snapshot_tree.master_subvol is cleared when we delete the
master subvolume in a tree of snapshots, and allow for snapshot trees
that don't have a master subvolume in fsck.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, fsck used the snapshot tree's master subvol for finding the
root inode number - but the master subvol might have been deleting, and
setting a new one should be a user operation; meaning we can't rely on
it existing.
Fortunately, for finding the root inode number in a tree of snapshots,
finding any associated subvolume works.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a version of kvmalloc() that doesn't have the INT_MAX limit; large
filesystems do hit this.
We'll want to get rid of the in-memory bucket gens array, but we're not
there quite yet.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Locking considerations (possibly no longer relevant?) mean that when an
accounting update needs a new superblock replicas entry to be created,
it's deferred to the transaction commit error path.
But accounting updates for gc/fcsk aren't done from the transaction
commit path - so we need to handle
-BCH_ERR_btree_insert_need_mark_replicas locally.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_btree_iter_flags() now takes a level parameter; this fixes a bug
where using a node iterator on a leaf wouldn't set
BTREE_ITER_with_key_cache, leading to fun cache coherency bugs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The btree node read error path already calls topology error, so this is
entirely redundant, and we're not specific enough about our error codes
- this was triggering for bucket_ref_update() errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Checking for writing past i_size after unlocking the folio and clearing
the dirty bit is racy, and we already check it at the start.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In bcachefs, io_read and io_write counter record the amount
of data which has been read and written. They increase in
unit of sector, so to display correctly, they need to be
shifted to the left by the size of a sector. Other counters
like io_move, move_extent_{read, write, finish} also have
this problem.
In order to support different unit, we add extra column to
mark the counter type by using TYPE_COUNTER and TYPE_SECTORS
in BCH_PERSISTENT_COUNTERS().
Fixes: 1c6fdbd8f2 ("bcachefs: Initial commit")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It's time to make self healing the default: change the error action for
old filesystems to fix_safe, matching the default for current
filesystems.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Persistent cursors for inode allocation.
A free inodes btree would add substantial overhead to inode allocation
and freeing - a "next num to allocate" cursor is always going to be
faster.
We just need it to be persistent, to avoid scanning the inodes btree
from the start on startup.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a new inode field, bi_depth, for directory inodes: this allows
us to make the check_directory_structure pass much more efficient.
Currently, to ensure the filesystem is fully connect and has no loops,
for every directory we follow backpointers until we find the root. But
by adding a depth counter, it sufficies to only check the parent of each
directory, and check that the parent's bi_depth is smaller.
(fsck doesn't require that bi_depth = parent->bi_depth + 1; if a rename
causes bi_depth off, but the chain to the root is still strictly
decreasing, then the algorithm still works and there's no need for fsck
to fixup the bi_depth fields).
We've already checked backpointers, so we know that every directory
(excluding the root)has a valid parent: if bi_depth is always
decreasing, every chain must terminate, and terminate at the root
directory.
bi_depth will not necessarily be correct when fsck runs, due to
directory renames - we can't change bi_depth on every child directory
when renaming a directory. That's ok; fsck will silently fix the
bi_depth field as needed, and future fsck runs will be much faster.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that bch2_move_get_io_opts() re-propagates changed inode io options
to bch_extent_rebalance, we can properly suport changing IO path options
for reflinked data.
Changing a per-file IO path option, either via the xattr interface or
via the BCHFS_IOC_REINHERIT_ATTRS ioctl, will now trigger a scan (the
inode number is marked as needing a scan, via
bch2_set_rebalance_needs_scan()), and rebalance will use
bch2_move_data(), which will walk the inode number and pick up the new
options.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, io path option changes on a file would be picked up
automatically and applied to existing data - but not for reflinked data,
as we had no way of doing this safely. A user may have had permission to
copy (and reflink) a given file, but not write to it, and if so they
shouldn't be allowed to change e.g. nr_replicas or other options.
This uses the incompat feature mechanism in the previous patch to add a
new incompatible flag to bch_reflink_p, indicating whether a given
reflink pointer may propagate io path option changes back to the
indirect extent.
In this initial patch we're only setting it for the source extents.
We'd like to set it for the destination in a reflink copy, when the user
has write access to the source, but that requires mnt_idmap which is not
curretly plumbed up to remap_file_range.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We've been getting away from feature bits: they don't have any kind of
ordering, and thus it's possible for people to enable weird combinations
of features that were never tested or intended to be run.
Much better to just give every new feature, compatible or incompatible,
a version number.
Additionally, we probably won't ever rev the major version number: major
version numbers represent incompatible versions, but that doesn't really
fit with how we actually roll out incompatible features - we need a
better way of rolling out incompatible features.
So, this patch adds two new superblock fields:
- BCH_SB_VERSION_INCOMPAT
- BCH_SB_VERSION_INCOMPAT_ALLOWED
BCH_SB_VERSION_INCOMPAT_ALLOWED indicates that incompatible features up
to version number x are allowed to be used without user prompting, but
it does not by itself deny old versions from mounting.
BCH_SB_VERSION_INCOMPAT does deny old versions from mounting, and must
be <= BCH_SB_VERSION_INCOMPAT_ALLOWED.
BCH_SB_VERSION_INCOMPAT will only be set when a codepath attempts to use
an incompatible feature, so as to not unnecessarily break compatibility
with old versions.
bch2_request_incompat_feature() is the new interface to check if an
incompatible feature may be used.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The backpointers passes, check_backpointers_to_extents() and
check_extents_to_backpointers() are the most expensive fsck passes.
Now that we're running the same check and repair code when using a
backpointer at runtime (via bch2_backpointer_get_key()) that fsck does,
there's no reason fsck needs to - except to verify that the filesystem
really has no errors in debug mode.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Continuing on with the self healing theme, we should be running any
check and repair code at runtime that we can - instead of declaring the
filesystemt inconsistent.
This will also let us skip running the backpointers -> extents fsck pass
except in debug mode.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Instead of walking every extent and every backpointer it points to,
first sum up backpointers in each bucket and check for mismatches, and
only look for missing backpointers if mismatches were detected, and only
check extents in those buckets.
This is a major fsck scalability improvement, since the two backpointers
passes (backpointers -> extents and extents -> backpointers) are the
most expensive fsck passes by far.
Additionally, to speed up the upgrade for backpointer bucket gens, or in
situations when we have to rebuild alloc info, add a special case for
when no backpointers are found in a bucket - don't check each individual
backpointer (in particular, avoiding the write buffer flushes), just
recreate them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In an upcoming patch bch2_backpointer_get_key() will be repairing when
it finds a dangling backpointer; it will need to flush the btree write
buffer before it can definitively say there's an error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch_backpointer no longer contains the bucket_offset field, it's just a
direct LBA mapping (with low bits to account for compressed extent
splitting), so we don't need to refer to the device to construct it
anymore.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Fix sort order for disk accounting keys, in order to fix a regression on
mount times.
The typetag is now the most significant byte of the key, meaning disk
accounting keys of the same type now sort together.
This lets us skip over disk accounting keys that aren't mirrored in
memory when reading accounting at startup, instead of having them
interleaved with other counter types.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
New on disk format version: backpointers new include the generation
number of the bucket they refer to, and the obsolete bucket_offset field
(no longer needed because we no longer store backpointers in alloc keys)
is gone.
This is an expensive forced upgrade - hopefully the last; we have to run
the extents_to_backpointers recovery pass to regenerate backpointers.
It's a forced incompatible upgrade because the alternative would've been
permamently making backpointers bigger, and as one of the biggest btrees
(along with the extents btree) that's not an ideal option.
It's worth it though, because this allows us to make the
check_extents_to_backpointers pass drastically cheaper: an upcoming
patch changes it to sum up backpointers in a bucket and check the sum
against the sector counts for that bucket, only looking for missing
backpointers if they don't match (and then only for specific buckets).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since commit 43b62ce3ff ("block: move bio io prio to a new field"), macro
bio_set_prio() does nothing but set bio->bi_ioprio. All other places just
set bio->bi_ioprio directly, so replace bio_set_prio() remaining
callsites with setting bio->bi_ioprio directly and delete that macro.
Signed-off-by: John Garry <john.g.garry@oracle.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20241202111957.2311683-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Switch to generating a private list of interior nodes to delete, instead
of using the equivalence class in the global data structure.
This eliminates possible races with snapshot creation, and is much
cleaner - it'll let us delete a lot of janky code for calculating and
maintaining the equivalence classes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When deleting dead snapshots, we move keys from redundant interior
snapshot nodes to child nodes - unless there's already a key, in which
case the ancestor key is deleted.
Previously, we tracked via equiv_seen whether the child snapshot had a
key, but this was tricky w.r.t. transaction restarts, and not
transactionally safe w.r.t. updates in the child snapshot.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This breaks when the trigger is inserting updates for the same btree, as
the inode trigger now does.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Originally, we ran insert triggers before overwrite so that if an extent
was being moved (by fallocate insert/collapse range), the bucket sector
count wouldn't hit 0 partway through, and so we don't trigger state
changes caused by that too soon.
But this is better solved by just moving the data type change to the
alloc trigger itself, where it's already called.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Normally, whitouts (KEY_TYPE_whitout) are filtered from btree lookups,
since they exist only to represent deletions of keys in ancestor
snapshots - except, they should not be filtered in
BTREE_ITER_all_snapshots mode, so that e.g. snapshot deletion can clean
them up.
This means that that the key cache has to store whiteouts, and key cache
fills cannot filter them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In BTREE_ITER_all_snapshots mode, we're required to only return keys
where the snapshot field matches the iterator position -
BTREE_ITER_filter_snapshots requires pulling keys into the key cache
from ancestor snapshots, so we have to check for that.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Clang 18 and newer warns (or errors with CONFIG_WERROR=y):
fs/bcachefs/str_hash.c:164:2: error: label followed by a declaration is a C23 extension [-Werror,-Wc23-extensions]
164 | struct bch_inode_unpacked inode;
| ^
In Clang 17 and prior, this is an unconditional hard error:
fs/bcachefs/str_hash.c:164:2: error: expected expression
164 | struct bch_inode_unpacked inode;
| ^
fs/bcachefs/str_hash.c:165:30: error: use of undeclared identifier 'inode'
165 | ret = bch2_inode_unpack(k, &inode);
| ^
fs/bcachefs/str_hash.c:169:55: error: use of undeclared identifier 'inode'
169 | struct bch_hash_info hash2 = bch2_hash_info_init(c, &inode);
| ^
fs/bcachefs/str_hash.c:171:40: error: use of undeclared identifier 'inode'
171 | ret = repair_inode_hash_info(trans, &inode);
| ^
Add an empty statement between the label and the declaration to fix the
warning/error without disturbing the code too much.
Fixes: 2519d3b0d656 ("bcachefs: bch2_str_hash_check_key() now checks inode hash info")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202412092339.QB7hffGC-lkp@intel.com/
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_snapshot_equiv() is going away; convert users that just wanted to
know if the snapshot exists to something better
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Versions of the same inode in different snapshots must have the same
hash info; this is critical for lookups to work correctly.
We're going to be running the str_hash checks online, at readdir or
xattr list time, so we now need str_hash_check_key() to check for inode
hash seed mismatches, since it won't be run right after check_inodes().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Bkey validation checks that inodes are well-formed and unpack
successfully, so an unpack error should always indicate memory
corruption or some other kind of hardware bug - but these are still
errors we can recover from.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since we added per-inode counters there's now far too many counters to
show in one shot - if we want this in the future, it'll have to be in
debugfs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can't hold mark_lock while calling fsck_err() - that's a deadlock,
mark_lock is meant to be a leaf node lock.
It's also unnecessary for gc_bucket() and bucket_gen(); rcu suffices
since the bucket_gens array describes its size, and we can't race with
device removal or resize during gc/fsck since that takes state lock.
Reported-by: syzbot+38641fcbda1aaffefdd4@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a deadlock during journal replay when btree node read errors
kick off a ton of rewrites: we don't want them competing with journal
replay.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
For each bucket we track when the bucket became nonempty and when it
became empty again: if we can ensure that there will be no journal
flushes in the range [nonempty, empty) (possibly because they occured at
the same journal sequence number), then it's safe to reuse the bucket
without waiting for a journal commit.
This is a major performance optimization for erasure coding, where
writes are initially replicated, but the extra replicas are quickly
dropped: if those buckets are reused and overwritten without issuing a
cache flush to the underlying device, then they only cost bus bandwidth.
But there's a tricky corner case when there's multiple empty -> nonempty
-> empty transitions in quick succession, i.e. when data is getting
overwritten immediately as it's being written.
If this happens and the previous empty transition hasn't been flushed,
we need to continue tracking the previous nonempty transition - not
start a new one.
Fixing this means we now need to track both the nonempty and empty
transitions in bch_alloc_v4.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Harder to screw up if we're explicit about the range, and more correct
as journal reservations can be outstanding on multiple journal entries
simultaneously.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This lets us print the exact location in the journal if it was found in
the journal, or correctly print if it was found in the superblock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Fix an O(n^2) issue when we find many overlapping (overwritten) btree
nodes - especially when one node overwrites many smaller nodes.
This was discovered to be an issue with the bcachefs
merge_torture_flakey test - if we had a large btree that was then
emptied, the number of difficult overwrites can be unbounded.
Cc: Kuan-Wei Chiu <visitorckw@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Check open buckets and buckets waiting for journal commit before doing
other expensive lookups.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Originally, btree splits always succeeded once we got to the point of
recursing to the btree_insert_node() call.
But that changed when we switched to not taking intent locks all the way
up to the root, and that introduced a bug, because
bch2_btree_interior_update_will_free_node() cancels paending writes and
reparents a node that's going to be made visible on disk by another
btree update to the current btree update.
This was discovered in recent backpointers work, because
bch2_btree_interior_update_will_free_node() also clears the
will_make_reachable flag, causing backpointer target lookup to
spuriously thing it had found a dangling backpointer (when the
backpointer just hadn't been created yet by
btree_update_nodes_written()).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We should always signal to rewind if the requested pass hasn't been run,
even if called multiple times.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This lets us use darray macros on dev_alloc_list (and it will become a
darray eventually, when we increase the maximum number of devices).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When allocating a journal write fails, then retries after doing
discards, we were failing to count already allocated replicas.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a tracepoint for inserting new accounting entries: we're seeing odd
spinning behaviour in accounting read.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The validate late path was iterating over accounting entries in
eytzinger order, which is unnecessarily tricky when we may have to
remove entries.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
we wish to use the logged ops btree for other items that aren't strictly
logged ops: cursors for inode allocation
There's no reason to create another cached btree for inode allocator
cursors - so reserve different parts of the keyspace for different
purposes.
Older versions will ignore or delete the cursors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Introduce a typedef to handle the difference between unsigned
long/struct urcu_gp_poll_state.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When tracing is disabled, there is no point in asking the user about
enabling extra btree_path tracepoints in bcachefs.
Fixes: 32ed4a620c ("bcachefs: Btree path tracepoints")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The "journal space available" calculations didn't take into account
mismatched bucket sizes; we need to take the minimum space available out
of our devices.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a method to flush btree node rewrites at the end of recovery, to
ensure that corrected errors are persisted.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Ensure that "invalid bkey" repair gets persisted, so that it doesn't
repeatedly spam the logs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a function for walking backpointers to find a path from a given
inode number, and convert various error messages to use it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The function bch2_bucket_alloc_trans() lacked a description for the
nowait parameter in its documentation comment block. This patch adds the
missing description to ensure all parameters are properly documented.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=12179
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When calling check_discard_freeespace_key from the allocator, we can't
repair without recursing - run it asynchronously instead.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When not compressed, these must be equal - this fixes an assertion pop
in bch2_rechecksum_bio().
Reported-by: syzbot+50d3544c9b8db9c99fd2@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We should add support for cryptographic macs on the superblock - and it
won't be hard, but it'll need an incompatible feature bit (and we have a
new incompatible feature versioning scheme coming).
For now, just add a guard to avoid a dull ptr deref in gen_poly_key().
Reported-by: syzbot+dd3d9835055dacb66f35@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes an assertion pop in bch2_journal_noflush_seq() - log the
error to the superblock and continue instead.
Reported-by: syzbot+85700120f75fc10d4e18@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
transaction commits invalidate pointers to btree values, and they also
downgrade intent locks.
This breaks the interior btree update path, which takes intent locks and
then calls into the allocator.
This isn't an ideal solution: we can't unconditionally issue a restart
after a transaction commit, because that would break other codepaths.
Reported-by: syzbot+78d82470c16a49702682@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Wraparound is impractical to handle since in various places we use 0 as
a sentinal value - but 64 bits (or 56, because the btree write buffer
steals a few bits) is enough for all practical purposes.
Reported-by: syzbot+73ed43fbe826227bd4e0@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
These repair paths are well tested, we can repair them without explicit
user intervention
This also tweaks bch2_topology_error() so that we run topology repair if
we're in recovery, not just fsck.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a new parameter to bkey validate functions, and use it to improve
invalid bkey error messages: we can now print the btree and depth it
came from, or if it came from the journal, or is a btree root.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There's no reason to treat them as errors: just ignore them, and go with
a previous btree root if we had one.
Reported-by: syzbot+e22007d6acb9c87c2362@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Historically, we required that all btree node roots point to a valid
(possibly fake) node, but we're improving our ability to continue in the
presence of errors.
Reported-by: syzbot+e22007d6acb9c87c2362@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, when mounting read-write after a clean shutdown, we wouldn't
go read-write until after all the recovery passes completed.
Now, go RW early in recovery, the same as any other situation we'll need
to go read-write. This fixes a bug where we discover unlinked inodes
after a clean shutdown: repair fails because we're read only.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Fix an assertion pop from the recent btree cache freelist fixes.
Fixes: baefd3f849 ("bcachefs: btree_cache.freeable list fixes")
Reported-by: Tyler <th020394@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
6.11 had a bug where we'd sometimes create disk accounting keys with
version 0, which causes issues for journal replay - but we don't need to
delete existing accounting keys with version 0.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If a btree node says it's encrypted, but the superblock never had an
encryptino key - whoops, that needs to be handled.
Reported-by: syzbot+026f1857b12f5eb3f9e9@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The early-early allocation path, bch2_bucket_alloc_new_fs(), is no
longer needed - and inconsistencies around new_fs_bucket_idx have been a
frequent source of bugs.
Reported-by: syzbot+592425844580a6598410@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
btree_root entries for unknown btree IDs are created during recovery,
before reading those btree roots.
But btree_node_scan may find btree nodes with unknown btree IDs when we
haven't seen roots for those btrees.
Reported-by: syzbot+1f202d4da221ec6ebf8e@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we rewind recovery to run topology repair, that causes
accounting_read to run twice.
This fixes accounting being double counted.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Accounting keys that reference invalid devices are corrected by fsck,
they shouldn't cause an emergency shutdown.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Instead of throwing standard error codes, we should be throwing
dedicated private error codes, this greatly improves debugability.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The discard bucket fastpath previously was using its own code for
discarding buckets and clearing them in the need_discard btree, which
didn't have any of the consistency checks of the main discard path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Per reports of performance issues on mixed multi device filesystems
where we're issuing too much IO to the spinning rust - tweak this
algorithm.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- bch2_backpointer_del()
- bch2_backpointer_maybe_flush()
Kill a bit of open coding and make sure we're properly handling the
btree write buffer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch_backpointer.bucket_offset is going away - it's no longer needed
since we no longer store backpointers in alloc keys, the same
information is in the key position itself.
And we'll be reclaiming the space in bch_backpointer for the bucket
generation number.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_get_btree_in_memory_pos() will return positions that refer directly
to the btree it's checking will fit in memory - i.e. backpointer
positions, not buckets.
This also means check_bp_exists() no longer has to refer to the device,
and we can delete some code.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since we no longer store backpointers in alloc keys, there's no reason
not to pass around bkey_i_backpointers; this means we don't have to pass
the bucket pos separately.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
_noerror means don't produce inconsistent errors, so it should be using
bch2_dev_rcu_noerror().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
86a494c8eef9 ("bcachefs: Kill bch2_get_next_backpointer()") dropped some
things the tracepoint emitted because bch2_evacuate_bucket() no longer
looks at the alloc key - but we did want at least some of that.
We still no longer look at the alloc key so we can't report on the
fragmentation number, but that's a direct function of dirty_sectors and
a copygc concern anyways - copygc should get its own tracepoint that
includes information from the fragmentation LRU.
But we can report on the number of sectors we moved and the bucket size.
Co-developed-by: Piotr Zalewski <pZ010001011111@proton.me>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The journal_keys array can't be substantially modified after we go RW,
because lookups need to be able to check it locklessly - thus we're
limited on what we can do when a key in the journal has been
overwritten.
This is a problem when there's many overwrites to skip over for peek()
operations. To fix this, add tracking of ranges of overwrites: we create
a range entry when there's more than one contiguous whiteout.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
To help ameloriate issues with peek operations having to skip over
deletions in the journal - just bail out if all we're doing is
prefetching btree nodes.
Since btree node prefetching runs every time we iterate to a new node,
and has to sequentially scan ahead, this avoids another O(n^2).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There's an unavoidable issue with btree lookups when we're overlaying
journal keys and the journal has many deletions for keys present in the
btree - peek operations will have to iterate over all those deletions to
find the next live key to return.
This is mainly a problem for lookups in interior nodes, if we have to
traverse to a leaf. Looking up an insert position in a leaf (for journal
replay) doesn't have to find the next live key, but walking down the
btree does.
So to ameloriate this, change journal key sort ordering so that we
replay keys from roots and interior nodes first.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We don't allocate the mempools for compression/decompression unless we
need them - but that means there's an inconsistency to check for.
Reported-by: syzbot+cb3fbcfb417448cfd278@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
gzip and zstd require different decompress workspace sizes, and if we
start with one and then start using the other at runtime we may not get
the correct size
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
type includes lz4 and lz4_old, which do not get different compression
workspaces, and incompressible, a fake type - BCH_COMPRESSION_OPTS() is
the correct enum to use.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since for quite some time backpointers have only been stored in the
backpointers btree, not alloc keys (an aborted experiment, support for
which has been removed) - we can replace get_next_backpointer() with
simple btree iteration.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
try_alloc_bucket() has a "safety" check, which avoids allocating a
bucket if there's any backpointers present.
But backpointers are not the source of truth for live data in a bucket,
the bucket sector counts are; this check was fairly useless, and we're
also deferring backpointers checks from fsck to runtime in the near
future.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With extents and snapshots, for slightly different reasons, we may have
to search forwards to find a key that compares equal to iter->pos (i.e.
a key that peek_prev() should return, as it returns keys <= iter->pos).
peek_slot() does this, and is an easy way to fix this case.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
A user contributed a filessytem dump, where the dump was actually
corrupted (due to being taken while the filesystem was online), but
which exposed an interesting bug in fsck - reconstruct_inode().
When itearting in BTREE_ITER_filter_snapshots mode, it's required to
give an end position for the iteration and it can't span inode numbers;
continuing into the next inode might mean we start seeing keys from a
different snapshot tree, that the is_ancestor() checks always filter,
thus we're never able to return a key and stop iterating.
Backwards iteration never implemented the end position because nothing
else needed it - except for reconstuct_inode().
Additionally, backwards iteration is now able to overlay keys from the
journal, which will be useful if we ever decide to start doing journal
replay in the background.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Factor out a common helper, need_discard_or_freespace_err(), which is
now used by both fsck and the runtime checks, and can repair.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
check_discard_freespace_key() was doing all the same checks as
try_alloc_bucket(), but with repair.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Change it to a normal fsck_err() - meaning it'll get repaired at runtime
when that's flipped on.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
To avoid tragic loss in the event of transient errors (i.e., a btree
node topology error that was later corrected by btree node scan), we
can't delete reflink pointers to correct errors.
This adds a new error bit to bch_reflink_p, indicating that it is known
to point to a missing indirect extent, and the error has already been
reported.
Indirect extent lookups now use bch2_lookup_indirect_extent(), which on
error reports it as a fsck_err() and sets the error bit, and clears it
if necessary on succesful lookup.
This also gets rid of the bch2_inconsistent_error() call in
__bch2_read_indirect_extent, and in the reflink_p trigger: part of the
online self healing project.
An on disk format change isn't necessary here: setting the error bit
will be interpreted by older versions as pointing to a different index,
which will also be missing - which is fine.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Better repair for reflink pointers, as well as propagating new inode
options to indirect extents, are going to require a few extra bits
bch_reflink_p: so claim a few from the high end of the destination
index.
Also add some missing bounds checking.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we find an error that indicates that we need to run fsck, we can
specify that directly with run_explicit_recovery_pass().
These are now log_fsck_err() calls: we're just logging in the superblock
that an error occurred - and possibly doing an emergency shutdown,
depending on policy.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
alloc key validation ensures that if a bucket is in need_discard state
the sector counts are all zero - we don't have to check for that.
The NEED_INC_GEN check appears to be dead code, as well: we only see
buckets in the need_discard btree, and it's an error if they aren't in
the need_discard state.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
`inode->v.i_ino` has been initialized to `inum.inum`. If `inum.inum` and
`bi->bi_inum` are not equal, BUG_ON() is triggered in
bch2_inode_update_after_write().
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
__filemap_get_folio the return value cannot be NULL, so unnecessary checks
are removed.
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Use super_set_uuid() to set `sb->s_uuid_len` to avoid returning `-ENOTTY`
with sb->s_uuid_len being 0.
Original patch link:
[1]: https://lore.kernel.org/all/20240207025624.1019754-2-kent.overstreet@linux.dev/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Here is the patch which uses existing constant table:
Currently, when using bcachefs-tools to set options, bool-type options
can only accept 1 or 0. Add support for accepting true/false and yes/no
for these options.
Signed-off-by: Integral <integral@murena.io>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Acked-by: David Howells <dhowells@redhat.com>
Collapse all the BTREE_ITER_filter_snapshots handling down into a single
block; btree iteration is much simpler in the !filter_snapshots case.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're not allowed to have a dirty key in the key cache if the key
doesn't exist at all in the btree - creation has to bypass the key
cache, so that iteration over the btree can check if the key is present
in the key cache.
Things break in subtle ways if cache coherency is broken, so this needs
an assert.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're ramping up on checking transaction restart handling correctness -
so, in debug mode we now save a backtrace for where the restart was
emitted, which makes it much easier to track down the incorrect
handling.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Generally, releasing a transaction within a transaction restart means an
unhandled transaction restart: but this can happen legitimately within
the move code, e.g. when bch2_move_ratelimit() tells us to exit before
we've retried.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
On interior btree node updates, we always verify that we're not
introducing topology errors: child nodes should exactly span the range
of the parent node.
single_device.ktest small_nodes has been popping this assert: change it
to give us more information.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We unconditionally go read-write, if we're going to do so, before
journal replay: lazy_rw is obsolete.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The journal replay keys mechanism can only be used for updates in early
recovery, when still single threaded.
Add some asserts to make sure we never accidentally use it elsewhere.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The sysfs attribute definition has been wrapped into macro:
rw_attribute, read_attribute and write_attribute, we can
use these helpers to uniform the attribute definition.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The gc_gens_pos is used to show the status of bucket gen gc.
There is no need to assign write permissions for this attribute.
Here we can use read_attribute helper to define this attribute.
```
[Before]
$ ll internal/gc_gens_pos
-rw-r--r-- 1 root root 4096 Oct 28 15:27 internal/gc_gens_pos
[After]
$ ll internal/gc_gens_pos
-r--r--r-- 1 root root 4096 Oct 28 17:27 internal/gc_gens_pos
```
Fixes: ac516d0e7d ("bcachefs: Add the status of bucket gen gc to sysfs")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since bch2_move_get_io_opts() now synchronizes io_opts with options from
bch_extent_rebalance, delete the ad-hoc logic in rebalance.c that
previously did this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_move_get_io_opts() now synchronizes options loaded from the
filesystem and inode (if present, i.e. not walking the reflink btree
directly) with options from the bch_extent_rebalance_entry, updating the
extent if necessary.
Since bch_extent_rebalance tracks where its option came from we can
preserve "inode options override filesystem options", even for indirect
extents where we don't have access to the inode the options came from.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, BCHFS_IOC_REINHERIT_ATTRS didn't trigger rebalance scans
when changing rebalance options - it had been missed, only the xattr
interface triggered them.
Ideally they'd be done by the transactional trigger, but unpacking the
inode to get the options is too heavy to be done in the low level
trigger - the inode trigger is run on every extent update, since the
bch_inode.bi_journal_seq has to be updated for fsync.
bch2_write_inode() is a good compromise, it already unpacks and repacks
and is not run in any super-fast paths.
Additionally, creating the new rebalance entry to trigger the scan is
now done in the same transaction as the inode update that changed the
options.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Add more io path options to bch_extent_rebalance
- For each option, track whether it came from the filesystem or the
inode
This will be used for improved rebalance support for reflinked data.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is going to be used in the bch_extent_rebalance improvements, which
propagate io_path options into the extent (important for rebalance,
which needs something present in the extent for transactionally tagging
them in the rebalance_work btree, and also for indirect extents).
By tracking in bch_extent_rebalance whether the option came from the
filesystem or the inode we can correctly handle options being changed on
indirect extents.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add the __counted_by compiler attribute to the flexible array member b
to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Use struct_size() to calculate the number of bytes to be allocated.
Update bucket_gens->nbuckets and bucket_gens->nbuckets_minus_first when
resizing.
Compile-tested only.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Remove hard-coded strings by using the str_write_read() helper function.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Remove hard-coded strings by using the helper function str_write_read().
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Remove hard-coded strings by using the helper function str_write_read().
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
A user popped up with a very old (0.11) filesystem that needed repair
and wasn't recently backed up.
Reported-by: Manoa <manoa@mail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This will help with some of the btree_trans srcu lock hold time warnings
that are still turning up; submit_bio() can block for awhile if the
device is sufficiently congested.
It's not a perfect solution since blk_plug bios are submitted when
scheduling; we might want a way to disable the "submit on context
switch" behaviour, or switch to our own plugging in the future.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This part of addressing
https://github.com/koverstreet/bcachefs/issues/656
where we're getting stuck in bch2_journal_meta() in the dump tool.
We shouldn't be invoking the journal without a ref on c->writes (if
we're not RW), and there's no reason for the dump tool to be going
read-write.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
-o norecovery (used by the dump tool) should be doing the absolute
minimum amount of work to get the filesystem up and readable; we
shouldn't be running check and repair code, or going read-write.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need to add a path for reshaping existing stripes (for e.g. device
removal), and this new path won't necessarily use ec_stripe_head.
Refactor the code to avoid unnecessary references to it for clarity.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Recovery can rewind in certain situations - when we discover we need to
run a pass that doesn't normally run.
This can happen from another thread for btree node read errors, so we
need a bit of locking.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
They can be regenerated by fsck and don't require a btree node scan,
like other alloc btrees.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
if we're not in recovery then there's no way to rewind recovery - give
this a different errcode so that any error messages will give us a
better idea of what happened.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Use the existing FOREACH_ACL_ENTRY() macro to iterate over POSIX acl
entries and remove the custom acl_for_each_entry() macro.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The header files dirent_format.h and disk_groups_format.h are included
twice. Remove the redundant includes and the following warnings reported
by make includecheck:
disk_groups_format.h is included more than once
dirent_format.h is included more than once
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
hash_lookup() used to return an errorcode, and a peek_slot() call was
required to get the key it looked up. But we're adding fault injection
for transaction restarts, so fix this old unconverted code.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
A series posted previously moved all of the `struct xattr_handler`
tables to .rodata for each filesystem [1].
However, this appears to have been done shortly before bcachefs was
merged, so bcachefs was missed at that time.
Link: https://lkml.kernel.org/r/20230930050033.41174-1-wedsonaf@gmail.com [1]
Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
lock_fail_root_changed has not been used since commit
0d7009d7ca ("bcachefs: Delete old deadlock avoidance code")
Remove it.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Also, fix a minor bug in the revert path, where we weren't checking the
journal entry type correctly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In userspace we don't (yet) have an SRCU implementation, so call_srcu()
recurses.
But we don't want to be invoking it under the lock anyways.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There are a several statements with two following semicolons, replace
these with just one semicolon.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
performs some cleanups in the resource management code.
- The series "Improve the copy of task comm" from Yafang Shao addresses
possible race-induced overflows in the management of task_struct.comm[].
- The series "Remove unnecessary header includes from
{tools/}lib/list_sort.c" from Kuan-Wei Chiu adds some cleanups and a
small fix to the list_sort library code and to its selftest.
- The series "Enhance min heap API with non-inline functions and
optimizations" also from Kuan-Wei Chiu optimizes and cleans up the
min_heap library code.
- The series "nilfs2: Finish folio conversion" from Ryusuke Konishi
finishes off nilfs2's folioification.
- The series "add detect count for hung tasks" from Lance Yang adds more
userspace visibility into the hung-task detector's activity.
- Apart from that, singelton patches in many places - please see the
individual changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ0L6lQAKCRDdBJ7gKXxA
jmEIAPwMSglNPKRIOgzOvHh8MUJW1Dy8iKJ2kWCO3f6QTUIM2AEA+PazZbUd/g2m
Ii8igH0UBibIgva7MrCyJedDI1O23AA=
=8BIU
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2024-11-24-02-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- The series "resource: A couple of cleanups" from Andy Shevchenko
performs some cleanups in the resource management code
- The series "Improve the copy of task comm" from Yafang Shao addresses
possible race-induced overflows in the management of
task_struct.comm[]
- The series "Remove unnecessary header includes from
{tools/}lib/list_sort.c" from Kuan-Wei Chiu adds some cleanups and a
small fix to the list_sort library code and to its selftest
- The series "Enhance min heap API with non-inline functions and
optimizations" also from Kuan-Wei Chiu optimizes and cleans up the
min_heap library code
- The series "nilfs2: Finish folio conversion" from Ryusuke Konishi
finishes off nilfs2's folioification
- The series "add detect count for hung tasks" from Lance Yang adds
more userspace visibility into the hung-task detector's activity
- Apart from that, singelton patches in many places - please see the
individual changelogs for details
* tag 'mm-nonmm-stable-2024-11-24-02-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (71 commits)
gdb: lx-symbols: do not error out on monolithic build
kernel/reboot: replace sprintf() with sysfs_emit()
lib: util_macros_kunit: add kunit test for util_macros.h
util_macros.h: fix/rework find_closest() macros
Improve consistency of '#error' directive messages
ocfs2: fix uninitialized value in ocfs2_file_read_iter()
hung_task: add docs for hung_task_detect_count
hung_task: add detect count for hung tasks
dma-buf: use atomic64_inc_return() in dma_buf_getfile()
fs/proc/kcore.c: fix coccinelle reported ERROR instances
resource: avoid unnecessary resource tree walking in __region_intersects()
ocfs2: remove unused errmsg function and table
ocfs2: cluster: fix a typo
lib/scatterlist: use sg_phys() helper
checkpatch: always parse orig_commit in fixes tag
nilfs2: convert metadata aops from writepage to writepages
nilfs2: convert nilfs_recovery_copy_block() to take a folio
nilfs2: convert nilfs_page_count_clean_buffers() to take a folio
nilfs2: remove nilfs_writepage
nilfs2: convert checkpoint file to be folio-based
...
This runs on extents that haven't yet been validated, so we don't want
to assert that we have a valid entry type.
Reported-by: syzbot+4f29c3f12f864d8a8d17@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If the jset_entry_dev_usage is malformed, and too small, our nr_entries
calculation will be incorrect - just bail out.
Reported-by: syzbot+05d7520be047c9be86e0@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can't assume that btrees only contain keys of a given type - even if
they only have a single key type listed in the allowed key types for
that btree; this is a forwards compatibility issue.
Reported-by: syzbot+a27c3aaa3640dd3e1dfb@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We silence btree errors in btree_node_scan, since it's probing and
errors are expected: add a fake pass so that btree_node_scan is no
longer recovery pass 0, and we don't think we're in btree node scan when
reading btree roots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When we truncate a bset (due to it extending past the end of the btree
node), we can't skip the rest of the validation for e.g. the packed
format (if it's the first bset in the node).
Reported-by: syzbot+4d722d3c539d77c7bc82@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes an assertion pop where we try to navigate to the target of
the backpointer, and the path level isn't what we expect.
Reported-by: syzbot+b17df21b4d370f2dc330@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The write buffer needs to be specifically flushed when going RO: keys in
the journal that haven't yet been moved to the write buffer don't have a
journal pin yet.
This fixes numerous syzbot bugs, all with symptoms of still doing writes
after we've got RO.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we error in data_update_init() after adding to the rhashtable of
outstanding promotes, kfree_rcu() is required.
Reported-by: Reed Riley <reed@riley.engineer>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Change OPT_STR max value to be 1 less than the "ARRAY_SIZE" of "_choices"
array. As a result, remove -1 from (opt->max-1) in bch2_opt_to_text.
The "_choices" array is a null-terminated array, so computing the maximum
using "ARRAY_SIZE" without subtracting 1 yields an incorrect result. Since
bch2_opt_validate don't subtract 1, as bch2_opt_to_text does, values
bigger than the actual maximum would pass through option validation.
Reported-by: syzbot+bee87a0c3291c06aa8c6@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=bee87a0c3291c06aa8c6
Fixes: 63c4b25453 ("bcachefs: Better superblock opt validation")
Suggested-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When allocating new btree nodes, we were leaving them on the freeable
list - unlocked - allowing them to be reclaimed: ouch.
Additionally, bch2_btree_node_free_never_used() ->
bch2_btree_node_hash_remove was putting it on the freelist, while
bch2_btree_node_free_never_used() was putting it back on the btree
update reserve list - ouch.
Originally, the code was written to always keep btree nodes on a list -
live or freeable - and this worked when new nodes were kept locked.
But now with the cycle detector, we can't keep nodes locked that aren't
tracked by the cycle detector; and this is fine as long as they're not
reachable.
We also have better and more robust leak detection now, with memory
allocation profiling, so the original justification no longer applies.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The perf_test does not check the number of iterations and threads
when it is zero. If nr_thread is 0, the perf test will keep
waiting for wakekup. If iteration is 0, it will cause exception
of division by zero. This can be reproduced by:
echo "rand_insert 0 1" > /sys/fs/bcachefs/${uuid}/perf_test
or
echo "rand_insert 1 0" > /sys/fs/bcachefs/${uuid}/perf_test
Fixes: 1c6fdbd8f2 ("bcachefs: Initial commit")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bio_kmalloc may return NULL, will cause NULL pointer dereference.
Add check NULL return for bio_kmalloc in journal_read_bucket.
Signed-off-by: Pei Xiao <xiaopei01@kylinos.cn>
Fixes: ac10a9611d ("bcachefs: Some fixes for building in userspace")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If BCH_FS_may_go_rw is not yet set, it indicates to the transaction
commit path that updates should be done via the list of journal replay
keys.
This must be set before multithreaded use commences.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If a btree split picks a pivot that's being deleted by a btree node
merge, we're going to have problems.
Fix this by checking if the pivot is being deleted, the same as we check
for deletions in journal replay keys.
Found by single_devic.ktest small_nodes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Syzbot found an assertion pop, by generating an ancient filesystem
version with an invalid bkey_format (with fields that can overflow) as
well as packed keys that aren't representable unpacked.
This breaks key comparisons in all sorts of painful ways.
Filesystems have been automatically rewriting nodes with such invalid
formats for years; we can safely drop support for them.
Reported-by: syzbot+8a0109511de9d4b61217@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bucket_gen() checks if we're lookup up a valid bucket and returns NULL
otherwise, but bucket_gen_get() was failing to check; other callers were
correct.
Also do a bit of cleanup on callers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Replace the swp function pointer in the min_heap_callbacks of bcachefs
with NULL, allowing direct usage of the default builtin swap
implementation. This modification simplifies the code and improves
performance by removing unnecessary function indirection.
Link: https://lkml.kernel.org/r/20241020040200.939973-10-visitorckw@gmail.com
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
Cc: Coly Li <colyli@suse.de>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Sakai <msakai@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Refactor the bcachefs code to remove multiple redundant declarations of
min_heap_callbacks, ensuring that each unique declaration appears only
once.
Link: https://lore.kernel.org/20241017095520.GV16066@noisy.programming.kicks-ass.net
Link: https://lkml.kernel.org/r/20241020040200.939973-9-visitorckw@gmail.com
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
Cc: Coly Li <colyli@suse.de>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Sakai <msakai@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Enhance min heap API with non-inline functions and
optimizations", v2.
Add non-inline versions of the min heap API functions in lib/min_heap.c
and updates all users outside of kernel/events/core.c to use these
non-inline versions. To mitigate the performance impact of indirect
function calls caused by the non-inline versions of the swap and compare
functions, a builtin swap has been introduced that swaps elements based on
their size. Additionally, it micro-optimizes the efficiency of the min
heap by pre-scaling the counter, following the same approach as in
lib/sort.c. Documentation for the min heap API has also been added to the
core-api section.
This patch (of 10):
All current min heap API functions are marked with '__always_inline'.
However, as the number of users increases, inlining these functions
everywhere leads to a increase in kernel size.
In performance-critical paths, such as when perf events are enabled and
min heap functions are called on every context switch, it is important to
retain the inline versions for optimal performance. To balance this, the
original inline functions are kept, and additional non-inline versions of
the functions have been added in lib/min_heap.c.
Link: https://lkml.kernel.org/r/20241020040200.939973-1-visitorckw@gmail.com
Link: https://lore.kernel.org/20240522161048.8d8bbc7b153b4ecd92c50666@linux-foundation.org
Link: https://lkml.kernel.org/r/20241020040200.939973-2-visitorckw@gmail.com
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
Cc: Coly Li <colyli@suse.de>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Kuan-Wei Chiu <visitorckw@gmail.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Sakai <msakai@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Various syzbot fixes, and the more notable ones:
- Fix for pointers in an extent overflowing the max (16) on a filesystem
with many devices: we were creating too many cached copies when moving
data around. Now, we only create at most one cached copy if there's a
promote target set.
Caching will be a bit broken for reflinked data until 6.13: I have
larger series queued up which significantly improves the plumbing for
data options down into the extent (bch_extent_rebalance) to fix this.
- Fix for deadlock on -ENOSPC on tiny filesystems
Allocation from the partial open_bucket list wasn't correctly
accounting partial open_buckets as free: this fixes the main cause of
tests timing out in the automated tests.
-----BEGIN PGP SIGNATURE-----
iQJOBAABCAA4FiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmckUrIaHGtlbnQub3Zl
cnN0cmVldEBsaW51eC5kZXYACgkQE6szbY3KbnaFYA//Qd9SD8v+ypavnaogWhqk
3bufCO8YJDV5DRQVuX/z36fia8zOKzWQGRAYvq0vF0mmgagwBE+AcBh6vvfDCxqZ
m1937IXcv/hHh2FFQau9gWEItTH9dGwyQjeDjB3xaTL5ZTGsAdA9558ygf8GAVOe
wD+W8Z8Qj09hAErnNS7y50t/PGbZDuG7AV2Dy2unp+fp6U0FVrZ3Z0bhFuhxcR7/
e3j49DoW4EZL7Gu1svn7nzehjWK4wx1wX7QhynPgSOVIhdj2Fc3XG76b3mBsuZF6
A/cBRKmSZsYL9MBK0vferqizqeuwlIJsvwpo/6zzukpyf8QOl+0IqPuAXFoz8vg3
vrdp9cdvzWvQNexTD2+7PYosCKoUswOvo0oIy8Iopkg4VGSreZib1sZeCPzw2FBK
AZcQaQSBLKojWpYsn9Dl2AlqEHHTvnopjr5wRXiimqKe/OcA3ugIvebUw2UE2ACp
/Z2ZQu615BtRYQM+dRIJJQ2CAy0F3EZxIXEXwc/yrH7kL2VBay8QCKp/k/9YYy4e
Nlxxw7alb/XGgT8GQgu24tho3yMKT621dLFOaAZ7x2HtLP8T56zL/L/wKWsocW/V
R8Kwqot6F1EVb3Q0LECUJottYQ+5I1Et7ZpVyOPxfqF1y7KOsuxKOmZFLO7i3Spc
fg0gOt/fyKrAF3zuSmWXne8=
=hzm/
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-10-31' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Various syzbot fixes, and the more notable ones:
- Fix for pointers in an extent overflowing the max (16) on a
filesystem with many devices: we were creating too many cached
copies when moving data around. Now, we only create at most one
cached copy if there's a promote target set.
Caching will be a bit broken for reflinked data until 6.13: I have
larger series queued up which significantly improves the plumbing
for data options down into the extent (bch_extent_rebalance) to fix
this.
- Fix for deadlock on -ENOSPC on tiny filesystems
Allocation from the partial open_bucket list wasn't correctly
accounting partial open_buckets as free: this fixes the main cause
of tests timing out in the automated tests"
* tag 'bcachefs-2024-10-31' of git://evilpiepirate.org/bcachefs:
bcachefs: Fix NULL ptr dereference in btree_node_iter_and_journal_peek
bcachefs: fix possible null-ptr-deref in __bch2_ec_stripe_head_get()
bcachefs: Fix deadlock on -ENOSPC w.r.t. partial open buckets
bcachefs: Don't filter partial list buckets in open_buckets_to_text()
bcachefs: Don't keep tons of cached pointers around
bcachefs: init freespace inited bits to 0 in bch2_fs_initialize
bcachefs: Fix unhandled transaction restart in fallocate
bcachefs: Fix UAF in bch2_reconstruct_alloc()
bcachefs: fix null-ptr-deref in have_stripes()
bcachefs: fix shift oob in alloc_lru_idx_fragmentation
bcachefs: Fix invalid shift in validate_sb_layout()
Add NULL check for key returned from bch2_btree_and_journal_iter_peek in
btree_node_iter_and_journal_peek to avoid NULL ptr dereference in
bch2_bkey_buf_reassemble.
When key returned from bch2_btree_and_journal_iter_peek is NULL it means
that btree topology needs repair. Print topology error message with
position at which node wasn't found, its parent node information and
btree_id with level.
Return error code returned by bch2_topology_error to ensure that topology
error is handled properly by recovery.
Reported-by: syzbot+005ef9aa519f30d97657@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=005ef9aa519f30d97657
Fixes: 5222a4607c ("bcachefs: BTREE_ITER_WITH_JOURNAL")
Suggested-by: Alan Huang <mmpgouride@gmail.com>
Suggested-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The function ec_new_stripe_head_alloc() returns nullptr if kzalloc()
fails. It is crucial to verify its return value before dereferencing
it to avoid a potential nullptr dereference.
Fixes: 035d72f72c ("bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices")
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Open buckets on the partial list should not count as allocated when
we're trying to allocate from the partial list.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We had a bug report where the data update path was creating an extent
that failed to validate because it had too many pointers; almost all of
them were cached.
To fix this, we have:
- want_cached_ptr(), a new helper that checks if we even want a cached
pointer (is on appropriate target, device is readable).
- bch2_extent_set_ptr_cached() now only sets a pointer cached if we want
it.
- bch2_extent_normalize_by_opts() now ensures that we only have a single
cached pointer that we want.
While working on this, it was noticed that this doesn't work well with
reflinked data and per-file options. Another patch series is coming that
plumbs through additional io path options through bch_extent_rebalance,
with improved option handling.
Reported-by: Reed Riley <reed@riley.engineer>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Initialize freespace_initialized bits to 0 in member's flags and update
member's cached version for each device in bch2_fs_initialize.
It's possible for the bits to be set to 1 before fs is initialized and if
call to bch2_trans_mark_dev_sbs (just before bch2_fs_freespace_init) fails
bits remain to be 1 which can later indirectly trigger BUG condition in
bch2_bucket_alloc_freelist during shutdown.
Reported-by: syzbot+2b6a17991a6af64f9489@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=2b6a17991a6af64f9489
Fixes: bbe682c767 ("bcachefs: Ensure devices are always correctly initialized")
Suggested-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
c->btree_roots_known[i].b can be NULL. In this case, a NULL pointer dereference
occurs, so you need to add code to check the variable.
Reported-by: syzbot+b468b9fef56949c3b528@syzkaller.appspotmail.com
Fixes: 7773df19c3 ("bcachefs: metadata version bucket_stripe_sectors")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The size of a.data_type is set abnormally large, causing shift-out-of-bounds.
To fix this, we need to add validation on a.data_type in
alloc_lru_idx_fragmentation().
Reported-by: syzbot+7f45fa9805c40db3f108@syzkaller.appspotmail.com
Fixes: 260af1562e ("bcachefs: Kill alloc_v4.fragmentation_lru")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Lots of hotfixes:
- transaction restart injection has been shaking out a few things
- fix a data corruption in the buffered write path on -ENOSPC, found by
xfstests generic/299
- Some small show_options fixes
- Repair mismatches in inode hash type, seed: different snapshot
versions of an inode must have the same hash/type seed, used for
directory entries and xattrs. We were checking the hash seed, but not
the type, and a user contributed a filesystem where the hash type on
one inode had somehow been flipped; these fixes allow his filesystem
to repair.
Additionally, the hash type flip made some directory entries
invisible, which were then recreated by userspace; so the hash check
code now checks for duplicate non dangling dirents, and renames one of
them if necessary.
- Don't use wait_event_interruptible() in recovery: this fixes some
filesystems failing to mount with -ERESTARTSYS
- Workaround for kvmalloc not supporting > INT_MAX allocations, causing
an -ENOMEM when allocating the sorted array of journal keys: this
allows a 75 TB filesystem to mount
- Make sure bch_inode_unpacked.bi_snapshot is set in the old inode
compat path: this alllows Marcin's filesystem (in use since before
6.7) to repair and mount.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmcX4vYACgkQE6szbY3K
bnbywxAArBfIJfshWq5Wk9WztenzUmyUmV2HIgntT/iN4ty4eIpZ26VSvHcGvgkU
j3wx+OuxMTPBGc3fjUS+gALf/BGcQEgh6oPZCV+6M3kasTzNzG2jYOCkLqKbpcO1
V5n/Le/SM1X2grkgTm/H+TulGHNgG9gJ2U4kjihroJrTbTesZhzcW/qlz6RWo7U1
02NvLop4WE9M6WaW9RzsHK2llRUAl2Z3oRMuwNz3IIijCpm98STGD4gyvGoMV2b8
qNsXjy7b2lkYObKI29yWF0caRzWK1LRz79afRlnNVSJb6DK1QB83ms5Qa8rprCU4
uOq0wsGWyg6lzwQ19X+2TvUYABopVk2HXLlzTO/lJrWeMTuYJVPZ7KZi3l6ubw5T
GIsAD5qMdCm8E5nXX8hG//0rOIl6QK288+zMQyRCvAkCL+iN2k0TU8qKAEEC44de
vj6ZyNqbuLR39LLz9K09ZhzIZGk09ELpxOJ2Wwwj4ZFriwphWDtFgBtBUpNo/KWA
inBfq2lZJsmNjfns9vCqOmNOStOJxXnyMOR25sTv7wM69QPGkl41dPY3oeuG8lRk
cU/qJQKlpTKJbFeXiEKWKDnMzWxOnovqLFC0tKu2qAYM6vAz+AtwTXgthVFGh21U
QoUDbsnQCCixMkS2AksCo7nivLrxmV/EeYm5pgeiU38VdA5ofBM=
=OpYN
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-10-22' of https://github.com/koverstreet/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Lots of hotfixes:
- transaction restart injection has been shaking out a few things
- fix a data corruption in the buffered write path on -ENOSPC, found
by xfstests generic/299
- Some small show_options fixes
- Repair mismatches in inode hash type, seed: different snapshot
versions of an inode must have the same hash/type seed, used for
directory entries and xattrs. We were checking the hash seed, but
not the type, and a user contributed a filesystem where the hash
type on one inode had somehow been flipped; these fixes allow his
filesystem to repair.
Additionally, the hash type flip made some directory entries
invisible, which were then recreated by userspace; so the hash
check code now checks for duplicate non dangling dirents, and
renames one of them if necessary.
- Don't use wait_event_interruptible() in recovery: this fixes some
filesystems failing to mount with -ERESTARTSYS
- Workaround for kvmalloc not supporting > INT_MAX allocations,
causing an -ENOMEM when allocating the sorted array of journal
keys: this allows a 75 TB filesystem to mount
- Make sure bch_inode_unpacked.bi_snapshot is set in the old inode
compat path: this alllows Marcin's filesystem (in use since before
6.7) to repair and mount"
* tag 'bcachefs-2024-10-22' of https://github.com/koverstreet/bcachefs: (26 commits)
bcachefs: Set bch_inode_unpacked.bi_snapshot in old inode path
bcachefs: Mark more errors as AUTOFIX
bcachefs: Workaround for kvmalloc() not supporting > INT_MAX allocations
bcachefs: Don't use wait_event_interruptible() in recovery
bcachefs: Fix __bch2_fsck_err() warning
bcachefs: fsck: Improve hash_check_key()
bcachefs: bch2_hash_set_or_get_in_snapshot()
bcachefs: Repair mismatches in inode hash seed, type
bcachefs: Add hash seed, type to inode_to_text()
bcachefs: INODE_STR_HASH() for bch_inode_unpacked
bcachefs: Run in-kernel offline fsck without ratelimit errors
bcachefs: skip mount option handle for empty string.
bcachefs: fix incorrect show_options results
bcachefs: Fix data corruption on -ENOSPC in buffered write path
bcachefs: bch2_folio_reservation_get_partial() is now better behaved
bcachefs: fix disk reservation accounting in bch2_folio_reservation_get()
bcachefS: ec: fix data type on stripe deletion
bcachefs: Don't use commit_do() unnecessarily
bcachefs: handle restarts in bch2_bucket_io_time_reset()
bcachefs: fix restart handling in __bch2_resume_logged_op_finsert()
...
This fixes a fsck bug on a very old filesystem (pre mainline merge).
Fixes: 72350ee0ea ("bcachefs: Kill snapshot arg to fsck_write_inode()")
Reported-by: Marcin Mirosław <marcin@mejor.pl>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
kvmalloc() doesn't support allocations > INT_MAX, but vmalloc() does -
the limit should be lifted, but we can work around this for now.
A user with a 75 TB filesystem reported the following journal replay
error:
https://github.com/koverstreet/bcachefs/issues/769
In journal replay we have to sort and dedup all the keys from the
journal, which means we need a large contiguous allocation. Given that
the user has 128GB of ram, the 2GB limit on allocation size has become
far too small.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Fix a bug where mount was failing with -ERESTARTSYS:
https://github.com/koverstreet/bcachefs/issues/741
We only want the interruptible wait when called from fsync.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
hash_check_key() checks and repairs the hash table btrees: dirents and
xattrs are open addressing hash tables.
We recently had a corruption reported where the hash type on an inode
somehow got flipped, which made the existing dirents invisible and
allowed new ones to be created with the same name.
Now, hash_check_key() can repair duplicates: it will delete one of them,
if it has an xattr or dangling dirent, but if it has two valid dirents
one of them gets renamed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Different versions of the same inode (same inode number, different
snapshot ID) must have the same hash seed and type - lookups require
this, since they see keys from different snapshots simultaneously.
To repair we only need to make the inodes consistent, hash_check_key()
will do the rest.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This helped with discovering some filesystem corruption fsck has having
trouble with: the str_hash type had gotten flipped on one snapshot's
version of an inode.
All versions of a given inode number have the same hash seed and hash
type, since lookups will be done with a single hash/seed and type and
see dirents/xattrs from multiple snapshots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The options parse in get_tree will split the options buffer, it will
get the empty string for last one by strsep(). After commit
ea0eeb89b1d5 ("bcachefs: reject unknown mount options") is merged,
unknown mount options is not allowed (here is empty string), and this
causes this errors. This can be reproduced just by the following steps:
bcachefs format /dev/loop
mount -t bcachefs -o metadata_target=loop1 /dev/loop1 /mnt/bcachefs/
Fixes: ea0eeb89b1d5 ("bcachefs: reject unknown mount options")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When call show_options in bcachefs, the options buffer is appeneded
to the seq variable. In fact, it requires an additional comma to be
appended first. This will affect the remount process when reading
existing mount options.
Fixes: 9305cf91d05e ("bcachefs: bch2_opts_to_text()")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Found by generic/299: When we have to truncate a write due to -ENOSPC,
we may have to read in the folio we're writing to if we're now no longer
doing a complete write to a !uptodate folio.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_folio_reservation_get_partial(), on partial success, will now
return a reservation that's aligned to the filesystem blocksize.
This is a partial fix for fstests generic/299 - fio verify is badly
behaved in the presence of short writes that aren't aligned to its
blocksize.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_disk_reservation_put() zeroes out the reservation - oops.
This fixes a disk reservation leak when getting a quota reservation
returned an error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Using commit_do() to call alloc_sectors_start_trans() breaks when we're
randomly injecting transaction restarts - the restart in the commit
causes us to leak the lock that alloc_sectorS_start_trans() takes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_bucket_io_time_reset() doesn't need to succeed, which is why it
didn't previously retry on transaction restart - but we're now treating
these as errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is ugly:
We may discover in alloc_write_key that the data type we calculated is
wrong, because BCH_DATA_need_discard is checked/set elsewhere, and the
disk accounting counters we calculated need to be updated.
But bch2_alloc_key_to_dev_counters(..., BTREE_TRIGGER_gc) is not safe
w.r.t. transaction restarts, so we need to propagate the fixup back to
our gc state in case we take a transaction restart.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
this one is fairly harmless since the invalidate worker will just run
again later if it needs to, but still worth fixing
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This should be impossible to hit in practice; the first lookup within a
transaction won't return a restart due to lock ordering, but we're
adding fault injection for transaction restarts and shaking out bugs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- New metadata version inode_has_child_snapshots
This fixes bugs with handling of unlinked inodes + snapshots, in
particular when an inode is reattached after taking a snapshot;
deleted inodes now get correctly cleaned up across snapshots.
- Disk accounting rewrite fixes
- validation fixes for when a device has been removed
- fix journal replay failing with "journal_reclaim_would_deadlock"
- Some more small fixes for erasure coding + device removal
- Assorted small syzbot fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmcNw4UACgkQE6szbY3K
bnbSzBAAmSCCQCqRwnFSp4OdNSlBK9q1e5WsbKOqHgtoXZU/mOUBe/5bnPPqm6Mg
GkTc7FqVOs/95/rEDKXw2LneFgxRrt8MriJCUdXZvV5fC2R4Kdl0TkwABtMtm2Ae
wp37n6iQO81j4uZHfOj67RzC2NRo7dMdun5HnQPRBTKzyuDaZXqwjMmF2LmaeODh
oiBFUvD5nFBo5XvXPABBin6xpdquHO+6ZWf6SFD4+iRe11NrJAOAIS/crJvxsFfr
I/X152Z+gzKPE+NhANKMxlHyNnVGo7iHUqhUjVuI4SSaXb9Ap6k4sXgfoIzncR17
GA5qWtaNS1W72+awT3R2EaF9Tqi+Vng2RVfxxQ04giImnBq0eziOjlZ26enOE0LU
0ZZrBFzqpItqYbNnzPissHuKb1mAQGPWy6kxoGIrqDKbichA7lzyWDz2lgEE85Sx
E1mvHwYbKhUuLC4c4460hueGVUgMWmjqM3E8oex+oNDpauPB+/bnYkcgZEG2RBla
+ZlDL28fg4fxtqlUrOQeonQ1RecGNdRMJz7xiGnkYU9rQpUuv8QwFiBZGAbLP6zn
6fbFZGxS/pO95sY7GmAtKz7ZgKxJQCzII4s+Oht5AgOvoBlPjAiol1UbwYadYQxz
HKF+WBaPC9z/L6JjP+gx+uUzTWRIfBmhHylhWbKr4vLGfx3Jc1g=
=Rkq2
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-10-14' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
- New metadata version inode_has_child_snapshots
This fixes bugs with handling of unlinked inodes + snapshots, in
particular when an inode is reattached after taking a snapshot;
deleted inodes now get correctly cleaned up across snapshots.
- Disk accounting rewrite fixes
- validation fixes for when a device has been removed
- fix journal replay failing with "journal_reclaim_would_deadlock"
- Some more small fixes for erasure coding + device removal
- Assorted small syzbot fixes
* tag 'bcachefs-2024-10-14' of git://evilpiepirate.org/bcachefs: (27 commits)
bcachefs: Fix sysfs warning in fstests generic/730,731
bcachefs: Handle race between stripe reuse, invalidate_stripe_to_dev
bcachefs: Fix kasan splat in new_stripe_alloc_buckets()
bcachefs: Add missing validation for bch_stripe.csum_granularity_bits
bcachefs: Fix missing bounds checks in bch2_alloc_read()
bcachefs: fix uaf in bch2_dio_write_done()
bcachefs: Improve check_snapshot_exists()
bcachefs: Fix bkey_nocow_lock()
bcachefs: Fix accounting replay flags
bcachefs: Fix invalid shift in member_to_text()
bcachefs: Fix bch2_have_enough_devs() for BCH_SB_MEMBER_INVALID
bcachefs: __wait_for_freeing_inode: Switch to wait_bit_queue_entry
bcachefs: Check if stuck in journal_res_get()
closures: Add closure_wait_event_timeout()
bcachefs: Fix state lock involved deadlock
bcachefs: Fix NULL pointer dereference in bch2_opt_to_text
bcachefs: Release transaction before wake up
bcachefs: add check for btree id against max in try read node
bcachefs: Disk accounting device validation fixes
bcachefs: bch2_inode_or_descendents_is_open()
...
sysfs warns if we're removing a symlink from a directory that's no
longer in sysfs; this is triggered by fstests generic/730, which
simulates hot removal of a block device.
This patch is however not a correct fix, since checking
kobj->state_in_sysfs on a kobj owned by another subsystem is racy.
A better fix would be to add the appropriate check to
sysfs_remove_link() - and sysfs_create_link() as well.
But kobject_add_internal()/kobject_del() do not as of today have locking
that would support that.
Note that the block/holder.c code appears to be subject to this race as
well.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When creating a new stripe, we may reuse an existing stripe that has
some empty and some nonempty blocks.
Generally, the existing stripe won't change underneath us - except for
block sector counts, which we copy to the new key in
ec_stripe_key_update.
But the device removal path can now invalidate stripe pointers to a
device, and that can race with stripe reuse.
Change ec_stripe_key_update() to check for and resolve this
inconsistency.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We were checking that the alloc key was for a valid device, but not a
valid bucket.
This is the upgrade path from versions prior to bcachefs being mainlined.
Reported-by: syzbot+a1b59c8e1a3f022fd301@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Check if we have snapshot_trees or subvolumes that refer to the snapshot
node being reconstructed, and use them.
With this, the kill_btree_root test that blows away the snapshots btree
now passes, and we're able to successfully reconstruct.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BCH_TRANS_COMMIT_journal_reclaim without BCH_WATERMARK_reclaim means
"return an error if low on journal space" - but accounting replay must
succeed.
Fixes https://github.com/koverstreet/bcachefs/issues/656
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
this was likely originally cribbed, and has been dead code, and Al is
working on removing it from the tree.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Like how we already do when the allocator seems to be stuck, check if
we're waiting too long for a journal reservation and print some debug
info.
This is specifically to track down
https://github.com/koverstreet/bcachefs/issues/656
which is showing up in userspace where we don't have sysfs/debugfs to
get the journal debug info.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch adds a bounds check to the bch2_opt_to_text function to prevent
NULL pointer dereferences when accessing the opt->choices array. This
ensures that the index used is within valid bounds before dereferencing.
The new version enhances the readability.
Reported-and-tested-by: syzbot+37186860aa7812b331d5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=37186860aa7812b331d5
Signed-off-by: Mohammed Anees <pvmohammedanees2003@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We will get this if we wake up first:
Kernel panic - not syncing: btree_node_write_done leaked btree_trans
since there are still transactions waiting for cycle detectors after
BTREE_NODE_write_in_flight is cleared.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Fix failure to validate that accounting replicas entries point to
valid devices: this wasn't a real bug since they'd be cleaned up by
GC, but is still something we should know about
- Fix failure to validate that dev_data_type entries point to valid
devices: this does fix a real bug, since bch2_accounting_read() would
then try to copy the counters to that device and pop an inconsistent
error when the device didn't exist
- Remove accounting entries that are zeroed or invalid: if we're not
validating them we need to get rid of them: they might not exist in
the superblock, so we need the to trigger the superblock mark path
when they're readded.
This fixes the replication.ktest rereplicate test, which was failing
with "superblock not marked for replicas..."
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
fsck can now correctly check if inodes in interior snapshot nodes are
open/in use.
- Tweak the vfs inode rhashtable so that the subvolume ID isn't hashed,
meaning inums in different subvolumes will hash to the same slot. Note
that this is a hack, and will cause problems if anyone ever has the
same file in many different snapshots open all at the same time.
- Then check if any of those subvolumes is a descendent of the snapshot
ID being checked
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There's an inherent race in taking a snapshot while an unlinked file is
open, and then reattaching it in the child snapshot.
In the interior snapshot node the file will appear unlinked, as though
it should be deleted - it's not referenced by anything in that snapshot
- but we can't delete it, because the file data is referenced by the
child snapshot.
This was being handled incorrectly with
propagate_key_to_snapshot_leaves() - but that doesn't resolve the
fundamental inconsistency of "this file looks like it should be deleted
according to normal rules, but - ".
To fix this, we need to fix the rule for when an inode is deleted. The
previous rule, ignoring snapshots (there was no well-defined rule
for with snapshots) was:
Unlinked, non open files are deleted, either at recovery time or
during online fsck
The new rule is:
Unlinked, non open files, that do not exist in child snapshots, are
deleted.
To make this work transactionally, we add a new inode flag,
BCH_INODE_has_child_snapshot; it overrides BCH_INODE_unlinked when
considering whether to delete an inode, or put it on the deleted list.
For transactional consistency, clearing it handled by the inode trigger:
when deleting an inode we check if there are parent inodes which can now
have the BCH_INODE_has_child_snapshot flag cleared.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Patch series "remove PF_MEMALLOC_NORECLAIM" v3.
This patch (of 2):
bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new
inode to achieve GFP_NOWAIT semantic while holding locks. If this
allocation fails it will drop locks and use GFP_NOFS allocation context.
We would like to drop PF_MEMALLOC_NORECLAIM because it is really
dangerous to use if the caller doesn't control the full call chain with
this flag set. E.g. if any of the function down the chain needed
GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and
cause unexpected failure.
While this is not the case in this particular case using the scoped gfp
semantic is not really needed bacause we can easily pus the allocation
context down the chain without too much clutter.
[akpm@linux-foundation.org: fix kerneldoc warnings]
Link: https://lkml.kernel.org/r/20240926172940.167084-1-mhocko@kernel.org
Link: https://lkml.kernel.org/r/20240926172940.167084-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz> # For vfs changes
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: James Morris <jmorris@namei.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
BCH_INODE_i_size_dirty dates from before we had logged operations for
truncate (as well as finsert) - it hasn't been needed since before
bcachefs was mainlined.
BCH_INODE_i_sectors_dirty hasn't been needed since we started always
updating i_sectors transactionally - it's been unused for even longer.
BCH_INODE_backptr_untrusted also hasn't been used since prior to
mainlining; when unlinking a hardling, we zero out the backpointer
fields if they're for the dirent being removed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When we find an unreachable inode, we now reattach it in the oldest
version that needs to be reattached (thus avoiding redundant work
reattaching every single version), and we now fix up inode -> dirent
backpointers in newer versions as needed - or white out the reattaching
dirent in newer versions, if the newer version isn't supposed to be
reattached.
This results in the second verify fsck now passing cleanly after
repairing on a user-provided filesystem image with thousands of
different snapshots.
Reported-by: Christopher Snowhill <chris@kode54.net>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With inode backpointers, we can write a very simple
check_unreachable_inodes() pass that only looks for non-unlinked inodes
that are missing backpointers, and reattaches them.
This simplifies check_directory_structure() so that it's now only
checking for directory structure loops,
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
A lot of little fixes, bigger ones include:
- bcachefs's __wait_on_freeing_inode() was broken in rc1 due to vfs
changes, now fixed along with another lost wakeup
- fragmentation LRU fixes; fsck now repairs successfully (this is the
data structure copygc uses); along with some nice simplification.
- Rework logged op error handling, so that if logged op replay errors
(due to another filesystem error) we delete the logged op instead of
going into an infinite loop)
- Various small filesystem connectivitity repair fixes
The final part of this patch series, fixing snapshots + unlinked file
handling, is now out on the list - I'm giving that part of the series
more time for user testing.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmcBhkIACgkQE6szbY3K
bnYt8RAAqZo6RcN91sgz6xGsJkUvE6DS4Rtj1J4vlVAmuiIa5NUhRhqFnS6j8V9A
AWZw63JwTizrglbLk4Z4knfiViT4GOeiKX4sttaJk7cLW7bxwCUddlho1G5Q7q0I
PFurYevqG1ltcl5oZpD6LhZiqEhndQI3XnkpEvKsmoXy9TSB4KEqaU8Y+cewjq4q
KCFuxTBhbmatxP9eTGuDhd6uWw5h0EVDGQyMitEcSutIaernGlSsBQ8gZ5n9dWSd
lP91qFT5iypmCMo9Arf8Fq1YBvOpV6P91eq8YPa4A3sKDfzHn3CCzsSyjUiGK0RM
Wcl+kNwqYJa7Fwtb7aGgTVhaMkqLzPTI+XYye3FXrXjJ6B0JKpl2QvvDoFhDxop9
ZPb57QyRgRBtOvofvFz8fWQOr67n+HNvaMbeG1iwGvqm6/MrgdSLsN6OaRh80uAE
5P0qX7rwTTOfJj5T6dKLxr3KuXKXNrM5AAIG0MjOMsha232+XUAZvofYNmqx7BMi
juJvqZc9/GXrcXqdPTYDyBs4UXDkwHsKdr744ooZ64VNiIYFs6eTvXp7V0XuajYH
ExLrEEjhO2UGPM5N9R9jw9AMsEhJstexgylHQsiiADtdi+jY4LKa/NZAJSJQQC+C
QQyE3Q7ZCpzRPiGPkkpIY/D7IRoIHL2H+LhbXV/K3oMGdbA7hS4=
=XnG4
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-10-05' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"A lot of little fixes, bigger ones include:
- bcachefs's __wait_on_freeing_inode() was broken in rc1 due to vfs
changes, now fixed along with another lost wakeup
- fragmentation LRU fixes; fsck now repairs successfully (this is the
data structure copygc uses); along with some nice simplification.
- Rework logged op error handling, so that if logged op replay errors
(due to another filesystem error) we delete the logged op instead
of going into an infinite loop)
- Various small filesystem connectivitity repair fixes"
* tag 'bcachefs-2024-10-05' of git://evilpiepirate.org/bcachefs:
bcachefs: Rework logged op error handling
bcachefs: Add warn param to subvol_get_snapshot, peek_inode
bcachefs: Kill snapshot arg to fsck_write_inode()
bcachefs: Check for unlinked, non-empty dirs in check_inode()
bcachefs: Check for unlinked inodes with dirents
bcachefs: Check for directories with no backpointers
bcachefs: Kill alloc_v4.fragmentation_lru
bcachefs: minor lru fsck fixes
bcachefs: Mark more errors AUTOFIX
bcachefs: Make sure we print error that causes fsck to bail out
bcachefs: bkey errors are only AUTOFIX during read
bcachefs: Create lost+found in correct snapshot
bcachefs: Fix reattach_inode()
bcachefs: Add missing wakeup to bch2_inode_hash_remove()
bcachefs: Fix trans_commit disk accounting revert
bcachefs: Fix bch2_inode_is_open() check
bcachefs: Fix return type of dirent_points_to_inode_nowarn()
bcachefs: Fix bad shift in bch2_read_flag_list()
Initially it was thought that we just wanted to ignore errors from
logged op replay, but it turns out we do need to catch -EROFS, or we'll
go into an infinite loop.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
These shouldn't always be fatal errors - logged op resume, in
particular, and we want it as a parameter there.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It was initially believed that it would be better to be explicit about
the snapshot we're updating when writing inodes in fsck; however, it
turns out that passing around the snapshot separately is more error
prone and we're usually updating the inode in the same snapshow we read
it from.
This is different from normal filesystem paths, where we do the update
in the snapshot of the subvolume we're in.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We want to check for this early so it can be reattached if necessary in
check_unreachable_inodes(); better than letting it be deleted and having
the children reattached, losing their filenames.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
link count works differently in bcachefs - it's only nonzero for files
with multiple hardlinks, which means we can also avoid checking it
except for files that are known to have hardlinks.
That means we need a few different checks instead; in particular, we
don't want fsck to delet a file that has a dirent pointing to it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It's legal for regular files to have missing backpointers (due to
hardlinks), and fsck should automatically add them, but for directories
this is an error that should be flagged.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The fragmentation_lru field hasn't been needed since we reworked the LRU
btrees to use the btree write buffer; previously it was used to resolve
collisions, but the revised LRU btree uses the backpointer (the bucket)
as part of the key.
It should have been deleted at the time of the LRU rework; since it
wasn't, that left places for bugs to hide, in check/repair.
This fixes LRU fsck on a filesystem image helpfully provided by a user
who disappeared before I could get his name for the reported-by.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
check_lru_key() wasn't using write buffer updates for deleting bad lru
entries - dating from before the lru btree used the btree write buffer.
And when possibly flushing the btree write buffer (to make sure we're
seeing a real inconsistency), we need to be using the modern
bch2_btree_write_buffer_maybe_flush().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Newly generated keys, in the transaction commit path or write path,
should not be AUTOFIX; those indicate bugs that we need to fail fast
for.
Fixes: 5612daafb7 ("bcachefs: Fix fsck warnings from bkey validation")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Ensure a copy of the lost+found inode exists in the snapshot that we're
reattaching, so that we don't trigger warnings in
lookup_inode_for_snapshot() later.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes two different bugs:
- Looser locking with the rhashtable means we need to recheck if the
inode is still hashed after prepare_to_wait(), and add a corresponding
wakeup after removing from the hash table.
- da18ecbf0f ("fs: add i_state helpers") changed the bit waitqueues
used for inodes, and bcachefs wasn't updated and thus broke; this
updates bcachefs to the new helper.
Fixes: 112d21fd1a ("bcachefs: switch to rhashtable for vfs inodes hash")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We only are applying JSET_ENTRY_TYPE_write_buffer_keys, revert path was
missed.
Fixes: a3581ca35d ("bcachefs: Fix BCH_TRANS_COMMIT_skip_accounting_apply")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
we're returning an error code now, not a bool
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZv3NAgAKCRBZ7Krx/gZQ
68kbAP0YzQxUgl0/o7Soda8XwKSPZTM9ls6kRk1UHTTG/i4ZigEA/G+i/mBQctL0
AB911kK8mxfXppfOXzstFBjoJSqiigQ=
=IE7D
-----END PGP SIGNATURE-----
Merge tag 'pull-work.unaligned' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull generic unaligned.h cleanups from Al Viro:
"Get rid of architecture-specific <asm/unaligned.h> includes, replacing
them with a single generic <linux/unaligned.h> header file.
It's the second largest (after asm/io.h) class of asm/* includes, and
all but two architectures actually end up using exact same file.
Massage the remaining two (arc and parisc) to do the same and just
move the thing to from asm-generic/unaligned.h to linux/unaligned.h"
[ This is one of those things that we're better off doing outside the
merge window, and would only cause extra conflict noise if it was in
linux-next for the next release due to all the trivial #include line
updates. Rip off the band-aid. - Linus ]
* tag 'pull-work.unaligned' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
move asm/unaligned.h to linux/unaligned.h
arc: get rid of private asm/unaligned.h
parisc: get rid of private asm/unaligned.h
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.
auto-generated by the following:
for i in `git grep -l -w asm/unaligned.h`; do
sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
Builds on big endian systems fail as follows.
fs/bcachefs/bkey.h: In function 'bch2_bkey_format_add_key':
fs/bcachefs/bkey.h:557:41: error:
'const struct bkey' has no member named 'bversion'
The original commit only renamed the variable for little endian builds.
Rename it for big endian builds as well to fix the problem.
Fixes: cf49f8a8c2 ("bcachefs: rename version -> bversion")
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Assorted minor syzbot fixes, and for bigger stuff:
- Fix two disk accounting rewrite bugs
- Disk accounting keys use the version field of bkey so that journal
replay can tell which updates have been applied to the btree. This is
set in the transaction commit path, after we've gotten our journal
reservation (and our time ordering), but the
BCH_TRANS_COMMIT_skip_accounting_apply flag that journal replay uses
was incorrectly skipping this for new updates generated prior to
journal replay.
This fixes the underlying cause of an assertion pop in
disk_accounting_read.
- A couple fixes for disk accounting + device removal. Checking if
acocunting replicas entries were marked in the superblock was being
done at the wrong point, when deltas in the journal could still zero
them out, and then additionally we'd try to add a missing replicas
entry to the superblock without checking if it referred to an invalid
(removed) device.
- A whole slew of repair fixes
- fix infinite loop in propagate_key_to_snapshot_leaves(), this fixes
an infinite loop when repairing a filesystem with many snapshots
- fix incorrect transaction restart handling leading to occasional
"fsck counted ..." warnings"
- fix warning in __bch2_fsck_err() for bkey fsck errors
- check_inode() in fsck now correctly checks if the filesystem was
clean
- there shouldn't be pending logged ops if the fs was clean, we now
check for this
- remove_backpointer() doesn't remove a dirent that doesn't actually
point to the inode
- many more fsck errors are AUTOFIX
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmb4QtsACgkQE6szbY3K
bnYx4A//bhGgZYgP55FxduuxUH8XjX2eOnXwuPv/MmYO/4oCok5VBa9bRDTVXhIK
PtY4pP2IJZ3+u963mwbwJAawsPA01AEEty9tE+AdXbltDRQ03I33OEuIy0HFIso2
s8VBkVPbru6yU4RCCvYNIVvRG/9GOL+J0GgrR1t05zHVyKXe1FuS00Yq5+z3niNP
HtuGTsD273Nnhikz47bqyD+M6VizU+uzSUFLgnB3zrzpb+gPSGETSwgc4ggajlM4
2P10Vc4L/Nb3KYV9RW+C3WpRfUR/o8BZA3wjJfNo0JeA4iDaUbltSjpCA07EcAnA
3D6Omzqkm4aobL2WlvioT0UhZx4t8X/8x5t5F9HyX52i1k+g87oMT9/KIKec1Dzd
8vQCwCdXFfWaLSZoOJsHyIljip7BuRLKhWwKosdzzLIAnRQy5StxAhsG99fNStu6
JOWICPNCn1b6SkktnoKou1unL+K5RczeNfAxMAjcJjTD7IIAmytLe4mdRbP9q+Oa
x8no7pttbb4JnoRvfo42GVz8KWQR07oN/Zy7mH3K4Y0Ix+xDOrLqlfLIDLGpxMNv
HZz+UPchdlfpYJO+nTLoAOGXZWnKDqg70SAEcWKDc82Ri4vNOhraYDZvXrzl9qE+
63RPzqDbg3uXGxLYMvujjPe610QkPxS9zKKyDvUZZx0ZiUX4CjI=
=cdrz
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-09-28' of git://evilpiepirate.org/bcachefs
Pull more bcachefs updates from Kent Overstreet:
"Assorted minor syzbot fixes, and for bigger stuff:
Fix two disk accounting rewrite bugs:
- Disk accounting keys use the version field of bkey so that journal
replay can tell which updates have been applied to the btree.
This is set in the transaction commit path, after we've gotten our
journal reservation (and our time ordering), but the
BCH_TRANS_COMMIT_skip_accounting_apply flag that journal replay
uses was incorrectly skipping this for new updates generated prior
to journal replay.
This fixes the underlying cause of an assertion pop in
disk_accounting_read.
- A couple of fixes for disk accounting + device removal.
Checking if acocunting replicas entries were marked in the
superblock was being done at the wrong point, when deltas in the
journal could still zero them out, and then additionally we'd try
to add a missing replicas entry to the superblock without checking
if it referred to an invalid (removed) device.
A whole slew of repair fixes:
- fix infinite loop in propagate_key_to_snapshot_leaves(), this fixes
an infinite loop when repairing a filesystem with many snapshots
- fix incorrect transaction restart handling leading to occasional
"fsck counted ..." warnings
- fix warning in __bch2_fsck_err() for bkey fsck errors
- check_inode() in fsck now correctly checks if the filesystem was
clean
- there shouldn't be pending logged ops if the fs was clean, we now
check for this
- remove_backpointer() doesn't remove a dirent that doesn't actually
point to the inode
- many more fsck errors are AUTOFIX"
* tag 'bcachefs-2024-09-28' of git://evilpiepirate.org/bcachefs: (35 commits)
bcachefs: check_subvol_path() now prints subvol root inode
bcachefs: remove_backpointer() now checks if dirent points to inode
bcachefs: dirent_points_to_inode() now warns on mismatch
bcachefs: Fix lost wake up
bcachefs: Check for logged ops when clean
bcachefs: BCH_FS_clean_recovery
bcachefs: Convert disk accounting BUG_ON() to WARN_ON()
bcachefs: Fix BCH_TRANS_COMMIT_skip_accounting_apply
bcachefs: Check for accounting keys with bversion=0
bcachefs: rename version -> bversion
bcachefs: Don't delete unlinked inodes before logged op resume
bcachefs: Fix BCH_SB_ERRS() so we can reorder
bcachefs: Fix fsck warnings from bkey validation
bcachefs: Move transaction commit path validation to as late as possible
bcachefs: Fix disk accounting attempting to mark invalid replicas entry
bcachefs: Fix unlocked access to c->disk_sb.sb in bch2_replicas_entry_validate()
bcachefs: Fix accounting read + device removal
bcachefs: bch_accounting_mode
bcachefs: fix transaction restart handling in check_extents(), check_dirents()
bcachefs: kill inode_walker_entry.seen_this_pos
...
if an inode backpointer points to a dirent that doesn't point back,
that's an error we should warn about.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If the reader acquires the read lock and then the writer enters the slow
path, while the reader proceeds to the unlock path, the following scenario
can occur without the change:
writer: pcpu_read_count(lock) return 1 (so __do_six_trylock will return 0)
reader: this_cpu_dec(*lock->readers)
reader: smp_mb()
reader: state = atomic_read(&lock->state) (there is no waiting flag set)
writer: six_set_bitmask()
then the writer will sleep forever.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a filesystem flag to indicate whether we did a clean recovery -
using c->sb.clean after we've got rw is incorrect, since c->sb is
updated whenever we write the superblock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We had a bug where disk accounting keys didn't always have their version
field set in journal replay; change the BUG_ON() to a WARN(), and
exclude this case since it's now checked for elsewhere (in the bkey
validate function).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This was added to avoid double-counting accounting keys in journal
replay. But applied incorrectly (easily done since it applies to the
transaction commit, not a particular update), it leads to skipping
in-mem accounting for real accounting updates, and failure to give them
a version number - which leads to journal replay becoming very confused
the next time around.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, check_inode() would delete unlinked inodes if they weren't
on the deleted list - this code dating from before there was a deleted
list.
But, if we crash during a logged op (truncate or finsert/fcollapse) of
an unlinked file, logged op resume will get confused if the inode has
already been deleted - instead, just add it to the deleted list if it
needs to be there; delete_dead_inodes runs after logged op resume.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BCH_SB_ERRS() has a field for the actual enum val so that we can reorder
to reorganize, but the way BCH_SB_ERR_MAX was defined didn't allow for
this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
__bch2_fsck_err() warns if the current task has a btree_trans object and
it wasn't passed in, because if it has to prompt for user input it has
to be able to unlock it.
But plumbing the btree_trans through bkey_validate(), as well as
transaction restarts, is problematic - so instead make bkey fsck errors
FSCK_AUTOFIX, which doesn't need to warn.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In order to check for accounting keys with version=0, we need to run
validation after they've been assigned version numbers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
accounting read was checking if accounting replicas entries were marked
in the superblock prior to applying accounting from the journal,
which meant that a recently removed device could spuriously trigger a
"not marked in superblocked" error (when journal entries zero out the
offending counter).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Minor refactoring - replace multiple bool arguments with an enum; prep
work for fixing a bug in accounting read.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Dealing with outside state within a btree transaction is always tricky.
check_extents() and check_dirents() have to accumulate counters for
i_sectors and i_nlink (for subdirectories). There were two bugs:
- transaction commit may return a restart; therefore we have to commit
before accumulating to those counters
- get_inode_all_snapshots() may return a transaction restart, before
updating w->last_pos; then, on the restart,
check_i_sectors()/check_subdir_count() would see inodes that were not
for w->last_pos
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Returning a positive integer instead of an error code causes error paths
to become very confused.
Closes: syzbot+c0360e8367d6d8d04a66@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The pointer clean points the memory allocated by kmemdup, when the
return value of bch2_sb_clean_validate_late is not zero. The memory
pointed by clean is leaked. So we should free it in this case.
Fixes: a37ad1a3ab ("bcachefs: sb-clean.c")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In downgrade_table_extra, the return value is needed. When it
return failed, we should exit immediately.
Fixes: 7773df19c3 ("bcachefs: metadata version bucket_stripe_sectors")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
check_topology doesn't need the srcu lock and doesn't use normal btree
transactions - we can just drop the srcu lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
fsck_err() jumps to the fsck_err label when bailing out; need to make
sure bp_iter was initialized...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a kasan splat in propagate_key_to_snapshot_leaves() -
varint_decode_fast() does reads (that it never uses) up to 7 bytes past
the end of the integer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Most or all errors will be autofix in the future, we're currently just
doing the ones that we know are well tested.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
no_llseek had been defined to NULL two years ago, in commit 868941b144
("fs: remove no_llseek")
To quote that commit,
At -rc1 we'll need do a mechanical removal of no_llseek -
git grep -l -w no_llseek | grep -v porting.rst | while read i; do
sed -i '/\<no_llseek\>/d' $i
done
would do it.
Unfortunately, that hadn't been done. Linus, could you do that now, so
that we could finally put that thing to rest? All instances are of the
form
.llseek = no_llseek,
so it's obviously safe.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As we iterate we need to mark that we no longer need iterators -
otherwise we'll infinite loop via the "too many iters" check when
there's many snapshots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
if it doesn't get set we'll never be able to flush the btree write
buffer; this only happens in fake rw mode, but prevents us from shutting
down.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
rcu_pending, btree key cache rework: this solves lock contenting in the
key cache, eliminating the biggest source of the srcu lock hold time
warnings, and drastically improving performance on some metadata heavy
workloads - on multithreaded creates we're now 3-4x faster than xfs.
We're now using an rhashtable instead of the system inode hash table;
this is another significant performance improvement on multithreaded
metadata workloads, eliminating more lock contention.
for_each_btree_key_in_subvolume_upto(): new helper for iterating over
keys within a specific subvolume, eliminating a lot of open coded
"subvolume_get_snapshot()" and also fixing another source of srcu lock
time warnings, by running each loop iteration in its own transaction (as
the existing for_each_btree_key() does).
More work on btree_trans locking asserts; we now assert that we don't
hold btree node locks when trans->locked is false, which is important
because we don't use lockdep for tracking individual btree node locks.
Some cleanups and improvements in the bset.c btree node lookup code,
from Alan.
Rework of btree node pinning, which we use in backpointers fsck. The old
hacky implementation, where the shrinker just skipped over nodes in the
pinned range, was causing OOMs; instead we now use another shrinker with
a much higher seeks number for pinned nodes.
Rebalance now uses BCH_WRITE_ONLY_SPECIFIED_DEVS; this fixes an issue
where rebalance would sometimes fall back to allocating from the full
filesystem, which is not what we want when it's trying to move data to a
specific target.
Use __GFP_ACCOUNT, GFP_RECLAIMABLE for btree node, key cache
allocations.
Idmap mounts are now supported - Hongbo.
Rename whiteouts are now supported - Hongbo.
Erasure coding can now handle devices being marked as failed, or
forcibly removed. We still need the evacuate path for erasure coding,
but it's getting very close to ready for people to start using.
Status, and when will we be taking off experimental:
----------------------------------------------------
Going by critical, user facing bugs getting found and fixed, we're
nearly there. There are a couple key items that need to be finished
before we can take off the experimental label:
- The end-user experience is still pretty painful when the root
filesystem needs a fsck; we need some form of limited self healing so
that necessary repair gets run automatically. Errors (by type) are
recorded in the superblock, so what we need to do next is convert
remaining inconsistent() errors to fsck() errors (so that all runtime
inconsistencies are logged in the superblock), and we need to go
through the list of fsck errors and classify them by which fsck passes
are needed to repair them.
- We need comprehensive torture testing for all our repair paths, to
shake out remaining bugs there. Thomas has been working on the tooling
for this, so this is coming soonish.
Slightly less critical items:
- We need to improve the end-user experience for degraded mounts: right
now, a degraded root filesystem means dropping to an initramfs shell
or somehow inputting mount options manually (we don't want to allow
degraded mounts without some form of user input, except on unattended
servers) - we need the mount helper to prompt the user to allow
mounting degraded, and make sure this works with systemd.
- Scalabiity: we have users running 100TB+ filesystems, and that's
effectively the limit right now due to fsck times. We have some
reworks in the pipeline to address this, we're aiming to make petabyte
sized filesystems practical.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmbvHQoACgkQE6szbY3K
bnYfAw/+IXQ43/O+Jzs0MLD7pKZnrlbHiX9FqYLazD40vWvkyRTQOwgTn8pVNhq3
4YWmtuZyqh036YC+bGqYFOhz20YetS5UdgbClpwmc99JJ6xsY+Z1mdpYfz5oq1Dw
/pBX5iYb3rAt8UbQoZ8lcWM+GpT3GKJVgJuiLB2gRp9gATFesuh+0qU42oIVVVU5
4y3VhDBUmRk4XqEnk8hr7EIDMW0wWP3aptxYMZzeUPW0x1cEQ+FWrJo5D6lXv2KK
dKv3MogvA0FFNi/eNexclPiu2pXtI7vrxT7umsxAICHLt41rWpV5ttE6io3bC4ZN
qvwF9w2CpmKPKchFru9PO+QrWHVR7e6bphwf3TzyoKZ7tTn42f1RQlub7gBzI3bz
ai5ZwGRIvpUoPVBj+CO+Ipog81uUb23Ma+gXg1akEFBOAb+o7I3KOOSBh5l+0cHj
3Ov1n0TLcsoO2cqoqfsV2QubW9YcWEZ76g5mKwQnUn8Cs6Fp0wWaIyK9aNkIAxcr
tNDPGtH1gKitxUvju5i/LyI7y1UoeFvqJFee0VsU6QnixHn1ySzhePsJt6UEnIJT
Ia3C96Igqu2mV9FxhfGHj/qi7TGjqqkZHa8+B610cDpgf15cx7Ps2DYjkuQMFCqZ
Q3Q1o5De9roRq5xF2hLiYJCbzJKqd5ichFsBtLQuX572ICxbICg=
=oVCy
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-09-21' of git://evilpiepirate.org/bcachefs
Pull bcachefs updates from Kent Overstreet:
- rcu_pending, btree key cache rework: this solves lock contenting in
the key cache, eliminating the biggest source of the srcu lock hold
time warnings, and drastically improving performance on some metadata
heavy workloads - on multithreaded creates we're now 3-4x faster than
xfs.
- We're now using an rhashtable instead of the system inode hash table;
this is another significant performance improvement on multithreaded
metadata workloads, eliminating more lock contention.
- for_each_btree_key_in_subvolume_upto(): new helper for iterating over
keys within a specific subvolume, eliminating a lot of open coded
"subvolume_get_snapshot()" and also fixing another source of srcu
lock time warnings, by running each loop iteration in its own
transaction (as the existing for_each_btree_key() does).
- More work on btree_trans locking asserts; we now assert that we don't
hold btree node locks when trans->locked is false, which is important
because we don't use lockdep for tracking individual btree node
locks.
- Some cleanups and improvements in the bset.c btree node lookup code,
from Alan.
- Rework of btree node pinning, which we use in backpointers fsck. The
old hacky implementation, where the shrinker just skipped over nodes
in the pinned range, was causing OOMs; instead we now use another
shrinker with a much higher seeks number for pinned nodes.
- Rebalance now uses BCH_WRITE_ONLY_SPECIFIED_DEVS; this fixes an issue
where rebalance would sometimes fall back to allocating from the full
filesystem, which is not what we want when it's trying to move data
to a specific target.
- Use __GFP_ACCOUNT, GFP_RECLAIMABLE for btree node, key cache
allocations.
- Idmap mounts are now supported (Hongbo Li)
- Rename whiteouts are now supported (Hongbo Li)
- Erasure coding can now handle devices being marked as failed, or
forcibly removed. We still need the evacuate path for erasure coding,
but it's getting very close to ready for people to start using.
* tag 'bcachefs-2024-09-21' of git://evilpiepirate.org/bcachefs: (99 commits)
bcachefs: return err ptr instead of null in read sb clean
bcachefs: Remove duplicated include in backpointers.c
bcachefs: Don't drop devices with stripe pointers
bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices
bcachefs: bch_fs.rw_devs_change_count
bcachefs: bch2_dev_remove_stripes()
bcachefs: bch2_trigger_ptr() calculates sectors even when no device
bcachefs: improve error messages in bch2_ec_read_extent()
bcachefs: improve error message on too few devices for ec
bcachefs: improve bch2_new_stripe_to_text()
bcachefs: ec_stripe_head.nr_created
bcachefs: bch_stripe.disk_label
bcachefs: stripe_to_mem()
bcachefs: EIO errcode cleanup
bcachefs: Rework btree node pinning
bcachefs: split up btree cache counters for live, freeable
bcachefs: btree cache counters should be size_t
bcachefs: Don't count "skipped access bit" as touched in btree cache scan
bcachefs: Failed devices no longer require mounting in degraded mode
bcachefs: bch2_dev_rcu_noerror()
...
Syzbot reports a problem that a warning is triggered due to suspicious
use of rcu_dereference_check(). That is triggered by a call of
bch2_snapshot_tree_oldest_subvol().
The cause of the warning is that inside
bch2_snapshot_tree_oldest_subvol(), snapshot_t() is called which calls
rcu_dereference() that requires a read lock to be held. Also, the call
of bch2_snapshot_tree_next() eventually calls snapshot_t().
To fix this, call rcu_read_lock() before calling snapshot_t(). Then,
release the lock after the termination of the while loop.
Reported-by: <syzbot+f7c41a878676b72c16a6@syzkaller.appspotmail.com>
Signed-off-by: Ahmed Ehab <bottaawesome633@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The header files bbpos.h is included twice in backpointers.c,
so one inclusion of each can be removed.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=10783
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This factors out ec_strie_head_devs_update(), which initializes the
bitmap of devices we're allocating from, and runs it every time
c->rw_devs_change_count changes.
We also cancel pending, not allocated stripes, since they may refer to
devices that are no longer available.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a counter that's incremented whenever rw devices change; this will
be used for erasure coding so that it can keep ec_stripe_head in sync
and not deadlock on a new stripe when a device it wants goes away.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can now correctly force-remove a device that has stripes on it; this
uses the new BCH_SB_MEMBER_INVALID sentinal value.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When reshaping existing stripes, we should keep them on the same target
that they were allocated on; to do this, we need to add a field to the
btree stripe type.
This is a tad awkward, because we only have 8 bits left, and targets are
16 bits - but we only need to store a label, not a full target.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In backpointers fsck, we do a seqential scan of one btree, and check
references to another: extents <-> backpointers
Checking references generates random lookups, so we want to pin that
btree in memory (or only a range, if it doesn't fit in ram).
Previously, this was done with a simple check in the shrinker - "if
btree node is in range being pinned, don't free it" - but this generated
OOMs, as our shrinker wasn't well behaved if there was less memory
available than expected.
Instead, we now have two different shrinkers and lru lists; the second
shrinker being for pinned nodes, with seeks set much higher than normal
- so they can still be freed if necessary, but we'll prefer not to.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
32 bits won't overflow any time soon, but size_t is the correct type for
counting objects in memory.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Fix the following compilation error:
```
fs/bcachefs/sb-members.c: In function ‘bch2_sb_member_alloc’:
fs/bcachefs/sb-members.c:508:2: error: a label can only be part of a statement and a declaration is not a statement
508 | unsigned nr_devices = max_t(unsigned, dev_idx + 1, c->sb.nr_devices);
```
Fixes: a7d364a133c7 ("bcachefs: bch2_sb_member_alloc()")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds mount options for specifying recovery passes to run, or
exclude; the immediate need for this is that backpointers fsck is having
trouble completing, so we need a way to skip it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When freeing in a shrinker callback, we need to notify memory reclaim,
so it knows forward progress has been made.
Normally this is done in e.g. slab code, but we're not freeing through
slab - or rather we are, but these allocations are big, and use the
kmalloc_large() path.
This is really a bug in the slub code, but we're working around it here
for now.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is needed for overlayfs, which is used by container managers.
Signed-off-by: Sasha Finkelstein <fnkl.kernel@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
this was an oversight: rebalance is moving data to a specific device, so
we don't want it falling back to the full filesystem
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
rebalance writes must be BCH_WRITE_ALLOC_NOWAIT because they don't
allocate from the full filesystem - but we don't want spurious
allocation failures due to open buckets.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
like the previous patch, kill use of bare arrays; the encryption code
likes to work in big batches, so this is a small performance
improvement.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Implement the SCHED_DEADLINE server infrastructure - Daniel Bristot de Oliveira's
last major contribution to the kernel:
"SCHED_DEADLINE servers can help fixing starvation issues of low priority
tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU
cycles. Today we have RT Throttling; DEADLINE servers should be able to
replace and improve that."
(Daniel Bristot de Oliveira, Peter Zijlstra, Joel Fernandes,
Youssef Esmat, Huang Shijie)
- Preparatory changes for sched_ext integration:
- Use set_next_task(.first) where required
- Fix up set_next_task() implementations
- Clean up DL server vs. core sched
- Split up put_prev_task_balance()
- Rework pick_next_task()
- Combine the last put_prev_task() and the first set_next_task()
- Rework dl_server
- Add put_prev_task(.next)
(Peter Zijlstra, with a fix by Tejun Heo)
- Complete the EEVDF transition and refine EEVDF scheduling:
- Implement delayed dequeue
- Allow shorter slices to wakeup-preempt
- Use sched_attr::sched_runtime to set request/slice suggestion
- Document the new feature flags
- Remove unused and duplicate-functionality fields
- Simplify & unify pick_next_task_fair()
- Misc debuggability enhancements
(Peter Zijlstra, with fixes/cleanups by Dietmar Eggemann,
Valentin Schneider and Chuyi Zhou)
- Initialize the vruntime of a new task when it is first enqueued,
resulting in significant decrease in latency of newly woken tasks.
(Zhang Qiao)
- Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
(K Prateek Nayak, Peter Zijlstra)
- Clean up and clarify the usage of Clean up usage of rt_task()
(Qais Yousef)
- Preempt SCHED_IDLE entities in strict cgroup hierarchies
(Tianchen Ding)
- Clarify the documentation of time units for deadline scheduler
parameters. (Christian Loehle)
- Remove the HZ_BW chicken-bit feature flag introduced a year ago,
the original change seems to be working fine.
(Phil Auld)
- Misc fixes and cleanups (Chen Yu, Dan Carpenter, Huang Shijie,
Peilin He, Qais Yousefm and Vincent Guittot)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmbr8qcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gdbw/+Mj3zWfYP+dtUkfgrR2FClPAJoo1/9Dz0
LYD8XgYHu8rEJ0Aq+VbdkgYGUt9utvzUFPIxvWFDcldQl57KwhF4hp9Ir+PqJyYC
NolQ1q8ddo1hnslxnEg6SgHVzQq/4FqMM0nDNUkQETCx6zTyFFeRf+q7o/2c2m5B
uI9dSU1Wrx7XrXm2D3kB8+xP+ZRy+qhbFN5Pfuz96mhelfklylgKMfPzgAiCT/7T
JTbQhQ2HdcCNgiLoSrWsHBDy2UYpouP4zb4jyd+lDQzhSUJrj3u4Xy4vVmuTKq+y
sTgWlgKB+MTuh9UuJ4UYzSnMqg161UlMvtXeH84ABmAqDNGHRPtOKrrlcLtJ3D4x
m1SPhNnsvpjOu2pH0XLIS8al3VUesWND5S+rucHRYSq6Nvhivf4MTvRJlicXXurL
Mt2APnIlhGJuKBNWnmyZovVdtO0ZUUPlaZWfr3rCS4txAVo+HwWhsm3uhtTycQqN
gazsCiuGh6Jds90ZqA/BvdLWG+DY8J0xLlV3ex4pCXuQ/HFrabVWTyThJsULhrZ2
5mTdWIsocPctNMO9/RHMy7vJI7G7ljgHEquWVn5kiGGzXhK6VwVwKAMpfgXGw+YA
yVP6/M7a7g2yEzj69gXkcDa8k/kedMVquJ/G/8YhZM7u7sPqsMjpmaGsqsJRfnpT
ChngAzap+kA=
=TEC6
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Implement the SCHED_DEADLINE server infrastructure - Daniel Bristot
de Oliveira's last major contribution to the kernel:
"SCHED_DEADLINE servers can help fixing starvation issues of low
priority tasks (e.g., SCHED_OTHER) when higher priority tasks
monopolize CPU cycles. Today we have RT Throttling; DEADLINE
servers should be able to replace and improve that."
(Daniel Bristot de Oliveira, Peter Zijlstra, Joel Fernandes, Youssef
Esmat, Huang Shijie)
- Preparatory changes for sched_ext integration:
- Use set_next_task(.first) where required
- Fix up set_next_task() implementations
- Clean up DL server vs. core sched
- Split up put_prev_task_balance()
- Rework pick_next_task()
- Combine the last put_prev_task() and the first set_next_task()
- Rework dl_server
- Add put_prev_task(.next)
(Peter Zijlstra, with a fix by Tejun Heo)
- Complete the EEVDF transition and refine EEVDF scheduling:
- Implement delayed dequeue
- Allow shorter slices to wakeup-preempt
- Use sched_attr::sched_runtime to set request/slice suggestion
- Document the new feature flags
- Remove unused and duplicate-functionality fields
- Simplify & unify pick_next_task_fair()
- Misc debuggability enhancements
(Peter Zijlstra, with fixes/cleanups by Dietmar Eggemann, Valentin
Schneider and Chuyi Zhou)
- Initialize the vruntime of a new task when it is first enqueued,
resulting in significant decrease in latency of newly woken tasks
(Zhang Qiao)
- Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
(K Prateek Nayak, Peter Zijlstra)
- Clean up and clarify the usage of Clean up usage of rt_task()
(Qais Yousef)
- Preempt SCHED_IDLE entities in strict cgroup hierarchies
(Tianchen Ding)
- Clarify the documentation of time units for deadline scheduler
parameters (Christian Loehle)
- Remove the HZ_BW chicken-bit feature flag introduced a year ago,
the original change seems to be working fine (Phil Auld)
- Misc fixes and cleanups (Chen Yu, Dan Carpenter, Huang Shijie,
Peilin He, Qais Yousefm and Vincent Guittot)
* tag 'sched-core-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
sched/cpufreq: Use NSEC_PER_MSEC for deadline task
cpufreq/cppc: Use NSEC_PER_MSEC for deadline task
sched/deadline: Clarify nanoseconds in uapi
sched/deadline: Convert schedtool example to chrt
sched/debug: Fix the runnable tasks output
sched: Fix sched_delayed vs sched_core
kernel/sched: Fix util_est accounting for DELAY_DEQUEUE
kthread: Fix task state in kthread worker if being frozen
sched/pelt: Use rq_clock_task() for hw_pressure
sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c
sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
sched: Add put_prev_task(.next)
sched: Rework dl_server
sched: Combine the last put_prev_task() and the first set_next_task()
sched: Rework pick_next_task()
sched: Split up put_prev_task_balance()
sched: Clean up DL server vs core sched
sched: Fixup set_next_task() implementations
sched: Use set_next_task(.first) where required
sched/fair: Properly deactivate sched_delayed task upon class change
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEvgAKCRCRxhvAZXjc
ou77AQD3U1KjbdgzbUi6kaUmiiWOPhfYTlm8mho8dBjqvTCB+AD/XTWSFCWWhHB4
KyQZTbjRD81xmVNbKjASazp0EA6Ahwc=
=gIsD
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12.folio' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs folio updates from Christian Brauner:
"This contains work to port write_begin and write_end to rely on folios
for various filesystems.
This converts ocfs2, vboxfs, orangefs, jffs2, hostfs, fuse, f2fs,
ecryptfs, ntfs3, nilfs2, reiserfs, minixfs, qnx6, sysv, ufs, and
squashfs.
After this series lands a bunch of the filesystems in this list do not
mention struct page anymore"
* tag 'vfs-6.12.folio' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (61 commits)
Squashfs: Ensure all readahead pages have been used
Squashfs: Rewrite and update squashfs_readahead_fragment() to not use page->index
Squashfs: Update squashfs_readpage_block() to not use page->index
Squashfs: Update squashfs_readahead() to not use page->index
Squashfs: Update page_actor to not use page->index
jffs2: Use a folio in jffs2_garbage_collect_dnode()
jffs2: Convert jffs2_do_readpage_nolock to take a folio
buffer: Convert __block_write_begin() to take a folio
ocfs2: Convert ocfs2_write_zero_page to use a folio
fs: Convert aops->write_begin to take a folio
fs: Convert aops->write_end to take a folio
vboxsf: Use a folio in vboxsf_write_end()
orangefs: Convert orangefs_write_begin() to use a folio
orangefs: Convert orangefs_write_end() to use a folio
jffs2: Convert jffs2_write_begin() to use a folio
jffs2: Convert jffs2_write_end() to use a folio
hostfs: Convert hostfs_write_end() to use a folio
fuse: Convert fuse_write_begin() to use a folio
fuse: Convert fuse_write_end() to use a folio
f2fs: Convert f2fs_write_begin() to use a folio
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEGwAKCRCRxhvAZXjc
ojIuAQC433+hBkvjvmQ7H0r5rgZSjUuCTG3bSmdU7RJmPHUHhwEA85v/NGq53f+W
IhandK6t+Cf0JYpFZ3N0bT88hDYVhQQ=
=9zGL
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual pile of misc updates:
Features:
- Add F_CREATED_QUERY fcntl() that allows userspace to query whether
a file was actually created. Often userspace wants to know whether
an O_CREATE request did actually create a file without using
O_EXCL. The current logic is that to first attempts to open the
file without O_CREAT | O_EXCL and if ENOENT is returned userspace
tries again with both flags. If that succeeds all is well. If it
now reports EEXIST it retries.
That works fairly well but some corner cases make this more
involved. If this operates on a dangling symlink the first openat()
without O_CREAT | O_EXCL will return ENOENT but the second openat()
with O_CREAT | O_EXCL will fail with EEXIST.
The reason is that openat() without O_CREAT | O_EXCL follows the
symlink while O_CREAT | O_EXCL doesn't for security reasons. So
it's not something we can really change unless we add an explicit
opt-in via O_FOLLOW which seems really ugly.
All available workarounds are really nasty (fanotify, bpf lsm etc)
so add a simple fcntl().
- Try an opportunistic lookup for O_CREAT. Today, when opening a file
we'll typically do a fast lookup, but if O_CREAT is set, the kernel
always takes the exclusive inode lock. This was likely done with
the expectation that O_CREAT means that we always expect to do the
create, but that's often not the case. Many programs set O_CREAT
even in scenarios where the file already exists (see related
F_CREATED_QUERY patch motivation above).
The series contained in the pr rearranges the pathwalk-for-open
code to also attempt a fast_lookup in certain O_CREAT cases. If a
positive dentry is found, the inode_lock can be avoided altogether
and it can stay in rcuwalk mode for the last step_into.
- Expose the 64 bit mount id via name_to_handle_at()
Now that we provide a unique 64-bit mount ID interface in statx(2),
we can now provide a race-free way for name_to_handle_at(2) to
provide a file handle and corresponding mount without needing to
worry about racing with /proc/mountinfo parsing or having to open a
file just to do statx(2).
While this is not necessary if you are using AT_EMPTY_PATH and
don't care about an extra statx(2) call, users that pass full paths
into name_to_handle_at(2) need to know which mount the file handle
comes from (to make sure they don't try to open_by_handle_at a file
handle from a different filesystem) and switching to AT_EMPTY_PATH
would require allocating a file for every name_to_handle_at(2) call
- Add a per dentry expire timeout to autofs
There are two fairly well known automounter map formats, the autofs
format and the amd format (more or less System V and Berkley).
Some time ago Linux autofs added an amd map format parser that
implemented a fair amount of the amd functionality. This was done
within the autofs infrastructure and some functionality wasn't
implemented because it either didn't make sense or required extra
kernel changes. The idea was to restrict changes to be within the
existing autofs functionality as much as possible and leave changes
with a wider scope to be considered later.
One of these changes is implementing the amd options:
1) "unmount", expire this mount according to a timeout (same as
the current autofs default).
2) "nounmount", don't expire this mount (same as setting the
autofs timeout to 0 except only for this specific mount) .
3) "utimeout=<seconds>", expire this mount using the specified
timeout (again same as setting the autofs timeout but only for
this mount)
To implement these options per-dentry expire timeouts need to be
implemented for autofs indirect mounts. This is because all map
keys (mounts) for autofs indirect mounts use an expire timeout
stored in the autofs mount super block info. structure and all
indirect mounts use the same expire timeout.
Fixes:
- Fix missing fput for FSCONFIG_SET_FD in autofs
- Use param->file for FSCONFIG_SET_FD in coda
- Delete the 'fs/netfs' proc subtreee when netfs module exits
- Make sure that struct uid_gid_map fits into a single cacheline
- Don't flush in-flight wb switches for superblocks without cgroup
writeback
- Correcting the idmapping mount example in the idmapping
documentation
- Fix a race between evice_inodes() and find_inode() and iput()
- Refine the show_inode_state() macro definition in writeback code
- Prevent dump_mapping() from accessing invalid dentry.d_name.name
- Show actual source for debugfs in /proc/mounts
- Annotate data-race of busy_poll_usecs in eventpoll
- Don't WARN for racy path_noexec check in exec code
- Handle OOM on mnt_warn_timestamp_expiry()
- Fix some spelling in the iomap design documentation
- Fix typo in procfs comment
- Fix typo in fs/namespace.c comment
Cleanups:
- Add the VFS git tree to the MAINTAINERS file
- Move FMODE_UNSIGNED_OFFSET to fop_flags freeing up another f_mode
bit in struct file bringing us to 5 free f_mode bits
- Remove the __I_DIO_WAKEUP bit from i_state flags as we can simplify
the wait mechanism
- Remove the unused path_put_init() helper
- Replace a __u32 with u32 for s_fsnotify_mask as __u32 is uapi
specific
- Replace the unsigned long i_state member with a u32 i_state member
in struct inode freeing up 4 bytes in struct inode. Instead of
using the bit based wait apis we're now using the var event apis
and using the individual bytes of the i_state member to wait on
state changes
- Explain how per-syscall AT_* flags should be allocated
- Use in_group_or_capable() helper to simplify the posix acl mode
update code
- Switch to LIST_HEAD() in fsync_buffers_list() to simplify the code
- Removed comment about d_rcu_to_refcount() as that function doesn't
exist anymore
- Add kernel documentation for lookup_fast()
- Don't re-zero evenpoll fields
- Remove outdated comment after close_fd()
- Fix imprecise wording in comment about the pipe filesystem
- Drop GFP_NOFAIL mode from alloc_page_buffers
- Missing blank line warnings and struct declaration improved in
file_table
- Annotate struct poll_list with __counted_by()
- Remove the unused read parameter in percpu-rwsem
- Remove linux/prefetch.h include from direct-io code
- Use kmemdup_array instead of kmemdup for multiple allocation in
mnt_idmapping code
- Remove unused mnt_cursor_del() declaration
Performance tweaks:
- Dodge smp_mb in break_lease and break_deleg in the common case
- Only read fops once in fops_{get,put}()
- Use RCU in ilookup()
- Elide smp_mb in iversion handling in the common case
- Drop one lock trip in evict()"
* tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (58 commits)
uidgid: make sure we fit into one cacheline
proc: Fix typo in the comment
fs/pipe: Correct imprecise wording in comment
fhandle: expose u64 mount id to name_to_handle_at(2)
uapi: explain how per-syscall AT_* flags should be allocated
fs: drop GFP_NOFAIL mode from alloc_page_buffers
writeback: Refine the show_inode_state() macro definition
fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name
mnt_idmapping: Use kmemdup_array instead of kmemdup for multiple allocation
netfs: Delete subtree of 'fs/netfs' when netfs module exits
fs: use LIST_HEAD() to simplify code
inode: make i_state a u32
inode: port __I_LRU_ISOLATING to var event
vfs: fix race between evice_inodes() and find_inode()&iput()
inode: port __I_NEW to var event
inode: port __I_SYNC to var event
fs: reorder i_state bits
fs: add i_state helpers
MAINTAINERS: add the VFS git tree
fs: s/__u32/u32/ for s_fsnotify_mask
...
Add the __counted_by compiler attribute to the flexible array members
devs to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Increment nr_devs before adding a new device to the devs array and
adjust the array indexes accordingly. Add a helper macro for adding a
new device.
In bch2_journal_read(), explicitly set nr_devs to 0.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We enable idmapped mounts for bcachefs. Here, we just pass down
the user_namespace argument from the VFS methods to the relevant
helpers.
The idmap test in bcachefs is as following:
```
1. losetup /dev/loop1 bcachefs.img
2. ./bcachefs format /dev/loop1
3. mount -t bcachefs /dev/loop1 /mnt/bcachefs/
4. ./mount-idmapped --map-mount b:0:1000:1 /mnt/bcachefs /mnt/idmapped1/
ll /mnt/bcachefs
total 2
drwx------. 2 root root 0 Jun 14 14:10 lost+found
-rw-r--r--. 1 root root 1945 Jun 14 14:12 profile
ll /mnt/idmapped1/
total 2
drwx------. 2 1000 1000 0 Jun 14 14:10 lost+found
-rw-r--r--. 1 1000 1000 1945 Jun 14 14:12 profile
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add the __counted_by compiler attribute to the flexible array member
x_name to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Use jiffies macros instead of using jiffies directly to handle wraparound.
Signed-off-by: Chen Yufan <chenyufan@vivo.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_bset_fix_lookup_table is too complicated to be easily understood,
the comment "l now > where" there is also incorrect when where ==
t->end_offset. This patch therefore refactor the function, the idea is
that when where >= rw_aux_tree(b, t)[t->size - 1].offset, we don't need
to adjust the rw aux tree.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We rely on the trans->locked to know if a trans has nodes locked for
assertions about deadlocks; there can't be more than one trans in the
same process that is locked.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
folio_has_private() is an attractive nuisance; filesystem authors
generally don't realise that it actually checks two flags (one of which
is never set by bcachefs). There's no need to check the private flag at
all; for folios owned by bcachefs, we know that folio->private is NULL
when the private flag is clear and non-NULL when the private flag is set.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It's really not needed: the only locks used here are the btree cache
lock, which we drop for GFP_WAIT allocations, and btree node locks - but
we also drop those for GFP_WAIT allocations.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Use helper functions to make code more readable.
Similar to commit a5488f2983 ("fs: simplify ->listxattr() implementation")
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Remove struct nop_posix_acl_{access,default} for bcachefs filesystem
that don't depend on the xattr handler in their inode->i_op->listxattr()
method in any way. There's nothing more to do than to simply remove the
handler. It's been effectively unused ever since we introduced the new
posix acl api. See [1] for details.
Link [1]: https://patchwork.kernel.org/project/linux-fsdevel/cover/20230125-fs-acl-remove-generic-xattr-handlers-v3-0-f760cc58967d@kernel.org/
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>