When reshaping existing stripes, we should keep them on the same target
that they were allocated on; to do this, we need to add a field to the
btree stripe type.
This is a tad awkward, because we only have 8 bits left, and targets are
16 bits - but we only need to store a label, not a full target.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bkey_fsck_err() was added as an interface that looks like fsck_err(),
but previously all it did was ensure that the appropriate error counter
was incremented in the superblock.
This is a cleanup and bugfix patch that converts it to a wrapper around
fsck_err(). This is needed to fix an issue with the upgrade path to
disk_accounting_v3, where the "silent fix" error list now includes
bkey_fsck errors; fsck_err() handles this in a unified way, and since we
need to change printing of bkey fsck errors from the caller to the inner
bkey_fsck_err() calls, this ends up being a pretty big change.
Als,, rename .invalid() methods to .validate(), for clarity, while we're
changing the function signature anyways (to drop the printbuf argument).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kuan-Wei Chiu has significantly reworked the min_heap library code and
has taught bcachefs to use the new more generic implementation.
- Yury Norov's series "Cleanup cpumask.h inclusion in core headers"
reworks the cpumask and nodemask headers to make things generally more
rational.
- Kuan-Wei Chiu has sent along some maintenance work against our sorting
library code in the series "lib/sort: Optimizations and cleanups".
- More library maintainance work from Christophe Jaillet in the series
"Remove usage of the deprecated ida_simple_xx() API".
- Ryusuke Konishi continues with the nilfs2 fixes and clanups in the
series "nilfs2: eliminate the call to inode_attach_wb()".
- Kuan-Ying Lee has some fixes to the gdb scripts in the series "Fix GDB
command error".
- Plus the usual shower of singleton patches all over the place. Please
see the relevant changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZp2GvwAKCRDdBJ7gKXxA
jlf/AP48xP5ilIHbtpAKm2z+MvGuTxJQ5VSC0UXFacuCbc93lAEA+Yo+vOVRmh6j
fQF2nVKyKLYfSz7yqmCyAaHWohIYLgg=
=Stxz
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2024-07-21-15-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- In the series "treewide: Refactor heap related implementation",
Kuan-Wei Chiu has significantly reworked the min_heap library code
and has taught bcachefs to use the new more generic implementation.
- Yury Norov's series "Cleanup cpumask.h inclusion in core headers"
reworks the cpumask and nodemask headers to make things generally
more rational.
- Kuan-Wei Chiu has sent along some maintenance work against our
sorting library code in the series "lib/sort: Optimizations and
cleanups".
- More library maintainance work from Christophe Jaillet in the series
"Remove usage of the deprecated ida_simple_xx() API".
- Ryusuke Konishi continues with the nilfs2 fixes and clanups in the
series "nilfs2: eliminate the call to inode_attach_wb()".
- Kuan-Ying Lee has some fixes to the gdb scripts in the series "Fix
GDB command error".
- Plus the usual shower of singleton patches all over the place. Please
see the relevant changelogs for details.
* tag 'mm-nonmm-stable-2024-07-21-15-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (98 commits)
ia64: scrub ia64 from poison.h
watchdog/perf: properly initialize the turbo mode timestamp and rearm counter
tsacct: replace strncpy() with strscpy()
lib/bch.c: use swap() to improve code
test_bpf: convert comma to semicolon
init/modpost: conditionally check section mismatch to __meminit*
init: remove unused __MEMINIT* macros
nilfs2: Constify struct kobj_type
nilfs2: avoid undefined behavior in nilfs_cnt32_ge macro
math: rational: add missing MODULE_DESCRIPTION() macro
lib/zlib: add missing MODULE_DESCRIPTION() macro
fs: ufs: add MODULE_DESCRIPTION()
lib/rbtree.c: fix the example typo
ocfs2: add bounds checking to ocfs2_check_dir_entry()
fs: add kernel-doc comments to ocfs2_prepare_orphan_dir()
coredump: simplify zap_process()
selftests/fpu: add missing MODULE_DESCRIPTION() macro
compiler.h: simplify data_race() macro
build-id: require program headers to be right after ELF header
resource: add missing MODULE_DESCRIPTION()
...
Rewrite fsck/gc for the new accounting scheme.
This adds a second set of in-memory accounting counters for gc to use;
like with other parts of gc we run all trigger in TRIGGER_GC mode, then
compare what we calculated to existing in-memory accounting at the end.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Main part of the disk accounting rewrite.
This is a wholesale rewrite of the existing disk space accounting, which
relies on percepu counters that are sharded by journal buffer, and
rolled up and added to each journal write.
With the new scheme, every set of counters is a distinct key in the
accounting btree; this fixes scaling limitations of the old scheme,
where counters took up space in each journal entry and required multiple
percpu counters.
Now, in memory accounting requires a single set of percpu counters - not
multiple for each in flight journal buffer - and in the future we'll
probably also have counters that don't use in memory percpu counters,
they're not strictly required.
An accounting update is now a normal btree update, using the btree write
buffer path. At transaction commit time, we apply accounting updates to
the in memory counters, which are percpu counters indexed in an
eytzinger tree by the accounting key.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Drop the heap-related macros from bcachefs and replacing them with the
generic min_heap implementation from include/linux. By doing so, code
readability is improved by using functions instead of macros. Moreover,
the min_heap implementation in include/linux adopts a bottom-up variation
compared to the textbook version currently used in bcachefs. This
bottom-up variation allows for approximately 50% reduction in the number
of comparison operations during heap siftdown, without changing the number
of swaps, thus making it more efficient.
[visitorckw@gmail.com: fix missing assignment of minimum element]
Link: https://lkml.kernel.org/r/20240602174828.1955320-1-visitorckw@gmail.com
Link: https://lkml.kernel.org/ioyfizrzq7w7mjrqcadtzsfgpuntowtjdw5pgn4qhvsdp4mqqg@nrlek5vmisbu
Link: https://lkml.kernel.org/r/20240524152958.919343-17-visitorckw@gmail.com
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Acked-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
Cc: Coly Li <colyli@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Sakai <msakai@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Compatibility fix - we no longer have a separate table for which order
gc walks btrees in, and special case the stripes btree directly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're about to start using bch_validate_flags for superblock section
validation - it's no longer bkey specific.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Wrapper around bch2_dev_have_ref() for open_buckets; we do guarantee
that the device an open_bucket points to exists.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Currently, the reflink_p gc trigger does repair as well - turning a
reflink_p key into an error key if the reflink_v it points to doesn't
exist.
This won't work with online check/repair, because the repair path once
online will be subject to transaction restarts, but BTREE_TRIGGER_gc is
not idempotant - we can't run it multiple times if we get a transaction
restart.
So we need to split these paths; to do so this patch calls
check_fix_ptrs() by a new general path - a new trigger type,
BTREE_TRIGGER_check_repair.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we hit an inconsistency when updating allocation information, we
don't want to fail the update if it's for a deletion - only if it's for
a new key.
Rename check_bucket_ref() -> bucket_ref_update() so we can centralize
the logic to do this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This eliminates some duplicated logic, and the gc path now handles
stripe updates and deletions - we need this since soon we're bringing
back runtime gc.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Start to work on unifying mark_stripe_bucket() and
trans_mark_stripe_bucket(); first, clean up all the unnecessary and
gratuitious differences.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're working on potentially unifying bch2_check_bucket_ref() and
bch2_check_fix_ptrs() - or at least eliminating gratuitious differences.
Most immediately, there's a bunch of cleanups to be done regarding
BCH_DATA_stripe.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Combine iter/update/trigger/str_hash flags into a single enum, and
x-macroize them for a to_text() function later.
These flags are all for a specific iter/key/update context, so it makes
sense to group them together - iter/update/trigger flags were already
given distinct bits, this cleans up and unifies that handling.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previosuly, the transaction commit path would have to add keys to the
btree write buffer as a separate operation, requiring additional global
synchronization.
This patch introduces a new journal entry type, which indicates that the
keys need to be copied into the btree write buffer prior to being
written out. We switch the journal entry type back to
JSET_ENTRY_btree_keys prior to write, so this is not an on disk format
change.
Flushing the btree write buffer may require pulling keys out of journal
entries yet to be written, and quiescing outstanding journal
reservations; we previously added journal->buf_lock for synchronization
with the journal write path.
We also can't put strict bounds on the number of keys in the journal
destined for the write buffer, which means we might overflow the size of
the preallocated buffer and have to reallocate - this introduces a
potentially fatal memory allocation failure. This is something we'll
have to watch for, if it becomes an issue in practice we can do
additional mitigation.
The transaction commit path no longer has to explicitly check if the
write buffer is full and wait on flushing; this is another performance
optimization. Instead, when the btree write buffer is close to full we
change the journal watermark, so that only reservations for journal
reclaim are allowed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
for_each_btree_key() handles transaction restarts, like
for_each_btree_key2(), but only calls bch2_trans_begin() after a
transaction restart - for_each_btree_key2() wraps every loop iteration
in a transaction.
The for_each_btree_key() behaviour is problematic when it leads to
holding the SRCU lock that prevents key cache reclaim for an unbounded
amount of time - there's no real need to keep it around.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
__bch2_btree_write_buffer_flush() now assumes a write ref is already
held (as called by the transaction commit path); and the wrappers
bch2_write_buffer_flush() and flush_sync() take an explicit write ref.
This means internally the write buffer code can always use
BTREE_INSERT_NOCHECK_RW, instead of in the previous code passing flags
around and hoping the NOCHECK_RW flag was always carried around
correctly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now we can print out filesystem flags in sysfs, useful for debugging
various "what's my filesystem doing" issues.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can't create stripes if we don't have enough devices - this
manifested as an integer underflow bug later.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We now include the name of the device in the error message - and also
increment the number of checksum errors on that device.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're not supposed to have more than one btree_trans at a time in a
given thread - that causes recursive locking deadlocks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch adds a superblock error counter for every distinct fsck
error; this means that when analyzing filesystems out in the wild we'll
be able to see what sorts of inconsistencies are being found and repair,
and hence what bugs to look for.
Errors validating bkeys are not yet considered distinct fsck errors, but
this patch adds a new helper, bkey_fsck_err(), in order to add distinct
error types for them as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We now track IO errors per device since filesystem creation.
IO error counts can be viewed in sysfs, or with the 'bcachefs
show-super' command.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're using more stack than we'd like in a number of functions, and
btree_trans is the biggest object that we stack allocate.
But we have to do a heap allocatation to initialize it anyways, so
there's no real downside to heap allocating the entire thing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
More reorganization, this splits up io.c into
- io_read.c
- io_misc.c - fallocate, fpunch, truncate
- io_write.c
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
clang had a few more warnings about enum conversion, and also didn't
like the opts.c initializer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
As part of the forward compatibility patch series, we need to allow for
new key types without complaining loudly when running an old version.
This patch changes the flags parameter of bkey_invalid to an enum, and
adds a new flag to indicate we're being called from the transaction
commit path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- endianness fixes
- mark some things static
- fix a few __percpu annotations
- fix silent enum conversions
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add two new helpers for printing error messages with __func__ and
bch2_err_str():
- bch_err_fn
- bch_err_msg
Also kill the old error strings in the recovery path, which were causing
us to incorrectly report memory allocation failures - they're not needed
anymore.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add two new helpers for allocating memory with btree locks held: The
idea is to first try the allocation with GFP_NOWAIT|__GFP_NOWARN, then
if that fails - unlock, retry with GFP_KERNEL, and then call
trans_relock().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
GFP_NOIO dates from the bcache days, when we operated under the block
layer. Now, GFP_NOFS is more appropriate, so switch all GFP_NOIO uses to
GFP_NOFS.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Introduce new helpers for a common pattern:
bch2_trans_iter_init();
bch2_btree_iter_peek_slot();
- bch2_bkey_get_iter_type() returns -ENOENT if it doesn't find a key of
the correct type
- bch2_bkey_get_val_typed() copies the val out of the btree to a
(typically stack allocated) variable; it handles the case where the
value in the btree is smaller than the current version of the type,
zeroing out the remainder.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a new field to bkey_ops for the minimum size of the value,
which standardizes that check and also enforces the new rule (previously
done somewhat ad-hoc) that we can extend value types by adding new
fields on to the end.
To make that work we do _not_ initialize min_val_size with sizeof,
instead we initialize it to the size of the first version of those
values.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We don't store backpointers in alloc keys anymore, since we gained the
btree write buffer.
This patch drops support for backpointers in alloc keys, and revs the on
disk format version so that we know a fsck is required.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a flags param to bch2_backpointer_get_key() so that we can
pass BTREE_ITER_INTENT, since ec_stripe_update_extent() is updating the
extent immediately.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
A workqueue resource deadlock has been observed when running fsck
on a filesystem with a full/stuck journal. fsck is not currently
able to repair the fs due to fairly rapid emergency shutdown, but
rather than exit gracefully the fsck process hangs during the
shutdown sequence. Fortunately this is easily recoverable from
userspace, but the root cause involves code shared between the
kernel and userspace and so should be addressed.
The deadlock scenario involves the main task in the bch2_fs_stop()
-> bch2_fs_read_only() path waiting on write references to drain
with the fs state lock held. A bch2_read_only_work() workqueue task
is scheduled on the system_long_wq, blocked on the state lock.
Finally, various other write ref holding workqueue tasks are
scheduled to run on the same workqueue and must complete in order to
release references that the initial task is waiting on.
To avoid this problem, we can split the dependent workqueue tasks
across different workqueues. It's a bit of a waste to create a
dedicated wq for the read-only worker, but there are several tasks
throughout the fs that follow the pattern of acquiring a write
reference and then scheduling to the system wq. Use a local wq
for such tasks to break the subtle dependency between these and the
read-only worker.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This implements a new shutdown path for erasure coding, which is needed
for the upcoming BCH_WRITE_WAIT_FOR_EC write path.
The process is:
- Cancel new stripes being built up
- Close out/cancel open buckets on write points or the partial list
that are for stripes
- Shutdown rebalance/copygc
- Then wait for in flight new stripes to finish
With BCH_WRITE_WAIT_FOR_EC, move ops will be waiting on stripes to fill
up before they complete; the new ec shutdown path is needed for shutting
down copygc/rebalance without deadlocking.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds private error codes for most (but not all) of our ENOMEM uses,
which makes it easier to track down assorted allocation failures.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we errored out on a new stripe before fully allocating it, we
shouldn't be zeroing out unwritten data.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is not technically correct - it's subject to a race if we ever end
up with a stripe with all empty blocks (that needs to be deleted) being
held open. But the "correct" version was much too inefficient, and soon
we'll be adding a stripes LRU.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This will be used for move writes, which will be waiting until the
stripe is created to do the index update. They need to prevent the
stripe from being reclaimed until their index update is done, so we need
another refcount that just keeps the stripe open.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
# Conflicts:
# fs/bcachefs/ec.c
# fs/bcachefs/io.c
- __bch2_bkey_drop_ptr() -> bch2_bkey_drop_ptr_noerror(), now available
outside extents.
- Split bch2_bkey_has_device() and bch2_bkey_has_device_c(), const and
non const versions
- bch2_extent_has_ptr() now returns the pointer it found
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now, any open_bucket can go on the partial list: allocating from the
partial list has been moved to its own dedicated function,
open_bucket_add_bucets() -> bucket_alloc_set_partial().
In particular, this means that erasure coded buckets can safely go on
the partial list; the new location works with the "allocate an ec bucket
first, then the rest" logic.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It's possible that we reuse a stripe that doesn't have quite the same
configuration as the stripe_head we're allocating from. In that case, we
have to make sure that the new stripe uses the settings from the stripe
we resue, not the stripe head, and make sure the buffer is allocated
correctly.
This fixes the ec_mixed_tiers test.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Rework stripe creation path - new algorithm for deciding when to create
new stripes or reuse existing stripes.
We add a new allocation watermark, RESERVE_stripe, above RESERVE_none.
Then we always try to create a new stripe by doing RESERVE_stripe
allocations; if this fails, we reuse an existing stripe and allocate
buckets for it with the reserve watermark for the given write
(RESERVE_none or RESERVE_movinggc).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Occasionally, we won't write to an entire bucket. This fixes the EC code
to handle this case, zeroing out the rest of the bucket as needed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It's possible when shutting down to for a stripe head to have a new
stripe that doesn't yet have any blocks allocated - we just need to free
it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>