There are several spelling mistakes in error messages. Fix these.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a new btree for long running logged operations - i.e. for logging
operations that we can't do within a single btree transaction, so that
they can be resumed if we crash.
Keys in the logged operations btree will represent operations in
progress, with the state of the operation stored in the value.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
subvolume.c has gotten a bit large, this splits out a separate file just
for managing snapshot trees - BTREE_ID_snapshots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Split out a new file from recovery.c for managing the list of keys we
read from the journal: before journal replay finishes the btree iterator
code needs to be able to iterate over and return keys from the journal
as well, so there's a fair bit of code here.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Awhile back, we changed bkey_format generation to ensure that the packed
representation could never represent fields larger than the unpacked
representation.
This was to ensure that bkey_packed_successor() always gave a sensible
result, but in the current code bkey_packed_successor() is only used in
a debug assertion - not for anything important.
This kills the requirement that we've gotten rid of those weird bkey
formats, and instead changes the assertion to check if we're dealing
with an old weird bkey format.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes should_restart_for_topology_repair() - previously it was
returning false if the btree io path had already seleceted topology
repair to run, even if it hadn't run yet.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We want to ensure that fsck actually fixed all the errors it found - the
second fsck run should be clean.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds bch2_run_explicit_recovery_pass(), for rewinding recovery and
explicitly running a specific recovery pass - this is a more general
replacement for how we were running topology repair before.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This introduces bch2_run_explicit_recovery_pass() and uses it for when
fsck detects that we need to re-run dead snaphots cleanup, and makes
dead snapshot cleanup more like a normal recovery pass.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Before, it was parsed as a bool but internally it was really an enum:
this lets us pass in all the possible values.
But we special case the option parsing: no supplied value is parsed as
FSCK_FIX_yes, to match the previous behaviour.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This extents KEY_TYPE_snapshot to include some new fields:
- depth, to indicate depth of this particular node from the root
- skip[3], skiplist entries for quickly walking back up to the root
These are to improve bch2_snapshot_is_ancestor(), making it O(ln(n))
instead of O(n) in the snapshot tree depth.
Skiplist nodes are picked at random from the set of ancestor nodes, not
some fixed fraction.
This introduces bcachefs_metadata_version 1.1, snapshot_skiplists.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we've got forward compatibility sorted out, we should be doing
more frequent version upgrades in the future.
To avoid having to run a full fsck for every version upgrade, this
improves the BCH_METADATA_VERSIONS() table to explicitly specify a
bitmask of recovery passes to run when upgrading to or past a given
version.
This means we can also delete PASS_UPGRADE().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This introduces major/minor versioning to the superblock version number.
Major version number changes indicate incompatible releases; we can move
forward to a new major version number, but not backwards. Minor version
numbers indicate compatible changes - these add features, but can still
be mounted and used by old versions.
With the recent patches that make it possible to roll out new btrees and
key types without breaking compatibility, we should be able to roll out
most new features without incompatible changes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Recovery and fsck have many different passes/jobs to do, which always
run in the same order - but not all of them run all the time. Some are
for fsck, some for unclean shutdown, some for version upgrades.
This adds some new structure: a defined list of recovery passes that we
can run in a loop, as well as consolidating the log messages.
The main benefit is consolidating the "should run this recovery pass"
logic, as well as cleaning up the "this recovery pass has finished"
state; instead of having a bunch of ad-hoc state bits in c->flags, we've
now got c->curr_recovery_pass.
By consolidating the "should run this recovery pass" logic, in the
future on disk format upgrades will be able to say "upgrading to this
version requires x passes to run", instead of forcing all of fsck to
run.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
For the upcoming enumeration of recovery passes, we need all recovery
passes to be called the same way - including journal replay.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This folds bch2_bucket_gens_read() into bch2_alloc_read(), doing the
version check there.
This is prep work for enumarating all recovery passes: we need some
cleanup first to make calling all the recovery passes consistent.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The version_upgrade parameter is now an enum, not a bool, and it's
persistent in the superblock:
- compatible (default): upgrade to the latest compatible version
- incompatible: upgrade to latest incompatible version
- none
Currently all upgrades are incompatible upgrades, but the next release
will introduce major:minor versions.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Version upgrades are not atomic operations: when we do a version upgrade
we need to update the superblock before we start using new features, and
then when the upgrade completes we need to update the superblock again.
This adds a new superblock field so we can detect and handle incomplete
version upgrades.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we have distinct error codes for different memory allocation
failures, the early init log messages are no longer needed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- endianness fixes
- mark some things static
- fix a few __percpu annotations
- fix silent enum conversions
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need to allow filesystems with metadata from newer versions to be
mountable and usable by older versions.
This patch enables us to roll out new btrees without a new major version
number; we can now handle btree roots for unknown btree types.
The unknown btree roots will be retained, and fsck (including
backpointers) will check them, the same as other btree types.
We add a dynamic array for the extra, unknown btree roots, in addition
to the fixed size btree root array, and add new helpers for looking up
btree roots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This unifies JOURNAL_WATERMARK with BCH_WATERMARK; we're working towards
specifying watermarks once in the transaction commit path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add two new helpers for printing error messages with __func__ and
bch2_err_str():
- bch_err_fn
- bch_err_msg
Also kill the old error strings in the recovery path, which were causing
us to incorrectly report memory allocation failures - they're not needed
anymore.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
As with previous conversions, replace -ENOENT uses with more informative
private error codes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a new btree which gets us a persistent per-snapshot-tree
identifier.
- BTREE_ID_snapshot_trees
- KEY_TYPE_snapshot_tree
- bch_snapshot now has a field that points to a snapshot_tree
This is going to be used to designate one snapshot ID/subvolume out of a
given tree of snapshots as the "main" subvolume, so that we can do quota
accounting in that subvolume and not the rest.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Introduce new helpers for a common pattern:
bch2_trans_iter_init();
bch2_btree_iter_peek_slot();
- bch2_bkey_get_iter_type() returns -ENOENT if it doesn't find a key of
the correct type
- bch2_bkey_get_val_typed() copies the val out of the btree to a
(typically stack allocated) variable; it handles the case where the
value in the btree is smaller than the current version of the type,
zeroing out the remainder.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're just doing cpu work here and it could take awhile, a
cond_resched() is definitely needed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We don't store backpointers in alloc keys anymore, since we gained the
btree write buffer.
This patch drops support for backpointers in alloc keys, and revs the on
disk format version so that we know a fsck is required.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we block on journal reservation attempting to log journal
messages during recovery, particularly for the first message(s)
before we start doing actual work, chances are the filesystem ends
up deadlocked.
Allow logged messages to use reserved journal space to mitigate this
problem. In the worst case where no space is available whatsoever,
this at least allows the fs to recognize that the journal is stuck
and fail the mount gracefully.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We may end up in a situation where allocating the buffer for the sorted
journal_keys fails - but it would likely succeed, post compaction where
we drop duplicates.
We've had reports of this allocation failing, so this adds a slowpath to
do the compaction incrementally.
This is only a band-aid fix; we need to look at limiting the number of
keys in the journal based on the amount of system RAM.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds private error codes for most (but not all) of our ENOMEM uses,
which makes it easier to track down assorted allocation failures.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Rust bindgen doesn't cope well with anonymous structs and unions. This
patch drops the fancy anonymous structs & unions in bkey_i that let us
use the same helpers for bkey_i and bkey_packed; since bkey_packed is an
internal type that's never exposed to outside code, it's only a minor
inconvenienc.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we have a separate data structure for tracking open stripes,
the stripes heap can track all existing stripes, which is a nice
simplification.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we have much more efficient updates to the LRU btree, this
patch adds a new LRU that indexes buckets by fragmentation.
This means copygc no longer has to scan every bucket to find buckets
that need to be evacuated.
Changes:
- A new field in bch_alloc_v4, fragmentation_lru - this corresponds to
the bucket's position in the fragmentation LRU. We add a new field
for this instead of calculating it as needed because we may make the
fragmentation LRU optional; this field indicates whether a bucket is
on the fragmentation LRU.
Also, zoned devices will introduce variable bucket sizes; explicitly
recording the LRU position will be safer for them.
- A new copygc path for using the fragmentation LRU instead of
scanning every bucket and building up an in-memory heap.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If failed to read a btree root - or if we're not using a btree root,
because of the reconstruct_alloc option - make sure we update the
corresponding info for the key/level for the root on disk.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch changes how the LRU index works:
Instead of using KEY_TYPE_lru where the bucket the lru entry points to
is part of the value, this switches to KEY_TYPE_set and encoding the
bucket we refer to in the low bits of the key.
This means that we no longer have to check for collisions when inserting
LRU entries. We'll be making using of this in the next patch, which adds
a btree write buffer - a pure write buffer for btree updates, where
updates are appended to a simple array and then periodically sorted and
batch inserted.
This is a new on disk format version, and a forced upgrade.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
To improve mount times, add a btree for just bucket gens, 256 of them
per key: this means we'll have to scan drastically less metadata at
startup.
This adds
- trigger for keeping it in sync with the all btree
- initialization code, for filesystems from previous versions
- new path for reading bucket gens
- new fsck code
And a new on disk format version.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Move bi_size and bi_sectors into the non-varint portion of the inode, so
that the write path can update them without going through the relatively
expensive unpack/pack operations.
Other changes:
- Add a field for the offset of the varint section, so we can add new
non-varint fields without needing a new inode type, like alloc_v3
- Move bi_mode into the flags field, so that the varint section can be
u64 aligned
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_gc may require snapshots to be started - the repair path when
checking the reflink btree may do updates to the extents btree.
This moves bch2_fs_initialize_subvolumes() and bch2_fs_snapshots_start()
to before bch2_gc() - since we haven't gone RW yet, the updates in
bch2_fs_initialize_subvolumes() are done via the journal replay keys
list, so it's fine to do this before bch2_gc().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch adds backpointers: we now have a reverse index from device
and offset on that device (specifically, offset within a bucket) back to
btree nodes and (non cached) data extents.
The first 40 backpointers within a bucket are stored in the alloc key;
after that backpointers spill over to the next backpointers btree. This
is to help avoid performance regressions from additional btree updates
on large streaming workloads.
This patch adds all the code for creating, checking and repairing
backpointers. The next patch in the series is going to use backpointers
for copygc - finally getting rid of the need to scan all extents to do
copygc.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It's possible to do btree updates before going RW by adding them to the
list of updates for journal replay to do, but this is limited by what
fits in RAM. This patch switches the second alloc info phase to run
after going RW - btree_gc has already ensured the alloc btree itself is
correct - and tweaks the allocation path to deal with the potential
small inconsistencies.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the distant past, it wasn't possible to start copygc until after
journal replay had finished. Now, the btree iterator code overlays keys
from the journal, so there's no reason not to start it earlier - and it
solves a rare deadlock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch
- Adds a mechanism for queuing up journal entries prior to the journal
being started, which will be used for early journal log messages
- Adds bch2_fs_log_msg() and improves bch2_trans_log_msg(), which now
take format strings. bch2_fs_log_msg() can be used before or after
the journal has been started, and will use the appropriate mechanism.
- Deletes the now obsolete bch2_journal_log_msg()
- And adds more log messages to the recovery path - messages for
journal/filesystem started, journal entries being blacklisted, and
journal replay starting/finishing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If it so happens that we crash while dirty, meaning we don't have the
superblock clean section, and we erroneously mark a journal entry we
wrote as blacklisted, we won't be able to recover.
This patch fixes this by adding a fallback: if we've got no superblock
clean section, and no non-ignored journal entries, we try the most
recent ignored journal entry.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_journal_keys_peek_upto() was comparing against btree_id & level
incorrectly - fix this by using __journal_key_cmp().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This tweaks the recovery and journal paths so that we don't error out
before we need to: the list_journal command should work, even if we
wouldn't be able to replay successfully.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch introduces
- bpos_eq()
- bpos_lt()
- bpos_le()
- bpos_gt()
- bpos_ge()
and equivalent replacements for bkey_cmp().
Looking at the generated assembly these could probably be improved
further, but we already see a significant code size improvement with
this patch.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Marking a non-static function as inline doesn't actually work and is
now causing problems - drop that
- Introduce BCACHEFS_LOG_PREFIX for when we want to prefix log messages
with bcachefs (filesystem name)
- Userspace doesn't have real percpu variables (maybe we can get this
fixed someday), put an #ifdef around bch2_disk_reservation_add()
fastpath
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- In the btree iterator code that overlays keys from the journal, we
were incorrectly specifying level=0 instead of the btree_path's
current level in a few places
- When we didn't do journal replay, we shouldn't free the journal keys:
this fixes cmd_list and cmd_dump, which run in norecovery mode
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
checkpatch.pl gives lots of warnings that we don't want - suggested
ignore list:
ASSIGN_IN_IF
UNSPECIFIED_INT - bcachefs coding style prefers single token type names
NEW_TYPEDEFS - typedefs are occasionally good
FUNCTION_ARGUMENTS - we prefer to look at functions in .c files
(hopefully with docbook documentation), not .h
file prototypes
MULTISTATEMENT_MACRO_USE_DO_WHILE
- we have _many_ x-macros and other macros where
we can't do this
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This removes an optimization that didn't actually save us any memory,
due to alignment, but did make the code more complicated than it needed
to be. We were also seeing a bug where journal_seq_base wasn't getting
correctly initailized, so hopefully it'll fix that too.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can rebuild alloc info if these btree roots are missing - no need to
bail out and say the filesystem is unrecoverable
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
fsck doesn't want to run while we're cleaning up deleted snapshots - if
that work needs to be done, we want it to have finished before fsck
runs, otherwise fsck will get confused when it finds multiple keys in
the same snapshot ID equivalence class (i.e. the mechanism that
snapshot deletion uses for cleaning up redundant keys).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This converts bcachefs to the modern printbuf interface/implementation,
synced with the version to be submitted upstream.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We had a bug where btree_and_journal_iter would return the same key
twice - after deleting it (perhaps because it was present in both the
btree and the journal?)
This reworks btree_and_journal_iter to track the current position, much
like btree_paths, which makes the logic considerably simpler and more
robust.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
cmd_list_journal wasn't correctly listing the most recent journal
entries as blacklisted - because in the recovery path when just reading
the journal, we were failing to add those to the blacklist table.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Previously, on every btree_iter_peek() operation we were searching the
journal keys, doing a full binary search - which was slow.
This patch fixes that by saving our position in the journal keys, so
that we only do a full binary search when moving our position backwards
or a large jump forwards.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- Drop old unneeded parameter for whether we're in initial GC - which
was from when btree updates had to be done differently before we
went RW.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
journal_iters_fix() was incorrectly rewinding iterators past keys they
had already returned, leading to those keys being double counted in the
bch2_gc() path - oops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
btree updates before going RW are expensive if they're in random order,
since they use the list of keys for journal replay to insert, which is
just a gap buffer.
This patch improves the bucket invalidate path so that if
bch2_check_lrus() hasn't finished it only prints warnings instead of
doing an emergency shutdown, which means we can now set BCH_FS_MAY_GO_RW
before bch2_check_lrus().
Also, the filesystem state bits are reorganized a bit.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This option was useful when the replicas mechism was new and still being
debugged, but hasn't been used in ages - let's delete it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
In journal replay, we weren't immediately dropping journal pins when we
start doing updates that ewern't from journal replay - leading to
journal reclaim getting stuck.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
When many journal replay keys have been overwritten,
bch2_journal_keys_peek() was taking excessively long to scan before it
found a key to return.
Fix this by introducing bch2_journal_keys_peek_upto() which takes a
parameter for the end of the range we want, so that we can terminate the
search much sooner, and replace all uses of bch2_journal_keys_peek()
with peek_upto() or peek_slot().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Previously, the journal read path used a linked list for storing the
journal entries we read from disk. But there's been a bug that's been
causing journal_flush_delay to incorrectly be set to 0, leading to far
more journal entries than is normal being written out, which then means
filesystems are no longer able to start due to the O(n^2) behaviour of
inserting into/searching that linked list.
Fix this by switching to a radix tree.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
When there weren't any keys in the journal there's no need to allocate
the buffer - but doing that causes a spurious -ENOMEM.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Previously, we were missing accounting for buckets in need_gc_gens and
need_discard states. This matters because buckets in those states need
other btree operations done before they can be used, so they can't be
conuted when checking current number of free buckets against the
allocation watermark.
Also, we weren't directly counting free buckets at all. Now, data type 0
== BCH_DATA_free, and free buckets are counted; this means we can get
rid of the separate (poorly defined) count of unavailable buckets.
This is a new on disk format version, with upgrade and fsck required for
the accounting changes.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- Move checks for whether the device & bucket are valid from the
.key_invalid method to bch2_check_alloc_key(). This is because
.key_invalid() is called on keys that may no longer exist (post
journal replay), which is a problem when removing/resizing devices.
- We weren't checking the need_discard btree to ensure that every set
bucket has a corresponding alloc key. This refactors the code for
checking the freespace btree, so that it now checks both.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Btree updates before we go RW work by inserting into the array of keys
that journal replay will insert - but inserting into a flat array is
O(n), meaning if btree_gc needs to update many alloc keys, we're O(n^2).
Fortunately, the updates btree_gc does happens in sequential order,
which means a gap buffer works nicely here - this patch implements a gap
buffer for journal keys.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
All code using the in-memory bucket array, excluding GC, has now been
converted to use the alloc btree directly - so we can finally delete it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we have new persistent data structures for the allocator, this
patch converts the allocator to use them.
Now, foreground bucket allocation uses the freespace btree to find
buckets to allocate, instead of popping buckets off the freelist.
The background allocator threads are no longer needed and are deleted,
as well as the allocator freelists. Now we only need background tasks
for invalidating buckets containing cached data (when we are low on
empty buckets), and for issuing discards.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds two new btrees for the upcoming allocator rewrite: an extents
btree of free buckets, and a btree for buckets awaiting discards.
We also add a new trigger for alloc keys to keep the new btrees up to
date, and a compatibility path to initialize them on existing
filesystems.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since journal reclaim -> btree key cache flushing may require the
allocation of new btree nodes, it has an implicit dependency on copygc
in order to make forward progress - so we should avoid blocking copygc
unless the journal is really close to full.
This introduces watermarks to replace our single MAY_GET_UNRESERVED bit
in the journal, and adds a watermark for copygc and plumbs it through.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds bch2_journal_log_msg(), which just logs a message to the
journal, and uses it to mark startup and when journal replay finishes.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This patch changes printbufs dynamically allocate and reallocate a
buffer as needed. Stack usage has become a bit of a problem, and a major
cause of that has been static size string buffers on the stack.
The most involved part of this refactoring is that printbufs must now be
exited with printbuf_exit().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This improves __bch2_trans_commit - early in the recovery process, when
we're running btree_gc and before we want to go RW, it now uses
bch2_journal_key_insert() to add the update to the list of updates for
journal replay to do, instead of btree_gc having to use separate
interfaces depending on whether we're running at bringup or, later,
runtime.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This patch improves the superblock .to_text() methods and adds methods
for all types that were missing them. It also improves printbufs by
allowing them to specfiy what units we want to be printing in, and adds
new wrapper methods for unifying our kernel and userspace environments.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Add an option that tells recovery to only read the journal, to be used
by the list_journal command.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This changes the btree_gc code to only use the second bucket array, the
one dedicated to GC. On completion, it compares what's in its in memory
bucket array to the allocation information in the btree and writes it
directly, instead of updating the main in-memory bucket array and
writing that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Previously, when doing updates and running triggers before journal
replay completes, triggers would see the incorrect key for the old key
being overwritten - this patch updates the trigger code to check the
journal keys when necessary, needed for the upcoming allocator rewrite.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This reverts commit f95b61228efd04c9c158123da5827c96e9773b29.
It turns out, we're seeing filesystems in the wild end up with
blacklisted btree node bsets - this should not be happening, and until
we understand why and fix it we need to keep this code around.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- Add a shim uuid_unparse_lower() in the kernel, since %pU doesn't work
in userspace
- We don't need to print the bcachefs: or the filesystem name prefix in
userspace
- Improve a few error messages
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
With BTREE_ITER_WITH_JOURNAL, there's no longer any restrictions on the
order we have to replay keys from the journal in, and we can also start
up journal reclaim right away - and delete a bunch of code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds a new btree iterator flag, BTREE_ITER_WITH_JOURNAL, that is
automatically enabled when initializing a btree iterator before journal
replay has completed - it overlays the contents of the journal with the
btree.
This lets us delete bch2_btree_and_journal_walk() and just use the
normal btree iterator interface instead - which also lets us delete a
significant amount of duplicated code.
Note that BTREE_ITER_WITH_JOURNAL is still unoptimized in this patch -
we're redoing the binary search over keys in the journal every time we
call bch2_btree_iter_peek().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we're not running fsck we still want to set BCH_FS_FSCK_DONE, so that
bch2_fsck_err() calls are interpreted as bch2_inconsistent_error()
calls().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Add a flag to indicate whether a journal replay key has been
overwritten, and set/test it with appropriate btree locks held.
This fixes a race between the allocator - invalidating buckets, and
doing btree updates - and journal replay, which before this patch could
clobber the allocator thread's update with an older version of the key
from the journal.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds a _to_text() pretty printer for journal entries - including
every subtype - which will shortly be used by the 'bcachefs
list_journal' subcommand.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The upcoming BTREE_ITER_WITH_JOURNAL patch will require journal keys to
stay in sorted order, so the btree iterator code can overlay them over
btree keys.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
In the recovery path, we scan for old btree nodes if we don't have
certain compat bits set. If we do this, we should be doing it after we
upgraded to the newest on disk format.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Since metadata version bcachefs_metadata_version_btree_ptr_sectors_written,
we haven't needed the journal seq blacklist mechanism for ignoring
blacklisted btree node writes - we now only need it for ignoring journal
entries that were written after the newest flush journal entry, and then
we only need to keep those blacklist entries around until journal replay
is finished.
That means we can delete the code for scanning btree nodes to GC
journal_seq_blacklist entries.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
bch2_journal_key_insert() used to assume that the key passed to it was
allocated with kmalloc(), and on success took ownership. This patch
deletes that behaviour, making it more similar to
bch2_trans_update()/bch2_trans_commit().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
If the allocator threads start before journal replay has finished
replaying alloc keys, journal replay might overwrite the allocator's
btree updates.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Now that bch2_bucket_alloc_new_fs() isn't looking at bucket marks to
decide what buckets are eligible to allocate, we can clean up the
filesystem initialization and device add paths. Previously, we had to
use ancient code to mark superblock/journal buckets in the in memory
bucket marks as we allocated them, and then zero that out and re-do that
marking using the newer transational bucket mark paths. Now, we can
simply delete the in-memory bucket marking.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This changes bch2_bucket_alloc_new_fs() to a simple bump allocator that
doesn't need to use the in memory bucket array, part of a larger patch
series to entirely get rid of the in memory bucket array, except for
gc/fsck.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds a new helper that much like the one we have for inode updates,
that allocates the packed alloc key, packs it and calls
bch2_trans_update.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We have two radix trees of stripes - one that mirrors some information
from the stripes btree in normal operation, and another that GC uses to
recalculate block usage counts.
The normal one is now only used for finding partially empty stripes in
order to reuse them - the normal stripes radix tree and the GC stripes
radix tree are used significantly differently, so this patch splits them
into separate types.
In an upcoming patch we'll be replacing c->stripes with a btree that
indexes stripes by the order we want to reuse them.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- bch2_journal_halt() was unconditionally overwriting j->err_seq, the
sequence number that we failed to write
- journal_write_done was updating seq_ondisk and flushed_seq_ondisk even
for writes that errored, which broke the way bch2_journal_flush_seq_async()
locklessly checked for completions.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Change log messages in userspace to be closer to what they are in kernel
space, and include the device name - it's also useful in userspace.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This consolidates duplicated code in journal replay - it's only a few
flags that are different for replaying alloc keys.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Add fields to inode & alloc keys that record the journal sequence number
when they were most recently modified.
For alloc keys, this is needed to know what journal sequence number we
have to flush before the bucket can be reused. Currently this is tracked
in memory, but we'll be getting rid of the in memory bucket array.
For inodes, this is needed for fsync when the inode has been evicted
from the vfs cache. Currently we use a bloom filter per outstanding
journal buf - but that mechanism has been broken since we added the
ability to not issue a flush/fua for every journal write.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This helps to unify the interface between bch2_mark_key() and
bch2_trans_mark_key() - and it also gives access to the journal
reservation and journal seq in the mark_key path.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
nochanges mode is often used for getting data off of otherwise
nonrecoverable filesystems, which is often because of errors hit during
fsck.
Don't force version upgrade & fsck in nochanges mode, so that it's more
likely to mount.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This changes the on disk format for dirents that point to subvols so
that they also record the subvolid of the parent subvol, so that we can
filter them out in other subvolumes.
This also updates the dirent code to do that filtering, and in
particular tweaks the rename code - we need to ensure that there's only
ever one dirent (counting multiplicities in different snapshots) that
point to a subvolume.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We had a bug where reflink_p pointers weren't being initialized to 0,
and when we started using the second word, things broke badly.
This patch revs the on disk format version and adds cleanup code to zero
out the second word of reflink_p pointers before we start using it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This will cause the compat code to be run that creates entries in the
subvolumes and snapshots btrees.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This is the final patch in the patch series implementing snapshots.
This patch implements two new ioctls that work like creation and
deletion of directories, but fancier.
- BCH_IOCTL_SUBVOLUME_CREATE, for creating new subvolumes and snaphots
- BCH_IOCTL_SUBVOLUME_DESTROY, for deleting subvolumes and snapshots
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
To implement snapshots, we need every filesystem btree operation (every
btree operation without a subvolume) to start by looking up the
subvolume and getting the current snapshot ID, with
bch2_subvolume_get_snapshot() - then, that snapshot ID is used for doing
btree lookups in BTREE_ITER_FILTER_SNAPSHOTS mode.
This patch adds those bch2_subvolume_get_snapshot() calls, and also
switches to passing around a subvol_inum instead of just an inode
number.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This patch adds subvolume.c - support for the subvolumes and snapshots
btrees and related data types and on disk data structures. The next
patches will start hooking up this new code to existing code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This splits btree_iter into two components: btree_iter is now the
externally visible componont, and it points to a btree_path which is now
reference counted.
This means we no longer have to clone iterators up front if they might
be mutated - btree_path can be shared by multiple iterators, and cloned
if an iterator would mutate a shared btree_path. This will help us use
iterators more efficiently, as well as slimming down the main long lived
state in btree_trans, and significantly cleans up the logic for iterator
lifetimes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is prep work for splitting btree_path out from btree_iter -
btree_path will not have a pointer to btree_trans.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds progress stats to sysfs for copygc, rebalance, recovery, and the
cmd_job ioctls.
Signed-off-by: Brett Holman <bholman.devel@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a bad ptr deref on recovery from unclean shutdown in
bch2_btree_node_get_noiter().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This closes a significant hole (and last known hole) in our ability to
verify metadata. Previously, since btree nodes are log structured, we
couldn't detect lost btree writes that weren't the first write to a
given node. Additionally, this seems to have lead to some significant
metadata corruption on multi device filesystems with metadata
replication: since a write may have made it to one device and not
another, if we read that btree node back from the replica that did have
that write and started appending after that point, the other replica
would have a gap in the bset entries and reading from that replica
wouldn't find the rest of the bsets.
But, since updates to interior btree nodes are now journalled, we can
close this hole by updating pointers to btree nodes after every write
with the currently written number of sectors, without negatively
affecting performance. This means we will always detect lost or corrupt
metadata - it also means that our btree is now a curious hybrid of COW
and non COW btrees, with all the benefits of both (excluding
complexity).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Adding iter->should_be_locked introduced a regression where it ended up
not being set on the iterator passed to bch2_btree_update_start(), which
is definitely not what we want.
This patch requires it to be set when calling bch2_trans_update(), and
adds various fixups to make that happen.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
If filesystem on disk was used by a version with a larger BCH_DATA_NR
thas the currently running version, we don't want this to cause a buffer
overrun.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We really need debug mode assertions that ca->ref and ca->io_ref are
used correctly.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This splits out btree topology repair into a separate pass, and makes
some improvements:
- When we have to pick which of two overlapping nodes to drop keys
from, we use the btree node header sequence number to preserve the
newer node
- the gc code has been changed so that it doesn't bail out if we're
continuing/ignoring on fsck error - this way the dump tool can skip
running the repair pass but still walk all reachable metadata
- add a new superblock flag indicating when a filesystem is known to
have btree topology issues, and the topology repair pass should be
run
- changing the start/end of a node might mean keys in that node have to
be deleted: this patch handles that better by splitting it out into a
separate function and running it explicitly in the topology repair
code, previously those keys were only being dropped when the btree
node was read in.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This lets us simplify fsck quite a bit, which we need for making fsck
snapshot aware.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We've had BCH_FEATURE_atomic_nlink for quite some time, we can drop this
now.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch starts treating the bpos.snapshot field like part of the key
in the btree code:
* bpos_successor() and bpos_predecessor() now include the snapshot field
* Keys in btrees that will be using snapshots (extents, inodes, dirents
and xattrs) now always have their snapshot field set to U32_MAX
The btree iterator code gets a new flag, BTREE_ITER_ALL_SNAPSHOTS, that
determines whether we're iterating over keys in all snapshots or not -
internally, this controlls whether bkey_(successor|predecessor)
increment/decrement the snapshot field, or only the higher bits of the
key.
We add a new member to struct btree_iter, iter->snapshot: when
BTREE_ITER_ALL_SNAPSHOTS is not set, iter->pos.snapshot should always
equal iter->snapshot, which will be 0 for btrees that don't use
snapshots, and alsways U32_MAX for btrees that will use snapshots
(until we enable snapshot creation).
This patch also introduces a new metadata version number, and compat
code for reading from/writing to older versions - this isn't a forced
upgrade (yet).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With snapshots, we're going to need to differentiate between comparisons
that should and shouldn't include the snapshot field. bpos_cmp is now
the comparison function that does include the snapshot field, used by
core btree code.
Upper level filesystem code generally does _not_ want to compare against
the snapshot field - that code wants keys to compare as equal even when
one of them is in an ancestor snapshot.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is mkfs's job. Also, clean up the handling of feature bits some.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The bkey compat code wasn't being run for btree roots in the superblock
clean section - this patch fixes it to use the journal entry validate
code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bcachefs has been aggressively migrating filesystems and btree nodes to
the new format for quite some time - this shouldn't affect anyone
anymore, and lets us delete a _lot_ of code. Also, it frees up
KEY_TYPE_discard for a new whiteout key type for snapshots.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is useful for the filesystem dump debugging tool - when we're
hitting bugs we want to skip as much of the recovery process as
possible, and the dump tool only needs to know where metadata lives.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is to generate strings for them, so that we can print them out.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Having a packed format that can represent a field larger than the
unpacked type breaks bkey_packed_successor() assertions - we need to fix this to start using the snapshot filed.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We dropped support for !BTREE_NODE_NEW_EXTENT_OVERWRITE but it turned
out there were people who still had filesystems with btree nodes in that
format in the wild. This adds a new compat feature that indicates we've
scanned for and rewritten nodes in the old format, and does that scan at
mount time if the option isn't set.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When snapshots arrive, we won't necessarily be able to arbitrarily split
existis - when we need to split an existing extent, we'll have to check
if the extent was overwritten in child snapshots and if so emit a
whiteout for the split in the child snapshot.
Because extents couldn't span btree nodes previously, journal replay
would sometimes have to split existing extents. That's no good anymore,
but fortunately since extent handling has already been lifted above most
of the btree code there's no real need for that rule anymore.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're using BCH_FEATURE_alloc_v2 to also gate journalling updates to dev
usage - we don't have the code for reconstructing this from buckets
anymore, so we need to run fsck if it's not set.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This eliminates the need to scan every bucket to regenerate dev_usage at
mount time.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Originally, bcachefs - going back to bcache - stored, for each bucket, a
16 bit counter corresponding to how long it had been since the bucket
was read from. But, this required periodically rescaling counters on
every bucket to avoid wraparound. That wasn't an issue in bcache, where
we'd perodically rewrite the per bucket metadata all at once, but in
bcachefs we're trying to avoid having to walk every single bucket.
This patch switches to persisting 64 bit io clocks, corresponding to the
64 bit bucket timestaps introduced in the previous patch with
KEY_TYPE_alloc_v2.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When we walk the btrees during recovery, part of that is checking that
btree topology is correct: for every interior btree node, its child
nodes should exactly span the range the parent node covers.
Previously, we had checks for this, but not repair code. Now that we
have the ability to do btree updates during initial GC, this patch adds
that repair code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Some errors may need to be fixed in order for GC to successfully run -
walk and mark all metadata. But we can't start the allocators and do
normal btree updates until after GC has completed, and allocation
information is known to be consistent, so we need a different method of
doing btree updates.
Fortunately, we already have code for walking the btree while overlaying
keys from the journal to be replayed. This patch adds an update path
that adds keys to the list of keys to be replayed by journal replay, and
also fixes up iterators.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This was useful before we had transactional updates to interior btree
nodes - but now, it's just extra unneeded complexity.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a bug where mark and sweep gc incorrectly was clearing out
the stripes heap and causing assertions to fire later - simpler to just
create the stripes heap after gc has finished.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_btree_and_journal_walk() walks the btree overlaying keys from the
journal; it was introduced so that we could read in the alloc btree
prior to journal replay being done, when journalling of updates to
interior btree nodes was introduced.
But it didn't have btree node prefetching, which introduced a severe
regression with mount times, particularly on spinning rust. This patch
implements btree node prefetching for the btree + journal walk,
hopefully fixing that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Alloc info isn't stored on a particular device, it makes no sense to
only be writing it out for rw members - this was causing fsck to not fix
alloc info errors, oops.
Also, make sure we write out alloc info in other repair paths.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With various newer key types - stripe keys, inline data extents - the
old approach of calculating the maximum size of the value is becoming
more and more error prone. Better to switch to bkey_on_stack, which can
dynamically allocate if necessary to handle any size bkey.
In particular we also want to get rid of BKEY_EXTENT_VAL_U64s_MAX.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is needed to fix a bug where we're overflowing iterators within a
btree transaction, because we're updating the stripes btree (to update
block counts) and the stripes btree trigger is unnecessarily updating
the alloc btree - it doesn't need to update the alloc btree when the
pointers within a stripe aren't changing.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch adds a flag to journal entries which, if set, indicates that
they weren't done as flush/fua writes.
- non flush/fua journal writes don't update last_seq (i.e. they don't
free up space in the journal), thus the journal free space
calculations now check whether nonflush journal writes are currently
allowed (i.e. are we low on free space, or would doing a flush write
free up a lot of space in the journal)
- write_delay_ms, the user configurable option for when open journal
entries are automatically written, is now interpreted as the max
delay between flush journal writes (default 1 second).
- bch2_journal_flush_seq_async is changed to ensure a flush write >=
the requested sequence number has happened
- journal read/replay must now ignore, and blacklist, any journal
entries newer than the most recent flush entry in the journal. Also,
the way the read_entire_journal option is handled has been improved;
struct journal_replay now has an entry, 'ignore', for entries that
were read but should not be used.
- assorted refactoring and improvements related to journal read in
journal_io.c and recovery.c
Previously, we'd have to issue a flush/fua write every time we
accumulated a full journal entry - typically the bucket size. Now we
need to issue them much less frequently: when an fsync is requested, or
it's been more than write_delay_ms since the last flush, or when we need
to free up space in the journal. This is a significant performance
improvement on many write heavy workloads.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch increases the maximum journal buffers in flight from 2 to 4 -
this will be particularly helpful when in the future we stop requiring
flush+fua for every journal write.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since we now always preallocate the maximum number of iterators when we
initialize a btree transaction, getting an iterator never fails - we can
delete a fair amount of error path code.
This patch also simplifies the iterator allocation code a bit.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Introducing the journal+btree iter introduced a regression where we
stopped using BTREE_ITER_PREFETCH - this is a performance regression on
rotating disks.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can't run journal reclaim until we've finished replaying updates to
interior btree nodes - the check for this was in the wrong place though,
leading to journal reclaim spinning before it was allowed to proceed.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
extent_replay_key dates from before putting iterators was required -
fixed.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previous varint implementation used by the inode code was not nearly as
fast as it could have been; partly because it was attempting to encode
integers up to 96 bits (for timestamps) but this meant that encoding and
decoding the length required a table lookup.
Instead, we'll just encode timestamps greater than 64 bits as two
separate varints; this will make decoding/encoding of inodes
significantly faster overall.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a bug where we'd pop an assertion due to replaying a key for
an interior btree node when that node no longer exists.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we've got transactional alloc info updates (and have for
awhile), we don't need to write it out on shutdown, and we don't need to
write it out on startup except when GC found errors - this is a big
improvement to mount/unmount performance.
This patch also fixes a few bugs where we weren't writing out alloc
info (on new filesystems, and new devices) and should have been.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, we would start doing btree updates before writing the first
journal entry; if this was after an unclean shutdown, this could cause
those btree updates to not be blacklisted.
Also, move some code to headers for userspace debug tools.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There is a bug where we cnan end up clearing the data_has field in the
superblock members section, which causes us to skip reading the journal
and thus journal replay fails. This option tells the recovery path to
not trust those fields.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is prep work for reworking the triggers machinery - we have
triggers that need to know both the old and the new key.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
To be used the debug tool that dumps the contents of the journal.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Before we were setting features after allocating btree nodes, which
meant we were using the old btree pointer format.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We now update the alloc info (bucket sector counts) atomically with
journalling the update to the interior btree nodes, and we also set new
btree roots atomically with the journalled part of the btree update.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When updates to interior nodes started being journalled, that meant that
after an unclean shutdown, until journal replay is done we can't walk
the btree without overlaying the updates from the journal.
The initial btree gc was changed to walk the btree overlaying keys from
the journal - but bch2_alloc_read() and bch2_stripes_read() were missed.
Major whoops...
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Extent btrees no longer have weird special behaviour for min_key.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This will be used by the userspace debug tools.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This slightly modifies the journal replay code so that it can replay
updates to interior nodes.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is needed so that users can roll back to before "d9bb516b2d
bcachefs: Move extent overwrite handling out of core btree code", which
it appears may still be buggy.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Ever since the btree code was first written, handling of overwriting
existing extents - including partially overwriting and splittin existing
extents - was handled as part of the core btree insert path. The modern
transaction and iterator infrastructure didn't exist then, so that was
the only way for it to be done.
This patch moves that outside of the core btree code to a pass that runs
at transaction commit time.
This is a significant simplification to the btree code and overall
reduction in code size, but more importantly it gets us much closer to
the core btree code being completely independent of extents and is
important prep work for snapshots.
This introduces a new feature bit; the old and new extent update models
are incompatible when the filesystem needs journal replay.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
These are created by the new extent update path, but not used yet by the
recovery code and they break the existing recovery code, so we can just
skip them.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a bug where we end up spinning in journal replay - in theory
this shouldn't be necessary though, transaction reset should be
re-traversing all iterators.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BCH_FEATURE_btree_ptr_v2 wasn't getting set on new filesystems, oops
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a new btree ptr type which contains the sequence number (random 64
bit cookie, actually) for that btree node - this lets us verify that
when we read in a btree node it really is the btree node we wanted.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Introduce a new iterator that iterates over keys in the btree with keys
from the journal overlaid on top. This factors out what the erasure
coding init code was doing manually.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The trigger flags really belong with individual btree_insert_entries,
not the transaction commit flags - this splits out those flags and
unifies them with the BCH_BUCKET_MARK flags. Todo - split out
btree_trigger.c from buckets.c
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>