This fixes a bug where mark and sweep gc incorrectly was clearing out
the stripes heap and causing assertions to fire later - simpler to just
create the stripes heap after gc has finished.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Alloc info isn't stored on a particular device, it makes no sense to
only be writing it out for rw members - this was causing fsck to not fix
alloc info errors, oops.
Also, make sure we write out alloc info in other repair paths.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With various newer key types - stripe keys, inline data extents - the
old approach of calculating the maximum size of the value is becoming
more and more error prone. Better to switch to bkey_on_stack, which can
dynamically allocate if necessary to handle any size bkey.
In particular we also want to get rid of BKEY_EXTENT_VAL_U64s_MAX.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, we were using BTREE_INSERT_RESERVE in a lot of places where
it no longer makes sense.
- we now have more open_buckets than we used to, and the reserves work
better, so we shouldn't need to use BTREE_INSERT_RESERVE just because
we're holding open_buckets pinned anymore.
- We have the btree key cache for updates to the alloc btree, meaning
we no longer need the btree reserve to ensure the allocator can make
forward progress.
This means that we should only need a reserve for btree updates to
ensure that copygc can make forward progress.
Since it's now just for copygc, we can also fold RESERVE_BTREE into
RESERVE_MOVINGGC (the allocator's freelist reserve).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Various filesystem usage counters are kept in percpu counters, with one
set per in flight journal buffer. Right now all the code that deals with
it assumes that there's only two buffers/sets of counters, but the
number of journal bufs is getting increased to 4 in the next patch - so
refactor that code to not assume a constant.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It's not used much anymore, the module paramter interface is better.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we've got transactional alloc info updates (and have for
awhile), we don't need to write it out on shutdown, and we don't need to
write it out on startup except when GC found errors - this is a big
improvement to mount/unmount performance.
This patch also fixes a few bugs where we weren't writing out alloc
info (on new filesystems, and new devices) and should have been.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Awhile back, gcing of stale pointers was split out from full
mark-and-sweep gc - but, the bit to actually drop those stale pointers
wasn't implemnted. Whoops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Awhile back the mechanism for garbage collecting unused replicas entries
was significantly improved, but some cleanup was missed - this patch
does that now.
This is also prep work for a patch to account for erasure coded parity
blocks separately - we need to consolidate the logic for
checking/marking the various replicas entries from one bkey into a
single function.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Soon we'll be able to modify existing stripes - replacing empty blocks
with new blocks and new p/q blocks. This patch updates the trigger code
to handle pointers changing in an existing stripe; also, it
significantly improves how the stripes heap works, which means we can
get rid of the stripe creation/deletion lock.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Full mark and sweep gc doesn't (yet?) work with the new btree key cache
code, but it also blocks updates to interior btree nodes for the
duration and isn't really necessary in practice; we aren't currently
attempting to repair errors in allocation info at runtime.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We now update the alloc info (bucket sector counts) atomically with
journalling the update to the interior btree nodes, and we also set new
btree roots atomically with the journalled part of the btree update.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Not legal to block on a journal prereservation with btree locks held.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When initial btree gc was changed to overlay journal keys as it walks
the btree, it also stopped checking btree topology.
Previously, checking btree topology was a fairly complicated affair -
but it's much easier now that btree_ptr_v2 has min_key in the pointer.
This rewrites the old range_checks code and uses it in both runtime and
initial gc.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, BTREE_ID_INODES was special - inodes were indexed by the
inode field, which meant the offset field of struct bpos wasn't used,
which led to special cases in e.g. the btree iterator code.
Now, inodes in the inodes btree are indexed by the offset field.
Also: prevously min_key was special for extents btrees, min_key for
extents would equal max_key for the previous node. Now, min_key =
bkey_successor() of the previous node, same as non extent btrees.
This means we can completely get rid of
btree_type_sucessor/predecessor.
Also make some improvements to the metadata IO validate/compat code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Ever since the btree code was first written, handling of overwriting
existing extents - including partially overwriting and splittin existing
extents - was handled as part of the core btree insert path. The modern
transaction and iterator infrastructure didn't exist then, so that was
the only way for it to be done.
This patch moves that outside of the core btree code to a pass that runs
at transaction commit time.
This is a significant simplification to the btree code and overall
reduction in code size, but more importantly it gets us much closer to
the core btree code being completely independent of extents and is
important prep work for snapshots.
This introduces a new feature bit; the old and new extent update models
are incompatible when the filesystem needs journal replay.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The trigger flags really belong with individual btree_insert_entries,
not the transaction commit flags - this splits out those flags and
unifies them with the BCH_BUCKET_MARK flags. Todo - split out
btree_trigger.c from buckets.c
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
For upcoming inline data extents, we're going to need to be able to
shorten the value of existing bkeys in the btree - and to make that work
we're going to be able to need to pad out the space the value previously
took up with something.
This patch changes the various code that iterates over bkeys to handle
k->u64s == 0 as meaning "skip the next 8 bytes".
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We have to free the old (in memory) btree node _before_ unlocking the
new nodes - else, some other thread with a read lock on the old node
could see stale data after another thread has already updated the new
node.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Running the filesystem under valgrind exposed a path where the max_stale
variable in bch2_gc_btree() might not be initialized before use in a
rare case when there are no btree nodes in a transaction.
Signed-off-by: Justin Husted <sigstop@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Importantly, we don't want to use bch2_fs_inconsistent_on() for errors
that fsck can repair, becuase that will just put us in RO mode and
prevent fsck from actually fixing stuff. Probably want to get rid of it
in the future.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Major simplification - gets rid of the need for marking buckets as
dirty, instead we write buckets if the in memory mark is different from
what's in the btree.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is prep work for the btree key cache: btree iterators will point to
either struct btree, or a new struct bkey_cached.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
this lets us get rid of a lot of extra switch statements - in a lot of
places we dispatch on the btree node type, and then the key type, so
this is a nice cleanup across a lot of code.
Also improve the on disk format versioning stuff.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Hit an assertion, probably spurious, indicating an iterator was unlocked
when it shouldn't have been (spurious because it wasn't locked at all
when the caller called btree_insert_at()).
Add a flag, BTREE_ITER_NOUNLOCK, and tighten up the assertions
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This means we can now use gc to verify the allocation information -
important for testing persistant alloc info
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_alloc_sectors_start() was a nightmare to work with - it's got some
tricky stuff to do, since it wants to use the buckets the writepoint
already has, unless they're not in the target it wants to write to,
unless it can't allocate from any other devices in which case it will
use those buckets if it has to - et cetera.
This restructures the code to start with a new empty list of open
buckets we're going to use for the new allocation, pulling buckets from
the write point's list as we decide that we really are going to use
them - making the code somewhat more functional and drastically easier
to understand.
Also fixes a bug where we could end up waiting on c->freelist_wait
(because allocating from one device failed) but return success from
bch2_bucket_alloc(), because allocating from a different device
succeeded.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This lifts the restriction that 0 size extents must not overlap with
other extents, which means we can now sort extents and non extents the
same way, and will let us simplify a bunch of other stuff as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Initially forked from drivers/md/bcache, bcachefs is a new copy-on-write
filesystem with every feature you could possibly want.
Website: https://bcachefs.org
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>