The trigger flags really belong with individual btree_insert_entries,
not the transaction commit flags - this splits out those flags and
unifies them with the BCH_BUCKET_MARK flags. Todo - split out
btree_trigger.c from buckets.c
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This should be private to btree_update_leaf.c, and we might end up
removing it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Return an error instead (still work in progress...)
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The main optimization here is that if we let
bch2_replicas_delta_list_apply() fail, we can completely skip calling
bch2_bkey_replicas_marked_locked().
And assorted other small optimizations.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This lets us avoid a cache miss in the write path.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Disk space accounting for erasure coding + compression was completely
broken - we need to calculate the parity sectors delta the same way we
calculate disk_sectors, by calculating the old and new usage and
subtracting to get the difference.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This make the disk accounting code saner, and it's not clear why we'd
ever want the same data to be in multiple stripes simultaneously.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If an extent only contained cached or erasure coded pointers, there
won't be any devices in the normal dirty replicas list or an entry to
update.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The btree_trans struct needs to memoize/cache btree iterators, so that
on transaction restart we don't have to completely redo btree lookups,
and so that we can do them all at once in the correct order when the
transaction had to restart to avoid a deadlock.
This switches the btree iterator lookups to work based on iterator
position, instead of trying to match them up based on the stack trace.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Move extents instead of copying them - this way, we can iterate over
only live extents, not the entire keyspace. Also, this means we can
mostly skip running triggers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_mark_update() was correct, but bch2_trans_mark_update() wasn't
respecting BTREE_INSERT_NOMARK_OVERWRITES - key marking/triggers really
need to be cleaned up.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Importantly, we don't want to use bch2_fs_inconsistent_on() for errors
that fsck can repair, becuase that will just put us in RO mode and
prevent fsck from actually fixing stuff. Probably want to get rid of it
in the future.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The continue statement in bch2_trans_mark_extent() was wrong - by
bailing out early, we'd be constructing the wrong replicas list to
update. Also, the assertion in update_replicas() was wrong - due to
rounding with compressed extents, it is possible for sectors to be 0
sometimes.
Also, change extent_to_replicas() in replicas.c to match the replicas
list we construct in buckets.c.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Major simplification - gets rid of the need for marking buckets as
dirty, instead we write buckets if the in memory mark is different from
what's in the btree.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a bug in the journal replay -> extent_replay_key ->
split_compressed path, when we do an update that changes alloc info but
the alloc info in the btree isn't up to date yet.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
More prep work for reflink: for extents, we're not looking for an exact
mach on pos, rather that the pos is within the range of the key the
iterator points to.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
mark_lock is a frequently taken lock, and there's also potential for
deadlocks since currently bch2_clear_page_bits which is called from
memory reclaim has to take it to drop disk reservations.
The disk reservation get path takes it when it recalculates the number
of sectors known to be available, but it's not really needed for
consistency. We just want to make sure we only have one thread updating
the sectors_available count, which we can do with a dedicated mutex.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Does not persist alloc info for stripes yet
- Also does not yet include filesystem block/sector counts yet, from
struct fs_usage
- Not made use of just yet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
this lets us get rid of a lot of extra switch statements - in a lot of
places we dispatch on the btree node type, and then the key type, so
this is a nice cleanup across a lot of code.
Also improve the on disk format versioning stuff.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This means we can now use gc to verify the allocation information -
important for testing persistant alloc info
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It's now possible to create and use a filesystem on a 512k device with
4k buckets (though at that size we still waste almost half to internal
reserves)
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Due to compression, the different replicas of a replicated extent don't
necessarily have to take up the same amount of space - so replicated
data sector counts shouldn't be stored divided by the number of
replicas.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Prep work for persistent alloc information. Refactoring also lets us
make free_inc much smaller, which means a lot fewer buckets stranded on
freelists.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
journal_buf_switch is called from the foreground when getting a journal
reservation and thus is somewhat latency sensitive;
bch2_bucket_seq_cleanup has to run infrequently but is a bit expensive
when it does run.
Call it from the journal write path instead, and punt the journal write
to worqueue context.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Initially forked from drivers/md/bcache, bcachefs is a new copy-on-write
filesystem with every feature you could possibly want.
Website: https://bcachefs.org
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>