After commit 230e9fc286 ("slab: add SLAB_ACCOUNT flag"), we need to mark
the inode cache as SLAB_ACCOUNT, similar to commit 5d097056c9 ("kmemcg:
account for certain kmem allocations to memcg")
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We don't have a way to flush a timer that's executing the callback, and
this is simple and limited enough in scope that we can just use the lock
instead.
Needed for the next patch that adds direct wakeups from the allocator to
copygc, where we're now more frequently calling io_timer_del() on an
expiring timer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
fragmentation_lru derives from dirty_sectors, and wasn't being checked.
Co-developed-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need to make sure we're not missing any fragmenation entries in the
LRU BTREE after repairing ALLOC BTREE
Also, use the new bch2_btree_write_buffer_maybe_flush() helper; this was
only working without it before since bucket invalidation (usually)
wasn't happening while fsck was running.
Co-developed-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a new helper for checking references to write buffer btrees, where
we need a flush before we definitively know we have an inconsistency.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BCH_READ_NODECODE mode - used by the move paths - really wants to use
only the original rbio, but the retry path really wants to clone - oof.
Make sure to copy the crc of the pointer we read from back to the
original rbio, or we'll see spurious checksum errors later.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
On a new filesystem or device we have to allocate the journal with a
bump allocator, because allocation info isn't ready yet - but when
hot-adding a device that doesn't have a journal, we don't want to use
that path.
Reported-by: syzbot+24a867cb90d8315cccff@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
the unlock is now in read_extent, this fixes an assertion pop in
read_from_stale_dirty_pointer()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There's no reason for discards to be single threaded across all devices;
this will improve performance on multi device setups.
Additionally, making them per-device simplifies the refcounting on
bch_dev->io_ref; we now hold it for the duration that the discard path
is running, which fixes a race between the discard path and device
removal.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This series fix the shift-out-of-bounds issue in
bch2_blacklist_entries_gc().
Instead of passing 0 to eytzinger0_first() when iterating the entries,
we explicitly check 0 and initialize i to be 0.
syzbot has tested the proposed patch and the reproducer did not trigger
any issue:
Reported-and-tested-by: syzbot+835d255ad6bc7f29ee12@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=835d255ad6bc7f29ee12
Signed-off-by: Pei Li <peili.dev@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Drop the heap-related macros from bcachefs and replacing them with the
generic min_heap implementation from include/linux. By doing so, code
readability is improved by using functions instead of macros. Moreover,
the min_heap implementation in include/linux adopts a bottom-up variation
compared to the textbook version currently used in bcachefs. This
bottom-up variation allows for approximately 50% reduction in the number
of comparison operations during heap siftdown, without changing the number
of swaps, thus making it more efficient.
[visitorckw@gmail.com: fix missing assignment of minimum element]
Link: https://lkml.kernel.org/r/20240602174828.1955320-1-visitorckw@gmail.com
Link: https://lkml.kernel.org/ioyfizrzq7w7mjrqcadtzsfgpuntowtjdw5pgn4qhvsdp4mqqg@nrlek5vmisbu
Link: https://lkml.kernel.org/r/20240524152958.919343-17-visitorckw@gmail.com
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Acked-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
Cc: Coly Li <colyli@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Sakai <msakai@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This fixes a rare deadlock when we're doing an emergency shutdown due to
failure to do a journal write.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The debug code relies on btree_trans_list being ordered so that it can
resume on subsequent calls or lock restarts.
However, it was using trans->locknig_wait.task.pid, which is incorrect
since btree_trans objects are cached and reused - typically by different
tasks.
Fix this by switching to pointer order, and also sort them lazily when
required - speeding up the btree_trans_get() fastpath.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
debug.c was using closure_get() on a different thread's closure where
the we don't know if the object being refcounted is alive.
We keep btree_trans objects on a list so they can be printed by debug
code, and because it is cost prohibitive to touch the btree_trans list
every time we allocate and free btree_trans objects, cached objects are
also on this list.
However, we do not want the debug code to see cached but not in use
btree_trans objects - critically because the btree_paths array will have
been freed (if it was reallocated).
closure_get() is also incorrect to use when that get may race with it
hitting zero, i.e. we must already have a ref on the object or know the
ref can't currently hit 0 for other reasons (as used in the cycle
detector).
to fix this, use the previously introduced closure_get_not_zero(),
closure_return_sync(), and closure_init_stack_release(); the debug code
now can only take a ref on a trans object if it's alive and in use.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
btree_deadlock_to_text() searches the list of btree transactions to find
a deadlock - when it finds one it's done; it's not like other *_read()
functions that's printing each object.
Factor out btree_deadlock_to_text() to make this clearer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We were grabbing the sequence number before unlock incremented it - fix
this by moving the increment to seqmutex_lock() (so the seqmutex_relock()
failure path skips the mutex_trylock()), and returning the sequence
number from unlock(), to make the API simpler and safer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
`inode->ei_flags` setting and cleaning should be done after initialization,
otherwise the operation is invalid.
Fixes: 9ca4853b98 ("bcachefs: Fix quota support for snapshots")
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
write_super() may reallocate the superblock buffer - but
bch_sb_field_ext was referencing it; don't use it after the write_super
call.
Reported-by: syzbot+8992fc10a192067b8d8a@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
printk strings get truncated to 1024 bytes; if we have a long error
message (journal debug info) we need to use a helper.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
discard_new_inode() is the correct interface for tearing down an indoe
that was fully created but not made visible to other threads, but it
expects I_NEW to be set, which we don't use.
Reported-by: https://github.com/koverstreet/bcachefs/issues/690
Fixes: bcachefs: Fix race path in bch2_inode_insert()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Incorrect bucket state transition in the discard path; when incrementing
a bucket's generation number that had already been discarded, we were
forgetting to check if it should be need_gc_gens, not free.
This was caught by the .invalid checks in the transaction commit path,
causing us to go emergency read only.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With CONFIG_READ_ONLY_THP_FOR_FS, the Linux kernel supports using THPs
for read-only mmapped files, such as shared libraries. However, the
kernel makes no attempt to actually align those mappings on 2MB
boundaries, which makes it impossible to use those THPs most of the
time. This issue applies to general file mapping THP as well as
existing setups using CONFIG_READ_ONLY_THP_FOR_FS. This is easily
fixed by using thp_get_unmapped_area for the unmapped_area function
in bcachefs, which is what ext2, ext4, fuse, xfs and btrfs all use.
Similar to commit b0c582233a ("btrfs: fix alignment of VMA for
memory mapped files on THP").
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
i.e. the start of automatic self healing:
If errors=continue or fix_safe, we now automatically fix simple errors
without user intervention.
New error action option: fix_safe
This replaces the existing errors=ro option, which gets a new slot, i.e.
existing errors=ro users now get errors=fix_safe.
This is currently only enabled for a limited set of errors - initially
just disk accounting; errors we would never not want to fix, and we
don't want to require user intervention (i.e. to make sure a bug report
gets filed).
Errors will still be counted in the superblock, so we (developers) will
still know they've been occuring if a bug report gets filed (as bug
reports typically include the errors superblock section).
Eventually we'll be enabling this for a much wider set of errors, after
we've done thorough error injection testing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
reference: https://github.com/koverstreet/bcachefs/issues/692
trans->ref is the reference used by the cycle detector, which walks
btree_trans objects of other threads to walk the graph of held locks and
issue wakeups when an abort is required.
We have to wait for the ref to go to 1 before freeing trans->paths or
clearing trans->locking_wait.task.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
LRUs only have 48 bits for the time field (i.e. LRU order); thus we need
overflow checks and guards.
Reported-by: syzbot+df3bf3f088dcaa728857@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We've been moving away from going RW lazily; if we want to go RW we do
that in set_may_go_rw(), and if we didn't go RW we don't need to delete
dead snapshots.
Reported-by: syzbot+4366624c0b5aac4906cf@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We shouln't be running the journal shutdown sequence if we never fully
initialized the journal.
Reported-by: syzbot+ffd2270f0bca3322ee00@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can only handle btree IDs up to 62, since the btree id (plus the type
for interior btree nodes) has to fit ito a 64 bit bitmask - check for
invalid ones to avoid invalid shifts later.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can't discard a bucket while it's still open; this needs the
bucket_is_open_safe() version, which takes the open_buckets lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We use 0 size arrays as markers, but ubsan doesn't know that - cast them
to a pointer to fix the splat.
Also, make sure this code gets tested a bit more.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
btree_iter_init() needs to happen before key_cache_init(), to initialize
btree_trans_barrier
Reported-by: syzbot+3cca837c2183f8f6fcaf@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
inode_init_always started setting the field to 0.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20240611120626.513952-5-mjguzik@gmail.com
Acked-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The bucket_gens array and gc_buckets array known their own size; we
should be using those members, and returning an error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
fsck_err() does a goto fsck_err on error; factor out check_fix_ptr() so
that our error label can drop our device ref.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We count objects as freed when we move them to the srcu-pending lists
because we're doing the equivalent of a kfree_srcu(); the only
difference is managing the pending list ourself means we can allocate
from the pending list.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since the key cache shrinker walks the rhashtable, a mostly empty
rhashtable leads to really nasty reclaim performance issues.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
fsck.c always runs top of the stack so we're not too concerned here;
noinline_for_stack is sufficient
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
for forwards compat we now explicitly allow mounting and using
filesystems with unknown btrees, and we have to walk them for fsck.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
error handling here is slightly odd, which is why we were accidently
calling evict() on an error pointer
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Split the workqueues for btree read completions and btree write
submissions; we don't want concurrency control on btree read
completions, but we do want concurrency control on write submissions,
else blocking in submit_bio() will cause a ton of kworkers to be
allocated.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
in bch2_move_data_btree, we might start with the trans unlocked from a
previous loop iteration - we need a trans_begin() before iter_init().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes an issue where setting a device to durability=0 after it's
been used makes it impossible to remove.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- two downgrade fixes
- a couple snapshot deletion and repair fixes, thanks to noradtux for
finding these and providing the image to debug them
- a couple assert fixes
- convert to folio helper, from Matthew
- some improved error messages
- bit of code reorganization (just moving things around); doing this
while things are quiet so I'm not rebasing fixes past reorgs
- don't return -EROFS on inconsistency error in recovery, this confuses
util-linux and has it retry the mount
- fix failure to return error on misaligned dio write; reported as an
issue with coreutils shred
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmZYntUACgkQE6szbY3K
bnbp7hAAvMgBanBT7qq3ac+W3vtgLuIk6gXNB7eRl+QNff7bJ+BzJH4UhCGhbo5g
WzzQAQ2Zta6NwxbdAcZdL91qe4QDI3ITdIeKBZYtN/C8FySOeEk14K+CNhfQjYgd
fJP2bx4LuUnyMri1pw8ZF3L/YXMOKhzTF8jLH04etty8Sbxss+zh9Dz6LFXqvloq
3v0EmbzrgB3KH+zflJ+yxTFUO3/tNYJhZHGXD452AlJYs29bECAAzJ/5gUq43CqQ
/q+omBqqqf7oJZ84dHIu2piZrUhUJqotLdcIkzlkxDg+hN/BPeY4hv+dw5GNffz7
hgD6ieWm+0PQrf2WSBGRy7l3DglrwknUgrFSb8PlUAbOsg0TNsN7qjW6LVZSWMZ/
tBWiUQ95VYtlP8KzwLrIZ+BcP/Jm0X5hIAxui0Diz+exh7onDiY7Gxsp8/r0krYI
x0s7uLhl73Jb/TO3pX9BS6U+Y0bUu0GJb+TThOLNX961Vg900BmpZvLave6y3U0i
E09JRetWGK50wgPPvNt7M+s8lhs0Jg+Q+AuHAUd3x8eb1NSMibAvYGzV4oVpElrT
YAP7vrJSgVdCCpI6qqCt+SgxatNUCSa/sHraJz2XeVGFyE6iLlXylBHabxKPn5P2
d8jyJ9cEHzumx6tHjLgm09UvoCBg00+ameiNOpjNKbPw6iJXfuw=
=HDxx
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-05-30' of https://evilpiepirate.org/git/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Assorted odds and ends...
- two downgrade fixes
- a couple snapshot deletion and repair fixes, thanks to noradtux for
finding these and providing the image to debug them
- a couple assert fixes
- convert to folio helper, from Matthew
- some improved error messages
- bit of code reorganization (just moving things around); doing this
while things are quiet so I'm not rebasing fixes past reorgs
- don't return -EROFS on inconsistency error in recovery, this
confuses util-linux and has it retry the mount
- fix failure to return error on misaligned dio write; reported as an
issue with coreutils shred"
* tag 'bcachefs-2024-05-30' of https://evilpiepirate.org/git/bcachefs: (21 commits)
bcachefs: Fix failure to return error on misaligned dio write
bcachefs: Don't return -EROFS from mount on inconsistency error
bcachefs: Fix uninitialized var warning
bcachefs: Split out sb-errors_format.h
bcachefs: Split out journal_seq_blacklist_format.h
bcachefs: Split out replicas_format.h
bcachefs: Split out disk_groups_format.h
bcachefs: split out sb-downgrade_format.h
bcachefs: split out sb-members_format.h
bcachefs: Better fsck error message for key version
bcachefs: btree_gc can now handle unknown btrees
bcachefs: add missing MODULE_DESCRIPTION()
bcachefs: Fix setting of downgrade recovery passes/errors
bcachefs: Run check_key_has_snapshot in snapshot_delete_keys()
bcachefs: Refactor delete_dead_snapshots()
bcachefs: Fix locking assert
bcachefs: Fix lookup_first_inode() when inode_generations are present
bcachefs: Plumb bkey into __btree_err()
bcachefs: Use copy_folio_from_iter_atomic()
bcachefs: Fix sb-downgrade validation
...
We were accidentally returning -EROFS during recovery on filesystem
inconsistency - since this is what the journal returns on emergency
shutdown.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Compatibility fix - we no longer have a separate table for which order
gc walks btrees in, and special case the stripes btree directly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Fix the 'make W=1' warning:
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/bcachefs/mean_and_variance_test.o
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_check_version_downgrade() was setting c->sb.version, which
bch2_sb_set_downgrade() expects to be at the previous version; and it
shouldn't even have been set directly because c->sb.version is updated
by write_super().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
delete_dead_snapshots now runs before the main fsck.c passes which check
for keys for invalid snapshots; thus, it needs those checks as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Consolidate per-key work into delete_dead_snapshots_process_key(), so we
now walk all keys once, not twice.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We now track whether a transaction is locked, and verify that we don't
have nodes locked when the transaction isn't locked; reorder relocks to
not pop the new assert.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This function is used for finding the hash seed (which is the same in
all versions of an inode in different snapshots): ff an inode has been
deleted in a child snapshot we need to iterate until we find a live
version.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
copy_page_from_iter_atomic() will be removed at some point.
Also fixup a comment for folios.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Superblock downgrade entries are only two byte aligned, but section
sizes are 8 byte aligned, which means we have to be careful about
overrun checks; an entry that crosses the end of the section is allowed
(and ignored) as long as it has zero errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Just a few syzbot fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmZQk0cACgkQE6szbY3K
bna7gA/+MSY3I95CwaJ4bBq5SCxOaRcrX099LFh8Zrj+OF+DWE2PtVo1LhhgnYrQ
KpZrS2Q9Qgb2yVqYzOY6LBfH4il1O/WwvloMG0MbuYiQFu9/JL/6CEK9uFyiGmaC
fdiFEN3u+8AK6phTFaqUU2ncG0XFQ1Ple5zmFXo4Y3ZJeNaubJeEDac+kbRvOwYh
rQ6Iy0FNoQymv0BzmuM7g2NsbhdAgHTv7rhGbfpNBZv3lu0yDXsfZZgWTr2oXMSP
FMhm4bcTGAFp5hbwq9k56ND8oSFpamsH7SwS4bDlEe1CNOfMI1JjnrvSEuDrocAE
1Jn2J2Gv9NXnEHKamVzzpUILG67buEtYzJyDQk51N4kulgThdpRzjm+11ylD5U0U
wzIK1HXsKHtRdUiIhQGLCLW61FXM+0QBIk2eXhPq88jsM2zTL7iMbXR3P/nvgUDy
8ia8g5Q+nKxcb223M8WmK0rBwlaNasE/hXiFT54ntt8bK5nmVJjPMxVXUmYth3hw
7STkuT0k5jVsMG1NqLkg+wSupj1AuWbD2hIcas7GkxarEYAULbQcClHYGpMll3Tw
+pJfLjAtBOkcE4TwWDLflVBhwWtdmPNhk51Q3iLVRp0Gm7t0rhE2vE6TjpsIFnrg
rUAgaqQqQ2WXfsRaGa2wx0tRKoW+8Iigq13ndn1AZIrfEtQkYUs=
=vuNC
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-05-24' of https://evilpiepirate.org/git/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Nothing exciting, just syzbot fixes (except for the one
FMODE_CAN_ODIRECT patch).
Looks like syzbot reports have slowed down; this is all catch up from
two weeks of conferences.
Next hardening project is using Thomas's error injection tooling to
torture test repair"
* tag 'bcachefs-2024-05-24' of https://evilpiepirate.org/git/bcachefs:
bcachefs: Fix race path in bch2_inode_insert()
bcachefs: Ensure we're RW before journalling
bcachefs: Fix shutdown ordering
bcachefs: Fix unsafety in bch2_dirent_name_bytes()
bcachefs: Fix stack oob in __bch2_encrypt_bio()
bcachefs: Fix btree_trans leak in bch2_readahead()
bcachefs: Fix bogus verify_replicas_entry() assert
bcachefs: Check for subvolues with bogus snapshot/inode fields
bcachefs: bch2_checksum() returns 0 for unknown checksum type
bcachefs: Fix bch2_alloc_ciphers()
bcachefs: Add missing guard in bch2_snapshot_has_children()
bcachefs: Fix missing parens in drop_locks_do()
bcachefs: Improve bch2_assert_pos_locked()
bcachefs: Fix shift overflows in replicas.c
bcachefs: Fix shift overflow in btree_lost_data()
bcachefs: Fix ref in trans_mark_dev_sbs() error path
bcachefs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method
bcachefs: Fix rcu splat in check_fix_ptrs()
__destroy_new_inode() is appropriate when we have _just_allocated the
inode, but not when it's been fully initialized and on i_sb_list.
Reported-by: syzbot+a0ddc9873c280a4cb18f@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With the rework of how the __string() handles dynamic strings where it
saves off the source string in field in the helper structure[1], the
assignment of that value to the trace event field is stored in the helper
value and does not need to be passed in again.
This means that with:
__string(field, mystring)
Which use to be assigned with __assign_str(field, mystring), no longer
needs the second parameter and it is unused. With this, __assign_str()
will now only get a single parameter.
There's over 700 users of __assign_str() and because coccinelle does not
handle the TRACE_EVENT() macro I ended up using the following sed script:
git grep -l __assign_str | while read a ; do
sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file;
mv /tmp/test-file $a;
done
I then searched for __assign_str() that did not end with ';' as those
were multi line assignments that the sed script above would fail to catch.
Note, the same updates will need to be done for:
__assign_str_len()
__assign_rel_str()
__assign_rel_str_len()
I tested this with both an allmodconfig and an allyesconfig (build only for both).
[1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/
Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts.
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for
Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal
Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Darrick J. Wong <djwong@kernel.org> # xfs
Tested-by: Guenter Roeck <linux@roeck-us.net>
the btree key cache uses the srcu struct created/destroyed by
btree_iter.c; btree_iter needs to be exited last.
Reported-by: syzbot+3af9daea347788b15213@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
verify_replicas_entry() is only for newly created replicas entries -
existing entries on disk may have unknown data types, and we have real
verifiers for them.
Reported-by: syzbot+73414091bd382684ee2b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Replacement of bdev->bd_inode with sane(r) set of primitives.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZkwjlgAKCRBZ7Krx/gZQ
66OmAP9nhZLASn/iM2+979I6O0GW+vid+uLh48uW3d+LbsmVIgD9GYpR+cuLQ/xj
mJESWfYKOVSpFFSrqlzKg9PQlU/GFgs=
=6LRp
-----END PGP SIGNATURE-----
Merge tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull bdev bd_inode updates from Al Viro:
"Replacement of bdev->bd_inode with sane(r) set of primitives by me and
Yu Kuai"
* tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
RIP ->bd_inode
dasd_format(): killing the last remaining user of ->bd_inode
nilfs_attach_log_writer(): use ->bd_mapping->host instead of ->bd_inode
block/bdev.c: use the knowledge of inode/bdev coallocation
gfs2: more obvious initializations of mapping->host
fs/buffer.c: massage the remaining users of ->bd_inode to ->bd_mapping
blk_ioctl_{discard,zeroout}(): we only want ->bd_inode->i_mapping here...
grow_dev_folio(): we only want ->bd_inode->i_mapping there
use ->bd_mapping instead of ->bd_inode->i_mapping
block_device: add a pointer to struct address_space (page cache of bdev)
missing helpers: bdev_unhash(), bdev_drop()
block: move two helpers into bdev.c
block2mtd: prevent direct access of bd_inode
dm-vdo: use bdev_nr_bytes(bdev) instead of i_size_read(bdev->bd_inode)
blkdev_write_iter(): saner way to get inode and bdev
bcachefs: remove dead function bdev_sectors()
ext4: remove block_device_ejected()
erofs_buf: store address_space instead of inode
erofs: switch erofs_bread() to passing offset instead of block number
This fixes an assertion pop in btree_iter.c that checks for forgetting
to pass a snapshot ID when iterating over snapshots btrees.
Reported-by: syzbot+0dfe05235e38653e2aee@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes missing guards on trying to calculate a checksum with an
invalid/unknown checksum type; moving the guards up to e.g. btree_io.c
might be "more correct", but doesn't buy us anything - an unknown
checksum type will always be flagged as at least a checksum error so we
aren't losing any safety doing it this way and it makes it less likely
to accidentally pop an assert we don't want.
Reported-by: syzbot+e951ad5349f3a34a715a@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Don't put error pointers in bch_fs, that's gross.
This fixes (?) the check in bch2_checksum_type_valid() - depending on
our error paths, or depending on what our error paths are doing it at
least makes the code saner.
Reported-by: syzbot+2e3cb81b5d1fe18a374b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We additionally need to be going inconsistent if passed an invalid
snapshot ID; that patch will need more thorough testing.
Reported-by: syzbot+1c9fca23fe478633b305@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can't disallow unknown data_types in verify() - we have to preserve
them unchanged for backwards compat; that means we have to add a few
more guards.
Reported-by: syzbot+249018ea545364f78d04@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since commit a2ad63daa8 ("VFS: add FMODE_CAN_ODIRECT file flag") file
systems can just set the FMODE_CAN_ODIRECT flag at open time instead of
wiring up a dummy direct_IO method to indicate support for direct I/O.
Do that for bcachefs so that noop_direct_IO can eventually be removed.
Similar to commit b294349993 ("xfs: set FMODE_CAN_ODIRECT instead of
a dummy direct_IO method").
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- More safety fixes, primarily found by syzbot
- Run the upgrade/downgrade paths in nochnages mode. Nochanges mode is
primarily for testing fsck/recovery in dry run mode, so it shouldn't
change anything besides disabling writes and holding dirty metadata in
memory.
The idea here was to reduce the amount of activity if we can't write
anything out, so that bringing up a filesystem in "super ro" mode
would be more lilkely to work for data recovery - but norecovery is
the correct option for this.
- btree_trans->locked; we now track whether a btree_trans has any btree
nodes locked, and this is used for improved assertions related to
trans_unlock() and trans_relock(). We'll also be using it for
improving how we work with lockdep in the future: we don't want
lockdep to be tracking individual btree node locks because we take too
many for lockdep to track, and it's not necessary since we have a
cycle detector.
- Trigger improvements that are prep work for online fsck
- BTREE_TRIGGER_check_repair; this regularizes how we do some repair
work for extents that goes with running triggers in fsck, and fixes
some subtle issues with transaction restarts there.
- bch2_snapshot_equiv() has now been ripped out of fsck.c; snapshot
equivalence classes are for when snapshot deletion leaves behind
redundant snapshot nodes, but snapshot deletion now cleans this up
right away, so the abstraction doesn't need to leak.
- Improvements to how we resume writing to the journal in recovery. The
code for picking the new place to write when reading the journal is
greatly simplified and we also store the position in the superblock
for when we don't read the journal; this means that we preserve more
of the journal for list_journal debugging.
- Improvements to sysfs btree_cache and btree_node_cache, for debugging
memory reclaim.
- We now detect when we've blocked for 10 seconds on the allocator in
the write path and dump some useful info.
- Safety fixes for devices references: this is a big series that changes
almost all device lookups to properly check if the device exists and
take a reference to it.
Previously we assumed that if a bkey exists that references a device
then the device must exist, and this was enforced in .invalid methods,
but this was incorrect because it meant device removal relied on
accounting being correct to not leave keys pointing to invalid
devices, and that's not something we can assume.
Getting the "pointer to invalid device" checks out of our .invalid()
methods fixes some long standing device removal bugs; the only
outstanding bug with device removal now is a race between the discard
path and deleting alloc info, which should be easily fixed.
- The allocator now prefers not to expand the new
member_info.btree_allocated bitmap, meaning if repair ever requires
scanning for btree nodes (because of a corrupt interior nodes) we
won't have to scan the whole device(s).
- New coding style document, which among other things talks about the
correct usage of assertions
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmZKJQgACgkQE6szbY3K
bnZETg//SU9H0OHnBSMB/cteF6PKo9QR+dhT+3n+gWTxl0o/egbGTqwbzVqGtd2f
J6II1BsDk8VoTOb/gFfLRShlmJfnj2jpRThU265faR/7LQYeSaqndDPkjOpTayAD
Nj/DJyiSUTL753rZh3yUhOpOIHf7iapH6wuaZCPfhdfk+yvZNW8iz07JHjHLKRp8
I2cFH0r6kN916NdRkt9oDCz68WouT8eWTqwcKra04XsLEZjNJHxLpKMq4M8UdPc7
YynJPVt+aP8+VduGIq6pV8Co3afCP2oUywo11JpRmvLsw4tex/59wxOYtpMfgn6k
4H+9WqiBwkbmnLDrfFHWRameS6F/7+GRAOVuz9nkmfk61UPU15gLjSRffqZ6u2YC
7vbrXgebId/sZXtBpQd83RMMX52BnEJah0upNJ54IsSqfDYkU9lwl6CEyYpcX1hf
YNBGBTbspZztc3AB13b3ow421FMhaySUg0FDmntMR9O8Z6/BXk7Ykc7b8DPEfrFs
W6JY7q+ARBxr+EgFcV74fvMCf7NJTAhyv80AKryo7NFU2JZOyyaTxcTGSnolX4Mi
lyHiOgicmOX+vy3vbC1dZoDcmIDJ4Uc0vixYcpKiZqxlR8XJ+wpevC50TEhxrcW+
ZO4SloQvgyjI34xu/gZgjRYb3BhXK3x+ougVFpRG8V8zQ/+ccWg=
=MKrF
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-05-19' of https://evilpiepirate.org/git/bcachefs
Pull bcachefs updates from Kent Overstreet:
- More safety fixes, primarily found by syzbot
- Run the upgrade/downgrade paths in nochnages mode. Nochanges mode is
primarily for testing fsck/recovery in dry run mode, so it shouldn't
change anything besides disabling writes and holding dirty metadata
in memory.
The idea here was to reduce the amount of activity if we can't write
anything out, so that bringing up a filesystem in "super ro" mode
would be more lilkely to work for data recovery - but norecovery is
the correct option for this.
- btree_trans->locked; we now track whether a btree_trans has any btree
nodes locked, and this is used for improved assertions related to
trans_unlock() and trans_relock(). We'll also be using it for
improving how we work with lockdep in the future: we don't want
lockdep to be tracking individual btree node locks because we take
too many for lockdep to track, and it's not necessary since we have a
cycle detector.
- Trigger improvements that are prep work for online fsck
- BTREE_TRIGGER_check_repair; this regularizes how we do some repair
work for extents that goes with running triggers in fsck, and fixes
some subtle issues with transaction restarts there.
- bch2_snapshot_equiv() has now been ripped out of fsck.c; snapshot
equivalence classes are for when snapshot deletion leaves behind
redundant snapshot nodes, but snapshot deletion now cleans this up
right away, so the abstraction doesn't need to leak.
- Improvements to how we resume writing to the journal in recovery. The
code for picking the new place to write when reading the journal is
greatly simplified and we also store the position in the superblock
for when we don't read the journal; this means that we preserve more
of the journal for list_journal debugging.
- Improvements to sysfs btree_cache and btree_node_cache, for debugging
memory reclaim.
- We now detect when we've blocked for 10 seconds on the allocator in
the write path and dump some useful info.
- Safety fixes for devices references: this is a big series that
changes almost all device lookups to properly check if the device
exists and take a reference to it.
Previously we assumed that if a bkey exists that references a device
then the device must exist, and this was enforced in .invalid
methods, but this was incorrect because it meant device removal
relied on accounting being correct to not leave keys pointing to
invalid devices, and that's not something we can assume.
Getting the "pointer to invalid device" checks out of our .invalid()
methods fixes some long standing device removal bugs; the only
outstanding bug with device removal now is a race between the discard
path and deleting alloc info, which should be easily fixed.
- The allocator now prefers not to expand the new
member_info.btree_allocated bitmap, meaning if repair ever requires
scanning for btree nodes (because of a corrupt interior nodes) we
won't have to scan the whole device(s).
- New coding style document, which among other things talks about the
correct usage of assertions
* tag 'bcachefs-2024-05-19' of https://evilpiepirate.org/git/bcachefs: (155 commits)
bcachefs: add no_invalid_checks flag
bcachefs: add counters for failed shrinker reclaim
bcachefs: Fix sb_field_downgrade validation
bcachefs: Plumb bch_validate_flags to sb_field_ops.validate()
bcachefs: s/bkey_invalid_flags/bch_validate_flags
bcachefs: fsync() should not return -EROFS
bcachefs: Invalid devices are now checked for by fsck, not .invalid methods
bcachefs: kill bch2_dev_bkey_exists() in bch2_check_fix_ptrs()
bcachefs: kill bch2_dev_bkey_exists() in bch2_read_endio()
bcachefs: bch2_dev_get_ioref() checks for device not present
bcachefs: bch2_dev_get_ioref2(); io_read.c
bcachefs: bch2_dev_get_ioref2(); debug.c
bcachefs: bch2_dev_get_ioref2(); journal_io.c
bcachefs: bch2_dev_get_ioref2(); io_write.c
bcachefs: bch2_dev_get_ioref2(); btree_io.c
bcachefs: bch2_dev_get_ioref2(); backpointers.c
bcachefs: bch2_dev_get_ioref2(); alloc_background.c
bcachefs: for_each_bset() declares loop iter
bcachefs: Move BCACHEFS_STATFS_MAGIC value to UAPI magic.h
bcachefs: Improve sysfs internal/btree_cache
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZj3HuwAKCRCRxhvAZXjc
orYvAQCZOr68uJaEaXAArYTdnMdQ6HIzG+FVlwrqtrhz0BV07wEAqgmtSR9XKh+L
0+DNepg4R8PZOHH371eSSsLNRCUCkAs=
=SVsU
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.
Features:
- Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That
means we now have six free FMODE_* bits in total (but bit #6
already got used for FMODE_WRITE_RESTRICTED)
- Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup)
- Add fd_raw cleanup class so we can make use of automatic cleanup
provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well
- Optimize seq_puts()
- Simplify __seq_puts()
- Add new anon_inode_getfile_fmode() api to allow specifying f_mode
instead of open-coding it in multiple places
- Annotate struct file_handle with __counted_by() and use
struct_size()
- Warn in get_file() whether f_count resurrection from zero is
attempted (epoll/drm discussion)
- Folio-sophize aio
- Export the subvolume id in statx() for both btrfs and bcachefs
- Relax linkat(AT_EMPTY_PATH) requirements
- Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors
for dup*() equality replacing kcmp()
Cleanups:
- Compile out swapfile inode checks when swap isn't enabled
- Use (1 << n) notation for FMODE_* bitshifts for clarity
- Remove redundant variable assignment in fs/direct-io
- Cleanup uses of strncpy in orangefs
- Speed up and cleanup writeback
- Move fsparam_string_empty() helper into header since it's currently
open-coded in multiple places
- Add kernel-doc comments to proc_create_net_data_write()
- Don't needlessly read dentry->d_flags twice
Fixes:
- Fix out-of-range warning in nilfs2
- Fix ecryptfs overflow due to wrong encryption packet size
calculation
- Fix overly long line in xfs file_operations (follow-up to FMODE_*
cleanup)
- Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up
to FMODE_* cleanup)
- Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_*
cleanup)
- Fix stable offset api to prevent endless loops
- Fix afs file server rotations
- Prevent xattr node from overflowing the eraseblock in jffs2
- Move fdinfo PTRACE_MODE_READ procfs check into the .permission()
operation instead of .open() operation since this caused userspace
regressions"
* tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits)
afs: Fix fileserver rotation getting stuck
selftests: add F_DUPDFD_QUERY selftests
fcntl: add F_DUPFD_QUERY fcntl()
file: add fd_raw cleanup class
fs: WARN when f_count resurrection is attempted
seq_file: Simplify __seq_puts()
seq_file: Optimize seq_puts()
proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation
fs: Create anon_inode_getfile_fmode()
xfs: don't call xfs_file_open from xfs_dir_open
xfs: drop fop_flags for directories
xfs: fix overly long line in the file_operations
shmem: Fix shmem_rename2()
libfs: Add simple_offset_rename() API
libfs: Fix simple_offset_rename_exchange()
jffs2: prevent xattr node from overflowing the eraseblock
vfs, swap: compile out IS_SWAPFILE() on swapless configs
vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements
fs/direct-io: remove redundant assignment to variable retval
fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading
...
Setting this flag on a filesystem results in validity checks being
skipped when writing bkeys. This flag will be used by tooling that
deliberately injects corruption into a filesystem in order to exercise
fsck. It shouldn't be set outside of testing/debugging code.
Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds distinct counters for every reason the btree node shrinker can
fail to free an object - if our shrinker isn't making progress, this
will tell us why.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- bch2_sb_downgrade_validate() wasn't checking for a downgrade entry
extending past the end of the superblock section
- for_each_downgrade_entry() is used in to_text() and needs to work on
malformed input; it also was missing a check for a field extending
past the end of the section
Reported-by: syzbot+e49ccab73449180bc9be@syzkaller.appspotmail.com
Fixes: 84f1638795 ("bcachefs: bch_sb_field_downgrade")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're about to start using bch_validate_flags for superblock section
validation - it's no longer bkey specific.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
fsync has a slightly odd usage of -EROFS, where it means "does not
support fsync". I didn't choose it...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Move BCACHEFS_STATFS_MAGIC value to UAPI <linux/magic.h> under
BCACHEFS_SUPER_MAGIC definition (use common approach for name) and reuse the
definition in bcachefs_format.h BCACHEFS_STATFS_MAGIC.
There are other bcachefs magic definitions: BCACHE_MAGIC, BCHFS_MAGIC,
which use UUID_INIT() and are used only in libbcachefs. Therefore move
only BCACHEFS_STATFS_MAGIC value, which can be used outside of
libbcachefs for f_type field in struct statfs in statfs() or fstatfs().
Suggested-by: Su Yue <glass.su@suse.com>
Signed-off-by: Petr Vorel <pvorel@suse.cz>
Acked-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We now have a small bitmap in the member info section of the superblock
for "regions that have btree nodes", so that if we ever have to scan for
btree nodes in repair we don't have to scan the whole device(s).
This tweaks the allocator to prefer allocating from regions that are
already marked in this bitmap.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Tracepoints are garbage, and perf trace even cuts off some of our
fields.
Much nicer to just trace a string, and then we can build nicely
formatted output with printbufs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a lifetime issue; bch2_nocow_write_unlock() uses
PTR_BUCKET_POS(), which needs the device - but we drop our ref to the
device in bch2_write_endio().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_dev_bkey_exists() is going away; bch2_dev_have_ref() documents that
we're looking up a device without checking if it's present because we
have a reference to it already.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Wrapper around bch2_dev_have_ref() for open_buckets; we do guarantee
that the device an open_bucket points to exists.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is needed for the next patch - the write submit path has to be able
to allocate a replica bio even when we weren't able to get a ref on the
device.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Most uses of bch2_dev_bkey_exists() are going away, where we assume that
because a key references a device the device most exist - instead, we'll
be explicitly checking if the device exists and getting a reference to
it.
This adds the new helpers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Eliminating bch2_dev_bkey_exists() uses and replacing them with proper
checks; this one was unnecessary since the caller already has it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we block on the allocator for more than 10 seconds, print out some
useful debugging info.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can't strictly guarantee that no pointers refer to nonexistent
devices - we attempt to, but we need to be safe when the filesystem is
corrupt.
Therefore, change device_add to try to pick a slot that's never been
used, or the slot that's been unused the longest.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now explicitly allocate and free the buckets_nouse bitmap - this is
going to be used for online fsck.
To go RW when we haven't check allocations, we'll do a much slimmed down
version that just initializes the buckets_nouse bitmaps.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It should be FS_IOC32_GETFLAGS instead of FS_IOC_GETFLAGS in
compat ioctl.
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In bch2_link_trans(), if bch2_inode_nlink_inc() fails, it needs to
call bch2_trans_iter_exit() in the error path.
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The vfs[1] documentation describes free_inode as follows:
```
free_inode
this method is called from RCU callback. If you use call_rcu()
in ->destroy_inode to free ‘struct inode’ memory, then it’s
better to release memory in this method.
```
free_inode will be called by the RCU callback, so it might be better
to move the inode free operation to destroy_inode.
Similar to commit ae6b47b565 ("fs/ntfs3: Change destroy_inode to
free_inode").
Link:
[1]: https://www.kernel.org/doc/html/latest/filesystems/vfs.html
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
On recovery from clean shutdown we don't typically read the journal, but
we still want to avoid overwriting existing entries in the journal for
list_journal debugging.
Thus, add some fields to the member info section so we can remember
where we left off.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When compiling the bcachefs-tools, the following compilation warning
is reported:
libbcachefs/snapshot.c: In function ‘bch2_reconstruct_snapshots’:
libbcachefs/snapshot.c:915:19: warning: ‘tree_id’ may be used uninitialized in this function [-Wmaybe-uninitialized]
915 | snapshot->v.tree = cpu_to_le32(tree_id);
libbcachefs/snapshot.c:903:6: note: ‘tree_id’ was declared here
903 | u32 tree_id;
| ^~~~~~~
This is a false alert, because @tree_id is changed in
bch2_snapshot_tree_create after it returns 0. And if this function
returns other value, @tree_id wouldn't be used. Thus there should
be nothing wrong in logical.
Although the report itself is a false alert, we can still make it more
explicit by setting the initial value of @tree_id to 0 (an invalid
tree ID).
Fixes: a292be3b68 ("bcachefs: Reconstruct missing snapshot nodes")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When building for a 32-bit target, for which 'size_t' is 'unsigned int',
there are two warnings around mismatched format specifiers and argument
types:
In file included from fs/bcachefs/vstructs.h:5,
from fs/bcachefs/bcachefs_format.h:79,
from fs/bcachefs/bcachefs.h:207,
from fs/bcachefs/btree_key_cache.c:3:
fs/bcachefs/btree_key_cache.c: In function 'bch2_btree_key_cache_to_text':
fs/bcachefs/btree_key_cache.c:1046:25: error: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'size_t' {aka 'unsigned int'} [-Werror=format=]
1046 | prt_printf(out, "nonpcpu freelist:\t%lu\r\n", bc->nr_freed_nonpcpu);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~
| |
| size_t {aka unsigned int}
fs/bcachefs/util.h:192:63: note: in definition of macro 'prt_printf'
192 | #define prt_printf(_out, ...) bch2_prt_printf(_out, __VA_ARGS__)
| ^~~~~~~~~~~
fs/bcachefs/btree_key_cache.c:1046:47: note: format string is defined here
1046 | prt_printf(out, "nonpcpu freelist:\t%lu\r\n", bc->nr_freed_nonpcpu);
| ~~^
| |
| long unsigned int
| %u
fs/bcachefs/btree_key_cache.c:1047:25: error: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'size_t' {aka 'unsigned int'} [-Werror=format=]
1047 | prt_printf(out, "pcpu freelist:\t%lu\r\n", bc->nr_freed_pcpu);
| ^~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~
| |
| size_t {aka unsigned int}
fs/bcachefs/util.h:192:63: note: in definition of macro 'prt_printf'
192 | #define prt_printf(_out, ...) bch2_prt_printf(_out, __VA_ARGS__)
| ^~~~~~~~~~~
fs/bcachefs/btree_key_cache.c:1047:44: note: format string is defined here
1047 | prt_printf(out, "pcpu freelist:\t%lu\r\n", bc->nr_freed_pcpu);
| ~~^
| |
| long unsigned int
| %u
cc1: all warnings being treated as error
Use the proper 'size_t' specifier, '%zu', to clear up the warnings for
these platforms.
Fixes: f2d47ec26af5 ("bcachefs: Btree key cache instrumentation")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When building with clang's -Wincompatible-function-pointer-types-strict
(a warning designed to catch potential kCFI failures at build time),
there are several warnings along the lines of:
fs/bcachefs/bkey_methods.c:118:2: error: incompatible function pointer types initializing 'int (*)(struct btree_trans *, enum btree_id, unsigned int, struct bkey_s_c, struct bkey_s, enum btree_iter_update_trigger_flags)' with an expression of type 'int (struct btree_trans *, enum btree_id, unsigned int, struct bkey_s_c, struct bkey_s, unsigned int)' [-Werror,-Wincompatible-function-pointer-types-strict]
118 | BCH_BKEY_TYPES()
| ^~~~~~~~~~~~~~~~
fs/bcachefs/bcachefs_format.h:394:2: note: expanded from macro 'BCH_BKEY_TYPES'
394 | x(inode, 8) \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
fs/bcachefs/bkey_methods.c:117:41: note: expanded from macro 'x'
117 | #define x(name, nr) [KEY_TYPE_##name] = bch2_bkey_ops_##name,
| ^~~~~~~~~~~~~~~~~~~~
<scratch space>:277:1: note: expanded from here
277 | bch2_bkey_ops_inode
| ^~~~~~~~~~~~~~~~~~~
fs/bcachefs/inode.h:26:13: note: expanded from macro 'bch2_bkey_ops_inode'
26 | .trigger = bch2_trigger_inode, \
| ^~~~~~~~~~~~~~~~~~
There are several functions that did not have their flags parameter
converted to 'enum btree_iter_update_trigger_flags' in the recent
unification, which will cause kCFI failures at runtime because the
types, while ABI compatible (hence no warning from the non-strict
version of this warning), do not match exactly.
Fix up these functions (as well as a few other obvious functions that
should have it, even if there are no warnings currently) to resolve the
warnings and potential kCFI runtime failures.
Fixes: 31e4ef3280c8 ("bcachefs: iter/update/trigger/str_hash flag cleanup")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This unifies the online and offline btree gc passes; we're not yet
running it online.
We now iterate over one level of the btree at a time - the same as
check_extents_to_backpointers(); this ordering preserves order of keys
regardless of btree splits and merges, which will be important when we
re-enable online gc.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Currently, the reflink_p gc trigger does repair as well - turning a
reflink_p key into an error key if the reflink_v it points to doesn't
exist.
This won't work with online check/repair, because the repair path once
online will be subject to transaction restarts, but BTREE_TRIGGER_gc is
not idempotant - we can't run it multiple times if we get a transaction
restart.
So we need to split these paths; to do so this patch calls
check_fix_ptrs() by a new general path - a new trigger type,
BTREE_TRIGGER_check_repair.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
looping when we change a bucket gen is not ideal - it means we risk
failing if we'd go into an infinite loop, and it's better to make
forward progress even if fsck doesn't fix everything.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we hit an inconsistency when updating allocation information, we
don't want to fail the update if it's for a deletion - only if it's for
a new key.
Rename check_bucket_ref() -> bucket_ref_update() so we can centralize
the logic to do this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This eliminates some duplicated logic, and the gc path now handles
stripe updates and deletions - we need this since soon we're bringing
back runtime gc.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Start to work on unifying mark_stripe_bucket() and
trans_mark_stripe_bucket(); first, clean up all the unnecessary and
gratuitious differences.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're working on potentially unifying bch2_check_bucket_ref() and
bch2_check_fix_ptrs() - or at least eliminating gratuitious differences.
Most immediately, there's a bunch of cleanups to be done regarding
BCH_DATA_stripe.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There's no need to be using new_inode(); we can skip all that
indirection and make the code easier to follow.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since btree_ptr_v2, we no longer require the journal seq blacklist table
for skipping blacklisted bsets (btree node entries); the pointer to a
given node indicates how much data is present.
Therefore there's no longer any need for journal seq blacklist gc to
walk the btree - we can prune entries older than journal last_seq.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
prep work for making the allocator try to keep btree nodes within the
existing member info btree allocated bitmap
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It turns out the btree key cache shrinker wasn't actually reclaiming
anything, prior to the previous patch. This adds instrumentation so that
if we have further issues we can see what's going on.
Specifically, sysfs internal/btree_key_cache is greatly expanded with
new counters, and the SRCU sequence numbers of the first 10 entries on
each pending freelist, and we also add trigger_btree_key_cache_shrink
for testing without having to prune all the system caches.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Common code doesn't test the error flag, so we don't need to set it in
bcachefs. We can use folio_end_read() to combine the setting (or not)
of the uptodate flag and clearing the lock flag.
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Brian Foster <bfoster@redhat.com>
Cc: linux-bcachefs@vger.kernel.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is a nice cleanup - and we've also been having problems with
kthread creation in the mount path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Make things easier when we're debugging long fsck runs - persist the
work that successful recovery passes did.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Originally, when deleting snapshots we didn't collapse redundant
snapshot nodes; thus, the notion of a class of equivalent snapshot nodes
leaked into fsck.
Now we do, so snapshot ID equivalence classes are purely local to
snapshot deletion.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With the new assertions, we shouldn't be holding locks when
trans->locked is false, thus, we shouldn't use relock when we just want
to check if we can relock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a field for tracking whether a transaction object holds btree locks,
and assertions to verify state.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're starting to be more strict about transaction locked state, and
multiple transactions in a task.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're about to add new asserts for btree_trans locking consistency, and
part of that requires that aren't using the btree_trans while it's
unlocked.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the key cache fill path, we use path_upgrade() on a path that isn't
uptodate yet but should be locked.
This change makes bch2_btree_path_upgrade() slightly looser so we can
use it in key cache upgrade, instead of the __ version.
Also, make the related assert - that path->uptodate implies nodes_locked
- slightly clearer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Commit ec9cc18fc2 ("bcachefs: Add checks for invalid snapshot IDs")
intends to check the sanity of a snapshot and panic when
BCACHEFS_DEBUG is set, but that conditional has a typo.
Fix the typo to refer to the actual existing Kconfig symbol.
This was found with ./scripts/checkkconfigsymbols.py.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since commit 43a7206b09 ("driver core: class: make class_register() take
a const *"), the driver core allows for struct class to be in read-only
memory, so move the bch_chardev_class structure to be declared at build
time placing it into read-only memory, instead of having to be dynamically
allocated at boot time. Also, correctly clean up after failing paths in
bch2_chardev_init().
Cc: Hongbo Li <lihongbo22@huawei.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a new helper that calls dir_emit() and updates ctx->pos on success;
this lets us convert bch2_readdir() to drop_locks_do().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Combine iter/update/trigger/str_hash flags into a single enum, and
x-macroize them for a to_text() function later.
These flags are all for a specific iter/key/update context, so it makes
sense to group them together - iter/update/trigger flags were already
given distinct bits, this cleans up and unifies that handling.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Consolidate mark_superblock() and trans_mark_superblock(), like we did
with the other trigger paths.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>