linux-loongson

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-09-01 23:46:45 +00:00

Author	SHA1	Message	Date
Kent Overstreet	9c09e59cc5	bcachefs: fix wrong arg to fsck_err() fsck_err() needs the btree transaction passed to it if there is one - so that it can unlock/relock around prompting userspace for fixing the error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-14 18:59:15 -04:00
Kent Overstreet	d1041d8eab	bcachefs: Fix missing commit in backpointer to missing target Fsck wants to do transaction commits from an outer context; it may have other repair to do (i.e. duplicate backpointers). But when calling backpointer_not_found() from runtime code, i.e. runtime self healing, we should be doing the commit - the outer context expects to just be doing lookups. This fixes bugs where we get stuck spinning, reported as "RCU lock hold time warnings. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-14 17:05:19 -04:00
Kent Overstreet	a12cb6f758	bcachefs: Fix accidental O(n^2) in fiemap Since bch2_seek_pagecache_data() searches for dirty data, we only want to call it for holes in the extents btree - otherwise we have an accidental O(n^2), as we repeatedly search the same range. Reported-by: Marcin Mirosław <marcin@mejor.pl> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-14 17:05:19 -04:00
Kent Overstreet	43b9fece2d	bcachefs: Fix set_should_be_locked() call in peek_slot() set_should_be_locked() needs to be called before peek_key_cache(), which traverses other paths and may do a trans unlock/relock. This fixes an assertion pop in path_peek_slot(), when the path we're using is unexpectedly not uptodate. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-14 17:05:19 -04:00
Alan Huang	61198e6287	bcachefs: Fix self deadlock Before invoking bch2_accounting_mem_mod_locked in bch2_gc_accounting_done, we already write locked mark_lock, in bch2_accounting_mem_insert, we lock mark_lock again. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-14 17:05:19 -04:00
Kent Overstreet	19b22d04cd	bcachefs: Don't set btree nodes as accessed on fill Prevent jobs that do lots of scanning (i.e. evacuatee, scrub) from causing OOMs. The shrinker code seems to be having issues when it doesn't do any freeing because it's just flipping off the acccessed bit - and the accessed bit shouldn't be set on first use anyways. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-14 17:05:19 -04:00
Kent Overstreet	7b6759b199	bcachefs: Fix livelock in journal_entry_open() When the journal is low on space, we might do discards from journal_res_get() -> journal_entry_open(). Make sure we set j->can_discard correctly, so that if we're low on space but not because discards aren't keeping up we don't livelock. Fixes: `8e4d28036c` ("bcachefs: Don't aggressively discard the journal") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-14 17:05:19 -04:00
Kent Overstreet	b1c71cb492	bcachefs: Fix broken btree_path lock invariants in next_node() This fixes btree locking assert pops users were seeing during evacuate: https://github.com/koverstreet/bcachefs/issues/878 May 09 22:45:02 sharon kernel: bcachefs (68116e25-fa2d-4c6f-86c7-e8b431d792ae): bch2_btree_insert_node(): node not locked at level 1 May 09 22:45:02 sharon kernel: bch2_btree_node_rewrite [bcachefs]: watermark=btree no_check_rw alloc l=0-1 mode=none nodes_written=0 cl.remaining=2 journal_seq=0 May 09 22:45:02 sharon kernel: path: idx 1 ref 1:0 S B btree=alloc level=0 pos 0:3699637:0 0:3698012:1-0:3699637:0 bch2_move_btree.isra.0+0x1db/0x490 [bcachefs] uptodate 0 locks_want 2 May 09 22:45:02 sharon kernel: l=0 locks intent seq 4 node ffff8bd700c93600 May 09 22:45:02 sharon kernel: l=1 locks unlocked seq 1712 node ffff8bd6fd5e7a00 May 09 22:45:02 sharon kernel: l=2 locks unlocked seq 2295 node ffff8bd6cc725400 May 09 22:45:02 sharon kernel: l=3 locks unlocked seq 0 node 0000000000000000 Evacuate walks btree nodes with bch2_btree_iter_next_node() and rewrites them, bch2_btree_update_start() upgrades the path to take intent locks as far as it needs to. But next_node() does low level unlock/relock calls on individual nodes, and didn't handle the case where a path is supposed to be holding multiple intent locks. If a path has locks_want > 1, it needs to be either holding locks on all the btree nodes (at each level) requested, or none of them. Fix this with a bch2_btree_path_downgrade(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-14 17:05:19 -04:00
Kent Overstreet	cd52cc3544	bcachefs: Don't strip rebalance_opts from indirect extents Fix bch2_bkey_clear_needs_rebalance(): indirect extents are never supposed to have bch_extent_rebalance stripped off, because that's how we get the IO path options when we don't have the original inode it belonged to. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-14 17:05:19 -04:00
Eric Biggers	607c92141c	crypto: lib/chacha - add strongly-typed state zeroization Now that the ChaCha state matrix is strongly-typed, add a helper function chacha_zeroize_state() which zeroizes it. Then convert all applicable callers to use it instead of direct memzero_explicit. No functional changes. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-05-12 13:32:53 +08:00
Eric Biggers	98066f2f89	crypto: lib/chacha - strongly type the ChaCha state The ChaCha state matrix is 16 32-bit words. Currently it is represented in the code as a raw u32 array, or even just a pointer to u32. This weak typing is error-prone. Instead, introduce struct chacha_state: struct chacha_state { u32 x[16]; }; Convert all ChaCha and HChaCha functions to use struct chacha_state. No functional changes. Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-05-12 13:32:53 +08:00
Fedor Pchelkin	f3def8270c	sort.h: hoist cmp_int() into generic header file Deduplicate the same functionality implemented in several places by moving the cmp_int() helper macro into linux/sort.h. The macro performs a three-way comparison of the arguments mostly useful in different sorting strategies and algorithms. Link: https://lkml.kernel.org/r/20250427201451.900730-1-pchelkin@ispras.ru Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Suggested-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Kent Overstreet <kent.overstreet@linux.dev> Acked-by: Coly Li <colyli@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Carlos Maiolino <cem@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Coly Li <colyli@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:54:12 -07:00
Ingo Molnar	aad823aa3a	treewide, timers: Rename destroy_timer_on_stack() as timer_destroy_on_stack() Move this API to the canonical timer_*() namespace. Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250507175338.672442-10-mingo@kernel.org	2025-05-08 19:49:33 +02:00
Kent Overstreet	8e4d28036c	bcachefs: Don't aggressively discard the journal We frequently use 'bcachefs list_journal -a' for debugging, as it provides a record of all btree transactions, and a history of what happened. But it's not so useful if we immediately discard journal buckets right after they're no longer dirty. This tweaks journal reclaim to only discard when we're low on space, keeping the journal mostly un-discarded. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-07 17:10:10 -04:00
Kent Overstreet	da18dabc37	bcachefs: Ensure superblock gets written when we go ERO When we go emergency read-only, make sure we do a final write_super() to persist counters and error counts - this can be critical for piecing together what fsck was doing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-07 17:09:59 -04:00
Kent Overstreet	2fea3aa76e	bcachefs: Filter out harmless EROFS error messages These just indicate that we're shutting down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-07 16:58:32 -04:00
Kent Overstreet	473f09f362	bcachefs: journal_shutdown is EROFS, not EIO We often filter out EROFS errors to avoid log spew after an emergency shutdown - journal_shutdown is just another emergency shutdown error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-07 16:58:26 -04:00
Kent Overstreet	9c61856099	bcachefs: Call bch2_fs_start before getting vfs superblock This reverts `1fdbe0b184` bcachefs: Make sure c->vfs_sb is set before starting fs switched up bch2_fs_get_tree() so that we got a superblock before calling bch2_fs_start, so that c->vfs_sb would always be initialized while the filesystem was active. This turned out not to be necessary, because blk_holder_ops were implemented using our own locking, not vfs locking. And this had the side effect of creating a super_block and doing our full recovery (including potentially fsck) before setting SB_BORN, which causes things like sync calls to hang until our recovery is finished. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-05 16:06:35 -04:00
Kent Overstreet	aed4ccbf45	bcachefs: fix hung task timeout in journal read Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-05 14:21:28 -04:00
Kent Overstreet	7a69fa6571	bcachefs: Add missing barriers before wake_up_bit() wake_up() doesn't require a barrier - but wake_up_bit() does. This only affected non x86, and primarily lead to lost wakeups after btree node reads. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-05 14:19:10 -04:00
Kent Overstreet	50a7b899a0	bcachefs: Ensure proper write alignment There was a buggy version of bcachefs-tools which picked misaligned bucket sizes when formatting, and we're also about to do dynamic block sizes - which will allow picking logical block size or physical block size of the device per-write, allowing for better compression ratios at the cost of slightly worse write performance (i.e. forcing the device to do RMW or extra buffering). To account for this, tweak bch2_alloc_sectors_start() to properly align open_buckets to the blocksize of the write we're about to do. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-05 14:19:01 -04:00
Kent Overstreet	844f766e02	bcachefs: Improve want_cached_ptr() If promote target isn't set, rebalance should still leave a cached copy on the faster device. Fall back to foreground_target if it's set, or allow a cached copy on any device if neither are set. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-05 14:16:20 -04:00
Kent Overstreet	df2e19a883	bcachefs: thread_with_stdio: fix spinning instead of exiting bch2_stdio_redirect_vprintf() was missing a check for stdio->done, i.e. exiting. This caused the thread attempting to print to spin, and since it was being called from the kthread ran by thread_with_stdio, the userspace side hung as well. Change it to return -EPIPE - i.e. writing to a pipe that's been closed. Reported-by: Jan Solanti <jhs@psonet.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-04 14:00:14 -04:00
Alan Huang	6846100b00	bcachefs: Remove incorrect __counted_by annotation This actually reverts `86e92eeeb2` ("bcachefs: Annotate struct bch_xattr with __counted_by()"). After the x_name, there is a value. According to the disscussion[1], __counted_by assumes that the flexible array member contains exactly the amount of elements that are specified. Now there are users came across a false positive detection of an out of bounds write caused by the __counted_by here[2], so revert that. [1] https://lore.kernel.org/lkml/Zv8VDKWN1GzLRT-_@archlinux/T/#m0ce9541c5070146320efd4f928cc1ff8de69e9b2 [2] https://privatebin.net/?a0d4e97d590d71e1#9bLmp2Kb5NU6X6cZEucchDcu88HzUQwHUah8okKPReEt Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-01 16:38:58 -04:00
Kent Overstreet	28580052e6	bcachefs: add missing sched_annotate_sleep() 00594 ------------[ cut here ]------------ 00594 do not call blocking ops when !TASK_RUNNING; state=2 set at [<000000003e51ef4a>] prepare_to_wait_event+0x5c/0x1c0 00594 WARNING: CPU: 12 PID: 1117 at kernel/sched/core.c:8741 __might_sleep+0x74/0x88 00594 Modules linked in: 00594 CPU: 12 UID: 0 PID: 1117 Comm: umount Not tainted 6.15.0-rc4-ktest-g3a72e369412d #21845 PREEMPT 00594 Hardware name: linux,dummy-virt (DT) 00594 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00594 pc : __might_sleep+0x74/0x88 00594 lr : __might_sleep+0x74/0x88 00594 sp : ffffff80c8d67a90 00594 x29: ffffff80c8d67a90 x28: ffffff80f5903500 x27: 0000000000000000 00594 x26: 0000000000000000 x25: ffffff80cf5002a0 x24: ffffffc087dad000 00594 x23: ffffff80c8d67b40 x22: 0000000000000000 x21: 0000000000000000 00594 x20: 0000000000000242 x19: ffffffc080b92020 x18: 00000000ffffffff 00594 x17: 30303c5b20746120 x16: 74657320323d6574 x15: 617473203b474e49 00594 x14: 0000000000000001 x13: 00000000000c0000 x12: ffffff80facc0000 00594 x11: 0000000000000001 x10: 0000000000000001 x9 : ffffffc0800b0774 00594 x8 : c0000000fffbffff x7 : ffffffc087dac670 x6 : 00000000015fffa8 00594 x5 : ffffff80facbffa8 x4 : ffffff80fbd30b90 x3 : 0000000000000000 00594 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffff80f5903500 00594 Call trace: 00594 __might_sleep+0x74/0x88 (P) 00594 __mutex_lock+0x64/0x8d8 00594 mutex_lock_nested+0x28/0x38 00594 bch2_fs_ec_flush+0xf8/0x128 00594 __bch2_fs_read_only+0x54/0x1d8 00594 bch2_fs_read_only+0x3e0/0x438 00594 __bch2_fs_stop+0x5c/0x250 00594 bch2_put_super+0x18/0x28 00594 generic_shutdown_super+0x6c/0x140 00594 bch2_kill_sb+0x1c/0x38 00594 deactivate_locked_super+0x54/0xd0 00594 deactivate_super+0x70/0x90 00594 cleanup_mnt+0xec/0x188 00594 __cleanup_mnt+0x18/0x28 00594 task_work_run+0x90/0xd8 00594 do_notify_resume+0x138/0x148 00594 el0_svc+0x9c/0xa0 00594 el0t_64_sync_handler+0x104/0x130 00594 el0t_64_sync+0x154/0x158 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-01 13:54:58 -04:00
Kent Overstreet	e2699274d5	bcachefs: Fix __bch2_dev_group_set() bch2_sb_disk_groups_to_cpu() goes off of the superblock member info, so we need to set that first. Reported-by: Stijn Tintel <stijn@linux-ipv6.be> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-01 12:22:10 -04:00
Kent Overstreet	e660d7ca74	bcachefs: Kill ERO for i_blocks check in truncate Replace with logging the error in the superblock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-01 06:19:58 -04:00
Kent Overstreet	3a72e36941	bcachefs: check for inode.bi_sectors underflow Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-01 06:19:58 -04:00
Kent Overstreet	05450c48a3	bcachefs: Kill ERO in __bch2_i_sectors_acct() We won't be root causing this in the immediate future, and it's fairly innocuous - so just log it in the superblock. https://github.com/koverstreet/bcachefs/issues/869 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-01 06:19:58 -04:00
Kent Overstreet	5e63d579e7	bcachefs: readdir fixes - Don't call bch2_trans_relock() after dir_emit(); taking a transaction restart here will cause us to emit the same dirent to userspace twice - Fix incorrect checking of the return value on dir_emit(): "true" means success, keep going, but bch2_dir_emit() needs to return true when we're finished iterating. https://github.com/koverstreet/bcachefs/issues/867 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-30 11:49:34 -04:00
Kent Overstreet	2feaa92c7c	bcachefs: improve missing journal write device error message Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-30 11:49:28 -04:00
Kent Overstreet	dbe4674802	bcachefs: Topology error after insert is now an ERO A user hit this, and this will naturally be easier to debug if we don't panic. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 22:42:17 -04:00
Kent Overstreet	9a4a858c9b	bcachefs: Use bch2_kvmalloc() for journal keys array We can hit this limit fairly easy when we have to reconstuct large amounts of alloc info on large filesystems. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 22:42:17 -04:00
Kent Overstreet	e5a3b8cf33	bcachefs: More informative error message when shutting down due to error Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 22:42:17 -04:00
Kent Overstreet	652dd6558b	bcachefs: btree_root_unreadable_and_scan_found_nothing autofix for non data btrees If loosing a btree won't cause data loss - i.e. it's an alloc btree, or we can easily reconstruct it - we shouldn't require user action to continue repair. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 22:42:17 -04:00
Kent Overstreet	c366b1672d	bcachefs: btree_node_data_missing is now autofix Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 16:46:13 -04:00
Kent Overstreet	eca5b56ccf	bcachefs: Don't generate alloc updates to invalid buckets Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 16:46:13 -04:00
Kent Overstreet	e7f1a52849	bcachefs: Improve bch2_dev_bucket_missing() More useful error message. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 16:46:13 -04:00
Kent Overstreet	002466446a	bcachefs: fix bch2_dev_buckets_resize() The resize memcpy path was totally busted. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 16:46:13 -04:00
Kent Overstreet	9e9c28acfd	bcachefs: Add upgrade table entry from 0.14 There are a few errors that needed to be marked as autofix. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 16:46:12 -04:00
Kent Overstreet	3c24020119	bcachefs: Run BCH_RECOVERY_PASS_reconstruct_snapshots on missing subvol -> snapshot Fix this repair path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 16:46:12 -04:00
Kent Overstreet	bdc32a10a2	bcachefs: Add missing utf8_unload() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 16:46:12 -04:00
Kent Overstreet	70c3d89f49	bcachefs: Emit unicode version message on startup fstests expects this Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 16:46:12 -04:00
Kent Overstreet	c83311c5b9	bcachefs: Use generic_set_sb_d_ops for standard casefolding d_ops Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 16:46:12 -04:00
Kent Overstreet	a2f546330e	bcachefs: Fix losing return code in next_fiemap_extent() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 16:46:12 -04:00
Kent Overstreet	d1b0f9aa73	bcachefs: Rework fiemap transaction restart handling Restart handling in the previous patch was incorrect, so: move btree operations into a separate helper, and run it with a lockrestart_do(). Additionally, clarify whether pagecache or the btree takes precedence. Right now, the btree takes precedence: this is incorrect, but it's needed to pass fstests. Add a giant comment explaining why. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-24 19:10:29 -04:00
Brian Foster	b9b0494017	bcachefs: add fiemap delalloc extent detection bcachefs currently populates fiemap data from the extents btree. This works correctly when the fiemap sync flag is provided, but if not, it skips all delalloc extents that have not yet been flushed. This is because delalloc extents from buffered writes are first stored as reservation in the pagecache, and only become resident in the extents btree after writeback completes. Update the fiemap implementation to process holes between extents by scanning pagecache for data, via seek data/hole. If a valid data range is found over a hole in the extent btree, fake up an extent key and flag the extent as delalloc for reporting to userspace. Note that this does not necessarily change behavior for the case where there is dirty pagecache over already written extents, where when in COW mode, writeback will allocate new blocks for the underlying ranges. The existing behavior is consistent with btrfs and it is recommended to use the sync flag for the most up to date extent state from fiemap. Signed-off-by: Brian Foster <bfoster@redhat.com>	2025-04-24 19:10:29 -04:00
Brian Foster	2d55a63709	bcachefs: refactor fiemap processing into extent helper and struct The bulk of the loop in bch2_fiemap() involves processing the current extent key from the iter, including following indirections and trimming the extent size and such. This patch makes a few changes to reduce the size of the loop and facilitate future changes to support delalloc extents. Define a new bch_fiemap_extent structure to wrap the bkey buffer that holds the extent key to report to userspace along with associated fiemap flags. Update bch2_fill_extent() to take the bch_fiemap_extent as a param instead of the individual fields. Finally, lift the bulk of the extent processing into a bch2_fiemap_extent() helper that takes the current key and formats the bch_fiemap_extent appropriately for the fill function. No functional changes intended by this patch. Signed-off-by: Brian Foster <bfoster@redhat.com>	2025-04-24 19:10:29 -04:00
Brian Foster	d020a9fb11	bcachefs: track current fiemap offset in start variable Signed-off-by: Brian Foster <bfoster@redhat.com>	2025-04-24 19:10:28 -04:00
Brian Foster	28d2d19ccc	bcachefs: drop duplicate fiemap sync flag FIEMAP_FLAG_SYNC handling was deliberately moved into core code in commit `45dd052e67` ("fs: handle FIEMAP_FLAG_SYNC in fiemap_prep"), released in kernel v5.8. Update bcachefs accordingly. Signed-off-by: Brian Foster <bfoster@redhat.com>	2025-04-24 19:10:28 -04:00
Kent Overstreet	353739f1d1	bcachefs: Fix btree_iter_peek_prev() at end of inode At the end of the inode, on an extents iterator, peek_slot() has to advance to the next position to avoid returning a 0 size extent, which is not allowed. Changing iter->pos confuses peek_prev(), but we don't need to call peek_slot() in this case. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-24 19:09:52 -04:00
Kent Overstreet	c4f89a1d35	bcachefs: Make btree_iter_peek_prev() assert more precise The issue this assert is guarding against is that in BTREE_ITER_filter_snapshots mode we only want to be iterating within a single inode number - if we iterate into another inode number with keys for a different snapshot tree, we'll loop arbitrarily long before finding a key we can return. This comes up in the unit tests, where we're using inode 0 for our test keys. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-24 19:09:52 -04:00
Kent Overstreet	394ef278e1	bcachefs: Unit test fixes The peek_end() tests expect an empty btree. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-24 19:09:52 -04:00
Kent Overstreet	caab547686	bcachefs: Print mount opts earlier If we aren't mounting with the correct degraded option, it's helpful to know that before we fail to mount degraded. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-24 19:09:52 -04:00
Kent Overstreet	7cb85324c4	bcachefs: unlink: casefold d_invalidate casefolding results in additional aliases on lookup for the non-casefolded names - these need invalidating on unlink. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-24 19:09:52 -04:00
Kent Overstreet	9cdde3c7aa	bcachefs: Fix casefold lookups Add casefolding to bch2_lookup_trans: During the delay between when casefolding was written and when it was merged, the main filesystem lookup path grew self healing - which meant it was no longer using bch2_dirent_lookup_trans(), where casefolding on lookups happens. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-24 19:09:52 -04:00
Kent Overstreet	b9e1f873d2	bcachefs: Casefold is now a regular opts.h option Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-24 19:09:00 -04:00
Kent Overstreet	7a4a86618e	bcachefs: Implement fileattr_(get\|set) inode_operations.fileattr_(get\|set) didn't exist when the various flag ioctls where implemented - but they do now, which means we can delete a bunch of ioctl code in favor of standard VFS level wrappers. Closes: https://lore.kernel.org/linux-bcachefs/7ltgrgqgfummyrlvw7hnfhnu42rfiamoq3lpcvrjnlyytldmzp@yazbhusnztqn/ Cc: Petr Vorel <pvorel@suse.cz> Cc: Andrea Cervesato <andrea.cervesato@suse.de> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-21 19:50:56 -04:00
Kent Overstreet	4ede80a9a8	bcachefs: Allocator now copes with unaligned buckets We had a buggy release of bcachefs-tools that wasn't properly aligning bucket sizes. We can't ask users to reformat - and it's easy to teach the allocator to make sure writes are properly aligned. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-21 19:36:45 -04:00
Kent Overstreet	387df33129	bcachefs: Start copygc, rebalance threads earlier Previously, copygc and rebalance weren't started until the very end of mounting, after all recvoery passes have finished. But copygc really should be started earlier, since it may be needed for allocations to make forward progress. Additionally, we've been seeing occasional bug reports where starting the kthread fails due to a pending signal - i.e. we're getting timed out by systemd (during a version upgrade), but we're not seeing the signal until mount is about to complete. Additionally, we now have copygc/rebalance explicitly wait for check_snapshots to complete (if being run); they require that for snapshot_is_ancestor() in the data move path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-21 11:57:24 -04:00
Kent Overstreet	d64e8e842b	bcachefs: Refactor bch2_run_recovery_passes() Don't use a continue; this simplifies the next patch where run_recovery_passes() will be responsible for waking up copygc and rebalance at the appropriate time. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-21 11:56:43 -04:00
Kent Overstreet	10e42b6f25	bcachefs: bch2_copygc_wakeup() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-20 20:01:48 -04:00
Kent Overstreet	bfbb76ec98	bcachefs: Fix ref leak in write_super() found with the new enumerated_ref code Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-20 19:41:38 -04:00
Kent Overstreet	4c327d03d7	bcachefs: Change __journal_entry_close() assert to ERO We've got some reports of this happening in the wild, and need a bit more info to debug it: https://github.com/koverstreet/bcachefs/issues/854 https://www.reddit.com/r/bcachefs/comments/1k28kjm/surprise_soft_lockup/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-20 19:41:38 -04:00
Kent Overstreet	6468aef231	bcachefs: Ensure journal space is block size aligned We don't require that bucket size is block size aligned (although it should be!) - so we need to handle this in the journal code. This fixes an assertion pop in jorunal_entry_close(), where the journal entry overruns available space - after rounding it up to block size. Fixes: https://github.com/koverstreet/bcachefs/issues/854 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-20 19:41:38 -04:00
Kent Overstreet	71f8e806a5	bcachefs: Stricter checks on "key allowed in this btree" Syzbot managed to come up with a filesystem where check/repair got rather confused at finding a reflink pointer in the inodes btree. Currently, the "key allowed in this btree" checks only apply at commit time, not read time - for forwards compatibility. It seems this is too loose. Now, strict key type allowed checks apply: - at commit time (no forward compatibility issues) - for btree node pointers - if it's a known btree, known key type, and the key type has the "BKEY_TYPE_strict_btree_checks" flag. This means we still have the option of using generic key types - e.g. KEY_TYPE_error, KEY_TYPE_set - on more existing btrees in the future, while most key types that are intended for only a specific btree get stricter checks. Reported-by: syzbot+baee8591f336cab0958b@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-20 19:41:38 -04:00
Kent Overstreet	417f01e726	bcachefs: Error ratelimiting is no longer only during fsck We now more often do repair automatically, without the user invoking fsck - and sometimes that can involve fixing lots of errors, so let's avoid flooding the dmesg log. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-20 19:41:38 -04:00
Kent Overstreet	aa6a591f0f	bcachefs: Fix null ptr deref in bch2_snapshot_tree_oldest_subvol() Reported-by: syzbot+baee8591f336cab0958b@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-20 19:41:38 -04:00
Kent Overstreet	4c0d2c67ac	bcachefs: Fix early startup error path Don't set JOURNAL_running until we're also calling journal_space_available() for the first time. If JOURNAL_running is set, shutdown will write an empty journal entry - but this will hit an assert in journal_entry_open() if we've never called journal_space_available(). Reported-by: syzbot+53bb24d476ef8368a7f0@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-20 19:41:38 -04:00
Linus Torvalds	9e99c1accb	bcachefs fixes for 6.15-rc3 Usual set of small fixes/logging improvements. One bigger user reported fix, for inode <-> dirent inconsistencies reported in fsck, after moving a subvolume that had been snapshotted. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmgBYaIACgkQE6szbY3K bnYcGQ/+K+LsEvGAZ5wtTwUN4KqJIYWhcHYcuLS2mHKf8PMbgYhL7TjmCwb9VWyr 0+GFQcJgfLsl++kX4j7CjG4gHd22aLiwbhMDmSt3r6c4aF29rG+zCpe4W1+7o60k UIKokfbLUV6b+0vF5bA/W3PmtXK7S8E0yAPMfWxv4/sACu8RUvrUJtrUCKEWwLzC bcrRGsN91456qNhCrOp3e4t3yZjiGtZIz+SbPYIxdNrZYIMURlGUm+f9sLH3O+2R NKsi41sggo/TmgUyspH3KCtMT88IDbN07F7O9/zcxgtdfzfC9l9FI6HnvRVSHDOV boFaH/NdRaIbg+O5kqZXYul+/EPXsYp5B77TL6KQ3jhv3q16uwpv9EL4v6HIwvz9 BTDOfI2y/+YWHMfrtzXgh3C9dZDPS7qxFFWjSjCs/lXwKVz46RjBWVmtQoTJSEmb Ee29kBGMpkwmH8fqr5KQheJUIeYewpyTVeB6orgtshnrr+aezS6zunIbk7fJ6+Ng Tc08H/Aqc2KGcyBS3KTLhbReQ1clQKGOqWJymeb1p2V3SMXfABMbh61B1VU1XulC Al5B7/w/WPwb+T2XZIM2qbmeoRJ8OBara5RWkx4HN8pcYuWV8H6GWJtRJQD/eKSO pOT5bz8z9N2n/otwrfLT5lfO2fNW1mULCAamn6iSzR+EDHyuaMU= =pi0D -----END PGP SIGNATURE----- Merge tag 'bcachefs-2025-04-17' of git://evilpiepirate.org/bcachefs Pull bcachefs fixes from Kent Overstreet: "Usual set of small fixes/logging improvements. One bigger user reported fix, for inode <-> dirent inconsistencies reported in fsck, after moving a subvolume that had been snapshotted" * tag 'bcachefs-2025-04-17' of git://evilpiepirate.org/bcachefs: bcachefs: Fix snapshotting a subvolume, then renaming it bcachefs: Add missing READ_ONCE() for metadata replicas bcachefs: snapshot_node_missing is now autofix bcachefs: Log message when incompat version requested but not enabled bcachefs: Print version_incompat_allowed on startup bcachefs: Silence extent_poisoned error messages bcachefs: btree_root_unreadable_and_scan_found_nothing now AUTOFIX bcachefs: fix bch2_dev_usage_full_read_fast() bcachefs: Don't print data read retry success on non-errors bcachefs: Add missing error handling bcachefs: Prevent granting write refs when filesystem is read-only	2025-04-17 15:08:29 -07:00
Kent Overstreet	261592ba06	bcachefs: Fix snapshotting a subvolume, then renaming it Subvolume roots and the dirents that point to them are special; they don't obey the normal snapshot versioning rules because they cross snapshot boundaries. We don't keep around older versions of subvolume dirents on rename - we don't need to, because subvolume dirents are only visible in the parent subvolume, and we wouldn't be able to match up the different dirent and inode versions due to crossing the snapshot ID boundary. That means that when we rename a subvolume, that's been snapshotted, the older version of the subvolume root will become dangling - it won't have a dirent that points to it. That's expected, we just need to tell fsck that this is ok. Fixes: https://github.com/koverstreet/bcachefs/issues/856 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-17 14:17:16 -04:00
Kent Overstreet	8dd3804bf4	bcachefs: Add missing READ_ONCE() for metadata replicas If we race with the user changing the metadata_replicas setting, this could cause us to get an incorrectly sized disk reservation. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-16 18:31:47 -04:00
Kent Overstreet	72b5259053	bcachefs: snapshot_node_missing is now autofix Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-15 17:37:39 -04:00
Kent Overstreet	c3b02e6d67	bcachefs: Log message when incompat version requested but not enabled Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-15 11:35:05 -04:00
Kent Overstreet	14bcf982f4	bcachefs: Print version_incompat_allowed on startup Let users know if incompatible features aren't enabled Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-15 11:35:05 -04:00
Kent Overstreet	a06459657e	bcachefs: Silence extent_poisoned error messages extent poisoning is partly so that we don't keep spewing the dmesg log when we've got unreadable data - we don't want to print these. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-15 11:34:37 -04:00
Kent Overstreet	8692c7db9a	bcachefs: btree_root_unreadable_and_scan_found_nothing now AUTOFIX This will likely mean that the btree had only one node - there was nothing or almost nothing in it, and we should reconstruct and continue. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-13 21:17:55 -04:00
Kent Overstreet	345731a389	bcachefs: fix bch2_dev_usage_full_read_fast() One reference to bch_dev_usage wasn't updated, which meant we weren't reading the full bch_dev_usage_full - oops. Fixes: `955ba7b5ea` ("bcachefs: bch_dev_usage_full") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-13 14:54:34 -04:00
Kent Overstreet	7dfd42a07a	bcachefs: Don't print data read retry success on non-errors We may end up in the data read retry path when reading cached data and racing with invalidation, or on checksum error when we were reading into a userspace buffer that might have been modified while the read was in flight. These aren't real errors, so we shouldn't print the 'retry success' message. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-13 10:17:00 -04:00
Alan Huang	806776ad9c	bcachefs: Add missing error handling Reported-by: syzbot+d10151bf01574a09a915@syzkaller.appspotmail.com Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-13 07:56:28 -04:00
Gabriel Shahrouzi	d62922ba3c	bcachefs: Prevent granting write refs when filesystem is read-only Fix a shutdown WARNING in bch2_dev_free caused by active write I/O references (ca->io_ref[WRITE]) on a device being freed. The problem occurs when: - The filesystem is marked read-only (BCH_FS_rw clear in c->flags). - A subsequent operation (e.g., error handling for device removal) incorrectly tries to grant write references back to a device. - During final shutdown, the read-only flag causes the system to skip stopping write I/O references (bch2_dev_io_ref_stop(ca, WRITE)). - The leftover active write reference triggers the WARN_ON in bch2_dev_free. Prevent this by checking if the filesystem is read-only before attempting to grant write references to a device in the problematic code path. Ensure consistency between the filesystem state flag and the device I/O reference state during shutdown. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-13 07:56:28 -04:00
Linus Torvalds	ef77858826	bcachefs fixes for 6.15-rc2 Mostly minor fixes. Eric Biggers' crypto API conversion is included because of long standing sporadic crashes - mostly, but not entirely syzbot - in the crypto API code when calling poly1305, which have been nigh impossible to reproduce and debug. His rework deletes the code where we've seen the crashes, so either it'll be a fix or we'll end up with backtraces we can debug. (Thanks Eric!). -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmf4Yq8ACgkQE6szbY3K bnb1fxAAu68Ll/4PLWr3xHVp7ETWgEZSzuwRqA87fqs/Q0jNRC2aDIO03Wmj28qM ckEM3PFPuiQubLhOmU21Osta/sFU6GxL0IggMoEC50F5XiVlcKRiNSWhRLnr07Qp v7sc5MQ1HDGnZXQMcdRymzRWixn2hzdRMwKOvbhBHuj0YUPd4yk5I/+AtJi5SAnT ri+dGIQLBRmG7J2pd4AZNza0YZ5pkTAFj9/4wRVoJdKX2pPzf40e7qzYNPt4/6rK A6P9ecU3TDDDQEE4S7s8Dng4rXwsa+9qTUcpXnTTC1L6YbbnZd/IQYzWI4b+FUsS wqnUD+aE7UEMANZh891QlJpGj3ih/6z8opUP4T6RdsVuJwt9X1vFJY99CsOTEf1o 7jAcssL+ueEWPZj8tBoN1niujikyFsXM+xKiUOMZxbuM6BhE40j/WrA77sRhI5+I 7DXlf5s8SDh+gw0IGUboBJe3ofGisRXnfxeZAKQHGHgtEFboY4bDAURcGW4MbIqE uN5Cd+5IJlcKmJdXLCbHMb5KktfBNWu9/VrOMcZ2QHhIuOfd3fFgLzE0ZEroj4lN kTWxpzKeNDt3bPF4esYnvduafHDbzClwfkTt5TBgcOeE4TcIL2mOmweLE2LTKIwW xr5Xhqx1/9//PeaOTwxbCoeZ26G0Q9B8L1+eUZgjPS0FcRdZH9w= =DpNC -----END PGP SIGNATURE----- Merge tag 'bcachefs-2025-04-10' of git://evilpiepirate.org/bcachefs Pull bcachefs fixes from Kent Overstreet: "Mostly minor fixes. Eric Biggers' crypto API conversion is included because of long standing sporadic crashes - mostly, but not entirely syzbot - in the crypto API code when calling poly1305, which have been nigh impossible to reproduce and debug. His rework deletes the code where we've seen the crashes, so either it'll be a fix or we'll end up with backtraces we can debug. (Thanks Eric!)" * tag 'bcachefs-2025-04-10' of git://evilpiepirate.org/bcachefs: bcachefs: Use sort_nonatomic() instead of sort() bcachefs: Remove unnecessary softdep on xxhash bcachefs: use library APIs for ChaCha20 and Poly1305 bcachefs: Fix duplicate "ro,read_only" in opts at startup bcachefs: Fix UAF in bchfs_read() bcachefs: Use cpu_to_le16 for dirent lengths bcachefs: Fix type for parameter in journal_advance_devs_to_next_bucket bcachefs: Fix escape sequence in prt_printf	2025-04-10 19:38:22 -07:00
Linus Torvalds	97c484ccb8	CRC cleanups for 6.15 Finish cleaning up the CRC kconfig options by removing the remaining unnecessary prompts and an unnecessary 'default y', removing CONFIG_LIBCRC32C, and documenting all the CRC library options. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZ/P7QhQcZWJpZ2dlcnNA Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKyoOAQCynFcS1dWuD27S+SdUREmBjMAoZo5M zdsIvlPv9KLycgD/QX5lXjW3KIYY6jQ8vHUuLVwfDl/JEp4GJS9dLGU+agg= =0R1T -----END PGP SIGNATURE----- Merge tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull CRC cleanups from Eric Biggers: "Finish cleaning up the CRC kconfig options by removing the remaining unnecessary prompts and an unnecessary 'default y', removing CONFIG_LIBCRC32C, and documenting all the CRC library options" * tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: lib/crc: remove CONFIG_LIBCRC32C lib/crc: document all the CRC library kconfig options lib/crc: remove unnecessary prompt for CONFIG_CRC_ITU_T lib/crc: remove unnecessary prompt for CONFIG_CRC_T10DIF lib/crc: remove unnecessary prompt for CONFIG_CRC16 lib/crc: remove unnecessary prompt for CONFIG_CRC_CCITT lib/crc: remove unnecessary prompt for CONFIG_CRC32 and drop 'default y'	2025-04-08 12:09:28 -07:00
Kent Overstreet	55fd97fbc4	bcachefs: Use sort_nonatomic() instead of sort() Fixes "task out to lunch" warnings during recovery on large machines with lots of dirty data in the journal. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-06 19:33:53 -04:00
Eric Biggers	c92896ffb7	bcachefs: Remove unnecessary softdep on xxhash As with the other algorithms, bcachefs does not access xxhash through the crypto API. So there is no need to use a module softdep to ensure that it is loaded. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-06 19:33:53 -04:00
Eric Biggers	4bf4b5046d	bcachefs: use library APIs for ChaCha20 and Poly1305 Just use the ChaCha20 and Poly1305 libraries instead of the clunky crypto API. This is much simpler. It is also slightly faster, since the libraries provide more direct access to the same architecture-optimized ChaCha20 and Poly1305 code. I've tested that existing encrypted bcachefs filesystems can be continue to be accessed with this patch applied. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-06 19:33:53 -04:00
Kent Overstreet	1ec94a9f6d	bcachefs: Fix duplicate "ro,read_only" in opts at startup Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-06 19:33:53 -04:00
Kent Overstreet	34b47e3d73	bcachefs: Fix UAF in bchfs_read() Commit `3ba0240a87` fixed a bug in the read retry path in __bch2_read(), and changed bchfs_read() to match - to avoid a landmine if bch2_read_extent() ever starts returning transaction restarts. But that was incorrect, because bchfs_read() doesn't use a separate stack allocated bvec_iter, it uses the one in the rbio being submitted. Add a comment explaining the issue, and revert the buggy change. Fixes: `3ba0240a87` ("bcachefs: Fix silent short reads in data read retry path") Reported-by: syzbot+2deb10b8dc9aae6fab67@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-06 19:13:43 -04:00
Gabriel Shahrouzi	4a22a73323	bcachefs: Use cpu_to_le16 for dirent lengths Prevent incorrect byte ordering for big-endian systems. Signed-off-by: Gabriel Shahrouzi <gshahrouzi@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-06 19:13:43 -04:00
Gabriel Shahrouzi	afc5444e4d	bcachefs: Fix type for parameter in journal_advance_devs_to_next_bucket Replace u64 with __le64 to match the expected parameter type. Ensure consistency both in function calls and within the function itself. Signed-off-by: Gabriel Shahrouzi <gshahrouzi@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-06 19:13:43 -04:00
Gabriel Shahrouzi	f5cd27ec71	bcachefs: Fix escape sequence in prt_printf Remove backslash before format specifier. Ensure correct output. Signed-off-by: Gabriel Shahrouzi <gshahrouzi@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-06 19:13:43 -04:00
Thomas Gleixner	8fa7292fee	treewide: Switch/rename to timer_delete[_sync]() timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-04-05 10:30:12 +02:00
Eric Biggers	b261d22220	lib/crc: remove CONFIG_LIBCRC32C Now that LIBCRC32C does nothing besides select CRC32, make every option that selects LIBCRC32C instead select CRC32 directly. Then remove LIBCRC32C. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250401221600.24878-8-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2025-04-04 11:31:42 -07:00
Linus Torvalds	56770e24f6	bcachefs fixes for 6.15-rc1 More notable fixes: - Fix for striping behaviour on tiering filesystems where replicas exceeds durability on destination target - Fix a race in device removal where deleting alloc info races with the discard worker - Some small stack usage improvements: this is just enough for KMSAN builds to not blow the stack, more is queued up for 6.16. -----BEGIN PGP SIGNATURE----- iQIyBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmfu7pMACgkQE6szbY3K bnY6rw/3W4dho57OPjOoHUbQ7A7IK1hI4SFvGtDDb3vX1RjF2r+RbdBupMAd0zGj T+5SzhYCQLGbyfBa6MW+iPqiHokZsG904+3mRogOf9cpz2Mup9ZOq/vV+Z7ndaF8 2i9wpQTb7GShkSaXkeTvQqnx3YAUxVcRB2ExraTXmv4wIxr1SYyJEeakmMBDasJB UanXXVHzrKo9WLiqWz0JSZCiuQW2v03P84zZo1d/GyMKlTxYDt5aAteos77lJBef 5CWVr4/HKKozt/vI2qHQ+3LJXktLjvb07zoENXwadmgQawYA3nQ+9jLT3Q0FKjXG bK28AHTtiXgWsYsbCs5sVh1+WLPdEj0UBBoFZGWo++TzaN2hXhoMsFTQfuddhaEh W63MWtelv4TGIVOEFk+ayHRgPL6ajhCsa1boHS9EKdosl2nl9Vk9Nq0i++hYZDGW KhWqENT9E5EpVCnZ6H4m1tsXprWavNqXnkOJzXW0T2F3t8+94zp1n6YXkwDdgLfs l+xTEEAL5J8lvlfSS6dW7QcMSMtMKbo3+qlerpH8J4zBZJBbb2nF1ggCtpYg6zFt 4Jgs5FPQLVqWsPXQr4CaSF2UIt3zMPnNIawL1cEpRBU1j35qo0e/kxIjEpS0Pnjt mX67gBlodY54/pwGGLfc/Vkw4xqh//dqTmYIdHkibdAEvKf0dg== =2TfM -----END PGP SIGNATURE----- Merge tag 'bcachefs-2025-04-03' of git://evilpiepirate.org/bcachefs Pull more bcachefs updates from Kent Overstreet: "More notable fixes: - Fix for striping behaviour on tiering filesystems where replicas exceeds durability on destination target - Fix a race in device removal where deleting alloc info races with the discard worker - Some small stack usage improvements: this is just enough for KMSAN builds to not blow the stack, more is queued up for 6.16" * tag 'bcachefs-2025-04-03' of git://evilpiepirate.org/bcachefs: bcachefs: Fix "journal stuck" during recovery bcachefs: backpointer_get_key: check for null from peek_slot() bcachefs: Fix null ptr deref in invalidate_one_bucket() bcachefs: Fix check_snapshot_exists() restart handling bcachefs: use nonblocking variant of print_string_as_lines in error path bcachefs: Fix scheduling while atomic from logging changes bcachefs: Add error handling for zlib_deflateInit2() bcachefs: add missing selection of XARRAY_MULTI bcachefs: bch_dev_usage_full bcachefs: Kill btree_iter.trans bcachefs: do_trace_key_cache_fill() bcachefs: Split up bch_dev.io_ref bcachefs: fix ref leak in btree_node_read_all_replicas bcachefs: Fix null ptr deref in bch2_write_endio() bcachefs: Fix field spanning write warning bcachefs: Fix striping behaviour	2025-04-03 15:39:47 -07:00
Kent Overstreet	77ad1df82b	bcachefs: Fix "journal stuck" during recovery If we crash when the journal pin fifo is completely full - i.e. we're at the maximum number of dirty journal entries - that may put us in a sticky situation in recovery, as journal replay will need to be able to open new journal entries in order to get going. bch2_fs_journal_start() already had provisions for resizing the journal pin fifo if needed, but it needs a fudge factor to ensure there's room for journal replay. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-03 12:11:43 -04:00
Kent Overstreet	2581f89ac8	bcachefs: backpointer_get_key: check for null from peek_slot() peek_slot() doesn't normally return bkey_s_c_null - except when we ask for a key at a btree level that doesn't exist, which can happen here. We might want to revisit this, but we'll have to look over all the places where we use peek_slot() on interior nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-03 12:11:43 -04:00
Kent Overstreet	39ebd74864	bcachefs: Fix null ptr deref in invalidate_one_bucket() bch2_backpointer_get_key() returns bkey_s_c_null when the target isn't found. backpointer_get_key() flags the error, so there's nothing else to do here - just skip it and move on. Link: https://github.com/koverstreet/bcachefs/issues/847 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-03 12:11:43 -04:00
Kent Overstreet	83d539b1b0	bcachefs: Fix check_snapshot_exists() restart handling Codepaths that create entries in the snapshots btree currently call bch2_mark_snapshot(), which updates the in-memory snapshot table, before transaction commit. This is because bch2_mark_snapshot() is an atomic trigger, run with btree write locks held, and isn't allowed to fail - but it might need to reallocate the table, hence we call it early when we're still allowed to fail. This is generally harmless - if we fail, we'll have left an entry in the snapshots table around, but nothing will reference it and it'll get overwritten if reused by another transaction. But check_snapshot_exists(), which reconstructs snapshots when the snapshots btree has been corrupted or lost, was erronously rechecking if the snapshot exists inside the transaction commit loop - so on transaction restart (in this case mem_realloced), the second iteration would return without repairing. This code needs some cleanup: splitting out a "maybe realloc snapshots table" helper would have avoided this, that will be in the next patch. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-03 12:11:43 -04:00
Bharadwaj Raju	570f5050bb	bcachefs: use nonblocking variant of print_string_as_lines in error path The inconsistency error path calls print_string_as_lines, which calls console_lock, which is a potentially-sleeping function and so can't be called in an atomic context. Replace calls to it with the nonblocking variant which is safe to call. Signed-off-by: Bharadwaj Raju <bharadwaj.raju777@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-03 12:11:42 -04:00
Kent Overstreet	b2ffadcc7f	bcachefs: Fix scheduling while atomic from logging changes Two fixes from the recent logging changes: bch2_inconsistent(), bch2_fs_inconsistent() be called from interrupt context, or with rcu_read_lock() held. The one syzbot found is in bch2_bkey_pick_read_device bch2_dev_rcu bch2_fs_inconsistent We're starting to switch to lift the printbufs up to higher levels so we can emit better log messages and print them all in one go (avoid garbling), so that conversion will help with spotting these in the future; when we declare a printbuf it must be flagged if we're in an atomic context. Secondly, in btree_node_write_endio: 00085 BUG: sleeping function called from invalid context at include/linux/sched/mm.h:321 00085 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 618, name: bch-reclaim/fa6 00085 preempt_count: 10001, expected: 0 00085 RCU nest depth: 0, expected: 0 00085 4 locks held by bch-reclaim/fa6/618: 00085 #0: ffffff80d7ccad68 (&j->reclaim_lock){+.+.}-{4:4}, at: bch2_journal_reclaim_thread+0x84/0x198 00085 #1: ffffff80d7c84218 (&c->btree_trans_barrier){.+.+}-{0:0}, at: __bch2_trans_get+0x1c0/0x440 00085 #2: ffffff80cd3f8140 (bcachefs_btree){+.+.}-{0:0}, at: __bch2_trans_get+0x22c/0x440 00085 #3: ffffff80c3823c20 (&vblk->vqs[i].lock){-.-.}-{3:3}, at: virtblk_done+0x58/0x130 00085 irq event stamp: 328 00085 hardirqs last enabled at (327): [<ffffffc080073a14>] finish_task_switch.isra.0+0xbc/0x2a0 00085 hardirqs last disabled at (328): [<ffffffc080971a10>] el1_interrupt+0x20/0x60 00085 softirqs last enabled at (0): [<ffffffc08002f920>] copy_process+0x7c8/0x2118 00085 softirqs last disabled at (0): [<0000000000000000>] 0x0 00085 Preemption disabled at: 00085 [<ffffffc08003ada0>] irq_enter_rcu+0x18/0x90 00085 CPU: 8 UID: 0 PID: 618 Comm: bch-reclaim/fa6 Not tainted 6.14.0-rc6-ktest-g04630bde23e8 #18798 00085 Hardware name: linux,dummy-virt (DT) 00085 Call trace: 00085 show_stack+0x1c/0x30 (C) 00085 dump_stack_lvl+0x84/0xc0 00085 dump_stack+0x14/0x20 00085 __might_resched+0x180/0x288 00085 __might_sleep+0x4c/0x88 00085 __kmalloc_node_track_caller_noprof+0x34c/0x3e0 00085 krealloc_noprof+0x1a0/0x2d8 00085 bch2_printbuf_make_room+0x9c/0x120 00085 bch2_prt_printf+0x60/0x1b8 00085 btree_node_write_endio+0x1b0/0x2d8 00085 bio_endio+0x138/0x1f0 00085 btree_node_write_endio+0xe8/0x2d8 00085 bio_endio+0x138/0x1f0 00085 blk_update_request+0x220/0x4c0 00085 blk_mq_end_request+0x28/0x148 00085 virtblk_request_done+0x64/0xe8 00085 blk_mq_complete_request+0x34/0x40 00085 virtblk_done+0x78/0x130 00085 vring_interrupt+0x6c/0xb0 00085 __handle_irq_event_percpu+0x8c/0x2e0 00085 handle_irq_event+0x50/0xb0 00085 handle_fasteoi_irq+0xc4/0x250 00085 handle_irq_desc+0x44/0x60 00085 generic_handle_domain_irq+0x20/0x30 00085 gic_handle_irq+0x54/0xc8 00085 call_on_irq_stack+0x24/0x40 Reported-by: syzbot+c82cd2906e2f192410bb@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-03 12:11:42 -04:00
Wentao Liang	9364f17ba4	bcachefs: Add error handling for zlib_deflateInit2() In attempt_compress(), the return value of zlib_deflateInit2() needs to be checked. A proper implementation can be found in pstore_compress(). Add an error check and return 0 immediately if the initialzation fails. Fixes: `986e9842fb` ("bcachefs: Compression levels") Signed-off-by: Wentao Liang <vulab@iscas.ac.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-03 12:11:42 -04:00
Eric Biggers	a07c43e6c2	bcachefs: add missing selection of XARRAY_MULTI When CONFIG_XARRAY_MULTI is not set, reading from a bcachefs file hits the 'BUG_ON(order > 0);' in xas_set_order(), because it tries to insert a large folio in the page cache. Fix this by making bcachefs select XARRAY_MULTI. Fixes: `be212d86b1` ("bcachefs: bs > ps support") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-02 10:24:34 -04:00
Kent Overstreet	955ba7b5ea	bcachefs: bch_dev_usage_full All the fastpaths that need device usage don't need the sector totals or fragmentation, just bucket counts. Split bch_dev_usage up into two different versions, the normal one with just bucket counts. This is also a stack usage improvement, since we have a bch_dev_usage on the stack in the allocation path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-02 10:24:34 -04:00
Kent Overstreet	9180ad2e16	bcachefs: Kill btree_iter.trans This was planned to be done ages ago, now finally completed; there are places where we have quite a few btree_trans objects on the stack, so this reduces stack usage somewhat. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-02 10:24:34 -04:00
Kent Overstreet	1c8f4587d2	bcachefs: do_trace_key_cache_fill() Reducing stack frame usage; this moves the printbuf out of the main stack frame. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-02 10:24:34 -04:00
Kent Overstreet	dcffc3b1ae	bcachefs: Split up bch_dev.io_ref We now have separate per device io_refs for read and write access. This fixes a device removal bug where the discard workers were still running while we're removing alloc info for that device. It's also a bit of hardening; we no longer allow writes to devices that are read-only. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-02 10:24:34 -04:00
Linus Torvalds	eb0ece1602	- The 6 patch series "Enable strict percpu address space checks" from Uros Bizjak uses x86 named address space qualifiers to provide compile-time checking of percpu area accesses. This has caused a small amount of fallout - two or three issues were reported. In all cases the calling code was founf to be incorrect. - The 4 patch series "Some cleanup for memcg" from Chen Ridong implements some relatively monir cleanups for the memcontrol code. - The 17 patch series "mm: fixes for device-exclusive entries (hmm)" from David Hildenbrand fixes a boatload of issues which David found then using device-exclusive PTE entries when THP is enabled. More work is needed, but this makes thins better - our own HMM selftests now succeed. - The 2 patch series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed remove the z3fold and zbud implementations. They have been deprecated for half a year and nobody has complained. - The 5 patch series "mm: further simplify VMA merge operation" from Lorenzo Stoakes implements numerous simplifications in this area. No runtime effects are anticipated. - The 4 patch series "mm/madvise: remove redundant mmap_lock operations from process_madvise()" from SeongJae Park rationalizes the locking in the madvise() implementation. Performance gains of 20-25% were observed in one MADV_DONTNEED microbenchmark. - The 12 patch series "Tiny cleanup and improvements about SWAP code" from Baoquan He contains a number of touchups to issues which Baoquan noticed when working on the swap code. - The 2 patch series "mm: kmemleak: Usability improvements" from Catalin Marinas implements a couple of improvements to the kmemleak user-visible output. - The 2 patch series "mm/damon/paddr: fix large folios access and schemes handling" from Usama Arif provides a couple of fixes for DAMON's handling of large folios. - The 3 patch series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors" from SeongJae Park fixes a few issues with the accuracy of kdamond's walking of DAMON regions. - The 3 patch series "expose mapping wrprotect, fix fb_defio use" from Lorenzo Stoakes changes the interaction between framebuffer deferred-io and core MM. No functional changes are anticipated - this is preparatory work for the future removal of page structure fields. - The 4 patch series "mm/damon: add support for hugepage_size DAMOS filter" from Usama Arif adds a DAMOS filter which permits the filtering by huge page sizes. - The 4 patch series "mm: permit guard regions for file-backed/shmem mappings" from Lorenzo Stoakes extends the guard region feature from its present "anon mappings only" state. The feature now covers shmem and file-backed mappings. - The 4 patch series "mm: batched unmap lazyfree large folios during reclamation" from Barry Song cleans up and speeds up the unmapping for pte-mapped large folios. - The 18 patch series "reimplement per-vma lock as a refcount" from Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for pulling it out were largely bogus and that change made the code more messy. This patchset provides small (0-10%) improvements on one microbenchmark. - The 5 patch series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves" from SeongJae Park does some maintenance work on the DAMON docs. - The 27 patch series "hugetlb/CMA improvements for large systems" from Frank van der Linden addresses a pile of issues which have been observed when using CMA on large machines. - The 2 patch series "mm/damon: introduce DAMOS filter type for unmapped pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the page's mapped/unmapped status. - The 19 patch series "zsmalloc/zram: there be preemption" from Sergey Senozhatsky teaches zram to run its compression and decompression operations preemptibly. - The 12 patch series "selftests/mm: Some cleanups from trying to run them" from Brendan Jackman fixes a pile of unrelated issues which Brendan encountered while runnimg our selftests. - The 2 patch series "fs/proc/task_mmu: add guard region bit to pagemap" from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to determine whether a particular page is a guard page. - The 7 patch series "mm, swap: remove swap slot cache" from Kairui Song removes the swap slot cache from the allocation path - it simply wasn't being effective. - The 5 patch series "mm: cleanups for device-exclusive entries (hmm)" from David Hildenbrand implements a number of unrelated cleanups in this code. - The 5 patch series "mm: Rework generic PTDUMP configs" from Anshuman Khandual implements a number of preparatoty cleanups to the GENERIC_PTDUMP Kconfig logic. - The 8 patch series "mm/damon: auto-tune aggregation interval" from SeongJae Park implements a feedback-driven automatic tuning feature for DAMON's aggregation interval tuning. - The 5 patch series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did this in preparation for implementing lazy mmu mode for arm64 to optimize vmalloc. - The 2 patch series "mm/page_alloc: Some clarifications for migratetype fallback" from Brendan Jackman reworks some commentary to make the code easier to follow. - The 3 patch series "page_counter cleanup and size reduction" from Shakeel Butt cleans up the page_counter code and fixes a size increase which we accidentally added late last year. - The 3 patch series "Add a command line option that enables control of how many threads should be used to allocate huge pages" from Thomas Prescher does that. It allows the careful operator to significantly reduce boot time by tuning the parallalization of huge page initialization. - The 3 patch series "Fix calculations in trace_balance_dirty_pages() for cgwb" from Tang Yizhou fixes the tracing output from the dirty page balancing code. - The 9 patch series "mm/damon: make allow filters after reject filters useful and intuitive" from SeongJae Park improves the handling of allow and reject filters. Behaviour is made more consistent and the documention is updated accordingly. - The 5 patch series "Switch zswap to object read/write APIs" from Yosry Ahmed updates zswap to the new object read/write APIs and thus permits the removal of some legacy code from zpool and zsmalloc. - The 6 patch series "Some trivial cleanups for shmem" from Baolin Wang does as it claims. - The 20 patch series "fs/dax: Fix ZONE_DEVICE page reference counts" from Alistair Popple regularizes the weird ZONE_DEVICE page refcount handling in DAX, permittig the removal of a number of special-case checks. - The 4 patch series "refactor mremap and fix bug" from Lorenzo Stoakes is a preparatoty refactoring and cleanup of the mremap() code. - The 20 patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in which we determine whether a large folio is known to be mapped exclusively into a single MM. - The 8 patch series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers" from SeongJae Park adds a couple of new sysfs directories to ease the management of DAMON/DAMOS filters. - The 13 patch series "arch, mm: reduce code duplication in mem_init()" from Mike Rapoport consolidates many per-arch implementations of mem_init() into code generic code, where that is practical. - The 13 patch series "mm/damon/sysfs: commit parameters online via damon_call()" from SeongJae Park continues the cleaning up of sysfs access to DAMON internal data. - The 3 patch series "mm: page_ext: Introduce new iteration API" from Luiz Capitulino reworks the page_ext initialization to fix a boot-time crash which was observed with an unusual combination of compile and cmdline options. - The 8 patch series "Buddy allocator like (or non-uniform) folio split" from Zi Yan reworks the code to split a folio into smaller folios. The main benefit is lessened memory consumption: fewer post-split folios are generated. - The 2 patch series "Minimize xa_node allocation during xarry split" from Zi Yan reduces the number of xarray xa_nodes which are generated during an xarray split. - The 2 patch series "drivers/base/memory: Two cleanups" from Gavin Shan performs some maintenance work on the drivers/base/memory code. - The 3 patch series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages" from Martin Liu adds some more tracepoints to the page allocator code. - The 4 patch series "mm/madvise: cleanup requests validations and classifications" from SeongJae Park cleans up some warts which SeongJae observed during his earlier madvise work. - The 3 patch series "mm/hwpoison: Fix regressions in memory failure handling" from Shuai Xue addresses two quite serious regressions which Shuai has observed in the memory-failure implementation. - The 5 patch series "mm: reliable huge page allocator" from Johannes Weiner makes huge page allocations cheaper and more reliable by reducing fragmentation. - The 5 patch series "Minor memcg cleanups & prep for memdescs" from Matthew Wilcox is preparatory work for the future implementation of memdescs. - The 4 patch series "track memory used by balloon drivers" from Nico Pache introduces a way to track memory used by our various balloon drivers. - The 2 patch series "mm/damon: introduce DAMOS filter type for active pages" from Nhat Pham permits users to filter for active/inactive pages, separately for file and anon pages. - The 2 patch series "Adding Proactive Memory Reclaim Statistics" from Hao Jia separates the proactive reclaim statistics from the direct reclaim statistics. - The 2 patch series "mm/vmscan: don't try to reclaim hwpoison folio" from Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim code. -----BEGIN PGP SIGNATURE----- iHQEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nZaAAKCRDdBJ7gKXxA jsOWAPiP4r7CJHMZRK4eyJOkvS1a1r+TsIarrFZtjwvf/GIfAQCEG+JDxVfUaUSF Ee93qSSLR1BkNdDw+931Pu0mXfbnBw== =Pn2K -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "Enable strict percpu address space checks" from Uros Bizjak uses x86 named address space qualifiers to provide compile-time checking of percpu area accesses. This has caused a small amount of fallout - two or three issues were reported. In all cases the calling code was found to be incorrect. - The series "Some cleanup for memcg" from Chen Ridong implements some relatively monir cleanups for the memcontrol code. - The series "mm: fixes for device-exclusive entries (hmm)" from David Hildenbrand fixes a boatload of issues which David found then using device-exclusive PTE entries when THP is enabled. More work is needed, but this makes thins better - our own HMM selftests now succeed. - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed remove the z3fold and zbud implementations. They have been deprecated for half a year and nobody has complained. - The series "mm: further simplify VMA merge operation" from Lorenzo Stoakes implements numerous simplifications in this area. No runtime effects are anticipated. - The series "mm/madvise: remove redundant mmap_lock operations from process_madvise()" from SeongJae Park rationalizes the locking in the madvise() implementation. Performance gains of 20-25% were observed in one MADV_DONTNEED microbenchmark. - The series "Tiny cleanup and improvements about SWAP code" from Baoquan He contains a number of touchups to issues which Baoquan noticed when working on the swap code. - The series "mm: kmemleak: Usability improvements" from Catalin Marinas implements a couple of improvements to the kmemleak user-visible output. - The series "mm/damon/paddr: fix large folios access and schemes handling" from Usama Arif provides a couple of fixes for DAMON's handling of large folios. - The series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors" from SeongJae Park fixes a few issues with the accuracy of kdamond's walking of DAMON regions. - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo Stoakes changes the interaction between framebuffer deferred-io and core MM. No functional changes are anticipated - this is preparatory work for the future removal of page structure fields. - The series "mm/damon: add support for hugepage_size DAMOS filter" from Usama Arif adds a DAMOS filter which permits the filtering by huge page sizes. - The series "mm: permit guard regions for file-backed/shmem mappings" from Lorenzo Stoakes extends the guard region feature from its present "anon mappings only" state. The feature now covers shmem and file-backed mappings. - The series "mm: batched unmap lazyfree large folios during reclamation" from Barry Song cleans up and speeds up the unmapping for pte-mapped large folios. - The series "reimplement per-vma lock as a refcount" from Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for pulling it out were largely bogus and that change made the code more messy. This patchset provides small (0-10%) improvements on one microbenchmark. - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves" from SeongJae Park does some maintenance work on the DAMON docs. - The series "hugetlb/CMA improvements for large systems" from Frank van der Linden addresses a pile of issues which have been observed when using CMA on large machines. - The series "mm/damon: introduce DAMOS filter type for unmapped pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the page's mapped/unmapped status. - The series "zsmalloc/zram: there be preemption" from Sergey Senozhatsky teaches zram to run its compression and decompression operations preemptibly. - The series "selftests/mm: Some cleanups from trying to run them" from Brendan Jackman fixes a pile of unrelated issues which Brendan encountered while runnimg our selftests. - The series "fs/proc/task_mmu: add guard region bit to pagemap" from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to determine whether a particular page is a guard page. - The series "mm, swap: remove swap slot cache" from Kairui Song removes the swap slot cache from the allocation path - it simply wasn't being effective. - The series "mm: cleanups for device-exclusive entries (hmm)" from David Hildenbrand implements a number of unrelated cleanups in this code. - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual implements a number of preparatoty cleanups to the GENERIC_PTDUMP Kconfig logic. - The series "mm/damon: auto-tune aggregation interval" from SeongJae Park implements a feedback-driven automatic tuning feature for DAMON's aggregation interval tuning. - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did this in preparation for implementing lazy mmu mode for arm64 to optimize vmalloc. - The series "mm/page_alloc: Some clarifications for migratetype fallback" from Brendan Jackman reworks some commentary to make the code easier to follow. - The series "page_counter cleanup and size reduction" from Shakeel Butt cleans up the page_counter code and fixes a size increase which we accidentally added late last year. - The series "Add a command line option that enables control of how many threads should be used to allocate huge pages" from Thomas Prescher does that. It allows the careful operator to significantly reduce boot time by tuning the parallalization of huge page initialization. - The series "Fix calculations in trace_balance_dirty_pages() for cgwb" from Tang Yizhou fixes the tracing output from the dirty page balancing code. - The series "mm/damon: make allow filters after reject filters useful and intuitive" from SeongJae Park improves the handling of allow and reject filters. Behaviour is made more consistent and the documention is updated accordingly. - The series "Switch zswap to object read/write APIs" from Yosry Ahmed updates zswap to the new object read/write APIs and thus permits the removal of some legacy code from zpool and zsmalloc. - The series "Some trivial cleanups for shmem" from Baolin Wang does as it claims. - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from Alistair Popple regularizes the weird ZONE_DEVICE page refcount handling in DAX, permittig the removal of a number of special-case checks. - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a preparatoty refactoring and cleanup of the mremap() code. - The series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in which we determine whether a large folio is known to be mapped exclusively into a single MM. - The series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers" from SeongJae Park adds a couple of new sysfs directories to ease the management of DAMON/DAMOS filters. - The series "arch, mm: reduce code duplication in mem_init()" from Mike Rapoport consolidates many per-arch implementations of mem_init() into code generic code, where that is practical. - The series "mm/damon/sysfs: commit parameters online via damon_call()" from SeongJae Park continues the cleaning up of sysfs access to DAMON internal data. - The series "mm: page_ext: Introduce new iteration API" from Luiz Capitulino reworks the page_ext initialization to fix a boot-time crash which was observed with an unusual combination of compile and cmdline options. - The series "Buddy allocator like (or non-uniform) folio split" from Zi Yan reworks the code to split a folio into smaller folios. The main benefit is lessened memory consumption: fewer post-split folios are generated. - The series "Minimize xa_node allocation during xarry split" from Zi Yan reduces the number of xarray xa_nodes which are generated during an xarray split. - The series "drivers/base/memory: Two cleanups" from Gavin Shan performs some maintenance work on the drivers/base/memory code. - The series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages" from Martin Liu adds some more tracepoints to the page allocator code. - The series "mm/madvise: cleanup requests validations and classifications" from SeongJae Park cleans up some warts which SeongJae observed during his earlier madvise work. - The series "mm/hwpoison: Fix regressions in memory failure handling" from Shuai Xue addresses two quite serious regressions which Shuai has observed in the memory-failure implementation. - The series "mm: reliable huge page allocator" from Johannes Weiner makes huge page allocations cheaper and more reliable by reducing fragmentation. - The series "Minor memcg cleanups & prep for memdescs" from Matthew Wilcox is preparatory work for the future implementation of memdescs. - The series "track memory used by balloon drivers" from Nico Pache introduces a way to track memory used by our various balloon drivers. - The series "mm/damon: introduce DAMOS filter type for active pages" from Nhat Pham permits users to filter for active/inactive pages, separately for file and anon pages. - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia separates the proactive reclaim statistics from the direct reclaim statistics. - The series "mm/vmscan: don't try to reclaim hwpoison folio" from Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim code. * tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits) mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex() x86/mm: restore early initialization of high_memory for 32-bits mm/vmscan: don't try to reclaim hwpoison folio mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper cgroup: docs: add pswpin and pswpout items in cgroup v2 doc mm: vmscan: split proactive reclaim statistics from direct reclaim statistics selftests/mm: speed up split_huge_page_test selftests/mm: uffd-unit-tests support for hugepages > 2M docs/mm/damon/design: document active DAMOS filter type mm/damon: implement a new DAMOS filter type for active pages fs/dax: don't disassociate zero page entries MM documentation: add "Unaccepted" meminfo entry selftests/mm: add commentary about 9pfs bugs fork: use __vmalloc_node() for stack allocation docs/mm: Physical Memory: Populate the "Zones" section xen: balloon: update the NR_BALLOON_PAGES state hv_balloon: update the NR_BALLOON_PAGES state balloon_compaction: update the NR_BALLOON_PAGES state meminfo: add a per node counter for balloon drivers mm: remove references to folio in __memcg_kmem_uncharge_page() ...	2025-04-01 09:29:18 -07:00
Kent Overstreet	f1350c2c74	bcachefs: fix ref leak in btree_node_read_all_replicas Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-01 02:47:41 -04:00
Linus Torvalds	98fb679d19	bcachefs updates for 6.15, part 2 All bugfixes and logging improvements. Minor merge conflict, see: https://lore.kernel.org/linux-next/20250331092816.778a7c83@canb.auug.org.au/T/#u CI says the fs-next tree is good: https://evilpiepirate.org/~testdashboard/ci?user=fs-next&branch=master -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmfqyxsACgkQE6szbY3K bnY4aQ/7BylMgHZsAG2OLRRtegCsuFZ5fZt148TObofSGTTDPcVKYQWcz249Hlao RZzv9nbqq2M7fJrUK5Xloc4DA0ICuWIh9n+uRf+5od7JgtygjJpXqMRz9HrGBtTo QZJE/wAzsa1A8xBORjWVki4koHT3YivaMW2zdgbHIHWTjDJso5Es7RW0/WZYv3lW cFLFEfOnBCEXckhEtK7TAPnJpHEPw+/d0bMFU/PIHbokUwTjxCR0bmRL/RecKUXa 5U1o1x7gAo1iPi3XPGLJVVxXWgjmxzQlF/3aXva+DYaeLgxPMqKxUlC6hkV4f6Oc 9lH/w1pEiCMcANbbp2E3Q91sDFRlafFCgvsKhEz79W5WoNq+vSrxLhLaynyuBT/K lfoiig6IFRTWJDYHu2L6YHFMmp8JOxgJSJ0+dcgyVRnaDJQeGgbuv1tEldonQLsg 9DT8iRJpVDomffwPUoVhujlvJOqUi8zFkxyMCgVWExFzC3ief2B5s3D4uLXcpApO nZfb01W0ElW7qBMQxjyD0Vy+wY8EryzTht9ZKJq5Id1T/LWc9Qi+jPaY86OBC9/w GJgW9OcYLFjYdsDokk5XkwOd/IAXz6fU+vHGtahFJPVfH4T8zzdBnxfPbiR2mXo8 4EfeNmRevZP/oK7/2l2cqIzY7tYBJBUK1gFyvz1+7bcuFwVI8rc= =Udka -----END PGP SIGNATURE----- Merge tag 'bcachefs-2025-03-31' of git://evilpiepirate.org/bcachefs Pull more bcachefs updates from Kent Overstreet: "All bugfixes and logging improvements" * tag 'bcachefs-2025-03-31' of git://evilpiepirate.org/bcachefs: (35 commits) bcachefs: fix bch2_write_point_to_text() units bcachefs: Log original key being moved in data updates bcachefs: BCH_JSET_ENTRY_log_bkey bcachefs: Reorder error messages that include journal debug bcachefs: Don't use designated initializers for disk_accounting_pos bcachefs: Silence errors after emergency shutdown bcachefs: fix units in rebalance_status bcachefs: bch2_ioctl_subvolume_destroy() fixes bcachefs: Clear fs_path_parent on subvolume unlink bcachefs: Change btree_insert_node() assertion to error bcachefs: Better printing of inconsistency errors bcachefs: bch2_count_fsck_err() bcachefs: Better helpers for inconsistency errors bcachefs: Consistent indentation of multiline fsck errors bcachefs: Add an "ignore unknown" option to bch2_parse_mount_opts() bcachefs: bch2_time_stats_init_no_pcpu() bcachefs: Fix bch2_fs_get_tree() error path bcachefs: fix logging in journal_entry_err_msg() bcachefs: add missing newline in bch2_trans_updates_to_text() bcachefs: print_string_as_lines: fix extra newline ...	2025-03-31 18:33:51 -07:00
Kent Overstreet	de39965858	bcachefs: Fix null ptr deref in bch2_write_endio() This was previously hard to hit since it requires racing with device removal, but splitting up io_ref uncovered it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-31 17:39:27 -04:00
Kent Overstreet	7f10fde38f	bcachefs: Fix field spanning write warning Struct with embedded VLA... memcpy: detected field-spanning write (size 8) of single field "&gc->r.e" at fs/bcachefs/ec.c:465 (size 3) WARNING: CPU: 1 PID: 936 at fs/bcachefs/ec.c:465 bch2_trigger_stripe+0x706/0x730 Modules linked in: CPU: 1 UID: 0 PID: 936 Comm: mount.bcachefs Not tainted 6.14.0-rc6-ktest-00236-gefb0b5c62dbc #55 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:bch2_trigger_stripe+0x706/0x730 Code: b4 00 01 b9 03 00 00 00 48 89 fb 48 c7 c7 33 54 da 81 48 89 d6 49 89 d6 48 c7 c2 c3 36 db 81 e8 60 54 c5 ff 48 89 df 4c 89 f2 <0f> 0b e9 5c fd ff ff e8 fe 5e 4e 00 bf 10 00 00 00 48 c7 c6 ff ff RSP: 0018:ffff88817081f680 EFLAGS: 00010246 RAX: f8fe7dd1c56b5600 RBX: ffff888101265368 RCX: 0000000000000027 RDX: 0000000000000008 RSI: 00000000fffbffff RDI: ffff888101265368 RBP: 0000000000000000 R08: 000000000003ffff R09: ffff88817f1fe000 R10: 00000000000bfffd R11: 0000000000000004 R12: ffff8881012652c0 R13: 0000000000000000 R14: 0000000000000008 R15: ffff88817081f6c9 FS: 00007fc428bc7c80(0000) GS:ffff888179280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd3ee4a038 CR3: 000000010a9bc000 CR4: 0000000000750eb0 PKRU: 55555554 Call Trace: <TASK> ? __warn+0xce/0x1b0 ? bch2_trigger_stripe+0x706/0x730 ? report_bug+0x11b/0x1a0 ? bch2_trigger_stripe+0x706/0x730 ? handle_bug+0x5e/0x90 ? exc_invalid_op+0x1a/0x50 ? asm_exc_invalid_op+0x1a/0x20 ? bch2_trigger_stripe+0x706/0x730 bch2_gc_mark_key+0x2cf/0x430 bch2_check_allocations+0x1a64/0x1ed0 ? vsnprintf+0x1ad/0x420 ? bch2_check_allocations+0x191f/0x1ed0 bch2_run_recovery_passes+0x13b/0x2b0 bch2_fs_recovery+0x9b7/0x1290 ? __bch2_print+0xb2/0xf0 ? bch2_printbuf_exit+0x1e/0x30 ? print_mount_opts+0x153/0x180 bch2_fs_start+0x274/0x3b0 bch2_fs_get_tree+0x516/0x6e0 vfs_get_tree+0x21/0xa0 do_new_mount+0x153/0x350 __x64_sys_mount+0x16c/0x1f0 do_syscall_64+0x6c/0x140 ? arch_exit_to_user_mode_prepare+0x9/0x40 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-31 17:39:10 -04:00
Kent Overstreet	f540876f4e	bcachefs: Fix striping behaviour For striping across devices, we maintain "clocks", and we advance them by the inverse of "how much free space this device has left", so that we round robin biased in favor of devices with more free space. This code was originally trying to do EWMA-ish stuff when originally written, ~10 years ago, and was never properly cleaned up when it was realized that an EWMA is not the right approach here. That left a bug, when we rescale to keep all the clocks in the correct range and prevent overflow. It was assumed that we'd always be allocated from the device with the smallest clock hand, but that's actually not correct: with the target options, allocations will be first tried from a subset of devices, and then the entire filesystem if that fails. Thus, the rescale from the first allocation - allocating from a subset of devices - can pick the wrong rescale value and cause the rest of the clocks to go to 0, losing information. This resuls in incorrect striping behaviour when the desired number of replicas doesn't fit on the foreground target. Link: https://www.reddit.com/r/bcachefs/comments/1jn3t26/replica_allocation_not_evenly_distributed_among/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-31 17:39:10 -04:00
Kent Overstreet	650f5353dc	bcachefs: fix bch2_write_point_to_text() units Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-30 20:04:16 -04:00
Kent Overstreet	7fdc3fa3cb	bcachefs: Log original key being moved in data updates There's something going on with the data move path; log the original key being moved for debugging. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-30 18:25:12 -04:00
Kent Overstreet	edaed8ee8c	bcachefs: BCH_JSET_ENTRY_log_bkey Add a journal entry type for logging - but logging a bkey, not a string; to be used for data move path debugging. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-30 18:25:12 -04:00
Kent Overstreet	2b47102b93	bcachefs: Reorder error messages that include journal debug Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-30 16:36:27 -04:00
Kent Overstreet	393a05a741	bcachefs: Don't use designated initializers for disk_accounting_pos Not all compilers fully initialize these - they're not guaranteed to because of the union shenanigans. Fixes: https://github.com/koverstreet/bcachefs/issues/844 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-30 16:35:13 -04:00
Kent Overstreet	f548db4d31	bcachefs: Silence errors after emergency shutdown We don't care about errors from asynchronous ops that were because we did an emergency shutdown; silence them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-30 16:35:13 -04:00
Kent Overstreet	458e2ef882	bcachefs: fix units in rebalance_status Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-30 16:35:13 -04:00
Kent Overstreet	707549600c	bcachefs: bch2_ioctl_subvolume_destroy() fixes bch2_evict_subvolume_inodes() was getting stuck - due to incorrectly pruning the dcache. Also, fix missing permissions checks. Reported-by: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-30 16:33:22 -04:00
Kent Overstreet	b3981564ca	bcachefs: Clear fs_path_parent on subvolume unlink This fixes recursive subvolume removal. Subvolume deletion is asynchronous; fs_path_parent, and thus the entry in the subvolume_children btree, need to be cleared when the subvolume is unlinked from the fs heirarchy - else we'll spuriously think a subvolume has children and deletion will fail. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-29 20:16:49 -04:00
Kent Overstreet	63c3b8f616	bcachefs: Change btree_insert_node() assertion to error Debug for https://github.com/koverstreet/bcachefs/issues/843 Print useful debug info and go emergency read-only. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-29 14:24:48 -04:00
Kent Overstreet	6d77ce4a27	bcachefs: Better printing of inconsistency errors Build up and emit the error message for an inconsistency error all at once, instead of spread over multiple printk calls, so they're not jumbled in the dmesg log. Also, add better indenting. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-29 13:26:13 -04:00
Kent Overstreet	7337f9f14e	bcachefs: bch2_count_fsck_err() Factor out a helper from __bch2_fsck_err(), for counting the error in the superblock and deciding whether to print or ratelimit - will be used to replace some log_fsck_err() calls, where we want to lift out printing the error message. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-29 13:26:13 -04:00
Kent Overstreet	b00750c2e5	bcachefs: Better helpers for inconsistency errors An inconsistency error often happens as part of an event with multiple error messages, and we want to build up one single error message with proper indenting to produce more readable log messages that don't get garbled. Add new helpers that emit messages to a printbuf instead of printing them directly, next patch will convert to use them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 22:31:47 -04:00
Kent Overstreet	1ece53237e	bcachefs: Consistent indentation of multiline fsck errors Add the new helper printbuf_indent_add_nextline(), and use it in __bch2_fsck_err() to centralize setting the indentation of multiline fsck errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 22:31:47 -04:00
Kent Overstreet	a7cdf2276e	bcachefs: Add an "ignore unknown" option to bch2_parse_mount_opts() To be used by the mount helper in userspace, where we still have options to be parsed by other layers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 22:31:47 -04:00
Kent Overstreet	daa771332e	bcachefs: bch2_time_stats_init_no_pcpu() Add a mode to disable automatic switching to percpu mode, useful when a time_stats will only be used by one thread and we don't want to have to flush the percpu buffers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 22:31:47 -04:00
Florian Albrechtskirchinger	7c4cb50e1a	bcachefs: Fix bch2_fs_get_tree() error path When a filesystem is mounted read-only, subsequent attempts to mount it as read-write fail with EBUSY. Previously, the error path in bch2_fs_get_tree() would unconditionally call __bch2_fs_stop(), improperly freeing resources for a filesystem that was still actively mounted. This change modifies the error path to only call __bch2_fs_stop() if the superblock has no valid root dentry, ensuring resources are not cleaned up prematurely when the filesystem is in use. Signed-off-by: Florian Albrechtskirchinger <falbrechtskirchinger@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 13:51:09 -04:00
Kent Overstreet	6b1e0b9e18	bcachefs: fix logging in journal_entry_err_msg() We want to log errors all at once, not spread across multiple printks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 12:36:32 -04:00
Kent Overstreet	ff4e0f7de6	bcachefs: add missing newline in bch2_trans_updates_to_text() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 12:36:32 -04:00
Kent Overstreet	35a11506a3	bcachefs: print_string_as_lines: fix extra newline Don't print a newline on empty string; this was causing us to also print an extra newline when we got to the end of th string. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 12:36:32 -04:00
Kent Overstreet	3c72d3eea9	bcachefs: Fix WARN() in bch2_bkey_pick_read_device() syzbot discovered that this one is possible: we have pointers, but none of them are to valid devices. Reported-by: syzbot+336a6e6a2dbb7d4dba9a@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 12:35:05 -04:00
Kent Overstreet	af3d4c276a	bcachefs: Don't return 0 size holes from bch2_seek_hole() The hole we find in the btree might be fully dirty in the page cache. If so, keep searching. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 11:30:14 -04:00
Kent Overstreet	1f4bb8254c	bcachefs: Fix bch2_seek_hole() locking We can't call bch2_seek_pagecache_hole(), and block on page locks, with btree locks held. This is easily fixed because we're at the end of the transaction - we can just unlock, we don't need a drop_locks_do(). Reported-by: https://github.com/nagalun Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 11:30:14 -04:00
Kent Overstreet	2dd202dbaf	bcachefs: Recovery no longer holds state_lock state_lock guards against devices coming or leaving, changing state, or the filesystem changing between ro <-> rw. But it's not necessary for running recovery passes, and holding it blocks asynchronous events that would cause us to go RO or kick out devices. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 11:13:25 -04:00
Kent Overstreet	c6c6a39109	bcachefs: Fix permissions on version modparam There's no reason for this not to be world readable - it provides the currently supported on disk format version. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 11:13:23 -04:00
Linus Torvalds	4a4b30ea80	bcachefs updates for 6.15 On disk format is now soft frozen: no more required/automatic are anticipated before taking off the experimental label. Major changes/features since 6.14: - Scrub - Blocksize greater than page size support - A number of "rebalance spinning and doing no work" issues have been fixed; we now check if the write allocation will succeed in bch2_data_update_init(), before kicking off the read. There's still more work to do in this area. Later we may want to add another bitset btree, like rebalance_work, to track "extents that rebalance was requested to move but couldn't", e.g. due to destination target having insufficient online devices. - We can now support scaling well into the petabyte range: latest bcachefs-tools will pick an appropriate bucket size at format time to ensure fsck can run in available memory (e.g. a server with 256GB of ram and 100PB of storage would want 16MB buckets). On disk format changes: - 1.21: cached backpointers (scalability improvement) Cached replicas now get backpointers, which means we no longer rely on incrementing bucket generation numbers to invalidate cached data: this lets us get rid of the bucket generation number garbage collection, which had to periodically rescan all extents to recompute bucket oldest_gen. Bucket generation numbers are now only used as a consistency check, but they're quite useful for that. - 1.22: stripe backpointers Stripes now have backpointers: erasure coded stripes have their own checksums, separate from the checksums for the extents they contain (and stripe checksums also cover the parity blocks). This is required for implementing scrub for stripes. - 1.23: stripe lru (scalability improvement) Persistent lru for stripes, ordered by "number of empty blocks". This is used by the stripe creation path, which depending on free space may create a new stripe out of a partially empty existing stripe instead of starting a brand new stripe. This replaces an in-memory heap, and means we no longer have to read in the stripes btree at startup. - 1.24: casefolding Case insensitive directory support, courtesy of Valve. This is an incompatible feature, to enable mount with -o version_upgrade=incompatible - 1.25: extent_flags Another incompatible feature requiring explicit opt-in to enable. This adds a flags entry to extents, and a flag bit that marks extents as poisoned. A poisoned extent is an extent that was unreadable due to checksum errors. We can't move such extents without giving them a new checksum, and we may have to move them (for e.g. copygc or device evacuate). We also don't want to delete them: in the future we'll have an API that lets userspace ignore checksum errors and attempt to deal with simple bitrot itself. Marking them as poisoned lets us continue to return the correct error to userspace on normal read calls. Other changes/features: - BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top' command, which shows a live view of all internal filesystem counters. - Improved journal pipelining: we can now have 16 journal writes in flight concurrently, up from 4. We're logging significantly more to the journal than we used to with all the recent disk accounting changes and additions, so some users should see a performance increase on some workloads. - BCH_MEMBER_STATE_failed: previously, we would do no IO at all to devices marked as failed. Now we will attempt to read from them, but only if we have no better options. - New option, write_error_timeout: devices will be kicked out of the filesystem if all writes have been failing for x number of seconds. We now also kick devices out when notified by blk_holder_ops that they've gone offline. - Device option handling improvements: the discard option should now be working as expected (additionally, in -tools, all device options that can be set at format time can now be set at device add time, i.e. data_allowed, state). - We now try harder to read data after a checksum error: we'll do additional retries if necessary to a device after after it gave us data with a checksum error. - More self healing work: the full inode <-> dirent consistency checks that are currently run by fsck are now also run every time we do a lookup, meaning we'll be able to correct errors at runtime. Runtime self healing will be flipped on after the new changes have seen more testing, currently they're just checking for consistency. - KMSAN fixes: our KMSAN builds should be nearly clean now, which will put a massive dent in the syzbot dashboard. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmfhbnsACgkQE6szbY3K bnY6ew/9FXh3m71BvVpuqTYcUGzIC7gVrnkFy6n4W96v07OjSOoTNHOVVovajxc3 P9LvA77BHC4Xro3H7ORpsIurOZUc6yx18ZizzulVbQFuYa7LY/kNri4ZBtGHcRiV pIdQDLSNmwFjPA4x2S1qTFSF1c586lad+UNQiLam5ophBwQPEO6vG51ZEHa4wld9 +OhWTDYfrvij4D3Lt1ppvhuDP+PQBjhu/QFc0bGjHvKOjfV6sw9XU91sCYKOJIzd qzpsiQd5sepnX717Br3f5SLdxMq2lJYvRp9756vltOCaMBvJYJtHqtXCglHQEkFw yjhmPjk4r3VlKTF8K+wEJfAHwbC2kEn7csJNbt0+Nko5PPtFyrb8ok6QUbHCKscL L0VMnzaXHVqvG2VgYa31temfdz7HM/zHjQ8Al3eQPaqTHIoTXIBQxOQSea/apVMt TIlastvLoHfR8W7+LrwOmTjnBJGCJ+MrdcJzJDVk2tQmmcMA0boeZvl4aSklFuyB zNN5fxp0VMsxNyIHLJjQ3UcwVqHXC5w+f5H1ByQLUyQh+m/xaAaz7S+BTVdVbFPa 1Z1xDuvuHOTnjIOamnOD1l36afJnhq5RciPCXCNtQSB819mc+AfNGQNQTVNOTReC iTiUCcNxu0/DIPlPmeJzAlukVJUgz+/knOI/6zPs3eI7/o88ZGg= =k3cV -----END PGP SIGNATURE----- Merge tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs Pull bcachefs updates from Kent Overstreet: "On disk format is now soft frozen: no more required/automatic are anticipated before taking off the experimental label. Major changes/features since 6.14: - Scrub - Blocksize greater than page size support - A number of "rebalance spinning and doing no work" issues have been fixed; we now check if the write allocation will succeed in bch2_data_update_init(), before kicking off the read. There's still more work to do in this area. Later we may want to add another bitset btree, like rebalance_work, to track "extents that rebalance was requested to move but couldn't", e.g. due to destination target having insufficient online devices. - We can now support scaling well into the petabyte range: latest bcachefs-tools will pick an appropriate bucket size at format time to ensure fsck can run in available memory (e.g. a server with 256GB of ram and 100PB of storage would want 16MB buckets). On disk format changes: - 1.21: cached backpointers (scalability improvement) Cached replicas now get backpointers, which means we no longer rely on incrementing bucket generation numbers to invalidate cached data: this lets us get rid of the bucket generation number garbage collection, which had to periodically rescan all extents to recompute bucket oldest_gen. Bucket generation numbers are now only used as a consistency check, but they're quite useful for that. - 1.22: stripe backpointers Stripes now have backpointers: erasure coded stripes have their own checksums, separate from the checksums for the extents they contain (and stripe checksums also cover the parity blocks). This is required for implementing scrub for stripes. - 1.23: stripe lru (scalability improvement) Persistent lru for stripes, ordered by "number of empty blocks". This is used by the stripe creation path, which depending on free space may create a new stripe out of a partially empty existing stripe instead of starting a brand new stripe. This replaces an in-memory heap, and means we no longer have to read in the stripes btree at startup. - 1.24: casefolding Case insensitive directory support, courtesy of Valve. This is an incompatible feature, to enable mount with -o version_upgrade=incompatible - 1.25: extent_flags Another incompatible feature requiring explicit opt-in to enable. This adds a flags entry to extents, and a flag bit that marks extents as poisoned. A poisoned extent is an extent that was unreadable due to checksum errors. We can't move such extents without giving them a new checksum, and we may have to move them (for e.g. copygc or device evacuate). We also don't want to delete them: in the future we'll have an API that lets userspace ignore checksum errors and attempt to deal with simple bitrot itself. Marking them as poisoned lets us continue to return the correct error to userspace on normal read calls. Other changes/features: - BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top' command, which shows a live view of all internal filesystem counters. - Improved journal pipelining: we can now have 16 journal writes in flight concurrently, up from 4. We're logging significantly more to the journal than we used to with all the recent disk accounting changes and additions, so some users should see a performance increase on some workloads. - BCH_MEMBER_STATE_failed: previously, we would do no IO at all to devices marked as failed. Now we will attempt to read from them, but only if we have no better options. - New option, write_error_timeout: devices will be kicked out of the filesystem if all writes have been failing for x number of seconds. We now also kick devices out when notified by blk_holder_ops that they've gone offline. - Device option handling improvements: the discard option should now be working as expected (additionally, in -tools, all device options that can be set at format time can now be set at device add time, i.e. data_allowed, state). - We now try harder to read data after a checksum error: we'll do additional retries if necessary to a device after after it gave us data with a checksum error. - More self healing work: the full inode <-> dirent consistency checks that are currently run by fsck are now also run every time we do a lookup, meaning we'll be able to correct errors at runtime. Runtime self healing will be flipped on after the new changes have seen more testing, currently they're just checking for consistency. - KMSAN fixes: our KMSAN builds should be nearly clean now, which will put a massive dent in the syzbot dashboard" * tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs: (180 commits) bcachefs: Kill unnecessary bch2_dev_usage_read() bcachefs: btree node write errors now print btree node bcachefs: Fix race in print_chain() bcachefs: btree_trans_restart_foreign_task() bcachefs: bch2_disk_accounting_mod2() bcachefs: zero init journal bios bcachefs: Eliminate padding in move_bucket_key bcachefs: Fix a KMSAN splat in btree_update_nodes_written() bcachefs: kmsan asserts bcachefs: Fix kmsan warnings in bch2_extent_crc_pack() bcachefs: Disable asm memcpys when kmsan enabled bcachefs: Handle backpointers with unknown data types bcachefs: Count BCH_DATA_parity backpointers correctly bcachefs: Run bch2_check_dirent_target() at lookup time bcachefs: Refactor bch2_check_dirent_target() bcachefs: Move bch2_check_dirent_target() to namei.c bcachefs: fs-common.c -> namei.c bcachefs: EIO cleanup bcachefs: bch2_write_prep_encoded_data() now returns errcode bcachefs: Simplify bch2_write_op_error() ...	2025-03-27 13:20:07 -07:00
Kent Overstreet	d0fb2a266a	bcachefs: cond_resched() in journal_key_sort_cmp() Fixes "task out to lunch" warnings during recovery on large machines with lots of dirty data in the journal. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-26 16:26:35 -04:00
Kent Overstreet	ef488bb5d0	bcachefs: Fix 'hung task' messages in btree node scan btree node scan has to wait on kthread workers that scan each device, potentially for awhile. We would like this to be interruptible, but we may need a different mechanism than signals for that - we've had bugs in the past where mounts were failing due to checking for signals, and no explanation on where they came from. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-26 16:26:35 -04:00
Kent Overstreet	9314e2fb26	bcachefs: Fix btree iter flags in data move (2) Data move -> move_get_io_opts -> bch2_get_update_rebalance_opts requires a not_extents iterator; this fixes the path where we're walking the extents btree and chase a reflink pointer into the reflink btree. bch2_lookup_indirect_extent() requires working with an extents iterator (due to peek_slot() semantics), so we implement bch2_lookup_indirect_extent_for_move(). This is simplified because there's no need to report indirect_extent_missing_errors here, that can be deferred until fsck or when a user reads that data. Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-26 16:26:35 -04:00
Kent Overstreet	19ff84b20d	bcachefs: Don't unnecessarily decrypt data when moving There's various checks for "are we going to compress this" - but we're not going to compress if we know it's incompressible. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-26 16:26:35 -04:00
Kent Overstreet	a44e4f8f00	bcachefs: Document disk accounting keys and conuters Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-26 16:26:35 -04:00
Kent Overstreet	9c893face2	bcachefs: Validate number of counters for accounting keys We weren't checking that accounting keys have the expected number of accounters. Originally we probably wanted to be flexible on this, but it doesn't look like that will be required - accounting is extended by adding new counter types, not more counters to an existing type. This means we can drop a BUG_ON() that popped once in automated testing, and the new validation will make that bug easier to track down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-26 16:26:35 -04:00
Kent Overstreet	e1e50a6330	bcachefs: Use print_string_as_lines() for journal stuck messages They were being truncated, printk has a 1k limit per call Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-25 11:49:46 -04:00
Kent Overstreet	a76db26a96	bcachefs: Fix duplicate checksum error messages in write path Also, improve the message in prep_encoded_data() - it now prints good/bad checksums, and checksum type. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-25 11:49:43 -04:00
Kent Overstreet	3ba0240a87	bcachefs: Fix silent short reads in data read retry path __bch2_read, before calling __bch2_read_extent(), sets bvec_iter.bi_size to "the size we can read from the current extent" with a swap, and restores it to "the size for the total read" after the read_extent call with another swap. But we neglected to do the restore before the "if (ret) goto err;" - which is a problem if we're retrying those errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-25 11:49:16 -04:00
Kent Overstreet	5af61dbd96	bcachefs: Fix nonce inconsistency in bch2_write_prep_encoded_data() If we're moving an extent that was partially overwritten, bch2_write_rechecksum() will trim it to the currenty live range. If we then also want to compress it, it'll be decrypted - but the nonce has been advanced for the overwritten start of the extent that we dropped, and we were using the nonce we calculated before rechecksum(). Reported-by: Gabriel de Perthuis <g2p.code@gmail.com> Fixes: `127d90d282` ("bcachefs: bch2_write_prep_encoded_data() now returns errcode") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-25 11:40:35 -04:00
Linus Torvalds	e41170cc5e	vfs-6.15-rc1.pagesize -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rxAAKCRCRxhvAZXjc ooIPAQCwMjDjtWegvBy8kefiRw+fa4z3ZWHrwRT9DJrD/K9WyAD+JVd0ou27SVpQ jKpRSRct2eTbyxdYiGydHQGm5F5sLg4= =0FyQ -----END PGP SIGNATURE----- Merge tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs pagesize updates from Christian Brauner: "This enables block sizes greater than the page size for block devices. With this we can start supporting block devices with logical block sizes larger than 4k. It also allows to lift the device cache sector size support to 64k. This allows filesystems which can use larger sector sizes up to 64k to ensure that the filesystem will not generate writes that are smaller than the specified sector size" * tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: bdev: add back PAGE_SIZE block size validation for sb_set_blocksize() bdev: use bdev_io_min() for statx block size block/bdev: lift block size restrictions to 64k block/bdev: enable large folio support for large logical block sizes fs/buffer fs/mpage: remove large folio restriction fs/mpage: use blocks_per_folio instead of blocks_per_page fs/mpage: avoid negative shift for large blocksize fs/buffer: remove batching from async read fs/buffer: simplify block_read_full_folio() with bh_offset()	2025-03-24 12:01:29 -07:00
Linus Torvalds	26d8e43079	vfs-6.15-rc1.async.dir -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rNwAKCRCRxhvAZXjc onBJAP9Z8Ywmlb5KQ1E3HvDmkwyY6yOSyZ9/CmbzrkCJ8ywYkQD/d9/xt0EP/O/q N8YtzXArHWt7u0YbcVpy9WK3F72BdwU= =VJgY -----END PGP SIGNATURE----- Merge tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs async dir updates from Christian Brauner: "This contains cleanups that fell out of the work from async directory handling: - Change kern_path_locked() and user_path_locked_at() to never return a negative dentry. This simplifies the usability of these helpers in various places - Drop d_exact_alias() from the remaining place in NFS where it is still used. This also allows us to drop the d_exact_alias() helper completely - Drop an unnecessary call to fh_update() from nfsd_create_locked() - Change i_op->mkdir() to return a struct dentry Change vfs_mkdir() to return a dentry provided by the filesystems which is hashed and positive. This allows us to reduce the number of cases where the resulting dentry is not positive to very few cases. The code in these places becomes simpler and easier to understand. - Repack DENTRY_* and LOOKUP_* flags" * tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: doc: fix inline emphasis warning VFS: Change vfs_mkdir() to return the dentry. nfs: change mkdir inode_operation to return alternate dentry if needed. fuse: return correct dentry for ->mkdir ceph: return the correct dentry on mkdir hostfs: store inode in dentry after mkdir if possible. Change inode_operations.mkdir to return struct dentry * nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked() nfs/vfs: discard d_exact_alias() VFS: add common error checks to lookup_one_qstr_excl() VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry VFS: repack LOOKUP_ bit flags. VFS: repack DENTRY_ flags.	2025-03-24 10:47:14 -07:00
Kent Overstreet	d8bdc8daac	bcachefs: Kill unnecessary bch2_dev_usage_read() bch2_dev_usage_read() is fairly expensive, we should optimize this more. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:37 -04:00
Kent Overstreet	2adfa46734	bcachefs: btree node write errors now print btree node It turned out a user was wondering why we were going read-only after a write error, and he didn't realize he didn't have replication enabled - this will make that more obvious, and we should be printing it anyways. Link: https://www.reddit.com/r/bcachefs/comments/1jf9akl/large_data_transfers_switched_bcachefs_to_readonly/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:37 -04:00
Kent Overstreet	739200c573	bcachefs: Fix race in print_chain() 00636 Unable to handle kernel NULL pointer dereference at virtual address 00000000000000b0 00636 Mem abort info: 00636 ESR = 0x0000000096000005 00636 EC = 0x25: DABT (current EL), IL = 32 bits 00636 SET = 0, FnV = 0 00636 EA = 0, S1PTW = 0 00636 FSC = 0x05: level 1 translation fault 00636 Data abort info: 00636 ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 00636 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 00636 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 00636 user pgtable: 4k pages, 39-bit VAs, pgdp=0000000101b10000 00636 [00000000000000b0] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 00636 Internal error: Oops: 0000000096000005 [#1] SMP 00636 Modules linked in: 00636 CPU: 12 UID: 0 PID: 79369 Comm: cat Not tainted 6.14.0-rc6-ktest-g3783b8973ab7 #17757 00636 Hardware name: linux,dummy-virt (DT) 00636 pstate: 20001005 (nzCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00636 pc : print_chain+0xb8/0x170 00636 lr : print_chain+0xa0/0x170 00636 sp : ffffff80d9c1bbb0 00636 x29: ffffff80d9c1bbb0 x28: 0000000000000002 x27: ffffff80c1be8250 00636 x26: ffffff80dd9b0000 x25: 0000000000000020 x24: 000000000000002d 00636 x23: 000000000000003c x22: ffffffc080a54518 x21: ffffff80da6e00d0 00636 x20: ffffff80da6e0170 x19: ffffff80c1a1d240 x18: 00000000ffffffff 00636 x17: 3535303937202d3c x16: 203139202d3c2035 x15: 00000000ffffffff 00636 x14: 0000000000000000 x13: ffffff80d71b63f1 x12: 0000000000000006 00636 x11: ffffffc080beb1c0 x10: 0000000000000020 x9 : 00000000000134cc 00636 x8 : 0000000000000020 x7 : 0000000000000004 x6 : 0000000000000020 00636 x5 : ffffff80d71b63f7 x4 : ffffffc080a5451b x3 : 0000000000000000 00636 x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 00636 Call trace: 00636 print_chain+0xb8/0x170 (P) 00636 bch2_check_for_deadlock+0x444/0x5a0 00636 bch2_btree_deadlock_read+0xb4/0x1c8 00636 full_proxy_read+0x74/0xd8 00636 vfs_read+0x90/0x300 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:37 -04:00
Kent Overstreet	0b4fd56726	bcachefs: btree_trans_restart_foreign_task() In debug mode, we save the call stack on transaction restart - but there's no locking, so we can't touch it if we're issuing the restart from another thread. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:37 -04:00
Kent Overstreet	f4a584f4bf	bcachefs: bch2_disk_accounting_mod2() We're hitting some issues with uninitialized struct padding, flagged by kmsan. They appear to be falso positives, otherwise bch2_accounting_validate() would have flagged them as "junk at end". But for now, we'll need to initialize disk_accounting_pos with memset(). This adds a new helper, bch2_disk_accounting_mod2(), that initializes a disk_accounting_pos and does the accounting mod all at once - so overall things actually get slightly more ergonomic. BCH_DISK_ACCOUNTING_replicas keys are left for now; KMSAN isn't warning about them and they're a bit special. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:37 -04:00
Kent Overstreet	5ae6f33053	bcachefs: zero init journal bios fix a kmsan splat Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:37 -04:00
Kent Overstreet	9ea24b287b	bcachefs: Eliminate padding in move_bucket_key We appear to be tripping over a compiler/kmsan bug with padding fields - this is an easy workaround. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:37 -04:00
Kent Overstreet	1f88c35674	bcachefs: Fix a KMSAN splat in btree_update_nodes_written() We may sometimes read from uninitialized memory; we know, and that's ok. We check if a btree node has been reused before waiting on any outstanding IO. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:37 -04:00
Kent Overstreet	28aa859b6b	bcachefs: kmsan asserts Catching these early makes them a lot easier to track down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	53cf2a3daa	bcachefs: Fix kmsan warnings in bch2_extent_crc_pack() We store to all fields, so the kmsan warnings were spurious - but initializing via stores to bitfields appear to have been giving the compiler/kmsan trouble, and they're not necessary. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	9c3a2c9b47	bcachefs: Disable asm memcpys when kmsan enabled kmsan doesn't know about inline assembly, obviously; this will close a ton of syzbot bugs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	962322475b	bcachefs: Handle backpointers with unknown data types New data types might be added later, so we don't want to disallow unknown data types - that'll be a compatibility hassle later. Instead, ignore them. Reported-by: syzbot+3a290f5ff67ca3023834@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	6a9f681ef6	bcachefs: Count BCH_DATA_parity backpointers correctly These are counted as stripe data in the corresponding alloc keys. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	04e90891be	bcachefs: Run bch2_check_dirent_target() at lookup time More on the "full online self healing" project: We now run most of the dirent <-> inode consistency checks, with repair code, at runtime - the exact same check and repair code that fsck runs. This will allow us to repair the "dirent points to inode that does not point back" inconsistencies that have been popping up at runtime. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	9b0d00a369	bcachefs: Refactor bch2_check_dirent_target() Prep work for calling bch2_check_dirent_target() from bch2_lookup(). - Add an inline wrapper, if the target and backpointer match we can skip the function call. - We don't (yet?) want to remove the dirent we did the lookup from (when we find a directory or subvol with multiple valid dirents pointing to it), we can defer on that until later. For now, add an "are we in fsck?" parameter. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	758ea4ff81	bcachefs: Move bch2_check_dirent_target() to namei.c We're gradually running more and more fsck.c checks at runtime, whereever applicable; when we do so they get moved out of fsck.c. Next patch will call bch2_check_dirent_target() from bch2_lookup(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	4fcd4de0a6	bcachefs: fs-common.c -> namei.c name <-> inode, code for managing the relationships between inodes and dirents. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	8a9f3d0582	bcachefs: EIO cleanup Replace these with proper private error codes, so that when we get an error message we're not sifting through the entire codebase to see where it came from. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	127d90d282	bcachefs: bch2_write_prep_encoded_data() now returns errcode Prep work for killing off EIO and replacing them with proper private error codes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	2fe208303a	bcachefs: Simplify bch2_write_op_error() There's no reason for the caller to do the actual logging, it's all done the same. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	af2ff37da7	bcachefs: Fix block/btree node size defaults We're fixing option parsing in userspace, it now obeys OPT_SB_FIELD_SECTORS Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Alan Huang	5d361ae5af	bcachefs: Add missing smp_rmb() The smp_rmb() guarantees that reads from reservations.counter occur before accessing cur_entry_u64s. It's paired with the atomic64_try_cmpxchg in journal_entry_open. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Kent Overstreet	4a4000b9a6	bcachefs: Kill JOURNAL_ERRORS() Convert these to standard error codes, which means we can pass them outside the journal code, they're easier to pass to tracepoints, etc. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Kent Overstreet	80be08cdb5	bcachefs: Filesystem discard option now propagates to devices the discard option is special, because it's both a filesystem and a device option. When set at the filesytsem level, it's supposed to propagate to (if set persistently via sysfs) or override (if non persistently as a mount option) the devices - that now works correctly. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Kent Overstreet	8d7b7ac367	bcachefs: Device state is now a runtime option Other options can normally be set at runtime via sysfs, no reason for this one not to be as well - it just doesn't support the degraded flags argument this way, that requires the ioctl. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Kent Overstreet	7b84d934a1	bcachefs: Setting foreground_target at runtime now triggers rebalance Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Kent Overstreet	8b294a9b5c	bcachefs: Device options now use standard sysfs code Device options now use the common code for sysfs, and can superblock fields (in a struct bch_member). This replaces BCH_DEV_OPT_SETTERS(), which was weird and easy to miss. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Kent Overstreet	d2bad59255	bcachefs: Kill BCH_DEV_OPT_SETTERS() Previously, device options had their superblock option field listed separately, which was weird and easy to miss when defining options. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Alan Huang	dd7ae389ff	bcachefs: Remove spurious smp_mb() The smp_mb() is paired with nothing. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Alan Huang	5cc0ab39fb	bcachefs: Fix incorrect state count atomic64_read(&j->seq) - j->seq_write_started == JOURNAL_STATE_BUF_NR is the condition in journal_entry_open where we return JOURNAL_ERR_max_open, so journal_cur_seq(j) - seq == JOURNAL_STATE_BUF_NR means that the buf corresponding to seq has started to write. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Kent Overstreet	16a8d5d00b	bcachefs: Fix btree iter flags in data move Rebalance requires a not_extents iterator. This wasn't hit before because all_snapshots disableds is_extents on snapshots btrees - but has no effect on the reflink btree. Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Kent Overstreet	92c7789a9e	bcachefs: Validate bch_sb.offset field This was missed - but it needs to be correct for the superblock recovery tool that scans the start and end of the device for backup superblocks: we don't want to pick up superblocks that belong to a different partition that starts at a different offset. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Kent Overstreet	8bd875ae47	bcachefs: bch2_sb_validate() doesn't need bch_sb_handle Minor refactoring, so that bch2_sb_validate() can be used in the new userspace superblock recovery tool. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Kent Overstreet	5e67243ea6	bcachefs: Add missing random.h includes Fix build in userspace, and good hygeine. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:34 -04:00
Kent Overstreet	2eb985c549	bcachefs: Better incompat version/feature error messages If we can't mount because of an incompatibility, print what's supported and unsupported - to help solve PEBKAC issues. Reported-by: Roland Vet <vet.roland@protonmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:34 -04:00
Kent Overstreet	6aa446c05a	bcachefs: Fix offset_into_extent in data move path Fixes the following: [ 17.607394] kernel BUG at fs/bcachefs/reflink.c:261! [ 17.608316] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 17.608485] CPU: 0 UID: 0 PID: 564 Comm: bch-rebalance/3 Tainted: G OE 6.14.0-rc6-arch1-gfcb0bd9609d2 #7 0efd7a8f4a00afeb2c5fb6e7ecb1aec8ddcbb1e1 [ 17.608616] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE [ 17.608736] Hardware name: Micro-Star International Co., Ltd. MS-7D75/MAG B650 TOMAHAWK WIFI (MS-7D75), BIOS 1.74 08/01/2023 [ 17.608855] RIP: 0010:bch2_lookup_indirect_extent+0x252/0x290 [bcachefs] [ 17.609006] Code: 00 00 00 00 e8 7f 51 f5 ff 89 c3 85 c0 74 52 48 8b 7d b0 4c 89 ee e8 4d 4b f4 ff 48 63 d3 48 89 d0 31 d2 e9 2e ff ff ff 0f 0b <0f> 0b 48 8b 7d b0 4c 89 ee 48 89 55 a8 e8 2c 4b f4 ff 4c 8b 55 a8 [ 17.609136] RSP: 0018:ffffa3714455f850 EFLAGS: 00010246 [ 17.609261] RAX: 0000000000000080 RBX: ffff895891098790 RCX: 0000000000000000 [ 17.609387] RDX: 0000000000000080 RSI: ffffa3714455fa90 RDI: ffff895889550000 [ 17.609511] RBP: ffffa3714455f8c0 R08: ffff895891098790 R09: 0000000000000001 [ 17.609637] R10: ffffa3714455f8d8 R11: ffffa3714455f950 R12: ffffa3714455fa58 [ 17.609763] R13: ffff895891098790 R14: ffffa3714455fa58 R15: ffff895889550000 [ 17.609888] FS: 0000000000000000(0000) GS:ffff896757c00000(0000) knlGS:0000000000000000 [ 17.610015] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 17.610143] CR2: 0000716b8cda2750 CR3: 0000000914e22000 CR4: 0000000000f50ef0 [ 17.610272] PKRU: 55555554 [ 17.610403] Call Trace: [ 17.610535] <TASK> [ 17.610662] ? __die_body.cold+0x19/0x27 [ 17.610791] ? die+0x2e/0x50 [ 17.610918] ? do_trap+0xca/0x110 [ 17.611049] ? do_error_trap+0x6a/0x90 [ 17.611178] ? bch2_lookup_indirect_extent+0x252/0x290 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.611331] ? exc_invalid_op+0x50/0x70 [ 17.611468] ? bch2_lookup_indirect_extent+0x252/0x290 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.611620] ? asm_exc_invalid_op+0x1a/0x20 [ 17.611757] ? bch2_lookup_indirect_extent+0x252/0x290 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.611911] ? bch2_move_data_btree+0x58a/0x6c0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.612084] bch2_move_data_btree+0x58a/0x6c0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.612256] ? __pfx_rebalance_pred+0x10/0x10 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.612431] ? bch2_move_extent+0x3d7/0x6e0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.612607] ? __bch2_move_data+0xea/0x200 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.612782] __bch2_move_data+0xea/0x200 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.612959] ? __pfx_rebalance_pred+0x10/0x10 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.613149] do_rebalance+0x517/0x8d0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.613342] ? local_clock_noinstr+0xd/0xd0 [ 17.613518] ? local_clock+0x15/0x30 [ 17.613693] ? __bch2_trans_get+0x152/0x300 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.613890] ? __pfx_bch2_rebalance_thread+0x10/0x10 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.614090] bch2_rebalance_thread+0x66/0xb0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] The offset_into_extent bit was copied from the read path, but it's unnecessary here, where we always want to read and move the entire indirect extent, and it causes the assertion pop - because we're using a non-extents iterator, which always points to the end of the reflink pointer. Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:34 -04:00
Eric Biggers	71fbb0b86e	bcachefs: use sha256() instead of crypto_shash API Just use sha256() instead of the clunky crypto API. This is much simpler. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:34 -04:00
Eric Biggers	39abc73b59	bcachefs: Remove unnecessary softdeps on crc32c and crc64 Since bcachefs does not access crc32c and crc64 through the crypto API, there is no need to use module softdeps to ensure they are loaded. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:34 -04:00
Kent Overstreet	9b39835e93	bcachefs: #if 0 out (enable\|disable)_encryption() These weren't hooked up, but they probably should be - add some comments for context. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:34 -04:00
Kent Overstreet	9962cb7748	bcachefs: Improve can_write_extent() This fixes another "rebalance spinning and doing no work" issue; rebalance was reading extents it wanted to move, but then failing in bch2_write() -> bch2_alloc_sectors_start() due to being unable to allocate sufficient replicas. This was triggered by a user playing with the durability settings, the foreground device was an NVME device with durability=2, and originally he'd set the background device to durability=2 as well, but changed it back to 1 (the default) after seeing IO errors. That meant that with replicas=2, we want to move data off the NVME device which satisfies that constraint, but with a single durability=1 device on the background target there's no way to move the extent to that target while satisfiying the "required replicas" constraint. The solution for now is for bch2_data_update_init() to check for this, and return an error - before kicking off the read. bch2_data_update_init() already had two different checks for "will we be able to write this extent", with partially duplicated code, so this patch combines and improves that logic. Additionally, we now always bail out and return an error if there's insufficient space on the destination target. Previously, we only did this for BCH_WRITE_alloc_nowait moves, because it might be the case that copygc just needs to free up space on the destination target. But we really shouldn't kick off a move if the destination is full, we can't currently distinguish between "really full" and "just need to wait for copygc", and if we are going to wait on copygc it'd be better to do that before kicking off the move. This will additionally fix "rebalance spinning" issues caused by a filesystem that has more data than can fit in background_target - which is a valid scenario, since we don't exclude foreground/cache devices when calculating filesystem capacity. Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:34 -04:00
Kent Overstreet	fb8a9a32cc	bcachefs: trace_io_move_write_fail Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:34 -04:00
Alan Huang	76bc6e51cd	bcachefs: Increase blacklist range Now there are 16 journal buffers, 8 is too small to be enough. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:34 -04:00
Kent Overstreet	35de2abc22	bcachefs: __bch2_read() now takes a btree_trans Next patch will be checking if the extent we're reading from matches the IO failure we saw before marking the failure. For this to work, __bch2_read() needs to take the same transaction context that bch2_rbio_retry() uses to do that check. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:34 -04:00
Kent Overstreet	3fb8bacb14	bcachefs: BCH_READ_data_update -> bch_read_bio.data_update Read flags are codepath dependent and change as they're passed around, while the fields in rbio._state are mostly fixed properties of that particular object. Losing track of BCH_READ_data_update would be bad, and previously it was not obvious if it was always correctly set in the rbio, so this is a safety cleanup. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:34 -04:00
Uros Bizjak	8a3c392388	percpu: use TYPEOF_UNQUAL() in variable declarations Use TYPEOF_UNQUAL() to declare variables as a corresponding type without named address space qualifier to avoid "`__seg_gs' specified for auto variable `var'" errors. Link: https://lkml.kernel.org/r/20250127160709.80604-4-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Acked-by: Nadav Amit <nadav.amit@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:05:53 -07:00
Kent Overstreet	be31e412ac	bcachefs: Checksum errors get additional retries It's possible for checksum errors to be transient - e.g. flakey controller or cable, thus we need additional retries (besides retrying from different replicas) before we can definitely return an error. This is particularly important for the next patch, which will allow the data move path to move extents with checksum errors - we don't want to accidentally introduce bitrot due to a transient error! - bch2_bkey_pick_read_device() is substantially reworked, and bch2_dev_io_failures is expanded to record more information about the type of failure (i.e. number of checksum errors). It now returns an error code that describes more precisely the reason for the failure - checksum error, io error, or offline device, instead of the previous generic "insufficient devices". This is important for the next patches that add poisoning, as we only want to poison extents when we've got real checksum errors (or perhaps IO errors?) - not because a device was offline. - Add a new option and superblock field for the number of checksum retries. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	ccba9029b0	bcachefs: Print message on successful read retry Users have been asking for this, and now that errors are returned to the top level read retry path - we can. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	de73677ff8	bcachefs: Return errors to top level bch2_rbio_retry() Next patch will be adding an additional retry loop for checksum errors, so that we can rule out transient errors before marking an extent as poisoned. Prerequisite to this is returning errors to bch2_rbio_retry(); this will also let us add a "successful retry" message. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	881b598ef1	bcachefs: BCH_ERR_data_read_buffer_too_small Now that the read path uses proper error codes, we can get rid of the weird rbio->hole signalling to the move path that the read didn't happen. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	f4b84bac20	bcachefs: Read error message now indicates if it was for an internal move Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	e75993b0bf	bcachefs: Fix BCH_ERR_data_read_csum_err_maybe_userspace in retry path When we do a read to a buffer that's mapped into userspace, it's possible to get a spurious checksum error if userspace was modified the buffer at the same time. When we retry those, they have to be bounced before we know definitively whether we're reading corrupt data. But the retry path propagates read flags differently, so needs special handling. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	943f0cfb15	bcachefs: Convert read path to standard error codes Kill the READ_ERR/READ_RETRY/READ_RETRY_AVOID enums, and add standard error codes that describe precisely which error occured. This is going to be used for the data move path, to move but poison extents with checksum errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	5a06cb8000	bcachefs: Debug params for data corruption injection dm-flakey is busted, and this is simpler anyways - this lets us test the checksum error retry ptahs Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	6d80fca9ef	bcachefs: Don't create bch_io_failures unless it's needed Only needed in retry path, no point in wasting stack space. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	9ec0089149	bcachefs: bch2_bkey_ptrs_rebalance_opts() Small optimization for bch2_bkey_sectors_need_rebalance() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	7c1e2a254f	bcachefs: Add a cond_resched() to btree cache teardown [12308.606480] watchdog: BUG: soft lockup - CPU#18 stuck for 26s! [umount:48479] [12308.606485] Modules linked in: bcachefs lz4hc_compress lz4_compress lz4_decompress sunrpc overlay nf_conntrack_netlink xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE bridge stp llc xfrm_user ip6table_nat ip6table_filter ip6_tables iptable_nat xt_addrtype iptable_filter ip_tables x_tables nfnetlink_cttimeout nfnetlink openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 psample ext4 mbcache jbd2 nls_iso8859_1 nls_cp850 vfat fat binfmt_misc skx_edac_common nfit edac_core libnvdimm cbc encrypted_keys intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common ipmi_ssif x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm drivetemp rapl intel_cstate coretemp mgag200 i2c_algo_bit ixgbe drm_shmem_helper drm_kms_helper mdio_devres xfrm_algo mdio drm ptp intel_uncore mei_me efi_pstore evdev uas pl2303 pps_core libphy usb_storage usbserial lpc_ich mei drm_panel_orientation_quirks acpi_power_meter tiny_power_button ipmi_si mfd_core intel_pch_thermal acpi_tad acpi_ipmi ioatdma [12308.606541] ipmi_devintf ipmi_msghandler dca wmi button efivarfs polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 sha1_generic xhci_pci xhci_hcd aesni_intel ehci_pci ehci_hcd gf128mul crypto_simd cryptd usbcore hpwdt usb_common [12308.606557] CPU: 18 UID: 0 PID: 48479 Comm: umount Tainted: G L 6.14.0-rc6-x86_64-00159-ga09496a03e63 #1 [12308.606560] Tainted: [L]=SOFTLOCKUP [12308.606561] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 07/20/2023 [12308.606563] RIP: 0010:clear_page_erms+0x7/0x10 [12308.606570] Code: 48 89 47 38 48 8d 7f 40 75 d9 90 c3 cc cc cc cc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 b9 00 10 00 00 31 c0 <f3> aa c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 [12308.606572] RSP: 0018:ffff9ed5b622fba0 EFLAGS: 00010246 [12308.606574] RAX: 0000000000000000 RBX: ffff90347fffe6c0 RCX: 00000000000004c0 [12308.606575] RDX: ffffe34ea9bec1c0 RSI: 00000000000405f0 RDI: ffff902eafb07b40 [12308.606576] RBP: ffff9ed5b622fbf0 R08: 0000000000000001 R09: 0000000000000006 [12308.606577] R10: 0000000000040001 R11: 0000000000000000 R12: ffffe34ea9bec000 [12308.606578] R13: 0000000000000000 R14: 0000000000000006 R15: ffffe34ea9bed000 [12308.606580] FS: 00007fe704ecfb68(0000) GS:ffff9053fea00000(0000) knlGS:0000000000000000 [12308.606581] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [12308.606582] CR2: 00007f18159068ae CR3: 00000001314d0005 CR4: 00000000007726f0 [12308.606583] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [12308.606584] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [12308.606584] PKRU: 55555554 [12308.606585] Call Trace: [12308.606587] <IRQ> [12308.606590] ? show_regs.cold+0x19/0x28 [12308.606595] ? watchdog_timer_fn.cold+0x3d/0x9d [12308.606598] ? __pfx_watchdog_timer_fn+0x10/0x10 [12308.606602] ? __hrtimer_run_queues+0x12e/0x250 [12308.606607] ? hrtimer_interrupt+0xfd/0x220 [12308.606609] ? __sysvec_apic_timer_interrupt+0x53/0xe0 [12308.606614] ? sysvec_apic_timer_interrupt+0x76/0xa0 [12308.606619] </IRQ> [12308.606620] <TASK> [12308.606620] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 [12308.606626] ? clear_page_erms+0x7/0x10 [12308.606628] ? __free_pages_ok+0x374/0x640 [12308.606633] free_frozen_pages+0x34/0x570 [12308.606636] __folio_put+0x87/0xe0 [12308.606641] free_large_kmalloc+0x70/0x80 [12308.606645] kfree+0x2f6/0x390 [12308.606648] kvfree+0x2d/0x40 [12308.606653] __btree_node_data_free+0xaf/0xf0 [bcachefs] [12308.606726] btree_node_data_free+0x6a/0x80 [bcachefs] [12308.606778] bch2_fs_btree_cache_exit+0x262/0x440 [bcachefs] [12308.606829] bch2_fs_release+0xe8/0x340 [bcachefs] [12308.606905] kobject_put+0x60/0xc0 [12308.606908] bch2_fs_free+0xdd/0x120 [bcachefs] [12308.606981] bch2_kill_sb+0x1e/0x30 [bcachefs] [12308.607051] deactivate_locked_super+0x32/0xb0 [12308.607055] deactivate_super+0x40/0x50 [12308.607057] cleanup_mnt+0xc3/0x160 [12308.607060] __cleanup_mnt+0x12/0x20 [12308.607062] task_work_run+0x5f/0xa0 [12308.607064] syscall_exit_to_user_mode+0x194/0x1a0 [12308.607066] do_syscall_64+0x67/0x170 [12308.607068] entry_SYSCALL_64_after_hwframe+0x76/0x7e [12308.607070] RIP: 0033:0x7fe704e66eed [12308.607073] Code: 08 49 89 ca b8 a5 00 00 00 0f 05 48 89 c7 e8 8a e6 ff ff 48 83 c4 Reported-by: Stijn Tintel <stijn@linux-ipv6.be> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	c991fbee8e	bcachefs: rebalance, copygc status also print stacktrace These are commonly needed when debugging, and saves from having to ask users to dig. Also, rebalance_status now includes pending rebalance work. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	8dc4514d58	bcachefs: Kill bch2_remount() Single caller, so inline it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:03:16 -04:00
Kent Overstreet	a2e9e68746	bcachefs: Kill a bit of dead code Found with CC=clang W=1 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Thorsten Blum	ff4cb203cc	bcachefs: Use max() to improve gen_after() Use max() to simplify gen_after() and improve its readability. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Thorsten Blum	c073ec6bec	bcachefs: Remove unnecessary byte allocation The extra byte is not used - remove it. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	94373026d9	bcachefs: We no longer read stripes into memory at startup And the stripes heap gets deleted. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	434a3f2ffa	bcachefs: trace_stripe_create Add a simple tracepoint for stripe creation, we'll want to expand this later. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	6c336144b9	bcachefs: get_existing_stripe() uses new stripe lru Convert to the new persistent stripe LRU. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	039790cfb5	bcachefs: ec_stripe_delete() uses new stripe lru Convert to the new persistent stripe LRU. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	4b0fac4bed	bcachefs: journal write path comment Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	981e380144	bcachefs: Kick devices out after too many write IO errors We're improving our handling of write errors - we shouldn't write degraded data just because a write failed once, we should retry it (on other devices, if possible). But for this to work, we need to kick devices out when they're only returning errors - otherwise those retries will loop infinitely. This adds a configurable timeout - if writes are failing for too long, we'll set that device read-only. In the future we should also implement more tracking and another knob for an "allowed error rate", so that we can kick out drives that are acting "unhealthy". Another thing we'll want is a mechanism (likely in userspace) for bringing a device back in after a transient error - perhaps a cable was jiggled, or there was a controller reset. After transient errors we also need a mechanism to walk (from the journal) recent btree updates that weren't flushed to that device and treat them as "degraded", since unflushed data may well not have been written. Out of scope for this patch, but becoming relevant. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	d71e023376	bcachefs: Change BCH_MEMBER_STATE_failed semantics Previously, we woudn't try to read at all from a failed device - that doesn't make much sense, the device may be unhealthy (perhaps taking longer than it should to service reads), but if it's our only option we should still try to read from it. Now, bch2_bkey_pick_read_device() will pick failed devices only if there are no non-failed replicas to read from. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	cf164a9106	bcachefs: bch2_dev_get_ioref() may now sleep The next patch implementing freezing will change bch2_dev_get_ioref() to sleep if a device is currently frozen. Add an annotation and fix the journal code accordingly. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	2efa8397ca	bcachefs: Fix btree_node_scan io_ref handling This was completely fubar; it's now simplified a bit as well. Note that for_each_online_member() takes and releases io_refs as it iterates, so we need to release that if we break. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	d5308203a8	bcachefs: Implement blk_holder_ops We can't use the standard fs_holder_ops because they're meant for single device filesystems - fs_bdev_mark_dead() in particular - and they assume that the blk_holder is the super_block, which also doesn't work for a multi device filesystem. These generally follow the standard fs_holder_ops; the locking/refcounting is a bit simplified because c->ro_ref suffices, and bch2_fs_bdev_mark_dead() is not necessarily shutting down the entire filesystem. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	1fdbe0b184	bcachefs: Make sure c->vfs_sb is set before starting fs This is necessary for the new blk_holder_ops, which want the vfs super_block available for synchronization. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	13fd6be102	bcachefs: Stash a pointer to the filesystem for blk_holder_ops Note that we open block devices before we allocate bch_fs, but once attached to a filesystem they will be closed before the bch_fs is torn down - so stashing a pointer without a refcount looks incorrect but it's not. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	b31c070407	bcachefs: Finish bch2_account_io_completion() conversions More prep work for automatically kicking devices out after too many IO errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	3526bca36b	bcachefs: bch2_account_io_completion() We need to start accounting successes for every IO, not just failures, so introduce a unified hook for io completion accounting and convert io_read.c. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	3480aecd5f	bcachefs: Fix read path io_ref handling We were using our device pointer after we'd released our ref to it. Unlikely to be a race that's practical to hit, since actually removing a member device is a whole process besides just taking it offline, but - needs to be fixed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	7bc5808168	bcachefs: data_update now checks for extents that can't be moved If a device is ro or failed, we might not have anywhere to move a replica. Check for this early, before doing the read and attempting to write. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	fba513a9ee	bcachefs: give bch2_write_super() a proper error code Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	4a90675cfe	bcachefs: bcachefs_metadata_version_extent_flags This implements a new extent field bitflags that apply to the whole extent. There's been a couple things we've wanted this for in the past, but the immediate need is extent poisoning, to solve a rebalance issue. Unknown extent fields can't be parsed (we won't known their size, so we can't advance to the next field), so this is an incompat feature, and using it prevents the filesystem from being mounted by old versions. This also adds the BCH_EXTENT_poisoned flag; this indicates that the data is known to be bad (i.e. there was a checksum error, and we had to write a new checksum) and reads will return errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	6422bf8117	bcachefs: bch2_request_incompat_feature() now returns error code For future usage, we'll want a dedicated error code for better debugging. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Thorsten Blum	bafd41b435	bcachefs: Fix error type in bch2_alloc_v3_validate() Use error type alloc_v3_unpack_error in bch2_alloc_v3_validate(). Fixes: `b65db750e2` ("bcachefs: Enumerate fsck errors") Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	fb195fa753	bcachefs: BCH_SB_FEATURES_ALL includes BCH_FEATURE_incompat_verison_field These features are set on format and incompat upgarde. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	24d790a7da	bcachefs: sysfs internal/trigger_btree_updates Add a debug knob to manually trigger the btree updates worker. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Joshua Ashton	d37c14ac6f	bcachefs: bcachefs_metadata_version_casefolding This patch implements support for case-insensitive file name lookups in bcachefs. The implementation uses the same UTF-8 lowering and normalization that ext4 and f2fs is using. More information is provided in Documentation/bcachefs/casefolding.rst Compatibility notes: This uses the new versioning scheme for incompatible features where an incompatible feature is tied to a version number: the superblock says "we may use incompat features up to x" and "incompat features up to x are in use", disallowing mounting by previous versions. Additionally, and old style incompat feature bit is used, so that kernels without utf8 casefolding support know if casefolding specifically is in use and they're allowed to mount. Signed-off-by: Joshua Ashton <joshua@froggi.es> Cc: André Almeida <andrealmeid@igalia.com> Cc: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Joshua Ashton	76872d46b7	bcachefs: Split out dirent alloc and name initialization Splits out the code that allocates the dirent and initializes the name to make things easier to implement casefolding in a future commit. Cc: André Almeida <andrealmeid@igalia.com> Cc: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Joshua Ashton <joshua@froggi.es> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	72f4edcf45	bcachefs: Kill dirent_occupied_size() in create path Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	68171d91ce	bcachefs: Kill dirent_occupied_size() in rename path Cc: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	6756e385a5	bcachefs: bcachefs_metadata_version_stripe_lru Add a persistent LRU for stripes, ordered by "number of empty blocks", i.e. order in which we wish to reuse them. This will replace the in-memory stripes heap, so we can kill off reading stripes into memory at startup. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	88d961b518	bcachefs: bcachefs_metadata_version_stripe_backpointers Stripes now have backpointers. This is needed for proper scrub - stripe checksums need to be verified, separately from extents within the stripe, since a block may not be full of live extents but it's still needed for reconstruct. And this will be needed for (efficient) evacuate/repair paths. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	69bd8a9277	bcachefs: Advance bch_alloc.oldest_gen if no stale pointers Now that we've got cached backpointers and aren't leaving around stale pointers on bucket invalidation, we no longer need the periodic (rare) gc_gens - which recalculates each bucket's oldest gen to avoid wraparound. We can't delete that code because we've got to support existing filesystems that will still have stale pointers, but this gets rid of another scalability limit. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	942a418c7a	bcachefs: Invalidate cached data by backpointers If we don't leave stale pointers around, we won't have to deal with bucket gen wraparound. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	15800f3d4b	bcachefs: bcachefs_metadata_version_cached_backpointers Cached pointers now have backpointers. This means that we'll be able to kill cached pointers in the bucket_invalidate path, when invalidating/reusing buckets containing cached data, instead of leaving them around to be cleaned up by gc_gens garbago collection - which requires a full metadata scan. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	65bc7688b8	bcachefs: rework bch2_trans_commit_run_triggers() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	c7c07bf250	bcachefs: Better trigger ordering Transactional triggers need to run in a defined ordering, which is not quite the same as btree ID integer comparison. Previously this was handled in a hacky way in bch2_trans_commit_run_triggers(), since it was only the alloc btree that needed special handling, but upcoming stripe btree changes are going to require more ordering changes - so, define that ordering. Next patch will change the transaction commit path to use it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	cc297dfb41	bcachefs: bch2_trigger_stripe_ptr() no longer uses ec_stripes_heap_lock Introduce per-entry locks, like with struct bucket - the stripes heap is going away. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	bc76ba70d2	bcachefs: Rework bch2_check_lru_key() It's now easier to add new LRU types. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	3aff608b86	bcachefs: decouple bch2_lru_check_set() from alloc btree Pass in the backpointer explicitly, instead of assuming 'referring_k' is an alloc key and calculating it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	b8e37c1645	bcachefs: s/BCH_LRU_FRAGMENTATION_START/BCH_LRU_BUCKET_FRAGMENTATION/ FRAGMENTATION_START was incorrect, there's currently only one fragmentation LRU (at the end of the reserved bits for LRU type), and we're getting ready to add a stripe fragmentation lru - so give it a better name. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	e130496707	bcachefs: bch2_lru_change() checks for no-op Minor cleanup, no reason for the caller to have to this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	cb87f623c1	bcachefs: minor journal errcode cleanup Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	1ccbcd3205	bcachefs: bch2_write_op_error() now prints info about data update A user has been seeing the "error verifying existing checksum while rewriting existing data (memory corruption?)" error. This generally indicates a hardware issue (and that may be the case here), but it might also indicate a bug, in which case we need more information to look for patterns. Reported-by: Roland Vet <vet.roland@protonmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	3faa4647a0	bcachefs: metadata_target is not an inode option This option only applies filesystem wide. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	f27614652c	bcachefs: eytzinger1_{next,prev} cleanup The eytzinger code was previously relying on the following wrap-around properties and their "eytzinger0" equivalents: eytzinger1_prev(0, size) == eytzinger1_last(size) eytzinger1_next(0, size) == eytzinger1_first(size) However, these properties are no longer relied upon and no longer necessary, so remove the corresponding asserts and forbid the use of eytzinger1_prev(0, size) and eytzinger1_next(0, size). This allows to further simplify the code in eytzinger1_next() and eytzinger1_prev(): where the left shifting happens, eytzinger1_next() is trying to move i to the lowest child on the left, which is equivalent to doubling i until the next doubling would cause it to be greater than size. This is implemented by shifting i to the left so that the most significant bits align and then shifting i to the right by one if the result is greater than size. Likewise, eytzinger1_prev() is trying to move to the lowest child on the right; the same applies here. The 1-offset in (size - 1) in eytzinger1_next() isn't needed at all, but the equivalent offset in eytzinger1_prev() is surprisingly needed to preserve the 'eytzinger1_prev(0, size) == eytzinger1_last(size)' property. However, since we no longer support that property, we can get rid of these offsets as well. This saves one addition in each function and makes the code less confusing. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	68eb4c5fea	bcachefs: convert eytzinger sort to be 1-based (2) In this second step, transform the eytzinger indexes i, j, and k in eytzinger1_sort_r() from 0-based to 1-based. This step looks a bit messy, but the resulting code is slightly better. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	3ff0dd28d6	bcachefs: convert eytzinger sort to be 1-based (1) In this first step, convert the eytzinger sort functions to use 1-based primitives. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	3849bcab4d	bcachefs: convert eytzinger0_find to be 1-based Several of the algorithms on eytzinger trees are implemented in terms of the eytzinger0 primitives. However, those algorithms can just as easily be expressed in terms of the eytzinger1 primitives, and that leads to better and easier to understand code. Start by converting eytzinger0_find(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	956032edd2	bcachefs: Add eytzinger0_find self test Function eytzinger0_find() isn't currently covered, so add a self test. We can rely on eytzinger0_find_le() here because it is being tested independently. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	63ce189b00	bcachefs: add eytzinger0_find_ge self test Add an eytzinger0_find_ge() self test similar to eytzinger0_find_gt(). Note that this test requires eytzinger0_find_ge() to return the first matching element in the array in case of duplicates. To prevent bisection errors, we only add this test after strenghening the original implementation (see the previous commit). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	11223d0e7b	bcachefs: implement eytzinger0_find_ge directly Implement eytzinger0_find_ge() directly instead of implementing it in terms of eytzinger0_find_le() and adjusting the result. This turns eytzinger0_find_ge() into a minimum search, so when there are duplicate elements, the result of eytzinger0_find_ge() will now always point at the first matching element. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	2182f29545	bcachefs: implement eytzinger0_find_gt directly Instead of implementing eytzinger0_find_gt() in terms of eytzinger0_find_le() and adjusting the result, implement it directly. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	d7cd33f7ef	bcachefs: add eytzinger0_find_gt self test Add an eytzinger0_find_gt() self test similar to eytzinger0_find_le(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	d384dada0e	bcachefs: simplify eytzinger0_find_le Replace the over-complicated implementation of eytzinger0_find_le() by an equivalent, simpler version. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	d148d804f2	bcachefs: convert eytzinger0_find_le to be 1-based eytzinger0_find_le() is also easy to concert to 1-based eytzinger (but see the next commit). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	c722b818a2	bcachefs: improve eytzinger0_find_le self test Rename eytzinger0_find_test_val() to eytzinger0_find_test_le() and add a new eytzinger0_find_test_val() wrapper that calls it. We have already established that the array is sorted in eytzinger order, so we can use the eytzinger iterator functions and check the boundary conditions to verify the result of eytzinger0_find_le(). Only scan the entire array if we get an incorrect result. When we need to scan, use eytzinger0_for_each_prev() so that we'll stop at the highest matching element in the array in case there are duplicates; going through the array linearly wouldn't give us that. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	dc5ceaaad8	bcachefs: add eytzinger0_for_each_prev Add an eytzinger0_for_each_prev() macro for iterating through an eytzinger array in reverse. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	e8a0966ffa	bcachefs: eytzinger0_find_test improvement In eytzinger0_find_test(), remember the smallest element seen so far instead of comparing adjacent array elements. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	ec70103f9b	bcachefs: eytzinger[01]_test improvement In eytzinger[01]_test(), make sure that eytzinger[01]_for_each() iterates over all array elements. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	0766f5599c	bcachefs: eytzinger self tests: fix cmp_u16 typo Fix an obvious typo in cmp_u16(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	0ede49212a	bcachefs: eytzinger self tests: missing newline termination pr_info() format strings need to be newline terminated. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	217ad1d7c7	bcachefs: eytzinger self tests: loop cleanups The iterator variable of eytzinger0_for_each() loops has been changed to be locally scoped at some point, so remove variables defined outside the loop that are now unused. In addition and for clarity, use a different variable inside those loops where an outside variable would be shadowed. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	d54b82ecc4	bcachefs: EYTZINGER_DEBUG fix When EYTZINGER_DEBUG is defined, <linux/bug.h> needs to be included. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	f7f9be0238	bcachefs: bch2_blacklist_entries_gc cleanup Use an eytzinger0_for_each() loop here. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	34a493089a	bcachefs: bch2_bkey_ptr_data_type() now correctly returns cached for cached ptrs Necessary for adding backpointers for cached pointers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	fd49882f12	bcachefs: Add time_stat for btree writes We have other metadata IO types covered, this was missing. Note: this includes the time until completion, i.e. including parent pointer update. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	b7f648e2ec	bcachefs: Add comment explaining why asserts in invalidate_one_bucket() are impossible Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	7606fb4d26	bcachefs: Ignore backpointers to stripes in ec_stripe_update_extents() Prep work for stripe backpointers: this path previously would get very confused at being asked to process (remove redundant replicas) stripes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	898bda5b72	bcachefs: Increase JOURNAL_BUF_NR Increase journal pipelining. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	35282ce9e8	bcachefs: Free journal bufs when not in use Since we're increasing the number of 'struct journal_bufs', we don't want them all permanently holding onto buffers for the journal data - that'd be 16 * 2MB = 32MB, or potentially more. Add a single-element mempool (open coded, since buffer size varies), this also means we won't be hitting the memory allocator every time we open and close a journal entry/buffer. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	2e853fdbc7	bcachefs: Don't touch journal_buf->data->seq in journal_res_get This is a small optimization, reducing the number of cachelines we touch in the fast path - and it's also necessary for the next patch that increases JOURNAL_BUF_NR. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	199a3578ed	bcachefs: Kill journal_res.idx More dead code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	c2be81d48a	bcachefs: Kill journal_res_state.unwritten_idx Dead code Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	3eccc02035	bcachefs: add progress indicator to check_allocations Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	491eda6394	bcachefs: Add a progress indicator to bch2_dev_data_drop() This code needs quite a bit of work: we don't want to be walking all metadata in the filesystem, we should just be walking backpointers, and it should be switched to a data ioctl that can report progress via a file descriptor, not the system console. But that'll take more work - before we can safely walk only backpointers we need to change device add to not reuse device indexes, since with that change accounting being wrong introduces the possibility of removing a device that still has pointers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	baabeb4997	bcachefs: Factor out progress.[ch] the backpointers code has progress indicators; these aren't great, since they print to the dmesg console and we much prefer to have progress indicators reporting to a specific userspace program so they're not spamming the system console. But not all codepaths that need progress indicators support that yet, and we don't want users to think "this is hung". Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	06284963e3	bcachefs: bch2_inum_offset_err_msg_trans() no longer handles transaction restarts we're starting to use error messages with paths in fsck_errors(), where we do not want nested transaction restart handling, so let's prepare for that. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	45f0e6c838	bcachefs: bch2_indirect_extent_missing_error() prints path, not just inode number We want all error messages converted to print paths, not just inode numbers - users want this information, and it speeds up debugging too. Auditing and converting all error messages is going to be a big project, so for the moment we're just doing this incrementally. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	e63cf203d7	bcachefs: Convert migrate to move_data_phys() Iterating over backpointers on a specific device is potentially much cheaper than walking all filesystem data. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	157ea58341	bcachefs: Read/move path counter work Reorganize counters a bit, grouping related counters together. New counters: - io_read_inline - io_read_hole Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Alan Huang	7d8321a286	bcachefs: Fix subtraction underflow When ancestor is less than IS_ANCESTOR_BITMAP, we would get an incorrect result. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	f269ae55d2	bcachefs: Scrub Add a new data op to walk all data and metadata in a filesystem, checking if it can be read successfully, and on error repairing from another copy if possible. - New helper: bch2_dev_idx_is_online(), so that we can bail out and report to userspace when we're unable to scrub because the device is offline - data_update_opts, which controls the data move path, now understands scrub: data is only read, not written. The read path is responsible for rewriting on read error, as with other reads. - scrub_pred skips data extents that don't have checksums - bch_ioctl_data has a new scrub member, which has a data_types field for data types to check - i.e. all data types, or only metadata. - Add new entries to bch_move_stats so that we can report numbers for corrected and uncorrected errors - Add a new enum to bch_ioctl_data_event for explicitly reporting completion and return code (i.e. device offline) Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	3e2ad29865	bcachefs: bch2_btree_node_scrub() Add a function for scrubbing btree nodes - reading them in, and kicking off a rewrite if there's an error. The btree_node_read_done() checks have to be duplicated because we're not using a pointer to a struct btree - the btree node might already be in cache, and we need to check a specific replica, which might not be the one we previously read from. This will be used in the next patch implementing high-level scrub. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	ca24130ee4	bcachefs: bch2_bkey_pick_read_device() can now specify a device To be used for scrub, where we want the read to come from a specific device. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	2a2f7aaa8d	bcachefs: __bch2_move_data_phys() now uses bch2_btree_node_rewrite_pos() Kill most of the separate logic for btree nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	987fdbdb40	bcachefs: bch2_move_data_phys() Add a more general version of bch2_evacuate_bucket - to be used for scrub. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	12188c9e2b	bcachefs: bch2_btree_node_rewrite_pos() Add a new helper for rewriting a btree node given a just the key, not a pointer to the node itself. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	ca16fa6b86	bcachefs: backpointer_get_key() doesn't pull in btree node We may not need to pull in a btree node when walking backpointers - don't do so unnecessarily when using backpointer_get_key(). It'll still fall back to backpointer_get_node() in a few situations, including btree roots (where an iterator can't point at just the key), and races due to the interior update path not having deleted a backpointer to an old node yet. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	dff6de9518	bcachefs: Internal reads can now correct errors Rework the read path so that BCH_READ_NODECODE reads now also self-heal after a read error and a successful retry - prerequisite for scrub. - __bch2_read_endio() now handles a read that's both BCH_READ_NODECODE and a bounce. Normally, we don't want a BCH_READ_NODECODE read to ever allocate a split bch_read_bio: we want to maintain the relationship between the bch_read_bio and the data_update it's embedded in. But correcting read errors requires allocating a split/bounce rbio that's embedded in a promote_op. We do still have a 1-1 relationship, i.e. we only allocate a single split/bounce if it's a BCH_READ_NODECODE, so things hopefully don't get too crazy. - __bch2_read_extent() now is allowed to allocate the promote_op for rewriting after a failed read, even if it's BCH_READ_NODECODE. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	7b1d655106	bcachefs: Don't self-heal if a data update is already rewriting Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	4dfb76e0ad	bcachefs: Don't start promotes from bch2_rbio_free() we don't want to block completion of the read - starting a promote calls into the write path, which will block. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	7e9ed60f5f	bcachefs: Bail out early on alloc_nowait data updates If a data update doesn't want to block on allocations (promotes, self healing on read error) - check if the allocation would fail before kicking off the data update and calling into the write path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	c37d42a0e2	bcachefs: Rework init order in bch2_data_update_init() Initialize the write op first, so that in the next patch we can check if the allocator would block (for BCH_WRITE_alloc_nowait ops) and bail out before taking nocow locks/dev refs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	29ad31c780	bcachefs: Self healing writes are BCH_WRITE_alloc_nowait If a drive is failing and we're moving data off of it, we can't necessairly depend on capacity/disk reservation calculations to avoid deadlocking/blocking on the allocator. And, we don't want to queue up infinite self healing moves anyways. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	8ff92a9e4e	bcachefs: Promotes should use BCH_WRITE_only_specified_devs Promotes, like most other internal moves, should only go to the specified target and not fall back to allocating from the full filesystem. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	d0148e7169	bcachefs: Be stricter in bch2_read_retry_nodecode() Now that data_update embeds bch_read_bio, BCH_READ_NODECODE means that the read is embedded in a a data_update - and we can check in the retry path if the extent has changed and bail out. This likely fixes some subtle bugs with read errors and data moves. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	6f7111f820	bcachefs: cleanup redundant code around data_update_op initialization Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	536d789781	bcachefs: bch2_update_unwritten_extent() no longer depends on wbio Prep work for improving bch2_data_update_init(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	8f97793d67	bcachefs: promote_op uses embedded bch_read_bio Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	a70bd97630	bcachefs: data_update now embeds bch_read_bio Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	dfa204b169	bcachefs: rbio_init() cleanup Move more initialization to rbio_init(), to assist in further cleanups. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	0f856b7228	bcachefs: rbio_init_fragment() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	14e2523fc5	bcachefs: Rename BCH_WRITE flags fer consistency with other x-macros enums The uppercase/lowercase style is nice for making the namespace explicit. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	9157b3ddfb	bcachefs: x-macroize BCH_READ flags Will be adding a bch2_read_bio_to_text(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	9f37016cb2	bcachefs: kill bch_read_bio.devs_have Dead code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	3075e68d26	bcachefs: bch2_data_update_inflight_to_text() Add a new helper for bch2_moving_ctxt_to_text(), which may be used to debug if moving_ios are getting stuck. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	50ca857457	bcachefs: BCH_IOCTL_QUERY_COUNTERS Add an ioctl for querying counters, the same ones provided in /sys/fs/bcachefs/<uuid>/counters/, but more suitable for a 'bcachefs top' command. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	5ee760f667	bcachefs: BCH_COUNTER_bucket_discard_fast Add a separate counter for fastpath bucket discards, which don't require a journal flush. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	bbd804f2ad	bcachefs: enum bch_persistent_counters_stable Persistent counters, like recovery passes, include a stable enum in their definition - but this was never correctly plumbed. This allows us to add new counters and properly organize them with a non-stable "presentation order", which can also be used in userspace by the new 'bcachefs fs top' tool. Fortunatel, since we haven't yet added any new counters where presentation order ID doesn't match stable ID, this won't cause any reordering issues. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	999cc1bb68	bcachefs: Separate running/runnable in wp stats We've got per-writepoint statistics to see how well the writepoint index update threads are pipelining; this separates running vs. runnable so we can see at a glance if they're blocking. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	78c9c6f6cd	bcachefs: Move write_points to debugfs this was hitting the sysfs 4k limit Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	55a132c37a	bcachefs: Don't inc io_(read\|write) counters for moves This makes 'bcachefs fs top' more useful; we can now see at a glance whether the IO to the device is being done for user reads/writes, or copygc/rebalance. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:10 -04:00
Kent Overstreet	e5a63ad343	bcachefs: Fix missing increment of move_extent_write counter Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:10 -04:00
Kent Overstreet	c3c9957c81	bcachefs: check_bp_exists() check for backpointers for stale pointers Early version of 'bcachefs_metadata_version_cached_backpointers' was creating backpointers for stale cached pointers - whoops. Now we have to repair those. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:10 -04:00
Kent Overstreet	2deae55804	bcachefs: btree_node_(rewrite\|update_key) cleanup Factor out get_iter_to_node() and use it for btree_node_rewrite_get_iter(), to be used for fixing btree node write error behaviour. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:10 -04:00
Kent Overstreet	be212d86b1	bcachefs: bs > ps support bcachefs removed most PAGE_SIZE references long ago, so this is easy; only readpage_bio_extend() has to be tweaked to respect the minimum order. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 19:45:57 -04:00
Kent Overstreet	1a2b74d0a2	bcachefs: fix build on 32 bit in get_random_u64_below() bare 64 bit divides not allowed, whoops arm-linux-gnueabi-ld: drivers/char/random.o: in function `__get_random_u64_below': drivers/char/random.c:602:(.text+0xc70): undefined reference to `__aeabi_uldivmod' Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 19:45:54 -04:00
Kent Overstreet	90fd9ad5b0	bcachefs: Change btree wb assert to runtime error We just had a report of the assert for "btree in write buffer for non-write buffer btree" popping during the 6.14 upgrade. - 150TB filesystem, after a reboot the upgrade was able to continue from where it left off, so no major damage. But with 6.14 about to come out we want to get this tracked down asap, and need more data if other users hit this. Convert the BUG_ON() to an emergency read-only, and print out btree, the key itself, and stack trace from the original write buffer update (which did not have this check before). Reported-by: Stijn Tintel <stijn@linux-ipv6.be> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 10:25:25 -04:00
Kent Overstreet	9c18ea7ffe	bcachefs: bch2_get_random_u64_below() steal the (clever) algorithm from get_random_u32_below() this fixes a bug where we were passing roundup_pow_of_two() a 64 bit number - we're squaring device latencies now: [ +1.681698] ------------[ cut here ]------------ [ +0.000010] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13 [ +0.000011] shift exponent 64 is too large for 64-bit type 'long unsigned int' [ +0.000011] CPU: 1 UID: 0 PID: 196 Comm: kworker/u32:13 Not tainted 6.14.0-rc6-dave+ #10 [ +0.000012] Hardware name: ASUS System Product Name/PRIME B460I-PLUS, BIOS 1301 07/13/2021 [ +0.000005] Workqueue: events_unbound __bch2_read_endio [bcachefs] [ +0.000354] Call Trace: [ +0.000005] <TASK> [ +0.000007] dump_stack_lvl+0x5d/0x80 [ +0.000018] ubsan_epilogue+0x5/0x30 [ +0.000008] __ubsan_handle_shift_out_of_bounds.cold+0x61/0xe6 [ +0.000011] bch2_rand_range.cold+0x17/0x20 [bcachefs] [ +0.000231] bch2_bkey_pick_read_device+0x547/0x920 [bcachefs] [ +0.000229] __bch2_read_extent+0x1e4/0x18e0 [bcachefs] [ +0.000241] ? bch2_btree_iter_peek_slot+0x3df/0x800 [bcachefs] [ +0.000180] ? bch2_read_retry_nodecode+0x270/0x330 [bcachefs] [ +0.000230] bch2_read_retry_nodecode+0x270/0x330 [bcachefs] [ +0.000230] bch2_rbio_retry+0x1fa/0x600 [bcachefs] [ +0.000224] ? bch2_printbuf_make_room+0x71/0xb0 [bcachefs] [ +0.000243] ? bch2_read_csum_err+0x4a4/0x610 [bcachefs] [ +0.000278] bch2_read_csum_err+0x4a4/0x610 [bcachefs] [ +0.000227] ? __bch2_read_endio+0x58b/0x870 [bcachefs] [ +0.000220] __bch2_read_endio+0x58b/0x870 [bcachefs] [ +0.000268] ? try_to_wake_up+0x31c/0x7f0 [ +0.000011] ? process_one_work+0x176/0x330 [ +0.000008] process_one_work+0x176/0x330 [ +0.000008] worker_thread+0x252/0x390 [ +0.000008] ? __pfx_worker_thread+0x10/0x10 [ +0.000006] kthread+0xec/0x230 [ +0.000011] ? __pfx_kthread+0x10/0x10 [ +0.000009] ret_from_fork+0x31/0x50 [ +0.000009] ? __pfx_kthread+0x10/0x10 [ +0.000008] ret_from_fork_asm+0x1a/0x30 [ +0.000012] </TASK> [ +0.000046] ---[ end trace ]--- Reported-by: Roland Vet <vet.roland@protonmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-13 12:40:22 -04:00
Kent Overstreet	69a5a13a22	bcachefs: target_congested -> get_random_u32_below() get_random_u32_below() has a better algorithm than bch2_rand_range(), it just didn't exist at the time. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-13 12:39:21 -04:00
Kent Overstreet	3bcde88d38	bcachefs: fix tiny leak in bch2_dev_add() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-13 00:23:19 -04:00
Kent Overstreet	dbac8feb23	bcachefs: Make sure trans is unlocked when submitting read IO We were still using the trans after the unlock, leading to this bug in the retry path: 00255 ------------[ cut here ]------------ 00255 kernel BUG at fs/bcachefs/btree_iter.c:3348! 00255 Internal error: Oops - BUG: 00000000f2000800 [#1] SMP 00255 bcachefs (0ca38fe8-0a26-41f9-9b5d-6a27796c7803): /fiotest offset 86048768: no device to read from: 00255 u64s 8 type extent 4098:168192:U32_MAX len 128 ver 0: durability: 0 crc: c_size 128 size 128 offset 0 nonce 0 csum crc32c 0:8040a368 compress none ec: idx 83 block 1 ptr: 0:302:128 gen 0 00255 bcachefs (0ca38fe8-0a26-41f9-9b5d-6a27796c7803): /fiotest offset 85983232: no device to read from: 00255 u64s 8 type extent 4098:168064:U32_MAX len 128 ver 0: durability: 0 crc: c_size 128 size 128 offset 0 nonce 0 csum crc32c 0:43311336 compress none ec: idx 83 block 1 ptr: 0:302:0 gen 0 00255 Modules linked in: 00255 CPU: 5 UID: 0 PID: 304 Comm: kworker/u70:2 Not tainted 6.14.0-rc6-ktest-g526aae23d67d #16040 00255 Hardware name: linux,dummy-virt (DT) 00255 Workqueue: events_unbound bch2_rbio_retry 00255 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00255 pc : __bch2_trans_get+0x100/0x378 00255 lr : __bch2_trans_get+0xa0/0x378 00255 sp : ffffff80c865b760 00255 x29: ffffff80c865b760 x28: 0000000000000000 x27: ffffff80d76ed880 00255 x26: 0000000000000018 x25: 0000000000000000 x24: ffffff80f4ec3760 00255 x23: ffffff80f4010140 x22: 0000000000000056 x21: ffffff80f4ec0000 00255 x20: ffffff80f4ec3788 x19: ffffff80d75f8000 x18: 00000000ffffffff 00255 x17: 2065707974203820 x16: 7334367520200a3a x15: 0000000000000008 00255 x14: 0000000000000001 x13: 0000000000000100 x12: 0000000000000006 00255 x11: ffffffc080b47a40 x10: 0000000000000000 x9 : ffffffc08038dea8 00255 x8 : ffffff80d75fc018 x7 : 0000000000000000 x6 : 0000000000003788 00255 x5 : 0000000000003760 x4 : ffffff80c922de80 x3 : ffffff80f18f0000 00255 x2 : ffffff80c922de80 x1 : 0000000000000130 x0 : 0000000000000006 00255 Call trace: 00255 __bch2_trans_get+0x100/0x378 (P) 00255 bch2_read_io_err+0x98/0x260 00255 bch2_read_endio+0xb8/0x2d0 00255 __bch2_read_extent+0xce8/0xfe0 00255 __bch2_read+0x2a8/0x978 00255 bch2_rbio_retry+0x188/0x318 00255 process_one_work+0x154/0x390 00255 worker_thread+0x20c/0x3b8 00255 kthread+0xf0/0x1b0 00255 ret_from_fork+0x10/0x20 00255 Code: 6b01001f 54ffff01 79408460 3617fec0 (d4210000) 00255 ---[ end trace 0000000000000000 ]--- 00255 Kernel panic - not syncing: Oops - BUG: Fatal exception 00255 SMP: stopping secondary CPUs 00255 Kernel Offset: disabled 00255 CPU features: 0x000,00000070,00000010,8240500b 00255 Memory Limit: none 00255 ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]--- Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-11 11:21:44 -04:00
Roxana Nicolescu	58517f4df8	bcachefs: Initialize from_inode members for bch_io_opts When there is no inode source, all "from_inode" members in the structure bhc_io_opts should be set false. Fixes: `7a7c43a0c1` ("bcachefs: Add bch_io_opts fields for indicating whether the opts came from the inode") Reported-by: syzbot+c17ad4b4367b72a853cb@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c17ad4b4367b72a853cb Signed-off-by: Roxana Nicolescu <nicolescu.roxana@protonmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-11 11:19:33 -04:00
Alan Huang	3a04334d62	bcachefs: Fix b->written overflow When bset past end of btree node, we should not add sectors to b->written, which will overflow b->written. Reported-by: syzbot+3cb3d9e8c3f197754825@syzkaller.appspotmail.com Tested-by: syzbot+3cb3d9e8c3f197754825@syzkaller.appspotmail.com Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-11 09:19:23 -04:00
Luis Chamberlain	a64e5a5960	bdev: add back PAGE_SIZE block size validation for sb_set_blocksize() The commit titled "block/bdev: lift block size restrictions to 64k" lifted the block layer's max supported block size to 64k inside the helper blk_validate_block_size() now that we support large folios. However in lifting the block size we also removed the silly use cases many filesystems have to use sb_set_blocksize() to verify that the block size <= PAGE_SIZE. The call to sb_set_blocksize() was used to check the block size <= PAGE_SIZE since historically we've always supported userspace to create for example 64k block size filesystems even on 4k page size systems, but what we didn't allow was mounting them. Older filesystems have been using the check with sb_set_blocksize() for years. While, we could argue that such checks should be filesystem specific, there are much more users of sb_set_blocksize() than LBS enabled filesystem on upstream, so just do the easier thing and bring back the PAGE_SIZE check for sb_set_blocksize() users and only skip it for LBS enabled filesystems. This will ensure that tests such as generic/466 when run in a loop against say, ext4, won't try to try to actually mount a filesystem with a block size larger than your filesystem supports given your PAGE_SIZE and in the worst case crash. Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250307020403.3068567-1-mcgrof@kernel.org Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-07 12:56:05 +01:00
Kent Overstreet	8ba73f53dc	bcachefs: copygc now skips non-rw devices There's no point in doing copygc on non-rw devices: the fragmentation doesn't matter if we're not writing to them, and we may not have anywhere to put the data on our other devices. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-06 18:15:01 -05:00
Kent Overstreet	33255c161a	bcachefs: Fix bch2_dev_journal_alloc() spuriously failing Previously, we fixed journal resize spuriousl failing with -BCH_ERR_open_buckets_empty, but initial journal allocation was missed because it didn't invoke the "block on allocator" loop at all. Factor out the "loop on allocator" code to fix that. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-06 18:15:01 -05:00
Kent Overstreet	4a4f9b5c7c	bcachefs: Don't set BCH_FEATURE_incompat_version_field unless requested We shouldn't be setting incompatible bits or the incompatible version field unless explicitly request or allowed - otherwise we break mounting with old kernels or userspace. Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-28 19:07:33 -05:00
NeilBrown	88d5baf690	Change inode_operations.mkdir to return struct dentry * Some filesystems, such as NFS, cifs, ceph, and fuse, do not have complete control of sequencing on the actual filesystem (e.g. on a different server) and may find that the inode created for a mkdir request already exists in the icache and dcache by the time the mkdir request returns. For example, if the filesystem is mounted twice the directory could be visible on the other mount before it is on the original mount, and a pair of name_to_handle_at(), open_by_handle_at() calls could instantiate the directory inode with an IS_ROOT() dentry before the first mkdir returns. This means that the dentry passed to ->mkdir() may not be the one that is associated with the inode after the ->mkdir() completes. Some callers need to interact with the inode after the ->mkdir completes and they currently need to perform a lookup in the (rare) case that the dentry is no longer hashed. This lookup-after-mkdir requires that the directory remains locked to avoid races. Planned future patches to lock the dentry rather than the directory will mean that this lookup cannot be performed atomically with the mkdir. To remove this barrier, this patch changes ->mkdir to return the resulting dentry if it is different from the one passed in. Possible returns are: NULL - the directory was created and no other dentry was used ERR_PTR() - an error occurred non-NULL - this other dentry was spliced in This patch only changes file-systems to return "ERR_PTR(err)" instead of "err" or equivalent transformations. Subsequent patches will make further changes to some file-systems to return a correct dentry. Not all filesystems reliably result in a positive hashed dentry: - NFS, cifs, hostfs will sometimes need to perform a lookup of the name to get inode information. Races could result in this returning something different. Note that this lookup is non-atomic which is what we are trying to avoid. Placing the lookup in filesystem code means it only happens when the filesystem has no other option. - kernfs and tracefs leave the dentry negative and the ->revalidate operation ensures that lookup will be called to correctly populate the dentry. This could be fixed but I don't think it is important to any of the users of vfs_mkdir() which look at the dentry. The recommendation to use d_drop();d_splice_alias() is ugly but fits with current practice. A planned future patch will change this. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-27 20:00:17 +01:00
Christian Brauner	71628584df	Merge patch series "prep patches for my mkdir series" NeilBrown <neilb@suse.de> says: These two patches are cleanup are dependencies for my mkdir changes and subsequence directory locking changes. * patches from https://lore.kernel.org/r/20250226062135.2043651-1-neilb@suse.de: (2 commits) nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked() nfs/vfs: discard d_exact_alias() Link: https://lore.kernel.org/r/20250226062135.2043651-1-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-27 09:25:34 +01:00
Kent Overstreet	eb54d2695b	bcachefs: Fix truncate sometimes failing and returning 1 __bch_truncate_folio() may return 1 to indicate dirtyness of the folio being truncated, needed for fpunch to get the i_size writes correct. But truncate was forgetting to clear ret, and sometimes returning it as an error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-26 19:31:05 -05:00
Alan Huang	677bdb7346	bcachefs: Fix deadlock This fixes two deadlocks: 1.pcpu_alloc_mutex involved one as pointed by syzbot[1] 2.recursion deadlock. The root cause is that we hold the bc lock during alloc_percpu, fix it by following the pattern used by __btree_node_mem_alloc(). [1] https://lore.kernel.org/all/66f97d9a.050a0220.6bad9.001d.GAE@google.com/T/ Reported-by: syzbot+fe63f377148a6371a9db@syzkaller.appspotmail.com Tested-by: syzbot+fe63f377148a6371a9db@syzkaller.appspotmail.com Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-26 19:31:05 -05:00
Kent Overstreet	7909d1fb90	bcachefs: Check for -BCH_ERR_open_buckets_empty in journal resize This fixes occasional failures from journal resize. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-26 19:31:05 -05:00
Kent Overstreet	4804f3ac26	bcachefs: Revert directory i_size This turned out to have several bugs, which were missed because the fsck code wasn't properly reporting errors - whoops. Kicking it out for now, hopefully it can make 6.15. Cc: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-26 19:30:38 -05:00
Kent Overstreet	cf3e696026	bcachefs: fix bch2_extent_ptr_eq() Reviewed-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-23 23:35:33 -05:00
Alan Huang	c522093b02	bcachefs: Fix memmove when move keys down The fix alone doesn't fix [1], but should be applied before debugging that. [1] https://syzkaller.appspot.com/bug?extid=38a0cbd267eff2d286ff Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-20 16:40:34 -05:00
Kent Overstreet	68aaa63716	bcachefs: print op->nonce on data update inconsistency "nonce inconstancy" is popping up again, causing us to go emergency read-only. This one looks less serious, i.e. specific to the encryption path and not indicative of a data corruption bug. But we'll need more info to track it down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-20 16:39:28 -05:00
Kent Overstreet	b04974f759	bcachefs: Fix srcu lock warning in btree_update_nodes_written() We don't want to be holding the srcu lock while waiting on btree write completions - easily fixed. Reported-by: Janpieter Sollie <janpieter.sollie@edpnet.be> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-19 18:52:42 -05:00
Kent Overstreet	4fd509c10f	bcachefs: Fix bch2_indirect_extent_missing_error() We had some error handling confusion here; -BCH_ERR_missing_indirect_extent is thrown by trans_trigger_reflink_p_segment(); at this point we haven't decide whether we're generating an error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-19 17:33:13 -05:00
Kent Overstreet	b9ddb3e1a8	bcachefs: Fix fsck directory i_size checking Error handling was wrong, causing unhandled transaction restart errors. check_directory_size() was also inefficient, since keys in multiple snapshots would be iterated over once for every snapshot. Convert it to the same scheme used for i_sectors and subdir count checking. Cc: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-19 13:52:27 -05:00
NeilBrown	1c3cb50b58	VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry No callers of kern_path_locked() or user_path_locked_at() want a negative dentry. So change them to return -ENOENT instead. This simplifies callers. This results in a subtle change to bcachefs in that an ioctl will now return -ENOENT in preference to -EXDEV. I believe this restores the behaviour to what it was prior to Commit `bbe6a7c899` ("bch2_ioctl_subvolume_destroy(): fix locking") Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250217003020.3170652-2-neilb@suse.de Acked-by: Paul Moore <paul@paul-moore.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-19 14:08:41 +01:00
Alan Huang	406e445b3c	bcachefs: Reuse transaction bch2_nocow_write_convert_unwritten is already in transaction context: 00191 ========= TEST generic/648 00242 kernel BUG at fs/bcachefs/btree_iter.c:3332! 00242 Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP 00242 Modules linked in: 00242 CPU: 4 UID: 0 PID: 2593 Comm: fsstress Not tainted 6.13.0-rc3-ktest-g345af8f855b7 #14403 00242 Hardware name: linux,dummy-virt (DT) 00242 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00242 pc : __bch2_trans_get+0x120/0x410 00242 lr : __bch2_trans_get+0xcc/0x410 00242 sp : ffffff80d89af600 00242 x29: ffffff80d89af600 x28: ffffff80ddb23000 x27: 00000000fffff705 00242 x26: ffffff80ddb23028 x25: ffffff80d8903fe0 x24: ffffff80ebb30168 00242 x23: ffffff80c8aeb500 x22: 000000000000005d x21: ffffff80d8904078 00242 x20: ffffff80d8900000 x19: ffffff80da9e8000 x18: 0000000000000000 00242 x17: 64747568735f6c61 x16: 6e72756f6a20726f x15: 0000000000000028 00242 x14: 0000000000000004 x13: 000000000000f787 x12: ffffffc081bbcdc8 00242 x11: 0000000000000000 x10: 0000000000000003 x9 : ffffffc08094efbc 00242 x8 : 000000001092c111 x7 : 000000000000000c x6 : ffffffc083c31fc4 00242 x5 : ffffffc083c31f28 x4 : ffffff80c8aeb500 x3 : ffffff80ebb30000 00242 x2 : 0000000000000001 x1 : 0000000000000a21 x0 : 000000000000028e 00242 Call trace: 00242 __bch2_trans_get+0x120/0x410 (P) 00242 bch2_inum_offset_err_msg+0x48/0xb0 00242 bch2_nocow_write_convert_unwritten+0x3d0/0x530 00242 bch2_nocow_write+0xeb0/0x1000 00242 __bch2_write+0x330/0x4e8 00242 bch2_write+0x1f0/0x530 00242 bch2_direct_write+0x530/0xc00 00242 bch2_write_iter+0x160/0xbe0 00242 vfs_write+0x1cc/0x360 00242 ksys_write+0x5c/0xf0 00242 __arm64_sys_write+0x20/0x30 00242 invoke_syscall.constprop.0+0x54/0xe8 00242 do_el0_svc+0x44/0xc0 00242 el0_svc+0x34/0xa0 00242 el0t_64_sync_handler+0x104/0x130 00242 el0t_64_sync+0x154/0x158 00242 Code: 6b01001f 54ffff01 79408460 3617fec0 (d4210000) 00242 ---[ end trace 0000000000000000 ]--- 00242 Kernel panic - not syncing: Oops - BUG: Fatal exception Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-12 18:44:50 -05:00
Alan Huang	531323a2ef	bcachefs: Pass _orig_restart_count to trans_was_restarted _orig_restart_count is unused now, according to the logic, trans_was_restarted should be using _orig_restart_count. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-12 18:40:19 -05:00
Kent Overstreet	9cf6b84b71	bcachefs: CONFIG_BCACHEFS_INJECT_TRANSACTION_RESTARTS Incorrectly handled transaction restarts can be a source of heisenbugs; add a mode where we randomly inject them to shake them out. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-12 18:40:19 -05:00
Kent Overstreet	9f734cd076	bcachefs: Fix want_new_bset() so we write until the end of the btree node want_new_bset() returns the address of a new bset to initialize if we wish to do so in a btree node - either because the previous one is too big, or because it's been written. The case for 'previous bset was written' was wrong: it's only supposed to check for if we have space in the node for one more block, but because it subtracted the header from the space available it would never initialize a new bset if we were down to the last block in a node. Fixing this results in fewer btree node splits/compactions, which fixes a bug with flushing the journal to go read-only sometimes not terminating or taking excessively long. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-11 10:10:32 -05:00
Kent Overstreet	1e690efa72	bcachefs: Split out journal pins by btree level This lets us flush the journal to go read-only more effectively. Flushing the journal and going read-only requires halting mutually recursive processes, which strictly speaking are not guaranteed to terminate. Flushing btree node journal pins will kick off a btree node write, and btree node writes on completion must do another btree update to the parent node to update the 'sectors_written' field for that node's key. If the parent node is full and requires a split or compaction, that's going to generate a whole bunch of additional btree updates - alloc info, LRU btree, and more - which then have to be flushed, and the cycle repeats. This process will terminate much more effectively if we tweak journal reclaim to flush btree updates leaf to root: i.e., don't flush updates for a given btree node (kicking off a write, and consuming space within that node up to the next block boundary) if there might still be unflushed updates in child nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-11 10:10:32 -05:00
Alan Huang	1c316eb57c	bcachefs: Fix use after free acc->k.data should be used with the lock hold: 00221 ========= TEST generic/187 00221 run fstests generic/187 at 2025-02-09 21:08:10 00221 spectre-v4 mitigation disabled by command-line option 00222 bcachefs (vdc): starting version 1.20: directory_size opts=errors=ro 00222 bcachefs (vdc): initializing new filesystem 00222 bcachefs (vdc): going read-write 00222 bcachefs (vdc): marking superblocks 00222 bcachefs (vdc): initializing freespace 00222 bcachefs (vdc): done initializing freespace 00222 bcachefs (vdc): reading snapshots table 00222 bcachefs (vdc): reading snapshots done 00222 bcachefs (vdc): done starting filesystem 00222 bcachefs (vdc): shutting down 00222 bcachefs (vdc): going read-only 00222 bcachefs (vdc): finished waiting for writes to stop 00223 bcachefs (vdc): flushing journal and stopping allocators, journal seq 6 00223 bcachefs (vdc): flushing journal and stopping allocators complete, journal seq 8 00223 bcachefs (vdc): clean shutdown complete, journal seq 9 00223 bcachefs (vdc): marking filesystem clean 00223 bcachefs (vdc): shutdown complete 00223 bcachefs (vdc): starting version 1.20: directory_size opts=errors=ro 00223 bcachefs (vdc): initializing new filesystem 00223 bcachefs (vdc): going read-write 00223 bcachefs (vdc): marking superblocks 00223 bcachefs (vdc): initializing freespace 00223 bcachefs (vdc): done initializing freespace 00223 bcachefs (vdc): reading snapshots table 00223 bcachefs (vdc): reading snapshots done 00223 bcachefs (vdc): done starting filesystem 00244 hrtimer: interrupt took 123350440 ns 00264 bcachefs (vdc): shutting down 00264 bcachefs (vdc): going read-only 00264 bcachefs (vdc): finished waiting for writes to stop 00264 bcachefs (vdc): flushing journal and stopping allocators, journal seq 97 00265 bcachefs (vdc): flushing journal and stopping allocators complete, journal seq 101 00265 bcachefs (vdc): clean shutdown complete, journal seq 102 00265 bcachefs (vdc): marking filesystem clean 00265 bcachefs (vdc): shutdown complete 00265 bcachefs (vdc): starting version 1.20: directory_size opts=errors=ro 00265 bcachefs (vdc): recovering from clean shutdown, journal seq 102 00265 bcachefs (vdc): accounting_read... 00265 ================================================================== 00265 done 00265 BUG: KASAN: slab-use-after-free in bch2_fs_to_text+0x12b4/0x1728 00265 bcachefs (vdc): alloc_read... done 00265 bcachefs (vdc): stripes_read... done 00265 Read of size 4 at addr ffffff80c57eac00 by task cat/7531 00265 bcachefs (vdc): snapshots_read... done 00265 00265 CPU: 6 UID: 0 PID: 7531 Comm: cat Not tainted 6.13.0-rc3-ktest-g16fc6fa3819d #14103 00265 Hardware name: linux,dummy-virt (DT) 00265 Call trace: 00265 show_stack+0x1c/0x30 (C) 00265 dump_stack_lvl+0x6c/0x80 00265 print_report+0xf8/0x5d8 00265 kasan_report+0x90/0xd0 00265 __asan_report_load4_noabort+0x1c/0x28 00265 bch2_fs_to_text+0x12b4/0x1728 00265 bch2_fs_show+0x94/0x188 00265 sysfs_kf_seq_show+0x1a4/0x348 00265 kernfs_seq_show+0x12c/0x198 00265 seq_read_iter+0x27c/0xfd0 00265 kernfs_fop_read_iter+0x390/0x4f8 00265 vfs_read+0x480/0x7f0 00265 ksys_read+0xe0/0x1e8 00265 __arm64_sys_read+0x70/0xa8 00265 invoke_syscall.constprop.0+0x74/0x1e8 00265 do_el0_svc+0xc8/0x1c8 00265 el0_svc+0x20/0x60 00265 el0t_64_sync_handler+0x104/0x130 00265 el0t_64_sync+0x154/0x158 00265 00265 Allocated by task 7510: 00265 kasan_save_stack+0x28/0x50 00265 kasan_save_track+0x1c/0x38 00265 kasan_save_alloc_info+0x3c/0x50 00265 __kasan_kmalloc+0xac/0xb0 00265 __kmalloc_node_noprof+0x168/0x348 00265 __kvmalloc_node_noprof+0x20/0x140 00265 __bch2_darray_resize_noprof+0x90/0x1b0 00265 __bch2_accounting_mem_insert+0x76c/0xb08 00265 bch2_accounting_mem_insert+0x224/0x3b8 00265 bch2_accounting_mem_mod_locked+0x480/0xc58 00265 bch2_accounting_read+0xa94/0x3eb8 00265 bch2_run_recovery_pass+0x80/0x178 00265 bch2_run_recovery_passes+0x340/0x698 00265 bch2_fs_recovery+0x1c98/0x2bd8 00265 bch2_fs_start+0x240/0x490 00265 bch2_fs_get_tree+0xe1c/0x1458 00265 vfs_get_tree+0x7c/0x250 00265 path_mount+0xe24/0x1648 00265 __arm64_sys_mount+0x240/0x438 00265 invoke_syscall.constprop.0+0x74/0x1e8 00265 do_el0_svc+0xc8/0x1c8 00265 el0_svc+0x20/0x60 00265 el0t_64_sync_handler+0x104/0x130 00265 el0t_64_sync+0x154/0x158 00265 00265 Freed by task 7510: 00265 kasan_save_stack+0x28/0x50 00265 kasan_save_track+0x1c/0x38 00265 kasan_save_free_info+0x48/0x88 00265 __kasan_slab_free+0x48/0x60 00265 kfree+0x188/0x408 00265 kvfree+0x3c/0x50 00265 __bch2_darray_resize_noprof+0xe0/0x1b0 00265 __bch2_accounting_mem_insert+0x76c/0xb08 00265 bch2_accounting_mem_insert+0x224/0x3b8 00265 bch2_accounting_mem_mod_locked+0x480/0xc58 00265 bch2_accounting_read+0xa94/0x3eb8 00265 bch2_run_recovery_pass+0x80/0x178 00265 bch2_run_recovery_passes+0x340/0x698 00265 bch2_fs_recovery+0x1c98/0x2bd8 00265 bch2_fs_start+0x240/0x490 00265 bch2_fs_get_tree+0xe1c/0x1458 00265 vfs_get_tree+0x7c/0x250 00265 path_mount+0xe24/0x1648 00265 bcachefs (vdc): going read-write 00265 __arm64_sys_mount+0x240/0x438 00265 invoke_syscall.constprop.0+0x74/0x1e8 00265 do_el0_svc+0xc8/0x1c8 00265 el0_svc+0x20/0x60 00265 el0t_64_sync_handler+0x104/0x130 00265 el0t_64_sync+0x154/0x158 00265 00265 The buggy address belongs to the object at ffffff80c57eac00 00265 which belongs to the cache kmalloc-128 of size 128 00265 The buggy address is located 0 bytes inside of 00265 freed 128-byte region [ffffff80c57eac00, ffffff80c57eac80) 00265 00265 The buggy address belongs to the physical page: 00265 page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1057ea 00265 head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 00265 flags: 0x8000000000000040(head\|zone=2) 00265 page_type: f5(slab) 00265 raw: 8000000000000040 ffffff80c0002800 dead000000000100 dead000000000122 00265 raw: 0000000000000000 0000000000200020 00000001f5000000 ffffff80c57a6400 00265 head: 8000000000000040 ffffff80c0002800 dead000000000100 dead000000000122 00265 head: 0000000000000000 0000000000200020 00000001f5000000 ffffff80c57a6400 00265 head: 8000000000000001 fffffffec315fa81 ffffffffffffffff 0000000000000000 00265 head: 0000000000000002 0000000000000000 00000000ffffffff 0000000000000000 00265 page dumped because: kasan: bad access detected 00265 00265 Memory state around the buggy address: 00265 ffffff80c57eab00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00265 ffffff80c57eab80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc 00265 >ffffff80c57eac00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb 00265 ^ 00265 ffffff80c57eac80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc 00265 ffffff80c57ead00: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc 00265 ================================================================== 00265 Kernel panic - not syncing: kasan.fault=panic set ... 00265 CPU: 6 UID: 0 PID: 7531 Comm: cat Not tainted 6.13.0-rc3-ktest-g16fc6fa3819d #14103 00265 Hardware name: linux,dummy-virt (DT) 00265 Call trace: 00265 show_stack+0x1c/0x30 (C) 00265 dump_stack_lvl+0x30/0x80 00265 dump_stack+0x18/0x20 00265 panic+0x4d4/0x518 00265 start_report.constprop.0+0x0/0x90 00265 kasan_report+0xa0/0xd0 00265 __asan_report_load4_noabort+0x1c/0x28 00265 bch2_fs_to_text+0x12b4/0x1728 00265 bch2_fs_show+0x94/0x188 00265 sysfs_kf_seq_show+0x1a4/0x348 00265 kernfs_seq_show+0x12c/0x198 00265 seq_read_iter+0x27c/0xfd0 00265 kernfs_fop_read_iter+0x390/0x4f8 00265 vfs_read+0x480/0x7f0 00265 ksys_read+0xe0/0x1e8 00265 __arm64_sys_read+0x70/0xa8 00265 invoke_syscall.constprop.0+0x74/0x1e8 00265 do_el0_svc+0xc8/0x1c8 00265 el0_svc+0x20/0x60 00265 el0t_64_sync_handler+0x104/0x130 00265 el0t_64_sync+0x154/0x158 00265 SMP: stopping secondary CPUs 00265 Kernel Offset: disabled 00265 CPU features: 0x000,00000070,00000010,8240500b 00265 Memory Limit: none 00265 ---[ end Kernel panic - not syncing: kasan.fault=panic set ... ]--- 00270 ========= FAILED TIMEOUT generic.187 in 1200s Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-11 10:10:32 -05:00
Kent Overstreet	595170d4b6	bcachefs: Fix marking reflink pointers to missing indirect extents reflink pointers to missing indirect extents aren't deleted, they just have an error bit set - in case the indirect extent somehow reappears. fsck/mark and sweep thus needs to ignore these errors. Also, they can be marked AUTOFIX now. Reported-by: Roland Vet <vet.roland@protonmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-07 14:49:47 -05:00
Kent Overstreet	4be214c269	bcachefs: bch2_bkey_sectors_need_rebalance() now only depends on bch_extent_rebalance Previously, bch2_bkey_sectors_need_rebalance() called bch2_target_accepts_data(), checking whether the target is writable. However, this means that adding or removing devices from a target would change the value of bch2_bkey_sectors_need_rebalance() for an existing extent; this needs to be invariant so that the extent trigger can correctly maintain rebalance_work accounting. Instead, check target_accepts_data() in io_opts_to_rebalance_opts(), before creating the bch_extent_rebalance entry. This fixes (one?) cause of rebalance_work accounting being off. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-06 22:35:11 -05:00
Kent Overstreet	3539880ef1	bcachefs: Fix rcu imbalance in bch2_fs_btree_key_cache_exit() Spotted by sparse. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-06 22:35:11 -05:00
Kent Overstreet	9e9033522a	bcachefs: Fix discard path journal flushing The discard path is supposed to issue journal flushes when there's too many buckets empty buckets that need a journal commit before they can be written to again, but at some point this code seems to have been lost. Bring it back with a new optimization to make sure we don't issue too many journal flushes: the journal now tracks the sequence number of the most recent flush in progress, which the discard path uses when deciding which buckets need a journal flush. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-06 22:35:11 -05:00
Jeongjun Park	2ef995df0c	bcachefs: fix deadlock in journal_entry_open() In the previous commit `b3d82c2f27`, code was added to prevent journal sequence overflow. Among them, the code added to journal_entry_open() uses the bch2_fs_fatal_err_on() function to handle errors. However, __journal_res_get() , which calls journal_entry_open() , calls journal_entry_open() while holding journal->lock , but bch2_fs_fatal_err_on() internally tries to acquire journal->lock , which results in a deadlock. So we need to add a locked helper to handle fatal errors even when the journal->lock is held. Fixes: `b3d82c2f27` ("bcachefs: Guard against journal seq overflow") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-06 22:35:11 -05:00
Jeongjun Park	6b37037d6d	bcachefs: fix incorrect pointer check in __bch2_subvolume_delete() For some unknown reason, checks on struct bkey_s_c_snapshot and struct bkey_s_c_snapshot_tree pointers are missing. Therefore, I think it would be appropriate to fix the incorrect pointer checking through this patch. Fixes: `4bd06f07bc` ("bcachefs: Fixes for snapshot_tree.master_subvol") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-06 22:35:11 -05:00
Linus Torvalds	a86bf2283d	assorted stuff for this merge window -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZ5yJdgAKCRBZ7Krx/gZQ 69W4AQDwgxceiQ6icx3rFhCWQigne4jdMO84kd8tNaa+xHGe1AD/WnkeChc5DqjQ wZWZxAAzml9SS01IcSiHWaF5fgrjlA0= =rXOq -----END PGP SIGNATURE----- Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs cleanups from Al Viro: "Two unrelated patches - one is a removal of long-obsolete include in overlayfs (it used to need fs/internal.h, but the extern it wanted has been moved back to include/linux/namei.h) and another introduces convenience helper constructing struct qstr by a NUL-terminated string" * tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add a string-to-qstr constructor fs/overlayfs/namei.c: get rid of include ../internal.h	2025-02-01 15:07:56 -08:00
Linus Torvalds	8080ff5ac6	bcachefs fixes for 6.14-rc1 - second half of a fix for a bug that'd been causing oopses on filesystems using snapshots with memory pressure (key cache fills for snaphots btrees are tricky) - build fix for strange compiler configurations that double stack frame size - "journal stuck timeout" now takes into account device latency: this fixes some spurious warnings, and the main remaining source of SRCU lock hold time warnings (I'm no longer seeing this in my CI, so any users still seeing this should definitely ping me) - fix for slow/hanging unmounts (" Improve journal pin flushing") - some more tracepoint fixes/improvements, to chase down the "rebalance isn't making progress" issues -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmeajBsACgkQE6szbY3K bna2BA/9E/+WBDFQHkLJ4kNQBxKL4u1xfav5kKGZ79mUlqruhr3AckLPFzWmQO21 eOJE0NeyvpsvLewDXMGZ8w/Nm3Vdc53X6ATKkQaF/UoTYVWbubmF62sXBzSS8TUh YIM6s24/CbCi8lT49JAuIaG3OC21KH0X0zOcvyepfmn1aiPNLr4y7zOWKynOhgCt mAt374ayUDDTgoQmXqPIrGp8eD/C+vUjo1ief+DIGMQmDj4uHpb5iBbmjXm8FF9x 4TcrP1UjWpiWPcHeb98H/CBWOnjDSgOFhYmxhVOvkDpC6XbtPSgKQIOs7tSJ0Nuo IOzrGuBPVfd2m+wgXsn7zbn0HNOjS76sCo92K1lAdS86k0eqRfXmCxkU6FUphNkA WCG8WrK0RjHL132iR97dtv36No8ji5mZN1ILPk/h4KRkoKC+9fA8BaJdAGVt+6NP wZLtZxZkV8BqgXF41HwzHt54YftRPn2kR47Jfu1rPimSUd4Uqy8Yjw2J/fUT7eAd 6JdfiadhAtMRWnFGzmVs4LsEWJ7Ja7GnG7jhjzlACsqbXsTU8k16Wq38IchC6mi+ p+hqq9pLAeosW9Lk/QTGFrq52aQfyOzdUjq1pyCcEYtZFNqjj8GmmHVejxZWiRTo C6dTEkSIMcBx+9QP8BJ5o+xMR02KABn+8x43TzQxQ2DXj0QamTA= =QJHL -----END PGP SIGNATURE----- Merge tag 'bcachefs-2025-01-29' of git://evilpiepirate.org/bcachefs Pull bcachefs fixes from Kent Overstreet: - second half of a fix for a bug that'd been causing oopses on filesystems using snapshots with memory pressure (key cache fills for snaphots btrees are tricky) - build fix for strange compiler configurations that double stack frame size - "journal stuck timeout" now takes into account device latency: this fixes some spurious warnings, and the main remaining source of SRCU lock hold time warnings (I'm no longer seeing this in my CI, so any users still seeing this should definitely ping me) - fix for slow/hanging unmounts (" Improve journal pin flushing") - some more tracepoint fixes/improvements, to chase down the "rebalance isn't making progress" issues * tag 'bcachefs-2025-01-29' of git://evilpiepirate.org/bcachefs: bcachefs: Improve trace_move_extent_finish bcachefs: Fix trace_copygc bcachefs: Journal writes are now IOPRIO_CLASS_RT bcachefs: Improve journal pin flushing bcachefs: fix bch2_btree_node_flags bcachefs: rebalance, copygc enabled are runtime opts bcachefs: Improve decompression error messages bcachefs: bset_blacklisted_journal_seq is now AUTOFIX bcachefs: "Journal stuck" timeout now takes into account device latency bcachefs: Reduce stack frame size of __bch2_str_hash_check_key() bcachefs: Fix btree_trans_peek_key_cache()	2025-01-30 08:42:50 -08:00
Al Viro	c1feab95e0	add a string-to-qstr constructor Quite a few places want to build a struct qstr by given string; it would be convenient to have a primitive doing that, rather than open-coding it via QSTR_INIT(). The closest approximation was in bcachefs, but that expands to initializer list - {.len = strlen(string), .name = string}. It would be more useful to have it as compound literal - (struct qstr){.len = strlen(string), .name = string}. Unlike initializer list it's a valid expression. What's more, it's a valid lvalue - it's an equivalent of anonymous local variable with such initializer, so the things like path->dentry = d_alloc_pseudo(mnt->mnt_sb, &QSTR(name)); are valid. It can also be used as initializer, with identical effect - struct qstr x = (struct qstr){.name = s, .len = strlen(s)}; is equivalent to struct qstr anon_variable = {.name = s, .len = strlen(s)}; struct qstr x = anon_variable; // anon_variable is never used after that point and any even remotely sane compiler will manage to collapse that into struct qstr x = {.name = s, .len = strlen(s)}; What compound literals can't be used for is initialization of global variables, but those are covered by QSTR_INIT(). This commit lifts definition(s) of QSTR() into linux/dcache.h, converts it to compound literal (all bcachefs users are fine with that) and converts assorted open-coded instances to using that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-01-27 19:25:45 -05:00
Kent Overstreet	5d9ccda9ba	bcachefs: Improve trace_move_extent_finish We're currently debugging issues with rebalance, where it's not making progress as quickly as it should be (or sometimes not at all). Add the full data_update to the move_extent_finish tracepoint, so we can check that the replicas we wrote match what we were supposed to do. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-26 23:02:28 -05:00
Kent Overstreet	0e458a616f	bcachefs: Fix trace_copygc Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-26 23:02:28 -05:00
Kent Overstreet	75474a54ed	bcachefs: Journal writes are now IOPRIO_CLASS_RT System performance is particularly sensitive to journal write latency, the number of outstanding journal writes is bounded and we can't issue journal flushes until other journal writes have completed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-26 23:02:28 -05:00
Kent Overstreet	35f5197009	bcachefs: Improve journal pin flushing Running the preempt tiering tests with a lower than normal journal reclaim delay turned up a shutdown hang - a lost wakeup, caused because flushing a journal pin (e.g. key cache/write buffer) can generate a new journal pin. The "simple" fix of adding the correct wakeup didn't work because of ordering issues; if we flush btree node pins too aggressively before other pins have completed, we end up spinning where each flush iteration generates new work. So to fix this correctly: - The list of flushed journal pins is now broken out by type, so that we can wait for key cache/write buffer pin flushing to complete before flushing dirty btree nodes - A new closure_waitlist is added for bch2_journal_flush_pins; this one is only used under or when we're taking the journal lock, so it's pretty cheap to add rigorously correct wakeups to journal_pin_set() and journal_pin_drop(). Additionally, bch2_journal_seq_pins_to_text() is moved to journal_reclaim.c, where it belongs, along with a bit of other small renaming and refactoring. Besides fixing the hang, the better ordering between key cache/write buffer flushing and btree node flushing should help or fix the "unmount taking excessively long" a few users have been noticing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-25 19:37:43 -05:00
Kent Overstreet	0c74c85bbe	bcachefs: fix bch2_btree_node_flags Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-25 19:33:19 -05:00
Kent Overstreet	37fd6b8176	bcachefs: rebalance, copygc enabled are runtime opts Fix a regression from when these were switched to normal opts.h options. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-25 19:33:19 -05:00
Kent Overstreet	2efbc3518f	bcachefs: Improve decompression error messages Ratelimit them, and use the new bch2_write_op_error() helper that prints path and file offset. Reported-by: https://github.com/koverstreet/bcachefs/issues/819 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-25 14:43:13 -05:00
Linus Torvalds	37b33c68b0	CRC updates for 6.14 - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be directly accessible via the library API, instead of requiring the crypto API. This is much simpler and more efficient. - Convert some users such as ext4 to use the CRC32 library API instead of the crypto API. More conversions like this will come later. - Add a KUnit test that tests and benchmarks multiple CRC variants. Remove older, less-comprehensive tests that are made redundant by this. - Add an entry to MAINTAINERS for the kernel's CRC library code. I'm volunteering to maintain it. I have additional cleanups and optimizations planned for future cycles. These patches have been in linux-next since -rc1. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZ418ZRQcZWJpZ2dlcnNA Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKyJYAP9kBlpm8W9/XY6N8SpjKaXE/vKQYHQl Nobhak06Us8uJwEAkcUTymWP4IwQj5A9jgBAPRw53FQcNVKIc+01C7gRHw0= =mqSH -----END PGP SIGNATURE----- Merge tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull CRC updates from Eric Biggers: - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be directly accessible via the library API, instead of requiring the crypto API. This is much simpler and more efficient. - Convert some users such as ext4 to use the CRC32 library API instead of the crypto API. More conversions like this will come later. - Add a KUnit test that tests and benchmarks multiple CRC variants. Remove older, less-comprehensive tests that are made redundant by this. - Add an entry to MAINTAINERS for the kernel's CRC library code. I'm volunteering to maintain it. I have additional cleanups and optimizations planned for future cycles. * tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (31 commits) MAINTAINERS: add entry for CRC library powerpc/crc: delete obsolete crc-vpmsum_test.c lib/crc32test: delete obsolete crc32test.c lib/crc16_kunit: delete obsolete crc16_kunit.c lib/crc_kunit.c: add KUnit test suite for CRC library functions powerpc/crc-t10dif: expose CRC-T10DIF function through lib arm64/crc-t10dif: expose CRC-T10DIF function through lib arm/crc-t10dif: expose CRC-T10DIF function through lib x86/crc-t10dif: expose CRC-T10DIF function through lib crypto: crct10dif - expose arch-optimized lib function lib/crc-t10dif: add support for arch overrides lib/crc-t10dif: stop wrapping the crypto API scsi: target: iscsi: switch to using the crc32c library f2fs: switch to using the crc32 library jbd2: switch to using the crc32c library ext4: switch to using the crc32c library lib/crc32: make crc32c() go directly to lib bcachefs: Explicitly select CRYPTO from BCACHEFS_FS x86/crc32: expose CRC32 functions through lib x86/crc32: update prototype for crc32_pclmul_le_16() ...	2025-01-22 19:55:08 -08:00
Kent Overstreet	c9c8a17f7a	bcachefs: bset_blacklisted_journal_seq is now AUTOFIX Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-21 23:05:32 -05:00
Kent Overstreet	2c5d8a8347	bcachefs: "Journal stuck" timeout now takes into account device latency If a block device (e.g. your typical consumer SSD) is taking multiple seconds for IOs (typically flushes), we don't want to emit the "journal stuck" message prematurely. Also, make sure to drop the btree_trans srcu lock if we're blocking for more than a second. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-21 18:32:05 -05:00
Kent Overstreet	f917016f69	bcachefs: Reduce stack frame size of __bch2_str_hash_check_key() We don't need all the helpers inlined here. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-21 12:57:48 -05:00
Kent Overstreet	a858175227	bcachefs: Fix btree_trans_peek_key_cache() BTREE_ITER_cached_nofill has some tricky corner cases; it's used internally for iterators that aren't walking the key cache, but need to be coherent with the key cache. It tells traverse to look up and lock the key cache entry if present, but don't create one if it doesn't exist. That means we have to have a BTREE_ITER_UPTODATE path (because after traverse the path has to be UPTODATE, or we pop assertions) that doesn't point to anything (which is the less bad option, taken by the previous fix). The previous fix for this path missed an issue that can happen in bch2_trans_peek_key_cache(): we can't set should_be_locked on a path that doesn't point to anything and doesn't hold locks. Fixes: `bd5b09727f` ("bcachefs: Don't set btree_path to updtodate if we don't fill") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-21 12:26:25 -05:00
Linus Torvalds	1cbfb828e0	for-6.14/block-20250118 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeL6hoQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgppw2EADQV8nDgLRggZR+il4U03yKHXcQEdAX1GrB Erowx+dasIJuh6kp3n6qRe9QD/pRqt1DKyLvXoWF8Qfuwq85j7oDnDDYxutNYT27 hDgrLJriJ3VeKYtTu+andHWt8P29b5h57UayInDOUJurEPA6rXyFZ5YVIti8n21K uDOrQXiACG3qRWS2+p2f3UNhX0MkFNFdN/lxi13WMIJtRWF5bXAP+JOgIWCID4Ze QuSY6rQD4dp4Q6M2erpX6tn0YZb7Hvw3rPjsd91n6jvYfTUVLH375zg8jCBpi6Wi Syufbb8xcTtriVPTDRNu0ekjebkc8wD8ax/h86g0z9v3Ua4DlNmsx9eXrtv6r5nu YXqDODOad6stI0+owFquW2vas0gHmfNSfyfGdlk2g24PMtP5Yx0V6FIEvwIeqnje ghgxQvBuKUsdhqakByfNnc+XvXi3+RUJek8kvMeUSUQWT1IyMQqPOOk0yp9WdyWD bY1f2ECP5BR1b37zYOyawewsI5xTupHUswn5a4r4qtGn3O15rGDkX98Nab5aLCnR rW/DvX7+wT6gW9EwrRHiwjwfNDZbsJ9Ggu3lMhtUl5GUWdk58yTiVgKaHJLnlX9/ CKFKfyyIR1Vl8+gYIpemyFhhcoN+dCSf06ISkrg0jeS0/tYwydaAaCBPL5J4kxZA h3Rtbh+Pgg== =EXYs -----END PGP SIGNATURE----- Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - NVMe pull requests via Keith: - Target support for PCI-Endpoint transport (Damien) - TCP IO queue spreading fixes (Sagi, Chaitanya) - Target handling for "limited retry" flags (Guixen) - Poll type fix (Yongsoo) - Xarray storage error handling (Keisuke) - Host memory buffer free size fix on error (Francis) - MD pull requests via Song: - Reintroduce md-linear (Yu Kuai) - md-bitmap refactor and fix (Yu Kuai) - Replace kmap_atomic with kmap_local_page (David Reaver) - Quite a few queue freeze and debugfs deadlock fixes Ming introduced lockdep support for this in the 6.13 kernel, and it has (unsurprisingly) uncovered quite a few issues - Use const attributes for IO schedulers - Remove bio ioprio wrappers - Fixes for stacked device atomic write support - Refactor queue affinity helpers, in preparation for better supporting isolated CPUs - Cleanups of loop O_DIRECT handling - Cleanup of BLK_MQ_F_* flags - Add rotational support for null_blk - Various fixes and cleanups * tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits) block: Don't trim an atomic write block: Add common atomic writes enable flag md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add() block: limit disk max sectors to (LLONG_MAX >> 9) block: Change blk_stack_atomic_writes_limits() unit_min check block: Ensure start sector is aligned for stacking atomic writes blk-mq: Move more error handling into blk_mq_submit_bio() block: Reorder the request allocation code in blk_mq_submit_bio() nvme: fix bogus kzalloc() return check in nvme_init_effects_log() md/md-bitmap: move bitmap_{start, end}write to md upper layer md/raid5: implement pers->bitmap_sector() md: add a new callback pers->bitmap_sector() md/md-bitmap: remove the last parameter for bimtap_ops->endwrite() md/md-bitmap: factor behind write counters out from bitmap_{start/end}write() md: Replace deprecated kmap_atomic() with kmap_local_page() md: reintroduce md-linear partitions: ldm: remove the initial kernel-doc notation blk-cgroup: rwstat: fix kernel-doc warnings in header file blk-cgroup: fix kernel-doc warnings in header file nbd: fix partial sending ...	2025-01-20 19:38:46 -08:00
Kent Overstreet	ff0b7ed607	bcachefs: Fix check_inode_hash_info_matches_root() Can't use memcmp() when the struct contains padding. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-15 15:28:23 -05:00
Kent Overstreet	a4e11cea27	bcachefs: Document issue with bch_stripe layout We've got a problem with bch_stripe that is going to take an on disk format rev to fix - we can't access the block sector counts if the checksum type is unknown. Document it for now, there are a few other things to fix as well. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:31 -05:00
Kent Overstreet	78423deb51	bcachefs: Fix self healing on read error We were incorrectly checking if there'd been an io error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:31 -05:00
Alan Huang	5dd21b2712	bcachefs: Pop all the transactions from the abort one The transaction is going to abort, so there will be no cycle involving this transaction anymore. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:25 -05:00
Alan Huang	b169138d48	bcachefs: Only abort the transactions in the cycle When the cycle doesn't involve the initiator of the cycle detection, we might choose a transaction that is not involved in the cycle to abort. It shouldn't be that since it won't break the cycle, this patch therefore chooses the transaction in the cycle to abort. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:18 -05:00
Alan Huang	6853a5e5d4	bcachefs: Introduce lock_graph_pop_from This patch introduces a helper function called lock_graph_pop_from, it pops the graph from i. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:13 -05:00
Alan Huang	b5c3dcd0db	bcachefs: Convert open-coded lock_graph_pop_all to helper Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:08 -05:00
Alan Huang	0ef9ab34f4	bcachefs: Do not allow no fail lock request to fail If the transaction chose itself as a victim before and restarted, it might request a no fail lock request this time. But it might be added to others' lock graph and be chose as the victim again, it's no longer safe without additional check. We can also convert the cycle detector to be fully RCU-based to solve that unsoundness, but the latency added to trans_put and additional memory required may not worth it. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:08 -05:00
Alan Huang	cdc419dbf2	bcachefs: Merge the condition to avoid additional invocation If the lock has been acquired and unlocked, we don't have to do clear and wakeup again, though harmless since we hold the intent lock. Merge the condition might be clearer. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:08 -05:00
Alan Huang	9c13cc9c7d	Revert "bcachefs: Fix bch2_btree_node_upgrade()" This reverts commit `62448afee7`. six_lock_tryupgrade fails only if there is an intent lock held, it won't fail no matter how many read locks are held. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:08 -05:00
Hongbo Li	c72deb03ff	bcachefs: bcachefs_metadata_version_directory_size This adds another metadata version for accounting directory size. For the new version of the filesystem, when new subdirectory items are created or deleted, the parent directory's size will change accordingly. For the old version of the existed file system, running fsck will automatically upgrade the metadata version, and it will do the check and recalculationg of the directory size. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-13 14:58:38 -05:00
Hongbo Li	e614a6c52d	bcachefs: make directory i_size meaningful The isize of directory is 0 in bcachefs if the directory is empty. With more child dirents created, its size ought to change. Many other filesystems changed as that (ie. xfs and btrfs). And many of them changed as the size of child dirent name. Although the directory size may not seem to convey much, we can still give it some meaning. The formula of dentry size as follow: occupied_size = 40 + ALIGN(9 + namelen, 8) Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-13 14:58:38 -05:00
Kent Overstreet	4204e3bf63	bcachefs: check_unreachable_inodes is not actually PASS_ONLINE yet check_unreachable_inodes does work in online mode, with the one caveat that it assumes check_dirents has also run - and check_dirents is not PASS_ONLINE yet. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	ae153f2e11	bcachefs: Don't use BTREE_ITER_cached when walking alloc btree during fsck No need to pull the whole alloc btree into the btree key cache. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	15734b5e6f	bcachefs: Check for dirents to overwritten inodes This fixes various "dirent to missing inode" errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	d3d0fac57d	bcachefs: bch2_btree_iter_peek_slot() handles navigating to nonexistent depth Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	bd5b09727f	bcachefs: Don't set btree_path to updtodate if we don't fill This fixes various locking asserts, and a null ptr deref in bch2_btree_iter_peek_path(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	cf67f46641	bcachefs: __bch2_btree_pos_to_text() Factor out a version of bch2_btree_pos_to_text() that doesn't take a pointer to a in-memory btree node, to be used for btree node scrub. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	0a46ea9d46	bcachefs: printbuf_reset() handles tabstops Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	5906dcb993	bcachefs: Silence read-only errors when deleting snapshots Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	8b1f46bff3	bcachefs: Dropped superblock write is no longer a fatal error Just emit a warning if errors=continue or fix_safe. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	8cfdc6ce1f	bcachefs: bch2_trans_node_drop() Factor out a small common helper. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	0971a72c3d	bcachefs: bch2_trans_unlock_write() New helper for dropping all write locks; which is distinct from the helper the transaction commit path uses, which is faster and only touches updates. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	e1911d7a69	bcachefs: btree_node_unlock() can now drop write locks Prep work for reworking btree node locking during interior btree updates. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	9a5232ef0a	bcachefs: six locks: write locks can now be held recursively This is needed for the interior update locking rework, where we'll be holding node write locks for the duration of the update - which is needed for synchronizing with online check_allocations. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	8f3aaa5d5d	bcachefs: bch2_fs_btree_gc_init() Now returns errors, prep work for check_allocations_done_lock Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	cb3f34982c	bcachefs: Assert that btree write buffer only touches the right btrees More asserts, more better. Also, clean up the per-btree flags a bit. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	bdedae70f5	bcachefs: bch2_inum_path() now crosses subvolumes correctly The dirent that points to a subvolume root is in the parent subvolume. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	ce9a21713b	bcachefs: bch2_inum_path() no longer returns an error for disconnected inums bch2_inum_path() should work even if the filesystem is corrupted - we don't want it to cause fsck to fail. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	6adc5af50a	bcachefs: btree_path_very_locks(): verify lock seq If the btree_path's lock seq is wrong, the next bch2_trans_relock() operation is guaranteed to fail and we take an unnecessary transaction restart. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	f908eacc34	bcachefs: fix bch2_btree_key_cache_drop() When evicting, we shouldn't leave a pointer to the key cache entry lying around - that screws up btree path asserts we're adding. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	bc6fce7870	bcachefs: bch2_btree_node_write_trans() Avoiding screwing up path->lock_seq. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	4bd06f07bc	bcachefs: Fixes for snapshot_tree.master_subvol Ensure that snapshot_tree.master_subvol is cleared when we delete the master subvolume in a tree of snapshots, and allow for snapshot trees that don't have a master subvolume in fsck. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	b5e4cd0871	bcachefs: Don't rely on snapshot_tree.master_subvol for reattaching Previously, fsck used the snapshot tree's master subvol for finding the root inode number - but the master subvol might have been deleting, and setting a new one should be a user operation; meaning we can't rely on it existing. Fortunately, for finding the root inode number in a tree of snapshots, finding any associated subvolume works. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	4541408391	bcachefs: bch2_kvmalloc() Add a version of kvmalloc() that doesn't have the INT_MAX limit; large filesystems do hit this. We'll want to get rid of the in-memory bucket gens array, but we're not there quite yet. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	fa3e5135e4	bcachefs: Fix assert for online fsck We can't check if we're racing with fsck ending until mark_lock is held. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	cf3da2d627	bcachefs: Handle -BCH_ERR_need_mark_replicas in gc Locking considerations (possibly no longer relevant?) mean that when an accounting update needs a new superblock replicas entry to be created, it's deferred to the transaction commit error path. But accounting updates for gc/fcsk aren't done from the transaction commit path - so we need to handle -BCH_ERR_btree_insert_need_mark_replicas locally. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	861cd0f606	bcachefs: Write lock btree node in key cache fills this addresses a key cache coherency bug Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	baf13d8344	bcachefs: kill __bch2_btree_iter_flags() bch2_btree_iter_flags() now takes a level parameter; this fixes a bug where using a node iterator on a leaf wouldn't set BTREE_ITER_with_key_cache, leading to fun cache coherency bugs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	30e32692d6	bcachefs: Drop redundant "read error" call from btree_gc The btree node read error path already calls topology error, so this is entirely redundant, and we're not specific enough about our error codes - this was triggering for bucket_ref_update() errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	6542afe299	bcachefs: Drop racy warning Checking for writing past i_size after unlocking the folio and clearing the dirty bit is racy, and we already check it at the start. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	0475c7639e	bcachefs: better check_bp_exists() error message Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Hongbo Li	d01ea14da7	bcachefs: add counter_flags for counters In bcachefs, io_read and io_write counter record the amount of data which has been read and written. They increase in unit of sector, so to display correctly, they need to be shifted to the left by the size of a sector. Other counters like io_move, move_extent_{read, write, finish} also have this problem. In order to support different unit, we add extra column to mark the counter type by using TYPE_COUNTER and TYPE_SECTORS in BCH_PERSISTENT_COUNTERS(). Fixes: `1c6fdbd8f2` ("bcachefs: Initial commit") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	3db3084a86	bcachefs: bcachefs_metadata_version_autofix_errors It's time to make self healing the default: change the error action for old filesystems to fix_safe, matching the default for current filesystems. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	df448ca355	bcachefs: bcachefs_metadata_version_persistent_inode_cursors Persistent cursors for inode allocation. A free inodes btree would add substantial overhead to inode allocation and freeing - a "next num to allocate" cursor is always going to be faster. We just need it to be persistent, to avoid scanning the inodes btree from the start on startup. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	59c50511f7	bcachefs: bcachefs_metadata_version_inode_depth This adds a new inode field, bi_depth, for directory inodes: this allows us to make the check_directory_structure pass much more efficient. Currently, to ensure the filesystem is fully connect and has no loops, for every directory we follow backpointers until we find the root. But by adding a depth counter, it sufficies to only check the parent of each directory, and check that the parent's bi_depth is smaller. (fsck doesn't require that bi_depth = parent->bi_depth + 1; if a rename causes bi_depth off, but the chain to the root is still strictly decreasing, then the algorithm still works and there's no need for fsck to fixup the bi_depth fields). We've already checked backpointers, so we know that every directory (excluding the root)has a valid parent: if bi_depth is always decreasing, every chain must terminate, and terminate at the root directory. bi_depth will not necessarily be correct when fsck runs, due to directory renames - we can't change bi_depth on every child directory when renaming a directory. That's ok; fsck will silently fix the bi_depth field as needed, and future fsck runs will be much faster. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	80c6352c2c	bcachefs: Option changes now get propagated to reflinked data Now that bch2_move_get_io_opts() re-propagates changed inode io options to bch_extent_rebalance, we can properly suport changing IO path options for reflinked data. Changing a per-file IO path option, either via the xattr interface or via the BCHFS_IOC_REINHERIT_ATTRS ioctl, will now trigger a scan (the inode number is marked as needing a scan, via bch2_set_rebalance_needs_scan()), and rebalance will use bch2_move_data(), which will walk the inode number and pick up the new options. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	ea4f9e75ec	bcachefs: bcachefs_metadata_version_reflink_p_may_update_opts Previously, io path option changes on a file would be picked up automatically and applied to existing data - but not for reflinked data, as we had no way of doing this safely. A user may have had permission to copy (and reflink) a given file, but not write to it, and if so they shouldn't be allowed to change e.g. nr_replicas or other options. This uses the incompat feature mechanism in the previous patch to add a new incompatible flag to bch_reflink_p, indicating whether a given reflink pointer may propagate io path option changes back to the indirect extent. In this initial patch we're only setting it for the source extents. We'd like to set it for the destination in a reflink copy, when the user has write access to the source, but that requires mnt_idmap which is not curretly plumbed up to remap_file_range. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	a36d8f0e0e	bcachefs: BCH_SB_VERSION_INCOMPAT We've been getting away from feature bits: they don't have any kind of ordering, and thus it's possible for people to enable weird combinations of features that were never tested or intended to be run. Much better to just give every new feature, compatible or incompatible, a version number. Additionally, we probably won't ever rev the major version number: major version numbers represent incompatible versions, but that doesn't really fit with how we actually roll out incompatible features - we need a better way of rolling out incompatible features. So, this patch adds two new superblock fields: - BCH_SB_VERSION_INCOMPAT - BCH_SB_VERSION_INCOMPAT_ALLOWED BCH_SB_VERSION_INCOMPAT_ALLOWED indicates that incompatible features up to version number x are allowed to be used without user prompting, but it does not by itself deny old versions from mounting. BCH_SB_VERSION_INCOMPAT does deny old versions from mounting, and must be <= BCH_SB_VERSION_INCOMPAT_ALLOWED. BCH_SB_VERSION_INCOMPAT will only be set when a codepath attempts to use an incompatible feature, so as to not unnecessarily break compatibility with old versions. bch2_request_incompat_feature() is the new interface to check if an incompatible feature may be used. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	d884cf189a	bcachefs: Only run check_backpointers_to_extents in debug mode The backpointers passes, check_backpointers_to_extents() and check_extents_to_backpointers() are the most expensive fsck passes. Now that we're running the same check and repair code when using a backpointer at runtime (via bch2_backpointer_get_key()) that fsck does, there's no reason fsck needs to - except to verify that the filesystem really has no errors in debug mode. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	7611d6b5d1	bcachefs: better backpointer_target_not_found() error message Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	c2c2a4d642	bcachefs: bch2_backpointer_get_key() now repairs dangling backpointers Continuing on with the self healing theme, we should be running any check and repair code at runtime that we can - instead of declaring the filesystemt inconsistent. This will also let us skip running the backpointers -> extents fsck pass except in debug mode. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	c738866e47	bcachefs: check_extents_to_backpointers() now only checks buckets with mismatches Instead of walking every extent and every backpointer it points to, first sum up backpointers in each bucket and check for mismatches, and only look for missing backpointers if mismatches were detected, and only check extents in those buckets. This is a major fsck scalability improvement, since the two backpointers passes (backpointers -> extents and extents -> backpointers) are the most expensive fsck passes by far. Additionally, to speed up the upgrade for backpointer bucket gens, or in situations when we have to rebuild alloc info, add a special case for when no backpointers are found in a bucket - don't check each individual backpointer (in particular, avoiding the write buffer flushes), just recreate them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	056cae1c00	bcachefs: Add write buffer flush param to backpointer_get_key() In an upcoming patch bch2_backpointer_get_key() will be repairing when it finds a dangling backpointer; it will need to flush the btree write buffer before it can definitively say there's an error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	7171b1fd27	bcachefs: kill __bch2_extent_ptr_to_bp() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	aca7a26f7f	bcachefs: bch2_extent_ptr_to_bp() no longer depends on device bch_backpointer no longer contains the bucket_offset field, it's just a direct LBA mapping (with low bits to account for compressed extent splitting), so we don't need to refer to the device to construct it anymore. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	ba9752e5f4	bcachefs: bcachefs_metadata_version_disk_accounting_big_endian Fix sort order for disk accounting keys, in order to fix a regression on mount times. The typetag is now the most significant byte of the key, meaning disk accounting keys of the same type now sort together. This lets us skip over disk accounting keys that aren't mirrored in memory when reading accounting at startup, instead of having them interleaved with other counter types. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	ebdca07268	bcachefs: bcachefs_metadata_version_backpointer_bucket_gen New on disk format version: backpointers new include the generation number of the bucket they refer to, and the obsolete bucket_offset field (no longer needed because we no longer store backpointers in alloc keys) is gone. This is an expensive forced upgrade - hopefully the last; we have to run the extents_to_backpointers recovery pass to regenerate backpointers. It's a forced incompatible upgrade because the alternative would've been permamently making backpointers bigger, and as one of the biggest btrees (along with the extents btree) that's not an ideal option. It's worth it though, because this allows us to make the check_extents_to_backpointers pass drastically cheaper: an upcoming patch changes it to sum up backpointers in a bucket and check the sum against the sector counts for that bucket, only looking for missing backpointers if they don't match (and then only for specific buckets). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	6679e363f4	bcachefs: bch2_btree_path_peek_slot() doesn't return errors Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	07c1a6fa90	bcachefs: trace_key_cache_fill Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	17d678bcdd	bcachefs: Log message in journal for snapshot deletion Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	54c9b92fc7	bcachefs: bch2_trans_log_msg() Export a helper for logging to the journal when we're already in a transaction context. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	d0855e2106	bcachefs: Kill snapshot_t->equiv Now entirely dead code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
John Garry	19206d3f5e	block: Delete bio_set_prio() Since commit `43b62ce3ff` ("block: move bio io prio to a new field"), macro bio_set_prio() does nothing but set bio->bi_ioprio. All other places just set bio->bi_ioprio directly, so replace bio_set_prio() remaining callsites with setting bio->bi_ioprio directly and delete that macro. Signed-off-by: John Garry <john.g.garry@oracle.com> Acked-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20241202111957.2311683-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-12-23 08:17:23 -07:00
Kent Overstreet	35c5609abf	bcachefs: Snapshot deletion no longer uses snapshot_t->equiv Switch to generating a private list of interior nodes to delete, instead of using the equivalence class in the global data structure. This eliminates possible races with snapshot creation, and is much cleaner - it'll let us delete a lot of janky code for calculating and maintaining the equivalence classes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	85c060f62d	bcachefs: Kill equiv_seen arg to delete_dead_snapshots_process_key() When deleting dead snapshots, we move keys from redundant interior snapshot nodes to child nodes - unless there's already a key, in which case the ancestor key is deleted. Previously, we tracked via equiv_seen whether the child snapshot had a key, but this was tricky w.r.t. transaction restarts, and not transactionally safe w.r.t. updates in the child snapshot. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	92e31d4251	bcachefs: Don't run overwrite triggers before insert This breaks when the trigger is inserting updates for the same btree, as the inode trigger now does. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	f859bc945e	bcachefs: alloc_data_type_set() happens in alloc trigger Originally, we ran insert triggers before overwrite so that if an extent was being moved (by fallocate insert/collapse range), the bucket sector count wouldn't hit 0 partway through, and so we don't trigger state changes caused by that too soon. But this is better solved by just moving the data type change to the alloc trigger itself, where it's already called. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	b9a37144da	bcachefs: Fix key cache + BTREE_ITER_all_snapshots Normally, whitouts (KEY_TYPE_whitout) are filtered from btree lookups, since they exist only to represent deletions of keys in ancestor snapshots - except, they should not be filtered in BTREE_ITER_all_snapshots mode, so that e.g. snapshot deletion can clean them up. This means that that the key cache has to store whiteouts, and key cache fills cannot filter them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	7e320a4063	bcachefs: Fix btree_trans_peek_key_cache() BTREE_ITER_all_snapshots In BTREE_ITER_all_snapshots mode, we're required to only return keys where the snapshot field matches the iterator position - BTREE_ITER_filter_snapshots requires pulling keys into the key cache from ancestor snapshots, so we have to check for that. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	c50341be4e	bcachefs: tidy btree_trans_peek_journal() Change to match bch2_btree_trans_peek_updates() calling convention. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	68eb4fdd8c	bcachefs: tidy up __bch2_btree_iter_peek() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	25a3123a67	bcachefs: check_indirect_extents can run online Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00

... 7 8 9 10 11 ...

5280 Commits