linux-loongson

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-09-02 16:44:59 +00:00

Author	SHA1	Message	Date
Kent Overstreet	e75993b0bf	bcachefs: Fix BCH_ERR_data_read_csum_err_maybe_userspace in retry path When we do a read to a buffer that's mapped into userspace, it's possible to get a spurious checksum error if userspace was modified the buffer at the same time. When we retry those, they have to be bounced before we know definitively whether we're reading corrupt data. But the retry path propagates read flags differently, so needs special handling. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	943f0cfb15	bcachefs: Convert read path to standard error codes Kill the READ_ERR/READ_RETRY/READ_RETRY_AVOID enums, and add standard error codes that describe precisely which error occured. This is going to be used for the data move path, to move but poison extents with checksum errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	5a06cb8000	bcachefs: Debug params for data corruption injection dm-flakey is busted, and this is simpler anyways - this lets us test the checksum error retry ptahs Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	6d80fca9ef	bcachefs: Don't create bch_io_failures unless it's needed Only needed in retry path, no point in wasting stack space. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	9ec0089149	bcachefs: bch2_bkey_ptrs_rebalance_opts() Small optimization for bch2_bkey_sectors_need_rebalance() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	7c1e2a254f	bcachefs: Add a cond_resched() to btree cache teardown [12308.606480] watchdog: BUG: soft lockup - CPU#18 stuck for 26s! [umount:48479] [12308.606485] Modules linked in: bcachefs lz4hc_compress lz4_compress lz4_decompress sunrpc overlay nf_conntrack_netlink xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE bridge stp llc xfrm_user ip6table_nat ip6table_filter ip6_tables iptable_nat xt_addrtype iptable_filter ip_tables x_tables nfnetlink_cttimeout nfnetlink openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 psample ext4 mbcache jbd2 nls_iso8859_1 nls_cp850 vfat fat binfmt_misc skx_edac_common nfit edac_core libnvdimm cbc encrypted_keys intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common ipmi_ssif x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm drivetemp rapl intel_cstate coretemp mgag200 i2c_algo_bit ixgbe drm_shmem_helper drm_kms_helper mdio_devres xfrm_algo mdio drm ptp intel_uncore mei_me efi_pstore evdev uas pl2303 pps_core libphy usb_storage usbserial lpc_ich mei drm_panel_orientation_quirks acpi_power_meter tiny_power_button ipmi_si mfd_core intel_pch_thermal acpi_tad acpi_ipmi ioatdma [12308.606541] ipmi_devintf ipmi_msghandler dca wmi button efivarfs polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 sha1_generic xhci_pci xhci_hcd aesni_intel ehci_pci ehci_hcd gf128mul crypto_simd cryptd usbcore hpwdt usb_common [12308.606557] CPU: 18 UID: 0 PID: 48479 Comm: umount Tainted: G L 6.14.0-rc6-x86_64-00159-ga09496a03e63 #1 [12308.606560] Tainted: [L]=SOFTLOCKUP [12308.606561] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 07/20/2023 [12308.606563] RIP: 0010:clear_page_erms+0x7/0x10 [12308.606570] Code: 48 89 47 38 48 8d 7f 40 75 d9 90 c3 cc cc cc cc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 b9 00 10 00 00 31 c0 <f3> aa c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 [12308.606572] RSP: 0018:ffff9ed5b622fba0 EFLAGS: 00010246 [12308.606574] RAX: 0000000000000000 RBX: ffff90347fffe6c0 RCX: 00000000000004c0 [12308.606575] RDX: ffffe34ea9bec1c0 RSI: 00000000000405f0 RDI: ffff902eafb07b40 [12308.606576] RBP: ffff9ed5b622fbf0 R08: 0000000000000001 R09: 0000000000000006 [12308.606577] R10: 0000000000040001 R11: 0000000000000000 R12: ffffe34ea9bec000 [12308.606578] R13: 0000000000000000 R14: 0000000000000006 R15: ffffe34ea9bed000 [12308.606580] FS: 00007fe704ecfb68(0000) GS:ffff9053fea00000(0000) knlGS:0000000000000000 [12308.606581] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [12308.606582] CR2: 00007f18159068ae CR3: 00000001314d0005 CR4: 00000000007726f0 [12308.606583] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [12308.606584] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [12308.606584] PKRU: 55555554 [12308.606585] Call Trace: [12308.606587] <IRQ> [12308.606590] ? show_regs.cold+0x19/0x28 [12308.606595] ? watchdog_timer_fn.cold+0x3d/0x9d [12308.606598] ? __pfx_watchdog_timer_fn+0x10/0x10 [12308.606602] ? __hrtimer_run_queues+0x12e/0x250 [12308.606607] ? hrtimer_interrupt+0xfd/0x220 [12308.606609] ? __sysvec_apic_timer_interrupt+0x53/0xe0 [12308.606614] ? sysvec_apic_timer_interrupt+0x76/0xa0 [12308.606619] </IRQ> [12308.606620] <TASK> [12308.606620] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 [12308.606626] ? clear_page_erms+0x7/0x10 [12308.606628] ? __free_pages_ok+0x374/0x640 [12308.606633] free_frozen_pages+0x34/0x570 [12308.606636] __folio_put+0x87/0xe0 [12308.606641] free_large_kmalloc+0x70/0x80 [12308.606645] kfree+0x2f6/0x390 [12308.606648] kvfree+0x2d/0x40 [12308.606653] __btree_node_data_free+0xaf/0xf0 [bcachefs] [12308.606726] btree_node_data_free+0x6a/0x80 [bcachefs] [12308.606778] bch2_fs_btree_cache_exit+0x262/0x440 [bcachefs] [12308.606829] bch2_fs_release+0xe8/0x340 [bcachefs] [12308.606905] kobject_put+0x60/0xc0 [12308.606908] bch2_fs_free+0xdd/0x120 [bcachefs] [12308.606981] bch2_kill_sb+0x1e/0x30 [bcachefs] [12308.607051] deactivate_locked_super+0x32/0xb0 [12308.607055] deactivate_super+0x40/0x50 [12308.607057] cleanup_mnt+0xc3/0x160 [12308.607060] __cleanup_mnt+0x12/0x20 [12308.607062] task_work_run+0x5f/0xa0 [12308.607064] syscall_exit_to_user_mode+0x194/0x1a0 [12308.607066] do_syscall_64+0x67/0x170 [12308.607068] entry_SYSCALL_64_after_hwframe+0x76/0x7e [12308.607070] RIP: 0033:0x7fe704e66eed [12308.607073] Code: 08 49 89 ca b8 a5 00 00 00 0f 05 48 89 c7 e8 8a e6 ff ff 48 83 c4 Reported-by: Stijn Tintel <stijn@linux-ipv6.be> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	c991fbee8e	bcachefs: rebalance, copygc status also print stacktrace These are commonly needed when debugging, and saves from having to ask users to dig. Also, rebalance_status now includes pending rebalance work. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	8dc4514d58	bcachefs: Kill bch2_remount() Single caller, so inline it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:03:16 -04:00
Kent Overstreet	a2e9e68746	bcachefs: Kill a bit of dead code Found with CC=clang W=1 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Thorsten Blum	ff4cb203cc	bcachefs: Use max() to improve gen_after() Use max() to simplify gen_after() and improve its readability. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Thorsten Blum	c073ec6bec	bcachefs: Remove unnecessary byte allocation The extra byte is not used - remove it. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	94373026d9	bcachefs: We no longer read stripes into memory at startup And the stripes heap gets deleted. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	434a3f2ffa	bcachefs: trace_stripe_create Add a simple tracepoint for stripe creation, we'll want to expand this later. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	6c336144b9	bcachefs: get_existing_stripe() uses new stripe lru Convert to the new persistent stripe LRU. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	039790cfb5	bcachefs: ec_stripe_delete() uses new stripe lru Convert to the new persistent stripe LRU. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	4b0fac4bed	bcachefs: journal write path comment Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	981e380144	bcachefs: Kick devices out after too many write IO errors We're improving our handling of write errors - we shouldn't write degraded data just because a write failed once, we should retry it (on other devices, if possible). But for this to work, we need to kick devices out when they're only returning errors - otherwise those retries will loop infinitely. This adds a configurable timeout - if writes are failing for too long, we'll set that device read-only. In the future we should also implement more tracking and another knob for an "allowed error rate", so that we can kick out drives that are acting "unhealthy". Another thing we'll want is a mechanism (likely in userspace) for bringing a device back in after a transient error - perhaps a cable was jiggled, or there was a controller reset. After transient errors we also need a mechanism to walk (from the journal) recent btree updates that weren't flushed to that device and treat them as "degraded", since unflushed data may well not have been written. Out of scope for this patch, but becoming relevant. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	d71e023376	bcachefs: Change BCH_MEMBER_STATE_failed semantics Previously, we woudn't try to read at all from a failed device - that doesn't make much sense, the device may be unhealthy (perhaps taking longer than it should to service reads), but if it's our only option we should still try to read from it. Now, bch2_bkey_pick_read_device() will pick failed devices only if there are no non-failed replicas to read from. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	cf164a9106	bcachefs: bch2_dev_get_ioref() may now sleep The next patch implementing freezing will change bch2_dev_get_ioref() to sleep if a device is currently frozen. Add an annotation and fix the journal code accordingly. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	2efa8397ca	bcachefs: Fix btree_node_scan io_ref handling This was completely fubar; it's now simplified a bit as well. Note that for_each_online_member() takes and releases io_refs as it iterates, so we need to release that if we break. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	d5308203a8	bcachefs: Implement blk_holder_ops We can't use the standard fs_holder_ops because they're meant for single device filesystems - fs_bdev_mark_dead() in particular - and they assume that the blk_holder is the super_block, which also doesn't work for a multi device filesystem. These generally follow the standard fs_holder_ops; the locking/refcounting is a bit simplified because c->ro_ref suffices, and bch2_fs_bdev_mark_dead() is not necessarily shutting down the entire filesystem. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	1fdbe0b184	bcachefs: Make sure c->vfs_sb is set before starting fs This is necessary for the new blk_holder_ops, which want the vfs super_block available for synchronization. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	13fd6be102	bcachefs: Stash a pointer to the filesystem for blk_holder_ops Note that we open block devices before we allocate bch_fs, but once attached to a filesystem they will be closed before the bch_fs is torn down - so stashing a pointer without a refcount looks incorrect but it's not. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	b31c070407	bcachefs: Finish bch2_account_io_completion() conversions More prep work for automatically kicking devices out after too many IO errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	3526bca36b	bcachefs: bch2_account_io_completion() We need to start accounting successes for every IO, not just failures, so introduce a unified hook for io completion accounting and convert io_read.c. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	3480aecd5f	bcachefs: Fix read path io_ref handling We were using our device pointer after we'd released our ref to it. Unlikely to be a race that's practical to hit, since actually removing a member device is a whole process besides just taking it offline, but - needs to be fixed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	7bc5808168	bcachefs: data_update now checks for extents that can't be moved If a device is ro or failed, we might not have anywhere to move a replica. Check for this early, before doing the read and attempting to write. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	fba513a9ee	bcachefs: give bch2_write_super() a proper error code Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	4a90675cfe	bcachefs: bcachefs_metadata_version_extent_flags This implements a new extent field bitflags that apply to the whole extent. There's been a couple things we've wanted this for in the past, but the immediate need is extent poisoning, to solve a rebalance issue. Unknown extent fields can't be parsed (we won't known their size, so we can't advance to the next field), so this is an incompat feature, and using it prevents the filesystem from being mounted by old versions. This also adds the BCH_EXTENT_poisoned flag; this indicates that the data is known to be bad (i.e. there was a checksum error, and we had to write a new checksum) and reads will return errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	6422bf8117	bcachefs: bch2_request_incompat_feature() now returns error code For future usage, we'll want a dedicated error code for better debugging. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Thorsten Blum	bafd41b435	bcachefs: Fix error type in bch2_alloc_v3_validate() Use error type alloc_v3_unpack_error in bch2_alloc_v3_validate(). Fixes: `b65db750e2` ("bcachefs: Enumerate fsck errors") Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	fb195fa753	bcachefs: BCH_SB_FEATURES_ALL includes BCH_FEATURE_incompat_verison_field These features are set on format and incompat upgarde. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	24d790a7da	bcachefs: sysfs internal/trigger_btree_updates Add a debug knob to manually trigger the btree updates worker. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Joshua Ashton	d37c14ac6f	bcachefs: bcachefs_metadata_version_casefolding This patch implements support for case-insensitive file name lookups in bcachefs. The implementation uses the same UTF-8 lowering and normalization that ext4 and f2fs is using. More information is provided in Documentation/bcachefs/casefolding.rst Compatibility notes: This uses the new versioning scheme for incompatible features where an incompatible feature is tied to a version number: the superblock says "we may use incompat features up to x" and "incompat features up to x are in use", disallowing mounting by previous versions. Additionally, and old style incompat feature bit is used, so that kernels without utf8 casefolding support know if casefolding specifically is in use and they're allowed to mount. Signed-off-by: Joshua Ashton <joshua@froggi.es> Cc: André Almeida <andrealmeid@igalia.com> Cc: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Joshua Ashton	76872d46b7	bcachefs: Split out dirent alloc and name initialization Splits out the code that allocates the dirent and initializes the name to make things easier to implement casefolding in a future commit. Cc: André Almeida <andrealmeid@igalia.com> Cc: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Joshua Ashton <joshua@froggi.es> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	72f4edcf45	bcachefs: Kill dirent_occupied_size() in create path Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	68171d91ce	bcachefs: Kill dirent_occupied_size() in rename path Cc: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	6756e385a5	bcachefs: bcachefs_metadata_version_stripe_lru Add a persistent LRU for stripes, ordered by "number of empty blocks", i.e. order in which we wish to reuse them. This will replace the in-memory stripes heap, so we can kill off reading stripes into memory at startup. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	88d961b518	bcachefs: bcachefs_metadata_version_stripe_backpointers Stripes now have backpointers. This is needed for proper scrub - stripe checksums need to be verified, separately from extents within the stripe, since a block may not be full of live extents but it's still needed for reconstruct. And this will be needed for (efficient) evacuate/repair paths. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	69bd8a9277	bcachefs: Advance bch_alloc.oldest_gen if no stale pointers Now that we've got cached backpointers and aren't leaving around stale pointers on bucket invalidation, we no longer need the periodic (rare) gc_gens - which recalculates each bucket's oldest gen to avoid wraparound. We can't delete that code because we've got to support existing filesystems that will still have stale pointers, but this gets rid of another scalability limit. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	942a418c7a	bcachefs: Invalidate cached data by backpointers If we don't leave stale pointers around, we won't have to deal with bucket gen wraparound. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	15800f3d4b	bcachefs: bcachefs_metadata_version_cached_backpointers Cached pointers now have backpointers. This means that we'll be able to kill cached pointers in the bucket_invalidate path, when invalidating/reusing buckets containing cached data, instead of leaving them around to be cleaned up by gc_gens garbago collection - which requires a full metadata scan. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	65bc7688b8	bcachefs: rework bch2_trans_commit_run_triggers() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	c7c07bf250	bcachefs: Better trigger ordering Transactional triggers need to run in a defined ordering, which is not quite the same as btree ID integer comparison. Previously this was handled in a hacky way in bch2_trans_commit_run_triggers(), since it was only the alloc btree that needed special handling, but upcoming stripe btree changes are going to require more ordering changes - so, define that ordering. Next patch will change the transaction commit path to use it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	cc297dfb41	bcachefs: bch2_trigger_stripe_ptr() no longer uses ec_stripes_heap_lock Introduce per-entry locks, like with struct bucket - the stripes heap is going away. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	bc76ba70d2	bcachefs: Rework bch2_check_lru_key() It's now easier to add new LRU types. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	3aff608b86	bcachefs: decouple bch2_lru_check_set() from alloc btree Pass in the backpointer explicitly, instead of assuming 'referring_k' is an alloc key and calculating it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	b8e37c1645	bcachefs: s/BCH_LRU_FRAGMENTATION_START/BCH_LRU_BUCKET_FRAGMENTATION/ FRAGMENTATION_START was incorrect, there's currently only one fragmentation LRU (at the end of the reserved bits for LRU type), and we're getting ready to add a stripe fragmentation lru - so give it a better name. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	e130496707	bcachefs: bch2_lru_change() checks for no-op Minor cleanup, no reason for the caller to have to this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	cb87f623c1	bcachefs: minor journal errcode cleanup Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	1ccbcd3205	bcachefs: bch2_write_op_error() now prints info about data update A user has been seeing the "error verifying existing checksum while rewriting existing data (memory corruption?)" error. This generally indicates a hardware issue (and that may be the case here), but it might also indicate a bug, in which case we need more information to look for patterns. Reported-by: Roland Vet <vet.roland@protonmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	3faa4647a0	bcachefs: metadata_target is not an inode option This option only applies filesystem wide. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	f27614652c	bcachefs: eytzinger1_{next,prev} cleanup The eytzinger code was previously relying on the following wrap-around properties and their "eytzinger0" equivalents: eytzinger1_prev(0, size) == eytzinger1_last(size) eytzinger1_next(0, size) == eytzinger1_first(size) However, these properties are no longer relied upon and no longer necessary, so remove the corresponding asserts and forbid the use of eytzinger1_prev(0, size) and eytzinger1_next(0, size). This allows to further simplify the code in eytzinger1_next() and eytzinger1_prev(): where the left shifting happens, eytzinger1_next() is trying to move i to the lowest child on the left, which is equivalent to doubling i until the next doubling would cause it to be greater than size. This is implemented by shifting i to the left so that the most significant bits align and then shifting i to the right by one if the result is greater than size. Likewise, eytzinger1_prev() is trying to move to the lowest child on the right; the same applies here. The 1-offset in (size - 1) in eytzinger1_next() isn't needed at all, but the equivalent offset in eytzinger1_prev() is surprisingly needed to preserve the 'eytzinger1_prev(0, size) == eytzinger1_last(size)' property. However, since we no longer support that property, we can get rid of these offsets as well. This saves one addition in each function and makes the code less confusing. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	68eb4c5fea	bcachefs: convert eytzinger sort to be 1-based (2) In this second step, transform the eytzinger indexes i, j, and k in eytzinger1_sort_r() from 0-based to 1-based. This step looks a bit messy, but the resulting code is slightly better. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	3ff0dd28d6	bcachefs: convert eytzinger sort to be 1-based (1) In this first step, convert the eytzinger sort functions to use 1-based primitives. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	3849bcab4d	bcachefs: convert eytzinger0_find to be 1-based Several of the algorithms on eytzinger trees are implemented in terms of the eytzinger0 primitives. However, those algorithms can just as easily be expressed in terms of the eytzinger1 primitives, and that leads to better and easier to understand code. Start by converting eytzinger0_find(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	956032edd2	bcachefs: Add eytzinger0_find self test Function eytzinger0_find() isn't currently covered, so add a self test. We can rely on eytzinger0_find_le() here because it is being tested independently. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	63ce189b00	bcachefs: add eytzinger0_find_ge self test Add an eytzinger0_find_ge() self test similar to eytzinger0_find_gt(). Note that this test requires eytzinger0_find_ge() to return the first matching element in the array in case of duplicates. To prevent bisection errors, we only add this test after strenghening the original implementation (see the previous commit). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	11223d0e7b	bcachefs: implement eytzinger0_find_ge directly Implement eytzinger0_find_ge() directly instead of implementing it in terms of eytzinger0_find_le() and adjusting the result. This turns eytzinger0_find_ge() into a minimum search, so when there are duplicate elements, the result of eytzinger0_find_ge() will now always point at the first matching element. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	2182f29545	bcachefs: implement eytzinger0_find_gt directly Instead of implementing eytzinger0_find_gt() in terms of eytzinger0_find_le() and adjusting the result, implement it directly. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	d7cd33f7ef	bcachefs: add eytzinger0_find_gt self test Add an eytzinger0_find_gt() self test similar to eytzinger0_find_le(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	d384dada0e	bcachefs: simplify eytzinger0_find_le Replace the over-complicated implementation of eytzinger0_find_le() by an equivalent, simpler version. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	d148d804f2	bcachefs: convert eytzinger0_find_le to be 1-based eytzinger0_find_le() is also easy to concert to 1-based eytzinger (but see the next commit). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	c722b818a2	bcachefs: improve eytzinger0_find_le self test Rename eytzinger0_find_test_val() to eytzinger0_find_test_le() and add a new eytzinger0_find_test_val() wrapper that calls it. We have already established that the array is sorted in eytzinger order, so we can use the eytzinger iterator functions and check the boundary conditions to verify the result of eytzinger0_find_le(). Only scan the entire array if we get an incorrect result. When we need to scan, use eytzinger0_for_each_prev() so that we'll stop at the highest matching element in the array in case there are duplicates; going through the array linearly wouldn't give us that. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	dc5ceaaad8	bcachefs: add eytzinger0_for_each_prev Add an eytzinger0_for_each_prev() macro for iterating through an eytzinger array in reverse. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	e8a0966ffa	bcachefs: eytzinger0_find_test improvement In eytzinger0_find_test(), remember the smallest element seen so far instead of comparing adjacent array elements. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	ec70103f9b	bcachefs: eytzinger[01]_test improvement In eytzinger[01]_test(), make sure that eytzinger[01]_for_each() iterates over all array elements. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	0766f5599c	bcachefs: eytzinger self tests: fix cmp_u16 typo Fix an obvious typo in cmp_u16(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	0ede49212a	bcachefs: eytzinger self tests: missing newline termination pr_info() format strings need to be newline terminated. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	217ad1d7c7	bcachefs: eytzinger self tests: loop cleanups The iterator variable of eytzinger0_for_each() loops has been changed to be locally scoped at some point, so remove variables defined outside the loop that are now unused. In addition and for clarity, use a different variable inside those loops where an outside variable would be shadowed. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	d54b82ecc4	bcachefs: EYTZINGER_DEBUG fix When EYTZINGER_DEBUG is defined, <linux/bug.h> needs to be included. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	f7f9be0238	bcachefs: bch2_blacklist_entries_gc cleanup Use an eytzinger0_for_each() loop here. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	34a493089a	bcachefs: bch2_bkey_ptr_data_type() now correctly returns cached for cached ptrs Necessary for adding backpointers for cached pointers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	fd49882f12	bcachefs: Add time_stat for btree writes We have other metadata IO types covered, this was missing. Note: this includes the time until completion, i.e. including parent pointer update. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	b7f648e2ec	bcachefs: Add comment explaining why asserts in invalidate_one_bucket() are impossible Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	7606fb4d26	bcachefs: Ignore backpointers to stripes in ec_stripe_update_extents() Prep work for stripe backpointers: this path previously would get very confused at being asked to process (remove redundant replicas) stripes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	898bda5b72	bcachefs: Increase JOURNAL_BUF_NR Increase journal pipelining. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	35282ce9e8	bcachefs: Free journal bufs when not in use Since we're increasing the number of 'struct journal_bufs', we don't want them all permanently holding onto buffers for the journal data - that'd be 16 * 2MB = 32MB, or potentially more. Add a single-element mempool (open coded, since buffer size varies), this also means we won't be hitting the memory allocator every time we open and close a journal entry/buffer. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	2e853fdbc7	bcachefs: Don't touch journal_buf->data->seq in journal_res_get This is a small optimization, reducing the number of cachelines we touch in the fast path - and it's also necessary for the next patch that increases JOURNAL_BUF_NR. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	199a3578ed	bcachefs: Kill journal_res.idx More dead code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	c2be81d48a	bcachefs: Kill journal_res_state.unwritten_idx Dead code Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	3eccc02035	bcachefs: add progress indicator to check_allocations Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	491eda6394	bcachefs: Add a progress indicator to bch2_dev_data_drop() This code needs quite a bit of work: we don't want to be walking all metadata in the filesystem, we should just be walking backpointers, and it should be switched to a data ioctl that can report progress via a file descriptor, not the system console. But that'll take more work - before we can safely walk only backpointers we need to change device add to not reuse device indexes, since with that change accounting being wrong introduces the possibility of removing a device that still has pointers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	baabeb4997	bcachefs: Factor out progress.[ch] the backpointers code has progress indicators; these aren't great, since they print to the dmesg console and we much prefer to have progress indicators reporting to a specific userspace program so they're not spamming the system console. But not all codepaths that need progress indicators support that yet, and we don't want users to think "this is hung". Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	06284963e3	bcachefs: bch2_inum_offset_err_msg_trans() no longer handles transaction restarts we're starting to use error messages with paths in fsck_errors(), where we do not want nested transaction restart handling, so let's prepare for that. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	45f0e6c838	bcachefs: bch2_indirect_extent_missing_error() prints path, not just inode number We want all error messages converted to print paths, not just inode numbers - users want this information, and it speeds up debugging too. Auditing and converting all error messages is going to be a big project, so for the moment we're just doing this incrementally. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	e63cf203d7	bcachefs: Convert migrate to move_data_phys() Iterating over backpointers on a specific device is potentially much cheaper than walking all filesystem data. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	157ea58341	bcachefs: Read/move path counter work Reorganize counters a bit, grouping related counters together. New counters: - io_read_inline - io_read_hole Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Alan Huang	7d8321a286	bcachefs: Fix subtraction underflow When ancestor is less than IS_ANCESTOR_BITMAP, we would get an incorrect result. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	f269ae55d2	bcachefs: Scrub Add a new data op to walk all data and metadata in a filesystem, checking if it can be read successfully, and on error repairing from another copy if possible. - New helper: bch2_dev_idx_is_online(), so that we can bail out and report to userspace when we're unable to scrub because the device is offline - data_update_opts, which controls the data move path, now understands scrub: data is only read, not written. The read path is responsible for rewriting on read error, as with other reads. - scrub_pred skips data extents that don't have checksums - bch_ioctl_data has a new scrub member, which has a data_types field for data types to check - i.e. all data types, or only metadata. - Add new entries to bch_move_stats so that we can report numbers for corrected and uncorrected errors - Add a new enum to bch_ioctl_data_event for explicitly reporting completion and return code (i.e. device offline) Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	3e2ad29865	bcachefs: bch2_btree_node_scrub() Add a function for scrubbing btree nodes - reading them in, and kicking off a rewrite if there's an error. The btree_node_read_done() checks have to be duplicated because we're not using a pointer to a struct btree - the btree node might already be in cache, and we need to check a specific replica, which might not be the one we previously read from. This will be used in the next patch implementing high-level scrub. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	ca24130ee4	bcachefs: bch2_bkey_pick_read_device() can now specify a device To be used for scrub, where we want the read to come from a specific device. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	2a2f7aaa8d	bcachefs: __bch2_move_data_phys() now uses bch2_btree_node_rewrite_pos() Kill most of the separate logic for btree nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	987fdbdb40	bcachefs: bch2_move_data_phys() Add a more general version of bch2_evacuate_bucket - to be used for scrub. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	12188c9e2b	bcachefs: bch2_btree_node_rewrite_pos() Add a new helper for rewriting a btree node given a just the key, not a pointer to the node itself. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	ca16fa6b86	bcachefs: backpointer_get_key() doesn't pull in btree node We may not need to pull in a btree node when walking backpointers - don't do so unnecessarily when using backpointer_get_key(). It'll still fall back to backpointer_get_node() in a few situations, including btree roots (where an iterator can't point at just the key), and races due to the interior update path not having deleted a backpointer to an old node yet. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	dff6de9518	bcachefs: Internal reads can now correct errors Rework the read path so that BCH_READ_NODECODE reads now also self-heal after a read error and a successful retry - prerequisite for scrub. - __bch2_read_endio() now handles a read that's both BCH_READ_NODECODE and a bounce. Normally, we don't want a BCH_READ_NODECODE read to ever allocate a split bch_read_bio: we want to maintain the relationship between the bch_read_bio and the data_update it's embedded in. But correcting read errors requires allocating a split/bounce rbio that's embedded in a promote_op. We do still have a 1-1 relationship, i.e. we only allocate a single split/bounce if it's a BCH_READ_NODECODE, so things hopefully don't get too crazy. - __bch2_read_extent() now is allowed to allocate the promote_op for rewriting after a failed read, even if it's BCH_READ_NODECODE. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	7b1d655106	bcachefs: Don't self-heal if a data update is already rewriting Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	4dfb76e0ad	bcachefs: Don't start promotes from bch2_rbio_free() we don't want to block completion of the read - starting a promote calls into the write path, which will block. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	7e9ed60f5f	bcachefs: Bail out early on alloc_nowait data updates If a data update doesn't want to block on allocations (promotes, self healing on read error) - check if the allocation would fail before kicking off the data update and calling into the write path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	c37d42a0e2	bcachefs: Rework init order in bch2_data_update_init() Initialize the write op first, so that in the next patch we can check if the allocator would block (for BCH_WRITE_alloc_nowait ops) and bail out before taking nocow locks/dev refs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	29ad31c780	bcachefs: Self healing writes are BCH_WRITE_alloc_nowait If a drive is failing and we're moving data off of it, we can't necessairly depend on capacity/disk reservation calculations to avoid deadlocking/blocking on the allocator. And, we don't want to queue up infinite self healing moves anyways. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	8ff92a9e4e	bcachefs: Promotes should use BCH_WRITE_only_specified_devs Promotes, like most other internal moves, should only go to the specified target and not fall back to allocating from the full filesystem. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	d0148e7169	bcachefs: Be stricter in bch2_read_retry_nodecode() Now that data_update embeds bch_read_bio, BCH_READ_NODECODE means that the read is embedded in a a data_update - and we can check in the retry path if the extent has changed and bail out. This likely fixes some subtle bugs with read errors and data moves. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	6f7111f820	bcachefs: cleanup redundant code around data_update_op initialization Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	536d789781	bcachefs: bch2_update_unwritten_extent() no longer depends on wbio Prep work for improving bch2_data_update_init(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	8f97793d67	bcachefs: promote_op uses embedded bch_read_bio Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	a70bd97630	bcachefs: data_update now embeds bch_read_bio Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	dfa204b169	bcachefs: rbio_init() cleanup Move more initialization to rbio_init(), to assist in further cleanups. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	0f856b7228	bcachefs: rbio_init_fragment() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	14e2523fc5	bcachefs: Rename BCH_WRITE flags fer consistency with other x-macros enums The uppercase/lowercase style is nice for making the namespace explicit. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	9157b3ddfb	bcachefs: x-macroize BCH_READ flags Will be adding a bch2_read_bio_to_text(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	9f37016cb2	bcachefs: kill bch_read_bio.devs_have Dead code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	3075e68d26	bcachefs: bch2_data_update_inflight_to_text() Add a new helper for bch2_moving_ctxt_to_text(), which may be used to debug if moving_ios are getting stuck. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	50ca857457	bcachefs: BCH_IOCTL_QUERY_COUNTERS Add an ioctl for querying counters, the same ones provided in /sys/fs/bcachefs/<uuid>/counters/, but more suitable for a 'bcachefs top' command. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	5ee760f667	bcachefs: BCH_COUNTER_bucket_discard_fast Add a separate counter for fastpath bucket discards, which don't require a journal flush. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	bbd804f2ad	bcachefs: enum bch_persistent_counters_stable Persistent counters, like recovery passes, include a stable enum in their definition - but this was never correctly plumbed. This allows us to add new counters and properly organize them with a non-stable "presentation order", which can also be used in userspace by the new 'bcachefs fs top' tool. Fortunatel, since we haven't yet added any new counters where presentation order ID doesn't match stable ID, this won't cause any reordering issues. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	999cc1bb68	bcachefs: Separate running/runnable in wp stats We've got per-writepoint statistics to see how well the writepoint index update threads are pipelining; this separates running vs. runnable so we can see at a glance if they're blocking. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	78c9c6f6cd	bcachefs: Move write_points to debugfs this was hitting the sysfs 4k limit Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	55a132c37a	bcachefs: Don't inc io_(read\|write) counters for moves This makes 'bcachefs fs top' more useful; we can now see at a glance whether the IO to the device is being done for user reads/writes, or copygc/rebalance. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:10 -04:00
Kent Overstreet	e5a63ad343	bcachefs: Fix missing increment of move_extent_write counter Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:10 -04:00
Kent Overstreet	c3c9957c81	bcachefs: check_bp_exists() check for backpointers for stale pointers Early version of 'bcachefs_metadata_version_cached_backpointers' was creating backpointers for stale cached pointers - whoops. Now we have to repair those. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:10 -04:00
Kent Overstreet	2deae55804	bcachefs: btree_node_(rewrite\|update_key) cleanup Factor out get_iter_to_node() and use it for btree_node_rewrite_get_iter(), to be used for fixing btree node write error behaviour. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:10 -04:00
Kent Overstreet	be212d86b1	bcachefs: bs > ps support bcachefs removed most PAGE_SIZE references long ago, so this is easy; only readpage_bio_extend() has to be tweaked to respect the minimum order. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 19:45:57 -04:00
Kent Overstreet	1a2b74d0a2	bcachefs: fix build on 32 bit in get_random_u64_below() bare 64 bit divides not allowed, whoops arm-linux-gnueabi-ld: drivers/char/random.o: in function `__get_random_u64_below': drivers/char/random.c:602:(.text+0xc70): undefined reference to `__aeabi_uldivmod' Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 19:45:54 -04:00
Kent Overstreet	90fd9ad5b0	bcachefs: Change btree wb assert to runtime error We just had a report of the assert for "btree in write buffer for non-write buffer btree" popping during the 6.14 upgrade. - 150TB filesystem, after a reboot the upgrade was able to continue from where it left off, so no major damage. But with 6.14 about to come out we want to get this tracked down asap, and need more data if other users hit this. Convert the BUG_ON() to an emergency read-only, and print out btree, the key itself, and stack trace from the original write buffer update (which did not have this check before). Reported-by: Stijn Tintel <stijn@linux-ipv6.be> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 10:25:25 -04:00
Kent Overstreet	9c18ea7ffe	bcachefs: bch2_get_random_u64_below() steal the (clever) algorithm from get_random_u32_below() this fixes a bug where we were passing roundup_pow_of_two() a 64 bit number - we're squaring device latencies now: [ +1.681698] ------------[ cut here ]------------ [ +0.000010] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13 [ +0.000011] shift exponent 64 is too large for 64-bit type 'long unsigned int' [ +0.000011] CPU: 1 UID: 0 PID: 196 Comm: kworker/u32:13 Not tainted 6.14.0-rc6-dave+ #10 [ +0.000012] Hardware name: ASUS System Product Name/PRIME B460I-PLUS, BIOS 1301 07/13/2021 [ +0.000005] Workqueue: events_unbound __bch2_read_endio [bcachefs] [ +0.000354] Call Trace: [ +0.000005] <TASK> [ +0.000007] dump_stack_lvl+0x5d/0x80 [ +0.000018] ubsan_epilogue+0x5/0x30 [ +0.000008] __ubsan_handle_shift_out_of_bounds.cold+0x61/0xe6 [ +0.000011] bch2_rand_range.cold+0x17/0x20 [bcachefs] [ +0.000231] bch2_bkey_pick_read_device+0x547/0x920 [bcachefs] [ +0.000229] __bch2_read_extent+0x1e4/0x18e0 [bcachefs] [ +0.000241] ? bch2_btree_iter_peek_slot+0x3df/0x800 [bcachefs] [ +0.000180] ? bch2_read_retry_nodecode+0x270/0x330 [bcachefs] [ +0.000230] bch2_read_retry_nodecode+0x270/0x330 [bcachefs] [ +0.000230] bch2_rbio_retry+0x1fa/0x600 [bcachefs] [ +0.000224] ? bch2_printbuf_make_room+0x71/0xb0 [bcachefs] [ +0.000243] ? bch2_read_csum_err+0x4a4/0x610 [bcachefs] [ +0.000278] bch2_read_csum_err+0x4a4/0x610 [bcachefs] [ +0.000227] ? __bch2_read_endio+0x58b/0x870 [bcachefs] [ +0.000220] __bch2_read_endio+0x58b/0x870 [bcachefs] [ +0.000268] ? try_to_wake_up+0x31c/0x7f0 [ +0.000011] ? process_one_work+0x176/0x330 [ +0.000008] process_one_work+0x176/0x330 [ +0.000008] worker_thread+0x252/0x390 [ +0.000008] ? __pfx_worker_thread+0x10/0x10 [ +0.000006] kthread+0xec/0x230 [ +0.000011] ? __pfx_kthread+0x10/0x10 [ +0.000009] ret_from_fork+0x31/0x50 [ +0.000009] ? __pfx_kthread+0x10/0x10 [ +0.000008] ret_from_fork_asm+0x1a/0x30 [ +0.000012] </TASK> [ +0.000046] ---[ end trace ]--- Reported-by: Roland Vet <vet.roland@protonmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-13 12:40:22 -04:00
Kent Overstreet	69a5a13a22	bcachefs: target_congested -> get_random_u32_below() get_random_u32_below() has a better algorithm than bch2_rand_range(), it just didn't exist at the time. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-13 12:39:21 -04:00
Kent Overstreet	3bcde88d38	bcachefs: fix tiny leak in bch2_dev_add() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-13 00:23:19 -04:00
Kent Overstreet	dbac8feb23	bcachefs: Make sure trans is unlocked when submitting read IO We were still using the trans after the unlock, leading to this bug in the retry path: 00255 ------------[ cut here ]------------ 00255 kernel BUG at fs/bcachefs/btree_iter.c:3348! 00255 Internal error: Oops - BUG: 00000000f2000800 [#1] SMP 00255 bcachefs (0ca38fe8-0a26-41f9-9b5d-6a27796c7803): /fiotest offset 86048768: no device to read from: 00255 u64s 8 type extent 4098:168192:U32_MAX len 128 ver 0: durability: 0 crc: c_size 128 size 128 offset 0 nonce 0 csum crc32c 0:8040a368 compress none ec: idx 83 block 1 ptr: 0:302:128 gen 0 00255 bcachefs (0ca38fe8-0a26-41f9-9b5d-6a27796c7803): /fiotest offset 85983232: no device to read from: 00255 u64s 8 type extent 4098:168064:U32_MAX len 128 ver 0: durability: 0 crc: c_size 128 size 128 offset 0 nonce 0 csum crc32c 0:43311336 compress none ec: idx 83 block 1 ptr: 0:302:0 gen 0 00255 Modules linked in: 00255 CPU: 5 UID: 0 PID: 304 Comm: kworker/u70:2 Not tainted 6.14.0-rc6-ktest-g526aae23d67d #16040 00255 Hardware name: linux,dummy-virt (DT) 00255 Workqueue: events_unbound bch2_rbio_retry 00255 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00255 pc : __bch2_trans_get+0x100/0x378 00255 lr : __bch2_trans_get+0xa0/0x378 00255 sp : ffffff80c865b760 00255 x29: ffffff80c865b760 x28: 0000000000000000 x27: ffffff80d76ed880 00255 x26: 0000000000000018 x25: 0000000000000000 x24: ffffff80f4ec3760 00255 x23: ffffff80f4010140 x22: 0000000000000056 x21: ffffff80f4ec0000 00255 x20: ffffff80f4ec3788 x19: ffffff80d75f8000 x18: 00000000ffffffff 00255 x17: 2065707974203820 x16: 7334367520200a3a x15: 0000000000000008 00255 x14: 0000000000000001 x13: 0000000000000100 x12: 0000000000000006 00255 x11: ffffffc080b47a40 x10: 0000000000000000 x9 : ffffffc08038dea8 00255 x8 : ffffff80d75fc018 x7 : 0000000000000000 x6 : 0000000000003788 00255 x5 : 0000000000003760 x4 : ffffff80c922de80 x3 : ffffff80f18f0000 00255 x2 : ffffff80c922de80 x1 : 0000000000000130 x0 : 0000000000000006 00255 Call trace: 00255 __bch2_trans_get+0x100/0x378 (P) 00255 bch2_read_io_err+0x98/0x260 00255 bch2_read_endio+0xb8/0x2d0 00255 __bch2_read_extent+0xce8/0xfe0 00255 __bch2_read+0x2a8/0x978 00255 bch2_rbio_retry+0x188/0x318 00255 process_one_work+0x154/0x390 00255 worker_thread+0x20c/0x3b8 00255 kthread+0xf0/0x1b0 00255 ret_from_fork+0x10/0x20 00255 Code: 6b01001f 54ffff01 79408460 3617fec0 (d4210000) 00255 ---[ end trace 0000000000000000 ]--- 00255 Kernel panic - not syncing: Oops - BUG: Fatal exception 00255 SMP: stopping secondary CPUs 00255 Kernel Offset: disabled 00255 CPU features: 0x000,00000070,00000010,8240500b 00255 Memory Limit: none 00255 ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]--- Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-11 11:21:44 -04:00
Roxana Nicolescu	58517f4df8	bcachefs: Initialize from_inode members for bch_io_opts When there is no inode source, all "from_inode" members in the structure bhc_io_opts should be set false. Fixes: `7a7c43a0c1` ("bcachefs: Add bch_io_opts fields for indicating whether the opts came from the inode") Reported-by: syzbot+c17ad4b4367b72a853cb@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c17ad4b4367b72a853cb Signed-off-by: Roxana Nicolescu <nicolescu.roxana@protonmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-11 11:19:33 -04:00
Alan Huang	3a04334d62	bcachefs: Fix b->written overflow When bset past end of btree node, we should not add sectors to b->written, which will overflow b->written. Reported-by: syzbot+3cb3d9e8c3f197754825@syzkaller.appspotmail.com Tested-by: syzbot+3cb3d9e8c3f197754825@syzkaller.appspotmail.com Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-11 09:19:23 -04:00
Luis Chamberlain	a64e5a5960	bdev: add back PAGE_SIZE block size validation for sb_set_blocksize() The commit titled "block/bdev: lift block size restrictions to 64k" lifted the block layer's max supported block size to 64k inside the helper blk_validate_block_size() now that we support large folios. However in lifting the block size we also removed the silly use cases many filesystems have to use sb_set_blocksize() to verify that the block size <= PAGE_SIZE. The call to sb_set_blocksize() was used to check the block size <= PAGE_SIZE since historically we've always supported userspace to create for example 64k block size filesystems even on 4k page size systems, but what we didn't allow was mounting them. Older filesystems have been using the check with sb_set_blocksize() for years. While, we could argue that such checks should be filesystem specific, there are much more users of sb_set_blocksize() than LBS enabled filesystem on upstream, so just do the easier thing and bring back the PAGE_SIZE check for sb_set_blocksize() users and only skip it for LBS enabled filesystems. This will ensure that tests such as generic/466 when run in a loop against say, ext4, won't try to try to actually mount a filesystem with a block size larger than your filesystem supports given your PAGE_SIZE and in the worst case crash. Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250307020403.3068567-1-mcgrof@kernel.org Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-07 12:56:05 +01:00
Kent Overstreet	8ba73f53dc	bcachefs: copygc now skips non-rw devices There's no point in doing copygc on non-rw devices: the fragmentation doesn't matter if we're not writing to them, and we may not have anywhere to put the data on our other devices. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-06 18:15:01 -05:00
Kent Overstreet	33255c161a	bcachefs: Fix bch2_dev_journal_alloc() spuriously failing Previously, we fixed journal resize spuriousl failing with -BCH_ERR_open_buckets_empty, but initial journal allocation was missed because it didn't invoke the "block on allocator" loop at all. Factor out the "loop on allocator" code to fix that. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-06 18:15:01 -05:00
Kent Overstreet	4a4f9b5c7c	bcachefs: Don't set BCH_FEATURE_incompat_version_field unless requested We shouldn't be setting incompatible bits or the incompatible version field unless explicitly request or allowed - otherwise we break mounting with old kernels or userspace. Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-28 19:07:33 -05:00
NeilBrown	88d5baf690	Change inode_operations.mkdir to return struct dentry * Some filesystems, such as NFS, cifs, ceph, and fuse, do not have complete control of sequencing on the actual filesystem (e.g. on a different server) and may find that the inode created for a mkdir request already exists in the icache and dcache by the time the mkdir request returns. For example, if the filesystem is mounted twice the directory could be visible on the other mount before it is on the original mount, and a pair of name_to_handle_at(), open_by_handle_at() calls could instantiate the directory inode with an IS_ROOT() dentry before the first mkdir returns. This means that the dentry passed to ->mkdir() may not be the one that is associated with the inode after the ->mkdir() completes. Some callers need to interact with the inode after the ->mkdir completes and they currently need to perform a lookup in the (rare) case that the dentry is no longer hashed. This lookup-after-mkdir requires that the directory remains locked to avoid races. Planned future patches to lock the dentry rather than the directory will mean that this lookup cannot be performed atomically with the mkdir. To remove this barrier, this patch changes ->mkdir to return the resulting dentry if it is different from the one passed in. Possible returns are: NULL - the directory was created and no other dentry was used ERR_PTR() - an error occurred non-NULL - this other dentry was spliced in This patch only changes file-systems to return "ERR_PTR(err)" instead of "err" or equivalent transformations. Subsequent patches will make further changes to some file-systems to return a correct dentry. Not all filesystems reliably result in a positive hashed dentry: - NFS, cifs, hostfs will sometimes need to perform a lookup of the name to get inode information. Races could result in this returning something different. Note that this lookup is non-atomic which is what we are trying to avoid. Placing the lookup in filesystem code means it only happens when the filesystem has no other option. - kernfs and tracefs leave the dentry negative and the ->revalidate operation ensures that lookup will be called to correctly populate the dentry. This could be fixed but I don't think it is important to any of the users of vfs_mkdir() which look at the dentry. The recommendation to use d_drop();d_splice_alias() is ugly but fits with current practice. A planned future patch will change this. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-27 20:00:17 +01:00
Christian Brauner	71628584df	Merge patch series "prep patches for my mkdir series" NeilBrown <neilb@suse.de> says: These two patches are cleanup are dependencies for my mkdir changes and subsequence directory locking changes. * patches from https://lore.kernel.org/r/20250226062135.2043651-1-neilb@suse.de: (2 commits) nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked() nfs/vfs: discard d_exact_alias() Link: https://lore.kernel.org/r/20250226062135.2043651-1-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-27 09:25:34 +01:00
Kent Overstreet	eb54d2695b	bcachefs: Fix truncate sometimes failing and returning 1 __bch_truncate_folio() may return 1 to indicate dirtyness of the folio being truncated, needed for fpunch to get the i_size writes correct. But truncate was forgetting to clear ret, and sometimes returning it as an error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-26 19:31:05 -05:00
Alan Huang	677bdb7346	bcachefs: Fix deadlock This fixes two deadlocks: 1.pcpu_alloc_mutex involved one as pointed by syzbot[1] 2.recursion deadlock. The root cause is that we hold the bc lock during alloc_percpu, fix it by following the pattern used by __btree_node_mem_alloc(). [1] https://lore.kernel.org/all/66f97d9a.050a0220.6bad9.001d.GAE@google.com/T/ Reported-by: syzbot+fe63f377148a6371a9db@syzkaller.appspotmail.com Tested-by: syzbot+fe63f377148a6371a9db@syzkaller.appspotmail.com Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-26 19:31:05 -05:00
Kent Overstreet	7909d1fb90	bcachefs: Check for -BCH_ERR_open_buckets_empty in journal resize This fixes occasional failures from journal resize. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-26 19:31:05 -05:00
Kent Overstreet	4804f3ac26	bcachefs: Revert directory i_size This turned out to have several bugs, which were missed because the fsck code wasn't properly reporting errors - whoops. Kicking it out for now, hopefully it can make 6.15. Cc: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-26 19:30:38 -05:00
Kent Overstreet	cf3e696026	bcachefs: fix bch2_extent_ptr_eq() Reviewed-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-23 23:35:33 -05:00
Alan Huang	c522093b02	bcachefs: Fix memmove when move keys down The fix alone doesn't fix [1], but should be applied before debugging that. [1] https://syzkaller.appspot.com/bug?extid=38a0cbd267eff2d286ff Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-20 16:40:34 -05:00
Kent Overstreet	68aaa63716	bcachefs: print op->nonce on data update inconsistency "nonce inconstancy" is popping up again, causing us to go emergency read-only. This one looks less serious, i.e. specific to the encryption path and not indicative of a data corruption bug. But we'll need more info to track it down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-20 16:39:28 -05:00
Kent Overstreet	b04974f759	bcachefs: Fix srcu lock warning in btree_update_nodes_written() We don't want to be holding the srcu lock while waiting on btree write completions - easily fixed. Reported-by: Janpieter Sollie <janpieter.sollie@edpnet.be> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-19 18:52:42 -05:00
Kent Overstreet	4fd509c10f	bcachefs: Fix bch2_indirect_extent_missing_error() We had some error handling confusion here; -BCH_ERR_missing_indirect_extent is thrown by trans_trigger_reflink_p_segment(); at this point we haven't decide whether we're generating an error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-19 17:33:13 -05:00
Kent Overstreet	b9ddb3e1a8	bcachefs: Fix fsck directory i_size checking Error handling was wrong, causing unhandled transaction restart errors. check_directory_size() was also inefficient, since keys in multiple snapshots would be iterated over once for every snapshot. Convert it to the same scheme used for i_sectors and subdir count checking. Cc: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-19 13:52:27 -05:00
NeilBrown	1c3cb50b58	VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry No callers of kern_path_locked() or user_path_locked_at() want a negative dentry. So change them to return -ENOENT instead. This simplifies callers. This results in a subtle change to bcachefs in that an ioctl will now return -ENOENT in preference to -EXDEV. I believe this restores the behaviour to what it was prior to Commit `bbe6a7c899` ("bch2_ioctl_subvolume_destroy(): fix locking") Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250217003020.3170652-2-neilb@suse.de Acked-by: Paul Moore <paul@paul-moore.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-19 14:08:41 +01:00
Alan Huang	406e445b3c	bcachefs: Reuse transaction bch2_nocow_write_convert_unwritten is already in transaction context: 00191 ========= TEST generic/648 00242 kernel BUG at fs/bcachefs/btree_iter.c:3332! 00242 Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP 00242 Modules linked in: 00242 CPU: 4 UID: 0 PID: 2593 Comm: fsstress Not tainted 6.13.0-rc3-ktest-g345af8f855b7 #14403 00242 Hardware name: linux,dummy-virt (DT) 00242 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00242 pc : __bch2_trans_get+0x120/0x410 00242 lr : __bch2_trans_get+0xcc/0x410 00242 sp : ffffff80d89af600 00242 x29: ffffff80d89af600 x28: ffffff80ddb23000 x27: 00000000fffff705 00242 x26: ffffff80ddb23028 x25: ffffff80d8903fe0 x24: ffffff80ebb30168 00242 x23: ffffff80c8aeb500 x22: 000000000000005d x21: ffffff80d8904078 00242 x20: ffffff80d8900000 x19: ffffff80da9e8000 x18: 0000000000000000 00242 x17: 64747568735f6c61 x16: 6e72756f6a20726f x15: 0000000000000028 00242 x14: 0000000000000004 x13: 000000000000f787 x12: ffffffc081bbcdc8 00242 x11: 0000000000000000 x10: 0000000000000003 x9 : ffffffc08094efbc 00242 x8 : 000000001092c111 x7 : 000000000000000c x6 : ffffffc083c31fc4 00242 x5 : ffffffc083c31f28 x4 : ffffff80c8aeb500 x3 : ffffff80ebb30000 00242 x2 : 0000000000000001 x1 : 0000000000000a21 x0 : 000000000000028e 00242 Call trace: 00242 __bch2_trans_get+0x120/0x410 (P) 00242 bch2_inum_offset_err_msg+0x48/0xb0 00242 bch2_nocow_write_convert_unwritten+0x3d0/0x530 00242 bch2_nocow_write+0xeb0/0x1000 00242 __bch2_write+0x330/0x4e8 00242 bch2_write+0x1f0/0x530 00242 bch2_direct_write+0x530/0xc00 00242 bch2_write_iter+0x160/0xbe0 00242 vfs_write+0x1cc/0x360 00242 ksys_write+0x5c/0xf0 00242 __arm64_sys_write+0x20/0x30 00242 invoke_syscall.constprop.0+0x54/0xe8 00242 do_el0_svc+0x44/0xc0 00242 el0_svc+0x34/0xa0 00242 el0t_64_sync_handler+0x104/0x130 00242 el0t_64_sync+0x154/0x158 00242 Code: 6b01001f 54ffff01 79408460 3617fec0 (d4210000) 00242 ---[ end trace 0000000000000000 ]--- 00242 Kernel panic - not syncing: Oops - BUG: Fatal exception Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-12 18:44:50 -05:00
Alan Huang	531323a2ef	bcachefs: Pass _orig_restart_count to trans_was_restarted _orig_restart_count is unused now, according to the logic, trans_was_restarted should be using _orig_restart_count. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-12 18:40:19 -05:00
Kent Overstreet	9cf6b84b71	bcachefs: CONFIG_BCACHEFS_INJECT_TRANSACTION_RESTARTS Incorrectly handled transaction restarts can be a source of heisenbugs; add a mode where we randomly inject them to shake them out. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-12 18:40:19 -05:00
Kent Overstreet	9f734cd076	bcachefs: Fix want_new_bset() so we write until the end of the btree node want_new_bset() returns the address of a new bset to initialize if we wish to do so in a btree node - either because the previous one is too big, or because it's been written. The case for 'previous bset was written' was wrong: it's only supposed to check for if we have space in the node for one more block, but because it subtracted the header from the space available it would never initialize a new bset if we were down to the last block in a node. Fixing this results in fewer btree node splits/compactions, which fixes a bug with flushing the journal to go read-only sometimes not terminating or taking excessively long. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-11 10:10:32 -05:00
Kent Overstreet	1e690efa72	bcachefs: Split out journal pins by btree level This lets us flush the journal to go read-only more effectively. Flushing the journal and going read-only requires halting mutually recursive processes, which strictly speaking are not guaranteed to terminate. Flushing btree node journal pins will kick off a btree node write, and btree node writes on completion must do another btree update to the parent node to update the 'sectors_written' field for that node's key. If the parent node is full and requires a split or compaction, that's going to generate a whole bunch of additional btree updates - alloc info, LRU btree, and more - which then have to be flushed, and the cycle repeats. This process will terminate much more effectively if we tweak journal reclaim to flush btree updates leaf to root: i.e., don't flush updates for a given btree node (kicking off a write, and consuming space within that node up to the next block boundary) if there might still be unflushed updates in child nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-11 10:10:32 -05:00
Alan Huang	1c316eb57c	bcachefs: Fix use after free acc->k.data should be used with the lock hold: 00221 ========= TEST generic/187 00221 run fstests generic/187 at 2025-02-09 21:08:10 00221 spectre-v4 mitigation disabled by command-line option 00222 bcachefs (vdc): starting version 1.20: directory_size opts=errors=ro 00222 bcachefs (vdc): initializing new filesystem 00222 bcachefs (vdc): going read-write 00222 bcachefs (vdc): marking superblocks 00222 bcachefs (vdc): initializing freespace 00222 bcachefs (vdc): done initializing freespace 00222 bcachefs (vdc): reading snapshots table 00222 bcachefs (vdc): reading snapshots done 00222 bcachefs (vdc): done starting filesystem 00222 bcachefs (vdc): shutting down 00222 bcachefs (vdc): going read-only 00222 bcachefs (vdc): finished waiting for writes to stop 00223 bcachefs (vdc): flushing journal and stopping allocators, journal seq 6 00223 bcachefs (vdc): flushing journal and stopping allocators complete, journal seq 8 00223 bcachefs (vdc): clean shutdown complete, journal seq 9 00223 bcachefs (vdc): marking filesystem clean 00223 bcachefs (vdc): shutdown complete 00223 bcachefs (vdc): starting version 1.20: directory_size opts=errors=ro 00223 bcachefs (vdc): initializing new filesystem 00223 bcachefs (vdc): going read-write 00223 bcachefs (vdc): marking superblocks 00223 bcachefs (vdc): initializing freespace 00223 bcachefs (vdc): done initializing freespace 00223 bcachefs (vdc): reading snapshots table 00223 bcachefs (vdc): reading snapshots done 00223 bcachefs (vdc): done starting filesystem 00244 hrtimer: interrupt took 123350440 ns 00264 bcachefs (vdc): shutting down 00264 bcachefs (vdc): going read-only 00264 bcachefs (vdc): finished waiting for writes to stop 00264 bcachefs (vdc): flushing journal and stopping allocators, journal seq 97 00265 bcachefs (vdc): flushing journal and stopping allocators complete, journal seq 101 00265 bcachefs (vdc): clean shutdown complete, journal seq 102 00265 bcachefs (vdc): marking filesystem clean 00265 bcachefs (vdc): shutdown complete 00265 bcachefs (vdc): starting version 1.20: directory_size opts=errors=ro 00265 bcachefs (vdc): recovering from clean shutdown, journal seq 102 00265 bcachefs (vdc): accounting_read... 00265 ================================================================== 00265 done 00265 BUG: KASAN: slab-use-after-free in bch2_fs_to_text+0x12b4/0x1728 00265 bcachefs (vdc): alloc_read... done 00265 bcachefs (vdc): stripes_read... done 00265 Read of size 4 at addr ffffff80c57eac00 by task cat/7531 00265 bcachefs (vdc): snapshots_read... done 00265 00265 CPU: 6 UID: 0 PID: 7531 Comm: cat Not tainted 6.13.0-rc3-ktest-g16fc6fa3819d #14103 00265 Hardware name: linux,dummy-virt (DT) 00265 Call trace: 00265 show_stack+0x1c/0x30 (C) 00265 dump_stack_lvl+0x6c/0x80 00265 print_report+0xf8/0x5d8 00265 kasan_report+0x90/0xd0 00265 __asan_report_load4_noabort+0x1c/0x28 00265 bch2_fs_to_text+0x12b4/0x1728 00265 bch2_fs_show+0x94/0x188 00265 sysfs_kf_seq_show+0x1a4/0x348 00265 kernfs_seq_show+0x12c/0x198 00265 seq_read_iter+0x27c/0xfd0 00265 kernfs_fop_read_iter+0x390/0x4f8 00265 vfs_read+0x480/0x7f0 00265 ksys_read+0xe0/0x1e8 00265 __arm64_sys_read+0x70/0xa8 00265 invoke_syscall.constprop.0+0x74/0x1e8 00265 do_el0_svc+0xc8/0x1c8 00265 el0_svc+0x20/0x60 00265 el0t_64_sync_handler+0x104/0x130 00265 el0t_64_sync+0x154/0x158 00265 00265 Allocated by task 7510: 00265 kasan_save_stack+0x28/0x50 00265 kasan_save_track+0x1c/0x38 00265 kasan_save_alloc_info+0x3c/0x50 00265 __kasan_kmalloc+0xac/0xb0 00265 __kmalloc_node_noprof+0x168/0x348 00265 __kvmalloc_node_noprof+0x20/0x140 00265 __bch2_darray_resize_noprof+0x90/0x1b0 00265 __bch2_accounting_mem_insert+0x76c/0xb08 00265 bch2_accounting_mem_insert+0x224/0x3b8 00265 bch2_accounting_mem_mod_locked+0x480/0xc58 00265 bch2_accounting_read+0xa94/0x3eb8 00265 bch2_run_recovery_pass+0x80/0x178 00265 bch2_run_recovery_passes+0x340/0x698 00265 bch2_fs_recovery+0x1c98/0x2bd8 00265 bch2_fs_start+0x240/0x490 00265 bch2_fs_get_tree+0xe1c/0x1458 00265 vfs_get_tree+0x7c/0x250 00265 path_mount+0xe24/0x1648 00265 __arm64_sys_mount+0x240/0x438 00265 invoke_syscall.constprop.0+0x74/0x1e8 00265 do_el0_svc+0xc8/0x1c8 00265 el0_svc+0x20/0x60 00265 el0t_64_sync_handler+0x104/0x130 00265 el0t_64_sync+0x154/0x158 00265 00265 Freed by task 7510: 00265 kasan_save_stack+0x28/0x50 00265 kasan_save_track+0x1c/0x38 00265 kasan_save_free_info+0x48/0x88 00265 __kasan_slab_free+0x48/0x60 00265 kfree+0x188/0x408 00265 kvfree+0x3c/0x50 00265 __bch2_darray_resize_noprof+0xe0/0x1b0 00265 __bch2_accounting_mem_insert+0x76c/0xb08 00265 bch2_accounting_mem_insert+0x224/0x3b8 00265 bch2_accounting_mem_mod_locked+0x480/0xc58 00265 bch2_accounting_read+0xa94/0x3eb8 00265 bch2_run_recovery_pass+0x80/0x178 00265 bch2_run_recovery_passes+0x340/0x698 00265 bch2_fs_recovery+0x1c98/0x2bd8 00265 bch2_fs_start+0x240/0x490 00265 bch2_fs_get_tree+0xe1c/0x1458 00265 vfs_get_tree+0x7c/0x250 00265 path_mount+0xe24/0x1648 00265 bcachefs (vdc): going read-write 00265 __arm64_sys_mount+0x240/0x438 00265 invoke_syscall.constprop.0+0x74/0x1e8 00265 do_el0_svc+0xc8/0x1c8 00265 el0_svc+0x20/0x60 00265 el0t_64_sync_handler+0x104/0x130 00265 el0t_64_sync+0x154/0x158 00265 00265 The buggy address belongs to the object at ffffff80c57eac00 00265 which belongs to the cache kmalloc-128 of size 128 00265 The buggy address is located 0 bytes inside of 00265 freed 128-byte region [ffffff80c57eac00, ffffff80c57eac80) 00265 00265 The buggy address belongs to the physical page: 00265 page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1057ea 00265 head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 00265 flags: 0x8000000000000040(head\|zone=2) 00265 page_type: f5(slab) 00265 raw: 8000000000000040 ffffff80c0002800 dead000000000100 dead000000000122 00265 raw: 0000000000000000 0000000000200020 00000001f5000000 ffffff80c57a6400 00265 head: 8000000000000040 ffffff80c0002800 dead000000000100 dead000000000122 00265 head: 0000000000000000 0000000000200020 00000001f5000000 ffffff80c57a6400 00265 head: 8000000000000001 fffffffec315fa81 ffffffffffffffff 0000000000000000 00265 head: 0000000000000002 0000000000000000 00000000ffffffff 0000000000000000 00265 page dumped because: kasan: bad access detected 00265 00265 Memory state around the buggy address: 00265 ffffff80c57eab00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00265 ffffff80c57eab80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc 00265 >ffffff80c57eac00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb 00265 ^ 00265 ffffff80c57eac80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc 00265 ffffff80c57ead00: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc 00265 ================================================================== 00265 Kernel panic - not syncing: kasan.fault=panic set ... 00265 CPU: 6 UID: 0 PID: 7531 Comm: cat Not tainted 6.13.0-rc3-ktest-g16fc6fa3819d #14103 00265 Hardware name: linux,dummy-virt (DT) 00265 Call trace: 00265 show_stack+0x1c/0x30 (C) 00265 dump_stack_lvl+0x30/0x80 00265 dump_stack+0x18/0x20 00265 panic+0x4d4/0x518 00265 start_report.constprop.0+0x0/0x90 00265 kasan_report+0xa0/0xd0 00265 __asan_report_load4_noabort+0x1c/0x28 00265 bch2_fs_to_text+0x12b4/0x1728 00265 bch2_fs_show+0x94/0x188 00265 sysfs_kf_seq_show+0x1a4/0x348 00265 kernfs_seq_show+0x12c/0x198 00265 seq_read_iter+0x27c/0xfd0 00265 kernfs_fop_read_iter+0x390/0x4f8 00265 vfs_read+0x480/0x7f0 00265 ksys_read+0xe0/0x1e8 00265 __arm64_sys_read+0x70/0xa8 00265 invoke_syscall.constprop.0+0x74/0x1e8 00265 do_el0_svc+0xc8/0x1c8 00265 el0_svc+0x20/0x60 00265 el0t_64_sync_handler+0x104/0x130 00265 el0t_64_sync+0x154/0x158 00265 SMP: stopping secondary CPUs 00265 Kernel Offset: disabled 00265 CPU features: 0x000,00000070,00000010,8240500b 00265 Memory Limit: none 00265 ---[ end Kernel panic - not syncing: kasan.fault=panic set ... ]--- 00270 ========= FAILED TIMEOUT generic.187 in 1200s Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-11 10:10:32 -05:00
Kent Overstreet	595170d4b6	bcachefs: Fix marking reflink pointers to missing indirect extents reflink pointers to missing indirect extents aren't deleted, they just have an error bit set - in case the indirect extent somehow reappears. fsck/mark and sweep thus needs to ignore these errors. Also, they can be marked AUTOFIX now. Reported-by: Roland Vet <vet.roland@protonmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-07 14:49:47 -05:00
Kent Overstreet	4be214c269	bcachefs: bch2_bkey_sectors_need_rebalance() now only depends on bch_extent_rebalance Previously, bch2_bkey_sectors_need_rebalance() called bch2_target_accepts_data(), checking whether the target is writable. However, this means that adding or removing devices from a target would change the value of bch2_bkey_sectors_need_rebalance() for an existing extent; this needs to be invariant so that the extent trigger can correctly maintain rebalance_work accounting. Instead, check target_accepts_data() in io_opts_to_rebalance_opts(), before creating the bch_extent_rebalance entry. This fixes (one?) cause of rebalance_work accounting being off. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-06 22:35:11 -05:00
Kent Overstreet	3539880ef1	bcachefs: Fix rcu imbalance in bch2_fs_btree_key_cache_exit() Spotted by sparse. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-06 22:35:11 -05:00
Kent Overstreet	9e9033522a	bcachefs: Fix discard path journal flushing The discard path is supposed to issue journal flushes when there's too many buckets empty buckets that need a journal commit before they can be written to again, but at some point this code seems to have been lost. Bring it back with a new optimization to make sure we don't issue too many journal flushes: the journal now tracks the sequence number of the most recent flush in progress, which the discard path uses when deciding which buckets need a journal flush. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-06 22:35:11 -05:00
Jeongjun Park	2ef995df0c	bcachefs: fix deadlock in journal_entry_open() In the previous commit `b3d82c2f27`, code was added to prevent journal sequence overflow. Among them, the code added to journal_entry_open() uses the bch2_fs_fatal_err_on() function to handle errors. However, __journal_res_get() , which calls journal_entry_open() , calls journal_entry_open() while holding journal->lock , but bch2_fs_fatal_err_on() internally tries to acquire journal->lock , which results in a deadlock. So we need to add a locked helper to handle fatal errors even when the journal->lock is held. Fixes: `b3d82c2f27` ("bcachefs: Guard against journal seq overflow") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-06 22:35:11 -05:00
Jeongjun Park	6b37037d6d	bcachefs: fix incorrect pointer check in __bch2_subvolume_delete() For some unknown reason, checks on struct bkey_s_c_snapshot and struct bkey_s_c_snapshot_tree pointers are missing. Therefore, I think it would be appropriate to fix the incorrect pointer checking through this patch. Fixes: `4bd06f07bc` ("bcachefs: Fixes for snapshot_tree.master_subvol") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-06 22:35:11 -05:00
Linus Torvalds	a86bf2283d	assorted stuff for this merge window -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZ5yJdgAKCRBZ7Krx/gZQ 69W4AQDwgxceiQ6icx3rFhCWQigne4jdMO84kd8tNaa+xHGe1AD/WnkeChc5DqjQ wZWZxAAzml9SS01IcSiHWaF5fgrjlA0= =rXOq -----END PGP SIGNATURE----- Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs cleanups from Al Viro: "Two unrelated patches - one is a removal of long-obsolete include in overlayfs (it used to need fs/internal.h, but the extern it wanted has been moved back to include/linux/namei.h) and another introduces convenience helper constructing struct qstr by a NUL-terminated string" * tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add a string-to-qstr constructor fs/overlayfs/namei.c: get rid of include ../internal.h	2025-02-01 15:07:56 -08:00
Linus Torvalds	8080ff5ac6	bcachefs fixes for 6.14-rc1 - second half of a fix for a bug that'd been causing oopses on filesystems using snapshots with memory pressure (key cache fills for snaphots btrees are tricky) - build fix for strange compiler configurations that double stack frame size - "journal stuck timeout" now takes into account device latency: this fixes some spurious warnings, and the main remaining source of SRCU lock hold time warnings (I'm no longer seeing this in my CI, so any users still seeing this should definitely ping me) - fix for slow/hanging unmounts (" Improve journal pin flushing") - some more tracepoint fixes/improvements, to chase down the "rebalance isn't making progress" issues -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmeajBsACgkQE6szbY3K bna2BA/9E/+WBDFQHkLJ4kNQBxKL4u1xfav5kKGZ79mUlqruhr3AckLPFzWmQO21 eOJE0NeyvpsvLewDXMGZ8w/Nm3Vdc53X6ATKkQaF/UoTYVWbubmF62sXBzSS8TUh YIM6s24/CbCi8lT49JAuIaG3OC21KH0X0zOcvyepfmn1aiPNLr4y7zOWKynOhgCt mAt374ayUDDTgoQmXqPIrGp8eD/C+vUjo1ief+DIGMQmDj4uHpb5iBbmjXm8FF9x 4TcrP1UjWpiWPcHeb98H/CBWOnjDSgOFhYmxhVOvkDpC6XbtPSgKQIOs7tSJ0Nuo IOzrGuBPVfd2m+wgXsn7zbn0HNOjS76sCo92K1lAdS86k0eqRfXmCxkU6FUphNkA WCG8WrK0RjHL132iR97dtv36No8ji5mZN1ILPk/h4KRkoKC+9fA8BaJdAGVt+6NP wZLtZxZkV8BqgXF41HwzHt54YftRPn2kR47Jfu1rPimSUd4Uqy8Yjw2J/fUT7eAd 6JdfiadhAtMRWnFGzmVs4LsEWJ7Ja7GnG7jhjzlACsqbXsTU8k16Wq38IchC6mi+ p+hqq9pLAeosW9Lk/QTGFrq52aQfyOzdUjq1pyCcEYtZFNqjj8GmmHVejxZWiRTo C6dTEkSIMcBx+9QP8BJ5o+xMR02KABn+8x43TzQxQ2DXj0QamTA= =QJHL -----END PGP SIGNATURE----- Merge tag 'bcachefs-2025-01-29' of git://evilpiepirate.org/bcachefs Pull bcachefs fixes from Kent Overstreet: - second half of a fix for a bug that'd been causing oopses on filesystems using snapshots with memory pressure (key cache fills for snaphots btrees are tricky) - build fix for strange compiler configurations that double stack frame size - "journal stuck timeout" now takes into account device latency: this fixes some spurious warnings, and the main remaining source of SRCU lock hold time warnings (I'm no longer seeing this in my CI, so any users still seeing this should definitely ping me) - fix for slow/hanging unmounts (" Improve journal pin flushing") - some more tracepoint fixes/improvements, to chase down the "rebalance isn't making progress" issues * tag 'bcachefs-2025-01-29' of git://evilpiepirate.org/bcachefs: bcachefs: Improve trace_move_extent_finish bcachefs: Fix trace_copygc bcachefs: Journal writes are now IOPRIO_CLASS_RT bcachefs: Improve journal pin flushing bcachefs: fix bch2_btree_node_flags bcachefs: rebalance, copygc enabled are runtime opts bcachefs: Improve decompression error messages bcachefs: bset_blacklisted_journal_seq is now AUTOFIX bcachefs: "Journal stuck" timeout now takes into account device latency bcachefs: Reduce stack frame size of __bch2_str_hash_check_key() bcachefs: Fix btree_trans_peek_key_cache()	2025-01-30 08:42:50 -08:00
Al Viro	c1feab95e0	add a string-to-qstr constructor Quite a few places want to build a struct qstr by given string; it would be convenient to have a primitive doing that, rather than open-coding it via QSTR_INIT(). The closest approximation was in bcachefs, but that expands to initializer list - {.len = strlen(string), .name = string}. It would be more useful to have it as compound literal - (struct qstr){.len = strlen(string), .name = string}. Unlike initializer list it's a valid expression. What's more, it's a valid lvalue - it's an equivalent of anonymous local variable with such initializer, so the things like path->dentry = d_alloc_pseudo(mnt->mnt_sb, &QSTR(name)); are valid. It can also be used as initializer, with identical effect - struct qstr x = (struct qstr){.name = s, .len = strlen(s)}; is equivalent to struct qstr anon_variable = {.name = s, .len = strlen(s)}; struct qstr x = anon_variable; // anon_variable is never used after that point and any even remotely sane compiler will manage to collapse that into struct qstr x = {.name = s, .len = strlen(s)}; What compound literals can't be used for is initialization of global variables, but those are covered by QSTR_INIT(). This commit lifts definition(s) of QSTR() into linux/dcache.h, converts it to compound literal (all bcachefs users are fine with that) and converts assorted open-coded instances to using that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-01-27 19:25:45 -05:00
Kent Overstreet	5d9ccda9ba	bcachefs: Improve trace_move_extent_finish We're currently debugging issues with rebalance, where it's not making progress as quickly as it should be (or sometimes not at all). Add the full data_update to the move_extent_finish tracepoint, so we can check that the replicas we wrote match what we were supposed to do. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-26 23:02:28 -05:00
Kent Overstreet	0e458a616f	bcachefs: Fix trace_copygc Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-26 23:02:28 -05:00
Kent Overstreet	75474a54ed	bcachefs: Journal writes are now IOPRIO_CLASS_RT System performance is particularly sensitive to journal write latency, the number of outstanding journal writes is bounded and we can't issue journal flushes until other journal writes have completed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-26 23:02:28 -05:00
Kent Overstreet	35f5197009	bcachefs: Improve journal pin flushing Running the preempt tiering tests with a lower than normal journal reclaim delay turned up a shutdown hang - a lost wakeup, caused because flushing a journal pin (e.g. key cache/write buffer) can generate a new journal pin. The "simple" fix of adding the correct wakeup didn't work because of ordering issues; if we flush btree node pins too aggressively before other pins have completed, we end up spinning where each flush iteration generates new work. So to fix this correctly: - The list of flushed journal pins is now broken out by type, so that we can wait for key cache/write buffer pin flushing to complete before flushing dirty btree nodes - A new closure_waitlist is added for bch2_journal_flush_pins; this one is only used under or when we're taking the journal lock, so it's pretty cheap to add rigorously correct wakeups to journal_pin_set() and journal_pin_drop(). Additionally, bch2_journal_seq_pins_to_text() is moved to journal_reclaim.c, where it belongs, along with a bit of other small renaming and refactoring. Besides fixing the hang, the better ordering between key cache/write buffer flushing and btree node flushing should help or fix the "unmount taking excessively long" a few users have been noticing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-25 19:37:43 -05:00
Kent Overstreet	0c74c85bbe	bcachefs: fix bch2_btree_node_flags Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-25 19:33:19 -05:00
Kent Overstreet	37fd6b8176	bcachefs: rebalance, copygc enabled are runtime opts Fix a regression from when these were switched to normal opts.h options. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-25 19:33:19 -05:00
Kent Overstreet	2efbc3518f	bcachefs: Improve decompression error messages Ratelimit them, and use the new bch2_write_op_error() helper that prints path and file offset. Reported-by: https://github.com/koverstreet/bcachefs/issues/819 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-25 14:43:13 -05:00
Linus Torvalds	37b33c68b0	CRC updates for 6.14 - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be directly accessible via the library API, instead of requiring the crypto API. This is much simpler and more efficient. - Convert some users such as ext4 to use the CRC32 library API instead of the crypto API. More conversions like this will come later. - Add a KUnit test that tests and benchmarks multiple CRC variants. Remove older, less-comprehensive tests that are made redundant by this. - Add an entry to MAINTAINERS for the kernel's CRC library code. I'm volunteering to maintain it. I have additional cleanups and optimizations planned for future cycles. These patches have been in linux-next since -rc1. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZ418ZRQcZWJpZ2dlcnNA Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKyJYAP9kBlpm8W9/XY6N8SpjKaXE/vKQYHQl Nobhak06Us8uJwEAkcUTymWP4IwQj5A9jgBAPRw53FQcNVKIc+01C7gRHw0= =mqSH -----END PGP SIGNATURE----- Merge tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull CRC updates from Eric Biggers: - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be directly accessible via the library API, instead of requiring the crypto API. This is much simpler and more efficient. - Convert some users such as ext4 to use the CRC32 library API instead of the crypto API. More conversions like this will come later. - Add a KUnit test that tests and benchmarks multiple CRC variants. Remove older, less-comprehensive tests that are made redundant by this. - Add an entry to MAINTAINERS for the kernel's CRC library code. I'm volunteering to maintain it. I have additional cleanups and optimizations planned for future cycles. * tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (31 commits) MAINTAINERS: add entry for CRC library powerpc/crc: delete obsolete crc-vpmsum_test.c lib/crc32test: delete obsolete crc32test.c lib/crc16_kunit: delete obsolete crc16_kunit.c lib/crc_kunit.c: add KUnit test suite for CRC library functions powerpc/crc-t10dif: expose CRC-T10DIF function through lib arm64/crc-t10dif: expose CRC-T10DIF function through lib arm/crc-t10dif: expose CRC-T10DIF function through lib x86/crc-t10dif: expose CRC-T10DIF function through lib crypto: crct10dif - expose arch-optimized lib function lib/crc-t10dif: add support for arch overrides lib/crc-t10dif: stop wrapping the crypto API scsi: target: iscsi: switch to using the crc32c library f2fs: switch to using the crc32 library jbd2: switch to using the crc32c library ext4: switch to using the crc32c library lib/crc32: make crc32c() go directly to lib bcachefs: Explicitly select CRYPTO from BCACHEFS_FS x86/crc32: expose CRC32 functions through lib x86/crc32: update prototype for crc32_pclmul_le_16() ...	2025-01-22 19:55:08 -08:00
Kent Overstreet	c9c8a17f7a	bcachefs: bset_blacklisted_journal_seq is now AUTOFIX Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-21 23:05:32 -05:00
Kent Overstreet	2c5d8a8347	bcachefs: "Journal stuck" timeout now takes into account device latency If a block device (e.g. your typical consumer SSD) is taking multiple seconds for IOs (typically flushes), we don't want to emit the "journal stuck" message prematurely. Also, make sure to drop the btree_trans srcu lock if we're blocking for more than a second. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-21 18:32:05 -05:00
Kent Overstreet	f917016f69	bcachefs: Reduce stack frame size of __bch2_str_hash_check_key() We don't need all the helpers inlined here. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-21 12:57:48 -05:00
Kent Overstreet	a858175227	bcachefs: Fix btree_trans_peek_key_cache() BTREE_ITER_cached_nofill has some tricky corner cases; it's used internally for iterators that aren't walking the key cache, but need to be coherent with the key cache. It tells traverse to look up and lock the key cache entry if present, but don't create one if it doesn't exist. That means we have to have a BTREE_ITER_UPTODATE path (because after traverse the path has to be UPTODATE, or we pop assertions) that doesn't point to anything (which is the less bad option, taken by the previous fix). The previous fix for this path missed an issue that can happen in bch2_trans_peek_key_cache(): we can't set should_be_locked on a path that doesn't point to anything and doesn't hold locks. Fixes: `bd5b09727f` ("bcachefs: Don't set btree_path to updtodate if we don't fill") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-21 12:26:25 -05:00
Linus Torvalds	1cbfb828e0	for-6.14/block-20250118 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeL6hoQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgppw2EADQV8nDgLRggZR+il4U03yKHXcQEdAX1GrB Erowx+dasIJuh6kp3n6qRe9QD/pRqt1DKyLvXoWF8Qfuwq85j7oDnDDYxutNYT27 hDgrLJriJ3VeKYtTu+andHWt8P29b5h57UayInDOUJurEPA6rXyFZ5YVIti8n21K uDOrQXiACG3qRWS2+p2f3UNhX0MkFNFdN/lxi13WMIJtRWF5bXAP+JOgIWCID4Ze QuSY6rQD4dp4Q6M2erpX6tn0YZb7Hvw3rPjsd91n6jvYfTUVLH375zg8jCBpi6Wi Syufbb8xcTtriVPTDRNu0ekjebkc8wD8ax/h86g0z9v3Ua4DlNmsx9eXrtv6r5nu YXqDODOad6stI0+owFquW2vas0gHmfNSfyfGdlk2g24PMtP5Yx0V6FIEvwIeqnje ghgxQvBuKUsdhqakByfNnc+XvXi3+RUJek8kvMeUSUQWT1IyMQqPOOk0yp9WdyWD bY1f2ECP5BR1b37zYOyawewsI5xTupHUswn5a4r4qtGn3O15rGDkX98Nab5aLCnR rW/DvX7+wT6gW9EwrRHiwjwfNDZbsJ9Ggu3lMhtUl5GUWdk58yTiVgKaHJLnlX9/ CKFKfyyIR1Vl8+gYIpemyFhhcoN+dCSf06ISkrg0jeS0/tYwydaAaCBPL5J4kxZA h3Rtbh+Pgg== =EXYs -----END PGP SIGNATURE----- Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - NVMe pull requests via Keith: - Target support for PCI-Endpoint transport (Damien) - TCP IO queue spreading fixes (Sagi, Chaitanya) - Target handling for "limited retry" flags (Guixen) - Poll type fix (Yongsoo) - Xarray storage error handling (Keisuke) - Host memory buffer free size fix on error (Francis) - MD pull requests via Song: - Reintroduce md-linear (Yu Kuai) - md-bitmap refactor and fix (Yu Kuai) - Replace kmap_atomic with kmap_local_page (David Reaver) - Quite a few queue freeze and debugfs deadlock fixes Ming introduced lockdep support for this in the 6.13 kernel, and it has (unsurprisingly) uncovered quite a few issues - Use const attributes for IO schedulers - Remove bio ioprio wrappers - Fixes for stacked device atomic write support - Refactor queue affinity helpers, in preparation for better supporting isolated CPUs - Cleanups of loop O_DIRECT handling - Cleanup of BLK_MQ_F_* flags - Add rotational support for null_blk - Various fixes and cleanups * tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits) block: Don't trim an atomic write block: Add common atomic writes enable flag md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add() block: limit disk max sectors to (LLONG_MAX >> 9) block: Change blk_stack_atomic_writes_limits() unit_min check block: Ensure start sector is aligned for stacking atomic writes blk-mq: Move more error handling into blk_mq_submit_bio() block: Reorder the request allocation code in blk_mq_submit_bio() nvme: fix bogus kzalloc() return check in nvme_init_effects_log() md/md-bitmap: move bitmap_{start, end}write to md upper layer md/raid5: implement pers->bitmap_sector() md: add a new callback pers->bitmap_sector() md/md-bitmap: remove the last parameter for bimtap_ops->endwrite() md/md-bitmap: factor behind write counters out from bitmap_{start/end}write() md: Replace deprecated kmap_atomic() with kmap_local_page() md: reintroduce md-linear partitions: ldm: remove the initial kernel-doc notation blk-cgroup: rwstat: fix kernel-doc warnings in header file blk-cgroup: fix kernel-doc warnings in header file nbd: fix partial sending ...	2025-01-20 19:38:46 -08:00
Kent Overstreet	ff0b7ed607	bcachefs: Fix check_inode_hash_info_matches_root() Can't use memcmp() when the struct contains padding. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-15 15:28:23 -05:00
Kent Overstreet	a4e11cea27	bcachefs: Document issue with bch_stripe layout We've got a problem with bch_stripe that is going to take an on disk format rev to fix - we can't access the block sector counts if the checksum type is unknown. Document it for now, there are a few other things to fix as well. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:31 -05:00
Kent Overstreet	78423deb51	bcachefs: Fix self healing on read error We were incorrectly checking if there'd been an io error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:31 -05:00
Alan Huang	5dd21b2712	bcachefs: Pop all the transactions from the abort one The transaction is going to abort, so there will be no cycle involving this transaction anymore. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:25 -05:00
Alan Huang	b169138d48	bcachefs: Only abort the transactions in the cycle When the cycle doesn't involve the initiator of the cycle detection, we might choose a transaction that is not involved in the cycle to abort. It shouldn't be that since it won't break the cycle, this patch therefore chooses the transaction in the cycle to abort. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:18 -05:00
Alan Huang	6853a5e5d4	bcachefs: Introduce lock_graph_pop_from This patch introduces a helper function called lock_graph_pop_from, it pops the graph from i. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:13 -05:00
Alan Huang	b5c3dcd0db	bcachefs: Convert open-coded lock_graph_pop_all to helper Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:08 -05:00
Alan Huang	0ef9ab34f4	bcachefs: Do not allow no fail lock request to fail If the transaction chose itself as a victim before and restarted, it might request a no fail lock request this time. But it might be added to others' lock graph and be chose as the victim again, it's no longer safe without additional check. We can also convert the cycle detector to be fully RCU-based to solve that unsoundness, but the latency added to trans_put and additional memory required may not worth it. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:08 -05:00
Alan Huang	cdc419dbf2	bcachefs: Merge the condition to avoid additional invocation If the lock has been acquired and unlocked, we don't have to do clear and wakeup again, though harmless since we hold the intent lock. Merge the condition might be clearer. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:08 -05:00
Alan Huang	9c13cc9c7d	Revert "bcachefs: Fix bch2_btree_node_upgrade()" This reverts commit `62448afee7`. six_lock_tryupgrade fails only if there is an intent lock held, it won't fail no matter how many read locks are held. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:08 -05:00
Hongbo Li	c72deb03ff	bcachefs: bcachefs_metadata_version_directory_size This adds another metadata version for accounting directory size. For the new version of the filesystem, when new subdirectory items are created or deleted, the parent directory's size will change accordingly. For the old version of the existed file system, running fsck will automatically upgrade the metadata version, and it will do the check and recalculationg of the directory size. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-13 14:58:38 -05:00
Hongbo Li	e614a6c52d	bcachefs: make directory i_size meaningful The isize of directory is 0 in bcachefs if the directory is empty. With more child dirents created, its size ought to change. Many other filesystems changed as that (ie. xfs and btrfs). And many of them changed as the size of child dirent name. Although the directory size may not seem to convey much, we can still give it some meaning. The formula of dentry size as follow: occupied_size = 40 + ALIGN(9 + namelen, 8) Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-13 14:58:38 -05:00
Kent Overstreet	4204e3bf63	bcachefs: check_unreachable_inodes is not actually PASS_ONLINE yet check_unreachable_inodes does work in online mode, with the one caveat that it assumes check_dirents has also run - and check_dirents is not PASS_ONLINE yet. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	ae153f2e11	bcachefs: Don't use BTREE_ITER_cached when walking alloc btree during fsck No need to pull the whole alloc btree into the btree key cache. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	15734b5e6f	bcachefs: Check for dirents to overwritten inodes This fixes various "dirent to missing inode" errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	d3d0fac57d	bcachefs: bch2_btree_iter_peek_slot() handles navigating to nonexistent depth Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	bd5b09727f	bcachefs: Don't set btree_path to updtodate if we don't fill This fixes various locking asserts, and a null ptr deref in bch2_btree_iter_peek_path(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	cf67f46641	bcachefs: __bch2_btree_pos_to_text() Factor out a version of bch2_btree_pos_to_text() that doesn't take a pointer to a in-memory btree node, to be used for btree node scrub. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	0a46ea9d46	bcachefs: printbuf_reset() handles tabstops Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	5906dcb993	bcachefs: Silence read-only errors when deleting snapshots Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	8b1f46bff3	bcachefs: Dropped superblock write is no longer a fatal error Just emit a warning if errors=continue or fix_safe. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	8cfdc6ce1f	bcachefs: bch2_trans_node_drop() Factor out a small common helper. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	0971a72c3d	bcachefs: bch2_trans_unlock_write() New helper for dropping all write locks; which is distinct from the helper the transaction commit path uses, which is faster and only touches updates. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	e1911d7a69	bcachefs: btree_node_unlock() can now drop write locks Prep work for reworking btree node locking during interior btree updates. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	9a5232ef0a	bcachefs: six locks: write locks can now be held recursively This is needed for the interior update locking rework, where we'll be holding node write locks for the duration of the update - which is needed for synchronizing with online check_allocations. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	8f3aaa5d5d	bcachefs: bch2_fs_btree_gc_init() Now returns errors, prep work for check_allocations_done_lock Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	cb3f34982c	bcachefs: Assert that btree write buffer only touches the right btrees More asserts, more better. Also, clean up the per-btree flags a bit. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	bdedae70f5	bcachefs: bch2_inum_path() now crosses subvolumes correctly The dirent that points to a subvolume root is in the parent subvolume. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	ce9a21713b	bcachefs: bch2_inum_path() no longer returns an error for disconnected inums bch2_inum_path() should work even if the filesystem is corrupted - we don't want it to cause fsck to fail. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	6adc5af50a	bcachefs: btree_path_very_locks(): verify lock seq If the btree_path's lock seq is wrong, the next bch2_trans_relock() operation is guaranteed to fail and we take an unnecessary transaction restart. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	f908eacc34	bcachefs: fix bch2_btree_key_cache_drop() When evicting, we shouldn't leave a pointer to the key cache entry lying around - that screws up btree path asserts we're adding. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	bc6fce7870	bcachefs: bch2_btree_node_write_trans() Avoiding screwing up path->lock_seq. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	4bd06f07bc	bcachefs: Fixes for snapshot_tree.master_subvol Ensure that snapshot_tree.master_subvol is cleared when we delete the master subvolume in a tree of snapshots, and allow for snapshot trees that don't have a master subvolume in fsck. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	b5e4cd0871	bcachefs: Don't rely on snapshot_tree.master_subvol for reattaching Previously, fsck used the snapshot tree's master subvol for finding the root inode number - but the master subvol might have been deleting, and setting a new one should be a user operation; meaning we can't rely on it existing. Fortunately, for finding the root inode number in a tree of snapshots, finding any associated subvolume works. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	4541408391	bcachefs: bch2_kvmalloc() Add a version of kvmalloc() that doesn't have the INT_MAX limit; large filesystems do hit this. We'll want to get rid of the in-memory bucket gens array, but we're not there quite yet. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	fa3e5135e4	bcachefs: Fix assert for online fsck We can't check if we're racing with fsck ending until mark_lock is held. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	cf3da2d627	bcachefs: Handle -BCH_ERR_need_mark_replicas in gc Locking considerations (possibly no longer relevant?) mean that when an accounting update needs a new superblock replicas entry to be created, it's deferred to the transaction commit error path. But accounting updates for gc/fcsk aren't done from the transaction commit path - so we need to handle -BCH_ERR_btree_insert_need_mark_replicas locally. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	861cd0f606	bcachefs: Write lock btree node in key cache fills this addresses a key cache coherency bug Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	baf13d8344	bcachefs: kill __bch2_btree_iter_flags() bch2_btree_iter_flags() now takes a level parameter; this fixes a bug where using a node iterator on a leaf wouldn't set BTREE_ITER_with_key_cache, leading to fun cache coherency bugs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	30e32692d6	bcachefs: Drop redundant "read error" call from btree_gc The btree node read error path already calls topology error, so this is entirely redundant, and we're not specific enough about our error codes - this was triggering for bucket_ref_update() errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	6542afe299	bcachefs: Drop racy warning Checking for writing past i_size after unlocking the folio and clearing the dirty bit is racy, and we already check it at the start. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	0475c7639e	bcachefs: better check_bp_exists() error message Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Hongbo Li	d01ea14da7	bcachefs: add counter_flags for counters In bcachefs, io_read and io_write counter record the amount of data which has been read and written. They increase in unit of sector, so to display correctly, they need to be shifted to the left by the size of a sector. Other counters like io_move, move_extent_{read, write, finish} also have this problem. In order to support different unit, we add extra column to mark the counter type by using TYPE_COUNTER and TYPE_SECTORS in BCH_PERSISTENT_COUNTERS(). Fixes: `1c6fdbd8f2` ("bcachefs: Initial commit") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	3db3084a86	bcachefs: bcachefs_metadata_version_autofix_errors It's time to make self healing the default: change the error action for old filesystems to fix_safe, matching the default for current filesystems. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	df448ca355	bcachefs: bcachefs_metadata_version_persistent_inode_cursors Persistent cursors for inode allocation. A free inodes btree would add substantial overhead to inode allocation and freeing - a "next num to allocate" cursor is always going to be faster. We just need it to be persistent, to avoid scanning the inodes btree from the start on startup. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	59c50511f7	bcachefs: bcachefs_metadata_version_inode_depth This adds a new inode field, bi_depth, for directory inodes: this allows us to make the check_directory_structure pass much more efficient. Currently, to ensure the filesystem is fully connect and has no loops, for every directory we follow backpointers until we find the root. But by adding a depth counter, it sufficies to only check the parent of each directory, and check that the parent's bi_depth is smaller. (fsck doesn't require that bi_depth = parent->bi_depth + 1; if a rename causes bi_depth off, but the chain to the root is still strictly decreasing, then the algorithm still works and there's no need for fsck to fixup the bi_depth fields). We've already checked backpointers, so we know that every directory (excluding the root)has a valid parent: if bi_depth is always decreasing, every chain must terminate, and terminate at the root directory. bi_depth will not necessarily be correct when fsck runs, due to directory renames - we can't change bi_depth on every child directory when renaming a directory. That's ok; fsck will silently fix the bi_depth field as needed, and future fsck runs will be much faster. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	80c6352c2c	bcachefs: Option changes now get propagated to reflinked data Now that bch2_move_get_io_opts() re-propagates changed inode io options to bch_extent_rebalance, we can properly suport changing IO path options for reflinked data. Changing a per-file IO path option, either via the xattr interface or via the BCHFS_IOC_REINHERIT_ATTRS ioctl, will now trigger a scan (the inode number is marked as needing a scan, via bch2_set_rebalance_needs_scan()), and rebalance will use bch2_move_data(), which will walk the inode number and pick up the new options. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	ea4f9e75ec	bcachefs: bcachefs_metadata_version_reflink_p_may_update_opts Previously, io path option changes on a file would be picked up automatically and applied to existing data - but not for reflinked data, as we had no way of doing this safely. A user may have had permission to copy (and reflink) a given file, but not write to it, and if so they shouldn't be allowed to change e.g. nr_replicas or other options. This uses the incompat feature mechanism in the previous patch to add a new incompatible flag to bch_reflink_p, indicating whether a given reflink pointer may propagate io path option changes back to the indirect extent. In this initial patch we're only setting it for the source extents. We'd like to set it for the destination in a reflink copy, when the user has write access to the source, but that requires mnt_idmap which is not curretly plumbed up to remap_file_range. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	a36d8f0e0e	bcachefs: BCH_SB_VERSION_INCOMPAT We've been getting away from feature bits: they don't have any kind of ordering, and thus it's possible for people to enable weird combinations of features that were never tested or intended to be run. Much better to just give every new feature, compatible or incompatible, a version number. Additionally, we probably won't ever rev the major version number: major version numbers represent incompatible versions, but that doesn't really fit with how we actually roll out incompatible features - we need a better way of rolling out incompatible features. So, this patch adds two new superblock fields: - BCH_SB_VERSION_INCOMPAT - BCH_SB_VERSION_INCOMPAT_ALLOWED BCH_SB_VERSION_INCOMPAT_ALLOWED indicates that incompatible features up to version number x are allowed to be used without user prompting, but it does not by itself deny old versions from mounting. BCH_SB_VERSION_INCOMPAT does deny old versions from mounting, and must be <= BCH_SB_VERSION_INCOMPAT_ALLOWED. BCH_SB_VERSION_INCOMPAT will only be set when a codepath attempts to use an incompatible feature, so as to not unnecessarily break compatibility with old versions. bch2_request_incompat_feature() is the new interface to check if an incompatible feature may be used. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	d884cf189a	bcachefs: Only run check_backpointers_to_extents in debug mode The backpointers passes, check_backpointers_to_extents() and check_extents_to_backpointers() are the most expensive fsck passes. Now that we're running the same check and repair code when using a backpointer at runtime (via bch2_backpointer_get_key()) that fsck does, there's no reason fsck needs to - except to verify that the filesystem really has no errors in debug mode. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	7611d6b5d1	bcachefs: better backpointer_target_not_found() error message Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	c2c2a4d642	bcachefs: bch2_backpointer_get_key() now repairs dangling backpointers Continuing on with the self healing theme, we should be running any check and repair code at runtime that we can - instead of declaring the filesystemt inconsistent. This will also let us skip running the backpointers -> extents fsck pass except in debug mode. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	c738866e47	bcachefs: check_extents_to_backpointers() now only checks buckets with mismatches Instead of walking every extent and every backpointer it points to, first sum up backpointers in each bucket and check for mismatches, and only look for missing backpointers if mismatches were detected, and only check extents in those buckets. This is a major fsck scalability improvement, since the two backpointers passes (backpointers -> extents and extents -> backpointers) are the most expensive fsck passes by far. Additionally, to speed up the upgrade for backpointer bucket gens, or in situations when we have to rebuild alloc info, add a special case for when no backpointers are found in a bucket - don't check each individual backpointer (in particular, avoiding the write buffer flushes), just recreate them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	056cae1c00	bcachefs: Add write buffer flush param to backpointer_get_key() In an upcoming patch bch2_backpointer_get_key() will be repairing when it finds a dangling backpointer; it will need to flush the btree write buffer before it can definitively say there's an error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	7171b1fd27	bcachefs: kill __bch2_extent_ptr_to_bp() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	aca7a26f7f	bcachefs: bch2_extent_ptr_to_bp() no longer depends on device bch_backpointer no longer contains the bucket_offset field, it's just a direct LBA mapping (with low bits to account for compressed extent splitting), so we don't need to refer to the device to construct it anymore. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	ba9752e5f4	bcachefs: bcachefs_metadata_version_disk_accounting_big_endian Fix sort order for disk accounting keys, in order to fix a regression on mount times. The typetag is now the most significant byte of the key, meaning disk accounting keys of the same type now sort together. This lets us skip over disk accounting keys that aren't mirrored in memory when reading accounting at startup, instead of having them interleaved with other counter types. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	ebdca07268	bcachefs: bcachefs_metadata_version_backpointer_bucket_gen New on disk format version: backpointers new include the generation number of the bucket they refer to, and the obsolete bucket_offset field (no longer needed because we no longer store backpointers in alloc keys) is gone. This is an expensive forced upgrade - hopefully the last; we have to run the extents_to_backpointers recovery pass to regenerate backpointers. It's a forced incompatible upgrade because the alternative would've been permamently making backpointers bigger, and as one of the biggest btrees (along with the extents btree) that's not an ideal option. It's worth it though, because this allows us to make the check_extents_to_backpointers pass drastically cheaper: an upcoming patch changes it to sum up backpointers in a bucket and check the sum against the sector counts for that bucket, only looking for missing backpointers if they don't match (and then only for specific buckets). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	6679e363f4	bcachefs: bch2_btree_path_peek_slot() doesn't return errors Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	07c1a6fa90	bcachefs: trace_key_cache_fill Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	17d678bcdd	bcachefs: Log message in journal for snapshot deletion Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	54c9b92fc7	bcachefs: bch2_trans_log_msg() Export a helper for logging to the journal when we're already in a transaction context. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	d0855e2106	bcachefs: Kill snapshot_t->equiv Now entirely dead code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
John Garry	19206d3f5e	block: Delete bio_set_prio() Since commit `43b62ce3ff` ("block: move bio io prio to a new field"), macro bio_set_prio() does nothing but set bio->bi_ioprio. All other places just set bio->bi_ioprio directly, so replace bio_set_prio() remaining callsites with setting bio->bi_ioprio directly and delete that macro. Signed-off-by: John Garry <john.g.garry@oracle.com> Acked-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20241202111957.2311683-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-12-23 08:17:23 -07:00
Kent Overstreet	35c5609abf	bcachefs: Snapshot deletion no longer uses snapshot_t->equiv Switch to generating a private list of interior nodes to delete, instead of using the equivalence class in the global data structure. This eliminates possible races with snapshot creation, and is much cleaner - it'll let us delete a lot of janky code for calculating and maintaining the equivalence classes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	85c060f62d	bcachefs: Kill equiv_seen arg to delete_dead_snapshots_process_key() When deleting dead snapshots, we move keys from redundant interior snapshot nodes to child nodes - unless there's already a key, in which case the ancestor key is deleted. Previously, we tracked via equiv_seen whether the child snapshot had a key, but this was tricky w.r.t. transaction restarts, and not transactionally safe w.r.t. updates in the child snapshot. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	92e31d4251	bcachefs: Don't run overwrite triggers before insert This breaks when the trigger is inserting updates for the same btree, as the inode trigger now does. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	f859bc945e	bcachefs: alloc_data_type_set() happens in alloc trigger Originally, we ran insert triggers before overwrite so that if an extent was being moved (by fallocate insert/collapse range), the bucket sector count wouldn't hit 0 partway through, and so we don't trigger state changes caused by that too soon. But this is better solved by just moving the data type change to the alloc trigger itself, where it's already called. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	b9a37144da	bcachefs: Fix key cache + BTREE_ITER_all_snapshots Normally, whitouts (KEY_TYPE_whitout) are filtered from btree lookups, since they exist only to represent deletions of keys in ancestor snapshots - except, they should not be filtered in BTREE_ITER_all_snapshots mode, so that e.g. snapshot deletion can clean them up. This means that that the key cache has to store whiteouts, and key cache fills cannot filter them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	7e320a4063	bcachefs: Fix btree_trans_peek_key_cache() BTREE_ITER_all_snapshots In BTREE_ITER_all_snapshots mode, we're required to only return keys where the snapshot field matches the iterator position - BTREE_ITER_filter_snapshots requires pulling keys into the key cache from ancestor snapshots, so we have to check for that. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	c50341be4e	bcachefs: tidy btree_trans_peek_journal() Change to match bch2_btree_trans_peek_updates() calling convention. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	68eb4fdd8c	bcachefs: tidy up __bch2_btree_iter_peek() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	25a3123a67	bcachefs: check_indirect_extents can run online Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	00fa283a41	bcachefs: Refactor c->opts.reconstruct_alloc Now handled in one place. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Nathan Chancellor	64833d3965	bcachefs: Add empty statement between label and declaration in check_inode_hash_info_matches_root() Clang 18 and newer warns (or errors with CONFIG_WERROR=y): fs/bcachefs/str_hash.c:164:2: error: label followed by a declaration is a C23 extension [-Werror,-Wc23-extensions] 164 \| struct bch_inode_unpacked inode; \| ^ In Clang 17 and prior, this is an unconditional hard error: fs/bcachefs/str_hash.c:164:2: error: expected expression 164 \| struct bch_inode_unpacked inode; \| ^ fs/bcachefs/str_hash.c:165:30: error: use of undeclared identifier 'inode' 165 \| ret = bch2_inode_unpack(k, &inode); \| ^ fs/bcachefs/str_hash.c:169:55: error: use of undeclared identifier 'inode' 169 \| struct bch_hash_info hash2 = bch2_hash_info_init(c, &inode); \| ^ fs/bcachefs/str_hash.c:171:40: error: use of undeclared identifier 'inode' 171 \| ret = repair_inode_hash_info(trans, &inode); \| ^ Add an empty statement between the label and the declaration to fix the warning/error without disturbing the code too much. Fixes: 2519d3b0d656 ("bcachefs: bch2_str_hash_check_key() now checks inode hash info") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202412092339.QB7hffGC-lkp@intel.com/ Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	3f57171d8d	bcachefs: trace_write_buffer_maybe_flush Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	9f95fc3c12	bcachefs: bch2_snapshot_exists() bch2_snapshot_equiv() is going away; convert users that just wanted to know if the snapshot exists to something better Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	be203120dc	bcachefs: bch2_check_key_has_snapshot() prints btree id Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	6ea607ca61	bcachefs: bch2_str_hash_check_key() now checks inode hash info Versions of the same inode in different snapshots must have the same hash info; this is critical for lookups to work correctly. We're going to be running the str_hash checks online, at readdir or xattr list time, so we now need str_hash_check_key() to check for inode hash seed mismatches, since it won't be run right after check_inodes(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	644457ed83	bcachefs: Don't BUG_ON() inode unpack error Bkey validation checks that inodes are well-formed and unpack successfully, so an unpack error should always indicate memory corruption or some other kind of hardware bug - but these are still errors we can recover from. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	7b11260456	bcachefs: Use proper errcodes for inode unpack errors Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	cd150cf924	bcachefs: kill sysfs internal/accounting Since we added per-inode counters there's now far too many counters to show in one shot - if we want this in the future, it'll have to be in debugfs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	49f2d18263	bcachefs: Kill unnecessary mark_lock usage We can't hold mark_lock while calling fsck_err() - that's a deadlock, mark_lock is meant to be a leaf node lock. It's also unnecessary for gc_bucket() and bucket_gen(); rcu suffices since the bucket_gens array describes its size, and we can't race with device removal or resize during gc/fsck since that takes state lock. Reported-by: syzbot+38641fcbda1aaffefdd4@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	54dacdada6	bcachefs: Don't start rewriting btree nodes until after journal replay This fixes a deadlock during journal replay when btree node read errors kick off a ton of rewrites: we don't want them competing with journal replay. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	9e779f3f24	bcachefs: Fix reuse of bucket before journal flush on multiple empty -> nonempty transition For each bucket we track when the bucket became nonempty and when it became empty again: if we can ensure that there will be no journal flushes in the range [nonempty, empty) (possibly because they occured at the same journal sequence number), then it's safe to reuse the bucket without waiting for a journal commit. This is a major performance optimization for erasure coding, where writes are initially replicated, but the extra replicas are quickly dropped: if those buckets are reused and overwritten without issuing a cache flush to the underlying device, then they only cost bus bandwidth. But there's a tricky corner case when there's multiple empty -> nonempty -> empty transitions in quick succession, i.e. when data is getting overwritten immediately as it's being written. If this happens and the previous empty transition hasn't been flushed, we need to continue tracking the previous nonempty transition - not start a new one. Fixing this means we now need to track both the nonempty and empty transitions in bch_alloc_v4. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	89e74eccab	bcachefs: bch2_journal_noflush_seq() now takes [start, end) Harder to screw up if we're explicit about the range, and more correct as journal reservations can be outstanding on multiple journal entries simultaneously. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	be565740ee	bcachefs: Set bucket needs discard, inc gen on empty -> nonempty transition Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	44a43cf9fd	bcachefs: Don't add unknown accounting types to eytzinger tree Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	60558d55f7	bcachefs: Plumb bkey_validate_context to journal_entry_validate This lets us print the exact location in the journal if it was found in the journal, or correctly print if it was found in the superblock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	bbe36bd099	bcachefs: Use a heap for handling overwrites in btree node scan Fix an O(n^2) issue when we find many overlapping (overwritten) btree nodes - especially when one node overwrites many smaller nodes. This was discovered to be an issue with the bcachefs merge_torture_flakey test - if we had a large btree that was then emptied, the number of difficult overwrites can be unbounded. Cc: Kuan-Wei Chiu <visitorckw@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	fbd152bf94	bcachefs: Minor bucket alloc optimization Check open buckets and buckets waiting for journal commit before doing other expensive lookups. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	f65645d804	bcachefs: Mark more errors autofix tested repairing from a bug uncovered by the merge_torture_flakey test Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	821ddebbc2	bcachefs: fix bch2_btree_node_header_to_text() format string Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	58117dbdd6	bcachefs: Journal space calculations should skip durability=0 devices Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	d4c9fc000b	bcachefs: factor out str_hash.c Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	ce70157112	bcachefs: kill flags param to bch2_subvolume_get() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	23f88c1d16	bcachefs: Don't call bch2_btree_interior_update_will_free_node() until after update succeeds Originally, btree splits always succeeded once we got to the point of recursing to the btree_insert_node() call. But that changed when we switched to not taking intent locks all the way up to the root, and that introduced a bug, because bch2_btree_interior_update_will_free_node() cancels paending writes and reparents a node that's going to be made visible on disk by another btree update to the current btree update. This was discovered in recent backpointers work, because bch2_btree_interior_update_will_free_node() also clears the will_make_reachable flag, causing backpointer target lookup to spuriously thing it had found a dangling backpointer (when the backpointer just hadn't been created yet by btree_update_nodes_written()). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	c67fab0774	bcachefs: Make sure __bch2_run_explicit_recovery_pass() signals to rewind We should always signal to rewind if the requested pass hasn't been run, even if called multiple times. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	90c6daa6ac	bcachefs: Call bch2_btree_lost_data() on btree read error Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	ff7e7c5367	bcachefs: Journal write path refactoring, debug improvements Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	47d6ee766f	bcachefs: dev_alloc_list.devs -> dev_alloc_list.data This lets us use darray macros on dev_alloc_list (and it will become a darray eventually, when we increase the maximum number of devices). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	49833ce27e	bcachefs: Fix failure to allocate journal write on discard retry When allocating a journal write fails, then retries after doing discards, we were failing to count already allocated replicas. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	6728f8f829	bcachefs: BCH_ERR_insufficient_journal_devices kill another standard error code use Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	3f1cf04ff9	bcachefs: Silence "unable to allocate journal write" if we're already RO Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	400af9a398	bcachefs: trace_accounting_mem_insert Add a tracepoint for inserting new accounting entries: we're seeing odd spinning behaviour in accounting read. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	e3474394eb	bcachefs: Advance to next bp on BCH_ERR_backpointer_to_overwritten_btree_node Don't spin. Fixes: de95cc201a97 ("bcachefs: Kill bch2_get_next_backpointer()") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	8dabb19ff4	bcachefs: Simplify disk accounting validate late The validate late path was iterating over accounting entries in eytzinger order, which is unnecessarily tricky when we may have to remove entries. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	f78760dede	bcachefs: logged ops only use inum 0 of logged ops btree we wish to use the logged ops btree for other items that aren't strictly logged ops: cursors for inode allocation There's no reason to create another cached btree for inode allocator cursors - so reserve different parts of the keyspace for different purposes. Older versions will ignore or delete the cursors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	ad0b2544ec	bcachefs: rcu_pending now works in userspace Introduce a typedef to handle the difference between unsigned long/struct urcu_gp_poll_state. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Geert Uytterhoeven	d36b3e74b6	bcachefs: BCACHEFS_PATH_TRACEPOINTS should depend on TRACING When tracing is disabled, there is no point in asking the user about enabling extra btree_path tracepoints in bcachefs. Fixes: `32ed4a620c` ("bcachefs: Btree path tracepoints") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	9c22dd02ae	bcachefs: Fix allocating too big journal entry The "journal space available" calculations didn't take into account mismatched bucket sizes; we need to take the minimum space available out of our devices. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	5cdaec193a	bcachefs: Improve "unable to allocate journal write" message Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	511ddcdb2d	bcachefs: fix bch2_journal_key_insert_take() seq Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	c1f618f4f7	bcachefs: bch2_async_btree_node_rewrites_flush() Add a method to flush btree node rewrites at the end of recovery, to ensure that corrected errors are persisted. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	b29769c72d	bcachefs: If we did repair on a btree node, make sure we rewrite it Ensure that "invalid bkey" repair gets persisted, so that it doesn't repeatedly spam the logs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	1302eeb7c5	bcachefs: bkey_fsck_err now respects errors_silent Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	7807b5b07d	bcachefs: list_pop_entry() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	097cc9d0d6	bcachefs: Convert write path errors to inum_to_path() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	f7727a6767	bcachefs: bch2_inum_to_path() Add a function for walking backpointers to find a path from a given inode number, and convert various error messages to use it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	c9b9afe78c	bcachefs: Fix fsck.c build in userspace Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Yang Li	2f8d5edf55	bcachefs: Add missing parameter description to bch2_bucket_alloc_trans() The function bch2_bucket_alloc_trans() lacked a description for the nowait parameter in its documentation comment block. This patch adds the missing description to ensure all parameters are properly documented. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=12179 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	2cd85fea49	bcachefs: Don't recurse in check_discard_freespace_key When calling check_discard_freeespace_key from the allocator, we can't repair without recursing - run it asynchronously instead. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	9bdb3b73e7	bcachefs: Check for extent crc uncompressed/compressed size mismatch When not compressed, these must be equal - this fixes an assertion pop in bch2_rechecksum_bio(). Reported-by: syzbot+50d3544c9b8db9c99fd2@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	ff1dd05f82	bcachefs: bch2_trans_relock() is trylock for lockdep fix some spurious lockdep splats Reported-by: syzbot+e088be3c2d5c05aaac35@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	f7f196170d	bcachefs: cryptographic MACs on superblock are not (yet?) supported We should add support for cryptographic macs on the superblock - and it won't be hard, but it'll need an incompatible feature bit (and we have a new incompatible feature versioning scheme coming). For now, just add a guard to avoid a dull ptr deref in gen_poly_key(). Reported-by: syzbot+dd3d9835055dacb66f35@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	4746ee182a	bcachefs: Check for inode journal seq in the future More check and repair code: this fixes a warning in bch2_journal_flush_seq_async() Reported-by: syzbot+d119b445ec739e7f3068@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	0eafe758ac	bcachefs: Check for bucket journal seq in the future This fixes an assertion pop in bch2_journal_noflush_seq() - log the error to the superblock and continue instead. Reported-by: syzbot+85700120f75fc10d4e18@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	8b10590918	bcachefs: do_fsck_ask_yn() __bch2_fsck_err() is huge, and badly needs more refactoring Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	052210c3fa	bcachefs: Don't error out when logging fsck error Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	cfba90aba9	bcachefs: mark more errors AUTOFIX mark errors as autofix where syzbot has hit the repair paths Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	914381013b	bcachefs: add missing printbuf_reset() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	0184dfa3b8	bcachefs: Fix journal_iter list corruption Fix exiting an iterator that wasn't initialized. Reported-by: syzbot+2f7c2225ed8a5cb24af1@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	f11ca2ab18	bcachefs: Guard against backpointers to unknown btrees Reported-by: syzbot+997f0573004dcb964555@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	f9e0a9be70	bcachefs: Issue a transaction restart after commit in repair transaction commits invalidate pointers to btree values, and they also downgrade intent locks. This breaks the interior btree update path, which takes intent locks and then calls into the allocator. This isn't an ideal solution: we can't unconditionally issue a restart after a transaction commit, because that would break other codepaths. Reported-by: syzbot+78d82470c16a49702682@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	b3d82c2f27	bcachefs: Guard against journal seq overflow Wraparound is impractical to handle since in various places we use 0 as a sentinal value - but 64 bits (or 56, because the btree write buffer steals a few bits) is enough for all practical purposes. Reported-by: syzbot+73ed43fbe826227bd4e0@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	9963a14da1	bcachefs: BCH_FS_recovery_running If we're autofixing topology errors, we shouldn't shutdown if we're still in recovery. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	124e108185	bcachefs: Make topology errors autofix These repair paths are well tested, we can repair them without explicit user intervention This also tweaks bch2_topology_error() so that we run topology repair if we're in recovery, not just fsck. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	a6f4794fcd	bcachefs: struct bkey_validate_context Add a new parameter to bkey validate functions, and use it to improve invalid bkey error messages: we can now print the btree and depth it came from, or if it came from the journal, or is a btree root. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	c7e78f7b01	bcachefs: Ignore empty btree root journal entries There's no reason to treat them as errors: just ignore them, and go with a previous btree root if we had one. Reported-by: syzbot+e22007d6acb9c87c2362@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	90f3683e8f	bcachefs: Fix null ptr deref in btree_path_lock_root() Historically, we required that all btree node roots point to a valid (possibly fake) node, but we're improving our ability to continue in the presence of errors. Reported-by: syzbot+e22007d6acb9c87c2362@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	db0667a4ed	bcachefs: Go RW earlier, for normal rw mount Previously, when mounting read-write after a clean shutdown, we wouldn't go read-write until after all the recovery passes completed. Now, go RW early in recovery, the same as any other situation we'll need to go read-write. This fixes a bug where we discover unlinked inodes after a clean shutdown: repair fails because we're read only. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	d941597636	bcachefs: Fix bch2_btree_node_update_key_early() Fix an assertion pop from the recent btree cache freelist fixes. Fixes: `baefd3f849` ("bcachefs: btree_cache.freeable list fixes") Reported-by: Tyler <th020394@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	686d2ebec6	bcachefs: Change "disk accounting version 0" check to commit only 6.11 had a bug where we'd sometimes create disk accounting keys with version 0, which causes issues for journal replay - but we don't need to delete existing accounting keys with version 0. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	dba8243f3b	bcachefs: Don't try to en/decrypt when encryption not available If a btree node says it's encrypted, but the superblock never had an encryptino key - whoops, that needs to be handled. Reported-by: syzbot+026f1857b12f5eb3f9e9@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	75eabea698	bcachefs: Fix dup/misordered check in btree node read We were checking for out of order keys, but not duplicate keys. Reported-by: syzbot+dedbd67513939979f84f@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	1415265480	bcachefs: Bad btree roots are now autofix Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	828552ca74	bcachefs: Kill bch2_bucket_alloc_new_fs() The early-early allocation path, bch2_bucket_alloc_new_fs(), is no longer needed - and inconsistencies around new_fs_bucket_idx have been a frequent source of bugs. Reported-by: syzbot+592425844580a6598410@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	abf23afa36	bcachefs: Fix btree node scan when unknown btree IDs are present btree_root entries for unknown btree IDs are created during recovery, before reading those btree roots. But btree_node_scan may find btree nodes with unknown btree IDs when we haven't seen roots for those btrees. Reported-by: syzbot+1f202d4da221ec6ebf8e@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	427db7ffe9	bcachefs: backpointer_to_missing_ptr is now autofix Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	b71d89bd7b	bcachefs: Fix accounting_read when we rewind If we rewind recovery to run topology repair, that causes accounting_read to run twice. This fixes accounting being double counted. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	a7ecd5f2cc	bcachefs: disk_accounting: bch2_dev_rcu -> bch2_dev_rcu_noerror Accounting keys that reference invalid devices are corrected by fsck, they shouldn't cause an emergency shutdown. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	6534a404d4	bcachefs: errcode cleanup: journal errors Instead of throwing standard error codes, we should be throwing dedicated private error codes, this greatly improves debugability. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	525be09e63	bcachefs: Use separate rhltable for bch2_inode_or_descendents_is_open() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	375d21b76d	bcachefs: BCH_ERR_btree_node_read_error_cached Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	0eaac0b44f	bcachefs: btree_write_buffer_flush_seq() no longer closes journal Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	bb61afebca	bcachefs: discard fastpath now uses bch2_discard_one_bucket() The discard bucket fastpath previously was using its own code for discarding buckets and clearing them in the need_discard btree, which didn't have any of the consistency checks of the main discard path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	e1cb4f56dc	bcachefs: Bias reads more in favor of faster device Per reports of performance issues on mixed multi device filesystems where we're issuing too much IO to the spinning rust - tweak this algorithm. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	f4d67f6d5a	bcachefs: trivial btree write buffer refactoring Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	c601e5d7da	bcachefs: Can now block journal activity without closing cur entry Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	c80f33b752	bcachefs: New backpointers helpers - bch2_backpointer_del() - bch2_backpointer_maybe_flush() Kill a bit of open coding and make sure we're properly handling the btree write buffer. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	1ab00b6cdd	bcachefs: kill bch_backpointer.bucket_offset usage bch_backpointer.bucket_offset is going away - it's no longer needed since we no longer store backpointers in alloc keys, the same information is in the key position itself. And we'll be reclaiming the space in bch_backpointer for the bucket generation number. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	e48fda6cdc	bcachefs: Fix check_backpointers_to_extents range limiting bch2_get_btree_in_memory_pos() will return positions that refer directly to the btree it's checking will fit in memory - i.e. backpointer positions, not buckets. This also means check_bp_exists() no longer has to refer to the device, and we can delete some code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	eb25733aba	bcachefs: bch_backpointer -> bkey_i_backpointer Since we no longer store backpointers in alloc keys, there's no reason not to pass around bkey_i_backpointers; this means we don't have to pass the bucket pos separately. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	abff9b149d	bcachefs: Drop swab code for backpointers in alloc keys Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	5b5a7ae8fa	bcachefs: bucket_pos_to_bp_end() Better helpers for iterating over backpointers within a specific bucket Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	debe6965ac	bcachefs: check for backpointers to invalid device Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	3b6ebc94a0	bcachefs: fix bp_pos_to_bucket_nodev_noerror _noerror means don't produce inconsistent errors, so it should be using bch2_dev_rcu_noerror(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	4de2c24aa9	bcachefs: Fix evacuate_bucket tracepoint 86a494c8eef9 ("bcachefs: Kill bch2_get_next_backpointer()") dropped some things the tracepoint emitted because bch2_evacuate_bucket() no longer looks at the alloc key - but we did want at least some of that. We still no longer look at the alloc key so we can't report on the fragmentation number, but that's a direct function of dirty_sectors and a copygc concern anyways - copygc should get its own tracepoint that includes information from the fragmentation LRU. But we can report on the number of sectors we moved and the bucket size. Co-developed-by: Piotr Zalewski <pZ010001011111@proton.me> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	eae6c4a625	bcachefs: fix O(n^2) issue with whiteouts in journal keys The journal_keys array can't be substantially modified after we go RW, because lookups need to be able to check it locklessly - thus we're limited on what we can do when a key in the journal has been overwritten. This is a problem when there's many overwrites to skip over for peek() operations. To fix this, add tracking of ranges of overwrites: we create a range entry when there's more than one contiguous whiteout. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	854724d116	bcachefs: btree_and_journal_iter: don't iterate over too many whiteouts when prefetching To help ameloriate issues with peek operations having to skip over deletions in the journal - just bail out if all we're doing is prefetching btree nodes. Since btree node prefetching runs every time we iterate to a new node, and has to sequentially scan ahead, this avoids another O(n^2). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	06d7a56fe0	bcachefs: journal keys: sort keys for interior nodes first There's an unavoidable issue with btree lookups when we're overlaying journal keys and the journal has many deletions for keys present in the btree - peek operations will have to iterate over all those deletions to find the next live key to return. This is mainly a problem for lookups in interior nodes, if we have to traverse to a leaf. Looking up an insert position in a leaf (for journal replay) doesn't have to find the next live key, but walking down the btree does. So to ameloriate this, change journal key sort ordering so that we replay keys from roots and interior nodes first. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	57026c41c9	bcachefs: kill bch2_journal_entries_free() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	3d0b3b51c5	bcachefs: Don't BUG_ON() when superblock feature wasn't set for compressed data We don't allocate the mempools for compression/decompression unless we need them - but that means there's an inconsistency to check for. Reported-by: syzbot+cb3fbcfb417448cfd278@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	e1702b9891	bcachefs: Don't use a shared decompress workspace mempool gzip and zstd require different decompress workspace sizes, and if we start with one and then start using the other at runtime we may not get the correct size Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	6a4ce7a92f	bcachefs: compression workspaces should be indexed by opt, not type type includes lz4 and lz4_old, which do not get different compression workspaces, and incompressible, a fake type - BCH_COMPRESSION_OPTS() is the correct enum to use. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	cec51e0a5d	bcachefs: add missing BTREE_ITER_intent this fixes excessive transaction restarts due to trans_commit having to upgrade Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	9e92d6e9ef	bcachefs: Kill bch2_get_next_backpointer() Since for quite some time backpointers have only been stored in the backpointers btree, not alloc keys (an aborted experiment, support for which has been removed) - we can replace get_next_backpointer() with simple btree iteration. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	7815809fca	bcachefs: Delete backpointers check in try_alloc_bucket() try_alloc_bucket() has a "safety" check, which avoids allocating a bucket if there's any backpointers present. But backpointers are not the source of truth for live data in a bucket, the bucket sector counts are; this check was fairly useless, and we're also deferring backpointers checks from fsck to runtime in the near future. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	ac745efb42	bcachefs: peek_prev_min(): Search forwards for extents, snapshots With extents and snapshots, for slightly different reasons, we may have to search forwards to find a key that compares equal to iter->pos (i.e. a key that peek_prev() should return, as it returns keys <= iter->pos). peek_slot() does this, and is an easy way to fix this case. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	7e5b8e00e2	bcachefs: Implement bch2_btree_iter_prev_min() A user contributed a filessytem dump, where the dump was actually corrupted (due to being taken while the filesystem was online), but which exposed an interesting bug in fsck - reconstruct_inode(). When itearting in BTREE_ITER_filter_snapshots mode, it's required to give an end position for the iteration and it can't span inode numbers; continuing into the next inode might mean we start seeing keys from a different snapshot tree, that the is_ancestor() checks always filter, thus we're never able to return a key and stop iterating. Backwards iteration never implemented the end position because nothing else needed it - except for reconstuct_inode(). Additionally, backwards iteration is now able to overlay keys from the journal, which will be useful if we ever decide to start doing journal replay in the background. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	acd1fc7b1f	bcachefs: discard_one_bucket() now uses need_discard_or_freespace_err() More conversion of inconsistent errors to fsck errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	c8e588135c	bcachefs: bch2_bucket_do_index(): inconsistent_err -> fsck_err Factor out a common helper, need_discard_or_freespace_err(), which is now used by both fsck and the runtime checks, and can repair. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	c97118f1dc	bcachefs: try_alloc_bucket() now uses bch2_check_discard_freespace_key() check_discard_freespace_key() was doing all the same checks as try_alloc_bucket(), but with repair. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	731d06e138	bcachefs: rework bch2_bucket_alloc_freelist() freelist iteration Prep work for converting try_alloc_bucket() to use bch2_check_discard_freespace_key(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	724e49c677	bcachefs: kill inconsistent err in invalidate_one_bucket() Change it to a normal fsck_err() - meaning it'll get repaired at runtime when that's flipped on. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	7579c85d9c	bcachefs: Don't delete reflink pointers to missing indirect extents To avoid tragic loss in the event of transient errors (i.e., a btree node topology error that was later corrected by btree node scan), we can't delete reflink pointers to correct errors. This adds a new error bit to bch_reflink_p, indicating that it is known to point to a missing indirect extent, and the error has already been reported. Indirect extent lookups now use bch2_lookup_indirect_extent(), which on error reports it as a fsck_err() and sets the error bit, and clears it if necessary on succesful lookup. This also gets rid of the bch2_inconsistent_error() call in __bch2_read_indirect_extent, and in the reflink_p trigger: part of the online self healing project. An on disk format change isn't necessary here: setting the error bit will be interpreted by older versions as pointing to a different index, which will also be missing - which is fine. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	3d338378d7	bcachefs: Reorganize reflink.c a bit Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	61f854da4c	bcachefs: Reserve 8 bits in bch_reflink_p Better repair for reflink pointers, as well as propagating new inode options to indirect extents, are going to require a few extra bits bch_reflink_p: so claim a few from the high end of the destination index. Also add some missing bounds checking. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	eb73e7773f	bcachefs: Kill FSCK_NEED_FSCK If we find an error that indicates that we need to run fsck, we can specify that directly with run_explicit_recovery_pass(). These are now log_fsck_err() calls: we're just logging in the superblock that an error occurred - and possibly doing an emergency shutdown, depending on policy. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	79c5e3c793	bcachefs: lru errors are expected when reconstructing alloc Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	b6269cd0ec	bcachefs: Delete dead code from bch2_discard_one_bucket() alloc key validation ensures that if a bucket is in need_discard state the sector counts are all zero - we don't have to check for that. The NEED_INC_GEN check appears to be dead code, as well: we only see buckets in the need_discard btree, and it's an error if they aren't in the need_discard state. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	7d1918b0d8	bcachefs: bch2_btree_bit_mod_iter() factor out a new helper, make it handle extents bitset btrees (freespace). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	1f282f1ee0	bcachefs: delete dead code Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	d985e63dba	bcachefs: Fix shutdown message Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	5c3911ac94	bcachefs: Don't use page allocator for sb_read_scratch Kill another unnecessary dependency on PAGE_SIZE Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Youling Tang	385d1a3c81	bcachefs: Simplify code in bch2_dev_alloc() - Remove unnecessary variable 'ret'. - Remove unnecessary bch2_dev_free() operations. Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Youling Tang	924e81c530	bcachefs: Remove redundant initialization in bch2_vfs_inode_init() `inode->v.i_ino` has been initialized to `inum.inum`. If `inum.inum` and `bi->bi_inum` are not equal, BUG_ON() is triggered in bch2_inode_update_after_write(). Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Youling Tang	5abd7ac19d	bcachefs: Removes NULL pointer checks for __filemap_get_folio return values __filemap_get_folio the return value cannot be NULL, so unnecessary checks are removed. Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	dc003efbc7	bcachefs: Add support for FS_IOC_GETFSSYSFSPATH [TEST]: ``` $ cat ioctl_getsysfspath.c #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <sys/ioctl.h> #include <linux/fs.h> #include <unistd.h> int main(int argc, char *argv[]) { int fd; struct fs_sysfs_path sysfs_path = {}; if (argc != 2) { fprintf(stderr, "Usage: %s <path_to_file_or_directory>\n", argv[0]); exit(EXIT_FAILURE); } fd = open(argv[1], O_RDONLY); if (fd == -1) { perror("open"); exit(EXIT_FAILURE); } if (ioctl(fd, FS_IOC_GETFSSYSFSPATH, &sysfs_path) == -1) { perror("ioctl FS_IOC_GETFSSYSFSPATH"); close(fd); exit(EXIT_FAILURE); } printf("FS_IOC_GETFSSYSFSPATH: %s\n", sysfs_path.name); close(fd); return 0; } $ gcc ioctl_getsysfspath.c $ sudo bcachefs format /dev/sda $ sudo mount.bcachefs /dev/sda /mnt $ sudo ./a.out /mnt FS_IOC_GETFSSYSFSPATH: bcachefs/c380b4ab-fbb6-41d2-b805-7a89cae9cadb ``` Original patch link: [1]: https://lore.kernel.org/all/20240207025624.1019754-8-kent.overstreet@linux.dev/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Youling Tang <youling.tang@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	4f1a6b0ab4	bcachefs: Add support for FS_IOC_GETFSUUID Use super_set_uuid() to set `sb->s_uuid_len` to avoid returning `-ENOTTY` with sb->s_uuid_len being 0. Original patch link: [1]: https://lore.kernel.org/all/20240207025624.1019754-2-kent.overstreet@linux.dev/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Youling Tang	4fa5d8e166	bcachefs: Correct the description of the '--bucket=size' options Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Integral	394033dcc9	bcachefs: add support for true/false & yes/no in bool-type options Here is the patch which uses existing constant table: Currently, when using bcachefs-tools to set options, bool-type options can only accept 1 or 0. Add support for accepting true/false and yes/no for these options. Signed-off-by: Integral <integral@murena.io> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Acked-by: David Howells <dhowells@redhat.com>	2024-12-21 01:36:17 -05:00
Kent Overstreet	e5ea05293a	bcachefs: Move fsck ioctl code to fsck.c chardev.c and fs-ioctl.c are not organized by subject; let's try to fix this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	e69df6adf8	bcachefs: Kill unnecessary iter_rewind() in bkey_get_empty_slot() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	db6e584b85	bcachefs: Simplify btree_iter_peek() filter_snapshots Collapse all the BTREE_ITER_filter_snapshots handling down into a single block; btree iteration is much simpler in the !filter_snapshots case. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	000fe8d573	bcachefs: Rename btree_iter_peek_upto() -> btree_iter_peek_max() We'll be introducing btree_iter_peek_prev_min(), so rename for consistency. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	65b14fa3d8	bcachefs: Assert that we're not violating key cache coherency rules We're not allowed to have a dirty key in the key cache if the key doesn't exist at all in the btree - creation has to bypass the key cache, so that iteration over the btree can check if the key is present in the key cache. Things break in subtle ways if cache coherency is broken, so this needs an assert. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	b318882022	bcachefs: bch2_trans_verify_not_unlocked_or_in_restart() Fold two asserts into one. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	a71a1fac90	bcachefs: Better in_restart error We're ramping up on checking transaction restart handling correctness - so, in debug mode we now save a backtrace for where the restart was emitted, which makes it much easier to track down the incorrect handling. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	2434fc38ef	bcachefs: Assert we're not in a restart in bch2_trans_put() This always indicates a transaction restart handling bug Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	69785001c6	bcachefs: Fix unhandled transaction restart in evacuate_bucket() Generally, releasing a transaction within a transaction restart means an unhandled transaction restart: but this can happen legitimately within the move code, e.g. when bch2_move_ratelimit() tells us to exit before we've retried. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	b09b34499c	bcachefs: Improved check_topology() assert On interior btree node updates, we always verify that we're not introducing topology errors: child nodes should exactly span the range of the parent node. single_device.ktest small_nodes has been popping this assert: change it to give us more information. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	a34b026482	bcachefs: Kill BCH_TRANS_COMMIT_lazy_rw We unconditionally go read-write, if we're going to do so, before journal replay: lazy_rw is obsolete. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	cc944fbe06	bcachefs: Add assert for use of journal replay keys for updates The journal replay keys mechanism can only be used for updates in early recovery, when still single threaded. Add some asserts to make sure we never accidentally use it elsewhere. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Hongbo Li	32e573c362	bcachefs: use attribute define helper for sysfs attribute The sysfs attribute definition has been wrapped into macro: rw_attribute, read_attribute and write_attribute, we can use these helpers to uniform the attribute definition. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Hongbo Li	d3d8ec90ba	bcachefs: remove write permission for gc_gens_pos sysfs interface The gc_gens_pos is used to show the status of bucket gen gc. There is no need to assign write permissions for this attribute. Here we can use read_attribute helper to define this attribute. ``` [Before] $ ll internal/gc_gens_pos -rw-r--r-- 1 root root 4096 Oct 28 15:27 internal/gc_gens_pos [After] $ ll internal/gc_gens_pos -r--r--r-- 1 root root 4096 Oct 28 17:27 internal/gc_gens_pos ``` Fixes: `ac516d0e7d` ("bcachefs: Add the status of bucket gen gc to sysfs") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	161d13835e	bcachefs: Move bch_extent_rebalance code to rebalance.c Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	a652c56590	bcachefs: Improve trace_rebalance_extent We now say explicitly which pointers are being moved or compressed Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	3de8b72731	bcachefs: Simplify option logic in rebalance Since bch2_move_get_io_opts() now synchronizes io_opts with options from bch_extent_rebalance, delete the ad-hoc logic in rebalance.c that previously did this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	6aa0bd0fd5	bcachefs: get_update_rebalance_opts() bch2_move_get_io_opts() now synchronizes options loaded from the filesystem and inode (if present, i.e. not walking the reflink btree directly) with options from the bch_extent_rebalance_entry, updating the extent if necessary. Since bch_extent_rebalance tracks where its option came from we can preserve "inode options override filesystem options", even for indirect extents where we don't have access to the inode the options came from. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	4ae6bbb522	bcachefs: bch2_write_inode() now checks for changing rebalance options Previously, BCHFS_IOC_REINHERIT_ATTRS didn't trigger rebalance scans when changing rebalance options - it had been missed, only the xattr interface triggered them. Ideally they'd be done by the transactional trigger, but unpacking the inode to get the options is too heavy to be done in the low level trigger - the inode trigger is run on every extent update, since the bch_inode.bi_journal_seq has to be updated for fsync. bch2_write_inode() is a good compromise, it already unpacks and repacks and is not run in any super-fast paths. Additionally, creating the new rebalance entry to trigger the scan is now done in the same transaction as the inode update that changed the options. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	2d21d11253	bcachefs: New bch_extent_rebalance fields - Add more io path options to bch_extent_rebalance - For each option, track whether it came from the filesystem or the inode This will be used for improved rebalance support for reflinked data. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	ed13bb5726	bcachefs: bch2_prt_csum_opt() bounds checking helper Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	c225084704	bcachefs: copygc_enabled, rebalance_enabled now opts.h options They can now be set at mount time Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	7a7c43a0c1	bcachefs: Add bch_io_opts fields for indicating whether the opts came from the inode This is going to be used in the bch_extent_rebalance improvements, which propagate io_path options into the extent (important for rebalance, which needs something present in the extent for transactionally tagging them in the rebalance_work btree, and also for indirect extents). By tracking in bch_extent_rebalance whether the option came from the filesystem or the inode we can correctly handle options being changed on indirect extents. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	3000855cab	bcachefs: io_opts_to_rebalance_opts() New helper to simplify bch2_bkey_set_needs_rebalance() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	5fffe1a3c3	bcachefs: rename bch_extent_rebalance fields to match other opts structs Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	282faf9524	bcachefs: kill __bch2_bkey_sectors_need_rebalance() Single caller, fold into bch2_bkey_sectors_need_rebalance() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	c8908959ae	bcachefs: kill bch2_bkey_needs_rebalance() Dead code Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	015fafc49b	bcachefs: small cleanup for extent ptr bitmasks Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	eacb755568	bcachefs: bch2_io_opts_fixups() Centralize some io path option fixups - they weren't always being applied correctly: - background_compression uses compression if unset - background_target uses foreground_target if unset - nocow disables most fancy io path options Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	16de2c856a	bcachefs: use bch2_data_update_opts_to_text() in trace_move_extent_fail() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	a1ca525b82	bcachefs: avoid 'unsigned flags' flags should have actual types, where possible: fix btree_update.h helpers Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Thorsten Blum	901ff6555b	bcachefs: Annotate struct bucket_gens with __counted_by() Add the __counted_by compiler attribute to the flexible array member b to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE. Use struct_size() to calculate the number of bytes to be allocated. Update bucket_gens->nbuckets and bucket_gens->nbuckets_minus_first when resizing. Compile-tested only. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Thorsten Blum	ac9826f147	bcachefs: Use str_write_read() helper in write_super_endio() Remove hard-coded strings by using the str_write_read() helper function. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Thorsten Blum	751d869710	bcachefs: Use str_write_read() helper in ec_block_endio() Remove hard-coded strings by using the helper function str_write_read(). Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Thorsten Blum	de902e3b4a	bcachefs: Use str_write_read() helper function Remove hard-coded strings by using the helper function str_write_read(). Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	e0c8369bc8	bcachefs: Add version check for bch_btree_ptr_v2.sectors_written validate A user popped up with a very old (0.11) filesystem that needed repair and wasn't recently backed up. Reported-by: Manoa <manoa@mail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	27de0ee39f	bcachefs: Add block plugging to read paths This will help with some of the btree_trans srcu lock hold time warnings that are still turning up; submit_bio() can block for awhile if the device is sufficiently congested. It's not a perfect solution since blk_plug bios are submitted when scheduling; we might want a way to disable the "submit on context switch" behaviour, or switch to our own plugging in the future. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	be5a7be106	bcachefs: Fix warning about passing flex array member by value this showed up when building in userspace Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	fb8c835b18	bcachefs: bch2_journal_meta() takes ref on c->writes This part of addressing https://github.com/koverstreet/bcachefs/issues/656 where we're getting stuck in bch2_journal_meta() in the dump tool. We shouldn't be invoking the journal without a ref on c->writes (if we're not RW), and there's no reason for the dump tool to be going read-write. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	8b22abb4c8	bcachefs: -o norecovery now bails out of recovery earlier -o norecovery (used by the dump tool) should be doing the absolute minimum amount of work to get the filesystem up and readable; we shouldn't be running check and repair code, or going read-write. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	d55d4a0ca2	bcachefs: Refactor new stripe path to reduce dependencies on ec_stripe_head We need to add a path for reshaping existing stripes (for e.g. device removal), and this new path won't necessarily use ec_stripe_head. Refactor the code to avoid unnecessary references to it for clarity. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	db514cf677	bcachefs: Avoid bch2_btree_id_str() Prefer bch2_btree_id_to_text() - it prints out the integer ID when unknown. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	9e2f5f7988	bcachefs: better error message in check_snapshot_tree() If we find a snapshot node and it didn't match the snapshot tree, we should print it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	106480e9a8	bcachefs: Factor out jset_entry_log_msg_bytes() Needed for improved userspace cmd_list_journal Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	0269e27ce3	bcachefs: improved bkey_val_copy() Factor out some common code, add typechecking. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	e3c43dbe8e	bcachefs: bch2_btree_lost_data() now uses run_explicit_rceovery_pass_persistent() Also get a bit more fine grained about which passes to run for which btrees. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	d65d126c02	bcachefs: Add locking for bch_fs.curr_recovery_pass Recovery can rewind in certain situations - when we discover we need to run a pass that doesn't normally run. This can happen from another thread for btree node read errors, so we need a bit of locking. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	26c79fdc58	bcachefs: lru, accounting are alloc btrees They can be regenerated by fsck and don't require a btree node scan, like other alloc btrees. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	18f5b84a04	bcachefs: bch2_run_explicit_recovery_pass() returns different error when not in recovery if we're not in recovery then there's no way to rewind recovery - give this a different errcode so that any error messages will give us a better idea of what happened. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	eba3d7e57d	bcachefs: add more path idx debug asserts Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Thorsten Blum	55f524b706	bcachefs: Use FOREACH_ACL_ENTRY() macro to iterate over acl entries Use the existing FOREACH_ACL_ENTRY() macro to iterate over POSIX acl entries and remove the custom acl_for_each_entry() macro. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Thorsten Blum	03525de506	bcachefs: Remove duplicate included headers The header files dirent_format.h and disk_groups_format.h are included twice. Remove the redundant includes and the following warnings reported by make includecheck: disk_groups_format.h is included more than once dirent_format.h is included more than once Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	4e1c6ac05a	bcachefs: kill btree_trans_restart_nounlock() Redundant, the normal btree_trans_restart() doesn't unlock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	d6cf895847	bcachefs: Remove unnecessary peek_slot() hash_lookup() used to return an errorcode, and a peek_slot() call was required to get the key it looked up. But we're adding fault injection for transaction restarts, so fix this old unconverted code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Thomas Bertschinger	fe818d2039	bcachefs: move bch2_xattr_handlers to .rodata A series posted previously moved all of the `struct xattr_handler` tables to .rodata for each filesystem [1]. However, this appears to have been done shortly before bcachefs was merged, so bcachefs was missed at that time. Link: https://lkml.kernel.org/r/20230930050033.41174-1-wedsonaf@gmail.com [1] Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Alan Huang	bf4e42d158	bcachefs: Delete dead code lock_fail_root_changed has not been used since commit `0d7009d7ca` ("bcachefs: Delete old deadlock avoidance code") Remove it. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	c07beca44f	bcachefs: Pull disk accounting hooks out of trans_commit.c Also, fix a minor bug in the revert path, where we weren't checking the journal entry type correctly. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	179cdecf22	bcachefs: bch_verbose_ratelimited ratelimit "deleting unlinked inode" messages Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	a55e2d78ea	bcachefs: rcu_pending: don't invoke __call_rcu() under lock In userspace we don't (yet) have an SRCU implementation, so call_srcu() recurses. But we don't want to be invoking it under the lock anyways. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	b836f22014	bcachefs: __bch2_key_has_snapshot_overwrites uses for_each_btree_key_reverse_norestart() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	1325ccf27e	bcachefs: remove_backpointer() now uses dirent_get_by_pos() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	de92b1ee67	bcachefs: bch2_inode_should_have_bp -> bch2_inode_should_have_single_bp Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Colin Ian King	1c6d5841ae	bcachefs: remove superfluous ; after statements There are a several statements with two following semicolons, replace these with just one semicolon. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	135c0c8524	bcachefs: Fix racy use of jiffies Calculate the timeout, then check if it's positive before calling schedule_timeout(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:13 -05:00
Kent Overstreet	4b8d382b2c	Merge branch 'bcachefs-kill-retry-estale' into HEAD	2024-12-21 01:36:13 -05:00
Eric Biggers	cc354fa7f0	bcachefs: Explicitly select CRYPTO from BCACHEFS_FS Explicitly select CRYPTO from BCACHEFS_FS, so that this dependency of CRYPTO_SHA256, CRYPTO_CHACHA20, and CRYPTO_POLY1305 (which are also selected) is satisfied. Currently this dependency is satisfied indirectly via LIBCRC32C, but this is fragile and is planned to change (https://lore.kernel.org/r/20241021002935.325878-13-ebiggers@kernel.org). Acked-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/r/20241202010844.144356-15-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2024-12-01 17:23:02 -08:00
Linus Torvalds	f5f4745a7f	- The series "resource: A couple of cleanups" from Andy Shevchenko performs some cleanups in the resource management code. - The series "Improve the copy of task comm" from Yafang Shao addresses possible race-induced overflows in the management of task_struct.comm[]. - The series "Remove unnecessary header includes from {tools/}lib/list_sort.c" from Kuan-Wei Chiu adds some cleanups and a small fix to the list_sort library code and to its selftest. - The series "Enhance min heap API with non-inline functions and optimizations" also from Kuan-Wei Chiu optimizes and cleans up the min_heap library code. - The series "nilfs2: Finish folio conversion" from Ryusuke Konishi finishes off nilfs2's folioification. - The series "add detect count for hung tasks" from Lance Yang adds more userspace visibility into the hung-task detector's activity. - Apart from that, singelton patches in many places - please see the individual changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ0L6lQAKCRDdBJ7gKXxA jmEIAPwMSglNPKRIOgzOvHh8MUJW1Dy8iKJ2kWCO3f6QTUIM2AEA+PazZbUd/g2m Ii8igH0UBibIgva7MrCyJedDI1O23AA= =8BIU -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2024-11-24-02-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - The series "resource: A couple of cleanups" from Andy Shevchenko performs some cleanups in the resource management code - The series "Improve the copy of task comm" from Yafang Shao addresses possible race-induced overflows in the management of task_struct.comm[] - The series "Remove unnecessary header includes from {tools/}lib/list_sort.c" from Kuan-Wei Chiu adds some cleanups and a small fix to the list_sort library code and to its selftest - The series "Enhance min heap API with non-inline functions and optimizations" also from Kuan-Wei Chiu optimizes and cleans up the min_heap library code - The series "nilfs2: Finish folio conversion" from Ryusuke Konishi finishes off nilfs2's folioification - The series "add detect count for hung tasks" from Lance Yang adds more userspace visibility into the hung-task detector's activity - Apart from that, singelton patches in many places - please see the individual changelogs for details * tag 'mm-nonmm-stable-2024-11-24-02-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (71 commits) gdb: lx-symbols: do not error out on monolithic build kernel/reboot: replace sprintf() with sysfs_emit() lib: util_macros_kunit: add kunit test for util_macros.h util_macros.h: fix/rework find_closest() macros Improve consistency of '#error' directive messages ocfs2: fix uninitialized value in ocfs2_file_read_iter() hung_task: add docs for hung_task_detect_count hung_task: add detect count for hung tasks dma-buf: use atomic64_inc_return() in dma_buf_getfile() fs/proc/kcore.c: fix coccinelle reported ERROR instances resource: avoid unnecessary resource tree walking in __region_intersects() ocfs2: remove unused errmsg function and table ocfs2: cluster: fix a typo lib/scatterlist: use sg_phys() helper checkpatch: always parse orig_commit in fixes tag nilfs2: convert metadata aops from writepage to writepages nilfs2: convert nilfs_recovery_copy_block() to take a folio nilfs2: convert nilfs_page_count_clean_buffers() to take a folio nilfs2: remove nilfs_writepage nilfs2: convert checkpoint file to be folio-based ...	2024-11-25 16:09:48 -08:00
Kent Overstreet	840c2fbcc5	bcachefs: Fix assertion pop in bch2_ptr_swab() This runs on extents that haven't yet been validated, so we don't want to assert that we have a valid entry type. Reported-by: syzbot+4f29c3f12f864d8a8d17@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-12 03:46:57 -05:00
Kent Overstreet	657d4282d8	bcachefs: Fix journal_entry_dev_usage_to_text() overrun If the jset_entry_dev_usage is malformed, and too small, our nr_entries calculation will be incorrect - just bail out. Reported-by: syzbot+05d7520be047c9be86e0@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-12 03:46:57 -05:00
Kent Overstreet	2642084f26	bcachefs: Allow for unknown key types in backpointers fsck We can't assume that btrees only contain keys of a given type - even if they only have a single key type listed in the allowed key types for that btree; this is a forwards compatibility issue. Reported-by: syzbot+a27c3aaa3640dd3e1dfb@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-11 00:37:19 -05:00
Kent Overstreet	0b6ec0c5ac	bcachefs: Fix assertion pop in topology repair Fixes: `baefd3f849` ("bcachefs: btree_cache.freeable list fixes") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-11 00:37:18 -05:00
Kent Overstreet	bcf77a05fb	bcachefs: Fix hidden btree errors when reading roots We silence btree errors in btree_node_scan, since it's probing and errors are expected: add a fake pass so that btree_node_scan is no longer recovery pass 0, and we don't think we're in btree node scan when reading btree roots. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-08 14:07:12 -05:00
Kent Overstreet	dc537189b5	bcachefs: Fix validate_bset() repair path When we truncate a bset (due to it extending past the end of the btree node), we can't skip the rest of the validation for e.g. the packed format (if it's the first bset in the node). Reported-by: syzbot+4d722d3c539d77c7bc82@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-08 14:07:11 -05:00
Kent Overstreet	f8f1dde686	bcachefs: Fix missing validation for bch_backpointer.level This fixes an assertion pop where we try to navigate to the target of the backpointer, and the path level isn't what we expect. Reported-by: syzbot+b17df21b4d370f2dc330@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-08 00:05:53 -05:00
Kent Overstreet	27a036a0c3	bcachefs: Fix bch_member.btree_bitmap_shift validation Needs to match the assert later when we resize... Reported-by: syzbot+e8eff054face85d7ea41@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-07 23:31:11 -05:00
Kent Overstreet	ca43f73cd1	bcachefs: bch2_btree_write_buffer_flush_going_ro() The write buffer needs to be specifically flushed when going RO: keys in the journal that haven't yet been moved to the write buffer don't have a journal pin yet. This fixes numerous syzbot bugs, all with symptoms of still doing writes after we've got RO. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-07 23:31:11 -05:00
Kent Overstreet	8440da9331	bcachefs: Fix UAF in __promote_alloc() error path If we error in data_update_init() after adding to the rhashtable of outstanding promotes, kfree_rcu() is required. Reported-by: Reed Riley <reed@riley.engineer> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-07 16:48:21 -05:00
Piotr Zalewski	f9f0a5390d	bcachefs: Change OPT_STR max to be 1 less than the size of choices array Change OPT_STR max value to be 1 less than the "ARRAY_SIZE" of "_choices" array. As a result, remove -1 from (opt->max-1) in bch2_opt_to_text. The "_choices" array is a null-terminated array, so computing the maximum using "ARRAY_SIZE" without subtracting 1 yields an incorrect result. Since bch2_opt_validate don't subtract 1, as bch2_opt_to_text does, values bigger than the actual maximum would pass through option validation. Reported-by: syzbot+bee87a0c3291c06aa8c6@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=bee87a0c3291c06aa8c6 Fixes: `63c4b25453` ("bcachefs: Better superblock opt validation") Suggested-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-07 16:48:21 -05:00
Kent Overstreet	baefd3f849	bcachefs: btree_cache.freeable list fixes When allocating new btree nodes, we were leaving them on the freeable list - unlocked - allowing them to be reclaimed: ouch. Additionally, bch2_btree_node_free_never_used() -> bch2_btree_node_hash_remove was putting it on the freelist, while bch2_btree_node_free_never_used() was putting it back on the btree update reserve list - ouch. Originally, the code was written to always keep btree nodes on a list - live or freeable - and this worked when new nodes were kept locked. But now with the cycle detector, we can't keep nodes locked that aren't tracked by the cycle detector; and this is fine as long as they're not reachable. We also have better and more robust leak detection now, with memory allocation profiling, so the original justification no longer applies. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-07 16:48:21 -05:00
Hongbo Li	9bb33852f5	bcachefs: check the invalid parameter for perf test The perf_test does not check the number of iterations and threads when it is zero. If nr_thread is 0, the perf test will keep waiting for wakekup. If iteration is 0, it will cause exception of division by zero. This can be reproduced by: echo "rand_insert 0 1" > /sys/fs/bcachefs/${uuid}/perf_test or echo "rand_insert 1 0" > /sys/fs/bcachefs/${uuid}/perf_test Fixes: `1c6fdbd8f2` ("bcachefs: Initial commit") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-07 16:48:21 -05:00
Pei Xiao	93d53f1caf	bcachefs: add check NULL return of bio_kmalloc in journal_read_bucket bio_kmalloc may return NULL, will cause NULL pointer dereference. Add check NULL return for bio_kmalloc in journal_read_bucket. Signed-off-by: Pei Xiao <xiaopei01@kylinos.cn> Fixes: `ac10a9611d` ("bcachefs: Some fixes for building in userspace") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-07 16:48:21 -05:00
Kent Overstreet	ef4f6c322b	bcachefs: Ensure BCH_FS_may_go_rw is set before exiting recovery If BCH_FS_may_go_rw is not yet set, it indicates to the transaction commit path that updates should be done via the list of journal replay keys. This must be set before multithreaded use commences. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-07 16:48:21 -05:00
Kent Overstreet	cec136d348	bcachefs: Fix topology errors on split after merge If a btree split picks a pivot that's being deleted by a btree node merge, we're going to have problems. Fix this by checking if the pivot is being deleted, the same as we check for deletions in journal replay keys. Found by single_devic.ktest small_nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-07 16:48:21 -05:00
Kent Overstreet	d335bb3fd3	bcachefs: Ancient versions with bad bkey_formats are no longer supported Syzbot found an assertion pop, by generating an ancient filesystem version with an invalid bkey_format (with fields that can overflow) as well as packed keys that aren't representable unpacked. This breaks key comparisons in all sorts of painful ways. Filesystems have been automatically rewriting nodes with such invalid formats for years; we can safely drop support for them. Reported-by: syzbot+8a0109511de9d4b61217@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-07 16:48:20 -05:00
Kent Overstreet	72acab3a7c	bcachefs: Fix error handling in bch2_btree_node_prefetch() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-07 16:48:20 -05:00
Kent Overstreet	fd00045f38	bcachefs: Fix null ptr deref in bucket_gen_get() bucket_gen() checks if we're lookup up a valid bucket and returns NULL otherwise, but bucket_gen_get() was failing to check; other callers were correct. Also do a bit of cleanup on callers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-07 16:48:17 -05:00
Kuan-Wei Chiu	75e849f3d0	bcachefs: update min_heap_callbacks to use default builtin swap Replace the swp function pointer in the min_heap_callbacks of bcachefs with NULL, allowing direct usage of the default builtin swap implementation. This modification simplifies the code and improves performance by removing unnecessary function indirection. Link: https://lkml.kernel.org/r/20241020040200.939973-10-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Coly Li <colyli@suse.de> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-11-05 17:12:36 -08:00
Kuan-Wei Chiu	06ce25145b	bcachefs: clean up duplicate min_heap_callbacks declarations Refactor the bcachefs code to remove multiple redundant declarations of min_heap_callbacks, ensuring that each unique declaration appears only once. Link: https://lore.kernel.org/20241017095520.GV16066@noisy.programming.kicks-ass.net Link: https://lkml.kernel.org/r/20241020040200.939973-9-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Coly Li <colyli@suse.de> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-11-05 17:12:36 -08:00
Kuan-Wei Chiu	92a8b224b8	lib/min_heap: introduce non-inline versions of min heap API functions Patch series "Enhance min heap API with non-inline functions and optimizations", v2. Add non-inline versions of the min heap API functions in lib/min_heap.c and updates all users outside of kernel/events/core.c to use these non-inline versions. To mitigate the performance impact of indirect function calls caused by the non-inline versions of the swap and compare functions, a builtin swap has been introduced that swaps elements based on their size. Additionally, it micro-optimizes the efficiency of the min heap by pre-scaling the counter, following the same approach as in lib/sort.c. Documentation for the min heap API has also been added to the core-api section. This patch (of 10): All current min heap API functions are marked with '__always_inline'. However, as the number of users increases, inlining these functions everywhere leads to a increase in kernel size. In performance-critical paths, such as when perf events are enabled and min heap functions are called on every context switch, it is important to retain the inline versions for optimal performance. To balance this, the original inline functions are kept, and additional non-inline versions of the functions have been added in lib/min_heap.c. Link: https://lkml.kernel.org/r/20241020040200.939973-1-visitorckw@gmail.com Link: https://lore.kernel.org/20240522161048.8d8bbc7b153b4ecd92c50666@linux-foundation.org Link: https://lkml.kernel.org/r/20241020040200.939973-2-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Coly Li <colyli@suse.de> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Kuan-Wei Chiu <visitorckw@gmail.com> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-11-05 17:12:34 -08:00
Linus Torvalds	7b83601da4	bcachefs fixes for 6.12-rc6 Various syzbot fixes, and the more notable ones: - Fix for pointers in an extent overflowing the max (16) on a filesystem with many devices: we were creating too many cached copies when moving data around. Now, we only create at most one cached copy if there's a promote target set. Caching will be a bit broken for reflinked data until 6.13: I have larger series queued up which significantly improves the plumbing for data options down into the extent (bch_extent_rebalance) to fix this. - Fix for deadlock on -ENOSPC on tiny filesystems Allocation from the partial open_bucket list wasn't correctly accounting partial open_buckets as free: this fixes the main cause of tests timing out in the automated tests. -----BEGIN PGP SIGNATURE----- iQJOBAABCAA4FiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmckUrIaHGtlbnQub3Zl cnN0cmVldEBsaW51eC5kZXYACgkQE6szbY3KbnaFYA//Qd9SD8v+ypavnaogWhqk 3bufCO8YJDV5DRQVuX/z36fia8zOKzWQGRAYvq0vF0mmgagwBE+AcBh6vvfDCxqZ m1937IXcv/hHh2FFQau9gWEItTH9dGwyQjeDjB3xaTL5ZTGsAdA9558ygf8GAVOe wD+W8Z8Qj09hAErnNS7y50t/PGbZDuG7AV2Dy2unp+fp6U0FVrZ3Z0bhFuhxcR7/ e3j49DoW4EZL7Gu1svn7nzehjWK4wx1wX7QhynPgSOVIhdj2Fc3XG76b3mBsuZF6 A/cBRKmSZsYL9MBK0vferqizqeuwlIJsvwpo/6zzukpyf8QOl+0IqPuAXFoz8vg3 vrdp9cdvzWvQNexTD2+7PYosCKoUswOvo0oIy8Iopkg4VGSreZib1sZeCPzw2FBK AZcQaQSBLKojWpYsn9Dl2AlqEHHTvnopjr5wRXiimqKe/OcA3ugIvebUw2UE2ACp /Z2ZQu615BtRYQM+dRIJJQ2CAy0F3EZxIXEXwc/yrH7kL2VBay8QCKp/k/9YYy4e Nlxxw7alb/XGgT8GQgu24tho3yMKT621dLFOaAZ7x2HtLP8T56zL/L/wKWsocW/V R8Kwqot6F1EVb3Q0LECUJottYQ+5I1Et7ZpVyOPxfqF1y7KOsuxKOmZFLO7i3Spc fg0gOt/fyKrAF3zuSmWXne8= =hzm/ -----END PGP SIGNATURE----- Merge tag 'bcachefs-2024-10-31' of git://evilpiepirate.org/bcachefs Pull bcachefs fixes from Kent Overstreet: "Various syzbot fixes, and the more notable ones: - Fix for pointers in an extent overflowing the max (16) on a filesystem with many devices: we were creating too many cached copies when moving data around. Now, we only create at most one cached copy if there's a promote target set. Caching will be a bit broken for reflinked data until 6.13: I have larger series queued up which significantly improves the plumbing for data options down into the extent (bch_extent_rebalance) to fix this. - Fix for deadlock on -ENOSPC on tiny filesystems Allocation from the partial open_bucket list wasn't correctly accounting partial open_buckets as free: this fixes the main cause of tests timing out in the automated tests" * tag 'bcachefs-2024-10-31' of git://evilpiepirate.org/bcachefs: bcachefs: Fix NULL ptr dereference in btree_node_iter_and_journal_peek bcachefs: fix possible null-ptr-deref in __bch2_ec_stripe_head_get() bcachefs: Fix deadlock on -ENOSPC w.r.t. partial open buckets bcachefs: Don't filter partial list buckets in open_buckets_to_text() bcachefs: Don't keep tons of cached pointers around bcachefs: init freespace inited bits to 0 in bch2_fs_initialize bcachefs: Fix unhandled transaction restart in fallocate bcachefs: Fix UAF in bch2_reconstruct_alloc() bcachefs: fix null-ptr-deref in have_stripes() bcachefs: fix shift oob in alloc_lru_idx_fragmentation bcachefs: Fix invalid shift in validate_sb_layout()	2024-11-01 07:21:03 -10:00
Piotr Zalewski	3726a1970b	bcachefs: Fix NULL ptr dereference in btree_node_iter_and_journal_peek Add NULL check for key returned from bch2_btree_and_journal_iter_peek in btree_node_iter_and_journal_peek to avoid NULL ptr dereference in bch2_bkey_buf_reassemble. When key returned from bch2_btree_and_journal_iter_peek is NULL it means that btree topology needs repair. Print topology error message with position at which node wasn't found, its parent node information and btree_id with level. Return error code returned by bch2_topology_error to ensure that topology error is handled properly by recovery. Reported-by: syzbot+005ef9aa519f30d97657@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=005ef9aa519f30d97657 Fixes: `5222a4607c` ("bcachefs: BTREE_ITER_WITH_JOURNAL") Suggested-by: Alan Huang <mmpgouride@gmail.com> Suggested-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-29 06:34:11 -04:00
Gaosheng Cui	ca959e328b	bcachefs: fix possible null-ptr-deref in __bch2_ec_stripe_head_get() The function ec_new_stripe_head_alloc() returns nullptr if kzalloc() fails. It is crucial to verify its return value before dereferencing it to avoid a potential nullptr dereference. Fixes: `035d72f72c` ("bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices") Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-29 06:34:10 -04:00
Kent Overstreet	778ac324cc	bcachefs: Fix deadlock on -ENOSPC w.r.t. partial open buckets Open buckets on the partial list should not count as allocated when we're trying to allocate from the partial list. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-29 06:34:10 -04:00
Kent Overstreet	e0fafac5c4	bcachefs: Don't filter partial list buckets in open_buckets_to_text() these are an important source of stranded buckets we need to be able to watch Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-29 06:34:10 -04:00
Kent Overstreet	a34eef6dd1	bcachefs: Don't keep tons of cached pointers around We had a bug report where the data update path was creating an extent that failed to validate because it had too many pointers; almost all of them were cached. To fix this, we have: - want_cached_ptr(), a new helper that checks if we even want a cached pointer (is on appropriate target, device is readable). - bch2_extent_set_ptr_cached() now only sets a pointer cached if we want it. - bch2_extent_normalize_by_opts() now ensures that we only have a single cached pointer that we want. While working on this, it was noticed that this doesn't work well with reflinked data and per-file options. Another patch series is coming that plumbs through additional io path options through bch_extent_rebalance, with improved option handling. Reported-by: Reed Riley <reed@riley.engineer> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-29 06:34:10 -04:00
Piotr Zalewski	3fd27e9c57	bcachefs: init freespace inited bits to 0 in bch2_fs_initialize Initialize freespace_initialized bits to 0 in member's flags and update member's cached version for each device in bch2_fs_initialize. It's possible for the bits to be set to 1 before fs is initialized and if call to bch2_trans_mark_dev_sbs (just before bch2_fs_freespace_init) fails bits remain to be 1 which can later indirectly trigger BUG condition in bch2_bucket_alloc_freelist during shutdown. Reported-by: syzbot+2b6a17991a6af64f9489@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=2b6a17991a6af64f9489 Fixes: `bbe682c767` ("bcachefs: Ensure devices are always correctly initialized") Suggested-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-29 06:34:10 -04:00
Kent Overstreet	c1fa854acc	bcachefs: Fix unhandled transaction restart in fallocate This used to not matter, but now we're being more strict. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-29 06:34:10 -04:00
Kent Overstreet	8e910ca20e	bcachefs: Fix UAF in bch2_reconstruct_alloc() write_super() -> sb_counters_from_cpu() may reallocate the superblock Reported-by: syzbot+9fc4dac4775d07bcfe34@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-25 13:17:23 -04:00
Jeongjun Park	a25a83de45	bcachefs: fix null-ptr-deref in have_stripes() c->btree_roots_known[i].b can be NULL. In this case, a NULL pointer dereference occurs, so you need to add code to check the variable. Reported-by: syzbot+b468b9fef56949c3b528@syzkaller.appspotmail.com Fixes: `7773df19c3` ("bcachefs: metadata version bucket_stripe_sectors") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-25 13:17:06 -04:00
Jeongjun Park	5c41f75d1b	bcachefs: fix shift oob in alloc_lru_idx_fragmentation The size of a.data_type is set abnormally large, causing shift-out-of-bounds. To fix this, we need to add validation on a.data_type in alloc_lru_idx_fragmentation(). Reported-by: syzbot+7f45fa9805c40db3f108@syzkaller.appspotmail.com Fixes: `260af1562e` ("bcachefs: Kill alloc_v4.fragmentation_lru") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-24 17:41:43 -04:00
Gianfranco Trad	2045fc4295	bcachefs: Fix invalid shift in validate_sb_layout() Add check on layout->sb_max_size_bits against BCH_SB_LAYOUT_SIZE_BITS_MAX to prevent UBSAN shift-out-of-bounds in validate_sb_layout(). Reported-by: syzbot+089fad5a3a5e77825426@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=089fad5a3a5e77825426 Fixes: `03ef80b469` ("bcachefs: Ignore unknown mount options") Tested-by: syzbot+089fad5a3a5e77825426@syzkaller.appspotmail.com Signed-off-by: Gianfranco Trad <gianf.trad@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-24 17:41:43 -04:00
Linus Torvalds	c1e822754c	bcachefs fixes for 6.12-rc5 Lots of hotfixes: - transaction restart injection has been shaking out a few things - fix a data corruption in the buffered write path on -ENOSPC, found by xfstests generic/299 - Some small show_options fixes - Repair mismatches in inode hash type, seed: different snapshot versions of an inode must have the same hash/type seed, used for directory entries and xattrs. We were checking the hash seed, but not the type, and a user contributed a filesystem where the hash type on one inode had somehow been flipped; these fixes allow his filesystem to repair. Additionally, the hash type flip made some directory entries invisible, which were then recreated by userspace; so the hash check code now checks for duplicate non dangling dirents, and renames one of them if necessary. - Don't use wait_event_interruptible() in recovery: this fixes some filesystems failing to mount with -ERESTARTSYS - Workaround for kvmalloc not supporting > INT_MAX allocations, causing an -ENOMEM when allocating the sorted array of journal keys: this allows a 75 TB filesystem to mount - Make sure bch_inode_unpacked.bi_snapshot is set in the old inode compat path: this alllows Marcin's filesystem (in use since before 6.7) to repair and mount. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmcX4vYACgkQE6szbY3K bnbywxAArBfIJfshWq5Wk9WztenzUmyUmV2HIgntT/iN4ty4eIpZ26VSvHcGvgkU j3wx+OuxMTPBGc3fjUS+gALf/BGcQEgh6oPZCV+6M3kasTzNzG2jYOCkLqKbpcO1 V5n/Le/SM1X2grkgTm/H+TulGHNgG9gJ2U4kjihroJrTbTesZhzcW/qlz6RWo7U1 02NvLop4WE9M6WaW9RzsHK2llRUAl2Z3oRMuwNz3IIijCpm98STGD4gyvGoMV2b8 qNsXjy7b2lkYObKI29yWF0caRzWK1LRz79afRlnNVSJb6DK1QB83ms5Qa8rprCU4 uOq0wsGWyg6lzwQ19X+2TvUYABopVk2HXLlzTO/lJrWeMTuYJVPZ7KZi3l6ubw5T GIsAD5qMdCm8E5nXX8hG//0rOIl6QK288+zMQyRCvAkCL+iN2k0TU8qKAEEC44de vj6ZyNqbuLR39LLz9K09ZhzIZGk09ELpxOJ2Wwwj4ZFriwphWDtFgBtBUpNo/KWA inBfq2lZJsmNjfns9vCqOmNOStOJxXnyMOR25sTv7wM69QPGkl41dPY3oeuG8lRk cU/qJQKlpTKJbFeXiEKWKDnMzWxOnovqLFC0tKu2qAYM6vAz+AtwTXgthVFGh21U QoUDbsnQCCixMkS2AksCo7nivLrxmV/EeYm5pgeiU38VdA5ofBM= =OpYN -----END PGP SIGNATURE----- Merge tag 'bcachefs-2024-10-22' of https://github.com/koverstreet/bcachefs Pull bcachefs fixes from Kent Overstreet: "Lots of hotfixes: - transaction restart injection has been shaking out a few things - fix a data corruption in the buffered write path on -ENOSPC, found by xfstests generic/299 - Some small show_options fixes - Repair mismatches in inode hash type, seed: different snapshot versions of an inode must have the same hash/type seed, used for directory entries and xattrs. We were checking the hash seed, but not the type, and a user contributed a filesystem where the hash type on one inode had somehow been flipped; these fixes allow his filesystem to repair. Additionally, the hash type flip made some directory entries invisible, which were then recreated by userspace; so the hash check code now checks for duplicate non dangling dirents, and renames one of them if necessary. - Don't use wait_event_interruptible() in recovery: this fixes some filesystems failing to mount with -ERESTARTSYS - Workaround for kvmalloc not supporting > INT_MAX allocations, causing an -ENOMEM when allocating the sorted array of journal keys: this allows a 75 TB filesystem to mount - Make sure bch_inode_unpacked.bi_snapshot is set in the old inode compat path: this alllows Marcin's filesystem (in use since before 6.7) to repair and mount" * tag 'bcachefs-2024-10-22' of https://github.com/koverstreet/bcachefs: (26 commits) bcachefs: Set bch_inode_unpacked.bi_snapshot in old inode path bcachefs: Mark more errors as AUTOFIX bcachefs: Workaround for kvmalloc() not supporting > INT_MAX allocations bcachefs: Don't use wait_event_interruptible() in recovery bcachefs: Fix __bch2_fsck_err() warning bcachefs: fsck: Improve hash_check_key() bcachefs: bch2_hash_set_or_get_in_snapshot() bcachefs: Repair mismatches in inode hash seed, type bcachefs: Add hash seed, type to inode_to_text() bcachefs: INODE_STR_HASH() for bch_inode_unpacked bcachefs: Run in-kernel offline fsck without ratelimit errors bcachefs: skip mount option handle for empty string. bcachefs: fix incorrect show_options results bcachefs: Fix data corruption on -ENOSPC in buffered write path bcachefs: bch2_folio_reservation_get_partial() is now better behaved bcachefs: fix disk reservation accounting in bch2_folio_reservation_get() bcachefS: ec: fix data type on stripe deletion bcachefs: Don't use commit_do() unnecessarily bcachefs: handle restarts in bch2_bucket_io_time_reset() bcachefs: fix restart handling in __bch2_resume_logged_op_finsert() ...	2024-10-24 12:38:59 -07:00
Kent Overstreet	a069f01479	bcachefs: Set bch_inode_unpacked.bi_snapshot in old inode path This fixes a fsck bug on a very old filesystem (pre mainline merge). Fixes: `72350ee0ea` ("bcachefs: Kill snapshot arg to fsck_write_inode()") Reported-by: Marcin Mirosław <marcin@mejor.pl> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-20 18:09:09 -04:00
Kent Overstreet	e04ee86089	bcachefs: Mark more errors as AUTOFIX Reported-by: Marcin Mirosław <marcin@mejor.pl> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-20 18:08:53 -04:00
Kent Overstreet	f0d3302073	bcachefs: Workaround for kvmalloc() not supporting > INT_MAX allocations kvmalloc() doesn't support allocations > INT_MAX, but vmalloc() does - the limit should be lifted, but we can work around this for now. A user with a 75 TB filesystem reported the following journal replay error: https://github.com/koverstreet/bcachefs/issues/769 In journal replay we have to sort and dedup all the keys from the journal, which means we need a large contiguous allocation. Given that the user has 128GB of ram, the 2GB limit on allocation size has become far too small. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-20 16:50:14 -04:00
Kent Overstreet	3956ff8bc2	bcachefs: Don't use wait_event_interruptible() in recovery Fix a bug where mount was failing with -ERESTARTSYS: https://github.com/koverstreet/bcachefs/issues/741 We only want the interruptible wait when called from fsync. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-20 16:50:14 -04:00
Kent Overstreet	eb5db64c45	bcachefs: Fix __bch2_fsck_err() warning We only warn about having a btree_trans that wasn't passed in if we'll be prompting. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-20 16:50:14 -04:00
Kent Overstreet	bc6d2d1041	bcachefs: fsck: Improve hash_check_key() hash_check_key() checks and repairs the hash table btrees: dirents and xattrs are open addressing hash tables. We recently had a corruption reported where the hash type on an inode somehow got flipped, which made the existing dirents invisible and allowed new ones to be created with the same name. Now, hash_check_key() can repair duplicates: it will delete one of them, if it has an xattr or dangling dirent, but if it has two valid dirents one of them gets renamed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	dc96656b20	bcachefs: bch2_hash_set_or_get_in_snapshot() Add a variant of bch2_hash_set_in_snapshot() that returns the existing key on -EEXIST. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	15a3836c8e	bcachefs: Repair mismatches in inode hash seed, type Different versions of the same inode (same inode number, different snapshot ID) must have the same hash seed and type - lookups require this, since they see keys from different snapshots simultaneously. To repair we only need to make the inodes consistent, hash_check_key() will do the rest. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	d8e879377f	bcachefs: Add hash seed, type to inode_to_text() This helped with discovering some filesystem corruption fsck has having trouble with: the str_hash type had gotten flipped on one snapshot's version of an inode. All versions of a given inode number have the same hash seed and hash type, since lookups will be done with a single hash/seed and type and see dirents/xattrs from multiple snapshots. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	78cf0ae636	bcachefs: INODE_STR_HASH() for bch_inode_unpacked Trivial cleanup - add a normal BITMASK() helper for bch_inode_unpacked. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	b96f8cd387	bcachefs: Run in-kernel offline fsck without ratelimit errors Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Hongbo Li	489ecc4cfd	bcachefs: skip mount option handle for empty string. The options parse in get_tree will split the options buffer, it will get the empty string for last one by strsep(). After commit ea0eeb89b1d5 ("bcachefs: reject unknown mount options") is merged, unknown mount options is not allowed (here is empty string), and this causes this errors. This can be reproduced just by the following steps: bcachefs format /dev/loop mount -t bcachefs -o metadata_target=loop1 /dev/loop1 /mnt/bcachefs/ Fixes: ea0eeb89b1d5 ("bcachefs: reject unknown mount options") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Hongbo Li	07cf8bac2d	bcachefs: fix incorrect show_options results When call show_options in bcachefs, the options buffer is appeneded to the seq variable. In fact, it requires an additional comma to be appended first. This will affect the remount process when reading existing mount options. Fixes: 9305cf91d05e ("bcachefs: bch2_opts_to_text()") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	97535cd84f	bcachefs: Fix data corruption on -ENOSPC in buffered write path Found by generic/299: When we have to truncate a write due to -ENOSPC, we may have to read in the folio we're writing to if we're now no longer doing a complete write to a !uptodate folio. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	335d318ef5	bcachefs: bch2_folio_reservation_get_partial() is now better behaved bch2_folio_reservation_get_partial(), on partial success, will now return a reservation that's aligned to the filesystem blocksize. This is a partial fix for fstests generic/299 - fio verify is badly behaved in the presence of short writes that aren't aligned to its blocksize. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	81e0b6c7c1	bcachefs: fix disk reservation accounting in bch2_folio_reservation_get() bch2_disk_reservation_put() zeroes out the reservation - oops. This fixes a disk reservation leak when getting a quota reservation returned an error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	4007bbb203	bcachefS: ec: fix data type on stripe deletion Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	a0d11feefb	bcachefs: Don't use commit_do() unnecessarily Using commit_do() to call alloc_sectors_start_trans() breaks when we're randomly injecting transaction restarts - the restart in the commit causes us to leak the lock that alloc_sectorS_start_trans() takes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	6bee2a04c5	bcachefs: handle restarts in bch2_bucket_io_time_reset() bch2_bucket_io_time_reset() doesn't need to succeed, which is why it didn't previously retry on transaction restart - but we're now treating these as errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	29fd10a36a	bcachefs: fix restart handling in __bch2_resume_logged_op_finsert() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:48 -04:00
Kent Overstreet	d8b5059774	bcachefs: fix restart handling in bch2_alloc_write_key() This is ugly: We may discover in alloc_write_key that the data type we calculated is wrong, because BCH_DATA_need_discard is checked/set elsewhere, and the disk accounting counters we calculated need to be updated. But bch2_alloc_key_to_dev_counters(..., BTREE_TRIGGER_gc) is not safe w.r.t. transaction restarts, so we need to propagate the fixup back to our gc state in case we take a transaction restart. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:47 -04:00
Kent Overstreet	7ee4be9c62	bcachefs: fix restart handling in bch2_do_invalidates_work() this one is fairly harmless since the invalidate worker will just run again later if it needs to, but still worth fixing Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:47 -04:00
Kent Overstreet	028f3c1d9b	bcachefs: fix missing restart handling in bch2_read_retry_nodecode() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:47 -04:00
Kent Overstreet	e1c4d2f082	bcachefs: fix restart handling in bch2_fiemap() We were leaking transaction restart errors to userspace. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:47 -04:00
Kent Overstreet	94bdeec8f5	bcachefs: fix bch2_hash_delete() error path we were exiting an iterator that hadn't been initialized Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:47 -04:00
Kent Overstreet	74ec2f3024	bcachefs: fix restart handling in bch2_rename2() This should be impossible to hit in practice; the first lookup within a transaction won't return a restart due to lock ordering, but we're adding fault injection for transaction restarts and shaking out bugs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-18 00:49:47 -04:00
Linus Torvalds	bdc7276512	bcachefs fixes for 6.12-rc4 - New metadata version inode_has_child_snapshots This fixes bugs with handling of unlinked inodes + snapshots, in particular when an inode is reattached after taking a snapshot; deleted inodes now get correctly cleaned up across snapshots. - Disk accounting rewrite fixes - validation fixes for when a device has been removed - fix journal replay failing with "journal_reclaim_would_deadlock" - Some more small fixes for erasure coding + device removal - Assorted small syzbot fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmcNw4UACgkQE6szbY3K bnbSzBAAmSCCQCqRwnFSp4OdNSlBK9q1e5WsbKOqHgtoXZU/mOUBe/5bnPPqm6Mg GkTc7FqVOs/95/rEDKXw2LneFgxRrt8MriJCUdXZvV5fC2R4Kdl0TkwABtMtm2Ae wp37n6iQO81j4uZHfOj67RzC2NRo7dMdun5HnQPRBTKzyuDaZXqwjMmF2LmaeODh oiBFUvD5nFBo5XvXPABBin6xpdquHO+6ZWf6SFD4+iRe11NrJAOAIS/crJvxsFfr I/X152Z+gzKPE+NhANKMxlHyNnVGo7iHUqhUjVuI4SSaXb9Ap6k4sXgfoIzncR17 GA5qWtaNS1W72+awT3R2EaF9Tqi+Vng2RVfxxQ04giImnBq0eziOjlZ26enOE0LU 0ZZrBFzqpItqYbNnzPissHuKb1mAQGPWy6kxoGIrqDKbichA7lzyWDz2lgEE85Sx E1mvHwYbKhUuLC4c4460hueGVUgMWmjqM3E8oex+oNDpauPB+/bnYkcgZEG2RBla +ZlDL28fg4fxtqlUrOQeonQ1RecGNdRMJz7xiGnkYU9rQpUuv8QwFiBZGAbLP6zn 6fbFZGxS/pO95sY7GmAtKz7ZgKxJQCzII4s+Oht5AgOvoBlPjAiol1UbwYadYQxz HKF+WBaPC9z/L6JjP+gx+uUzTWRIfBmhHylhWbKr4vLGfx3Jc1g= =Rkq2 -----END PGP SIGNATURE----- Merge tag 'bcachefs-2024-10-14' of git://evilpiepirate.org/bcachefs Pull bcachefs fixes from Kent Overstreet: - New metadata version inode_has_child_snapshots This fixes bugs with handling of unlinked inodes + snapshots, in particular when an inode is reattached after taking a snapshot; deleted inodes now get correctly cleaned up across snapshots. - Disk accounting rewrite fixes - validation fixes for when a device has been removed - fix journal replay failing with "journal_reclaim_would_deadlock" - Some more small fixes for erasure coding + device removal - Assorted small syzbot fixes * tag 'bcachefs-2024-10-14' of git://evilpiepirate.org/bcachefs: (27 commits) bcachefs: Fix sysfs warning in fstests generic/730,731 bcachefs: Handle race between stripe reuse, invalidate_stripe_to_dev bcachefs: Fix kasan splat in new_stripe_alloc_buckets() bcachefs: Add missing validation for bch_stripe.csum_granularity_bits bcachefs: Fix missing bounds checks in bch2_alloc_read() bcachefs: fix uaf in bch2_dio_write_done() bcachefs: Improve check_snapshot_exists() bcachefs: Fix bkey_nocow_lock() bcachefs: Fix accounting replay flags bcachefs: Fix invalid shift in member_to_text() bcachefs: Fix bch2_have_enough_devs() for BCH_SB_MEMBER_INVALID bcachefs: __wait_for_freeing_inode: Switch to wait_bit_queue_entry bcachefs: Check if stuck in journal_res_get() closures: Add closure_wait_event_timeout() bcachefs: Fix state lock involved deadlock bcachefs: Fix NULL pointer dereference in bch2_opt_to_text bcachefs: Release transaction before wake up bcachefs: add check for btree id against max in try read node bcachefs: Disk accounting device validation fixes bcachefs: bch2_inode_or_descendents_is_open() ...	2024-10-15 11:06:45 -07:00
Kent Overstreet	5e3b72324d	bcachefs: Fix sysfs warning in fstests generic/730,731 sysfs warns if we're removing a symlink from a directory that's no longer in sysfs; this is triggered by fstests generic/730, which simulates hot removal of a block device. This patch is however not a correct fix, since checking kobj->state_in_sysfs on a kobj owned by another subsystem is racy. A better fix would be to add the appropriate check to sysfs_remove_link() - and sysfs_create_link() as well. But kobject_add_internal()/kobject_del() do not as of today have locking that would support that. Note that the block/holder.c code appears to be subject to this race as well. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-14 05:43:01 -04:00
Kent Overstreet	cb6055e66f	bcachefs: Handle race between stripe reuse, invalidate_stripe_to_dev When creating a new stripe, we may reuse an existing stripe that has some empty and some nonempty blocks. Generally, the existing stripe won't change underneath us - except for block sector counts, which we copy to the new key in ec_stripe_key_update. But the device removal path can now invalidate stripe pointers to a device, and that can race with stripe reuse. Change ec_stripe_key_update() to check for and resolve this inconsistency. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-13 22:03:03 -04:00
Kent Overstreet	b1e562265e	bcachefs: Fix kasan splat in new_stripe_alloc_buckets() Update for BCH_SB_MEMBER_INVALID. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-13 22:03:01 -04:00
Kent Overstreet	9f25dbe0bf	bcachefs: Add missing validation for bch_stripe.csum_granularity_bits Reported-by: syzbot+f8c98a50c323635be65d@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-13 17:55:33 -04:00
Kent Overstreet	a319aeaebb	bcachefs: Fix missing bounds checks in bch2_alloc_read() We were checking that the alloc key was for a valid device, but not a valid bucket. This is the upgrade path from versions prior to bcachefs being mainlined. Reported-by: syzbot+a1b59c8e1a3f022fd301@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-13 17:55:33 -04:00
Kent Overstreet	573ddcdc56	bcachefs: fix uaf in bch2_dio_write_done() Reported-by: syzbot+19ad84d5133871207377@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-13 17:55:33 -04:00
Kent Overstreet	c986dd7ecb	bcachefs: Improve check_snapshot_exists() Check if we have snapshot_trees or subvolumes that refer to the snapshot node being reconstructed, and use them. With this, the kill_btree_root test that blows away the snapshots btree now passes, and we're able to successfully reconstruct. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-12 05:02:48 -04:00
Kent Overstreet	9183c2b11e	bcachefs: Fix bkey_nocow_lock() This fixes an assertion pop in nocow_locking.c 00243 kernel BUG at fs/bcachefs/nocow_locking.c:41! 00243 Internal error: Oops - BUG: 00000000f2000800 [#1] SMP 00243 Modules linked in: 00243 Hardware name: linux,dummy-virt (DT) 00243 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00244 pc : bch2_bucket_nocow_unlock (/home/testdashboard/linux-7/fs/bcachefs/nocow_locking.c:41) 00244 lr : bkey_nocow_lock (/home/testdashboard/linux-7/fs/bcachefs/data_update.c:79) 00244 sp : ffffff80c82373b0 00244 x29: ffffff80c82373b0 x28: ffffff80e08958c0 x27: ffffff80e0880000 00244 x26: ffffff80c8237a98 x25: 00000000000000a0 x24: ffffff80c8237ab0 00244 x23: 00000000000000c0 x22: 0000000000000008 x21: 0000000000000000 00244 x20: ffffff80c8237a98 x19: 0000000000000018 x18: 0000000000000000 00244 x17: 0000000000000000 x16: 000000000000003f x15: 0000000000000000 00244 x14: 0000000000000008 x13: 0000000000000018 x12: 0000000000000000 00244 x11: 0000000000000000 x10: ffffff80e0880000 x9 : ffffffc0803ac1a4 00244 x8 : 0000000000000018 x7 : ffffff80c8237a88 x6 : ffffff80c8237ab0 00244 x5 : ffffff80e08988d0 x4 : 00000000ffffffff x3 : 0000000000000000 00244 x2 : 0000000000000004 x1 : 0003000000000d1e x0 : ffffff80e08988c0 00244 Call trace: 00244 bch2_bucket_nocow_unlock (/home/testdashboard/linux-7/fs/bcachefs/nocow_locking.c:41) 00245 bch2_data_update_init (/home/testdashboard/linux-7/fs/bcachefs/data_update.c:627 (discriminator 1)) 00245 promote_alloc.isra.0 (/home/testdashboard/linux-7/fs/bcachefs/io_read.c:242 /home/testdashboard/linux-7/fs/bcachefs/io_read.c:304) 00245 __bch2_read_extent (/home/testdashboard/linux-7/fs/bcachefs/io_read.c:949) 00246 __bch2_read (/home/testdashboard/linux-7/fs/bcachefs/io_read.c:1215) 00246 bch2_direct_IO_read (/home/testdashboard/linux-7/fs/bcachefs/fs-io-direct.c:132) 00246 bch2_read_iter (/home/testdashboard/linux-7/fs/bcachefs/fs-io-direct.c:201) 00247 aio_read.constprop.0 (/home/testdashboard/linux-7/fs/aio.c:1602) 00247 io_submit_one.constprop.0 (/home/testdashboard/linux-7/fs/aio.c:2003 /home/testdashboard/linux-7/fs/aio.c:2052) 00248 __arm64_sys_io_submit (/home/testdashboard/linux-7/fs/aio.c:2111 /home/testdashboard/linux-7/fs/aio.c:2081 /home/testdashboard/linux-7/fs/aio.c:2081) 00248 invoke_syscall.constprop.0 (/home/testdashboard/linux-7/arch/arm64/include/asm/syscall.h:61 /home/testdashboard/linux-7/arch/arm64/kernel/syscall.c:54) 00248 ========= FAILED TIMEOUT tiering_variable_buckets_replicas in 1200s Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-12 05:01:52 -04:00
Kent Overstreet	672f75238e	bcachefs: Fix accounting replay flags BCH_TRANS_COMMIT_journal_reclaim without BCH_WATERMARK_reclaim means "return an error if low on journal space" - but accounting replay must succeed. Fixes https://github.com/koverstreet/bcachefs/issues/656 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-12 03:02:16 -04:00
Kent Overstreet	c1bd21bb65	bcachefs: Fix invalid shift in member_to_text() Reported-by: syzbot+064ce437a1ad63d3f6ef@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-12 03:02:16 -04:00
Kent Overstreet	7d84d9f449	bcachefs: Fix bch2_have_enough_devs() for BCH_SB_MEMBER_INVALID This fixes a kasan splat in the ec device removal tests. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-11 22:20:51 -04:00
Kent Overstreet	3d1ea1c0ae	bcachefs: kill retry_estale() in bch2_ioctl_subvolume_create() this was likely originally cribbed, and has been dead code, and Al is working on removing it from the tree. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-11 19:16:40 -04:00
Kent Overstreet	3b80552e70	bcachefs: __wait_for_freeing_inode: Switch to wait_bit_queue_entry inode_bit_waitqueue() is changing - this update clears the way for sched changes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-09 16:58:18 -04:00
Kent Overstreet	a7e2dd58fb	bcachefs: Check if stuck in journal_res_get() Like how we already do when the allocator seems to be stuck, check if we're waiting too long for a journal reservation and print some debug info. This is specifically to track down https://github.com/koverstreet/bcachefs/issues/656 which is showing up in userspace where we don't have sysfs/debugfs to get the journal debug info. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-09 16:57:59 -04:00
Alan Huang	9205d24cf7	bcachefs: Fix state lock involved deadlock We increased write ref, if the fs went to RO, that would lead to a deadlock, it actually happens: 00171 ========= TEST generic/279 00171 00172 bcachefs (vdb): starting version 1.12: rebalance_work_acct_fix opts=nocow 00172 bcachefs (vdb): recovering from clean shutdown, journal seq 35 00172 bcachefs (vdb): accounting_read... done 00172 bcachefs (vdb): alloc_read... done 00172 bcachefs (vdb): stripes_read... done 00172 bcachefs (vdb): snapshots_read... done 00172 bcachefs (vdb): journal_replay... done 00172 bcachefs (vdb): resume_logged_ops... done 00172 bcachefs (vdb): going read-write 00172 bcachefs (vdb): done starting filesystem 00172 FSTYP -- bcachefs 00172 PLATFORM -- Linux/aarch64 farm3-kvm 6.11.0-rc1-ktest-g3e290a0b8e34 #7030 SMP Tue Oct 8 14:15:12 UTC 2024 00172 MKFS_OPTIONS -- --nocow /dev/vdc 00172 MOUNT_OPTIONS -- /dev/vdc /mnt/scratch 00172 00172 bcachefs (vdc): starting version 1.12: rebalance_work_acct_fix opts=nocow 00172 bcachefs (vdc): initializing new filesystem 00172 bcachefs (vdc): going read-write 00172 bcachefs (vdc): marking superblocks 00172 bcachefs (vdc): initializing freespace 00172 bcachefs (vdc): done initializing freespace 00172 bcachefs (vdc): reading snapshots table 00172 bcachefs (vdc): reading snapshots done 00172 bcachefs (vdc): done starting filesystem 00173 bcachefs (vdc): shutting down 00173 bcachefs (vdc): going read-only 00173 bcachefs (vdc): finished waiting for writes to stop 00173 bcachefs (vdc): flushing journal and stopping allocators, journal seq 4 00173 bcachefs (vdc): flushing journal and stopping allocators complete, journal seq 6 00173 bcachefs (vdc): shutdown complete, journal seq 7 00173 bcachefs (vdc): marking filesystem clean 00173 bcachefs (vdc): shutdown complete 00173 bcachefs (vdb): shutting down 00173 bcachefs (vdb): going read-only 00361 INFO: task umount:6180 blocked for more than 122 seconds. 00361 Not tainted 6.11.0-rc1-ktest-g3e290a0b8e34 #7030 00361 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 00361 task:umount state:D stack:0 pid:6180 tgid:6180 ppid:6176 flags:0x00000004 00361 Call trace: 00362 __switch_to (arch/arm64/kernel/process.c:556) 00362 __schedule (kernel/sched/core.c:5191 kernel/sched/core.c:6529) 00363 schedule (include/asm-generic/bitops/generic-non-atomic.h:128 include/linux/thread_info.h:192 include/linux/sched.h:2084 kernel/sched/core.c:6608 kernel/sched/core.c:6621) 00365 bch2_fs_read_only (fs/bcachefs/super.c:346 (discriminator 41)) 00367 __bch2_fs_stop (fs/bcachefs/super.c:620) 00368 bch2_put_super (fs/bcachefs/fs.c:1942) 00369 generic_shutdown_super (include/linux/list.h:373 (discriminator 2) fs/super.c:650 (discriminator 2)) 00371 bch2_kill_sb (fs/bcachefs/fs.c:2170) 00372 deactivate_locked_super (fs/super.c:434 fs/super.c:475) 00373 deactivate_super (fs/super.c:508) 00374 cleanup_mnt (fs/namespace.c:250 fs/namespace.c:1374) 00376 __cleanup_mnt (fs/namespace.c:1381) 00376 task_work_run (include/linux/sched.h:2024 kernel/task_work.c:224) 00377 do_notify_resume (include/linux/resume_user_mode.h:50 arch/arm64/kernel/entry-common.c:151) 00377 el0_svc (arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:171 arch/arm64/kernel/entry-common.c:178 arch/arm64/kernel/entry-common.c:713) 00377 el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:731) 00378 el0t_64_sync (arch/arm64/kernel/entry.S:598) 00378 INFO: task tee:6182 blocked for more than 122 seconds. 00378 Not tainted 6.11.0-rc1-ktest-g3e290a0b8e34 #7030 00378 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 00378 task:tee state:D stack:0 pid:6182 tgid:6182 ppid:533 flags:0x00000004 00378 Call trace: 00378 __switch_to (arch/arm64/kernel/process.c:556) 00378 __schedule (kernel/sched/core.c:5191 kernel/sched/core.c:6529) 00378 schedule (include/asm-generic/bitops/generic-non-atomic.h:128 include/linux/thread_info.h:192 include/linux/sched.h:2084 kernel/sched/core.c:6608 kernel/sched/core.c:6621) 00378 schedule_preempt_disabled (kernel/sched/core.c:6680) 00379 rwsem_down_read_slowpath (kernel/locking/rwsem.c:1073 (discriminator 1)) 00379 down_read (kernel/locking/rwsem.c:1529) 00381 bch2_gc_gens (fs/bcachefs/sb-members.h:77 fs/bcachefs/sb-members.h:88 fs/bcachefs/sb-members.h:128 fs/bcachefs/btree_gc.c:1240) 00383 bch2_fs_store_inner (fs/bcachefs/sysfs.c:473) 00385 bch2_fs_internal_store (fs/bcachefs/sysfs.c:417 fs/bcachefs/sysfs.c:580 fs/bcachefs/sysfs.c:576) 00386 sysfs_kf_write (fs/sysfs/file.c:137) 00387 kernfs_fop_write_iter (fs/kernfs/file.c:334) 00389 vfs_write (fs/read_write.c:497 fs/read_write.c:590) 00390 ksys_write (fs/read_write.c:643) 00391 __arm64_sys_write (fs/read_write.c:652) 00391 invoke_syscall.constprop.0 (arch/arm64/include/asm/syscall.h:61 arch/arm64/kernel/syscall.c:54) 00392 do_el0_svc (include/linux/thread_info.h:127 (discriminator 2) arch/arm64/kernel/syscall.c:140 (discriminator 2) arch/arm64/kernel/syscall.c:151 (discriminator 2)) 00392 el0_svc (arch/arm64/include/asm/irqflags.h:55 arch/arm64/include/asm/irqflags.h:76 arch/arm64/kernel/entry-common.c:165 arch/arm64/kernel/entry-common.c:178 arch/arm64/kernel/entry-common.c:713) 00392 el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:731) 00392 el0t_64_sync (arch/arm64/kernel/entry.S:598) Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-09 16:42:54 -04:00
Mohammed Anees	a30f32222d	bcachefs: Fix NULL pointer dereference in bch2_opt_to_text This patch adds a bounds check to the bch2_opt_to_text function to prevent NULL pointer dereferences when accessing the opt->choices array. This ensures that the index used is within valid bounds before dereferencing. The new version enhances the readability. Reported-and-tested-by: syzbot+37186860aa7812b331d5@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=37186860aa7812b331d5 Signed-off-by: Mohammed Anees <pvmohammedanees2003@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-09 16:42:53 -04:00
Alan Huang	a154154148	bcachefs: Release transaction before wake up We will get this if we wake up first: Kernel panic - not syncing: btree_node_write_done leaked btree_trans since there are still transactions waiting for cycle detectors after BTREE_NODE_write_in_flight is cleared. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-09 16:42:53 -04:00
Piotr Zalewski	0151d10a48	bcachefs: add check for btree id against max in try read node Add check for read node's btree_id against BTREE_ID_NR_MAX in try_read_btree_node to prevent triggering EBUG_ON condition in bch2_btree_id_root[1]. [1] https://syzkaller.appspot.com/bug?extid=cf7b2215b5d70600ec00 Reported-by: syzbot+cf7b2215b5d70600ec00@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=cf7b2215b5d70600ec00 Fixes: `4409b8081d` ("bcachefs: Repair pass for scanning for btree nodes") Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-09 16:42:53 -04:00
Kent Overstreet	19773ec997	bcachefs: Disk accounting device validation fixes - Fix failure to validate that accounting replicas entries point to valid devices: this wasn't a real bug since they'd be cleaned up by GC, but is still something we should know about - Fix failure to validate that dev_data_type entries point to valid devices: this does fix a real bug, since bch2_accounting_read() would then try to copy the counters to that device and pop an inconsistent error when the device didn't exist - Remove accounting entries that are zeroed or invalid: if we're not validating them we need to get rid of them: they might not exist in the superblock, so we need the to trigger the superblock mark path when they're readded. This fixes the replication.ktest rereplicate test, which was failing with "superblock not marked for replicas..." Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-09 16:42:53 -04:00
Kent Overstreet	9d86178782	bcachefs: bch2_inode_or_descendents_is_open() fsck can now correctly check if inodes in interior snapshot nodes are open/in use. - Tweak the vfs inode rhashtable so that the subvolume ID isn't hashed, meaning inums in different subvolumes will hash to the same slot. Note that this is a hack, and will cause problems if anyone ever has the same file in many different snapshots open all at the same time. - Then check if any of those subvolumes is a descendent of the snapshot ID being checked Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-09 16:42:53 -04:00
Kent Overstreet	84878e8245	bcachefs: Kill bch2_propagate_key_to_snapshot_leaves() Dead code now. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-09 16:42:53 -04:00
Kent Overstreet	9b23fdbd5d	bcachefs: bcachefs_metadata_version_inode_has_child_snapshots There's an inherent race in taking a snapshot while an unlinked file is open, and then reattaching it in the child snapshot. In the interior snapshot node the file will appear unlinked, as though it should be deleted - it's not referenced by anything in that snapshot - but we can't delete it, because the file data is referenced by the child snapshot. This was being handled incorrectly with propagate_key_to_snapshot_leaves() - but that doesn't resolve the fundamental inconsistency of "this file looks like it should be deleted according to normal rules, but - ". To fix this, we need to fix the rule for when an inode is deleted. The previous rule, ignoring snapshots (there was no well-defined rule for with snapshots) was: Unlinked, non open files are deleted, either at recovery time or during online fsck The new rule is: Unlinked, non open files, that do not exist in child snapshots, are deleted. To make this work transactionally, we add a new inode flag, BCH_INODE_has_child_snapshot; it overrides BCH_INODE_unlinked when considering whether to delete an inode, or put it on the deleted list. For transactional consistency, clearing it handled by the inode trigger: when deleting an inode we check if there are parent inodes which can now have the BCH_INODE_has_child_snapshot flag cleared. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-09 16:42:51 -04:00
Michal Hocko	9897713fe1	bcachefs: do not use PF_MEMALLOC_NORECLAIM Patch series "remove PF_MEMALLOC_NORECLAIM" v3. This patch (of 2): bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new inode to achieve GFP_NOWAIT semantic while holding locks. If this allocation fails it will drop locks and use GFP_NOFS allocation context. We would like to drop PF_MEMALLOC_NORECLAIM because it is really dangerous to use if the caller doesn't control the full call chain with this flag set. E.g. if any of the function down the chain needed GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and cause unexpected failure. While this is not the case in this particular case using the scoped gfp semantic is not really needed bacause we can easily pus the allocation context down the chain without too much clutter. [akpm@linux-foundation.org: fix kerneldoc warnings] Link: https://lkml.kernel.org/r/20240926172940.167084-1-mhocko@kernel.org Link: https://lkml.kernel.org/r/20240926172940.167084-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> # For vfs changes Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: James Morris <jmorris@namei.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Paul Moore <paul@paul-moore.com> Cc: Serge E. Hallyn <serge@hallyn.com> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-10-09 12:47:18 -07:00
Kent Overstreet	cba31b7eee	bcachefs: Delete vestigal check_inode() checks BCH_INODE_i_size_dirty dates from before we had logged operations for truncate (as well as finsert) - it hasn't been needed since before bcachefs was mainlined. BCH_INODE_i_sectors_dirty hasn't been needed since we started always updating i_sectors transactionally - it's been unused for even longer. BCH_INODE_backptr_untrusted also hasn't been used since prior to mainlining; when unlinking a hardling, we zero out the backpointer fields if they're for the dirent being removed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-06 03:03:45 -04:00
Kent Overstreet	12f286085b	bcachefs: btree_iter_peek_upto() now handles BTREE_ITER_all_snapshots end_pos now compares against snapshot ID when required Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-06 03:03:45 -04:00
Kent Overstreet	38864eccf7	bcachefs: reattach_inode() now correctly handles interior snapshot nodes When we find an unreachable inode, we now reattach it in the oldest version that needs to be reattached (thus avoiding redundant work reattaching every single version), and we now fix up inode -> dirent backpointers in newer versions as needed - or white out the reattaching dirent in newer versions, if the newer version isn't supposed to be reattached. This results in the second verify fsck now passing cleanly after repairing on a user-provided filesystem image with thousands of different snapshots. Reported-by: Christopher Snowhill <chris@kode54.net> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-06 03:03:45 -04:00
Kent Overstreet	bade9711e0	bcachefs: Split out check_unreachable_inodes() pass With inode backpointers, we can write a very simple check_unreachable_inodes() pass that only looks for non-unlinked inodes that are missing backpointers, and reattaches them. This simplifies check_directory_structure() so that it's now only checking for directory structure loops, Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-06 03:03:45 -04:00
Kent Overstreet	bf4baaa087	bcachefs: Fix lockdep splat in bch2_accounting_read We can't take sb_lock while holding mark_lock, so split out replicas_entry_validate() and replicas_entry_sb_validate() - replicas_entry_validate() now uses the normal online device interface. 00039 ========= TEST set_option 00039 00039 WATCHDOG 30 00040 bcachefs (vdb): starting version 1.12: rebalance_work_acct_fix opts=errors=panic 00040 bcachefs (vdb): initializing new filesystem 00040 bcachefs (vdb): going read-write 00040 bcachefs (vdb): marking superblocks 00040 bcachefs (vdb): initializing freespace 00040 bcachefs (vdb): done initializing freespace 00040 bcachefs (vdb): reading snapshots table 00040 bcachefs (vdb): reading snapshots done 00040 bcachefs (vdb): done starting filesystem 00040 zstd 00041 bcachefs (vdb): shutting down 00041 bcachefs (vdb): going read-only 00041 bcachefs (vdb): finished waiting for writes to stop 00041 bcachefs (vdb): flushing journal and stopping allocators, journal seq 3 00041 bcachefs (vdb): flushing journal and stopping allocators complete, journal seq 11 00041 bcachefs (vdb): shutdown complete, journal seq 12 00041 bcachefs (vdb): marking filesystem clean 00041 bcachefs (vdb): shutdown complete 00041 Setting option on offline fs 00041 bch2_write_super(): fatal error : attempting to write superblock that wasn't version downgraded (1.12: (unknown version) > 1.10: disk_accounting_v3) 00041 fatal error - emergency read only 00041 bch2_write_super(): fatal error : attempting to write superblock that wasn't version downgraded (1.12: (unknown version) > 1.10: disk_accounting_v3) 00042 bcachefs (vdb): starting version 1.12: rebalance_work_acct_fix opts=errors=panic,compression=zstd 00042 bcachefs (vdb): recovering from clean shutdown, journal seq 12 00042 bcachefs (vdb): accounting_read... 00042 00042 ====================================================== 00042 WARNING: possible circular locking dependency detected 00042 6.12.0-rc1-ktest-g805e938a8502 #6807 Not tainted 00042 ------------------------------------------------------ 00042 mount.bcachefs/665 is trying to acquire lock: 00045 ffffff80cc280908 (&c->sb_lock){+.+.}-{3:3}, at: bch2_replicas_entry_validate (fs/bcachefs/replicas.c:102) 00045 00045 but task is already holding lock: 00048 ffffff80cc284870 (&c->mark_lock){++++}-{0:0}, at: bch2_accounting_read (fs/bcachefs/disk_accounting.c:670 (discriminator 1)) 00048 00048 which lock already depends on the new lock. 00048 00048 00048 the existing dependency chain (in reverse order) is: 00048 00048 -> #1 (&c->mark_lock){++++}-{0:0}: 00049 percpu_down_write (kernel/locking/percpu-rwsem.c:232) 00052 bch2_sb_replicas_to_cpu_replicas (fs/bcachefs/replicas.c:583) 00055 bch2_sb_to_fs (fs/bcachefs/super-io.c:614) 00057 bch2_fs_open (fs/bcachefs/super.c:828 fs/bcachefs/super.c:2050) 00060 bch2_fs_get_tree (fs/bcachefs/fs.c:2067) 00062 vfs_get_tree (fs/super.c:1801) 00064 path_mount (fs/namespace.c:3507 fs/namespace.c:3834) 00066 __arm64_sys_mount (fs/namespace.c:3847 fs/namespace.c:4055 fs/namespace.c:4032 fs/namespace.c:4032) 00067 invoke_syscall.constprop.0 (arch/arm64/include/asm/syscall.h:61 arch/arm64/kernel/syscall.c:54) 00068 do_el0_svc (include/linux/thread_info.h:127 (discriminator 2) arch/arm64/kernel/syscall.c:140 (discriminator 2) arch/arm64/kernel/syscall.c:151 (discriminator 2)) 00069 el0_svc (arch/arm64/include/asm/irqflags.h:82 arch/arm64/include/asm/irqflags.h:123 arch/arm64/include/asm/irqflags.h:136 arch/arm64/kernel/entry-common.c:165 arch/arm64/kernel/entry-common.c:178 arch/arm64/kernel/entry-common.c:713) 00069 ========= FAILED TIMEOUT set_option in 30s Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-06 03:03:45 -04:00
Linus Torvalds	8f602276d3	bcachefs fixes for 6.12-rc2 A lot of little fixes, bigger ones include: - bcachefs's __wait_on_freeing_inode() was broken in rc1 due to vfs changes, now fixed along with another lost wakeup - fragmentation LRU fixes; fsck now repairs successfully (this is the data structure copygc uses); along with some nice simplification. - Rework logged op error handling, so that if logged op replay errors (due to another filesystem error) we delete the logged op instead of going into an infinite loop) - Various small filesystem connectivitity repair fixes The final part of this patch series, fixing snapshots + unlinked file handling, is now out on the list - I'm giving that part of the series more time for user testing. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmcBhkIACgkQE6szbY3K bnYt8RAAqZo6RcN91sgz6xGsJkUvE6DS4Rtj1J4vlVAmuiIa5NUhRhqFnS6j8V9A AWZw63JwTizrglbLk4Z4knfiViT4GOeiKX4sttaJk7cLW7bxwCUddlho1G5Q7q0I PFurYevqG1ltcl5oZpD6LhZiqEhndQI3XnkpEvKsmoXy9TSB4KEqaU8Y+cewjq4q KCFuxTBhbmatxP9eTGuDhd6uWw5h0EVDGQyMitEcSutIaernGlSsBQ8gZ5n9dWSd lP91qFT5iypmCMo9Arf8Fq1YBvOpV6P91eq8YPa4A3sKDfzHn3CCzsSyjUiGK0RM Wcl+kNwqYJa7Fwtb7aGgTVhaMkqLzPTI+XYye3FXrXjJ6B0JKpl2QvvDoFhDxop9 ZPb57QyRgRBtOvofvFz8fWQOr67n+HNvaMbeG1iwGvqm6/MrgdSLsN6OaRh80uAE 5P0qX7rwTTOfJj5T6dKLxr3KuXKXNrM5AAIG0MjOMsha232+XUAZvofYNmqx7BMi juJvqZc9/GXrcXqdPTYDyBs4UXDkwHsKdr744ooZ64VNiIYFs6eTvXp7V0XuajYH ExLrEEjhO2UGPM5N9R9jw9AMsEhJstexgylHQsiiADtdi+jY4LKa/NZAJSJQQC+C QQyE3Q7ZCpzRPiGPkkpIY/D7IRoIHL2H+LhbXV/K3oMGdbA7hS4= =XnG4 -----END PGP SIGNATURE----- Merge tag 'bcachefs-2024-10-05' of git://evilpiepirate.org/bcachefs Pull bcachefs fixes from Kent Overstreet: "A lot of little fixes, bigger ones include: - bcachefs's __wait_on_freeing_inode() was broken in rc1 due to vfs changes, now fixed along with another lost wakeup - fragmentation LRU fixes; fsck now repairs successfully (this is the data structure copygc uses); along with some nice simplification. - Rework logged op error handling, so that if logged op replay errors (due to another filesystem error) we delete the logged op instead of going into an infinite loop) - Various small filesystem connectivitity repair fixes" * tag 'bcachefs-2024-10-05' of git://evilpiepirate.org/bcachefs: bcachefs: Rework logged op error handling bcachefs: Add warn param to subvol_get_snapshot, peek_inode bcachefs: Kill snapshot arg to fsck_write_inode() bcachefs: Check for unlinked, non-empty dirs in check_inode() bcachefs: Check for unlinked inodes with dirents bcachefs: Check for directories with no backpointers bcachefs: Kill alloc_v4.fragmentation_lru bcachefs: minor lru fsck fixes bcachefs: Mark more errors AUTOFIX bcachefs: Make sure we print error that causes fsck to bail out bcachefs: bkey errors are only AUTOFIX during read bcachefs: Create lost+found in correct snapshot bcachefs: Fix reattach_inode() bcachefs: Add missing wakeup to bch2_inode_hash_remove() bcachefs: Fix trans_commit disk accounting revert bcachefs: Fix bch2_inode_is_open() check bcachefs: Fix return type of dirent_points_to_inode_nowarn() bcachefs: Fix bad shift in bch2_read_flag_list()	2024-10-05 15:18:04 -07:00
Kent Overstreet	0f25eb4b60	bcachefs: Rework logged op error handling Initially it was thought that we just wanted to ignore errors from logged op replay, but it turns out we do need to catch -EROFS, or we'll go into an infinite loop. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-04 20:25:32 -04:00
Kent Overstreet	1f73cb4d34	bcachefs: Add warn param to subvol_get_snapshot, peek_inode These shouldn't always be fatal errors - logged op resume, in particular, and we want it as a parameter there. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-04 20:25:32 -04:00
Kent Overstreet	72350ee0ea	bcachefs: Kill snapshot arg to fsck_write_inode() It was initially believed that it would be better to be explicit about the snapshot we're updating when writing inodes in fsck; however, it turns out that passing around the snapshot separately is more error prone and we're usually updating the inode in the same snapshow we read it from. This is different from normal filesystem paths, where we do the update in the snapshot of the subvolume we're in. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-04 20:25:32 -04:00
Kent Overstreet	c9306a91c3	bcachefs: Check for unlinked, non-empty dirs in check_inode() We want to check for this early so it can be reattached if necessary in check_unreachable_inodes(); better than letting it be deleted and having the children reattached, losing their filenames. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-04 20:25:32 -04:00
Kent Overstreet	c7da5ee2e5	bcachefs: Check for unlinked inodes with dirents link count works differently in bcachefs - it's only nonzero for files with multiple hardlinks, which means we can also avoid checking it except for files that are known to have hardlinks. That means we need a few different checks instead; in particular, we don't want fsck to delet a file that has a dirent pointing to it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-04 20:25:32 -04:00
Kent Overstreet	1c6051bbd7	bcachefs: Check for directories with no backpointers It's legal for regular files to have missing backpointers (due to hardlinks), and fsck should automatically add them, but for directories this is an error that should be flagged. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-04 20:25:32 -04:00
Kent Overstreet	260af1562e	bcachefs: Kill alloc_v4.fragmentation_lru The fragmentation_lru field hasn't been needed since we reworked the LRU btrees to use the btree write buffer; previously it was used to resolve collisions, but the revised LRU btree uses the backpointer (the bucket) as part of the key. It should have been deleted at the time of the LRU rework; since it wasn't, that left places for bugs to hide, in check/repair. This fixes LRU fsck on a filesystem image helpfully provided by a user who disappeared before I could get his name for the reported-by. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-04 20:25:32 -04:00
Kent Overstreet	01bf5e3bd2	bcachefs: minor lru fsck fixes check_lru_key() wasn't using write buffer updates for deleting bad lru entries - dating from before the lru btree used the btree write buffer. And when possibly flushing the btree write buffer (to make sure we're seeing a real inconsistency), we need to be using the modern bch2_btree_write_buffer_maybe_flush(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-04 20:25:31 -04:00
Kent Overstreet	1bea714c53	bcachefs: Mark more errors AUTOFIX Errors are getting marked as AUTOFIX once they've been (re)-tested and audited. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-04 20:25:31 -04:00
Kent Overstreet	492e24d760	bcachefs: Make sure we print error that causes fsck to bail out Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-04 20:25:31 -04:00
Kent Overstreet	658c82f41e	bcachefs: bkey errors are only AUTOFIX during read Newly generated keys, in the transaction commit path or write path, should not be AUTOFIX; those indicate bugs that we need to fail fast for. Fixes: `5612daafb7` ("bcachefs: Fix fsck warnings from bkey validation") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-04 20:25:31 -04:00
Kent Overstreet	fda7b1ffde	bcachefs: Create lost+found in correct snapshot Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-04 20:25:31 -04:00
Kent Overstreet	20826fe6b8	bcachefs: Fix reattach_inode() Ensure a copy of the lost+found inode exists in the snapshot that we're reattaching, so that we don't trigger warnings in lookup_inode_for_snapshot() later. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-04 20:25:31 -04:00
Kent Overstreet	6b63a948a7	bcachefs: Add missing wakeup to bch2_inode_hash_remove() This fixes two different bugs: - Looser locking with the rhashtable means we need to recheck if the inode is still hashed after prepare_to_wait(), and add a corresponding wakeup after removing from the hash table. - `da18ecbf0f` ("fs: add i_state helpers") changed the bit waitqueues used for inodes, and bcachefs wasn't updated and thus broke; this updates bcachefs to the new helper. Fixes: `112d21fd1a` ("bcachefs: switch to rhashtable for vfs inodes hash") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-04 20:25:31 -04:00
Kent Overstreet	d28786606a	bcachefs: Fix trans_commit disk accounting revert We only are applying JSET_ENTRY_TYPE_write_buffer_keys, revert path was missed. Fixes: `a3581ca35d` ("bcachefs: Fix BCH_TRANS_COMMIT_skip_accounting_apply") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-02 21:37:42 -04:00
Kent Overstreet	3b1425a4eb	bcachefs: Fix bch2_inode_is_open() check Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-02 21:31:31 -04:00
Kent Overstreet	abaa6d4f6a	bcachefs: Fix return type of dirent_points_to_inode_nowarn() we're returning an error code now, not a bool Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-02 21:30:55 -04:00
Linus Torvalds	7ec462100e	Getting rid of asm/unaligned.h includes -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZv3NAgAKCRBZ7Krx/gZQ 68kbAP0YzQxUgl0/o7Soda8XwKSPZTM9ls6kRk1UHTTG/i4ZigEA/G+i/mBQctL0 AB911kK8mxfXppfOXzstFBjoJSqiigQ= =IE7D -----END PGP SIGNATURE----- Merge tag 'pull-work.unaligned' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull generic unaligned.h cleanups from Al Viro: "Get rid of architecture-specific <asm/unaligned.h> includes, replacing them with a single generic <linux/unaligned.h> header file. It's the second largest (after asm/io.h) class of asm/* includes, and all but two architectures actually end up using exact same file. Massage the remaining two (arc and parisc) to do the same and just move the thing to from asm-generic/unaligned.h to linux/unaligned.h" [ This is one of those things that we're better off doing outside the merge window, and would only cause extra conflict noise if it was in linux-next for the next release due to all the trivial #include line updates. Rip off the band-aid. - Linus ] * tag 'pull-work.unaligned' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: move asm/unaligned.h to linux/unaligned.h arc: get rid of private asm/unaligned.h parisc: get rid of private asm/unaligned.h	2024-10-02 16:42:28 -07:00
Al Viro	5f60d5f6bb	move asm/unaligned.h to linux/unaligned.h asm/unaligned.h is always an include of asm-generic/unaligned.h; might as well move that thing to linux/unaligned.h and include that - there's nothing arch-specific in that header. auto-generated by the following: for i in `git grep -l -w asm/unaligned.h`; do sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i done for i in `git grep -l -w asm-generic/unaligned.h`; do sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i done git mv include/asm-generic/unaligned.h include/linux/unaligned.h git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h	2024-10-02 17:23:23 -04:00
Kent Overstreet	e764e68103	bcachefs: Fix bad shift in bch2_read_flag_list() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-10-01 17:20:24 -04:00
Guenter Roeck	2007d28ec0	bcachefs: rename version -> bversion for big endian builds Builds on big endian systems fail as follows. fs/bcachefs/bkey.h: In function 'bch2_bkey_format_add_key': fs/bcachefs/bkey.h:557:41: error: 'const struct bkey' has no member named 'bversion' The original commit only renamed the variable for little endian builds. Rename it for big endian builds as well to fix the problem. Fixes: `cf49f8a8c2` ("bcachefs: rename version -> bversion") Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-29 23:55:52 -04:00
Linus Torvalds	9f9a534724	bcachefs fixes for 6.11-rc1 Assorted minor syzbot fixes, and for bigger stuff: - Fix two disk accounting rewrite bugs - Disk accounting keys use the version field of bkey so that journal replay can tell which updates have been applied to the btree. This is set in the transaction commit path, after we've gotten our journal reservation (and our time ordering), but the BCH_TRANS_COMMIT_skip_accounting_apply flag that journal replay uses was incorrectly skipping this for new updates generated prior to journal replay. This fixes the underlying cause of an assertion pop in disk_accounting_read. - A couple fixes for disk accounting + device removal. Checking if acocunting replicas entries were marked in the superblock was being done at the wrong point, when deltas in the journal could still zero them out, and then additionally we'd try to add a missing replicas entry to the superblock without checking if it referred to an invalid (removed) device. - A whole slew of repair fixes - fix infinite loop in propagate_key_to_snapshot_leaves(), this fixes an infinite loop when repairing a filesystem with many snapshots - fix incorrect transaction restart handling leading to occasional "fsck counted ..." warnings" - fix warning in __bch2_fsck_err() for bkey fsck errors - check_inode() in fsck now correctly checks if the filesystem was clean - there shouldn't be pending logged ops if the fs was clean, we now check for this - remove_backpointer() doesn't remove a dirent that doesn't actually point to the inode - many more fsck errors are AUTOFIX -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmb4QtsACgkQE6szbY3K bnYx4A//bhGgZYgP55FxduuxUH8XjX2eOnXwuPv/MmYO/4oCok5VBa9bRDTVXhIK PtY4pP2IJZ3+u963mwbwJAawsPA01AEEty9tE+AdXbltDRQ03I33OEuIy0HFIso2 s8VBkVPbru6yU4RCCvYNIVvRG/9GOL+J0GgrR1t05zHVyKXe1FuS00Yq5+z3niNP HtuGTsD273Nnhikz47bqyD+M6VizU+uzSUFLgnB3zrzpb+gPSGETSwgc4ggajlM4 2P10Vc4L/Nb3KYV9RW+C3WpRfUR/o8BZA3wjJfNo0JeA4iDaUbltSjpCA07EcAnA 3D6Omzqkm4aobL2WlvioT0UhZx4t8X/8x5t5F9HyX52i1k+g87oMT9/KIKec1Dzd 8vQCwCdXFfWaLSZoOJsHyIljip7BuRLKhWwKosdzzLIAnRQy5StxAhsG99fNStu6 JOWICPNCn1b6SkktnoKou1unL+K5RczeNfAxMAjcJjTD7IIAmytLe4mdRbP9q+Oa x8no7pttbb4JnoRvfo42GVz8KWQR07oN/Zy7mH3K4Y0Ix+xDOrLqlfLIDLGpxMNv HZz+UPchdlfpYJO+nTLoAOGXZWnKDqg70SAEcWKDc82Ri4vNOhraYDZvXrzl9qE+ 63RPzqDbg3uXGxLYMvujjPe610QkPxS9zKKyDvUZZx0ZiUX4CjI= =cdrz -----END PGP SIGNATURE----- Merge tag 'bcachefs-2024-09-28' of git://evilpiepirate.org/bcachefs Pull more bcachefs updates from Kent Overstreet: "Assorted minor syzbot fixes, and for bigger stuff: Fix two disk accounting rewrite bugs: - Disk accounting keys use the version field of bkey so that journal replay can tell which updates have been applied to the btree. This is set in the transaction commit path, after we've gotten our journal reservation (and our time ordering), but the BCH_TRANS_COMMIT_skip_accounting_apply flag that journal replay uses was incorrectly skipping this for new updates generated prior to journal replay. This fixes the underlying cause of an assertion pop in disk_accounting_read. - A couple of fixes for disk accounting + device removal. Checking if acocunting replicas entries were marked in the superblock was being done at the wrong point, when deltas in the journal could still zero them out, and then additionally we'd try to add a missing replicas entry to the superblock without checking if it referred to an invalid (removed) device. A whole slew of repair fixes: - fix infinite loop in propagate_key_to_snapshot_leaves(), this fixes an infinite loop when repairing a filesystem with many snapshots - fix incorrect transaction restart handling leading to occasional "fsck counted ..." warnings - fix warning in __bch2_fsck_err() for bkey fsck errors - check_inode() in fsck now correctly checks if the filesystem was clean - there shouldn't be pending logged ops if the fs was clean, we now check for this - remove_backpointer() doesn't remove a dirent that doesn't actually point to the inode - many more fsck errors are AUTOFIX" * tag 'bcachefs-2024-09-28' of git://evilpiepirate.org/bcachefs: (35 commits) bcachefs: check_subvol_path() now prints subvol root inode bcachefs: remove_backpointer() now checks if dirent points to inode bcachefs: dirent_points_to_inode() now warns on mismatch bcachefs: Fix lost wake up bcachefs: Check for logged ops when clean bcachefs: BCH_FS_clean_recovery bcachefs: Convert disk accounting BUG_ON() to WARN_ON() bcachefs: Fix BCH_TRANS_COMMIT_skip_accounting_apply bcachefs: Check for accounting keys with bversion=0 bcachefs: rename version -> bversion bcachefs: Don't delete unlinked inodes before logged op resume bcachefs: Fix BCH_SB_ERRS() so we can reorder bcachefs: Fix fsck warnings from bkey validation bcachefs: Move transaction commit path validation to as late as possible bcachefs: Fix disk accounting attempting to mark invalid replicas entry bcachefs: Fix unlocked access to c->disk_sb.sb in bch2_replicas_entry_validate() bcachefs: Fix accounting read + device removal bcachefs: bch_accounting_mode bcachefs: fix transaction restart handling in check_extents(), check_dirents() bcachefs: kill inode_walker_entry.seen_this_pos ...	2024-09-29 09:17:44 -07:00
Kent Overstreet	3a5895e3ac	bcachefs: check_subvol_path() now prints subvol root inode Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 22:32:23 -04:00
Kent Overstreet	0b0f0ad93c	bcachefs: remove_backpointer() now checks if dirent points to inode Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 22:32:23 -04:00
Kent Overstreet	a6508079b1	bcachefs: dirent_points_to_inode() now warns on mismatch if an inode backpointer points to a dirent that doesn't point back, that's an error we should warn about. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 22:32:23 -04:00
Alan Huang	e057a290ef	bcachefs: Fix lost wake up If the reader acquires the read lock and then the writer enters the slow path, while the reader proceeds to the unlock path, the following scenario can occur without the change: writer: pcpu_read_count(lock) return 1 (so __do_six_trylock will return 0) reader: this_cpu_dec(*lock->readers) reader: smp_mb() reader: state = atomic_read(&lock->state) (there is no waiting flag set) writer: six_set_bitmask() then the writer will sleep forever. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 22:32:23 -04:00
Kent Overstreet	d50d7a5fa4	bcachefs: Check for logged ops when clean If we shut down successfully, there shouldn't be any logged ops to resume. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 22:32:22 -04:00
Kent Overstreet	1c0ee43b2c	bcachefs: BCH_FS_clean_recovery Add a filesystem flag to indicate whether we did a clean recovery - using c->sb.clean after we've got rw is incorrect, since c->sb is updated whenever we write the superblock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 22:32:22 -04:00
Kent Overstreet	9773547b16	bcachefs: Convert disk accounting BUG_ON() to WARN_ON() We had a bug where disk accounting keys didn't always have their version field set in journal replay; change the BUG_ON() to a WARN(), and exclude this case since it's now checked for elsewhere (in the bkey validate function). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 22:32:22 -04:00
Kent Overstreet	a3581ca35d	bcachefs: Fix BCH_TRANS_COMMIT_skip_accounting_apply This was added to avoid double-counting accounting keys in journal replay. But applied incorrectly (easily done since it applies to the transaction commit, not a particular update), it leads to skipping in-mem accounting for real accounting updates, and failure to give them a version number - which leads to journal replay becoming very confused the next time around. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 22:32:20 -04:00
Kent Overstreet	f8911ad88d	bcachefs: Check for accounting keys with bversion=0 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	cf49f8a8c2	bcachefs: rename version -> bversion give bversions a more distinct name, to aid in grepping Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	fd65378db9	bcachefs: Don't delete unlinked inodes before logged op resume Previously, check_inode() would delete unlinked inodes if they weren't on the deleted list - this code dating from before there was a deleted list. But, if we crash during a logged op (truncate or finsert/fcollapse) of an unlinked file, logged op resume will get confused if the inode has already been deleted - instead, just add it to the deleted list if it needs to be there; delete_dead_inodes runs after logged op resume. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	8d65b15f8d	bcachefs: Fix BCH_SB_ERRS() so we can reorder BCH_SB_ERRS() has a field for the actual enum val so that we can reorder to reorganize, but the way BCH_SB_ERR_MAX was defined didn't allow for this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	5612daafb7	bcachefs: Fix fsck warnings from bkey validation __bch2_fsck_err() warns if the current task has a btree_trans object and it wasn't passed in, because if it has to prompt for user input it has to be able to unlock it. But plumbing the btree_trans through bkey_validate(), as well as transaction restarts, is problematic - so instead make bkey fsck errors FSCK_AUTOFIX, which doesn't need to warn. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	7c980a43e9	bcachefs: Move transaction commit path validation to as late as possible In order to check for accounting keys with version=0, we need to run validation after they've been assigned version numbers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	431312b59c	bcachefs: Fix disk accounting attempting to mark invalid replicas entry This fixes the following bug, where a disk accounting key has an invalid replicas entry, and we attempt to add it to the superblock: bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): starting version 1.12: rebalance_work_acct_fix opts=metadata_replicas=2,data_replicas=2,foreground_target=ssd,background_target=hdd,nopromote_whole_extents,verbose,fsck,fix_errors=yes bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): recovering from clean shutdown, journal seq 15211644 bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): accounting_read... accounting not marked in superblock replicas replicas cached: 1/1 [0], fixing bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): sb invalid before write: Invalid superblock section replicas_v0: invalid device 0 in entry cached: 1/1 [0] replicas_v0 (size 88): user: 2 [3 5] user: 2 [1 4] cached: 1 [2] btree: 2 [1 2] user: 2 [2 5] cached: 1 [0] cached: 1 [4] journal: 2 [1 5] user: 2 [1 2] user: 2 [2 3] user: 2 [3 4] user: 2 [4 5] cached: 1 [1] cached: 1 [3] cached: 1 [5] journal: 2 [1 2] journal: 2 [2 5] btree: 2 [2 5] user: 2 [1 3] user: 2 [1 5] user: 2 [2 4] bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): inconsistency detected - emergency read only at journal seq 15211644 accounting not marked in superblock replicas replicas user: 1/1 [3], fixing bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): sb invalid before write: Invalid superblock section replicas_v0: invalid device 0 in entry cached: 1/1 [0] replicas_v0 (size 96): user: 2 [3 5] user: 2 [1 3] cached: 1 [2] btree: 2 [1 2] user: 2 [2 4] cached: 1 [0] cached: 1 [4] journal: 2 [1 5] user: 1 [3] user: 2 [1 5] user: 2 [3 4] user: 2 [4 5] cached: 1 [1] cached: 1 [3] cached: 1 [5] journal: 2 [1 2] journal: 2 [2 5] btree: 2 [2 5] user: 2 [1 2] user: 2 [1 4] user: 2 [2 3] user: 2 [2 5] accounting not marked in superblock replicas replicas user: 1/2 [3 7], fixing bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): sb invalid before write: Invalid superblock section replicas_v0: invalid device 7 in entry user: 1/2 [3 7] replicas_v0 (size 96): user: 2 [3 7] user: 2 [1 3] cached: 1 [2] btree: 2 [1 2] user: 2 [2 4] cached: 1 [0] cached: 1 [4] journal: 2 [1 5] user: 1 [3] user: 2 [1 5] user: 2 [3 4] user: 2 [4 5] cached: 1 [1] cached: 1 [3] cached: 1 [5] journal: 2 [1 2] journal: 2 [2 5] btree: 2 [2 5] user: 2 [1 2] user: 2 [1 4] user: 2 [2 3] user: 2 [2 5] user: 2 [3 5] done bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): alloc_read... done bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): stripes_read... done bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): snapshots_read... done Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	49fd90b2cc	bcachefs: Fix unlocked access to c->disk_sb.sb in bch2_replicas_entry_validate() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	9104fc1928	bcachefs: Fix accounting read + device removal accounting read was checking if accounting replicas entries were marked in the superblock prior to applying accounting from the journal, which meant that a recently removed device could spuriously trigger a "not marked in superblocked" error (when journal entries zero out the offending counter). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	1e0272ef47	bcachefs: bch_accounting_mode Minor refactoring - replace multiple bool arguments with an enum; prep work for fixing a bug in accounting read. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	3672bda8f5	bcachefs: fix transaction restart handling in check_extents(), check_dirents() Dealing with outside state within a btree transaction is always tricky. check_extents() and check_dirents() have to accumulate counters for i_sectors and i_nlink (for subdirectories). There were two bugs: - transaction commit may return a restart; therefore we have to commit before accumulating to those counters - get_inode_all_snapshots() may return a transaction restart, before updating w->last_pos; then, on the restart, check_i_sectors()/check_subdir_count() would see inodes that were not for w->last_pos Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	22a507d68e	bcachefs: kill inode_walker_entry.seen_this_pos dead code Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Kent Overstreet	b29c30ab48	bcachefs: Fix incorrect IS_ERR_OR_NULL usage Returning a positive integer instead of an error code causes error paths to become very confused. Closes: syzbot+c0360e8367d6d8d04a66@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Hongbo Li	dc5bfdf8ea	bcachefs: fix the memory leak in exception case The pointer clean points the memory allocated by kmemdup, when the return value of bch2_sb_clean_validate_late is not zero. The memory pointed by clean is leaked. So we should free it in this case. Fixes: `a37ad1a3ab` ("bcachefs: sb-clean.c") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Hongbo Li	3125c95ea6	bcachefs: fast exit when darray_make_room failed In downgrade_table_extra, the return value is needed. When it return failed, we should exit immediately. Fixes: `7773df19c3` ("bcachefs: metadata version bucket_stripe_sectors") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Kent Overstreet	951dd86e7c	bcachefs: Fix iterator leak in check_subvol() A couple small error handling fixes Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Kent Overstreet	2a1df87346	bcachefs: Add snapshot to bch_inode_unpacked this allows for various cleanups in fsck Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Diogo Jahchan Koike	40d40c6bea	bcachefs: assign return error when iterating through layout syzbot reported a null ptr deref in __copy_user [0] In __bch2_read_super, when a corrupt backup superblock matches the default opts offset, no error is assigned to ret and the freed superblock gets through, possibly being assigned as the best sb in bch2_fs_open and being later dereferenced, causing a fault. Assign EINVALID to ret when iterating through layout. [0]: https://syzkaller.appspot.com/bug?extid=18a5c5e8a9c856944876 Reported-by: syzbot+18a5c5e8a9c856944876@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=18a5c5e8a9c856944876 Signed-off-by: Diogo Jahchan Koike <djahchankoike@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Kent Overstreet	c6040447c5	bcachefs: Fix srcu warning in check_topology check_topology doesn't need the srcu lock and doesn't use normal btree transactions - we can just drop the srcu lock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Kent Overstreet	18c520f408	bcachefs: Fix error path in check_dirent_inode_dirent() fsck_err() jumps to the fsck_err label when bailing out; need to make sure bp_iter was initialized... Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Piotr Zalewski	0696a18a8c	bcachefs: memset bounce buffer portion to 0 after key_sort_fix_overlapping Zero-initialize part of allocated bounce buffer which wasn't touched by subsequent bch2_key_sort_fix_overlapping to mitigate later uinit-value use KMSAN bug[1]. After applying the patch reproducer still triggers stack overflow[2] but it seems unrelated to the uninit-value use warning. After further investigation it was found that stack overflow occurs because KMSAN adds too many function calls[3]. Backtrace of where the stack magic number gets smashed was added as a reply to syzkaller thread[3]. It was confirmed that task's stack magic number gets smashed after the code path where KSMAN detects uninit-value use is executed, so it can be assumed that it doesn't contribute in any way to uninit-value use detection. [1] https://syzkaller.appspot.com/bug?extid=6f655a60d3244d0c6718 [2] https://lore.kernel.org/lkml/66e57e46.050a0220.115905.0002.GAE@google.com [3] https://lore.kernel.org/all/rVaWgPULej8K7HqMPNIu8kVNyXNjjCiTB-QBtItLFBmk0alH6fV2tk4joVPk97Evnuv4ZRDd8HB5uDCkiFG6u81xKdzDj-KrtIMJSlF6Kt8=@proton.me Reported-by: syzbot+6f655a60d3244d0c6718@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=6f655a60d3244d0c6718 Fixes: `ec4edd7b9d` ("bcachefs: Prep work for variable size btree node buffers") Suggested-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Kent Overstreet	51b7cc7c0f	bcachefs: Improve bch2_is_inode_open() warning message Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Kent Overstreet	4a8f8fafbd	bcachefs: Add extra padding in bkey_make_mut_noupdate() This fixes a kasan splat in propagate_key_to_snapshot_leaves() - varint_decode_fast() does reads (that it never uses) up to 7 bytes past the end of the integer. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Kent Overstreet	f890c8513f	bcachefs: Mark inode errors as autofix Most or all errors will be autofix in the future, we're currently just doing the ones that we know are well tested. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Al Viro	cb787f4ac0	[tree-wide] finally take no_llseek out no_llseek had been defined to NULL two years ago, in commit `868941b144` ("fs: remove no_llseek") To quote that commit, At -rc1 we'll need do a mechanical removal of no_llseek - git grep -l -w no_llseek \| grep -v porting.rst \| while read i; do sed -i '/\<no_llseek\>/d' $i done would do it. Unfortunately, that hadn't been done. Linus, could you do that now, so that we could finally put that thing to rest? All instances are of the form .llseek = no_llseek, so it's obviously safe. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2024-09-27 08:18:43 -07:00
Kent Overstreet	7eb4a319db	bcachefs: Fix infinite loop in propagate_key_to_snapshot_leaves() As we iterate we need to mark that we no longer need iterators - otherwise we'll infinite loop via the "too many iters" check when there's many snapshots. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-23 18:46:58 -04:00
Kent Overstreet	6d12d7ace9	bcachefs: Ensure BCH_FS_accounting_replay_done is always set if it doesn't get set we'll never be able to flush the btree write buffer; this only happens in fake rw mode, but prevents us from shutting down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-23 18:46:58 -04:00
Linus Torvalds	b3f391fddf	bcachefs changes for 6.12-rc1 rcu_pending, btree key cache rework: this solves lock contenting in the key cache, eliminating the biggest source of the srcu lock hold time warnings, and drastically improving performance on some metadata heavy workloads - on multithreaded creates we're now 3-4x faster than xfs. We're now using an rhashtable instead of the system inode hash table; this is another significant performance improvement on multithreaded metadata workloads, eliminating more lock contention. for_each_btree_key_in_subvolume_upto(): new helper for iterating over keys within a specific subvolume, eliminating a lot of open coded "subvolume_get_snapshot()" and also fixing another source of srcu lock time warnings, by running each loop iteration in its own transaction (as the existing for_each_btree_key() does). More work on btree_trans locking asserts; we now assert that we don't hold btree node locks when trans->locked is false, which is important because we don't use lockdep for tracking individual btree node locks. Some cleanups and improvements in the bset.c btree node lookup code, from Alan. Rework of btree node pinning, which we use in backpointers fsck. The old hacky implementation, where the shrinker just skipped over nodes in the pinned range, was causing OOMs; instead we now use another shrinker with a much higher seeks number for pinned nodes. Rebalance now uses BCH_WRITE_ONLY_SPECIFIED_DEVS; this fixes an issue where rebalance would sometimes fall back to allocating from the full filesystem, which is not what we want when it's trying to move data to a specific target. Use __GFP_ACCOUNT, GFP_RECLAIMABLE for btree node, key cache allocations. Idmap mounts are now supported - Hongbo. Rename whiteouts are now supported - Hongbo. Erasure coding can now handle devices being marked as failed, or forcibly removed. We still need the evacuate path for erasure coding, but it's getting very close to ready for people to start using. Status, and when will we be taking off experimental: ---------------------------------------------------- Going by critical, user facing bugs getting found and fixed, we're nearly there. There are a couple key items that need to be finished before we can take off the experimental label: - The end-user experience is still pretty painful when the root filesystem needs a fsck; we need some form of limited self healing so that necessary repair gets run automatically. Errors (by type) are recorded in the superblock, so what we need to do next is convert remaining inconsistent() errors to fsck() errors (so that all runtime inconsistencies are logged in the superblock), and we need to go through the list of fsck errors and classify them by which fsck passes are needed to repair them. - We need comprehensive torture testing for all our repair paths, to shake out remaining bugs there. Thomas has been working on the tooling for this, so this is coming soonish. Slightly less critical items: - We need to improve the end-user experience for degraded mounts: right now, a degraded root filesystem means dropping to an initramfs shell or somehow inputting mount options manually (we don't want to allow degraded mounts without some form of user input, except on unattended servers) - we need the mount helper to prompt the user to allow mounting degraded, and make sure this works with systemd. - Scalabiity: we have users running 100TB+ filesystems, and that's effectively the limit right now due to fsck times. We have some reworks in the pipeline to address this, we're aiming to make petabyte sized filesystems practical. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmbvHQoACgkQE6szbY3K bnYfAw/+IXQ43/O+Jzs0MLD7pKZnrlbHiX9FqYLazD40vWvkyRTQOwgTn8pVNhq3 4YWmtuZyqh036YC+bGqYFOhz20YetS5UdgbClpwmc99JJ6xsY+Z1mdpYfz5oq1Dw /pBX5iYb3rAt8UbQoZ8lcWM+GpT3GKJVgJuiLB2gRp9gATFesuh+0qU42oIVVVU5 4y3VhDBUmRk4XqEnk8hr7EIDMW0wWP3aptxYMZzeUPW0x1cEQ+FWrJo5D6lXv2KK dKv3MogvA0FFNi/eNexclPiu2pXtI7vrxT7umsxAICHLt41rWpV5ttE6io3bC4ZN qvwF9w2CpmKPKchFru9PO+QrWHVR7e6bphwf3TzyoKZ7tTn42f1RQlub7gBzI3bz ai5ZwGRIvpUoPVBj+CO+Ipog81uUb23Ma+gXg1akEFBOAb+o7I3KOOSBh5l+0cHj 3Ov1n0TLcsoO2cqoqfsV2QubW9YcWEZ76g5mKwQnUn8Cs6Fp0wWaIyK9aNkIAxcr tNDPGtH1gKitxUvju5i/LyI7y1UoeFvqJFee0VsU6QnixHn1ySzhePsJt6UEnIJT Ia3C96Igqu2mV9FxhfGHj/qi7TGjqqkZHa8+B610cDpgf15cx7Ps2DYjkuQMFCqZ Q3Q1o5De9roRq5xF2hLiYJCbzJKqd5ichFsBtLQuX572ICxbICg= =oVCy -----END PGP SIGNATURE----- Merge tag 'bcachefs-2024-09-21' of git://evilpiepirate.org/bcachefs Pull bcachefs updates from Kent Overstreet: - rcu_pending, btree key cache rework: this solves lock contenting in the key cache, eliminating the biggest source of the srcu lock hold time warnings, and drastically improving performance on some metadata heavy workloads - on multithreaded creates we're now 3-4x faster than xfs. - We're now using an rhashtable instead of the system inode hash table; this is another significant performance improvement on multithreaded metadata workloads, eliminating more lock contention. - for_each_btree_key_in_subvolume_upto(): new helper for iterating over keys within a specific subvolume, eliminating a lot of open coded "subvolume_get_snapshot()" and also fixing another source of srcu lock time warnings, by running each loop iteration in its own transaction (as the existing for_each_btree_key() does). - More work on btree_trans locking asserts; we now assert that we don't hold btree node locks when trans->locked is false, which is important because we don't use lockdep for tracking individual btree node locks. - Some cleanups and improvements in the bset.c btree node lookup code, from Alan. - Rework of btree node pinning, which we use in backpointers fsck. The old hacky implementation, where the shrinker just skipped over nodes in the pinned range, was causing OOMs; instead we now use another shrinker with a much higher seeks number for pinned nodes. - Rebalance now uses BCH_WRITE_ONLY_SPECIFIED_DEVS; this fixes an issue where rebalance would sometimes fall back to allocating from the full filesystem, which is not what we want when it's trying to move data to a specific target. - Use __GFP_ACCOUNT, GFP_RECLAIMABLE for btree node, key cache allocations. - Idmap mounts are now supported (Hongbo Li) - Rename whiteouts are now supported (Hongbo Li) - Erasure coding can now handle devices being marked as failed, or forcibly removed. We still need the evacuate path for erasure coding, but it's getting very close to ready for people to start using. * tag 'bcachefs-2024-09-21' of git://evilpiepirate.org/bcachefs: (99 commits) bcachefs: return err ptr instead of null in read sb clean bcachefs: Remove duplicated include in backpointers.c bcachefs: Don't drop devices with stripe pointers bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices bcachefs: bch_fs.rw_devs_change_count bcachefs: bch2_dev_remove_stripes() bcachefs: bch2_trigger_ptr() calculates sectors even when no device bcachefs: improve error messages in bch2_ec_read_extent() bcachefs: improve error message on too few devices for ec bcachefs: improve bch2_new_stripe_to_text() bcachefs: ec_stripe_head.nr_created bcachefs: bch_stripe.disk_label bcachefs: stripe_to_mem() bcachefs: EIO errcode cleanup bcachefs: Rework btree node pinning bcachefs: split up btree cache counters for live, freeable bcachefs: btree cache counters should be size_t bcachefs: Don't count "skipped access bit" as touched in btree cache scan bcachefs: Failed devices no longer require mounting in degraded mode bcachefs: bch2_dev_rcu_noerror() ...	2024-09-23 10:05:41 -07:00
Ahmed Ehab	39c3aad43f	bcachefs: Hold read lock in bch2_snapshot_tree_oldest_subvol() Syzbot reports a problem that a warning is triggered due to suspicious use of rcu_dereference_check(). That is triggered by a call of bch2_snapshot_tree_oldest_subvol(). The cause of the warning is that inside bch2_snapshot_tree_oldest_subvol(), snapshot_t() is called which calls rcu_dereference() that requires a read lock to be held. Also, the call of bch2_snapshot_tree_next() eventually calls snapshot_t(). To fix this, call rcu_read_lock() before calling snapshot_t(). Then, release the lock after the termination of the while loop. Reported-by: <syzbot+f7c41a878676b72c16a6@syzkaller.appspotmail.com> Signed-off-by: Ahmed Ehab <bottaawesome633@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 14:54:18 -04:00
Diogo Jahchan Koike	025c55a4c7	bcachefs: return err ptr instead of null in read sb clean syzbot reported a null-ptr-deref in bch2_fs_start. [0] When a sb is marked clear but doesn't have a clean section bch2_read_superblock_clean returns NULL which PTR_ERR_OR_ZERO lets through, eventually leading to a null ptr dereference down the line. Adjust read sb clean to return an ERR_PTR indicating the invalid clean section. [0] https://syzkaller.appspot.com/bug?extid=1cecc37d87c4286e5543 Reported-by: syzbot+1cecc37d87c4286e5543@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=1cecc37d87c4286e5543 Signed-off-by: Diogo Jahchan Koike <djahchankoike@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:49 -04:00
Yang Li	abb43dd677	bcachefs: Remove duplicated include in backpointers.c The header files bbpos.h is included twice in backpointers.c, so one inclusion of each can be removed. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=10783 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:49 -04:00
Kent Overstreet	d5c5b337f8	bcachefs: Don't drop devices with stripe pointers Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:49 -04:00
Kent Overstreet	035d72f72c	bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices This factors out ec_strie_head_devs_update(), which initializes the bitmap of devices we're allocating from, and runs it every time c->rw_devs_change_count changes. We also cancel pending, not allocated stripes, since they may refer to devices that are no longer available. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:49 -04:00
Kent Overstreet	83ccd9b31d	bcachefs: bch_fs.rw_devs_change_count Add a counter that's incremented whenever rw devices change; this will be used for erasure coding so that it can keep ec_stripe_head in sync and not deadlock on a new stripe when a device it wants goes away. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:49 -04:00
Kent Overstreet	ad8d1f77fc	bcachefs: bch2_dev_remove_stripes() We can now correctly force-remove a device that has stripes on it; this uses the new BCH_SB_MEMBER_INVALID sentinal value. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:49 -04:00
Kent Overstreet	934137b0c0	bcachefs: bch2_trigger_ptr() calculates sectors even when no device This is necessary for erasure coded pointers to devices that have been removed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:49 -04:00
Kent Overstreet	2aee59eb21	bcachefs: improve error messages in bch2_ec_read_extent() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:49 -04:00
Kent Overstreet	cb771fe891	bcachefs: improve error message on too few devices for ec Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:49 -04:00
Kent Overstreet	c9cabfb215	bcachefs: improve bch2_new_stripe_to_text() also print out the new stripe key Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	a4b7a0c037	bcachefs: ec_stripe_head.nr_created additional debug stat Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	fa85c47397	bcachefs: bch_stripe.disk_label When reshaping existing stripes, we should keep them on the same target that they were allocated on; to do this, we need to add a field to the btree stripe type. This is a tad awkward, because we only have 8 bits left, and targets are 16 bits - but we only need to store a label, not a full target. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	1b11c4d365	bcachefs: stripe_to_mem() factor out a common helper Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	54a12984a9	bcachefs: EIO errcode cleanup We want to be using private errcodes whenever possible, for better error messages. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	7a51608d01	bcachefs: Rework btree node pinning In backpointers fsck, we do a seqential scan of one btree, and check references to another: extents <-> backpointers Checking references generates random lookups, so we want to pin that btree in memory (or only a range, if it doesn't fit in ram). Previously, this was done with a simple check in the shrinker - "if btree node is in range being pinned, don't free it" - but this generated OOMs, as our shrinker wasn't well behaved if there was less memory available than expected. Instead, we now have two different shrinkers and lru lists; the second shrinker being for pinned nodes, with seeks set much higher than normal - so they can still be freed if necessary, but we'll prefer not to. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	91ddd71510	bcachefs: split up btree cache counters for live, freeable this is prep for introducing a second live list and shrinker for pinned nodes Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	691f2cba22	bcachefs: btree cache counters should be size_t 32 bits won't overflow any time soon, but size_t is the correct type for counting objects in memory. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	ad5dbe3ce5	bcachefs: Don't count "skipped access bit" as touched in btree cache scan Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	e92e5056e4	bcachefs: Failed devices no longer require mounting in degraded mode Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	805ddc2042	bcachefs: bch2_dev_rcu_noerror() bch2_dev_rcu() now properly errors if the device is invalid Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	b99a94fd7a	bcachefs: Progress indicator for extents_to_backpointers Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	3621ecc10f	bcachefs: bch2_opts_to_text() Factor out bch2_show_options() into a generic helper, for debugging option passing issues. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	bf611567b7	bcachefs: improve "no device to read from" message Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Hongbo Li	b161ca8096	bcachefs: Fix compilation error for bch2_sb_member_alloc Fix the following compilation error: ``` fs/bcachefs/sb-members.c: In function ‘bch2_sb_member_alloc’: fs/bcachefs/sb-members.c:508:2: error: a label can only be part of a statement and a declaration is not a statement 508 \| unsigned nr_devices = max_t(unsigned, dev_idx + 1, c->sb.nr_devices); ``` Fixes: a7d364a133c7 ("bcachefs: bch2_sb_member_alloc()") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	17405279e8	bcachefs: bch2_sb_member_alloc() refactoring Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	6b812f1dce	bcachefs: bch2_dev_remove_alloc() -> alloc_background.c Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	8ed4ba3663	bcachefs: Move tabstop setup to bch2_dev_usage_to_text() No reason for it not to be where it's needed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	4f19a60c32	bcachefs: Options for recovery_passes, recovery_passes_exclude This adds mount options for specifying recovery passes to run, or exclude; the immediate need for this is that backpointers fsck is having trouble completing, so we need a way to skip it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	ff7f756f2b	bcachefs: Use mm_account_reclaimed_pages() when freeing btree nodes When freeing in a shrinker callback, we need to notify memory reclaim, so it knows forward progress has been made. Normally this is done in e.g. slab code, but we're not freeing through slab - or rather we are, but these allocations are big, and use the kmalloc_large() path. This is really a bug in the slub code, but we're working around it here for now. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	895fbf1cf0	bcachefs: Use __GFP_ACCOUNT for reclaimable memory Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:46 -04:00
Sasha Finkelstein	4645855df0	bcachefs: Hook up RENAME_WHITEOUT in rename. This is needed for overlayfs, which is used by container managers. Signed-off-by: Sasha Finkelstein <fnkl.kernel@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:35:20 -04:00
Kent Overstreet	d90c8acd35	bcachefs: rebalance writes use BCH_WRITE_ONLY_SPECIFIED_DEVS this was an oversight: rebalance is moving data to a specific device, so we don't want it falling back to the full filesystem Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:35:20 -04:00
Kent Overstreet	a977f3e162	bcachefs: BCH_WRITE_ALLOC_NOWAIT no longer applies to open bucket allocation rebalance writes must be BCH_WRITE_ALLOC_NOWAIT because they don't allocate from the full filesystem - but we don't want spurious allocation failures due to open buckets. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:35:20 -04:00
Kent Overstreet	2e95497e81	bcachefs: fix prototype to bch2_alloc_sectors_start_trans() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:35:20 -04:00
Kent Overstreet	da2d20c98d	bcachefs: kill redundant is_vmalloc_addr() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:35:20 -04:00
Kent Overstreet	af05633d40	bcachefs: convert __bch2_encrypt_bio() to darray like the previous patch, kill use of bare arrays; the encryption code likes to work in big batches, so this is a small performance improvement. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:35:20 -04:00
Kent Overstreet	b7d8092a1b	bcachefs: do_encrypt() now handles allocation failures convert to darray, and add a fallback when allocation fails Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:35:20 -04:00
Kent Overstreet	3340dee235	bcachefs: Add pinned to btree cache not freed counters Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:35:19 -04:00
Linus Torvalds	2004cef11e	In the v6.12 scheduler development cycle we had 63 commits from 18 contributors: - Implement the SCHED_DEADLINE server infrastructure - Daniel Bristot de Oliveira's last major contribution to the kernel: "SCHED_DEADLINE servers can help fixing starvation issues of low priority tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU cycles. Today we have RT Throttling; DEADLINE servers should be able to replace and improve that." (Daniel Bristot de Oliveira, Peter Zijlstra, Joel Fernandes, Youssef Esmat, Huang Shijie) - Preparatory changes for sched_ext integration: - Use set_next_task(.first) where required - Fix up set_next_task() implementations - Clean up DL server vs. core sched - Split up put_prev_task_balance() - Rework pick_next_task() - Combine the last put_prev_task() and the first set_next_task() - Rework dl_server - Add put_prev_task(.next) (Peter Zijlstra, with a fix by Tejun Heo) - Complete the EEVDF transition and refine EEVDF scheduling: - Implement delayed dequeue - Allow shorter slices to wakeup-preempt - Use sched_attr::sched_runtime to set request/slice suggestion - Document the new feature flags - Remove unused and duplicate-functionality fields - Simplify & unify pick_next_task_fair() - Misc debuggability enhancements (Peter Zijlstra, with fixes/cleanups by Dietmar Eggemann, Valentin Schneider and Chuyi Zhou) - Initialize the vruntime of a new task when it is first enqueued, resulting in significant decrease in latency of newly woken tasks. (Zhang Qiao) - Introduce SM_IDLE and an idle re-entry fast-path in __schedule() (K Prateek Nayak, Peter Zijlstra) - Clean up and clarify the usage of Clean up usage of rt_task() (Qais Yousef) - Preempt SCHED_IDLE entities in strict cgroup hierarchies (Tianchen Ding) - Clarify the documentation of time units for deadline scheduler parameters. (Christian Loehle) - Remove the HZ_BW chicken-bit feature flag introduced a year ago, the original change seems to be working fine. (Phil Auld) - Misc fixes and cleanups (Chen Yu, Dan Carpenter, Huang Shijie, Peilin He, Qais Yousefm and Vincent Guittot) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmbr8qcRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gdbw/+Mj3zWfYP+dtUkfgrR2FClPAJoo1/9Dz0 LYD8XgYHu8rEJ0Aq+VbdkgYGUt9utvzUFPIxvWFDcldQl57KwhF4hp9Ir+PqJyYC NolQ1q8ddo1hnslxnEg6SgHVzQq/4FqMM0nDNUkQETCx6zTyFFeRf+q7o/2c2m5B uI9dSU1Wrx7XrXm2D3kB8+xP+ZRy+qhbFN5Pfuz96mhelfklylgKMfPzgAiCT/7T JTbQhQ2HdcCNgiLoSrWsHBDy2UYpouP4zb4jyd+lDQzhSUJrj3u4Xy4vVmuTKq+y sTgWlgKB+MTuh9UuJ4UYzSnMqg161UlMvtXeH84ABmAqDNGHRPtOKrrlcLtJ3D4x m1SPhNnsvpjOu2pH0XLIS8al3VUesWND5S+rucHRYSq6Nvhivf4MTvRJlicXXurL Mt2APnIlhGJuKBNWnmyZovVdtO0ZUUPlaZWfr3rCS4txAVo+HwWhsm3uhtTycQqN gazsCiuGh6Jds90ZqA/BvdLWG+DY8J0xLlV3ex4pCXuQ/HFrabVWTyThJsULhrZ2 5mTdWIsocPctNMO9/RHMy7vJI7G7ljgHEquWVn5kiGGzXhK6VwVwKAMpfgXGw+YA yVP6/M7a7g2yEzj69gXkcDa8k/kedMVquJ/G/8YhZM7u7sPqsMjpmaGsqsJRfnpT ChngAzap+kA= =TEC6 -----END PGP SIGNATURE----- Merge tag 'sched-core-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Implement the SCHED_DEADLINE server infrastructure - Daniel Bristot de Oliveira's last major contribution to the kernel: "SCHED_DEADLINE servers can help fixing starvation issues of low priority tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU cycles. Today we have RT Throttling; DEADLINE servers should be able to replace and improve that." (Daniel Bristot de Oliveira, Peter Zijlstra, Joel Fernandes, Youssef Esmat, Huang Shijie) - Preparatory changes for sched_ext integration: - Use set_next_task(.first) where required - Fix up set_next_task() implementations - Clean up DL server vs. core sched - Split up put_prev_task_balance() - Rework pick_next_task() - Combine the last put_prev_task() and the first set_next_task() - Rework dl_server - Add put_prev_task(.next) (Peter Zijlstra, with a fix by Tejun Heo) - Complete the EEVDF transition and refine EEVDF scheduling: - Implement delayed dequeue - Allow shorter slices to wakeup-preempt - Use sched_attr::sched_runtime to set request/slice suggestion - Document the new feature flags - Remove unused and duplicate-functionality fields - Simplify & unify pick_next_task_fair() - Misc debuggability enhancements (Peter Zijlstra, with fixes/cleanups by Dietmar Eggemann, Valentin Schneider and Chuyi Zhou) - Initialize the vruntime of a new task when it is first enqueued, resulting in significant decrease in latency of newly woken tasks (Zhang Qiao) - Introduce SM_IDLE and an idle re-entry fast-path in __schedule() (K Prateek Nayak, Peter Zijlstra) - Clean up and clarify the usage of Clean up usage of rt_task() (Qais Yousef) - Preempt SCHED_IDLE entities in strict cgroup hierarchies (Tianchen Ding) - Clarify the documentation of time units for deadline scheduler parameters (Christian Loehle) - Remove the HZ_BW chicken-bit feature flag introduced a year ago, the original change seems to be working fine (Phil Auld) - Misc fixes and cleanups (Chen Yu, Dan Carpenter, Huang Shijie, Peilin He, Qais Yousefm and Vincent Guittot) * tag 'sched-core-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits) sched/cpufreq: Use NSEC_PER_MSEC for deadline task cpufreq/cppc: Use NSEC_PER_MSEC for deadline task sched/deadline: Clarify nanoseconds in uapi sched/deadline: Convert schedtool example to chrt sched/debug: Fix the runnable tasks output sched: Fix sched_delayed vs sched_core kernel/sched: Fix util_est accounting for DELAY_DEQUEUE kthread: Fix task state in kthread worker if being frozen sched/pelt: Use rq_clock_task() for hw_pressure sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule() sched: Add put_prev_task(.next) sched: Rework dl_server sched: Combine the last put_prev_task() and the first set_next_task() sched: Rework pick_next_task() sched: Split up put_prev_task_balance() sched: Clean up DL server vs core sched sched: Fixup set_next_task() implementations sched: Use set_next_task(.first) where required sched/fair: Properly deactivate sched_delayed task upon class change ...	2024-09-19 15:55:58 +02:00
Linus Torvalds	2775df6e5e	vfs-6.12.folio -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEvgAKCRCRxhvAZXjc ou77AQD3U1KjbdgzbUi6kaUmiiWOPhfYTlm8mho8dBjqvTCB+AD/XTWSFCWWhHB4 KyQZTbjRD81xmVNbKjASazp0EA6Ahwc= =gIsD -----END PGP SIGNATURE----- Merge tag 'vfs-6.12.folio' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs folio updates from Christian Brauner: "This contains work to port write_begin and write_end to rely on folios for various filesystems. This converts ocfs2, vboxfs, orangefs, jffs2, hostfs, fuse, f2fs, ecryptfs, ntfs3, nilfs2, reiserfs, minixfs, qnx6, sysv, ufs, and squashfs. After this series lands a bunch of the filesystems in this list do not mention struct page anymore" * tag 'vfs-6.12.folio' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (61 commits) Squashfs: Ensure all readahead pages have been used Squashfs: Rewrite and update squashfs_readahead_fragment() to not use page->index Squashfs: Update squashfs_readpage_block() to not use page->index Squashfs: Update squashfs_readahead() to not use page->index Squashfs: Update page_actor to not use page->index jffs2: Use a folio in jffs2_garbage_collect_dnode() jffs2: Convert jffs2_do_readpage_nolock to take a folio buffer: Convert __block_write_begin() to take a folio ocfs2: Convert ocfs2_write_zero_page to use a folio fs: Convert aops->write_begin to take a folio fs: Convert aops->write_end to take a folio vboxsf: Use a folio in vboxsf_write_end() orangefs: Convert orangefs_write_begin() to use a folio orangefs: Convert orangefs_write_end() to use a folio jffs2: Convert jffs2_write_begin() to use a folio jffs2: Convert jffs2_write_end() to use a folio hostfs: Convert hostfs_write_end() to use a folio fuse: Convert fuse_write_begin() to use a folio fuse: Convert fuse_write_end() to use a folio f2fs: Convert f2fs_write_begin() to use a folio ...	2024-09-16 08:54:30 +02:00
Linus Torvalds	8f72c31f45	vfs-6.12.misc -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEGwAKCRCRxhvAZXjc ojIuAQC433+hBkvjvmQ7H0r5rgZSjUuCTG3bSmdU7RJmPHUHhwEA85v/NGq53f+W IhandK6t+Cf0JYpFZ3N0bT88hDYVhQQ= =9zGL -----END PGP SIGNATURE----- Merge tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual pile of misc updates: Features: - Add F_CREATED_QUERY fcntl() that allows userspace to query whether a file was actually created. Often userspace wants to know whether an O_CREATE request did actually create a file without using O_EXCL. The current logic is that to first attempts to open the file without O_CREAT \| O_EXCL and if ENOENT is returned userspace tries again with both flags. If that succeeds all is well. If it now reports EEXIST it retries. That works fairly well but some corner cases make this more involved. If this operates on a dangling symlink the first openat() without O_CREAT \| O_EXCL will return ENOENT but the second openat() with O_CREAT \| O_EXCL will fail with EEXIST. The reason is that openat() without O_CREAT \| O_EXCL follows the symlink while O_CREAT \| O_EXCL doesn't for security reasons. So it's not something we can really change unless we add an explicit opt-in via O_FOLLOW which seems really ugly. All available workarounds are really nasty (fanotify, bpf lsm etc) so add a simple fcntl(). - Try an opportunistic lookup for O_CREAT. Today, when opening a file we'll typically do a fast lookup, but if O_CREAT is set, the kernel always takes the exclusive inode lock. This was likely done with the expectation that O_CREAT means that we always expect to do the create, but that's often not the case. Many programs set O_CREAT even in scenarios where the file already exists (see related F_CREATED_QUERY patch motivation above). The series contained in the pr rearranges the pathwalk-for-open code to also attempt a fast_lookup in certain O_CREAT cases. If a positive dentry is found, the inode_lock can be avoided altogether and it can stay in rcuwalk mode for the last step_into. - Expose the 64 bit mount id via name_to_handle_at() Now that we provide a unique 64-bit mount ID interface in statx(2), we can now provide a race-free way for name_to_handle_at(2) to provide a file handle and corresponding mount without needing to worry about racing with /proc/mountinfo parsing or having to open a file just to do statx(2). While this is not necessary if you are using AT_EMPTY_PATH and don't care about an extra statx(2) call, users that pass full paths into name_to_handle_at(2) need to know which mount the file handle comes from (to make sure they don't try to open_by_handle_at a file handle from a different filesystem) and switching to AT_EMPTY_PATH would require allocating a file for every name_to_handle_at(2) call - Add a per dentry expire timeout to autofs There are two fairly well known automounter map formats, the autofs format and the amd format (more or less System V and Berkley). Some time ago Linux autofs added an amd map format parser that implemented a fair amount of the amd functionality. This was done within the autofs infrastructure and some functionality wasn't implemented because it either didn't make sense or required extra kernel changes. The idea was to restrict changes to be within the existing autofs functionality as much as possible and leave changes with a wider scope to be considered later. One of these changes is implementing the amd options: 1) "unmount", expire this mount according to a timeout (same as the current autofs default). 2) "nounmount", don't expire this mount (same as setting the autofs timeout to 0 except only for this specific mount) . 3) "utimeout=<seconds>", expire this mount using the specified timeout (again same as setting the autofs timeout but only for this mount) To implement these options per-dentry expire timeouts need to be implemented for autofs indirect mounts. This is because all map keys (mounts) for autofs indirect mounts use an expire timeout stored in the autofs mount super block info. structure and all indirect mounts use the same expire timeout. Fixes: - Fix missing fput for FSCONFIG_SET_FD in autofs - Use param->file for FSCONFIG_SET_FD in coda - Delete the 'fs/netfs' proc subtreee when netfs module exits - Make sure that struct uid_gid_map fits into a single cacheline - Don't flush in-flight wb switches for superblocks without cgroup writeback - Correcting the idmapping mount example in the idmapping documentation - Fix a race between evice_inodes() and find_inode() and iput() - Refine the show_inode_state() macro definition in writeback code - Prevent dump_mapping() from accessing invalid dentry.d_name.name - Show actual source for debugfs in /proc/mounts - Annotate data-race of busy_poll_usecs in eventpoll - Don't WARN for racy path_noexec check in exec code - Handle OOM on mnt_warn_timestamp_expiry() - Fix some spelling in the iomap design documentation - Fix typo in procfs comment - Fix typo in fs/namespace.c comment Cleanups: - Add the VFS git tree to the MAINTAINERS file - Move FMODE_UNSIGNED_OFFSET to fop_flags freeing up another f_mode bit in struct file bringing us to 5 free f_mode bits - Remove the __I_DIO_WAKEUP bit from i_state flags as we can simplify the wait mechanism - Remove the unused path_put_init() helper - Replace a __u32 with u32 for s_fsnotify_mask as __u32 is uapi specific - Replace the unsigned long i_state member with a u32 i_state member in struct inode freeing up 4 bytes in struct inode. Instead of using the bit based wait apis we're now using the var event apis and using the individual bytes of the i_state member to wait on state changes - Explain how per-syscall AT_* flags should be allocated - Use in_group_or_capable() helper to simplify the posix acl mode update code - Switch to LIST_HEAD() in fsync_buffers_list() to simplify the code - Removed comment about d_rcu_to_refcount() as that function doesn't exist anymore - Add kernel documentation for lookup_fast() - Don't re-zero evenpoll fields - Remove outdated comment after close_fd() - Fix imprecise wording in comment about the pipe filesystem - Drop GFP_NOFAIL mode from alloc_page_buffers - Missing blank line warnings and struct declaration improved in file_table - Annotate struct poll_list with __counted_by() - Remove the unused read parameter in percpu-rwsem - Remove linux/prefetch.h include from direct-io code - Use kmemdup_array instead of kmemdup for multiple allocation in mnt_idmapping code - Remove unused mnt_cursor_del() declaration Performance tweaks: - Dodge smp_mb in break_lease and break_deleg in the common case - Only read fops once in fops_{get,put}() - Use RCU in ilookup() - Elide smp_mb in iversion handling in the common case - Drop one lock trip in evict()" * tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (58 commits) uidgid: make sure we fit into one cacheline proc: Fix typo in the comment fs/pipe: Correct imprecise wording in comment fhandle: expose u64 mount id to name_to_handle_at(2) uapi: explain how per-syscall AT_* flags should be allocated fs: drop GFP_NOFAIL mode from alloc_page_buffers writeback: Refine the show_inode_state() macro definition fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name mnt_idmapping: Use kmemdup_array instead of kmemdup for multiple allocation netfs: Delete subtree of 'fs/netfs' when netfs module exits fs: use LIST_HEAD() to simplify code inode: make i_state a u32 inode: port __I_LRU_ISOLATING to var event vfs: fix race between evice_inodes() and find_inode()&iput() inode: port __I_NEW to var event inode: port __I_SYNC to var event fs: reorder i_state bits fs: add i_state helpers MAINTAINERS: add the VFS git tree fs: s/__u32/u32/ for s_fsnotify_mask ...	2024-09-16 08:35:09 +02:00
Thorsten Blum	fa1ab1b466	bcachefs: Annotate bch_replicas_entry_{v0,v1} with __counted_by() Add the __counted_by compiler attribute to the flexible array members devs to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE. Increment nr_devs before adding a new device to the devs array and adjust the array indexes accordingly. Add a helper macro for adding a new device. In bch2_journal_read(), explicitly set nr_devs to 0. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Hongbo Li	c24adfa0df	bcachefs: support idmap mounts We enable idmapped mounts for bcachefs. Here, we just pass down the user_namespace argument from the VFS methods to the relevant helpers. The idmap test in bcachefs is as following: ``` 1. losetup /dev/loop1 bcachefs.img 2. ./bcachefs format /dev/loop1 3. mount -t bcachefs /dev/loop1 /mnt/bcachefs/ 4. ./mount-idmapped --map-mount b:0:1000:1 /mnt/bcachefs /mnt/idmapped1/ ll /mnt/bcachefs total 2 drwx------. 2 root root 0 Jun 14 14:10 lost+found -rw-r--r--. 1 root root 1945 Jun 14 14:12 profile ll /mnt/idmapped1/ total 2 drwx------. 2 1000 1000 0 Jun 14 14:10 lost+found -rw-r--r--. 1 1000 1000 1945 Jun 14 14:12 profile Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Thorsten Blum	86e92eeeb2	bcachefs: Annotate struct bch_xattr with __counted_by() Add the __counted_by compiler attribute to the flexible array member x_name to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Kent Overstreet	2c6a7bff2a	bcachefs: Switch gc bucket array to a genradix A user with a 30 tb device is overflowing the INT_MAX limit on vmalloc allocations... Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Kent Overstreet	a803fa551d	bcachefs: darray: convert to alloc_hooks() better memory allocation profiling support Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Chen Yufan	848c3ff882	bcachefs: Convert to use jiffies macros Use jiffies macros instead of using jiffies directly to handle wraparound. Signed-off-by: Chen Yufan <chenyufan@vivo.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Alan Huang	94932a0842	bcachefs: Refactor bch2_bset_fix_lookup_table bch2_bset_fix_lookup_table is too complicated to be easily understood, the comment "l now > where" there is also incorrect when where == t->end_offset. This patch therefore refactor the function, the idea is that when where >= rw_aux_tree(b, t)[t->size - 1].offset, we don't need to adjust the rw aux tree. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Kent Overstreet	f1625637b8	bcachefs: Assert that we don't lock nodes when !trans->locked We rely on the trans->locked to know if a trans has nodes locked for assertions about deadlocks; there can't be more than one trans in the same process that is locked. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Matthew Wilcox (Oracle)	a8cdf0ff46	bcachefs: Do not check folio_has_private() folio_has_private() is an attractive nuisance; filesystem authors generally don't realise that it actually checks two flags (one of which is never set by bcachefs). There's no need to check the private flag at all; for folios owned by bcachefs, we know that folio->private is NULL when the private flag is clear and non-NULL when the private flag is set. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Kent Overstreet	fdbc9c390a	bcachefs: bch2_time_stats_reset() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Kent Overstreet	b36f679c99	bcachefs: Drop memalloc_nofs_save() in bch2_btree_node_mem_alloc() It's really not needed: the only locks used here are the btree cache lock, which we drop for GFP_WAIT allocations, and btree node locks - but we also drop those for GFP_WAIT allocations. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Youling Tang	42386fbaee	bcachefs: Simplify bch2_xattr_emit() implementation Use helper functions to make code more readable. Similar to commit `a5488f2983` ("fs: simplify ->listxattr() implementation") Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Youling Tang	d3f30f1629	bcachefs: drop unused posix acl handlers Remove struct nop_posix_acl_{access,default} for bcachefs filesystem that don't depend on the xattr handler in their inode->i_op->listxattr() method in any way. There's nothing more to do than to simply remove the handler. It's been effectively unused ever since we introduced the new posix acl api. See [1] for details. Link [1]: https://patchwork.kernel.org/project/linux-fsdevel/cover/20230125-fs-acl-remove-generic-xattr-handlers-v3-0-f760cc58967d@kernel.org/ Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00

... 11 12 13 14 15 ...

5280 Commits