linux-loongson

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-09-01 23:46:45 +00:00

Author	SHA1	Message	Date
Wentao Liang	9364f17ba4	bcachefs: Add error handling for zlib_deflateInit2() In attempt_compress(), the return value of zlib_deflateInit2() needs to be checked. A proper implementation can be found in pstore_compress(). Add an error check and return 0 immediately if the initialzation fails. Fixes: `986e9842fb` ("bcachefs: Compression levels") Signed-off-by: Wentao Liang <vulab@iscas.ac.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-03 12:11:42 -04:00
Eric Biggers	a07c43e6c2	bcachefs: add missing selection of XARRAY_MULTI When CONFIG_XARRAY_MULTI is not set, reading from a bcachefs file hits the 'BUG_ON(order > 0);' in xas_set_order(), because it tries to insert a large folio in the page cache. Fix this by making bcachefs select XARRAY_MULTI. Fixes: `be212d86b1` ("bcachefs: bs > ps support") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-02 10:24:34 -04:00
Kent Overstreet	955ba7b5ea	bcachefs: bch_dev_usage_full All the fastpaths that need device usage don't need the sector totals or fragmentation, just bucket counts. Split bch_dev_usage up into two different versions, the normal one with just bucket counts. This is also a stack usage improvement, since we have a bch_dev_usage on the stack in the allocation path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-02 10:24:34 -04:00
Kent Overstreet	9180ad2e16	bcachefs: Kill btree_iter.trans This was planned to be done ages ago, now finally completed; there are places where we have quite a few btree_trans objects on the stack, so this reduces stack usage somewhat. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-02 10:24:34 -04:00
Kent Overstreet	1c8f4587d2	bcachefs: do_trace_key_cache_fill() Reducing stack frame usage; this moves the printbuf out of the main stack frame. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-02 10:24:34 -04:00
Kent Overstreet	dcffc3b1ae	bcachefs: Split up bch_dev.io_ref We now have separate per device io_refs for read and write access. This fixes a device removal bug where the discard workers were still running while we're removing alloc info for that device. It's also a bit of hardening; we no longer allow writes to devices that are read-only. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-02 10:24:34 -04:00
Linus Torvalds	eb0ece1602	- The 6 patch series "Enable strict percpu address space checks" from Uros Bizjak uses x86 named address space qualifiers to provide compile-time checking of percpu area accesses. This has caused a small amount of fallout - two or three issues were reported. In all cases the calling code was founf to be incorrect. - The 4 patch series "Some cleanup for memcg" from Chen Ridong implements some relatively monir cleanups for the memcontrol code. - The 17 patch series "mm: fixes for device-exclusive entries (hmm)" from David Hildenbrand fixes a boatload of issues which David found then using device-exclusive PTE entries when THP is enabled. More work is needed, but this makes thins better - our own HMM selftests now succeed. - The 2 patch series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed remove the z3fold and zbud implementations. They have been deprecated for half a year and nobody has complained. - The 5 patch series "mm: further simplify VMA merge operation" from Lorenzo Stoakes implements numerous simplifications in this area. No runtime effects are anticipated. - The 4 patch series "mm/madvise: remove redundant mmap_lock operations from process_madvise()" from SeongJae Park rationalizes the locking in the madvise() implementation. Performance gains of 20-25% were observed in one MADV_DONTNEED microbenchmark. - The 12 patch series "Tiny cleanup and improvements about SWAP code" from Baoquan He contains a number of touchups to issues which Baoquan noticed when working on the swap code. - The 2 patch series "mm: kmemleak: Usability improvements" from Catalin Marinas implements a couple of improvements to the kmemleak user-visible output. - The 2 patch series "mm/damon/paddr: fix large folios access and schemes handling" from Usama Arif provides a couple of fixes for DAMON's handling of large folios. - The 3 patch series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors" from SeongJae Park fixes a few issues with the accuracy of kdamond's walking of DAMON regions. - The 3 patch series "expose mapping wrprotect, fix fb_defio use" from Lorenzo Stoakes changes the interaction between framebuffer deferred-io and core MM. No functional changes are anticipated - this is preparatory work for the future removal of page structure fields. - The 4 patch series "mm/damon: add support for hugepage_size DAMOS filter" from Usama Arif adds a DAMOS filter which permits the filtering by huge page sizes. - The 4 patch series "mm: permit guard regions for file-backed/shmem mappings" from Lorenzo Stoakes extends the guard region feature from its present "anon mappings only" state. The feature now covers shmem and file-backed mappings. - The 4 patch series "mm: batched unmap lazyfree large folios during reclamation" from Barry Song cleans up and speeds up the unmapping for pte-mapped large folios. - The 18 patch series "reimplement per-vma lock as a refcount" from Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for pulling it out were largely bogus and that change made the code more messy. This patchset provides small (0-10%) improvements on one microbenchmark. - The 5 patch series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves" from SeongJae Park does some maintenance work on the DAMON docs. - The 27 patch series "hugetlb/CMA improvements for large systems" from Frank van der Linden addresses a pile of issues which have been observed when using CMA on large machines. - The 2 patch series "mm/damon: introduce DAMOS filter type for unmapped pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the page's mapped/unmapped status. - The 19 patch series "zsmalloc/zram: there be preemption" from Sergey Senozhatsky teaches zram to run its compression and decompression operations preemptibly. - The 12 patch series "selftests/mm: Some cleanups from trying to run them" from Brendan Jackman fixes a pile of unrelated issues which Brendan encountered while runnimg our selftests. - The 2 patch series "fs/proc/task_mmu: add guard region bit to pagemap" from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to determine whether a particular page is a guard page. - The 7 patch series "mm, swap: remove swap slot cache" from Kairui Song removes the swap slot cache from the allocation path - it simply wasn't being effective. - The 5 patch series "mm: cleanups for device-exclusive entries (hmm)" from David Hildenbrand implements a number of unrelated cleanups in this code. - The 5 patch series "mm: Rework generic PTDUMP configs" from Anshuman Khandual implements a number of preparatoty cleanups to the GENERIC_PTDUMP Kconfig logic. - The 8 patch series "mm/damon: auto-tune aggregation interval" from SeongJae Park implements a feedback-driven automatic tuning feature for DAMON's aggregation interval tuning. - The 5 patch series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did this in preparation for implementing lazy mmu mode for arm64 to optimize vmalloc. - The 2 patch series "mm/page_alloc: Some clarifications for migratetype fallback" from Brendan Jackman reworks some commentary to make the code easier to follow. - The 3 patch series "page_counter cleanup and size reduction" from Shakeel Butt cleans up the page_counter code and fixes a size increase which we accidentally added late last year. - The 3 patch series "Add a command line option that enables control of how many threads should be used to allocate huge pages" from Thomas Prescher does that. It allows the careful operator to significantly reduce boot time by tuning the parallalization of huge page initialization. - The 3 patch series "Fix calculations in trace_balance_dirty_pages() for cgwb" from Tang Yizhou fixes the tracing output from the dirty page balancing code. - The 9 patch series "mm/damon: make allow filters after reject filters useful and intuitive" from SeongJae Park improves the handling of allow and reject filters. Behaviour is made more consistent and the documention is updated accordingly. - The 5 patch series "Switch zswap to object read/write APIs" from Yosry Ahmed updates zswap to the new object read/write APIs and thus permits the removal of some legacy code from zpool and zsmalloc. - The 6 patch series "Some trivial cleanups for shmem" from Baolin Wang does as it claims. - The 20 patch series "fs/dax: Fix ZONE_DEVICE page reference counts" from Alistair Popple regularizes the weird ZONE_DEVICE page refcount handling in DAX, permittig the removal of a number of special-case checks. - The 4 patch series "refactor mremap and fix bug" from Lorenzo Stoakes is a preparatoty refactoring and cleanup of the mremap() code. - The 20 patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in which we determine whether a large folio is known to be mapped exclusively into a single MM. - The 8 patch series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers" from SeongJae Park adds a couple of new sysfs directories to ease the management of DAMON/DAMOS filters. - The 13 patch series "arch, mm: reduce code duplication in mem_init()" from Mike Rapoport consolidates many per-arch implementations of mem_init() into code generic code, where that is practical. - The 13 patch series "mm/damon/sysfs: commit parameters online via damon_call()" from SeongJae Park continues the cleaning up of sysfs access to DAMON internal data. - The 3 patch series "mm: page_ext: Introduce new iteration API" from Luiz Capitulino reworks the page_ext initialization to fix a boot-time crash which was observed with an unusual combination of compile and cmdline options. - The 8 patch series "Buddy allocator like (or non-uniform) folio split" from Zi Yan reworks the code to split a folio into smaller folios. The main benefit is lessened memory consumption: fewer post-split folios are generated. - The 2 patch series "Minimize xa_node allocation during xarry split" from Zi Yan reduces the number of xarray xa_nodes which are generated during an xarray split. - The 2 patch series "drivers/base/memory: Two cleanups" from Gavin Shan performs some maintenance work on the drivers/base/memory code. - The 3 patch series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages" from Martin Liu adds some more tracepoints to the page allocator code. - The 4 patch series "mm/madvise: cleanup requests validations and classifications" from SeongJae Park cleans up some warts which SeongJae observed during his earlier madvise work. - The 3 patch series "mm/hwpoison: Fix regressions in memory failure handling" from Shuai Xue addresses two quite serious regressions which Shuai has observed in the memory-failure implementation. - The 5 patch series "mm: reliable huge page allocator" from Johannes Weiner makes huge page allocations cheaper and more reliable by reducing fragmentation. - The 5 patch series "Minor memcg cleanups & prep for memdescs" from Matthew Wilcox is preparatory work for the future implementation of memdescs. - The 4 patch series "track memory used by balloon drivers" from Nico Pache introduces a way to track memory used by our various balloon drivers. - The 2 patch series "mm/damon: introduce DAMOS filter type for active pages" from Nhat Pham permits users to filter for active/inactive pages, separately for file and anon pages. - The 2 patch series "Adding Proactive Memory Reclaim Statistics" from Hao Jia separates the proactive reclaim statistics from the direct reclaim statistics. - The 2 patch series "mm/vmscan: don't try to reclaim hwpoison folio" from Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim code. -----BEGIN PGP SIGNATURE----- iHQEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nZaAAKCRDdBJ7gKXxA jsOWAPiP4r7CJHMZRK4eyJOkvS1a1r+TsIarrFZtjwvf/GIfAQCEG+JDxVfUaUSF Ee93qSSLR1BkNdDw+931Pu0mXfbnBw== =Pn2K -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "Enable strict percpu address space checks" from Uros Bizjak uses x86 named address space qualifiers to provide compile-time checking of percpu area accesses. This has caused a small amount of fallout - two or three issues were reported. In all cases the calling code was found to be incorrect. - The series "Some cleanup for memcg" from Chen Ridong implements some relatively monir cleanups for the memcontrol code. - The series "mm: fixes for device-exclusive entries (hmm)" from David Hildenbrand fixes a boatload of issues which David found then using device-exclusive PTE entries when THP is enabled. More work is needed, but this makes thins better - our own HMM selftests now succeed. - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed remove the z3fold and zbud implementations. They have been deprecated for half a year and nobody has complained. - The series "mm: further simplify VMA merge operation" from Lorenzo Stoakes implements numerous simplifications in this area. No runtime effects are anticipated. - The series "mm/madvise: remove redundant mmap_lock operations from process_madvise()" from SeongJae Park rationalizes the locking in the madvise() implementation. Performance gains of 20-25% were observed in one MADV_DONTNEED microbenchmark. - The series "Tiny cleanup and improvements about SWAP code" from Baoquan He contains a number of touchups to issues which Baoquan noticed when working on the swap code. - The series "mm: kmemleak: Usability improvements" from Catalin Marinas implements a couple of improvements to the kmemleak user-visible output. - The series "mm/damon/paddr: fix large folios access and schemes handling" from Usama Arif provides a couple of fixes for DAMON's handling of large folios. - The series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors" from SeongJae Park fixes a few issues with the accuracy of kdamond's walking of DAMON regions. - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo Stoakes changes the interaction between framebuffer deferred-io and core MM. No functional changes are anticipated - this is preparatory work for the future removal of page structure fields. - The series "mm/damon: add support for hugepage_size DAMOS filter" from Usama Arif adds a DAMOS filter which permits the filtering by huge page sizes. - The series "mm: permit guard regions for file-backed/shmem mappings" from Lorenzo Stoakes extends the guard region feature from its present "anon mappings only" state. The feature now covers shmem and file-backed mappings. - The series "mm: batched unmap lazyfree large folios during reclamation" from Barry Song cleans up and speeds up the unmapping for pte-mapped large folios. - The series "reimplement per-vma lock as a refcount" from Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for pulling it out were largely bogus and that change made the code more messy. This patchset provides small (0-10%) improvements on one microbenchmark. - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves" from SeongJae Park does some maintenance work on the DAMON docs. - The series "hugetlb/CMA improvements for large systems" from Frank van der Linden addresses a pile of issues which have been observed when using CMA on large machines. - The series "mm/damon: introduce DAMOS filter type for unmapped pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the page's mapped/unmapped status. - The series "zsmalloc/zram: there be preemption" from Sergey Senozhatsky teaches zram to run its compression and decompression operations preemptibly. - The series "selftests/mm: Some cleanups from trying to run them" from Brendan Jackman fixes a pile of unrelated issues which Brendan encountered while runnimg our selftests. - The series "fs/proc/task_mmu: add guard region bit to pagemap" from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to determine whether a particular page is a guard page. - The series "mm, swap: remove swap slot cache" from Kairui Song removes the swap slot cache from the allocation path - it simply wasn't being effective. - The series "mm: cleanups for device-exclusive entries (hmm)" from David Hildenbrand implements a number of unrelated cleanups in this code. - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual implements a number of preparatoty cleanups to the GENERIC_PTDUMP Kconfig logic. - The series "mm/damon: auto-tune aggregation interval" from SeongJae Park implements a feedback-driven automatic tuning feature for DAMON's aggregation interval tuning. - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did this in preparation for implementing lazy mmu mode for arm64 to optimize vmalloc. - The series "mm/page_alloc: Some clarifications for migratetype fallback" from Brendan Jackman reworks some commentary to make the code easier to follow. - The series "page_counter cleanup and size reduction" from Shakeel Butt cleans up the page_counter code and fixes a size increase which we accidentally added late last year. - The series "Add a command line option that enables control of how many threads should be used to allocate huge pages" from Thomas Prescher does that. It allows the careful operator to significantly reduce boot time by tuning the parallalization of huge page initialization. - The series "Fix calculations in trace_balance_dirty_pages() for cgwb" from Tang Yizhou fixes the tracing output from the dirty page balancing code. - The series "mm/damon: make allow filters after reject filters useful and intuitive" from SeongJae Park improves the handling of allow and reject filters. Behaviour is made more consistent and the documention is updated accordingly. - The series "Switch zswap to object read/write APIs" from Yosry Ahmed updates zswap to the new object read/write APIs and thus permits the removal of some legacy code from zpool and zsmalloc. - The series "Some trivial cleanups for shmem" from Baolin Wang does as it claims. - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from Alistair Popple regularizes the weird ZONE_DEVICE page refcount handling in DAX, permittig the removal of a number of special-case checks. - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a preparatoty refactoring and cleanup of the mremap() code. - The series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in which we determine whether a large folio is known to be mapped exclusively into a single MM. - The series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers" from SeongJae Park adds a couple of new sysfs directories to ease the management of DAMON/DAMOS filters. - The series "arch, mm: reduce code duplication in mem_init()" from Mike Rapoport consolidates many per-arch implementations of mem_init() into code generic code, where that is practical. - The series "mm/damon/sysfs: commit parameters online via damon_call()" from SeongJae Park continues the cleaning up of sysfs access to DAMON internal data. - The series "mm: page_ext: Introduce new iteration API" from Luiz Capitulino reworks the page_ext initialization to fix a boot-time crash which was observed with an unusual combination of compile and cmdline options. - The series "Buddy allocator like (or non-uniform) folio split" from Zi Yan reworks the code to split a folio into smaller folios. The main benefit is lessened memory consumption: fewer post-split folios are generated. - The series "Minimize xa_node allocation during xarry split" from Zi Yan reduces the number of xarray xa_nodes which are generated during an xarray split. - The series "drivers/base/memory: Two cleanups" from Gavin Shan performs some maintenance work on the drivers/base/memory code. - The series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages" from Martin Liu adds some more tracepoints to the page allocator code. - The series "mm/madvise: cleanup requests validations and classifications" from SeongJae Park cleans up some warts which SeongJae observed during his earlier madvise work. - The series "mm/hwpoison: Fix regressions in memory failure handling" from Shuai Xue addresses two quite serious regressions which Shuai has observed in the memory-failure implementation. - The series "mm: reliable huge page allocator" from Johannes Weiner makes huge page allocations cheaper and more reliable by reducing fragmentation. - The series "Minor memcg cleanups & prep for memdescs" from Matthew Wilcox is preparatory work for the future implementation of memdescs. - The series "track memory used by balloon drivers" from Nico Pache introduces a way to track memory used by our various balloon drivers. - The series "mm/damon: introduce DAMOS filter type for active pages" from Nhat Pham permits users to filter for active/inactive pages, separately for file and anon pages. - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia separates the proactive reclaim statistics from the direct reclaim statistics. - The series "mm/vmscan: don't try to reclaim hwpoison folio" from Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim code. * tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits) mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex() x86/mm: restore early initialization of high_memory for 32-bits mm/vmscan: don't try to reclaim hwpoison folio mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper cgroup: docs: add pswpin and pswpout items in cgroup v2 doc mm: vmscan: split proactive reclaim statistics from direct reclaim statistics selftests/mm: speed up split_huge_page_test selftests/mm: uffd-unit-tests support for hugepages > 2M docs/mm/damon/design: document active DAMOS filter type mm/damon: implement a new DAMOS filter type for active pages fs/dax: don't disassociate zero page entries MM documentation: add "Unaccepted" meminfo entry selftests/mm: add commentary about 9pfs bugs fork: use __vmalloc_node() for stack allocation docs/mm: Physical Memory: Populate the "Zones" section xen: balloon: update the NR_BALLOON_PAGES state hv_balloon: update the NR_BALLOON_PAGES state balloon_compaction: update the NR_BALLOON_PAGES state meminfo: add a per node counter for balloon drivers mm: remove references to folio in __memcg_kmem_uncharge_page() ...	2025-04-01 09:29:18 -07:00
Kent Overstreet	f1350c2c74	bcachefs: fix ref leak in btree_node_read_all_replicas Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-01 02:47:41 -04:00
Linus Torvalds	98fb679d19	bcachefs updates for 6.15, part 2 All bugfixes and logging improvements. Minor merge conflict, see: https://lore.kernel.org/linux-next/20250331092816.778a7c83@canb.auug.org.au/T/#u CI says the fs-next tree is good: https://evilpiepirate.org/~testdashboard/ci?user=fs-next&branch=master -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmfqyxsACgkQE6szbY3K bnY4aQ/7BylMgHZsAG2OLRRtegCsuFZ5fZt148TObofSGTTDPcVKYQWcz249Hlao RZzv9nbqq2M7fJrUK5Xloc4DA0ICuWIh9n+uRf+5od7JgtygjJpXqMRz9HrGBtTo QZJE/wAzsa1A8xBORjWVki4koHT3YivaMW2zdgbHIHWTjDJso5Es7RW0/WZYv3lW cFLFEfOnBCEXckhEtK7TAPnJpHEPw+/d0bMFU/PIHbokUwTjxCR0bmRL/RecKUXa 5U1o1x7gAo1iPi3XPGLJVVxXWgjmxzQlF/3aXva+DYaeLgxPMqKxUlC6hkV4f6Oc 9lH/w1pEiCMcANbbp2E3Q91sDFRlafFCgvsKhEz79W5WoNq+vSrxLhLaynyuBT/K lfoiig6IFRTWJDYHu2L6YHFMmp8JOxgJSJ0+dcgyVRnaDJQeGgbuv1tEldonQLsg 9DT8iRJpVDomffwPUoVhujlvJOqUi8zFkxyMCgVWExFzC3ief2B5s3D4uLXcpApO nZfb01W0ElW7qBMQxjyD0Vy+wY8EryzTht9ZKJq5Id1T/LWc9Qi+jPaY86OBC9/w GJgW9OcYLFjYdsDokk5XkwOd/IAXz6fU+vHGtahFJPVfH4T8zzdBnxfPbiR2mXo8 4EfeNmRevZP/oK7/2l2cqIzY7tYBJBUK1gFyvz1+7bcuFwVI8rc= =Udka -----END PGP SIGNATURE----- Merge tag 'bcachefs-2025-03-31' of git://evilpiepirate.org/bcachefs Pull more bcachefs updates from Kent Overstreet: "All bugfixes and logging improvements" * tag 'bcachefs-2025-03-31' of git://evilpiepirate.org/bcachefs: (35 commits) bcachefs: fix bch2_write_point_to_text() units bcachefs: Log original key being moved in data updates bcachefs: BCH_JSET_ENTRY_log_bkey bcachefs: Reorder error messages that include journal debug bcachefs: Don't use designated initializers for disk_accounting_pos bcachefs: Silence errors after emergency shutdown bcachefs: fix units in rebalance_status bcachefs: bch2_ioctl_subvolume_destroy() fixes bcachefs: Clear fs_path_parent on subvolume unlink bcachefs: Change btree_insert_node() assertion to error bcachefs: Better printing of inconsistency errors bcachefs: bch2_count_fsck_err() bcachefs: Better helpers for inconsistency errors bcachefs: Consistent indentation of multiline fsck errors bcachefs: Add an "ignore unknown" option to bch2_parse_mount_opts() bcachefs: bch2_time_stats_init_no_pcpu() bcachefs: Fix bch2_fs_get_tree() error path bcachefs: fix logging in journal_entry_err_msg() bcachefs: add missing newline in bch2_trans_updates_to_text() bcachefs: print_string_as_lines: fix extra newline ...	2025-03-31 18:33:51 -07:00
Kent Overstreet	de39965858	bcachefs: Fix null ptr deref in bch2_write_endio() This was previously hard to hit since it requires racing with device removal, but splitting up io_ref uncovered it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-31 17:39:27 -04:00
Kent Overstreet	7f10fde38f	bcachefs: Fix field spanning write warning Struct with embedded VLA... memcpy: detected field-spanning write (size 8) of single field "&gc->r.e" at fs/bcachefs/ec.c:465 (size 3) WARNING: CPU: 1 PID: 936 at fs/bcachefs/ec.c:465 bch2_trigger_stripe+0x706/0x730 Modules linked in: CPU: 1 UID: 0 PID: 936 Comm: mount.bcachefs Not tainted 6.14.0-rc6-ktest-00236-gefb0b5c62dbc #55 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:bch2_trigger_stripe+0x706/0x730 Code: b4 00 01 b9 03 00 00 00 48 89 fb 48 c7 c7 33 54 da 81 48 89 d6 49 89 d6 48 c7 c2 c3 36 db 81 e8 60 54 c5 ff 48 89 df 4c 89 f2 <0f> 0b e9 5c fd ff ff e8 fe 5e 4e 00 bf 10 00 00 00 48 c7 c6 ff ff RSP: 0018:ffff88817081f680 EFLAGS: 00010246 RAX: f8fe7dd1c56b5600 RBX: ffff888101265368 RCX: 0000000000000027 RDX: 0000000000000008 RSI: 00000000fffbffff RDI: ffff888101265368 RBP: 0000000000000000 R08: 000000000003ffff R09: ffff88817f1fe000 R10: 00000000000bfffd R11: 0000000000000004 R12: ffff8881012652c0 R13: 0000000000000000 R14: 0000000000000008 R15: ffff88817081f6c9 FS: 00007fc428bc7c80(0000) GS:ffff888179280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd3ee4a038 CR3: 000000010a9bc000 CR4: 0000000000750eb0 PKRU: 55555554 Call Trace: <TASK> ? __warn+0xce/0x1b0 ? bch2_trigger_stripe+0x706/0x730 ? report_bug+0x11b/0x1a0 ? bch2_trigger_stripe+0x706/0x730 ? handle_bug+0x5e/0x90 ? exc_invalid_op+0x1a/0x50 ? asm_exc_invalid_op+0x1a/0x20 ? bch2_trigger_stripe+0x706/0x730 bch2_gc_mark_key+0x2cf/0x430 bch2_check_allocations+0x1a64/0x1ed0 ? vsnprintf+0x1ad/0x420 ? bch2_check_allocations+0x191f/0x1ed0 bch2_run_recovery_passes+0x13b/0x2b0 bch2_fs_recovery+0x9b7/0x1290 ? __bch2_print+0xb2/0xf0 ? bch2_printbuf_exit+0x1e/0x30 ? print_mount_opts+0x153/0x180 bch2_fs_start+0x274/0x3b0 bch2_fs_get_tree+0x516/0x6e0 vfs_get_tree+0x21/0xa0 do_new_mount+0x153/0x350 __x64_sys_mount+0x16c/0x1f0 do_syscall_64+0x6c/0x140 ? arch_exit_to_user_mode_prepare+0x9/0x40 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-31 17:39:10 -04:00
Kent Overstreet	f540876f4e	bcachefs: Fix striping behaviour For striping across devices, we maintain "clocks", and we advance them by the inverse of "how much free space this device has left", so that we round robin biased in favor of devices with more free space. This code was originally trying to do EWMA-ish stuff when originally written, ~10 years ago, and was never properly cleaned up when it was realized that an EWMA is not the right approach here. That left a bug, when we rescale to keep all the clocks in the correct range and prevent overflow. It was assumed that we'd always be allocated from the device with the smallest clock hand, but that's actually not correct: with the target options, allocations will be first tried from a subset of devices, and then the entire filesystem if that fails. Thus, the rescale from the first allocation - allocating from a subset of devices - can pick the wrong rescale value and cause the rest of the clocks to go to 0, losing information. This resuls in incorrect striping behaviour when the desired number of replicas doesn't fit on the foreground target. Link: https://www.reddit.com/r/bcachefs/comments/1jn3t26/replica_allocation_not_evenly_distributed_among/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-31 17:39:10 -04:00
Kent Overstreet	650f5353dc	bcachefs: fix bch2_write_point_to_text() units Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-30 20:04:16 -04:00
Kent Overstreet	7fdc3fa3cb	bcachefs: Log original key being moved in data updates There's something going on with the data move path; log the original key being moved for debugging. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-30 18:25:12 -04:00
Kent Overstreet	edaed8ee8c	bcachefs: BCH_JSET_ENTRY_log_bkey Add a journal entry type for logging - but logging a bkey, not a string; to be used for data move path debugging. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-30 18:25:12 -04:00
Kent Overstreet	2b47102b93	bcachefs: Reorder error messages that include journal debug Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-30 16:36:27 -04:00
Kent Overstreet	393a05a741	bcachefs: Don't use designated initializers for disk_accounting_pos Not all compilers fully initialize these - they're not guaranteed to because of the union shenanigans. Fixes: https://github.com/koverstreet/bcachefs/issues/844 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-30 16:35:13 -04:00
Kent Overstreet	f548db4d31	bcachefs: Silence errors after emergency shutdown We don't care about errors from asynchronous ops that were because we did an emergency shutdown; silence them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-30 16:35:13 -04:00
Kent Overstreet	458e2ef882	bcachefs: fix units in rebalance_status Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-30 16:35:13 -04:00
Kent Overstreet	707549600c	bcachefs: bch2_ioctl_subvolume_destroy() fixes bch2_evict_subvolume_inodes() was getting stuck - due to incorrectly pruning the dcache. Also, fix missing permissions checks. Reported-by: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-30 16:33:22 -04:00
Kent Overstreet	b3981564ca	bcachefs: Clear fs_path_parent on subvolume unlink This fixes recursive subvolume removal. Subvolume deletion is asynchronous; fs_path_parent, and thus the entry in the subvolume_children btree, need to be cleared when the subvolume is unlinked from the fs heirarchy - else we'll spuriously think a subvolume has children and deletion will fail. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-29 20:16:49 -04:00
Kent Overstreet	63c3b8f616	bcachefs: Change btree_insert_node() assertion to error Debug for https://github.com/koverstreet/bcachefs/issues/843 Print useful debug info and go emergency read-only. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-29 14:24:48 -04:00
Kent Overstreet	6d77ce4a27	bcachefs: Better printing of inconsistency errors Build up and emit the error message for an inconsistency error all at once, instead of spread over multiple printk calls, so they're not jumbled in the dmesg log. Also, add better indenting. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-29 13:26:13 -04:00
Kent Overstreet	7337f9f14e	bcachefs: bch2_count_fsck_err() Factor out a helper from __bch2_fsck_err(), for counting the error in the superblock and deciding whether to print or ratelimit - will be used to replace some log_fsck_err() calls, where we want to lift out printing the error message. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-29 13:26:13 -04:00
Kent Overstreet	b00750c2e5	bcachefs: Better helpers for inconsistency errors An inconsistency error often happens as part of an event with multiple error messages, and we want to build up one single error message with proper indenting to produce more readable log messages that don't get garbled. Add new helpers that emit messages to a printbuf instead of printing them directly, next patch will convert to use them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 22:31:47 -04:00
Kent Overstreet	1ece53237e	bcachefs: Consistent indentation of multiline fsck errors Add the new helper printbuf_indent_add_nextline(), and use it in __bch2_fsck_err() to centralize setting the indentation of multiline fsck errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 22:31:47 -04:00
Kent Overstreet	a7cdf2276e	bcachefs: Add an "ignore unknown" option to bch2_parse_mount_opts() To be used by the mount helper in userspace, where we still have options to be parsed by other layers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 22:31:47 -04:00
Kent Overstreet	daa771332e	bcachefs: bch2_time_stats_init_no_pcpu() Add a mode to disable automatic switching to percpu mode, useful when a time_stats will only be used by one thread and we don't want to have to flush the percpu buffers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 22:31:47 -04:00
Florian Albrechtskirchinger	7c4cb50e1a	bcachefs: Fix bch2_fs_get_tree() error path When a filesystem is mounted read-only, subsequent attempts to mount it as read-write fail with EBUSY. Previously, the error path in bch2_fs_get_tree() would unconditionally call __bch2_fs_stop(), improperly freeing resources for a filesystem that was still actively mounted. This change modifies the error path to only call __bch2_fs_stop() if the superblock has no valid root dentry, ensuring resources are not cleaned up prematurely when the filesystem is in use. Signed-off-by: Florian Albrechtskirchinger <falbrechtskirchinger@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 13:51:09 -04:00
Kent Overstreet	6b1e0b9e18	bcachefs: fix logging in journal_entry_err_msg() We want to log errors all at once, not spread across multiple printks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 12:36:32 -04:00
Kent Overstreet	ff4e0f7de6	bcachefs: add missing newline in bch2_trans_updates_to_text() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 12:36:32 -04:00
Kent Overstreet	35a11506a3	bcachefs: print_string_as_lines: fix extra newline Don't print a newline on empty string; this was causing us to also print an extra newline when we got to the end of th string. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 12:36:32 -04:00
Kent Overstreet	3c72d3eea9	bcachefs: Fix WARN() in bch2_bkey_pick_read_device() syzbot discovered that this one is possible: we have pointers, but none of them are to valid devices. Reported-by: syzbot+336a6e6a2dbb7d4dba9a@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 12:35:05 -04:00
Kent Overstreet	af3d4c276a	bcachefs: Don't return 0 size holes from bch2_seek_hole() The hole we find in the btree might be fully dirty in the page cache. If so, keep searching. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 11:30:14 -04:00
Kent Overstreet	1f4bb8254c	bcachefs: Fix bch2_seek_hole() locking We can't call bch2_seek_pagecache_hole(), and block on page locks, with btree locks held. This is easily fixed because we're at the end of the transaction - we can just unlock, we don't need a drop_locks_do(). Reported-by: https://github.com/nagalun Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 11:30:14 -04:00
Kent Overstreet	2dd202dbaf	bcachefs: Recovery no longer holds state_lock state_lock guards against devices coming or leaving, changing state, or the filesystem changing between ro <-> rw. But it's not necessary for running recovery passes, and holding it blocks asynchronous events that would cause us to go RO or kick out devices. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 11:13:25 -04:00
Kent Overstreet	c6c6a39109	bcachefs: Fix permissions on version modparam There's no reason for this not to be world readable - it provides the currently supported on disk format version. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-28 11:13:23 -04:00
Linus Torvalds	4a4b30ea80	bcachefs updates for 6.15 On disk format is now soft frozen: no more required/automatic are anticipated before taking off the experimental label. Major changes/features since 6.14: - Scrub - Blocksize greater than page size support - A number of "rebalance spinning and doing no work" issues have been fixed; we now check if the write allocation will succeed in bch2_data_update_init(), before kicking off the read. There's still more work to do in this area. Later we may want to add another bitset btree, like rebalance_work, to track "extents that rebalance was requested to move but couldn't", e.g. due to destination target having insufficient online devices. - We can now support scaling well into the petabyte range: latest bcachefs-tools will pick an appropriate bucket size at format time to ensure fsck can run in available memory (e.g. a server with 256GB of ram and 100PB of storage would want 16MB buckets). On disk format changes: - 1.21: cached backpointers (scalability improvement) Cached replicas now get backpointers, which means we no longer rely on incrementing bucket generation numbers to invalidate cached data: this lets us get rid of the bucket generation number garbage collection, which had to periodically rescan all extents to recompute bucket oldest_gen. Bucket generation numbers are now only used as a consistency check, but they're quite useful for that. - 1.22: stripe backpointers Stripes now have backpointers: erasure coded stripes have their own checksums, separate from the checksums for the extents they contain (and stripe checksums also cover the parity blocks). This is required for implementing scrub for stripes. - 1.23: stripe lru (scalability improvement) Persistent lru for stripes, ordered by "number of empty blocks". This is used by the stripe creation path, which depending on free space may create a new stripe out of a partially empty existing stripe instead of starting a brand new stripe. This replaces an in-memory heap, and means we no longer have to read in the stripes btree at startup. - 1.24: casefolding Case insensitive directory support, courtesy of Valve. This is an incompatible feature, to enable mount with -o version_upgrade=incompatible - 1.25: extent_flags Another incompatible feature requiring explicit opt-in to enable. This adds a flags entry to extents, and a flag bit that marks extents as poisoned. A poisoned extent is an extent that was unreadable due to checksum errors. We can't move such extents without giving them a new checksum, and we may have to move them (for e.g. copygc or device evacuate). We also don't want to delete them: in the future we'll have an API that lets userspace ignore checksum errors and attempt to deal with simple bitrot itself. Marking them as poisoned lets us continue to return the correct error to userspace on normal read calls. Other changes/features: - BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top' command, which shows a live view of all internal filesystem counters. - Improved journal pipelining: we can now have 16 journal writes in flight concurrently, up from 4. We're logging significantly more to the journal than we used to with all the recent disk accounting changes and additions, so some users should see a performance increase on some workloads. - BCH_MEMBER_STATE_failed: previously, we would do no IO at all to devices marked as failed. Now we will attempt to read from them, but only if we have no better options. - New option, write_error_timeout: devices will be kicked out of the filesystem if all writes have been failing for x number of seconds. We now also kick devices out when notified by blk_holder_ops that they've gone offline. - Device option handling improvements: the discard option should now be working as expected (additionally, in -tools, all device options that can be set at format time can now be set at device add time, i.e. data_allowed, state). - We now try harder to read data after a checksum error: we'll do additional retries if necessary to a device after after it gave us data with a checksum error. - More self healing work: the full inode <-> dirent consistency checks that are currently run by fsck are now also run every time we do a lookup, meaning we'll be able to correct errors at runtime. Runtime self healing will be flipped on after the new changes have seen more testing, currently they're just checking for consistency. - KMSAN fixes: our KMSAN builds should be nearly clean now, which will put a massive dent in the syzbot dashboard. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmfhbnsACgkQE6szbY3K bnY6ew/9FXh3m71BvVpuqTYcUGzIC7gVrnkFy6n4W96v07OjSOoTNHOVVovajxc3 P9LvA77BHC4Xro3H7ORpsIurOZUc6yx18ZizzulVbQFuYa7LY/kNri4ZBtGHcRiV pIdQDLSNmwFjPA4x2S1qTFSF1c586lad+UNQiLam5ophBwQPEO6vG51ZEHa4wld9 +OhWTDYfrvij4D3Lt1ppvhuDP+PQBjhu/QFc0bGjHvKOjfV6sw9XU91sCYKOJIzd qzpsiQd5sepnX717Br3f5SLdxMq2lJYvRp9756vltOCaMBvJYJtHqtXCglHQEkFw yjhmPjk4r3VlKTF8K+wEJfAHwbC2kEn7csJNbt0+Nko5PPtFyrb8ok6QUbHCKscL L0VMnzaXHVqvG2VgYa31temfdz7HM/zHjQ8Al3eQPaqTHIoTXIBQxOQSea/apVMt TIlastvLoHfR8W7+LrwOmTjnBJGCJ+MrdcJzJDVk2tQmmcMA0boeZvl4aSklFuyB zNN5fxp0VMsxNyIHLJjQ3UcwVqHXC5w+f5H1ByQLUyQh+m/xaAaz7S+BTVdVbFPa 1Z1xDuvuHOTnjIOamnOD1l36afJnhq5RciPCXCNtQSB819mc+AfNGQNQTVNOTReC iTiUCcNxu0/DIPlPmeJzAlukVJUgz+/knOI/6zPs3eI7/o88ZGg= =k3cV -----END PGP SIGNATURE----- Merge tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs Pull bcachefs updates from Kent Overstreet: "On disk format is now soft frozen: no more required/automatic are anticipated before taking off the experimental label. Major changes/features since 6.14: - Scrub - Blocksize greater than page size support - A number of "rebalance spinning and doing no work" issues have been fixed; we now check if the write allocation will succeed in bch2_data_update_init(), before kicking off the read. There's still more work to do in this area. Later we may want to add another bitset btree, like rebalance_work, to track "extents that rebalance was requested to move but couldn't", e.g. due to destination target having insufficient online devices. - We can now support scaling well into the petabyte range: latest bcachefs-tools will pick an appropriate bucket size at format time to ensure fsck can run in available memory (e.g. a server with 256GB of ram and 100PB of storage would want 16MB buckets). On disk format changes: - 1.21: cached backpointers (scalability improvement) Cached replicas now get backpointers, which means we no longer rely on incrementing bucket generation numbers to invalidate cached data: this lets us get rid of the bucket generation number garbage collection, which had to periodically rescan all extents to recompute bucket oldest_gen. Bucket generation numbers are now only used as a consistency check, but they're quite useful for that. - 1.22: stripe backpointers Stripes now have backpointers: erasure coded stripes have their own checksums, separate from the checksums for the extents they contain (and stripe checksums also cover the parity blocks). This is required for implementing scrub for stripes. - 1.23: stripe lru (scalability improvement) Persistent lru for stripes, ordered by "number of empty blocks". This is used by the stripe creation path, which depending on free space may create a new stripe out of a partially empty existing stripe instead of starting a brand new stripe. This replaces an in-memory heap, and means we no longer have to read in the stripes btree at startup. - 1.24: casefolding Case insensitive directory support, courtesy of Valve. This is an incompatible feature, to enable mount with -o version_upgrade=incompatible - 1.25: extent_flags Another incompatible feature requiring explicit opt-in to enable. This adds a flags entry to extents, and a flag bit that marks extents as poisoned. A poisoned extent is an extent that was unreadable due to checksum errors. We can't move such extents without giving them a new checksum, and we may have to move them (for e.g. copygc or device evacuate). We also don't want to delete them: in the future we'll have an API that lets userspace ignore checksum errors and attempt to deal with simple bitrot itself. Marking them as poisoned lets us continue to return the correct error to userspace on normal read calls. Other changes/features: - BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top' command, which shows a live view of all internal filesystem counters. - Improved journal pipelining: we can now have 16 journal writes in flight concurrently, up from 4. We're logging significantly more to the journal than we used to with all the recent disk accounting changes and additions, so some users should see a performance increase on some workloads. - BCH_MEMBER_STATE_failed: previously, we would do no IO at all to devices marked as failed. Now we will attempt to read from them, but only if we have no better options. - New option, write_error_timeout: devices will be kicked out of the filesystem if all writes have been failing for x number of seconds. We now also kick devices out when notified by blk_holder_ops that they've gone offline. - Device option handling improvements: the discard option should now be working as expected (additionally, in -tools, all device options that can be set at format time can now be set at device add time, i.e. data_allowed, state). - We now try harder to read data after a checksum error: we'll do additional retries if necessary to a device after after it gave us data with a checksum error. - More self healing work: the full inode <-> dirent consistency checks that are currently run by fsck are now also run every time we do a lookup, meaning we'll be able to correct errors at runtime. Runtime self healing will be flipped on after the new changes have seen more testing, currently they're just checking for consistency. - KMSAN fixes: our KMSAN builds should be nearly clean now, which will put a massive dent in the syzbot dashboard" * tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs: (180 commits) bcachefs: Kill unnecessary bch2_dev_usage_read() bcachefs: btree node write errors now print btree node bcachefs: Fix race in print_chain() bcachefs: btree_trans_restart_foreign_task() bcachefs: bch2_disk_accounting_mod2() bcachefs: zero init journal bios bcachefs: Eliminate padding in move_bucket_key bcachefs: Fix a KMSAN splat in btree_update_nodes_written() bcachefs: kmsan asserts bcachefs: Fix kmsan warnings in bch2_extent_crc_pack() bcachefs: Disable asm memcpys when kmsan enabled bcachefs: Handle backpointers with unknown data types bcachefs: Count BCH_DATA_parity backpointers correctly bcachefs: Run bch2_check_dirent_target() at lookup time bcachefs: Refactor bch2_check_dirent_target() bcachefs: Move bch2_check_dirent_target() to namei.c bcachefs: fs-common.c -> namei.c bcachefs: EIO cleanup bcachefs: bch2_write_prep_encoded_data() now returns errcode bcachefs: Simplify bch2_write_op_error() ...	2025-03-27 13:20:07 -07:00
Kent Overstreet	d0fb2a266a	bcachefs: cond_resched() in journal_key_sort_cmp() Fixes "task out to lunch" warnings during recovery on large machines with lots of dirty data in the journal. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-26 16:26:35 -04:00
Kent Overstreet	ef488bb5d0	bcachefs: Fix 'hung task' messages in btree node scan btree node scan has to wait on kthread workers that scan each device, potentially for awhile. We would like this to be interruptible, but we may need a different mechanism than signals for that - we've had bugs in the past where mounts were failing due to checking for signals, and no explanation on where they came from. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-26 16:26:35 -04:00
Kent Overstreet	9314e2fb26	bcachefs: Fix btree iter flags in data move (2) Data move -> move_get_io_opts -> bch2_get_update_rebalance_opts requires a not_extents iterator; this fixes the path where we're walking the extents btree and chase a reflink pointer into the reflink btree. bch2_lookup_indirect_extent() requires working with an extents iterator (due to peek_slot() semantics), so we implement bch2_lookup_indirect_extent_for_move(). This is simplified because there's no need to report indirect_extent_missing_errors here, that can be deferred until fsck or when a user reads that data. Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-26 16:26:35 -04:00
Kent Overstreet	19ff84b20d	bcachefs: Don't unnecessarily decrypt data when moving There's various checks for "are we going to compress this" - but we're not going to compress if we know it's incompressible. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-26 16:26:35 -04:00
Kent Overstreet	a44e4f8f00	bcachefs: Document disk accounting keys and conuters Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-26 16:26:35 -04:00
Kent Overstreet	9c893face2	bcachefs: Validate number of counters for accounting keys We weren't checking that accounting keys have the expected number of accounters. Originally we probably wanted to be flexible on this, but it doesn't look like that will be required - accounting is extended by adding new counter types, not more counters to an existing type. This means we can drop a BUG_ON() that popped once in automated testing, and the new validation will make that bug easier to track down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-26 16:26:35 -04:00
Kent Overstreet	e1e50a6330	bcachefs: Use print_string_as_lines() for journal stuck messages They were being truncated, printk has a 1k limit per call Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-25 11:49:46 -04:00
Kent Overstreet	a76db26a96	bcachefs: Fix duplicate checksum error messages in write path Also, improve the message in prep_encoded_data() - it now prints good/bad checksums, and checksum type. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-25 11:49:43 -04:00
Kent Overstreet	3ba0240a87	bcachefs: Fix silent short reads in data read retry path __bch2_read, before calling __bch2_read_extent(), sets bvec_iter.bi_size to "the size we can read from the current extent" with a swap, and restores it to "the size for the total read" after the read_extent call with another swap. But we neglected to do the restore before the "if (ret) goto err;" - which is a problem if we're retrying those errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-25 11:49:16 -04:00
Kent Overstreet	5af61dbd96	bcachefs: Fix nonce inconsistency in bch2_write_prep_encoded_data() If we're moving an extent that was partially overwritten, bch2_write_rechecksum() will trim it to the currenty live range. If we then also want to compress it, it'll be decrypted - but the nonce has been advanced for the overwritten start of the extent that we dropped, and we were using the nonce we calculated before rechecksum(). Reported-by: Gabriel de Perthuis <g2p.code@gmail.com> Fixes: `127d90d282` ("bcachefs: bch2_write_prep_encoded_data() now returns errcode") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-25 11:40:35 -04:00
Linus Torvalds	e41170cc5e	vfs-6.15-rc1.pagesize -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rxAAKCRCRxhvAZXjc ooIPAQCwMjDjtWegvBy8kefiRw+fa4z3ZWHrwRT9DJrD/K9WyAD+JVd0ou27SVpQ jKpRSRct2eTbyxdYiGydHQGm5F5sLg4= =0FyQ -----END PGP SIGNATURE----- Merge tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs pagesize updates from Christian Brauner: "This enables block sizes greater than the page size for block devices. With this we can start supporting block devices with logical block sizes larger than 4k. It also allows to lift the device cache sector size support to 64k. This allows filesystems which can use larger sector sizes up to 64k to ensure that the filesystem will not generate writes that are smaller than the specified sector size" * tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: bdev: add back PAGE_SIZE block size validation for sb_set_blocksize() bdev: use bdev_io_min() for statx block size block/bdev: lift block size restrictions to 64k block/bdev: enable large folio support for large logical block sizes fs/buffer fs/mpage: remove large folio restriction fs/mpage: use blocks_per_folio instead of blocks_per_page fs/mpage: avoid negative shift for large blocksize fs/buffer: remove batching from async read fs/buffer: simplify block_read_full_folio() with bh_offset()	2025-03-24 12:01:29 -07:00
Linus Torvalds	26d8e43079	vfs-6.15-rc1.async.dir -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rNwAKCRCRxhvAZXjc onBJAP9Z8Ywmlb5KQ1E3HvDmkwyY6yOSyZ9/CmbzrkCJ8ywYkQD/d9/xt0EP/O/q N8YtzXArHWt7u0YbcVpy9WK3F72BdwU= =VJgY -----END PGP SIGNATURE----- Merge tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs async dir updates from Christian Brauner: "This contains cleanups that fell out of the work from async directory handling: - Change kern_path_locked() and user_path_locked_at() to never return a negative dentry. This simplifies the usability of these helpers in various places - Drop d_exact_alias() from the remaining place in NFS where it is still used. This also allows us to drop the d_exact_alias() helper completely - Drop an unnecessary call to fh_update() from nfsd_create_locked() - Change i_op->mkdir() to return a struct dentry Change vfs_mkdir() to return a dentry provided by the filesystems which is hashed and positive. This allows us to reduce the number of cases where the resulting dentry is not positive to very few cases. The code in these places becomes simpler and easier to understand. - Repack DENTRY_* and LOOKUP_* flags" * tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: doc: fix inline emphasis warning VFS: Change vfs_mkdir() to return the dentry. nfs: change mkdir inode_operation to return alternate dentry if needed. fuse: return correct dentry for ->mkdir ceph: return the correct dentry on mkdir hostfs: store inode in dentry after mkdir if possible. Change inode_operations.mkdir to return struct dentry * nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked() nfs/vfs: discard d_exact_alias() VFS: add common error checks to lookup_one_qstr_excl() VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry VFS: repack LOOKUP_ bit flags. VFS: repack DENTRY_ flags.	2025-03-24 10:47:14 -07:00
Kent Overstreet	d8bdc8daac	bcachefs: Kill unnecessary bch2_dev_usage_read() bch2_dev_usage_read() is fairly expensive, we should optimize this more. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:37 -04:00
Kent Overstreet	2adfa46734	bcachefs: btree node write errors now print btree node It turned out a user was wondering why we were going read-only after a write error, and he didn't realize he didn't have replication enabled - this will make that more obvious, and we should be printing it anyways. Link: https://www.reddit.com/r/bcachefs/comments/1jf9akl/large_data_transfers_switched_bcachefs_to_readonly/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:37 -04:00
Kent Overstreet	739200c573	bcachefs: Fix race in print_chain() 00636 Unable to handle kernel NULL pointer dereference at virtual address 00000000000000b0 00636 Mem abort info: 00636 ESR = 0x0000000096000005 00636 EC = 0x25: DABT (current EL), IL = 32 bits 00636 SET = 0, FnV = 0 00636 EA = 0, S1PTW = 0 00636 FSC = 0x05: level 1 translation fault 00636 Data abort info: 00636 ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 00636 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 00636 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 00636 user pgtable: 4k pages, 39-bit VAs, pgdp=0000000101b10000 00636 [00000000000000b0] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 00636 Internal error: Oops: 0000000096000005 [#1] SMP 00636 Modules linked in: 00636 CPU: 12 UID: 0 PID: 79369 Comm: cat Not tainted 6.14.0-rc6-ktest-g3783b8973ab7 #17757 00636 Hardware name: linux,dummy-virt (DT) 00636 pstate: 20001005 (nzCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00636 pc : print_chain+0xb8/0x170 00636 lr : print_chain+0xa0/0x170 00636 sp : ffffff80d9c1bbb0 00636 x29: ffffff80d9c1bbb0 x28: 0000000000000002 x27: ffffff80c1be8250 00636 x26: ffffff80dd9b0000 x25: 0000000000000020 x24: 000000000000002d 00636 x23: 000000000000003c x22: ffffffc080a54518 x21: ffffff80da6e00d0 00636 x20: ffffff80da6e0170 x19: ffffff80c1a1d240 x18: 00000000ffffffff 00636 x17: 3535303937202d3c x16: 203139202d3c2035 x15: 00000000ffffffff 00636 x14: 0000000000000000 x13: ffffff80d71b63f1 x12: 0000000000000006 00636 x11: ffffffc080beb1c0 x10: 0000000000000020 x9 : 00000000000134cc 00636 x8 : 0000000000000020 x7 : 0000000000000004 x6 : 0000000000000020 00636 x5 : ffffff80d71b63f7 x4 : ffffffc080a5451b x3 : 0000000000000000 00636 x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 00636 Call trace: 00636 print_chain+0xb8/0x170 (P) 00636 bch2_check_for_deadlock+0x444/0x5a0 00636 bch2_btree_deadlock_read+0xb4/0x1c8 00636 full_proxy_read+0x74/0xd8 00636 vfs_read+0x90/0x300 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:37 -04:00
Kent Overstreet	0b4fd56726	bcachefs: btree_trans_restart_foreign_task() In debug mode, we save the call stack on transaction restart - but there's no locking, so we can't touch it if we're issuing the restart from another thread. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:37 -04:00
Kent Overstreet	f4a584f4bf	bcachefs: bch2_disk_accounting_mod2() We're hitting some issues with uninitialized struct padding, flagged by kmsan. They appear to be falso positives, otherwise bch2_accounting_validate() would have flagged them as "junk at end". But for now, we'll need to initialize disk_accounting_pos with memset(). This adds a new helper, bch2_disk_accounting_mod2(), that initializes a disk_accounting_pos and does the accounting mod all at once - so overall things actually get slightly more ergonomic. BCH_DISK_ACCOUNTING_replicas keys are left for now; KMSAN isn't warning about them and they're a bit special. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:37 -04:00
Kent Overstreet	5ae6f33053	bcachefs: zero init journal bios fix a kmsan splat Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:37 -04:00
Kent Overstreet	9ea24b287b	bcachefs: Eliminate padding in move_bucket_key We appear to be tripping over a compiler/kmsan bug with padding fields - this is an easy workaround. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:37 -04:00
Kent Overstreet	1f88c35674	bcachefs: Fix a KMSAN splat in btree_update_nodes_written() We may sometimes read from uninitialized memory; we know, and that's ok. We check if a btree node has been reused before waiting on any outstanding IO. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:37 -04:00
Kent Overstreet	28aa859b6b	bcachefs: kmsan asserts Catching these early makes them a lot easier to track down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	53cf2a3daa	bcachefs: Fix kmsan warnings in bch2_extent_crc_pack() We store to all fields, so the kmsan warnings were spurious - but initializing via stores to bitfields appear to have been giving the compiler/kmsan trouble, and they're not necessary. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	9c3a2c9b47	bcachefs: Disable asm memcpys when kmsan enabled kmsan doesn't know about inline assembly, obviously; this will close a ton of syzbot bugs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	962322475b	bcachefs: Handle backpointers with unknown data types New data types might be added later, so we don't want to disallow unknown data types - that'll be a compatibility hassle later. Instead, ignore them. Reported-by: syzbot+3a290f5ff67ca3023834@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	6a9f681ef6	bcachefs: Count BCH_DATA_parity backpointers correctly These are counted as stripe data in the corresponding alloc keys. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	04e90891be	bcachefs: Run bch2_check_dirent_target() at lookup time More on the "full online self healing" project: We now run most of the dirent <-> inode consistency checks, with repair code, at runtime - the exact same check and repair code that fsck runs. This will allow us to repair the "dirent points to inode that does not point back" inconsistencies that have been popping up at runtime. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	9b0d00a369	bcachefs: Refactor bch2_check_dirent_target() Prep work for calling bch2_check_dirent_target() from bch2_lookup(). - Add an inline wrapper, if the target and backpointer match we can skip the function call. - We don't (yet?) want to remove the dirent we did the lookup from (when we find a directory or subvol with multiple valid dirents pointing to it), we can defer on that until later. For now, add an "are we in fsck?" parameter. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	758ea4ff81	bcachefs: Move bch2_check_dirent_target() to namei.c We're gradually running more and more fsck.c checks at runtime, whereever applicable; when we do so they get moved out of fsck.c. Next patch will call bch2_check_dirent_target() from bch2_lookup(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	4fcd4de0a6	bcachefs: fs-common.c -> namei.c name <-> inode, code for managing the relationships between inodes and dirents. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	8a9f3d0582	bcachefs: EIO cleanup Replace these with proper private error codes, so that when we get an error message we're not sifting through the entire codebase to see where it came from. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	127d90d282	bcachefs: bch2_write_prep_encoded_data() now returns errcode Prep work for killing off EIO and replacing them with proper private error codes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	2fe208303a	bcachefs: Simplify bch2_write_op_error() There's no reason for the caller to do the actual logging, it's all done the same. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:36 -04:00
Kent Overstreet	af2ff37da7	bcachefs: Fix block/btree node size defaults We're fixing option parsing in userspace, it now obeys OPT_SB_FIELD_SECTORS Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Alan Huang	5d361ae5af	bcachefs: Add missing smp_rmb() The smp_rmb() guarantees that reads from reservations.counter occur before accessing cur_entry_u64s. It's paired with the atomic64_try_cmpxchg in journal_entry_open. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Kent Overstreet	4a4000b9a6	bcachefs: Kill JOURNAL_ERRORS() Convert these to standard error codes, which means we can pass them outside the journal code, they're easier to pass to tracepoints, etc. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Kent Overstreet	80be08cdb5	bcachefs: Filesystem discard option now propagates to devices the discard option is special, because it's both a filesystem and a device option. When set at the filesytsem level, it's supposed to propagate to (if set persistently via sysfs) or override (if non persistently as a mount option) the devices - that now works correctly. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Kent Overstreet	8d7b7ac367	bcachefs: Device state is now a runtime option Other options can normally be set at runtime via sysfs, no reason for this one not to be as well - it just doesn't support the degraded flags argument this way, that requires the ioctl. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Kent Overstreet	7b84d934a1	bcachefs: Setting foreground_target at runtime now triggers rebalance Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Kent Overstreet	8b294a9b5c	bcachefs: Device options now use standard sysfs code Device options now use the common code for sysfs, and can superblock fields (in a struct bch_member). This replaces BCH_DEV_OPT_SETTERS(), which was weird and easy to miss. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Kent Overstreet	d2bad59255	bcachefs: Kill BCH_DEV_OPT_SETTERS() Previously, device options had their superblock option field listed separately, which was weird and easy to miss when defining options. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Alan Huang	dd7ae389ff	bcachefs: Remove spurious smp_mb() The smp_mb() is paired with nothing. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Alan Huang	5cc0ab39fb	bcachefs: Fix incorrect state count atomic64_read(&j->seq) - j->seq_write_started == JOURNAL_STATE_BUF_NR is the condition in journal_entry_open where we return JOURNAL_ERR_max_open, so journal_cur_seq(j) - seq == JOURNAL_STATE_BUF_NR means that the buf corresponding to seq has started to write. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Kent Overstreet	16a8d5d00b	bcachefs: Fix btree iter flags in data move Rebalance requires a not_extents iterator. This wasn't hit before because all_snapshots disableds is_extents on snapshots btrees - but has no effect on the reflink btree. Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Kent Overstreet	92c7789a9e	bcachefs: Validate bch_sb.offset field This was missed - but it needs to be correct for the superblock recovery tool that scans the start and end of the device for backup superblocks: we don't want to pick up superblocks that belong to a different partition that starts at a different offset. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Kent Overstreet	8bd875ae47	bcachefs: bch2_sb_validate() doesn't need bch_sb_handle Minor refactoring, so that bch2_sb_validate() can be used in the new userspace superblock recovery tool. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:35 -04:00
Kent Overstreet	5e67243ea6	bcachefs: Add missing random.h includes Fix build in userspace, and good hygeine. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:34 -04:00
Kent Overstreet	2eb985c549	bcachefs: Better incompat version/feature error messages If we can't mount because of an incompatibility, print what's supported and unsupported - to help solve PEBKAC issues. Reported-by: Roland Vet <vet.roland@protonmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:34 -04:00
Kent Overstreet	6aa446c05a	bcachefs: Fix offset_into_extent in data move path Fixes the following: [ 17.607394] kernel BUG at fs/bcachefs/reflink.c:261! [ 17.608316] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 17.608485] CPU: 0 UID: 0 PID: 564 Comm: bch-rebalance/3 Tainted: G OE 6.14.0-rc6-arch1-gfcb0bd9609d2 #7 0efd7a8f4a00afeb2c5fb6e7ecb1aec8ddcbb1e1 [ 17.608616] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE [ 17.608736] Hardware name: Micro-Star International Co., Ltd. MS-7D75/MAG B650 TOMAHAWK WIFI (MS-7D75), BIOS 1.74 08/01/2023 [ 17.608855] RIP: 0010:bch2_lookup_indirect_extent+0x252/0x290 [bcachefs] [ 17.609006] Code: 00 00 00 00 e8 7f 51 f5 ff 89 c3 85 c0 74 52 48 8b 7d b0 4c 89 ee e8 4d 4b f4 ff 48 63 d3 48 89 d0 31 d2 e9 2e ff ff ff 0f 0b <0f> 0b 48 8b 7d b0 4c 89 ee 48 89 55 a8 e8 2c 4b f4 ff 4c 8b 55 a8 [ 17.609136] RSP: 0018:ffffa3714455f850 EFLAGS: 00010246 [ 17.609261] RAX: 0000000000000080 RBX: ffff895891098790 RCX: 0000000000000000 [ 17.609387] RDX: 0000000000000080 RSI: ffffa3714455fa90 RDI: ffff895889550000 [ 17.609511] RBP: ffffa3714455f8c0 R08: ffff895891098790 R09: 0000000000000001 [ 17.609637] R10: ffffa3714455f8d8 R11: ffffa3714455f950 R12: ffffa3714455fa58 [ 17.609763] R13: ffff895891098790 R14: ffffa3714455fa58 R15: ffff895889550000 [ 17.609888] FS: 0000000000000000(0000) GS:ffff896757c00000(0000) knlGS:0000000000000000 [ 17.610015] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 17.610143] CR2: 0000716b8cda2750 CR3: 0000000914e22000 CR4: 0000000000f50ef0 [ 17.610272] PKRU: 55555554 [ 17.610403] Call Trace: [ 17.610535] <TASK> [ 17.610662] ? __die_body.cold+0x19/0x27 [ 17.610791] ? die+0x2e/0x50 [ 17.610918] ? do_trap+0xca/0x110 [ 17.611049] ? do_error_trap+0x6a/0x90 [ 17.611178] ? bch2_lookup_indirect_extent+0x252/0x290 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.611331] ? exc_invalid_op+0x50/0x70 [ 17.611468] ? bch2_lookup_indirect_extent+0x252/0x290 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.611620] ? asm_exc_invalid_op+0x1a/0x20 [ 17.611757] ? bch2_lookup_indirect_extent+0x252/0x290 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.611911] ? bch2_move_data_btree+0x58a/0x6c0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.612084] bch2_move_data_btree+0x58a/0x6c0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.612256] ? __pfx_rebalance_pred+0x10/0x10 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.612431] ? bch2_move_extent+0x3d7/0x6e0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.612607] ? __bch2_move_data+0xea/0x200 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.612782] __bch2_move_data+0xea/0x200 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.612959] ? __pfx_rebalance_pred+0x10/0x10 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.613149] do_rebalance+0x517/0x8d0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.613342] ? local_clock_noinstr+0xd/0xd0 [ 17.613518] ? local_clock+0x15/0x30 [ 17.613693] ? __bch2_trans_get+0x152/0x300 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.613890] ? __pfx_bch2_rebalance_thread+0x10/0x10 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] [ 17.614090] bch2_rebalance_thread+0x66/0xb0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2] The offset_into_extent bit was copied from the read path, but it's unnecessary here, where we always want to read and move the entire indirect extent, and it causes the assertion pop - because we're using a non-extents iterator, which always points to the end of the reflink pointer. Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:34 -04:00
Eric Biggers	71fbb0b86e	bcachefs: use sha256() instead of crypto_shash API Just use sha256() instead of the clunky crypto API. This is much simpler. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:34 -04:00
Eric Biggers	39abc73b59	bcachefs: Remove unnecessary softdeps on crc32c and crc64 Since bcachefs does not access crc32c and crc64 through the crypto API, there is no need to use module softdeps to ensure they are loaded. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:34 -04:00
Kent Overstreet	9b39835e93	bcachefs: #if 0 out (enable\|disable)_encryption() These weren't hooked up, but they probably should be - add some comments for context. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:34 -04:00
Kent Overstreet	9962cb7748	bcachefs: Improve can_write_extent() This fixes another "rebalance spinning and doing no work" issue; rebalance was reading extents it wanted to move, but then failing in bch2_write() -> bch2_alloc_sectors_start() due to being unable to allocate sufficient replicas. This was triggered by a user playing with the durability settings, the foreground device was an NVME device with durability=2, and originally he'd set the background device to durability=2 as well, but changed it back to 1 (the default) after seeing IO errors. That meant that with replicas=2, we want to move data off the NVME device which satisfies that constraint, but with a single durability=1 device on the background target there's no way to move the extent to that target while satisfiying the "required replicas" constraint. The solution for now is for bch2_data_update_init() to check for this, and return an error - before kicking off the read. bch2_data_update_init() already had two different checks for "will we be able to write this extent", with partially duplicated code, so this patch combines and improves that logic. Additionally, we now always bail out and return an error if there's insufficient space on the destination target. Previously, we only did this for BCH_WRITE_alloc_nowait moves, because it might be the case that copygc just needs to free up space on the destination target. But we really shouldn't kick off a move if the destination is full, we can't currently distinguish between "really full" and "just need to wait for copygc", and if we are going to wait on copygc it'd be better to do that before kicking off the move. This will additionally fix "rebalance spinning" issues caused by a filesystem that has more data than can fit in background_target - which is a valid scenario, since we don't exclude foreground/cache devices when calculating filesystem capacity. Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:34 -04:00
Kent Overstreet	fb8a9a32cc	bcachefs: trace_io_move_write_fail Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:34 -04:00
Alan Huang	76bc6e51cd	bcachefs: Increase blacklist range Now there are 16 journal buffers, 8 is too small to be enough. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:34 -04:00
Kent Overstreet	35de2abc22	bcachefs: __bch2_read() now takes a btree_trans Next patch will be checking if the extent we're reading from matches the IO failure we saw before marking the failure. For this to work, __bch2_read() needs to take the same transaction context that bch2_rbio_retry() uses to do that check. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:34 -04:00
Kent Overstreet	3fb8bacb14	bcachefs: BCH_READ_data_update -> bch_read_bio.data_update Read flags are codepath dependent and change as they're passed around, while the fields in rbio._state are mostly fixed properties of that particular object. Losing track of BCH_READ_data_update would be bad, and previously it was not obvious if it was always correctly set in the rbio, so this is a safety cleanup. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-24 09:50:34 -04:00
Uros Bizjak	8a3c392388	percpu: use TYPEOF_UNQUAL() in variable declarations Use TYPEOF_UNQUAL() to declare variables as a corresponding type without named address space qualifier to avoid "`__seg_gs' specified for auto variable `var'" errors. Link: https://lkml.kernel.org/r/20250127160709.80604-4-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Acked-by: Nadav Amit <nadav.amit@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:05:53 -07:00
Kent Overstreet	be31e412ac	bcachefs: Checksum errors get additional retries It's possible for checksum errors to be transient - e.g. flakey controller or cable, thus we need additional retries (besides retrying from different replicas) before we can definitely return an error. This is particularly important for the next patch, which will allow the data move path to move extents with checksum errors - we don't want to accidentally introduce bitrot due to a transient error! - bch2_bkey_pick_read_device() is substantially reworked, and bch2_dev_io_failures is expanded to record more information about the type of failure (i.e. number of checksum errors). It now returns an error code that describes more precisely the reason for the failure - checksum error, io error, or offline device, instead of the previous generic "insufficient devices". This is important for the next patches that add poisoning, as we only want to poison extents when we've got real checksum errors (or perhaps IO errors?) - not because a device was offline. - Add a new option and superblock field for the number of checksum retries. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	ccba9029b0	bcachefs: Print message on successful read retry Users have been asking for this, and now that errors are returned to the top level read retry path - we can. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	de73677ff8	bcachefs: Return errors to top level bch2_rbio_retry() Next patch will be adding an additional retry loop for checksum errors, so that we can rule out transient errors before marking an extent as poisoned. Prerequisite to this is returning errors to bch2_rbio_retry(); this will also let us add a "successful retry" message. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	881b598ef1	bcachefs: BCH_ERR_data_read_buffer_too_small Now that the read path uses proper error codes, we can get rid of the weird rbio->hole signalling to the move path that the read didn't happen. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	f4b84bac20	bcachefs: Read error message now indicates if it was for an internal move Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	e75993b0bf	bcachefs: Fix BCH_ERR_data_read_csum_err_maybe_userspace in retry path When we do a read to a buffer that's mapped into userspace, it's possible to get a spurious checksum error if userspace was modified the buffer at the same time. When we retry those, they have to be bounced before we know definitively whether we're reading corrupt data. But the retry path propagates read flags differently, so needs special handling. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	943f0cfb15	bcachefs: Convert read path to standard error codes Kill the READ_ERR/READ_RETRY/READ_RETRY_AVOID enums, and add standard error codes that describe precisely which error occured. This is going to be used for the data move path, to move but poison extents with checksum errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	5a06cb8000	bcachefs: Debug params for data corruption injection dm-flakey is busted, and this is simpler anyways - this lets us test the checksum error retry ptahs Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	6d80fca9ef	bcachefs: Don't create bch_io_failures unless it's needed Only needed in retry path, no point in wasting stack space. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	9ec0089149	bcachefs: bch2_bkey_ptrs_rebalance_opts() Small optimization for bch2_bkey_sectors_need_rebalance() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	7c1e2a254f	bcachefs: Add a cond_resched() to btree cache teardown [12308.606480] watchdog: BUG: soft lockup - CPU#18 stuck for 26s! [umount:48479] [12308.606485] Modules linked in: bcachefs lz4hc_compress lz4_compress lz4_decompress sunrpc overlay nf_conntrack_netlink xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE bridge stp llc xfrm_user ip6table_nat ip6table_filter ip6_tables iptable_nat xt_addrtype iptable_filter ip_tables x_tables nfnetlink_cttimeout nfnetlink openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 psample ext4 mbcache jbd2 nls_iso8859_1 nls_cp850 vfat fat binfmt_misc skx_edac_common nfit edac_core libnvdimm cbc encrypted_keys intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common ipmi_ssif x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm drivetemp rapl intel_cstate coretemp mgag200 i2c_algo_bit ixgbe drm_shmem_helper drm_kms_helper mdio_devres xfrm_algo mdio drm ptp intel_uncore mei_me efi_pstore evdev uas pl2303 pps_core libphy usb_storage usbserial lpc_ich mei drm_panel_orientation_quirks acpi_power_meter tiny_power_button ipmi_si mfd_core intel_pch_thermal acpi_tad acpi_ipmi ioatdma [12308.606541] ipmi_devintf ipmi_msghandler dca wmi button efivarfs polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 sha1_generic xhci_pci xhci_hcd aesni_intel ehci_pci ehci_hcd gf128mul crypto_simd cryptd usbcore hpwdt usb_common [12308.606557] CPU: 18 UID: 0 PID: 48479 Comm: umount Tainted: G L 6.14.0-rc6-x86_64-00159-ga09496a03e63 #1 [12308.606560] Tainted: [L]=SOFTLOCKUP [12308.606561] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 07/20/2023 [12308.606563] RIP: 0010:clear_page_erms+0x7/0x10 [12308.606570] Code: 48 89 47 38 48 8d 7f 40 75 d9 90 c3 cc cc cc cc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 b9 00 10 00 00 31 c0 <f3> aa c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 [12308.606572] RSP: 0018:ffff9ed5b622fba0 EFLAGS: 00010246 [12308.606574] RAX: 0000000000000000 RBX: ffff90347fffe6c0 RCX: 00000000000004c0 [12308.606575] RDX: ffffe34ea9bec1c0 RSI: 00000000000405f0 RDI: ffff902eafb07b40 [12308.606576] RBP: ffff9ed5b622fbf0 R08: 0000000000000001 R09: 0000000000000006 [12308.606577] R10: 0000000000040001 R11: 0000000000000000 R12: ffffe34ea9bec000 [12308.606578] R13: 0000000000000000 R14: 0000000000000006 R15: ffffe34ea9bed000 [12308.606580] FS: 00007fe704ecfb68(0000) GS:ffff9053fea00000(0000) knlGS:0000000000000000 [12308.606581] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [12308.606582] CR2: 00007f18159068ae CR3: 00000001314d0005 CR4: 00000000007726f0 [12308.606583] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [12308.606584] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [12308.606584] PKRU: 55555554 [12308.606585] Call Trace: [12308.606587] <IRQ> [12308.606590] ? show_regs.cold+0x19/0x28 [12308.606595] ? watchdog_timer_fn.cold+0x3d/0x9d [12308.606598] ? __pfx_watchdog_timer_fn+0x10/0x10 [12308.606602] ? __hrtimer_run_queues+0x12e/0x250 [12308.606607] ? hrtimer_interrupt+0xfd/0x220 [12308.606609] ? __sysvec_apic_timer_interrupt+0x53/0xe0 [12308.606614] ? sysvec_apic_timer_interrupt+0x76/0xa0 [12308.606619] </IRQ> [12308.606620] <TASK> [12308.606620] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 [12308.606626] ? clear_page_erms+0x7/0x10 [12308.606628] ? __free_pages_ok+0x374/0x640 [12308.606633] free_frozen_pages+0x34/0x570 [12308.606636] __folio_put+0x87/0xe0 [12308.606641] free_large_kmalloc+0x70/0x80 [12308.606645] kfree+0x2f6/0x390 [12308.606648] kvfree+0x2d/0x40 [12308.606653] __btree_node_data_free+0xaf/0xf0 [bcachefs] [12308.606726] btree_node_data_free+0x6a/0x80 [bcachefs] [12308.606778] bch2_fs_btree_cache_exit+0x262/0x440 [bcachefs] [12308.606829] bch2_fs_release+0xe8/0x340 [bcachefs] [12308.606905] kobject_put+0x60/0xc0 [12308.606908] bch2_fs_free+0xdd/0x120 [bcachefs] [12308.606981] bch2_kill_sb+0x1e/0x30 [bcachefs] [12308.607051] deactivate_locked_super+0x32/0xb0 [12308.607055] deactivate_super+0x40/0x50 [12308.607057] cleanup_mnt+0xc3/0x160 [12308.607060] __cleanup_mnt+0x12/0x20 [12308.607062] task_work_run+0x5f/0xa0 [12308.607064] syscall_exit_to_user_mode+0x194/0x1a0 [12308.607066] do_syscall_64+0x67/0x170 [12308.607068] entry_SYSCALL_64_after_hwframe+0x76/0x7e [12308.607070] RIP: 0033:0x7fe704e66eed [12308.607073] Code: 08 49 89 ca b8 a5 00 00 00 0f 05 48 89 c7 e8 8a e6 ff ff 48 83 c4 Reported-by: Stijn Tintel <stijn@linux-ipv6.be> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	c991fbee8e	bcachefs: rebalance, copygc status also print stacktrace These are commonly needed when debugging, and saves from having to ask users to dig. Also, rebalance_status now includes pending rebalance work. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-16 13:47:55 -04:00
Kent Overstreet	8dc4514d58	bcachefs: Kill bch2_remount() Single caller, so inline it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:03:16 -04:00
Kent Overstreet	a2e9e68746	bcachefs: Kill a bit of dead code Found with CC=clang W=1 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Thorsten Blum	ff4cb203cc	bcachefs: Use max() to improve gen_after() Use max() to simplify gen_after() and improve its readability. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Thorsten Blum	c073ec6bec	bcachefs: Remove unnecessary byte allocation The extra byte is not used - remove it. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	94373026d9	bcachefs: We no longer read stripes into memory at startup And the stripes heap gets deleted. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	434a3f2ffa	bcachefs: trace_stripe_create Add a simple tracepoint for stripe creation, we'll want to expand this later. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	6c336144b9	bcachefs: get_existing_stripe() uses new stripe lru Convert to the new persistent stripe LRU. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	039790cfb5	bcachefs: ec_stripe_delete() uses new stripe lru Convert to the new persistent stripe LRU. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	4b0fac4bed	bcachefs: journal write path comment Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	981e380144	bcachefs: Kick devices out after too many write IO errors We're improving our handling of write errors - we shouldn't write degraded data just because a write failed once, we should retry it (on other devices, if possible). But for this to work, we need to kick devices out when they're only returning errors - otherwise those retries will loop infinitely. This adds a configurable timeout - if writes are failing for too long, we'll set that device read-only. In the future we should also implement more tracking and another knob for an "allowed error rate", so that we can kick out drives that are acting "unhealthy". Another thing we'll want is a mechanism (likely in userspace) for bringing a device back in after a transient error - perhaps a cable was jiggled, or there was a controller reset. After transient errors we also need a mechanism to walk (from the journal) recent btree updates that weren't flushed to that device and treat them as "degraded", since unflushed data may well not have been written. Out of scope for this patch, but becoming relevant. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	d71e023376	bcachefs: Change BCH_MEMBER_STATE_failed semantics Previously, we woudn't try to read at all from a failed device - that doesn't make much sense, the device may be unhealthy (perhaps taking longer than it should to service reads), but if it's our only option we should still try to read from it. Now, bch2_bkey_pick_read_device() will pick failed devices only if there are no non-failed replicas to read from. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	cf164a9106	bcachefs: bch2_dev_get_ioref() may now sleep The next patch implementing freezing will change bch2_dev_get_ioref() to sleep if a device is currently frozen. Add an annotation and fix the journal code accordingly. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	2efa8397ca	bcachefs: Fix btree_node_scan io_ref handling This was completely fubar; it's now simplified a bit as well. Note that for_each_online_member() takes and releases io_refs as it iterates, so we need to release that if we break. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	d5308203a8	bcachefs: Implement blk_holder_ops We can't use the standard fs_holder_ops because they're meant for single device filesystems - fs_bdev_mark_dead() in particular - and they assume that the blk_holder is the super_block, which also doesn't work for a multi device filesystem. These generally follow the standard fs_holder_ops; the locking/refcounting is a bit simplified because c->ro_ref suffices, and bch2_fs_bdev_mark_dead() is not necessarily shutting down the entire filesystem. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	1fdbe0b184	bcachefs: Make sure c->vfs_sb is set before starting fs This is necessary for the new blk_holder_ops, which want the vfs super_block available for synchronization. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	13fd6be102	bcachefs: Stash a pointer to the filesystem for blk_holder_ops Note that we open block devices before we allocate bch_fs, but once attached to a filesystem they will be closed before the bch_fs is torn down - so stashing a pointer without a refcount looks incorrect but it's not. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	b31c070407	bcachefs: Finish bch2_account_io_completion() conversions More prep work for automatically kicking devices out after too many IO errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	3526bca36b	bcachefs: bch2_account_io_completion() We need to start accounting successes for every IO, not just failures, so introduce a unified hook for io completion accounting and convert io_read.c. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	3480aecd5f	bcachefs: Fix read path io_ref handling We were using our device pointer after we'd released our ref to it. Unlikely to be a race that's practical to hit, since actually removing a member device is a whole process besides just taking it offline, but - needs to be fixed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:16 -04:00
Kent Overstreet	7bc5808168	bcachefs: data_update now checks for extents that can't be moved If a device is ro or failed, we might not have anywhere to move a replica. Check for this early, before doing the read and attempting to write. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	fba513a9ee	bcachefs: give bch2_write_super() a proper error code Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	4a90675cfe	bcachefs: bcachefs_metadata_version_extent_flags This implements a new extent field bitflags that apply to the whole extent. There's been a couple things we've wanted this for in the past, but the immediate need is extent poisoning, to solve a rebalance issue. Unknown extent fields can't be parsed (we won't known their size, so we can't advance to the next field), so this is an incompat feature, and using it prevents the filesystem from being mounted by old versions. This also adds the BCH_EXTENT_poisoned flag; this indicates that the data is known to be bad (i.e. there was a checksum error, and we had to write a new checksum) and reads will return errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	6422bf8117	bcachefs: bch2_request_incompat_feature() now returns error code For future usage, we'll want a dedicated error code for better debugging. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Thorsten Blum	bafd41b435	bcachefs: Fix error type in bch2_alloc_v3_validate() Use error type alloc_v3_unpack_error in bch2_alloc_v3_validate(). Fixes: `b65db750e2` ("bcachefs: Enumerate fsck errors") Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	fb195fa753	bcachefs: BCH_SB_FEATURES_ALL includes BCH_FEATURE_incompat_verison_field These features are set on format and incompat upgarde. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	24d790a7da	bcachefs: sysfs internal/trigger_btree_updates Add a debug knob to manually trigger the btree updates worker. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Joshua Ashton	d37c14ac6f	bcachefs: bcachefs_metadata_version_casefolding This patch implements support for case-insensitive file name lookups in bcachefs. The implementation uses the same UTF-8 lowering and normalization that ext4 and f2fs is using. More information is provided in Documentation/bcachefs/casefolding.rst Compatibility notes: This uses the new versioning scheme for incompatible features where an incompatible feature is tied to a version number: the superblock says "we may use incompat features up to x" and "incompat features up to x are in use", disallowing mounting by previous versions. Additionally, and old style incompat feature bit is used, so that kernels without utf8 casefolding support know if casefolding specifically is in use and they're allowed to mount. Signed-off-by: Joshua Ashton <joshua@froggi.es> Cc: André Almeida <andrealmeid@igalia.com> Cc: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Joshua Ashton	76872d46b7	bcachefs: Split out dirent alloc and name initialization Splits out the code that allocates the dirent and initializes the name to make things easier to implement casefolding in a future commit. Cc: André Almeida <andrealmeid@igalia.com> Cc: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Joshua Ashton <joshua@froggi.es> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	72f4edcf45	bcachefs: Kill dirent_occupied_size() in create path Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	68171d91ce	bcachefs: Kill dirent_occupied_size() in rename path Cc: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	6756e385a5	bcachefs: bcachefs_metadata_version_stripe_lru Add a persistent LRU for stripes, ordered by "number of empty blocks", i.e. order in which we wish to reuse them. This will replace the in-memory stripes heap, so we can kill off reading stripes into memory at startup. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	88d961b518	bcachefs: bcachefs_metadata_version_stripe_backpointers Stripes now have backpointers. This is needed for proper scrub - stripe checksums need to be verified, separately from extents within the stripe, since a block may not be full of live extents but it's still needed for reconstruct. And this will be needed for (efficient) evacuate/repair paths. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:15 -04:00
Kent Overstreet	69bd8a9277	bcachefs: Advance bch_alloc.oldest_gen if no stale pointers Now that we've got cached backpointers and aren't leaving around stale pointers on bucket invalidation, we no longer need the periodic (rare) gc_gens - which recalculates each bucket's oldest gen to avoid wraparound. We can't delete that code because we've got to support existing filesystems that will still have stale pointers, but this gets rid of another scalability limit. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	942a418c7a	bcachefs: Invalidate cached data by backpointers If we don't leave stale pointers around, we won't have to deal with bucket gen wraparound. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	15800f3d4b	bcachefs: bcachefs_metadata_version_cached_backpointers Cached pointers now have backpointers. This means that we'll be able to kill cached pointers in the bucket_invalidate path, when invalidating/reusing buckets containing cached data, instead of leaving them around to be cleaned up by gc_gens garbago collection - which requires a full metadata scan. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	65bc7688b8	bcachefs: rework bch2_trans_commit_run_triggers() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	c7c07bf250	bcachefs: Better trigger ordering Transactional triggers need to run in a defined ordering, which is not quite the same as btree ID integer comparison. Previously this was handled in a hacky way in bch2_trans_commit_run_triggers(), since it was only the alloc btree that needed special handling, but upcoming stripe btree changes are going to require more ordering changes - so, define that ordering. Next patch will change the transaction commit path to use it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	cc297dfb41	bcachefs: bch2_trigger_stripe_ptr() no longer uses ec_stripes_heap_lock Introduce per-entry locks, like with struct bucket - the stripes heap is going away. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	bc76ba70d2	bcachefs: Rework bch2_check_lru_key() It's now easier to add new LRU types. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	3aff608b86	bcachefs: decouple bch2_lru_check_set() from alloc btree Pass in the backpointer explicitly, instead of assuming 'referring_k' is an alloc key and calculating it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	b8e37c1645	bcachefs: s/BCH_LRU_FRAGMENTATION_START/BCH_LRU_BUCKET_FRAGMENTATION/ FRAGMENTATION_START was incorrect, there's currently only one fragmentation LRU (at the end of the reserved bits for LRU type), and we're getting ready to add a stripe fragmentation lru - so give it a better name. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	e130496707	bcachefs: bch2_lru_change() checks for no-op Minor cleanup, no reason for the caller to have to this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	cb87f623c1	bcachefs: minor journal errcode cleanup Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	1ccbcd3205	bcachefs: bch2_write_op_error() now prints info about data update A user has been seeing the "error verifying existing checksum while rewriting existing data (memory corruption?)" error. This generally indicates a hardware issue (and that may be the case here), but it might also indicate a bug, in which case we need more information to look for patterns. Reported-by: Roland Vet <vet.roland@protonmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Kent Overstreet	3faa4647a0	bcachefs: metadata_target is not an inode option This option only applies filesystem wide. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	f27614652c	bcachefs: eytzinger1_{next,prev} cleanup The eytzinger code was previously relying on the following wrap-around properties and their "eytzinger0" equivalents: eytzinger1_prev(0, size) == eytzinger1_last(size) eytzinger1_next(0, size) == eytzinger1_first(size) However, these properties are no longer relied upon and no longer necessary, so remove the corresponding asserts and forbid the use of eytzinger1_prev(0, size) and eytzinger1_next(0, size). This allows to further simplify the code in eytzinger1_next() and eytzinger1_prev(): where the left shifting happens, eytzinger1_next() is trying to move i to the lowest child on the left, which is equivalent to doubling i until the next doubling would cause it to be greater than size. This is implemented by shifting i to the left so that the most significant bits align and then shifting i to the right by one if the result is greater than size. Likewise, eytzinger1_prev() is trying to move to the lowest child on the right; the same applies here. The 1-offset in (size - 1) in eytzinger1_next() isn't needed at all, but the equivalent offset in eytzinger1_prev() is surprisingly needed to preserve the 'eytzinger1_prev(0, size) == eytzinger1_last(size)' property. However, since we no longer support that property, we can get rid of these offsets as well. This saves one addition in each function and makes the code less confusing. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	68eb4c5fea	bcachefs: convert eytzinger sort to be 1-based (2) In this second step, transform the eytzinger indexes i, j, and k in eytzinger1_sort_r() from 0-based to 1-based. This step looks a bit messy, but the resulting code is slightly better. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	3ff0dd28d6	bcachefs: convert eytzinger sort to be 1-based (1) In this first step, convert the eytzinger sort functions to use 1-based primitives. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	3849bcab4d	bcachefs: convert eytzinger0_find to be 1-based Several of the algorithms on eytzinger trees are implemented in terms of the eytzinger0 primitives. However, those algorithms can just as easily be expressed in terms of the eytzinger1 primitives, and that leads to better and easier to understand code. Start by converting eytzinger0_find(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	956032edd2	bcachefs: Add eytzinger0_find self test Function eytzinger0_find() isn't currently covered, so add a self test. We can rely on eytzinger0_find_le() here because it is being tested independently. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	63ce189b00	bcachefs: add eytzinger0_find_ge self test Add an eytzinger0_find_ge() self test similar to eytzinger0_find_gt(). Note that this test requires eytzinger0_find_ge() to return the first matching element in the array in case of duplicates. To prevent bisection errors, we only add this test after strenghening the original implementation (see the previous commit). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	11223d0e7b	bcachefs: implement eytzinger0_find_ge directly Implement eytzinger0_find_ge() directly instead of implementing it in terms of eytzinger0_find_le() and adjusting the result. This turns eytzinger0_find_ge() into a minimum search, so when there are duplicate elements, the result of eytzinger0_find_ge() will now always point at the first matching element. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:14 -04:00
Andreas Gruenbacher	2182f29545	bcachefs: implement eytzinger0_find_gt directly Instead of implementing eytzinger0_find_gt() in terms of eytzinger0_find_le() and adjusting the result, implement it directly. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	d7cd33f7ef	bcachefs: add eytzinger0_find_gt self test Add an eytzinger0_find_gt() self test similar to eytzinger0_find_le(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	d384dada0e	bcachefs: simplify eytzinger0_find_le Replace the over-complicated implementation of eytzinger0_find_le() by an equivalent, simpler version. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	d148d804f2	bcachefs: convert eytzinger0_find_le to be 1-based eytzinger0_find_le() is also easy to concert to 1-based eytzinger (but see the next commit). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	c722b818a2	bcachefs: improve eytzinger0_find_le self test Rename eytzinger0_find_test_val() to eytzinger0_find_test_le() and add a new eytzinger0_find_test_val() wrapper that calls it. We have already established that the array is sorted in eytzinger order, so we can use the eytzinger iterator functions and check the boundary conditions to verify the result of eytzinger0_find_le(). Only scan the entire array if we get an incorrect result. When we need to scan, use eytzinger0_for_each_prev() so that we'll stop at the highest matching element in the array in case there are duplicates; going through the array linearly wouldn't give us that. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	dc5ceaaad8	bcachefs: add eytzinger0_for_each_prev Add an eytzinger0_for_each_prev() macro for iterating through an eytzinger array in reverse. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	e8a0966ffa	bcachefs: eytzinger0_find_test improvement In eytzinger0_find_test(), remember the smallest element seen so far instead of comparing adjacent array elements. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	ec70103f9b	bcachefs: eytzinger[01]_test improvement In eytzinger[01]_test(), make sure that eytzinger[01]_for_each() iterates over all array elements. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	0766f5599c	bcachefs: eytzinger self tests: fix cmp_u16 typo Fix an obvious typo in cmp_u16(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	0ede49212a	bcachefs: eytzinger self tests: missing newline termination pr_info() format strings need to be newline terminated. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	217ad1d7c7	bcachefs: eytzinger self tests: loop cleanups The iterator variable of eytzinger0_for_each() loops has been changed to be locally scoped at some point, so remove variables defined outside the loop that are now unused. In addition and for clarity, use a different variable inside those loops where an outside variable would be shadowed. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	d54b82ecc4	bcachefs: EYTZINGER_DEBUG fix When EYTZINGER_DEBUG is defined, <linux/bug.h> needs to be included. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Andreas Gruenbacher	f7f9be0238	bcachefs: bch2_blacklist_entries_gc cleanup Use an eytzinger0_for_each() loop here. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	34a493089a	bcachefs: bch2_bkey_ptr_data_type() now correctly returns cached for cached ptrs Necessary for adding backpointers for cached pointers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	fd49882f12	bcachefs: Add time_stat for btree writes We have other metadata IO types covered, this was missing. Note: this includes the time until completion, i.e. including parent pointer update. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	b7f648e2ec	bcachefs: Add comment explaining why asserts in invalidate_one_bucket() are impossible Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	7606fb4d26	bcachefs: Ignore backpointers to stripes in ec_stripe_update_extents() Prep work for stripe backpointers: this path previously would get very confused at being asked to process (remove redundant replicas) stripes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	898bda5b72	bcachefs: Increase JOURNAL_BUF_NR Increase journal pipelining. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	35282ce9e8	bcachefs: Free journal bufs when not in use Since we're increasing the number of 'struct journal_bufs', we don't want them all permanently holding onto buffers for the journal data - that'd be 16 * 2MB = 32MB, or potentially more. Add a single-element mempool (open coded, since buffer size varies), this also means we won't be hitting the memory allocator every time we open and close a journal entry/buffer. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	2e853fdbc7	bcachefs: Don't touch journal_buf->data->seq in journal_res_get This is a small optimization, reducing the number of cachelines we touch in the fast path - and it's also necessary for the next patch that increases JOURNAL_BUF_NR. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:13 -04:00
Kent Overstreet	199a3578ed	bcachefs: Kill journal_res.idx More dead code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	c2be81d48a	bcachefs: Kill journal_res_state.unwritten_idx Dead code Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	3eccc02035	bcachefs: add progress indicator to check_allocations Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	491eda6394	bcachefs: Add a progress indicator to bch2_dev_data_drop() This code needs quite a bit of work: we don't want to be walking all metadata in the filesystem, we should just be walking backpointers, and it should be switched to a data ioctl that can report progress via a file descriptor, not the system console. But that'll take more work - before we can safely walk only backpointers we need to change device add to not reuse device indexes, since with that change accounting being wrong introduces the possibility of removing a device that still has pointers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	baabeb4997	bcachefs: Factor out progress.[ch] the backpointers code has progress indicators; these aren't great, since they print to the dmesg console and we much prefer to have progress indicators reporting to a specific userspace program so they're not spamming the system console. But not all codepaths that need progress indicators support that yet, and we don't want users to think "this is hung". Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	06284963e3	bcachefs: bch2_inum_offset_err_msg_trans() no longer handles transaction restarts we're starting to use error messages with paths in fsck_errors(), where we do not want nested transaction restart handling, so let's prepare for that. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	45f0e6c838	bcachefs: bch2_indirect_extent_missing_error() prints path, not just inode number We want all error messages converted to print paths, not just inode numbers - users want this information, and it speeds up debugging too. Auditing and converting all error messages is going to be a big project, so for the moment we're just doing this incrementally. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	e63cf203d7	bcachefs: Convert migrate to move_data_phys() Iterating over backpointers on a specific device is potentially much cheaper than walking all filesystem data. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	157ea58341	bcachefs: Read/move path counter work Reorganize counters a bit, grouping related counters together. New counters: - io_read_inline - io_read_hole Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Alan Huang	7d8321a286	bcachefs: Fix subtraction underflow When ancestor is less than IS_ANCESTOR_BITMAP, we would get an incorrect result. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	f269ae55d2	bcachefs: Scrub Add a new data op to walk all data and metadata in a filesystem, checking if it can be read successfully, and on error repairing from another copy if possible. - New helper: bch2_dev_idx_is_online(), so that we can bail out and report to userspace when we're unable to scrub because the device is offline - data_update_opts, which controls the data move path, now understands scrub: data is only read, not written. The read path is responsible for rewriting on read error, as with other reads. - scrub_pred skips data extents that don't have checksums - bch_ioctl_data has a new scrub member, which has a data_types field for data types to check - i.e. all data types, or only metadata. - Add new entries to bch_move_stats so that we can report numbers for corrected and uncorrected errors - Add a new enum to bch_ioctl_data_event for explicitly reporting completion and return code (i.e. device offline) Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	3e2ad29865	bcachefs: bch2_btree_node_scrub() Add a function for scrubbing btree nodes - reading them in, and kicking off a rewrite if there's an error. The btree_node_read_done() checks have to be duplicated because we're not using a pointer to a struct btree - the btree node might already be in cache, and we need to check a specific replica, which might not be the one we previously read from. This will be used in the next patch implementing high-level scrub. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	ca24130ee4	bcachefs: bch2_bkey_pick_read_device() can now specify a device To be used for scrub, where we want the read to come from a specific device. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	2a2f7aaa8d	bcachefs: __bch2_move_data_phys() now uses bch2_btree_node_rewrite_pos() Kill most of the separate logic for btree nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	987fdbdb40	bcachefs: bch2_move_data_phys() Add a more general version of bch2_evacuate_bucket - to be used for scrub. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	12188c9e2b	bcachefs: bch2_btree_node_rewrite_pos() Add a new helper for rewriting a btree node given a just the key, not a pointer to the node itself. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	ca16fa6b86	bcachefs: backpointer_get_key() doesn't pull in btree node We may not need to pull in a btree node when walking backpointers - don't do so unnecessarily when using backpointer_get_key(). It'll still fall back to backpointer_get_node() in a few situations, including btree roots (where an iterator can't point at just the key), and races due to the interior update path not having deleted a backpointer to an old node yet. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	dff6de9518	bcachefs: Internal reads can now correct errors Rework the read path so that BCH_READ_NODECODE reads now also self-heal after a read error and a successful retry - prerequisite for scrub. - __bch2_read_endio() now handles a read that's both BCH_READ_NODECODE and a bounce. Normally, we don't want a BCH_READ_NODECODE read to ever allocate a split bch_read_bio: we want to maintain the relationship between the bch_read_bio and the data_update it's embedded in. But correcting read errors requires allocating a split/bounce rbio that's embedded in a promote_op. We do still have a 1-1 relationship, i.e. we only allocate a single split/bounce if it's a BCH_READ_NODECODE, so things hopefully don't get too crazy. - __bch2_read_extent() now is allowed to allocate the promote_op for rewriting after a failed read, even if it's BCH_READ_NODECODE. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	7b1d655106	bcachefs: Don't self-heal if a data update is already rewriting Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	4dfb76e0ad	bcachefs: Don't start promotes from bch2_rbio_free() we don't want to block completion of the read - starting a promote calls into the write path, which will block. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:12 -04:00
Kent Overstreet	7e9ed60f5f	bcachefs: Bail out early on alloc_nowait data updates If a data update doesn't want to block on allocations (promotes, self healing on read error) - check if the allocation would fail before kicking off the data update and calling into the write path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	c37d42a0e2	bcachefs: Rework init order in bch2_data_update_init() Initialize the write op first, so that in the next patch we can check if the allocator would block (for BCH_WRITE_alloc_nowait ops) and bail out before taking nocow locks/dev refs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	29ad31c780	bcachefs: Self healing writes are BCH_WRITE_alloc_nowait If a drive is failing and we're moving data off of it, we can't necessairly depend on capacity/disk reservation calculations to avoid deadlocking/blocking on the allocator. And, we don't want to queue up infinite self healing moves anyways. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	8ff92a9e4e	bcachefs: Promotes should use BCH_WRITE_only_specified_devs Promotes, like most other internal moves, should only go to the specified target and not fall back to allocating from the full filesystem. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	d0148e7169	bcachefs: Be stricter in bch2_read_retry_nodecode() Now that data_update embeds bch_read_bio, BCH_READ_NODECODE means that the read is embedded in a a data_update - and we can check in the retry path if the extent has changed and bail out. This likely fixes some subtle bugs with read errors and data moves. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	6f7111f820	bcachefs: cleanup redundant code around data_update_op initialization Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	536d789781	bcachefs: bch2_update_unwritten_extent() no longer depends on wbio Prep work for improving bch2_data_update_init(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	8f97793d67	bcachefs: promote_op uses embedded bch_read_bio Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	a70bd97630	bcachefs: data_update now embeds bch_read_bio Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	dfa204b169	bcachefs: rbio_init() cleanup Move more initialization to rbio_init(), to assist in further cleanups. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	0f856b7228	bcachefs: rbio_init_fragment() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	14e2523fc5	bcachefs: Rename BCH_WRITE flags fer consistency with other x-macros enums The uppercase/lowercase style is nice for making the namespace explicit. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	9157b3ddfb	bcachefs: x-macroize BCH_READ flags Will be adding a bch2_read_bio_to_text(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	9f37016cb2	bcachefs: kill bch_read_bio.devs_have Dead code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	3075e68d26	bcachefs: bch2_data_update_inflight_to_text() Add a new helper for bch2_moving_ctxt_to_text(), which may be used to debug if moving_ios are getting stuck. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	50ca857457	bcachefs: BCH_IOCTL_QUERY_COUNTERS Add an ioctl for querying counters, the same ones provided in /sys/fs/bcachefs/<uuid>/counters/, but more suitable for a 'bcachefs top' command. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	5ee760f667	bcachefs: BCH_COUNTER_bucket_discard_fast Add a separate counter for fastpath bucket discards, which don't require a journal flush. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	bbd804f2ad	bcachefs: enum bch_persistent_counters_stable Persistent counters, like recovery passes, include a stable enum in their definition - but this was never correctly plumbed. This allows us to add new counters and properly organize them with a non-stable "presentation order", which can also be used in userspace by the new 'bcachefs fs top' tool. Fortunatel, since we haven't yet added any new counters where presentation order ID doesn't match stable ID, this won't cause any reordering issues. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	999cc1bb68	bcachefs: Separate running/runnable in wp stats We've got per-writepoint statistics to see how well the writepoint index update threads are pipelining; this separates running vs. runnable so we can see at a glance if they're blocking. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	78c9c6f6cd	bcachefs: Move write_points to debugfs this was hitting the sysfs 4k limit Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:11 -04:00
Kent Overstreet	55a132c37a	bcachefs: Don't inc io_(read\|write) counters for moves This makes 'bcachefs fs top' more useful; we can now see at a glance whether the IO to the device is being done for user reads/writes, or copygc/rebalance. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:10 -04:00
Kent Overstreet	e5a63ad343	bcachefs: Fix missing increment of move_extent_write counter Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:10 -04:00
Kent Overstreet	c3c9957c81	bcachefs: check_bp_exists() check for backpointers for stale pointers Early version of 'bcachefs_metadata_version_cached_backpointers' was creating backpointers for stale cached pointers - whoops. Now we have to repair those. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:10 -04:00
Kent Overstreet	2deae55804	bcachefs: btree_node_(rewrite\|update_key) cleanup Factor out get_iter_to_node() and use it for btree_node_rewrite_get_iter(), to be used for fixing btree node write error behaviour. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 21:02:10 -04:00
Kent Overstreet	be212d86b1	bcachefs: bs > ps support bcachefs removed most PAGE_SIZE references long ago, so this is easy; only readpage_bio_extend() has to be tweaked to respect the minimum order. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 19:45:57 -04:00
Kent Overstreet	1a2b74d0a2	bcachefs: fix build on 32 bit in get_random_u64_below() bare 64 bit divides not allowed, whoops arm-linux-gnueabi-ld: drivers/char/random.o: in function `__get_random_u64_below': drivers/char/random.c:602:(.text+0xc70): undefined reference to `__aeabi_uldivmod' Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 19:45:54 -04:00
Kent Overstreet	90fd9ad5b0	bcachefs: Change btree wb assert to runtime error We just had a report of the assert for "btree in write buffer for non-write buffer btree" popping during the 6.14 upgrade. - 150TB filesystem, after a reboot the upgrade was able to continue from where it left off, so no major damage. But with 6.14 about to come out we want to get this tracked down asap, and need more data if other users hit this. Convert the BUG_ON() to an emergency read-only, and print out btree, the key itself, and stack trace from the original write buffer update (which did not have this check before). Reported-by: Stijn Tintel <stijn@linux-ipv6.be> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-14 10:25:25 -04:00
Kent Overstreet	9c18ea7ffe	bcachefs: bch2_get_random_u64_below() steal the (clever) algorithm from get_random_u32_below() this fixes a bug where we were passing roundup_pow_of_two() a 64 bit number - we're squaring device latencies now: [ +1.681698] ------------[ cut here ]------------ [ +0.000010] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13 [ +0.000011] shift exponent 64 is too large for 64-bit type 'long unsigned int' [ +0.000011] CPU: 1 UID: 0 PID: 196 Comm: kworker/u32:13 Not tainted 6.14.0-rc6-dave+ #10 [ +0.000012] Hardware name: ASUS System Product Name/PRIME B460I-PLUS, BIOS 1301 07/13/2021 [ +0.000005] Workqueue: events_unbound __bch2_read_endio [bcachefs] [ +0.000354] Call Trace: [ +0.000005] <TASK> [ +0.000007] dump_stack_lvl+0x5d/0x80 [ +0.000018] ubsan_epilogue+0x5/0x30 [ +0.000008] __ubsan_handle_shift_out_of_bounds.cold+0x61/0xe6 [ +0.000011] bch2_rand_range.cold+0x17/0x20 [bcachefs] [ +0.000231] bch2_bkey_pick_read_device+0x547/0x920 [bcachefs] [ +0.000229] __bch2_read_extent+0x1e4/0x18e0 [bcachefs] [ +0.000241] ? bch2_btree_iter_peek_slot+0x3df/0x800 [bcachefs] [ +0.000180] ? bch2_read_retry_nodecode+0x270/0x330 [bcachefs] [ +0.000230] bch2_read_retry_nodecode+0x270/0x330 [bcachefs] [ +0.000230] bch2_rbio_retry+0x1fa/0x600 [bcachefs] [ +0.000224] ? bch2_printbuf_make_room+0x71/0xb0 [bcachefs] [ +0.000243] ? bch2_read_csum_err+0x4a4/0x610 [bcachefs] [ +0.000278] bch2_read_csum_err+0x4a4/0x610 [bcachefs] [ +0.000227] ? __bch2_read_endio+0x58b/0x870 [bcachefs] [ +0.000220] __bch2_read_endio+0x58b/0x870 [bcachefs] [ +0.000268] ? try_to_wake_up+0x31c/0x7f0 [ +0.000011] ? process_one_work+0x176/0x330 [ +0.000008] process_one_work+0x176/0x330 [ +0.000008] worker_thread+0x252/0x390 [ +0.000008] ? __pfx_worker_thread+0x10/0x10 [ +0.000006] kthread+0xec/0x230 [ +0.000011] ? __pfx_kthread+0x10/0x10 [ +0.000009] ret_from_fork+0x31/0x50 [ +0.000009] ? __pfx_kthread+0x10/0x10 [ +0.000008] ret_from_fork_asm+0x1a/0x30 [ +0.000012] </TASK> [ +0.000046] ---[ end trace ]--- Reported-by: Roland Vet <vet.roland@protonmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-13 12:40:22 -04:00
Kent Overstreet	69a5a13a22	bcachefs: target_congested -> get_random_u32_below() get_random_u32_below() has a better algorithm than bch2_rand_range(), it just didn't exist at the time. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-13 12:39:21 -04:00
Kent Overstreet	3bcde88d38	bcachefs: fix tiny leak in bch2_dev_add() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-13 00:23:19 -04:00
Kent Overstreet	dbac8feb23	bcachefs: Make sure trans is unlocked when submitting read IO We were still using the trans after the unlock, leading to this bug in the retry path: 00255 ------------[ cut here ]------------ 00255 kernel BUG at fs/bcachefs/btree_iter.c:3348! 00255 Internal error: Oops - BUG: 00000000f2000800 [#1] SMP 00255 bcachefs (0ca38fe8-0a26-41f9-9b5d-6a27796c7803): /fiotest offset 86048768: no device to read from: 00255 u64s 8 type extent 4098:168192:U32_MAX len 128 ver 0: durability: 0 crc: c_size 128 size 128 offset 0 nonce 0 csum crc32c 0:8040a368 compress none ec: idx 83 block 1 ptr: 0:302:128 gen 0 00255 bcachefs (0ca38fe8-0a26-41f9-9b5d-6a27796c7803): /fiotest offset 85983232: no device to read from: 00255 u64s 8 type extent 4098:168064:U32_MAX len 128 ver 0: durability: 0 crc: c_size 128 size 128 offset 0 nonce 0 csum crc32c 0:43311336 compress none ec: idx 83 block 1 ptr: 0:302:0 gen 0 00255 Modules linked in: 00255 CPU: 5 UID: 0 PID: 304 Comm: kworker/u70:2 Not tainted 6.14.0-rc6-ktest-g526aae23d67d #16040 00255 Hardware name: linux,dummy-virt (DT) 00255 Workqueue: events_unbound bch2_rbio_retry 00255 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00255 pc : __bch2_trans_get+0x100/0x378 00255 lr : __bch2_trans_get+0xa0/0x378 00255 sp : ffffff80c865b760 00255 x29: ffffff80c865b760 x28: 0000000000000000 x27: ffffff80d76ed880 00255 x26: 0000000000000018 x25: 0000000000000000 x24: ffffff80f4ec3760 00255 x23: ffffff80f4010140 x22: 0000000000000056 x21: ffffff80f4ec0000 00255 x20: ffffff80f4ec3788 x19: ffffff80d75f8000 x18: 00000000ffffffff 00255 x17: 2065707974203820 x16: 7334367520200a3a x15: 0000000000000008 00255 x14: 0000000000000001 x13: 0000000000000100 x12: 0000000000000006 00255 x11: ffffffc080b47a40 x10: 0000000000000000 x9 : ffffffc08038dea8 00255 x8 : ffffff80d75fc018 x7 : 0000000000000000 x6 : 0000000000003788 00255 x5 : 0000000000003760 x4 : ffffff80c922de80 x3 : ffffff80f18f0000 00255 x2 : ffffff80c922de80 x1 : 0000000000000130 x0 : 0000000000000006 00255 Call trace: 00255 __bch2_trans_get+0x100/0x378 (P) 00255 bch2_read_io_err+0x98/0x260 00255 bch2_read_endio+0xb8/0x2d0 00255 __bch2_read_extent+0xce8/0xfe0 00255 __bch2_read+0x2a8/0x978 00255 bch2_rbio_retry+0x188/0x318 00255 process_one_work+0x154/0x390 00255 worker_thread+0x20c/0x3b8 00255 kthread+0xf0/0x1b0 00255 ret_from_fork+0x10/0x20 00255 Code: 6b01001f 54ffff01 79408460 3617fec0 (d4210000) 00255 ---[ end trace 0000000000000000 ]--- 00255 Kernel panic - not syncing: Oops - BUG: Fatal exception 00255 SMP: stopping secondary CPUs 00255 Kernel Offset: disabled 00255 CPU features: 0x000,00000070,00000010,8240500b 00255 Memory Limit: none 00255 ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]--- Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-11 11:21:44 -04:00
Roxana Nicolescu	58517f4df8	bcachefs: Initialize from_inode members for bch_io_opts When there is no inode source, all "from_inode" members in the structure bhc_io_opts should be set false. Fixes: `7a7c43a0c1` ("bcachefs: Add bch_io_opts fields for indicating whether the opts came from the inode") Reported-by: syzbot+c17ad4b4367b72a853cb@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c17ad4b4367b72a853cb Signed-off-by: Roxana Nicolescu <nicolescu.roxana@protonmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-11 11:19:33 -04:00
Alan Huang	3a04334d62	bcachefs: Fix b->written overflow When bset past end of btree node, we should not add sectors to b->written, which will overflow b->written. Reported-by: syzbot+3cb3d9e8c3f197754825@syzkaller.appspotmail.com Tested-by: syzbot+3cb3d9e8c3f197754825@syzkaller.appspotmail.com Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-11 09:19:23 -04:00
Luis Chamberlain	a64e5a5960	bdev: add back PAGE_SIZE block size validation for sb_set_blocksize() The commit titled "block/bdev: lift block size restrictions to 64k" lifted the block layer's max supported block size to 64k inside the helper blk_validate_block_size() now that we support large folios. However in lifting the block size we also removed the silly use cases many filesystems have to use sb_set_blocksize() to verify that the block size <= PAGE_SIZE. The call to sb_set_blocksize() was used to check the block size <= PAGE_SIZE since historically we've always supported userspace to create for example 64k block size filesystems even on 4k page size systems, but what we didn't allow was mounting them. Older filesystems have been using the check with sb_set_blocksize() for years. While, we could argue that such checks should be filesystem specific, there are much more users of sb_set_blocksize() than LBS enabled filesystem on upstream, so just do the easier thing and bring back the PAGE_SIZE check for sb_set_blocksize() users and only skip it for LBS enabled filesystems. This will ensure that tests such as generic/466 when run in a loop against say, ext4, won't try to try to actually mount a filesystem with a block size larger than your filesystem supports given your PAGE_SIZE and in the worst case crash. Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250307020403.3068567-1-mcgrof@kernel.org Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-07 12:56:05 +01:00
Kent Overstreet	8ba73f53dc	bcachefs: copygc now skips non-rw devices There's no point in doing copygc on non-rw devices: the fragmentation doesn't matter if we're not writing to them, and we may not have anywhere to put the data on our other devices. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-06 18:15:01 -05:00
Kent Overstreet	33255c161a	bcachefs: Fix bch2_dev_journal_alloc() spuriously failing Previously, we fixed journal resize spuriousl failing with -BCH_ERR_open_buckets_empty, but initial journal allocation was missed because it didn't invoke the "block on allocator" loop at all. Factor out the "loop on allocator" code to fix that. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-03-06 18:15:01 -05:00
Kent Overstreet	4a4f9b5c7c	bcachefs: Don't set BCH_FEATURE_incompat_version_field unless requested We shouldn't be setting incompatible bits or the incompatible version field unless explicitly request or allowed - otherwise we break mounting with old kernels or userspace. Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-28 19:07:33 -05:00
NeilBrown	88d5baf690	Change inode_operations.mkdir to return struct dentry * Some filesystems, such as NFS, cifs, ceph, and fuse, do not have complete control of sequencing on the actual filesystem (e.g. on a different server) and may find that the inode created for a mkdir request already exists in the icache and dcache by the time the mkdir request returns. For example, if the filesystem is mounted twice the directory could be visible on the other mount before it is on the original mount, and a pair of name_to_handle_at(), open_by_handle_at() calls could instantiate the directory inode with an IS_ROOT() dentry before the first mkdir returns. This means that the dentry passed to ->mkdir() may not be the one that is associated with the inode after the ->mkdir() completes. Some callers need to interact with the inode after the ->mkdir completes and they currently need to perform a lookup in the (rare) case that the dentry is no longer hashed. This lookup-after-mkdir requires that the directory remains locked to avoid races. Planned future patches to lock the dentry rather than the directory will mean that this lookup cannot be performed atomically with the mkdir. To remove this barrier, this patch changes ->mkdir to return the resulting dentry if it is different from the one passed in. Possible returns are: NULL - the directory was created and no other dentry was used ERR_PTR() - an error occurred non-NULL - this other dentry was spliced in This patch only changes file-systems to return "ERR_PTR(err)" instead of "err" or equivalent transformations. Subsequent patches will make further changes to some file-systems to return a correct dentry. Not all filesystems reliably result in a positive hashed dentry: - NFS, cifs, hostfs will sometimes need to perform a lookup of the name to get inode information. Races could result in this returning something different. Note that this lookup is non-atomic which is what we are trying to avoid. Placing the lookup in filesystem code means it only happens when the filesystem has no other option. - kernfs and tracefs leave the dentry negative and the ->revalidate operation ensures that lookup will be called to correctly populate the dentry. This could be fixed but I don't think it is important to any of the users of vfs_mkdir() which look at the dentry. The recommendation to use d_drop();d_splice_alias() is ugly but fits with current practice. A planned future patch will change this. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-27 20:00:17 +01:00
Christian Brauner	71628584df	Merge patch series "prep patches for my mkdir series" NeilBrown <neilb@suse.de> says: These two patches are cleanup are dependencies for my mkdir changes and subsequence directory locking changes. * patches from https://lore.kernel.org/r/20250226062135.2043651-1-neilb@suse.de: (2 commits) nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked() nfs/vfs: discard d_exact_alias() Link: https://lore.kernel.org/r/20250226062135.2043651-1-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-27 09:25:34 +01:00
Kent Overstreet	eb54d2695b	bcachefs: Fix truncate sometimes failing and returning 1 __bch_truncate_folio() may return 1 to indicate dirtyness of the folio being truncated, needed for fpunch to get the i_size writes correct. But truncate was forgetting to clear ret, and sometimes returning it as an error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-26 19:31:05 -05:00
Alan Huang	677bdb7346	bcachefs: Fix deadlock This fixes two deadlocks: 1.pcpu_alloc_mutex involved one as pointed by syzbot[1] 2.recursion deadlock. The root cause is that we hold the bc lock during alloc_percpu, fix it by following the pattern used by __btree_node_mem_alloc(). [1] https://lore.kernel.org/all/66f97d9a.050a0220.6bad9.001d.GAE@google.com/T/ Reported-by: syzbot+fe63f377148a6371a9db@syzkaller.appspotmail.com Tested-by: syzbot+fe63f377148a6371a9db@syzkaller.appspotmail.com Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-26 19:31:05 -05:00
Kent Overstreet	7909d1fb90	bcachefs: Check for -BCH_ERR_open_buckets_empty in journal resize This fixes occasional failures from journal resize. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-26 19:31:05 -05:00
Kent Overstreet	4804f3ac26	bcachefs: Revert directory i_size This turned out to have several bugs, which were missed because the fsck code wasn't properly reporting errors - whoops. Kicking it out for now, hopefully it can make 6.15. Cc: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-26 19:30:38 -05:00
Kent Overstreet	cf3e696026	bcachefs: fix bch2_extent_ptr_eq() Reviewed-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-23 23:35:33 -05:00
Alan Huang	c522093b02	bcachefs: Fix memmove when move keys down The fix alone doesn't fix [1], but should be applied before debugging that. [1] https://syzkaller.appspot.com/bug?extid=38a0cbd267eff2d286ff Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-20 16:40:34 -05:00
Kent Overstreet	68aaa63716	bcachefs: print op->nonce on data update inconsistency "nonce inconstancy" is popping up again, causing us to go emergency read-only. This one looks less serious, i.e. specific to the encryption path and not indicative of a data corruption bug. But we'll need more info to track it down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-20 16:39:28 -05:00
Kent Overstreet	b04974f759	bcachefs: Fix srcu lock warning in btree_update_nodes_written() We don't want to be holding the srcu lock while waiting on btree write completions - easily fixed. Reported-by: Janpieter Sollie <janpieter.sollie@edpnet.be> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-19 18:52:42 -05:00
Kent Overstreet	4fd509c10f	bcachefs: Fix bch2_indirect_extent_missing_error() We had some error handling confusion here; -BCH_ERR_missing_indirect_extent is thrown by trans_trigger_reflink_p_segment(); at this point we haven't decide whether we're generating an error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-19 17:33:13 -05:00
Kent Overstreet	b9ddb3e1a8	bcachefs: Fix fsck directory i_size checking Error handling was wrong, causing unhandled transaction restart errors. check_directory_size() was also inefficient, since keys in multiple snapshots would be iterated over once for every snapshot. Convert it to the same scheme used for i_sectors and subdir count checking. Cc: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-19 13:52:27 -05:00
NeilBrown	1c3cb50b58	VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry No callers of kern_path_locked() or user_path_locked_at() want a negative dentry. So change them to return -ENOENT instead. This simplifies callers. This results in a subtle change to bcachefs in that an ioctl will now return -ENOENT in preference to -EXDEV. I believe this restores the behaviour to what it was prior to Commit `bbe6a7c899` ("bch2_ioctl_subvolume_destroy(): fix locking") Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250217003020.3170652-2-neilb@suse.de Acked-by: Paul Moore <paul@paul-moore.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-19 14:08:41 +01:00
Alan Huang	406e445b3c	bcachefs: Reuse transaction bch2_nocow_write_convert_unwritten is already in transaction context: 00191 ========= TEST generic/648 00242 kernel BUG at fs/bcachefs/btree_iter.c:3332! 00242 Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP 00242 Modules linked in: 00242 CPU: 4 UID: 0 PID: 2593 Comm: fsstress Not tainted 6.13.0-rc3-ktest-g345af8f855b7 #14403 00242 Hardware name: linux,dummy-virt (DT) 00242 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00242 pc : __bch2_trans_get+0x120/0x410 00242 lr : __bch2_trans_get+0xcc/0x410 00242 sp : ffffff80d89af600 00242 x29: ffffff80d89af600 x28: ffffff80ddb23000 x27: 00000000fffff705 00242 x26: ffffff80ddb23028 x25: ffffff80d8903fe0 x24: ffffff80ebb30168 00242 x23: ffffff80c8aeb500 x22: 000000000000005d x21: ffffff80d8904078 00242 x20: ffffff80d8900000 x19: ffffff80da9e8000 x18: 0000000000000000 00242 x17: 64747568735f6c61 x16: 6e72756f6a20726f x15: 0000000000000028 00242 x14: 0000000000000004 x13: 000000000000f787 x12: ffffffc081bbcdc8 00242 x11: 0000000000000000 x10: 0000000000000003 x9 : ffffffc08094efbc 00242 x8 : 000000001092c111 x7 : 000000000000000c x6 : ffffffc083c31fc4 00242 x5 : ffffffc083c31f28 x4 : ffffff80c8aeb500 x3 : ffffff80ebb30000 00242 x2 : 0000000000000001 x1 : 0000000000000a21 x0 : 000000000000028e 00242 Call trace: 00242 __bch2_trans_get+0x120/0x410 (P) 00242 bch2_inum_offset_err_msg+0x48/0xb0 00242 bch2_nocow_write_convert_unwritten+0x3d0/0x530 00242 bch2_nocow_write+0xeb0/0x1000 00242 __bch2_write+0x330/0x4e8 00242 bch2_write+0x1f0/0x530 00242 bch2_direct_write+0x530/0xc00 00242 bch2_write_iter+0x160/0xbe0 00242 vfs_write+0x1cc/0x360 00242 ksys_write+0x5c/0xf0 00242 __arm64_sys_write+0x20/0x30 00242 invoke_syscall.constprop.0+0x54/0xe8 00242 do_el0_svc+0x44/0xc0 00242 el0_svc+0x34/0xa0 00242 el0t_64_sync_handler+0x104/0x130 00242 el0t_64_sync+0x154/0x158 00242 Code: 6b01001f 54ffff01 79408460 3617fec0 (d4210000) 00242 ---[ end trace 0000000000000000 ]--- 00242 Kernel panic - not syncing: Oops - BUG: Fatal exception Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-12 18:44:50 -05:00
Alan Huang	531323a2ef	bcachefs: Pass _orig_restart_count to trans_was_restarted _orig_restart_count is unused now, according to the logic, trans_was_restarted should be using _orig_restart_count. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-12 18:40:19 -05:00
Kent Overstreet	9cf6b84b71	bcachefs: CONFIG_BCACHEFS_INJECT_TRANSACTION_RESTARTS Incorrectly handled transaction restarts can be a source of heisenbugs; add a mode where we randomly inject them to shake them out. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-12 18:40:19 -05:00
Kent Overstreet	9f734cd076	bcachefs: Fix want_new_bset() so we write until the end of the btree node want_new_bset() returns the address of a new bset to initialize if we wish to do so in a btree node - either because the previous one is too big, or because it's been written. The case for 'previous bset was written' was wrong: it's only supposed to check for if we have space in the node for one more block, but because it subtracted the header from the space available it would never initialize a new bset if we were down to the last block in a node. Fixing this results in fewer btree node splits/compactions, which fixes a bug with flushing the journal to go read-only sometimes not terminating or taking excessively long. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-11 10:10:32 -05:00
Kent Overstreet	1e690efa72	bcachefs: Split out journal pins by btree level This lets us flush the journal to go read-only more effectively. Flushing the journal and going read-only requires halting mutually recursive processes, which strictly speaking are not guaranteed to terminate. Flushing btree node journal pins will kick off a btree node write, and btree node writes on completion must do another btree update to the parent node to update the 'sectors_written' field for that node's key. If the parent node is full and requires a split or compaction, that's going to generate a whole bunch of additional btree updates - alloc info, LRU btree, and more - which then have to be flushed, and the cycle repeats. This process will terminate much more effectively if we tweak journal reclaim to flush btree updates leaf to root: i.e., don't flush updates for a given btree node (kicking off a write, and consuming space within that node up to the next block boundary) if there might still be unflushed updates in child nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-11 10:10:32 -05:00
Alan Huang	1c316eb57c	bcachefs: Fix use after free acc->k.data should be used with the lock hold: 00221 ========= TEST generic/187 00221 run fstests generic/187 at 2025-02-09 21:08:10 00221 spectre-v4 mitigation disabled by command-line option 00222 bcachefs (vdc): starting version 1.20: directory_size opts=errors=ro 00222 bcachefs (vdc): initializing new filesystem 00222 bcachefs (vdc): going read-write 00222 bcachefs (vdc): marking superblocks 00222 bcachefs (vdc): initializing freespace 00222 bcachefs (vdc): done initializing freespace 00222 bcachefs (vdc): reading snapshots table 00222 bcachefs (vdc): reading snapshots done 00222 bcachefs (vdc): done starting filesystem 00222 bcachefs (vdc): shutting down 00222 bcachefs (vdc): going read-only 00222 bcachefs (vdc): finished waiting for writes to stop 00223 bcachefs (vdc): flushing journal and stopping allocators, journal seq 6 00223 bcachefs (vdc): flushing journal and stopping allocators complete, journal seq 8 00223 bcachefs (vdc): clean shutdown complete, journal seq 9 00223 bcachefs (vdc): marking filesystem clean 00223 bcachefs (vdc): shutdown complete 00223 bcachefs (vdc): starting version 1.20: directory_size opts=errors=ro 00223 bcachefs (vdc): initializing new filesystem 00223 bcachefs (vdc): going read-write 00223 bcachefs (vdc): marking superblocks 00223 bcachefs (vdc): initializing freespace 00223 bcachefs (vdc): done initializing freespace 00223 bcachefs (vdc): reading snapshots table 00223 bcachefs (vdc): reading snapshots done 00223 bcachefs (vdc): done starting filesystem 00244 hrtimer: interrupt took 123350440 ns 00264 bcachefs (vdc): shutting down 00264 bcachefs (vdc): going read-only 00264 bcachefs (vdc): finished waiting for writes to stop 00264 bcachefs (vdc): flushing journal and stopping allocators, journal seq 97 00265 bcachefs (vdc): flushing journal and stopping allocators complete, journal seq 101 00265 bcachefs (vdc): clean shutdown complete, journal seq 102 00265 bcachefs (vdc): marking filesystem clean 00265 bcachefs (vdc): shutdown complete 00265 bcachefs (vdc): starting version 1.20: directory_size opts=errors=ro 00265 bcachefs (vdc): recovering from clean shutdown, journal seq 102 00265 bcachefs (vdc): accounting_read... 00265 ================================================================== 00265 done 00265 BUG: KASAN: slab-use-after-free in bch2_fs_to_text+0x12b4/0x1728 00265 bcachefs (vdc): alloc_read... done 00265 bcachefs (vdc): stripes_read... done 00265 Read of size 4 at addr ffffff80c57eac00 by task cat/7531 00265 bcachefs (vdc): snapshots_read... done 00265 00265 CPU: 6 UID: 0 PID: 7531 Comm: cat Not tainted 6.13.0-rc3-ktest-g16fc6fa3819d #14103 00265 Hardware name: linux,dummy-virt (DT) 00265 Call trace: 00265 show_stack+0x1c/0x30 (C) 00265 dump_stack_lvl+0x6c/0x80 00265 print_report+0xf8/0x5d8 00265 kasan_report+0x90/0xd0 00265 __asan_report_load4_noabort+0x1c/0x28 00265 bch2_fs_to_text+0x12b4/0x1728 00265 bch2_fs_show+0x94/0x188 00265 sysfs_kf_seq_show+0x1a4/0x348 00265 kernfs_seq_show+0x12c/0x198 00265 seq_read_iter+0x27c/0xfd0 00265 kernfs_fop_read_iter+0x390/0x4f8 00265 vfs_read+0x480/0x7f0 00265 ksys_read+0xe0/0x1e8 00265 __arm64_sys_read+0x70/0xa8 00265 invoke_syscall.constprop.0+0x74/0x1e8 00265 do_el0_svc+0xc8/0x1c8 00265 el0_svc+0x20/0x60 00265 el0t_64_sync_handler+0x104/0x130 00265 el0t_64_sync+0x154/0x158 00265 00265 Allocated by task 7510: 00265 kasan_save_stack+0x28/0x50 00265 kasan_save_track+0x1c/0x38 00265 kasan_save_alloc_info+0x3c/0x50 00265 __kasan_kmalloc+0xac/0xb0 00265 __kmalloc_node_noprof+0x168/0x348 00265 __kvmalloc_node_noprof+0x20/0x140 00265 __bch2_darray_resize_noprof+0x90/0x1b0 00265 __bch2_accounting_mem_insert+0x76c/0xb08 00265 bch2_accounting_mem_insert+0x224/0x3b8 00265 bch2_accounting_mem_mod_locked+0x480/0xc58 00265 bch2_accounting_read+0xa94/0x3eb8 00265 bch2_run_recovery_pass+0x80/0x178 00265 bch2_run_recovery_passes+0x340/0x698 00265 bch2_fs_recovery+0x1c98/0x2bd8 00265 bch2_fs_start+0x240/0x490 00265 bch2_fs_get_tree+0xe1c/0x1458 00265 vfs_get_tree+0x7c/0x250 00265 path_mount+0xe24/0x1648 00265 __arm64_sys_mount+0x240/0x438 00265 invoke_syscall.constprop.0+0x74/0x1e8 00265 do_el0_svc+0xc8/0x1c8 00265 el0_svc+0x20/0x60 00265 el0t_64_sync_handler+0x104/0x130 00265 el0t_64_sync+0x154/0x158 00265 00265 Freed by task 7510: 00265 kasan_save_stack+0x28/0x50 00265 kasan_save_track+0x1c/0x38 00265 kasan_save_free_info+0x48/0x88 00265 __kasan_slab_free+0x48/0x60 00265 kfree+0x188/0x408 00265 kvfree+0x3c/0x50 00265 __bch2_darray_resize_noprof+0xe0/0x1b0 00265 __bch2_accounting_mem_insert+0x76c/0xb08 00265 bch2_accounting_mem_insert+0x224/0x3b8 00265 bch2_accounting_mem_mod_locked+0x480/0xc58 00265 bch2_accounting_read+0xa94/0x3eb8 00265 bch2_run_recovery_pass+0x80/0x178 00265 bch2_run_recovery_passes+0x340/0x698 00265 bch2_fs_recovery+0x1c98/0x2bd8 00265 bch2_fs_start+0x240/0x490 00265 bch2_fs_get_tree+0xe1c/0x1458 00265 vfs_get_tree+0x7c/0x250 00265 path_mount+0xe24/0x1648 00265 bcachefs (vdc): going read-write 00265 __arm64_sys_mount+0x240/0x438 00265 invoke_syscall.constprop.0+0x74/0x1e8 00265 do_el0_svc+0xc8/0x1c8 00265 el0_svc+0x20/0x60 00265 el0t_64_sync_handler+0x104/0x130 00265 el0t_64_sync+0x154/0x158 00265 00265 The buggy address belongs to the object at ffffff80c57eac00 00265 which belongs to the cache kmalloc-128 of size 128 00265 The buggy address is located 0 bytes inside of 00265 freed 128-byte region [ffffff80c57eac00, ffffff80c57eac80) 00265 00265 The buggy address belongs to the physical page: 00265 page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1057ea 00265 head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 00265 flags: 0x8000000000000040(head\|zone=2) 00265 page_type: f5(slab) 00265 raw: 8000000000000040 ffffff80c0002800 dead000000000100 dead000000000122 00265 raw: 0000000000000000 0000000000200020 00000001f5000000 ffffff80c57a6400 00265 head: 8000000000000040 ffffff80c0002800 dead000000000100 dead000000000122 00265 head: 0000000000000000 0000000000200020 00000001f5000000 ffffff80c57a6400 00265 head: 8000000000000001 fffffffec315fa81 ffffffffffffffff 0000000000000000 00265 head: 0000000000000002 0000000000000000 00000000ffffffff 0000000000000000 00265 page dumped because: kasan: bad access detected 00265 00265 Memory state around the buggy address: 00265 ffffff80c57eab00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00265 ffffff80c57eab80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc 00265 >ffffff80c57eac00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb 00265 ^ 00265 ffffff80c57eac80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc 00265 ffffff80c57ead00: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc 00265 ================================================================== 00265 Kernel panic - not syncing: kasan.fault=panic set ... 00265 CPU: 6 UID: 0 PID: 7531 Comm: cat Not tainted 6.13.0-rc3-ktest-g16fc6fa3819d #14103 00265 Hardware name: linux,dummy-virt (DT) 00265 Call trace: 00265 show_stack+0x1c/0x30 (C) 00265 dump_stack_lvl+0x30/0x80 00265 dump_stack+0x18/0x20 00265 panic+0x4d4/0x518 00265 start_report.constprop.0+0x0/0x90 00265 kasan_report+0xa0/0xd0 00265 __asan_report_load4_noabort+0x1c/0x28 00265 bch2_fs_to_text+0x12b4/0x1728 00265 bch2_fs_show+0x94/0x188 00265 sysfs_kf_seq_show+0x1a4/0x348 00265 kernfs_seq_show+0x12c/0x198 00265 seq_read_iter+0x27c/0xfd0 00265 kernfs_fop_read_iter+0x390/0x4f8 00265 vfs_read+0x480/0x7f0 00265 ksys_read+0xe0/0x1e8 00265 __arm64_sys_read+0x70/0xa8 00265 invoke_syscall.constprop.0+0x74/0x1e8 00265 do_el0_svc+0xc8/0x1c8 00265 el0_svc+0x20/0x60 00265 el0t_64_sync_handler+0x104/0x130 00265 el0t_64_sync+0x154/0x158 00265 SMP: stopping secondary CPUs 00265 Kernel Offset: disabled 00265 CPU features: 0x000,00000070,00000010,8240500b 00265 Memory Limit: none 00265 ---[ end Kernel panic - not syncing: kasan.fault=panic set ... ]--- 00270 ========= FAILED TIMEOUT generic.187 in 1200s Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-11 10:10:32 -05:00
Kent Overstreet	595170d4b6	bcachefs: Fix marking reflink pointers to missing indirect extents reflink pointers to missing indirect extents aren't deleted, they just have an error bit set - in case the indirect extent somehow reappears. fsck/mark and sweep thus needs to ignore these errors. Also, they can be marked AUTOFIX now. Reported-by: Roland Vet <vet.roland@protonmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-07 14:49:47 -05:00
Kent Overstreet	4be214c269	bcachefs: bch2_bkey_sectors_need_rebalance() now only depends on bch_extent_rebalance Previously, bch2_bkey_sectors_need_rebalance() called bch2_target_accepts_data(), checking whether the target is writable. However, this means that adding or removing devices from a target would change the value of bch2_bkey_sectors_need_rebalance() for an existing extent; this needs to be invariant so that the extent trigger can correctly maintain rebalance_work accounting. Instead, check target_accepts_data() in io_opts_to_rebalance_opts(), before creating the bch_extent_rebalance entry. This fixes (one?) cause of rebalance_work accounting being off. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-06 22:35:11 -05:00
Kent Overstreet	3539880ef1	bcachefs: Fix rcu imbalance in bch2_fs_btree_key_cache_exit() Spotted by sparse. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-06 22:35:11 -05:00
Kent Overstreet	9e9033522a	bcachefs: Fix discard path journal flushing The discard path is supposed to issue journal flushes when there's too many buckets empty buckets that need a journal commit before they can be written to again, but at some point this code seems to have been lost. Bring it back with a new optimization to make sure we don't issue too many journal flushes: the journal now tracks the sequence number of the most recent flush in progress, which the discard path uses when deciding which buckets need a journal flush. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-06 22:35:11 -05:00
Jeongjun Park	2ef995df0c	bcachefs: fix deadlock in journal_entry_open() In the previous commit `b3d82c2f27`, code was added to prevent journal sequence overflow. Among them, the code added to journal_entry_open() uses the bch2_fs_fatal_err_on() function to handle errors. However, __journal_res_get() , which calls journal_entry_open() , calls journal_entry_open() while holding journal->lock , but bch2_fs_fatal_err_on() internally tries to acquire journal->lock , which results in a deadlock. So we need to add a locked helper to handle fatal errors even when the journal->lock is held. Fixes: `b3d82c2f27` ("bcachefs: Guard against journal seq overflow") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-06 22:35:11 -05:00
Jeongjun Park	6b37037d6d	bcachefs: fix incorrect pointer check in __bch2_subvolume_delete() For some unknown reason, checks on struct bkey_s_c_snapshot and struct bkey_s_c_snapshot_tree pointers are missing. Therefore, I think it would be appropriate to fix the incorrect pointer checking through this patch. Fixes: `4bd06f07bc` ("bcachefs: Fixes for snapshot_tree.master_subvol") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-02-06 22:35:11 -05:00
Linus Torvalds	a86bf2283d	assorted stuff for this merge window -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZ5yJdgAKCRBZ7Krx/gZQ 69W4AQDwgxceiQ6icx3rFhCWQigne4jdMO84kd8tNaa+xHGe1AD/WnkeChc5DqjQ wZWZxAAzml9SS01IcSiHWaF5fgrjlA0= =rXOq -----END PGP SIGNATURE----- Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs cleanups from Al Viro: "Two unrelated patches - one is a removal of long-obsolete include in overlayfs (it used to need fs/internal.h, but the extern it wanted has been moved back to include/linux/namei.h) and another introduces convenience helper constructing struct qstr by a NUL-terminated string" * tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add a string-to-qstr constructor fs/overlayfs/namei.c: get rid of include ../internal.h	2025-02-01 15:07:56 -08:00
Linus Torvalds	8080ff5ac6	bcachefs fixes for 6.14-rc1 - second half of a fix for a bug that'd been causing oopses on filesystems using snapshots with memory pressure (key cache fills for snaphots btrees are tricky) - build fix for strange compiler configurations that double stack frame size - "journal stuck timeout" now takes into account device latency: this fixes some spurious warnings, and the main remaining source of SRCU lock hold time warnings (I'm no longer seeing this in my CI, so any users still seeing this should definitely ping me) - fix for slow/hanging unmounts (" Improve journal pin flushing") - some more tracepoint fixes/improvements, to chase down the "rebalance isn't making progress" issues -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmeajBsACgkQE6szbY3K bna2BA/9E/+WBDFQHkLJ4kNQBxKL4u1xfav5kKGZ79mUlqruhr3AckLPFzWmQO21 eOJE0NeyvpsvLewDXMGZ8w/Nm3Vdc53X6ATKkQaF/UoTYVWbubmF62sXBzSS8TUh YIM6s24/CbCi8lT49JAuIaG3OC21KH0X0zOcvyepfmn1aiPNLr4y7zOWKynOhgCt mAt374ayUDDTgoQmXqPIrGp8eD/C+vUjo1ief+DIGMQmDj4uHpb5iBbmjXm8FF9x 4TcrP1UjWpiWPcHeb98H/CBWOnjDSgOFhYmxhVOvkDpC6XbtPSgKQIOs7tSJ0Nuo IOzrGuBPVfd2m+wgXsn7zbn0HNOjS76sCo92K1lAdS86k0eqRfXmCxkU6FUphNkA WCG8WrK0RjHL132iR97dtv36No8ji5mZN1ILPk/h4KRkoKC+9fA8BaJdAGVt+6NP wZLtZxZkV8BqgXF41HwzHt54YftRPn2kR47Jfu1rPimSUd4Uqy8Yjw2J/fUT7eAd 6JdfiadhAtMRWnFGzmVs4LsEWJ7Ja7GnG7jhjzlACsqbXsTU8k16Wq38IchC6mi+ p+hqq9pLAeosW9Lk/QTGFrq52aQfyOzdUjq1pyCcEYtZFNqjj8GmmHVejxZWiRTo C6dTEkSIMcBx+9QP8BJ5o+xMR02KABn+8x43TzQxQ2DXj0QamTA= =QJHL -----END PGP SIGNATURE----- Merge tag 'bcachefs-2025-01-29' of git://evilpiepirate.org/bcachefs Pull bcachefs fixes from Kent Overstreet: - second half of a fix for a bug that'd been causing oopses on filesystems using snapshots with memory pressure (key cache fills for snaphots btrees are tricky) - build fix for strange compiler configurations that double stack frame size - "journal stuck timeout" now takes into account device latency: this fixes some spurious warnings, and the main remaining source of SRCU lock hold time warnings (I'm no longer seeing this in my CI, so any users still seeing this should definitely ping me) - fix for slow/hanging unmounts (" Improve journal pin flushing") - some more tracepoint fixes/improvements, to chase down the "rebalance isn't making progress" issues * tag 'bcachefs-2025-01-29' of git://evilpiepirate.org/bcachefs: bcachefs: Improve trace_move_extent_finish bcachefs: Fix trace_copygc bcachefs: Journal writes are now IOPRIO_CLASS_RT bcachefs: Improve journal pin flushing bcachefs: fix bch2_btree_node_flags bcachefs: rebalance, copygc enabled are runtime opts bcachefs: Improve decompression error messages bcachefs: bset_blacklisted_journal_seq is now AUTOFIX bcachefs: "Journal stuck" timeout now takes into account device latency bcachefs: Reduce stack frame size of __bch2_str_hash_check_key() bcachefs: Fix btree_trans_peek_key_cache()	2025-01-30 08:42:50 -08:00
Al Viro	c1feab95e0	add a string-to-qstr constructor Quite a few places want to build a struct qstr by given string; it would be convenient to have a primitive doing that, rather than open-coding it via QSTR_INIT(). The closest approximation was in bcachefs, but that expands to initializer list - {.len = strlen(string), .name = string}. It would be more useful to have it as compound literal - (struct qstr){.len = strlen(string), .name = string}. Unlike initializer list it's a valid expression. What's more, it's a valid lvalue - it's an equivalent of anonymous local variable with such initializer, so the things like path->dentry = d_alloc_pseudo(mnt->mnt_sb, &QSTR(name)); are valid. It can also be used as initializer, with identical effect - struct qstr x = (struct qstr){.name = s, .len = strlen(s)}; is equivalent to struct qstr anon_variable = {.name = s, .len = strlen(s)}; struct qstr x = anon_variable; // anon_variable is never used after that point and any even remotely sane compiler will manage to collapse that into struct qstr x = {.name = s, .len = strlen(s)}; What compound literals can't be used for is initialization of global variables, but those are covered by QSTR_INIT(). This commit lifts definition(s) of QSTR() into linux/dcache.h, converts it to compound literal (all bcachefs users are fine with that) and converts assorted open-coded instances to using that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-01-27 19:25:45 -05:00
Kent Overstreet	5d9ccda9ba	bcachefs: Improve trace_move_extent_finish We're currently debugging issues with rebalance, where it's not making progress as quickly as it should be (or sometimes not at all). Add the full data_update to the move_extent_finish tracepoint, so we can check that the replicas we wrote match what we were supposed to do. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-26 23:02:28 -05:00
Kent Overstreet	0e458a616f	bcachefs: Fix trace_copygc Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-26 23:02:28 -05:00
Kent Overstreet	75474a54ed	bcachefs: Journal writes are now IOPRIO_CLASS_RT System performance is particularly sensitive to journal write latency, the number of outstanding journal writes is bounded and we can't issue journal flushes until other journal writes have completed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-26 23:02:28 -05:00
Kent Overstreet	35f5197009	bcachefs: Improve journal pin flushing Running the preempt tiering tests with a lower than normal journal reclaim delay turned up a shutdown hang - a lost wakeup, caused because flushing a journal pin (e.g. key cache/write buffer) can generate a new journal pin. The "simple" fix of adding the correct wakeup didn't work because of ordering issues; if we flush btree node pins too aggressively before other pins have completed, we end up spinning where each flush iteration generates new work. So to fix this correctly: - The list of flushed journal pins is now broken out by type, so that we can wait for key cache/write buffer pin flushing to complete before flushing dirty btree nodes - A new closure_waitlist is added for bch2_journal_flush_pins; this one is only used under or when we're taking the journal lock, so it's pretty cheap to add rigorously correct wakeups to journal_pin_set() and journal_pin_drop(). Additionally, bch2_journal_seq_pins_to_text() is moved to journal_reclaim.c, where it belongs, along with a bit of other small renaming and refactoring. Besides fixing the hang, the better ordering between key cache/write buffer flushing and btree node flushing should help or fix the "unmount taking excessively long" a few users have been noticing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-25 19:37:43 -05:00
Kent Overstreet	0c74c85bbe	bcachefs: fix bch2_btree_node_flags Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-25 19:33:19 -05:00
Kent Overstreet	37fd6b8176	bcachefs: rebalance, copygc enabled are runtime opts Fix a regression from when these were switched to normal opts.h options. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-25 19:33:19 -05:00
Kent Overstreet	2efbc3518f	bcachefs: Improve decompression error messages Ratelimit them, and use the new bch2_write_op_error() helper that prints path and file offset. Reported-by: https://github.com/koverstreet/bcachefs/issues/819 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-25 14:43:13 -05:00
Linus Torvalds	37b33c68b0	CRC updates for 6.14 - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be directly accessible via the library API, instead of requiring the crypto API. This is much simpler and more efficient. - Convert some users such as ext4 to use the CRC32 library API instead of the crypto API. More conversions like this will come later. - Add a KUnit test that tests and benchmarks multiple CRC variants. Remove older, less-comprehensive tests that are made redundant by this. - Add an entry to MAINTAINERS for the kernel's CRC library code. I'm volunteering to maintain it. I have additional cleanups and optimizations planned for future cycles. These patches have been in linux-next since -rc1. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZ418ZRQcZWJpZ2dlcnNA Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKyJYAP9kBlpm8W9/XY6N8SpjKaXE/vKQYHQl Nobhak06Us8uJwEAkcUTymWP4IwQj5A9jgBAPRw53FQcNVKIc+01C7gRHw0= =mqSH -----END PGP SIGNATURE----- Merge tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull CRC updates from Eric Biggers: - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be directly accessible via the library API, instead of requiring the crypto API. This is much simpler and more efficient. - Convert some users such as ext4 to use the CRC32 library API instead of the crypto API. More conversions like this will come later. - Add a KUnit test that tests and benchmarks multiple CRC variants. Remove older, less-comprehensive tests that are made redundant by this. - Add an entry to MAINTAINERS for the kernel's CRC library code. I'm volunteering to maintain it. I have additional cleanups and optimizations planned for future cycles. * tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (31 commits) MAINTAINERS: add entry for CRC library powerpc/crc: delete obsolete crc-vpmsum_test.c lib/crc32test: delete obsolete crc32test.c lib/crc16_kunit: delete obsolete crc16_kunit.c lib/crc_kunit.c: add KUnit test suite for CRC library functions powerpc/crc-t10dif: expose CRC-T10DIF function through lib arm64/crc-t10dif: expose CRC-T10DIF function through lib arm/crc-t10dif: expose CRC-T10DIF function through lib x86/crc-t10dif: expose CRC-T10DIF function through lib crypto: crct10dif - expose arch-optimized lib function lib/crc-t10dif: add support for arch overrides lib/crc-t10dif: stop wrapping the crypto API scsi: target: iscsi: switch to using the crc32c library f2fs: switch to using the crc32 library jbd2: switch to using the crc32c library ext4: switch to using the crc32c library lib/crc32: make crc32c() go directly to lib bcachefs: Explicitly select CRYPTO from BCACHEFS_FS x86/crc32: expose CRC32 functions through lib x86/crc32: update prototype for crc32_pclmul_le_16() ...	2025-01-22 19:55:08 -08:00
Kent Overstreet	c9c8a17f7a	bcachefs: bset_blacklisted_journal_seq is now AUTOFIX Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-21 23:05:32 -05:00
Kent Overstreet	2c5d8a8347	bcachefs: "Journal stuck" timeout now takes into account device latency If a block device (e.g. your typical consumer SSD) is taking multiple seconds for IOs (typically flushes), we don't want to emit the "journal stuck" message prematurely. Also, make sure to drop the btree_trans srcu lock if we're blocking for more than a second. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-21 18:32:05 -05:00
Kent Overstreet	f917016f69	bcachefs: Reduce stack frame size of __bch2_str_hash_check_key() We don't need all the helpers inlined here. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-21 12:57:48 -05:00
Kent Overstreet	a858175227	bcachefs: Fix btree_trans_peek_key_cache() BTREE_ITER_cached_nofill has some tricky corner cases; it's used internally for iterators that aren't walking the key cache, but need to be coherent with the key cache. It tells traverse to look up and lock the key cache entry if present, but don't create one if it doesn't exist. That means we have to have a BTREE_ITER_UPTODATE path (because after traverse the path has to be UPTODATE, or we pop assertions) that doesn't point to anything (which is the less bad option, taken by the previous fix). The previous fix for this path missed an issue that can happen in bch2_trans_peek_key_cache(): we can't set should_be_locked on a path that doesn't point to anything and doesn't hold locks. Fixes: `bd5b09727f` ("bcachefs: Don't set btree_path to updtodate if we don't fill") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-21 12:26:25 -05:00
Linus Torvalds	1cbfb828e0	for-6.14/block-20250118 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeL6hoQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgppw2EADQV8nDgLRggZR+il4U03yKHXcQEdAX1GrB Erowx+dasIJuh6kp3n6qRe9QD/pRqt1DKyLvXoWF8Qfuwq85j7oDnDDYxutNYT27 hDgrLJriJ3VeKYtTu+andHWt8P29b5h57UayInDOUJurEPA6rXyFZ5YVIti8n21K uDOrQXiACG3qRWS2+p2f3UNhX0MkFNFdN/lxi13WMIJtRWF5bXAP+JOgIWCID4Ze QuSY6rQD4dp4Q6M2erpX6tn0YZb7Hvw3rPjsd91n6jvYfTUVLH375zg8jCBpi6Wi Syufbb8xcTtriVPTDRNu0ekjebkc8wD8ax/h86g0z9v3Ua4DlNmsx9eXrtv6r5nu YXqDODOad6stI0+owFquW2vas0gHmfNSfyfGdlk2g24PMtP5Yx0V6FIEvwIeqnje ghgxQvBuKUsdhqakByfNnc+XvXi3+RUJek8kvMeUSUQWT1IyMQqPOOk0yp9WdyWD bY1f2ECP5BR1b37zYOyawewsI5xTupHUswn5a4r4qtGn3O15rGDkX98Nab5aLCnR rW/DvX7+wT6gW9EwrRHiwjwfNDZbsJ9Ggu3lMhtUl5GUWdk58yTiVgKaHJLnlX9/ CKFKfyyIR1Vl8+gYIpemyFhhcoN+dCSf06ISkrg0jeS0/tYwydaAaCBPL5J4kxZA h3Rtbh+Pgg== =EXYs -----END PGP SIGNATURE----- Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - NVMe pull requests via Keith: - Target support for PCI-Endpoint transport (Damien) - TCP IO queue spreading fixes (Sagi, Chaitanya) - Target handling for "limited retry" flags (Guixen) - Poll type fix (Yongsoo) - Xarray storage error handling (Keisuke) - Host memory buffer free size fix on error (Francis) - MD pull requests via Song: - Reintroduce md-linear (Yu Kuai) - md-bitmap refactor and fix (Yu Kuai) - Replace kmap_atomic with kmap_local_page (David Reaver) - Quite a few queue freeze and debugfs deadlock fixes Ming introduced lockdep support for this in the 6.13 kernel, and it has (unsurprisingly) uncovered quite a few issues - Use const attributes for IO schedulers - Remove bio ioprio wrappers - Fixes for stacked device atomic write support - Refactor queue affinity helpers, in preparation for better supporting isolated CPUs - Cleanups of loop O_DIRECT handling - Cleanup of BLK_MQ_F_* flags - Add rotational support for null_blk - Various fixes and cleanups * tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits) block: Don't trim an atomic write block: Add common atomic writes enable flag md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add() block: limit disk max sectors to (LLONG_MAX >> 9) block: Change blk_stack_atomic_writes_limits() unit_min check block: Ensure start sector is aligned for stacking atomic writes blk-mq: Move more error handling into blk_mq_submit_bio() block: Reorder the request allocation code in blk_mq_submit_bio() nvme: fix bogus kzalloc() return check in nvme_init_effects_log() md/md-bitmap: move bitmap_{start, end}write to md upper layer md/raid5: implement pers->bitmap_sector() md: add a new callback pers->bitmap_sector() md/md-bitmap: remove the last parameter for bimtap_ops->endwrite() md/md-bitmap: factor behind write counters out from bitmap_{start/end}write() md: Replace deprecated kmap_atomic() with kmap_local_page() md: reintroduce md-linear partitions: ldm: remove the initial kernel-doc notation blk-cgroup: rwstat: fix kernel-doc warnings in header file blk-cgroup: fix kernel-doc warnings in header file nbd: fix partial sending ...	2025-01-20 19:38:46 -08:00
Kent Overstreet	ff0b7ed607	bcachefs: Fix check_inode_hash_info_matches_root() Can't use memcmp() when the struct contains padding. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-15 15:28:23 -05:00
Kent Overstreet	a4e11cea27	bcachefs: Document issue with bch_stripe layout We've got a problem with bch_stripe that is going to take an on disk format rev to fix - we can't access the block sector counts if the checksum type is unknown. Document it for now, there are a few other things to fix as well. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:31 -05:00
Kent Overstreet	78423deb51	bcachefs: Fix self healing on read error We were incorrectly checking if there'd been an io error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:31 -05:00
Alan Huang	5dd21b2712	bcachefs: Pop all the transactions from the abort one The transaction is going to abort, so there will be no cycle involving this transaction anymore. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:25 -05:00
Alan Huang	b169138d48	bcachefs: Only abort the transactions in the cycle When the cycle doesn't involve the initiator of the cycle detection, we might choose a transaction that is not involved in the cycle to abort. It shouldn't be that since it won't break the cycle, this patch therefore chooses the transaction in the cycle to abort. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:18 -05:00
Alan Huang	6853a5e5d4	bcachefs: Introduce lock_graph_pop_from This patch introduces a helper function called lock_graph_pop_from, it pops the graph from i. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:13 -05:00
Alan Huang	b5c3dcd0db	bcachefs: Convert open-coded lock_graph_pop_all to helper Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:08 -05:00
Alan Huang	0ef9ab34f4	bcachefs: Do not allow no fail lock request to fail If the transaction chose itself as a victim before and restarted, it might request a no fail lock request this time. But it might be added to others' lock graph and be chose as the victim again, it's no longer safe without additional check. We can also convert the cycle detector to be fully RCU-based to solve that unsoundness, but the latency added to trans_put and additional memory required may not worth it. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:08 -05:00
Alan Huang	cdc419dbf2	bcachefs: Merge the condition to avoid additional invocation If the lock has been acquired and unlocked, we don't have to do clear and wakeup again, though harmless since we hold the intent lock. Merge the condition might be clearer. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:08 -05:00
Alan Huang	9c13cc9c7d	Revert "bcachefs: Fix bch2_btree_node_upgrade()" This reverts commit `62448afee7`. six_lock_tryupgrade fails only if there is an intent lock held, it won't fail no matter how many read locks are held. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-14 10:45:08 -05:00
Hongbo Li	c72deb03ff	bcachefs: bcachefs_metadata_version_directory_size This adds another metadata version for accounting directory size. For the new version of the filesystem, when new subdirectory items are created or deleted, the parent directory's size will change accordingly. For the old version of the existed file system, running fsck will automatically upgrade the metadata version, and it will do the check and recalculationg of the directory size. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-13 14:58:38 -05:00
Hongbo Li	e614a6c52d	bcachefs: make directory i_size meaningful The isize of directory is 0 in bcachefs if the directory is empty. With more child dirents created, its size ought to change. Many other filesystems changed as that (ie. xfs and btrfs). And many of them changed as the size of child dirent name. Although the directory size may not seem to convey much, we can still give it some meaning. The formula of dentry size as follow: occupied_size = 40 + ALIGN(9 + namelen, 8) Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-13 14:58:38 -05:00
Kent Overstreet	4204e3bf63	bcachefs: check_unreachable_inodes is not actually PASS_ONLINE yet check_unreachable_inodes does work in online mode, with the one caveat that it assumes check_dirents has also run - and check_dirents is not PASS_ONLINE yet. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	ae153f2e11	bcachefs: Don't use BTREE_ITER_cached when walking alloc btree during fsck No need to pull the whole alloc btree into the btree key cache. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	15734b5e6f	bcachefs: Check for dirents to overwritten inodes This fixes various "dirent to missing inode" errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	d3d0fac57d	bcachefs: bch2_btree_iter_peek_slot() handles navigating to nonexistent depth Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	bd5b09727f	bcachefs: Don't set btree_path to updtodate if we don't fill This fixes various locking asserts, and a null ptr deref in bch2_btree_iter_peek_path(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	cf67f46641	bcachefs: __bch2_btree_pos_to_text() Factor out a version of bch2_btree_pos_to_text() that doesn't take a pointer to a in-memory btree node, to be used for btree node scrub. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	0a46ea9d46	bcachefs: printbuf_reset() handles tabstops Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	5906dcb993	bcachefs: Silence read-only errors when deleting snapshots Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	8b1f46bff3	bcachefs: Dropped superblock write is no longer a fatal error Just emit a warning if errors=continue or fix_safe. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	8cfdc6ce1f	bcachefs: bch2_trans_node_drop() Factor out a small common helper. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	0971a72c3d	bcachefs: bch2_trans_unlock_write() New helper for dropping all write locks; which is distinct from the helper the transaction commit path uses, which is faster and only touches updates. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:42 -05:00
Kent Overstreet	e1911d7a69	bcachefs: btree_node_unlock() can now drop write locks Prep work for reworking btree node locking during interior btree updates. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	9a5232ef0a	bcachefs: six locks: write locks can now be held recursively This is needed for the interior update locking rework, where we'll be holding node write locks for the duration of the update - which is needed for synchronizing with online check_allocations. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	8f3aaa5d5d	bcachefs: bch2_fs_btree_gc_init() Now returns errors, prep work for check_allocations_done_lock Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	cb3f34982c	bcachefs: Assert that btree write buffer only touches the right btrees More asserts, more better. Also, clean up the per-btree flags a bit. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	bdedae70f5	bcachefs: bch2_inum_path() now crosses subvolumes correctly The dirent that points to a subvolume root is in the parent subvolume. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	ce9a21713b	bcachefs: bch2_inum_path() no longer returns an error for disconnected inums bch2_inum_path() should work even if the filesystem is corrupted - we don't want it to cause fsck to fail. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	6adc5af50a	bcachefs: btree_path_very_locks(): verify lock seq If the btree_path's lock seq is wrong, the next bch2_trans_relock() operation is guaranteed to fail and we take an unnecessary transaction restart. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	f908eacc34	bcachefs: fix bch2_btree_key_cache_drop() When evicting, we shouldn't leave a pointer to the key cache entry lying around - that screws up btree path asserts we're adding. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	bc6fce7870	bcachefs: bch2_btree_node_write_trans() Avoiding screwing up path->lock_seq. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	4bd06f07bc	bcachefs: Fixes for snapshot_tree.master_subvol Ensure that snapshot_tree.master_subvol is cleared when we delete the master subvolume in a tree of snapshots, and allow for snapshot trees that don't have a master subvolume in fsck. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	b5e4cd0871	bcachefs: Don't rely on snapshot_tree.master_subvol for reattaching Previously, fsck used the snapshot tree's master subvol for finding the root inode number - but the master subvol might have been deleting, and setting a new one should be a user operation; meaning we can't rely on it existing. Fortunately, for finding the root inode number in a tree of snapshots, finding any associated subvolume works. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	4541408391	bcachefs: bch2_kvmalloc() Add a version of kvmalloc() that doesn't have the INT_MAX limit; large filesystems do hit this. We'll want to get rid of the in-memory bucket gens array, but we're not there quite yet. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	fa3e5135e4	bcachefs: Fix assert for online fsck We can't check if we're racing with fsck ending until mark_lock is held. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	cf3da2d627	bcachefs: Handle -BCH_ERR_need_mark_replicas in gc Locking considerations (possibly no longer relevant?) mean that when an accounting update needs a new superblock replicas entry to be created, it's deferred to the transaction commit error path. But accounting updates for gc/fcsk aren't done from the transaction commit path - so we need to handle -BCH_ERR_btree_insert_need_mark_replicas locally. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	861cd0f606	bcachefs: Write lock btree node in key cache fills this addresses a key cache coherency bug Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	baf13d8344	bcachefs: kill __bch2_btree_iter_flags() bch2_btree_iter_flags() now takes a level parameter; this fixes a bug where using a node iterator on a leaf wouldn't set BTREE_ITER_with_key_cache, leading to fun cache coherency bugs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	30e32692d6	bcachefs: Drop redundant "read error" call from btree_gc The btree node read error path already calls topology error, so this is entirely redundant, and we're not specific enough about our error codes - this was triggering for bucket_ref_update() errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	6542afe299	bcachefs: Drop racy warning Checking for writing past i_size after unlocking the folio and clearing the dirty bit is racy, and we already check it at the start. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	0475c7639e	bcachefs: better check_bp_exists() error message Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Hongbo Li	d01ea14da7	bcachefs: add counter_flags for counters In bcachefs, io_read and io_write counter record the amount of data which has been read and written. They increase in unit of sector, so to display correctly, they need to be shifted to the left by the size of a sector. Other counters like io_move, move_extent_{read, write, finish} also have this problem. In order to support different unit, we add extra column to mark the counter type by using TYPE_COUNTER and TYPE_SECTORS in BCH_PERSISTENT_COUNTERS(). Fixes: `1c6fdbd8f2` ("bcachefs: Initial commit") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	3db3084a86	bcachefs: bcachefs_metadata_version_autofix_errors It's time to make self healing the default: change the error action for old filesystems to fix_safe, matching the default for current filesystems. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	df448ca355	bcachefs: bcachefs_metadata_version_persistent_inode_cursors Persistent cursors for inode allocation. A free inodes btree would add substantial overhead to inode allocation and freeing - a "next num to allocate" cursor is always going to be faster. We just need it to be persistent, to avoid scanning the inodes btree from the start on startup. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-01-09 23:38:41 -05:00
Kent Overstreet	59c50511f7	bcachefs: bcachefs_metadata_version_inode_depth This adds a new inode field, bi_depth, for directory inodes: this allows us to make the check_directory_structure pass much more efficient. Currently, to ensure the filesystem is fully connect and has no loops, for every directory we follow backpointers until we find the root. But by adding a depth counter, it sufficies to only check the parent of each directory, and check that the parent's bi_depth is smaller. (fsck doesn't require that bi_depth = parent->bi_depth + 1; if a rename causes bi_depth off, but the chain to the root is still strictly decreasing, then the algorithm still works and there's no need for fsck to fixup the bi_depth fields). We've already checked backpointers, so we know that every directory (excluding the root)has a valid parent: if bi_depth is always decreasing, every chain must terminate, and terminate at the root directory. bi_depth will not necessarily be correct when fsck runs, due to directory renames - we can't change bi_depth on every child directory when renaming a directory. That's ok; fsck will silently fix the bi_depth field as needed, and future fsck runs will be much faster. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	80c6352c2c	bcachefs: Option changes now get propagated to reflinked data Now that bch2_move_get_io_opts() re-propagates changed inode io options to bch_extent_rebalance, we can properly suport changing IO path options for reflinked data. Changing a per-file IO path option, either via the xattr interface or via the BCHFS_IOC_REINHERIT_ATTRS ioctl, will now trigger a scan (the inode number is marked as needing a scan, via bch2_set_rebalance_needs_scan()), and rebalance will use bch2_move_data(), which will walk the inode number and pick up the new options. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	ea4f9e75ec	bcachefs: bcachefs_metadata_version_reflink_p_may_update_opts Previously, io path option changes on a file would be picked up automatically and applied to existing data - but not for reflinked data, as we had no way of doing this safely. A user may have had permission to copy (and reflink) a given file, but not write to it, and if so they shouldn't be allowed to change e.g. nr_replicas or other options. This uses the incompat feature mechanism in the previous patch to add a new incompatible flag to bch_reflink_p, indicating whether a given reflink pointer may propagate io path option changes back to the indirect extent. In this initial patch we're only setting it for the source extents. We'd like to set it for the destination in a reflink copy, when the user has write access to the source, but that requires mnt_idmap which is not curretly plumbed up to remap_file_range. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	a36d8f0e0e	bcachefs: BCH_SB_VERSION_INCOMPAT We've been getting away from feature bits: they don't have any kind of ordering, and thus it's possible for people to enable weird combinations of features that were never tested or intended to be run. Much better to just give every new feature, compatible or incompatible, a version number. Additionally, we probably won't ever rev the major version number: major version numbers represent incompatible versions, but that doesn't really fit with how we actually roll out incompatible features - we need a better way of rolling out incompatible features. So, this patch adds two new superblock fields: - BCH_SB_VERSION_INCOMPAT - BCH_SB_VERSION_INCOMPAT_ALLOWED BCH_SB_VERSION_INCOMPAT_ALLOWED indicates that incompatible features up to version number x are allowed to be used without user prompting, but it does not by itself deny old versions from mounting. BCH_SB_VERSION_INCOMPAT does deny old versions from mounting, and must be <= BCH_SB_VERSION_INCOMPAT_ALLOWED. BCH_SB_VERSION_INCOMPAT will only be set when a codepath attempts to use an incompatible feature, so as to not unnecessarily break compatibility with old versions. bch2_request_incompat_feature() is the new interface to check if an incompatible feature may be used. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	d884cf189a	bcachefs: Only run check_backpointers_to_extents in debug mode The backpointers passes, check_backpointers_to_extents() and check_extents_to_backpointers() are the most expensive fsck passes. Now that we're running the same check and repair code when using a backpointer at runtime (via bch2_backpointer_get_key()) that fsck does, there's no reason fsck needs to - except to verify that the filesystem really has no errors in debug mode. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	7611d6b5d1	bcachefs: better backpointer_target_not_found() error message Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	c2c2a4d642	bcachefs: bch2_backpointer_get_key() now repairs dangling backpointers Continuing on with the self healing theme, we should be running any check and repair code at runtime that we can - instead of declaring the filesystemt inconsistent. This will also let us skip running the backpointers -> extents fsck pass except in debug mode. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	c738866e47	bcachefs: check_extents_to_backpointers() now only checks buckets with mismatches Instead of walking every extent and every backpointer it points to, first sum up backpointers in each bucket and check for mismatches, and only look for missing backpointers if mismatches were detected, and only check extents in those buckets. This is a major fsck scalability improvement, since the two backpointers passes (backpointers -> extents and extents -> backpointers) are the most expensive fsck passes by far. Additionally, to speed up the upgrade for backpointer bucket gens, or in situations when we have to rebuild alloc info, add a special case for when no backpointers are found in a bucket - don't check each individual backpointer (in particular, avoiding the write buffer flushes), just recreate them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	056cae1c00	bcachefs: Add write buffer flush param to backpointer_get_key() In an upcoming patch bch2_backpointer_get_key() will be repairing when it finds a dangling backpointer; it will need to flush the btree write buffer before it can definitively say there's an error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	7171b1fd27	bcachefs: kill __bch2_extent_ptr_to_bp() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	aca7a26f7f	bcachefs: bch2_extent_ptr_to_bp() no longer depends on device bch_backpointer no longer contains the bucket_offset field, it's just a direct LBA mapping (with low bits to account for compressed extent splitting), so we don't need to refer to the device to construct it anymore. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	ba9752e5f4	bcachefs: bcachefs_metadata_version_disk_accounting_big_endian Fix sort order for disk accounting keys, in order to fix a regression on mount times. The typetag is now the most significant byte of the key, meaning disk accounting keys of the same type now sort together. This lets us skip over disk accounting keys that aren't mirrored in memory when reading accounting at startup, instead of having them interleaved with other counter types. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	ebdca07268	bcachefs: bcachefs_metadata_version_backpointer_bucket_gen New on disk format version: backpointers new include the generation number of the bucket they refer to, and the obsolete bucket_offset field (no longer needed because we no longer store backpointers in alloc keys) is gone. This is an expensive forced upgrade - hopefully the last; we have to run the extents_to_backpointers recovery pass to regenerate backpointers. It's a forced incompatible upgrade because the alternative would've been permamently making backpointers bigger, and as one of the biggest btrees (along with the extents btree) that's not an ideal option. It's worth it though, because this allows us to make the check_extents_to_backpointers pass drastically cheaper: an upcoming patch changes it to sum up backpointers in a bucket and check the sum against the sector counts for that bucket, only looking for missing backpointers if they don't match (and then only for specific buckets). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	6679e363f4	bcachefs: bch2_btree_path_peek_slot() doesn't return errors Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	07c1a6fa90	bcachefs: trace_key_cache_fill Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	17d678bcdd	bcachefs: Log message in journal for snapshot deletion Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	54c9b92fc7	bcachefs: bch2_trans_log_msg() Export a helper for logging to the journal when we're already in a transaction context. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
Kent Overstreet	d0855e2106	bcachefs: Kill snapshot_t->equiv Now entirely dead code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-29 13:30:39 -05:00
John Garry	19206d3f5e	block: Delete bio_set_prio() Since commit `43b62ce3ff` ("block: move bio io prio to a new field"), macro bio_set_prio() does nothing but set bio->bi_ioprio. All other places just set bio->bi_ioprio directly, so replace bio_set_prio() remaining callsites with setting bio->bi_ioprio directly and delete that macro. Signed-off-by: John Garry <john.g.garry@oracle.com> Acked-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20241202111957.2311683-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-12-23 08:17:23 -07:00
Kent Overstreet	35c5609abf	bcachefs: Snapshot deletion no longer uses snapshot_t->equiv Switch to generating a private list of interior nodes to delete, instead of using the equivalence class in the global data structure. This eliminates possible races with snapshot creation, and is much cleaner - it'll let us delete a lot of janky code for calculating and maintaining the equivalence classes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	85c060f62d	bcachefs: Kill equiv_seen arg to delete_dead_snapshots_process_key() When deleting dead snapshots, we move keys from redundant interior snapshot nodes to child nodes - unless there's already a key, in which case the ancestor key is deleted. Previously, we tracked via equiv_seen whether the child snapshot had a key, but this was tricky w.r.t. transaction restarts, and not transactionally safe w.r.t. updates in the child snapshot. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	92e31d4251	bcachefs: Don't run overwrite triggers before insert This breaks when the trigger is inserting updates for the same btree, as the inode trigger now does. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	f859bc945e	bcachefs: alloc_data_type_set() happens in alloc trigger Originally, we ran insert triggers before overwrite so that if an extent was being moved (by fallocate insert/collapse range), the bucket sector count wouldn't hit 0 partway through, and so we don't trigger state changes caused by that too soon. But this is better solved by just moving the data type change to the alloc trigger itself, where it's already called. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	b9a37144da	bcachefs: Fix key cache + BTREE_ITER_all_snapshots Normally, whitouts (KEY_TYPE_whitout) are filtered from btree lookups, since they exist only to represent deletions of keys in ancestor snapshots - except, they should not be filtered in BTREE_ITER_all_snapshots mode, so that e.g. snapshot deletion can clean them up. This means that that the key cache has to store whiteouts, and key cache fills cannot filter them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	7e320a4063	bcachefs: Fix btree_trans_peek_key_cache() BTREE_ITER_all_snapshots In BTREE_ITER_all_snapshots mode, we're required to only return keys where the snapshot field matches the iterator position - BTREE_ITER_filter_snapshots requires pulling keys into the key cache from ancestor snapshots, so we have to check for that. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	c50341be4e	bcachefs: tidy btree_trans_peek_journal() Change to match bch2_btree_trans_peek_updates() calling convention. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	68eb4fdd8c	bcachefs: tidy up __bch2_btree_iter_peek() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	25a3123a67	bcachefs: check_indirect_extents can run online Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	00fa283a41	bcachefs: Refactor c->opts.reconstruct_alloc Now handled in one place. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Nathan Chancellor	64833d3965	bcachefs: Add empty statement between label and declaration in check_inode_hash_info_matches_root() Clang 18 and newer warns (or errors with CONFIG_WERROR=y): fs/bcachefs/str_hash.c:164:2: error: label followed by a declaration is a C23 extension [-Werror,-Wc23-extensions] 164 \| struct bch_inode_unpacked inode; \| ^ In Clang 17 and prior, this is an unconditional hard error: fs/bcachefs/str_hash.c:164:2: error: expected expression 164 \| struct bch_inode_unpacked inode; \| ^ fs/bcachefs/str_hash.c:165:30: error: use of undeclared identifier 'inode' 165 \| ret = bch2_inode_unpack(k, &inode); \| ^ fs/bcachefs/str_hash.c:169:55: error: use of undeclared identifier 'inode' 169 \| struct bch_hash_info hash2 = bch2_hash_info_init(c, &inode); \| ^ fs/bcachefs/str_hash.c:171:40: error: use of undeclared identifier 'inode' 171 \| ret = repair_inode_hash_info(trans, &inode); \| ^ Add an empty statement between the label and the declaration to fix the warning/error without disturbing the code too much. Fixes: 2519d3b0d656 ("bcachefs: bch2_str_hash_check_key() now checks inode hash info") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202412092339.QB7hffGC-lkp@intel.com/ Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	3f57171d8d	bcachefs: trace_write_buffer_maybe_flush Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	9f95fc3c12	bcachefs: bch2_snapshot_exists() bch2_snapshot_equiv() is going away; convert users that just wanted to know if the snapshot exists to something better Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	be203120dc	bcachefs: bch2_check_key_has_snapshot() prints btree id Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	6ea607ca61	bcachefs: bch2_str_hash_check_key() now checks inode hash info Versions of the same inode in different snapshots must have the same hash info; this is critical for lookups to work correctly. We're going to be running the str_hash checks online, at readdir or xattr list time, so we now need str_hash_check_key() to check for inode hash seed mismatches, since it won't be run right after check_inodes(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	644457ed83	bcachefs: Don't BUG_ON() inode unpack error Bkey validation checks that inodes are well-formed and unpack successfully, so an unpack error should always indicate memory corruption or some other kind of hardware bug - but these are still errors we can recover from. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	7b11260456	bcachefs: Use proper errcodes for inode unpack errors Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	cd150cf924	bcachefs: kill sysfs internal/accounting Since we added per-inode counters there's now far too many counters to show in one shot - if we want this in the future, it'll have to be in debugfs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:23 -05:00
Kent Overstreet	49f2d18263	bcachefs: Kill unnecessary mark_lock usage We can't hold mark_lock while calling fsck_err() - that's a deadlock, mark_lock is meant to be a leaf node lock. It's also unnecessary for gc_bucket() and bucket_gen(); rcu suffices since the bucket_gens array describes its size, and we can't race with device removal or resize during gc/fsck since that takes state lock. Reported-by: syzbot+38641fcbda1aaffefdd4@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	54dacdada6	bcachefs: Don't start rewriting btree nodes until after journal replay This fixes a deadlock during journal replay when btree node read errors kick off a ton of rewrites: we don't want them competing with journal replay. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	9e779f3f24	bcachefs: Fix reuse of bucket before journal flush on multiple empty -> nonempty transition For each bucket we track when the bucket became nonempty and when it became empty again: if we can ensure that there will be no journal flushes in the range [nonempty, empty) (possibly because they occured at the same journal sequence number), then it's safe to reuse the bucket without waiting for a journal commit. This is a major performance optimization for erasure coding, where writes are initially replicated, but the extra replicas are quickly dropped: if those buckets are reused and overwritten without issuing a cache flush to the underlying device, then they only cost bus bandwidth. But there's a tricky corner case when there's multiple empty -> nonempty -> empty transitions in quick succession, i.e. when data is getting overwritten immediately as it's being written. If this happens and the previous empty transition hasn't been flushed, we need to continue tracking the previous nonempty transition - not start a new one. Fixing this means we now need to track both the nonempty and empty transitions in bch_alloc_v4. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	89e74eccab	bcachefs: bch2_journal_noflush_seq() now takes [start, end) Harder to screw up if we're explicit about the range, and more correct as journal reservations can be outstanding on multiple journal entries simultaneously. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	be565740ee	bcachefs: Set bucket needs discard, inc gen on empty -> nonempty transition Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	44a43cf9fd	bcachefs: Don't add unknown accounting types to eytzinger tree Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	60558d55f7	bcachefs: Plumb bkey_validate_context to journal_entry_validate This lets us print the exact location in the journal if it was found in the journal, or correctly print if it was found in the superblock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	bbe36bd099	bcachefs: Use a heap for handling overwrites in btree node scan Fix an O(n^2) issue when we find many overlapping (overwritten) btree nodes - especially when one node overwrites many smaller nodes. This was discovered to be an issue with the bcachefs merge_torture_flakey test - if we had a large btree that was then emptied, the number of difficult overwrites can be unbounded. Cc: Kuan-Wei Chiu <visitorckw@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	fbd152bf94	bcachefs: Minor bucket alloc optimization Check open buckets and buckets waiting for journal commit before doing other expensive lookups. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	f65645d804	bcachefs: Mark more errors autofix tested repairing from a bug uncovered by the merge_torture_flakey test Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	821ddebbc2	bcachefs: fix bch2_btree_node_header_to_text() format string Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	58117dbdd6	bcachefs: Journal space calculations should skip durability=0 devices Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	d4c9fc000b	bcachefs: factor out str_hash.c Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	ce70157112	bcachefs: kill flags param to bch2_subvolume_get() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	23f88c1d16	bcachefs: Don't call bch2_btree_interior_update_will_free_node() until after update succeeds Originally, btree splits always succeeded once we got to the point of recursing to the btree_insert_node() call. But that changed when we switched to not taking intent locks all the way up to the root, and that introduced a bug, because bch2_btree_interior_update_will_free_node() cancels paending writes and reparents a node that's going to be made visible on disk by another btree update to the current btree update. This was discovered in recent backpointers work, because bch2_btree_interior_update_will_free_node() also clears the will_make_reachable flag, causing backpointer target lookup to spuriously thing it had found a dangling backpointer (when the backpointer just hadn't been created yet by btree_update_nodes_written()). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	c67fab0774	bcachefs: Make sure __bch2_run_explicit_recovery_pass() signals to rewind We should always signal to rewind if the requested pass hasn't been run, even if called multiple times. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	90c6daa6ac	bcachefs: Call bch2_btree_lost_data() on btree read error Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	ff7e7c5367	bcachefs: Journal write path refactoring, debug improvements Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	47d6ee766f	bcachefs: dev_alloc_list.devs -> dev_alloc_list.data This lets us use darray macros on dev_alloc_list (and it will become a darray eventually, when we increase the maximum number of devices). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	49833ce27e	bcachefs: Fix failure to allocate journal write on discard retry When allocating a journal write fails, then retries after doing discards, we were failing to count already allocated replicas. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:22 -05:00
Kent Overstreet	6728f8f829	bcachefs: BCH_ERR_insufficient_journal_devices kill another standard error code use Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	3f1cf04ff9	bcachefs: Silence "unable to allocate journal write" if we're already RO Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	400af9a398	bcachefs: trace_accounting_mem_insert Add a tracepoint for inserting new accounting entries: we're seeing odd spinning behaviour in accounting read. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	e3474394eb	bcachefs: Advance to next bp on BCH_ERR_backpointer_to_overwritten_btree_node Don't spin. Fixes: de95cc201a97 ("bcachefs: Kill bch2_get_next_backpointer()") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	8dabb19ff4	bcachefs: Simplify disk accounting validate late The validate late path was iterating over accounting entries in eytzinger order, which is unnecessarily tricky when we may have to remove entries. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	f78760dede	bcachefs: logged ops only use inum 0 of logged ops btree we wish to use the logged ops btree for other items that aren't strictly logged ops: cursors for inode allocation There's no reason to create another cached btree for inode allocator cursors - so reserve different parts of the keyspace for different purposes. Older versions will ignore or delete the cursors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	ad0b2544ec	bcachefs: rcu_pending now works in userspace Introduce a typedef to handle the difference between unsigned long/struct urcu_gp_poll_state. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Geert Uytterhoeven	d36b3e74b6	bcachefs: BCACHEFS_PATH_TRACEPOINTS should depend on TRACING When tracing is disabled, there is no point in asking the user about enabling extra btree_path tracepoints in bcachefs. Fixes: `32ed4a620c` ("bcachefs: Btree path tracepoints") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	9c22dd02ae	bcachefs: Fix allocating too big journal entry The "journal space available" calculations didn't take into account mismatched bucket sizes; we need to take the minimum space available out of our devices. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	5cdaec193a	bcachefs: Improve "unable to allocate journal write" message Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	511ddcdb2d	bcachefs: fix bch2_journal_key_insert_take() seq Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	c1f618f4f7	bcachefs: bch2_async_btree_node_rewrites_flush() Add a method to flush btree node rewrites at the end of recovery, to ensure that corrected errors are persisted. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	b29769c72d	bcachefs: If we did repair on a btree node, make sure we rewrite it Ensure that "invalid bkey" repair gets persisted, so that it doesn't repeatedly spam the logs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	1302eeb7c5	bcachefs: bkey_fsck_err now respects errors_silent Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	7807b5b07d	bcachefs: list_pop_entry() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	097cc9d0d6	bcachefs: Convert write path errors to inum_to_path() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	f7727a6767	bcachefs: bch2_inum_to_path() Add a function for walking backpointers to find a path from a given inode number, and convert various error messages to use it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	c9b9afe78c	bcachefs: Fix fsck.c build in userspace Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Yang Li	2f8d5edf55	bcachefs: Add missing parameter description to bch2_bucket_alloc_trans() The function bch2_bucket_alloc_trans() lacked a description for the nowait parameter in its documentation comment block. This patch adds the missing description to ensure all parameters are properly documented. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=12179 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	2cd85fea49	bcachefs: Don't recurse in check_discard_freespace_key When calling check_discard_freeespace_key from the allocator, we can't repair without recursing - run it asynchronously instead. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	9bdb3b73e7	bcachefs: Check for extent crc uncompressed/compressed size mismatch When not compressed, these must be equal - this fixes an assertion pop in bch2_rechecksum_bio(). Reported-by: syzbot+50d3544c9b8db9c99fd2@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:21 -05:00
Kent Overstreet	ff1dd05f82	bcachefs: bch2_trans_relock() is trylock for lockdep fix some spurious lockdep splats Reported-by: syzbot+e088be3c2d5c05aaac35@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	f7f196170d	bcachefs: cryptographic MACs on superblock are not (yet?) supported We should add support for cryptographic macs on the superblock - and it won't be hard, but it'll need an incompatible feature bit (and we have a new incompatible feature versioning scheme coming). For now, just add a guard to avoid a dull ptr deref in gen_poly_key(). Reported-by: syzbot+dd3d9835055dacb66f35@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	4746ee182a	bcachefs: Check for inode journal seq in the future More check and repair code: this fixes a warning in bch2_journal_flush_seq_async() Reported-by: syzbot+d119b445ec739e7f3068@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	0eafe758ac	bcachefs: Check for bucket journal seq in the future This fixes an assertion pop in bch2_journal_noflush_seq() - log the error to the superblock and continue instead. Reported-by: syzbot+85700120f75fc10d4e18@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	8b10590918	bcachefs: do_fsck_ask_yn() __bch2_fsck_err() is huge, and badly needs more refactoring Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	052210c3fa	bcachefs: Don't error out when logging fsck error Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	cfba90aba9	bcachefs: mark more errors AUTOFIX mark errors as autofix where syzbot has hit the repair paths Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	914381013b	bcachefs: add missing printbuf_reset() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	0184dfa3b8	bcachefs: Fix journal_iter list corruption Fix exiting an iterator that wasn't initialized. Reported-by: syzbot+2f7c2225ed8a5cb24af1@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	f11ca2ab18	bcachefs: Guard against backpointers to unknown btrees Reported-by: syzbot+997f0573004dcb964555@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	f9e0a9be70	bcachefs: Issue a transaction restart after commit in repair transaction commits invalidate pointers to btree values, and they also downgrade intent locks. This breaks the interior btree update path, which takes intent locks and then calls into the allocator. This isn't an ideal solution: we can't unconditionally issue a restart after a transaction commit, because that would break other codepaths. Reported-by: syzbot+78d82470c16a49702682@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	b3d82c2f27	bcachefs: Guard against journal seq overflow Wraparound is impractical to handle since in various places we use 0 as a sentinal value - but 64 bits (or 56, because the btree write buffer steals a few bits) is enough for all practical purposes. Reported-by: syzbot+73ed43fbe826227bd4e0@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	9963a14da1	bcachefs: BCH_FS_recovery_running If we're autofixing topology errors, we shouldn't shutdown if we're still in recovery. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	124e108185	bcachefs: Make topology errors autofix These repair paths are well tested, we can repair them without explicit user intervention This also tweaks bch2_topology_error() so that we run topology repair if we're in recovery, not just fsck. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	a6f4794fcd	bcachefs: struct bkey_validate_context Add a new parameter to bkey validate functions, and use it to improve invalid bkey error messages: we can now print the btree and depth it came from, or if it came from the journal, or is a btree root. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	c7e78f7b01	bcachefs: Ignore empty btree root journal entries There's no reason to treat them as errors: just ignore them, and go with a previous btree root if we had one. Reported-by: syzbot+e22007d6acb9c87c2362@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	90f3683e8f	bcachefs: Fix null ptr deref in btree_path_lock_root() Historically, we required that all btree node roots point to a valid (possibly fake) node, but we're improving our ability to continue in the presence of errors. Reported-by: syzbot+e22007d6acb9c87c2362@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	db0667a4ed	bcachefs: Go RW earlier, for normal rw mount Previously, when mounting read-write after a clean shutdown, we wouldn't go read-write until after all the recovery passes completed. Now, go RW early in recovery, the same as any other situation we'll need to go read-write. This fixes a bug where we discover unlinked inodes after a clean shutdown: repair fails because we're read only. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	d941597636	bcachefs: Fix bch2_btree_node_update_key_early() Fix an assertion pop from the recent btree cache freelist fixes. Fixes: `baefd3f849` ("bcachefs: btree_cache.freeable list fixes") Reported-by: Tyler <th020394@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	686d2ebec6	bcachefs: Change "disk accounting version 0" check to commit only 6.11 had a bug where we'd sometimes create disk accounting keys with version 0, which causes issues for journal replay - but we don't need to delete existing accounting keys with version 0. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	dba8243f3b	bcachefs: Don't try to en/decrypt when encryption not available If a btree node says it's encrypted, but the superblock never had an encryptino key - whoops, that needs to be handled. Reported-by: syzbot+026f1857b12f5eb3f9e9@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:20 -05:00
Kent Overstreet	75eabea698	bcachefs: Fix dup/misordered check in btree node read We were checking for out of order keys, but not duplicate keys. Reported-by: syzbot+dedbd67513939979f84f@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	1415265480	bcachefs: Bad btree roots are now autofix Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	828552ca74	bcachefs: Kill bch2_bucket_alloc_new_fs() The early-early allocation path, bch2_bucket_alloc_new_fs(), is no longer needed - and inconsistencies around new_fs_bucket_idx have been a frequent source of bugs. Reported-by: syzbot+592425844580a6598410@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	abf23afa36	bcachefs: Fix btree node scan when unknown btree IDs are present btree_root entries for unknown btree IDs are created during recovery, before reading those btree roots. But btree_node_scan may find btree nodes with unknown btree IDs when we haven't seen roots for those btrees. Reported-by: syzbot+1f202d4da221ec6ebf8e@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	427db7ffe9	bcachefs: backpointer_to_missing_ptr is now autofix Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	b71d89bd7b	bcachefs: Fix accounting_read when we rewind If we rewind recovery to run topology repair, that causes accounting_read to run twice. This fixes accounting being double counted. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	a7ecd5f2cc	bcachefs: disk_accounting: bch2_dev_rcu -> bch2_dev_rcu_noerror Accounting keys that reference invalid devices are corrected by fsck, they shouldn't cause an emergency shutdown. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	6534a404d4	bcachefs: errcode cleanup: journal errors Instead of throwing standard error codes, we should be throwing dedicated private error codes, this greatly improves debugability. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	525be09e63	bcachefs: Use separate rhltable for bch2_inode_or_descendents_is_open() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	375d21b76d	bcachefs: BCH_ERR_btree_node_read_error_cached Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	0eaac0b44f	bcachefs: btree_write_buffer_flush_seq() no longer closes journal Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	bb61afebca	bcachefs: discard fastpath now uses bch2_discard_one_bucket() The discard bucket fastpath previously was using its own code for discarding buckets and clearing them in the need_discard btree, which didn't have any of the consistency checks of the main discard path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	e1cb4f56dc	bcachefs: Bias reads more in favor of faster device Per reports of performance issues on mixed multi device filesystems where we're issuing too much IO to the spinning rust - tweak this algorithm. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	f4d67f6d5a	bcachefs: trivial btree write buffer refactoring Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	c601e5d7da	bcachefs: Can now block journal activity without closing cur entry Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	c80f33b752	bcachefs: New backpointers helpers - bch2_backpointer_del() - bch2_backpointer_maybe_flush() Kill a bit of open coding and make sure we're properly handling the btree write buffer. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	1ab00b6cdd	bcachefs: kill bch_backpointer.bucket_offset usage bch_backpointer.bucket_offset is going away - it's no longer needed since we no longer store backpointers in alloc keys, the same information is in the key position itself. And we'll be reclaiming the space in bch_backpointer for the bucket generation number. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	e48fda6cdc	bcachefs: Fix check_backpointers_to_extents range limiting bch2_get_btree_in_memory_pos() will return positions that refer directly to the btree it's checking will fit in memory - i.e. backpointer positions, not buckets. This also means check_bp_exists() no longer has to refer to the device, and we can delete some code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	eb25733aba	bcachefs: bch_backpointer -> bkey_i_backpointer Since we no longer store backpointers in alloc keys, there's no reason not to pass around bkey_i_backpointers; this means we don't have to pass the bucket pos separately. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	abff9b149d	bcachefs: Drop swab code for backpointers in alloc keys Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	5b5a7ae8fa	bcachefs: bucket_pos_to_bp_end() Better helpers for iterating over backpointers within a specific bucket Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	debe6965ac	bcachefs: check for backpointers to invalid device Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:19 -05:00
Kent Overstreet	3b6ebc94a0	bcachefs: fix bp_pos_to_bucket_nodev_noerror _noerror means don't produce inconsistent errors, so it should be using bch2_dev_rcu_noerror(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	4de2c24aa9	bcachefs: Fix evacuate_bucket tracepoint 86a494c8eef9 ("bcachefs: Kill bch2_get_next_backpointer()") dropped some things the tracepoint emitted because bch2_evacuate_bucket() no longer looks at the alloc key - but we did want at least some of that. We still no longer look at the alloc key so we can't report on the fragmentation number, but that's a direct function of dirty_sectors and a copygc concern anyways - copygc should get its own tracepoint that includes information from the fragmentation LRU. But we can report on the number of sectors we moved and the bucket size. Co-developed-by: Piotr Zalewski <pZ010001011111@proton.me> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	eae6c4a625	bcachefs: fix O(n^2) issue with whiteouts in journal keys The journal_keys array can't be substantially modified after we go RW, because lookups need to be able to check it locklessly - thus we're limited on what we can do when a key in the journal has been overwritten. This is a problem when there's many overwrites to skip over for peek() operations. To fix this, add tracking of ranges of overwrites: we create a range entry when there's more than one contiguous whiteout. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	854724d116	bcachefs: btree_and_journal_iter: don't iterate over too many whiteouts when prefetching To help ameloriate issues with peek operations having to skip over deletions in the journal - just bail out if all we're doing is prefetching btree nodes. Since btree node prefetching runs every time we iterate to a new node, and has to sequentially scan ahead, this avoids another O(n^2). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	06d7a56fe0	bcachefs: journal keys: sort keys for interior nodes first There's an unavoidable issue with btree lookups when we're overlaying journal keys and the journal has many deletions for keys present in the btree - peek operations will have to iterate over all those deletions to find the next live key to return. This is mainly a problem for lookups in interior nodes, if we have to traverse to a leaf. Looking up an insert position in a leaf (for journal replay) doesn't have to find the next live key, but walking down the btree does. So to ameloriate this, change journal key sort ordering so that we replay keys from roots and interior nodes first. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	57026c41c9	bcachefs: kill bch2_journal_entries_free() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	3d0b3b51c5	bcachefs: Don't BUG_ON() when superblock feature wasn't set for compressed data We don't allocate the mempools for compression/decompression unless we need them - but that means there's an inconsistency to check for. Reported-by: syzbot+cb3fbcfb417448cfd278@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	e1702b9891	bcachefs: Don't use a shared decompress workspace mempool gzip and zstd require different decompress workspace sizes, and if we start with one and then start using the other at runtime we may not get the correct size Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	6a4ce7a92f	bcachefs: compression workspaces should be indexed by opt, not type type includes lz4 and lz4_old, which do not get different compression workspaces, and incompressible, a fake type - BCH_COMPRESSION_OPTS() is the correct enum to use. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	cec51e0a5d	bcachefs: add missing BTREE_ITER_intent this fixes excessive transaction restarts due to trans_commit having to upgrade Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	9e92d6e9ef	bcachefs: Kill bch2_get_next_backpointer() Since for quite some time backpointers have only been stored in the backpointers btree, not alloc keys (an aborted experiment, support for which has been removed) - we can replace get_next_backpointer() with simple btree iteration. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	7815809fca	bcachefs: Delete backpointers check in try_alloc_bucket() try_alloc_bucket() has a "safety" check, which avoids allocating a bucket if there's any backpointers present. But backpointers are not the source of truth for live data in a bucket, the bucket sector counts are; this check was fairly useless, and we're also deferring backpointers checks from fsck to runtime in the near future. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	ac745efb42	bcachefs: peek_prev_min(): Search forwards for extents, snapshots With extents and snapshots, for slightly different reasons, we may have to search forwards to find a key that compares equal to iter->pos (i.e. a key that peek_prev() should return, as it returns keys <= iter->pos). peek_slot() does this, and is an easy way to fix this case. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	7e5b8e00e2	bcachefs: Implement bch2_btree_iter_prev_min() A user contributed a filessytem dump, where the dump was actually corrupted (due to being taken while the filesystem was online), but which exposed an interesting bug in fsck - reconstruct_inode(). When itearting in BTREE_ITER_filter_snapshots mode, it's required to give an end position for the iteration and it can't span inode numbers; continuing into the next inode might mean we start seeing keys from a different snapshot tree, that the is_ancestor() checks always filter, thus we're never able to return a key and stop iterating. Backwards iteration never implemented the end position because nothing else needed it - except for reconstuct_inode(). Additionally, backwards iteration is now able to overlay keys from the journal, which will be useful if we ever decide to start doing journal replay in the background. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	acd1fc7b1f	bcachefs: discard_one_bucket() now uses need_discard_or_freespace_err() More conversion of inconsistent errors to fsck errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	c8e588135c	bcachefs: bch2_bucket_do_index(): inconsistent_err -> fsck_err Factor out a common helper, need_discard_or_freespace_err(), which is now used by both fsck and the runtime checks, and can repair. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	c97118f1dc	bcachefs: try_alloc_bucket() now uses bch2_check_discard_freespace_key() check_discard_freespace_key() was doing all the same checks as try_alloc_bucket(), but with repair. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	731d06e138	bcachefs: rework bch2_bucket_alloc_freelist() freelist iteration Prep work for converting try_alloc_bucket() to use bch2_check_discard_freespace_key(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	724e49c677	bcachefs: kill inconsistent err in invalidate_one_bucket() Change it to a normal fsck_err() - meaning it'll get repaired at runtime when that's flipped on. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	7579c85d9c	bcachefs: Don't delete reflink pointers to missing indirect extents To avoid tragic loss in the event of transient errors (i.e., a btree node topology error that was later corrected by btree node scan), we can't delete reflink pointers to correct errors. This adds a new error bit to bch_reflink_p, indicating that it is known to point to a missing indirect extent, and the error has already been reported. Indirect extent lookups now use bch2_lookup_indirect_extent(), which on error reports it as a fsck_err() and sets the error bit, and clears it if necessary on succesful lookup. This also gets rid of the bch2_inconsistent_error() call in __bch2_read_indirect_extent, and in the reflink_p trigger: part of the online self healing project. An on disk format change isn't necessary here: setting the error bit will be interpreted by older versions as pointing to a different index, which will also be missing - which is fine. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	3d338378d7	bcachefs: Reorganize reflink.c a bit Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:18 -05:00
Kent Overstreet	61f854da4c	bcachefs: Reserve 8 bits in bch_reflink_p Better repair for reflink pointers, as well as propagating new inode options to indirect extents, are going to require a few extra bits bch_reflink_p: so claim a few from the high end of the destination index. Also add some missing bounds checking. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	eb73e7773f	bcachefs: Kill FSCK_NEED_FSCK If we find an error that indicates that we need to run fsck, we can specify that directly with run_explicit_recovery_pass(). These are now log_fsck_err() calls: we're just logging in the superblock that an error occurred - and possibly doing an emergency shutdown, depending on policy. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	79c5e3c793	bcachefs: lru errors are expected when reconstructing alloc Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	b6269cd0ec	bcachefs: Delete dead code from bch2_discard_one_bucket() alloc key validation ensures that if a bucket is in need_discard state the sector counts are all zero - we don't have to check for that. The NEED_INC_GEN check appears to be dead code, as well: we only see buckets in the need_discard btree, and it's an error if they aren't in the need_discard state. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	7d1918b0d8	bcachefs: bch2_btree_bit_mod_iter() factor out a new helper, make it handle extents bitset btrees (freespace). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	1f282f1ee0	bcachefs: delete dead code Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	d985e63dba	bcachefs: Fix shutdown message Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	5c3911ac94	bcachefs: Don't use page allocator for sb_read_scratch Kill another unnecessary dependency on PAGE_SIZE Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Youling Tang	385d1a3c81	bcachefs: Simplify code in bch2_dev_alloc() - Remove unnecessary variable 'ret'. - Remove unnecessary bch2_dev_free() operations. Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Youling Tang	924e81c530	bcachefs: Remove redundant initialization in bch2_vfs_inode_init() `inode->v.i_ino` has been initialized to `inum.inum`. If `inum.inum` and `bi->bi_inum` are not equal, BUG_ON() is triggered in bch2_inode_update_after_write(). Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Youling Tang	5abd7ac19d	bcachefs: Removes NULL pointer checks for __filemap_get_folio return values __filemap_get_folio the return value cannot be NULL, so unnecessary checks are removed. Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	dc003efbc7	bcachefs: Add support for FS_IOC_GETFSSYSFSPATH [TEST]: ``` $ cat ioctl_getsysfspath.c #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <sys/ioctl.h> #include <linux/fs.h> #include <unistd.h> int main(int argc, char *argv[]) { int fd; struct fs_sysfs_path sysfs_path = {}; if (argc != 2) { fprintf(stderr, "Usage: %s <path_to_file_or_directory>\n", argv[0]); exit(EXIT_FAILURE); } fd = open(argv[1], O_RDONLY); if (fd == -1) { perror("open"); exit(EXIT_FAILURE); } if (ioctl(fd, FS_IOC_GETFSSYSFSPATH, &sysfs_path) == -1) { perror("ioctl FS_IOC_GETFSSYSFSPATH"); close(fd); exit(EXIT_FAILURE); } printf("FS_IOC_GETFSSYSFSPATH: %s\n", sysfs_path.name); close(fd); return 0; } $ gcc ioctl_getsysfspath.c $ sudo bcachefs format /dev/sda $ sudo mount.bcachefs /dev/sda /mnt $ sudo ./a.out /mnt FS_IOC_GETFSSYSFSPATH: bcachefs/c380b4ab-fbb6-41d2-b805-7a89cae9cadb ``` Original patch link: [1]: https://lore.kernel.org/all/20240207025624.1019754-8-kent.overstreet@linux.dev/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Youling Tang <youling.tang@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	4f1a6b0ab4	bcachefs: Add support for FS_IOC_GETFSUUID Use super_set_uuid() to set `sb->s_uuid_len` to avoid returning `-ENOTTY` with sb->s_uuid_len being 0. Original patch link: [1]: https://lore.kernel.org/all/20240207025624.1019754-2-kent.overstreet@linux.dev/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Youling Tang	4fa5d8e166	bcachefs: Correct the description of the '--bucket=size' options Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Integral	394033dcc9	bcachefs: add support for true/false & yes/no in bool-type options Here is the patch which uses existing constant table: Currently, when using bcachefs-tools to set options, bool-type options can only accept 1 or 0. Add support for accepting true/false and yes/no for these options. Signed-off-by: Integral <integral@murena.io> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Acked-by: David Howells <dhowells@redhat.com>	2024-12-21 01:36:17 -05:00
Kent Overstreet	e5ea05293a	bcachefs: Move fsck ioctl code to fsck.c chardev.c and fs-ioctl.c are not organized by subject; let's try to fix this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	e69df6adf8	bcachefs: Kill unnecessary iter_rewind() in bkey_get_empty_slot() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	db6e584b85	bcachefs: Simplify btree_iter_peek() filter_snapshots Collapse all the BTREE_ITER_filter_snapshots handling down into a single block; btree iteration is much simpler in the !filter_snapshots case. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	000fe8d573	bcachefs: Rename btree_iter_peek_upto() -> btree_iter_peek_max() We'll be introducing btree_iter_peek_prev_min(), so rename for consistency. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	65b14fa3d8	bcachefs: Assert that we're not violating key cache coherency rules We're not allowed to have a dirty key in the key cache if the key doesn't exist at all in the btree - creation has to bypass the key cache, so that iteration over the btree can check if the key is present in the key cache. Things break in subtle ways if cache coherency is broken, so this needs an assert. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:17 -05:00
Kent Overstreet	b318882022	bcachefs: bch2_trans_verify_not_unlocked_or_in_restart() Fold two asserts into one. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	a71a1fac90	bcachefs: Better in_restart error We're ramping up on checking transaction restart handling correctness - so, in debug mode we now save a backtrace for where the restart was emitted, which makes it much easier to track down the incorrect handling. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	2434fc38ef	bcachefs: Assert we're not in a restart in bch2_trans_put() This always indicates a transaction restart handling bug Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	69785001c6	bcachefs: Fix unhandled transaction restart in evacuate_bucket() Generally, releasing a transaction within a transaction restart means an unhandled transaction restart: but this can happen legitimately within the move code, e.g. when bch2_move_ratelimit() tells us to exit before we've retried. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	b09b34499c	bcachefs: Improved check_topology() assert On interior btree node updates, we always verify that we're not introducing topology errors: child nodes should exactly span the range of the parent node. single_device.ktest small_nodes has been popping this assert: change it to give us more information. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	a34b026482	bcachefs: Kill BCH_TRANS_COMMIT_lazy_rw We unconditionally go read-write, if we're going to do so, before journal replay: lazy_rw is obsolete. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	cc944fbe06	bcachefs: Add assert for use of journal replay keys for updates The journal replay keys mechanism can only be used for updates in early recovery, when still single threaded. Add some asserts to make sure we never accidentally use it elsewhere. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Hongbo Li	32e573c362	bcachefs: use attribute define helper for sysfs attribute The sysfs attribute definition has been wrapped into macro: rw_attribute, read_attribute and write_attribute, we can use these helpers to uniform the attribute definition. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Hongbo Li	d3d8ec90ba	bcachefs: remove write permission for gc_gens_pos sysfs interface The gc_gens_pos is used to show the status of bucket gen gc. There is no need to assign write permissions for this attribute. Here we can use read_attribute helper to define this attribute. ``` [Before] $ ll internal/gc_gens_pos -rw-r--r-- 1 root root 4096 Oct 28 15:27 internal/gc_gens_pos [After] $ ll internal/gc_gens_pos -r--r--r-- 1 root root 4096 Oct 28 17:27 internal/gc_gens_pos ``` Fixes: `ac516d0e7d` ("bcachefs: Add the status of bucket gen gc to sysfs") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	161d13835e	bcachefs: Move bch_extent_rebalance code to rebalance.c Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	a652c56590	bcachefs: Improve trace_rebalance_extent We now say explicitly which pointers are being moved or compressed Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	3de8b72731	bcachefs: Simplify option logic in rebalance Since bch2_move_get_io_opts() now synchronizes io_opts with options from bch_extent_rebalance, delete the ad-hoc logic in rebalance.c that previously did this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	6aa0bd0fd5	bcachefs: get_update_rebalance_opts() bch2_move_get_io_opts() now synchronizes options loaded from the filesystem and inode (if present, i.e. not walking the reflink btree directly) with options from the bch_extent_rebalance_entry, updating the extent if necessary. Since bch_extent_rebalance tracks where its option came from we can preserve "inode options override filesystem options", even for indirect extents where we don't have access to the inode the options came from. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	4ae6bbb522	bcachefs: bch2_write_inode() now checks for changing rebalance options Previously, BCHFS_IOC_REINHERIT_ATTRS didn't trigger rebalance scans when changing rebalance options - it had been missed, only the xattr interface triggered them. Ideally they'd be done by the transactional trigger, but unpacking the inode to get the options is too heavy to be done in the low level trigger - the inode trigger is run on every extent update, since the bch_inode.bi_journal_seq has to be updated for fsync. bch2_write_inode() is a good compromise, it already unpacks and repacks and is not run in any super-fast paths. Additionally, creating the new rebalance entry to trigger the scan is now done in the same transaction as the inode update that changed the options. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	2d21d11253	bcachefs: New bch_extent_rebalance fields - Add more io path options to bch_extent_rebalance - For each option, track whether it came from the filesystem or the inode This will be used for improved rebalance support for reflinked data. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	ed13bb5726	bcachefs: bch2_prt_csum_opt() bounds checking helper Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	c225084704	bcachefs: copygc_enabled, rebalance_enabled now opts.h options They can now be set at mount time Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	7a7c43a0c1	bcachefs: Add bch_io_opts fields for indicating whether the opts came from the inode This is going to be used in the bch_extent_rebalance improvements, which propagate io_path options into the extent (important for rebalance, which needs something present in the extent for transactionally tagging them in the rebalance_work btree, and also for indirect extents). By tracking in bch_extent_rebalance whether the option came from the filesystem or the inode we can correctly handle options being changed on indirect extents. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	3000855cab	bcachefs: io_opts_to_rebalance_opts() New helper to simplify bch2_bkey_set_needs_rebalance() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	5fffe1a3c3	bcachefs: rename bch_extent_rebalance fields to match other opts structs Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	282faf9524	bcachefs: kill __bch2_bkey_sectors_need_rebalance() Single caller, fold into bch2_bkey_sectors_need_rebalance() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:16 -05:00
Kent Overstreet	c8908959ae	bcachefs: kill bch2_bkey_needs_rebalance() Dead code Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	015fafc49b	bcachefs: small cleanup for extent ptr bitmasks Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	eacb755568	bcachefs: bch2_io_opts_fixups() Centralize some io path option fixups - they weren't always being applied correctly: - background_compression uses compression if unset - background_target uses foreground_target if unset - nocow disables most fancy io path options Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	16de2c856a	bcachefs: use bch2_data_update_opts_to_text() in trace_move_extent_fail() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	a1ca525b82	bcachefs: avoid 'unsigned flags' flags should have actual types, where possible: fix btree_update.h helpers Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Thorsten Blum	901ff6555b	bcachefs: Annotate struct bucket_gens with __counted_by() Add the __counted_by compiler attribute to the flexible array member b to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE. Use struct_size() to calculate the number of bytes to be allocated. Update bucket_gens->nbuckets and bucket_gens->nbuckets_minus_first when resizing. Compile-tested only. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Thorsten Blum	ac9826f147	bcachefs: Use str_write_read() helper in write_super_endio() Remove hard-coded strings by using the str_write_read() helper function. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Thorsten Blum	751d869710	bcachefs: Use str_write_read() helper in ec_block_endio() Remove hard-coded strings by using the helper function str_write_read(). Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Thorsten Blum	de902e3b4a	bcachefs: Use str_write_read() helper function Remove hard-coded strings by using the helper function str_write_read(). Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	e0c8369bc8	bcachefs: Add version check for bch_btree_ptr_v2.sectors_written validate A user popped up with a very old (0.11) filesystem that needed repair and wasn't recently backed up. Reported-by: Manoa <manoa@mail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	27de0ee39f	bcachefs: Add block plugging to read paths This will help with some of the btree_trans srcu lock hold time warnings that are still turning up; submit_bio() can block for awhile if the device is sufficiently congested. It's not a perfect solution since blk_plug bios are submitted when scheduling; we might want a way to disable the "submit on context switch" behaviour, or switch to our own plugging in the future. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	be5a7be106	bcachefs: Fix warning about passing flex array member by value this showed up when building in userspace Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	fb8c835b18	bcachefs: bch2_journal_meta() takes ref on c->writes This part of addressing https://github.com/koverstreet/bcachefs/issues/656 where we're getting stuck in bch2_journal_meta() in the dump tool. We shouldn't be invoking the journal without a ref on c->writes (if we're not RW), and there's no reason for the dump tool to be going read-write. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	8b22abb4c8	bcachefs: -o norecovery now bails out of recovery earlier -o norecovery (used by the dump tool) should be doing the absolute minimum amount of work to get the filesystem up and readable; we shouldn't be running check and repair code, or going read-write. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	d55d4a0ca2	bcachefs: Refactor new stripe path to reduce dependencies on ec_stripe_head We need to add a path for reshaping existing stripes (for e.g. device removal), and this new path won't necessarily use ec_stripe_head. Refactor the code to avoid unnecessary references to it for clarity. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	db514cf677	bcachefs: Avoid bch2_btree_id_str() Prefer bch2_btree_id_to_text() - it prints out the integer ID when unknown. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	9e2f5f7988	bcachefs: better error message in check_snapshot_tree() If we find a snapshot node and it didn't match the snapshot tree, we should print it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	106480e9a8	bcachefs: Factor out jset_entry_log_msg_bytes() Needed for improved userspace cmd_list_journal Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	0269e27ce3	bcachefs: improved bkey_val_copy() Factor out some common code, add typechecking. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	e3c43dbe8e	bcachefs: bch2_btree_lost_data() now uses run_explicit_rceovery_pass_persistent() Also get a bit more fine grained about which passes to run for which btrees. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:15 -05:00
Kent Overstreet	d65d126c02	bcachefs: Add locking for bch_fs.curr_recovery_pass Recovery can rewind in certain situations - when we discover we need to run a pass that doesn't normally run. This can happen from another thread for btree node read errors, so we need a bit of locking. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	26c79fdc58	bcachefs: lru, accounting are alloc btrees They can be regenerated by fsck and don't require a btree node scan, like other alloc btrees. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	18f5b84a04	bcachefs: bch2_run_explicit_recovery_pass() returns different error when not in recovery if we're not in recovery then there's no way to rewind recovery - give this a different errcode so that any error messages will give us a better idea of what happened. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	eba3d7e57d	bcachefs: add more path idx debug asserts Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Thorsten Blum	55f524b706	bcachefs: Use FOREACH_ACL_ENTRY() macro to iterate over acl entries Use the existing FOREACH_ACL_ENTRY() macro to iterate over POSIX acl entries and remove the custom acl_for_each_entry() macro. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Thorsten Blum	03525de506	bcachefs: Remove duplicate included headers The header files dirent_format.h and disk_groups_format.h are included twice. Remove the redundant includes and the following warnings reported by make includecheck: disk_groups_format.h is included more than once dirent_format.h is included more than once Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	4e1c6ac05a	bcachefs: kill btree_trans_restart_nounlock() Redundant, the normal btree_trans_restart() doesn't unlock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	d6cf895847	bcachefs: Remove unnecessary peek_slot() hash_lookup() used to return an errorcode, and a peek_slot() call was required to get the key it looked up. But we're adding fault injection for transaction restarts, so fix this old unconverted code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Thomas Bertschinger	fe818d2039	bcachefs: move bch2_xattr_handlers to .rodata A series posted previously moved all of the `struct xattr_handler` tables to .rodata for each filesystem [1]. However, this appears to have been done shortly before bcachefs was merged, so bcachefs was missed at that time. Link: https://lkml.kernel.org/r/20230930050033.41174-1-wedsonaf@gmail.com [1] Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Alan Huang	bf4e42d158	bcachefs: Delete dead code lock_fail_root_changed has not been used since commit `0d7009d7ca` ("bcachefs: Delete old deadlock avoidance code") Remove it. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	c07beca44f	bcachefs: Pull disk accounting hooks out of trans_commit.c Also, fix a minor bug in the revert path, where we weren't checking the journal entry type correctly. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	179cdecf22	bcachefs: bch_verbose_ratelimited ratelimit "deleting unlinked inode" messages Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	a55e2d78ea	bcachefs: rcu_pending: don't invoke __call_rcu() under lock In userspace we don't (yet) have an SRCU implementation, so call_srcu() recurses. But we don't want to be invoking it under the lock anyways. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	b836f22014	bcachefs: __bch2_key_has_snapshot_overwrites uses for_each_btree_key_reverse_norestart() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	1325ccf27e	bcachefs: remove_backpointer() now uses dirent_get_by_pos() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	de92b1ee67	bcachefs: bch2_inode_should_have_bp -> bch2_inode_should_have_single_bp Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Colin Ian King	1c6d5841ae	bcachefs: remove superfluous ; after statements There are a several statements with two following semicolons, replace these with just one semicolon. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:14 -05:00
Kent Overstreet	135c0c8524	bcachefs: Fix racy use of jiffies Calculate the timeout, then check if it's positive before calling schedule_timeout(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-12-21 01:36:13 -05:00
Kent Overstreet	4b8d382b2c	Merge branch 'bcachefs-kill-retry-estale' into HEAD	2024-12-21 01:36:13 -05:00
Eric Biggers	cc354fa7f0	bcachefs: Explicitly select CRYPTO from BCACHEFS_FS Explicitly select CRYPTO from BCACHEFS_FS, so that this dependency of CRYPTO_SHA256, CRYPTO_CHACHA20, and CRYPTO_POLY1305 (which are also selected) is satisfied. Currently this dependency is satisfied indirectly via LIBCRC32C, but this is fragile and is planned to change (https://lore.kernel.org/r/20241021002935.325878-13-ebiggers@kernel.org). Acked-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/r/20241202010844.144356-15-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2024-12-01 17:23:02 -08:00
Linus Torvalds	f5f4745a7f	- The series "resource: A couple of cleanups" from Andy Shevchenko performs some cleanups in the resource management code. - The series "Improve the copy of task comm" from Yafang Shao addresses possible race-induced overflows in the management of task_struct.comm[]. - The series "Remove unnecessary header includes from {tools/}lib/list_sort.c" from Kuan-Wei Chiu adds some cleanups and a small fix to the list_sort library code and to its selftest. - The series "Enhance min heap API with non-inline functions and optimizations" also from Kuan-Wei Chiu optimizes and cleans up the min_heap library code. - The series "nilfs2: Finish folio conversion" from Ryusuke Konishi finishes off nilfs2's folioification. - The series "add detect count for hung tasks" from Lance Yang adds more userspace visibility into the hung-task detector's activity. - Apart from that, singelton patches in many places - please see the individual changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ0L6lQAKCRDdBJ7gKXxA jmEIAPwMSglNPKRIOgzOvHh8MUJW1Dy8iKJ2kWCO3f6QTUIM2AEA+PazZbUd/g2m Ii8igH0UBibIgva7MrCyJedDI1O23AA= =8BIU -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2024-11-24-02-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - The series "resource: A couple of cleanups" from Andy Shevchenko performs some cleanups in the resource management code - The series "Improve the copy of task comm" from Yafang Shao addresses possible race-induced overflows in the management of task_struct.comm[] - The series "Remove unnecessary header includes from {tools/}lib/list_sort.c" from Kuan-Wei Chiu adds some cleanups and a small fix to the list_sort library code and to its selftest - The series "Enhance min heap API with non-inline functions and optimizations" also from Kuan-Wei Chiu optimizes and cleans up the min_heap library code - The series "nilfs2: Finish folio conversion" from Ryusuke Konishi finishes off nilfs2's folioification - The series "add detect count for hung tasks" from Lance Yang adds more userspace visibility into the hung-task detector's activity - Apart from that, singelton patches in many places - please see the individual changelogs for details * tag 'mm-nonmm-stable-2024-11-24-02-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (71 commits) gdb: lx-symbols: do not error out on monolithic build kernel/reboot: replace sprintf() with sysfs_emit() lib: util_macros_kunit: add kunit test for util_macros.h util_macros.h: fix/rework find_closest() macros Improve consistency of '#error' directive messages ocfs2: fix uninitialized value in ocfs2_file_read_iter() hung_task: add docs for hung_task_detect_count hung_task: add detect count for hung tasks dma-buf: use atomic64_inc_return() in dma_buf_getfile() fs/proc/kcore.c: fix coccinelle reported ERROR instances resource: avoid unnecessary resource tree walking in __region_intersects() ocfs2: remove unused errmsg function and table ocfs2: cluster: fix a typo lib/scatterlist: use sg_phys() helper checkpatch: always parse orig_commit in fixes tag nilfs2: convert metadata aops from writepage to writepages nilfs2: convert nilfs_recovery_copy_block() to take a folio nilfs2: convert nilfs_page_count_clean_buffers() to take a folio nilfs2: remove nilfs_writepage nilfs2: convert checkpoint file to be folio-based ...	2024-11-25 16:09:48 -08:00
Kent Overstreet	840c2fbcc5	bcachefs: Fix assertion pop in bch2_ptr_swab() This runs on extents that haven't yet been validated, so we don't want to assert that we have a valid entry type. Reported-by: syzbot+4f29c3f12f864d8a8d17@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-12 03:46:57 -05:00
Kent Overstreet	657d4282d8	bcachefs: Fix journal_entry_dev_usage_to_text() overrun If the jset_entry_dev_usage is malformed, and too small, our nr_entries calculation will be incorrect - just bail out. Reported-by: syzbot+05d7520be047c9be86e0@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-12 03:46:57 -05:00
Kent Overstreet	2642084f26	bcachefs: Allow for unknown key types in backpointers fsck We can't assume that btrees only contain keys of a given type - even if they only have a single key type listed in the allowed key types for that btree; this is a forwards compatibility issue. Reported-by: syzbot+a27c3aaa3640dd3e1dfb@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-11 00:37:19 -05:00
Kent Overstreet	0b6ec0c5ac	bcachefs: Fix assertion pop in topology repair Fixes: `baefd3f849` ("bcachefs: btree_cache.freeable list fixes") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-11-11 00:37:18 -05:00

... 9 10 11 12 13 ...

5280 Commits