Commit Graph

5280 Commits

Author SHA1 Message Date
Wentao Liang
9364f17ba4 bcachefs: Add error handling for zlib_deflateInit2()
In attempt_compress(), the return value of zlib_deflateInit2() needs to be
checked. A proper implementation can be found in  pstore_compress().

Add an error check and return 0 immediately if the initialzation fails.

Fixes: 986e9842fb ("bcachefs: Compression levels")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-03 12:11:42 -04:00
Eric Biggers
a07c43e6c2 bcachefs: add missing selection of XARRAY_MULTI
When CONFIG_XARRAY_MULTI is not set, reading from a bcachefs file hits
the 'BUG_ON(order > 0);' in xas_set_order(), because it tries to insert
a large folio in the page cache.  Fix this by making bcachefs select
XARRAY_MULTI.

Fixes: be212d86b1 ("bcachefs: bs > ps support")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02 10:24:34 -04:00
Kent Overstreet
955ba7b5ea bcachefs: bch_dev_usage_full
All the fastpaths that need device usage don't need the sector totals or
fragmentation, just bucket counts.

Split bch_dev_usage up into two different versions, the normal one with
just bucket counts.

This is also a stack usage improvement, since we have a bch_dev_usage on
the stack in the allocation path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02 10:24:34 -04:00
Kent Overstreet
9180ad2e16 bcachefs: Kill btree_iter.trans
This was planned to be done ages ago, now finally completed; there are
places where we have quite a few btree_trans objects on the stack, so
this reduces stack usage somewhat.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02 10:24:34 -04:00
Kent Overstreet
1c8f4587d2 bcachefs: do_trace_key_cache_fill()
Reducing stack frame usage; this moves the printbuf out of the main
stack frame.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02 10:24:34 -04:00
Kent Overstreet
dcffc3b1ae bcachefs: Split up bch_dev.io_ref
We now have separate per device io_refs for read and write access.

This fixes a device removal bug where the discard workers were still
running while we're removing alloc info for that device.

It's also a bit of hardening; we no longer allow writes to devices that
are read-only.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02 10:24:34 -04:00
Linus Torvalds
eb0ece1602 - The 6 patch series "Enable strict percpu address space checks" from
Uros Bizjak uses x86 named address space qualifiers to provide
   compile-time checking of percpu area accesses.
 
   This has caused a small amount of fallout - two or three issues were
   reported.  In all cases the calling code was founf to be incorrect.
 
 - The 4 patch series "Some cleanup for memcg" from Chen Ridong
   implements some relatively monir cleanups for the memcontrol code.
 
 - The 17 patch series "mm: fixes for device-exclusive entries (hmm)"
   from David Hildenbrand fixes a boatload of issues which David found then
   using device-exclusive PTE entries when THP is enabled.  More work is
   needed, but this makes thins better - our own HMM selftests now succeed.
 
 - The 2 patch series "mm: zswap: remove z3fold and zbud" from Yosry
   Ahmed remove the z3fold and zbud implementations.  They have been
   deprecated for half a year and nobody has complained.
 
 - The 5 patch series "mm: further simplify VMA merge operation" from
   Lorenzo Stoakes implements numerous simplifications in this area.  No
   runtime effects are anticipated.
 
 - The 4 patch series "mm/madvise: remove redundant mmap_lock operations
   from process_madvise()" from SeongJae Park rationalizes the locking in
   the madvise() implementation.  Performance gains of 20-25% were observed
   in one MADV_DONTNEED microbenchmark.
 
 - The 12 patch series "Tiny cleanup and improvements about SWAP code"
   from Baoquan He contains a number of touchups to issues which Baoquan
   noticed when working on the swap code.
 
 - The 2 patch series "mm: kmemleak: Usability improvements" from Catalin
   Marinas implements a couple of improvements to the kmemleak user-visible
   output.
 
 - The 2 patch series "mm/damon/paddr: fix large folios access and
   schemes handling" from Usama Arif provides a couple of fixes for DAMON's
   handling of large folios.
 
 - The 3 patch series "mm/damon/core: fix wrong and/or useless
   damos_walk() behaviors" from SeongJae Park fixes a few issues with the
   accuracy of kdamond's walking of DAMON regions.
 
 - The 3 patch series "expose mapping wrprotect, fix fb_defio use" from
   Lorenzo Stoakes changes the interaction between framebuffer deferred-io
   and core MM.  No functional changes are anticipated - this is
   preparatory work for the future removal of page structure fields.
 
 - The 4 patch series "mm/damon: add support for hugepage_size DAMOS
   filter" from Usama Arif adds a DAMOS filter which permits the filtering
   by huge page sizes.
 
 - The 4 patch series "mm: permit guard regions for file-backed/shmem
   mappings" from Lorenzo Stoakes extends the guard region feature from its
   present "anon mappings only" state.  The feature now covers shmem and
   file-backed mappings.
 
 - The 4 patch series "mm: batched unmap lazyfree large folios during
   reclamation" from Barry Song cleans up and speeds up the unmapping for
   pte-mapped large folios.
 
 - The 18 patch series "reimplement per-vma lock as a refcount" from
   Suren Baghdasaryan puts the vm_lock back into the vma.  Our reasons for
   pulling it out were largely bogus and that change made the code more
   messy.  This patchset provides small (0-10%) improvements on one
   microbenchmark.
 
 - The 5 patch series "Docs/mm/damon: misc DAMOS filters documentation
   fixes and improves" from SeongJae Park does some maintenance work on the
   DAMON docs.
 
 - The 27 patch series "hugetlb/CMA improvements for large systems" from
   Frank van der Linden addresses a pile of issues which have been observed
   when using CMA on large machines.
 
 - The 2 patch series "mm/damon: introduce DAMOS filter type for unmapped
   pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the
   page's mapped/unmapped status.
 
 - The 19 patch series "zsmalloc/zram: there be preemption" from Sergey
   Senozhatsky teaches zram to run its compression and decompression
   operations preemptibly.
 
 - The 12 patch series "selftests/mm: Some cleanups from trying to run
   them" from Brendan Jackman fixes a pile of unrelated issues which
   Brendan encountered while runnimg our selftests.
 
 - The 2 patch series "fs/proc/task_mmu: add guard region bit to pagemap"
   from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
   determine whether a particular page is a guard page.
 
 - The 7 patch series "mm, swap: remove swap slot cache" from Kairui Song
   removes the swap slot cache from the allocation path - it simply wasn't
   being effective.
 
 - The 5 patch series "mm: cleanups for device-exclusive entries (hmm)"
   from David Hildenbrand implements a number of unrelated cleanups in this
   code.
 
 - The 5 patch series "mm: Rework generic PTDUMP configs" from Anshuman
   Khandual implements a number of preparatoty cleanups to the
   GENERIC_PTDUMP Kconfig logic.
 
 - The 8 patch series "mm/damon: auto-tune aggregation interval" from
   SeongJae Park implements a feedback-driven automatic tuning feature for
   DAMON's aggregation interval tuning.
 
 - The 5 patch series "Fix lazy mmu mode" from Ryan Roberts fixes some
   issues in powerpc, sparc and x86 lazy MMU implementations.  Ryan did
   this in preparation for implementing lazy mmu mode for arm64 to optimize
   vmalloc.
 
 - The 2 patch series "mm/page_alloc: Some clarifications for migratetype
   fallback" from Brendan Jackman reworks some commentary to make the code
   easier to follow.
 
 - The 3 patch series "page_counter cleanup and size reduction" from
   Shakeel Butt cleans up the page_counter code and fixes a size increase
   which we accidentally added late last year.
 
 - The 3 patch series "Add a command line option that enables control of
   how many threads should be used to allocate huge pages" from Thomas
   Prescher does that.  It allows the careful operator to significantly
   reduce boot time by tuning the parallalization of huge page
   initialization.
 
 - The 3 patch series "Fix calculations in trace_balance_dirty_pages()
   for cgwb" from Tang Yizhou fixes the tracing output from the dirty page
   balancing code.
 
 - The 9 patch series "mm/damon: make allow filters after reject filters
   useful and intuitive" from SeongJae Park improves the handling of allow
   and reject filters.  Behaviour is made more consistent and the
   documention is updated accordingly.
 
 - The 5 patch series "Switch zswap to object read/write APIs" from Yosry
   Ahmed updates zswap to the new object read/write APIs and thus permits
   the removal of some legacy code from zpool and zsmalloc.
 
 - The 6 patch series "Some trivial cleanups for shmem" from Baolin Wang
   does as it claims.
 
 - The 20 patch series "fs/dax: Fix ZONE_DEVICE page reference counts"
   from Alistair Popple regularizes the weird ZONE_DEVICE page refcount
   handling in DAX, permittig the removal of a number of special-case
   checks.
 
 - The 4 patch series "refactor mremap and fix bug" from Lorenzo Stoakes
   is a preparatoty refactoring and cleanup of the mremap() code.
 
 - The 20 patch series "mm: MM owner tracking for large folios (!hugetlb)
   + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
   which we determine whether a large folio is known to be mapped
   exclusively into a single MM.
 
 - The 8 patch series "mm/damon: add sysfs dirs for managing DAMOS
   filters based on handling layers" from SeongJae Park adds a couple of
   new sysfs directories to ease the management of DAMON/DAMOS filters.
 
 - The 13 patch series "arch, mm: reduce code duplication in mem_init()"
   from Mike Rapoport consolidates many per-arch implementations of
   mem_init() into code generic code, where that is practical.
 
 - The 13 patch series "mm/damon/sysfs: commit parameters online via
   damon_call()" from SeongJae Park continues the cleaning up of sysfs
   access to DAMON internal data.
 
 - The 3 patch series "mm: page_ext: Introduce new iteration API" from
   Luiz Capitulino reworks the page_ext initialization to fix a boot-time
   crash which was observed with an unusual combination of compile and
   cmdline options.
 
 - The 8 patch series "Buddy allocator like (or non-uniform) folio split"
   from Zi Yan reworks the code to split a folio into smaller folios.  The
   main benefit is lessened memory consumption: fewer post-split folios are
   generated.
 
 - The 2 patch series "Minimize xa_node allocation during xarry split"
   from Zi Yan reduces the number of xarray xa_nodes which are generated
   during an xarray split.
 
 - The 2 patch series "drivers/base/memory: Two cleanups" from Gavin Shan
   performs some maintenance work on the drivers/base/memory code.
 
 - The 3 patch series "Add tracepoints for lowmem reserves, watermarks
   and totalreserve_pages" from Martin Liu adds some more tracepoints to
   the page allocator code.
 
 - The 4 patch series "mm/madvise: cleanup requests validations and
   classifications" from SeongJae Park cleans up some warts which SeongJae
   observed during his earlier madvise work.
 
 - The 3 patch series "mm/hwpoison: Fix regressions in memory failure
   handling" from Shuai Xue addresses two quite serious regressions which
   Shuai has observed in the memory-failure implementation.
 
 - The 5 patch series "mm: reliable huge page allocator" from Johannes
   Weiner makes huge page allocations cheaper and more reliable by reducing
   fragmentation.
 
 - The 5 patch series "Minor memcg cleanups & prep for memdescs" from
   Matthew Wilcox is preparatory work for the future implementation of
   memdescs.
 
 - The 4 patch series "track memory used by balloon drivers" from Nico
   Pache introduces a way to track memory used by our various balloon
   drivers.
 
 - The 2 patch series "mm/damon: introduce DAMOS filter type for active
   pages" from Nhat Pham permits users to filter for active/inactive pages,
   separately for file and anon pages.
 
 - The 2 patch series "Adding Proactive Memory Reclaim Statistics" from
   Hao Jia separates the proactive reclaim statistics from the direct
   reclaim statistics.
 
 - The 2 patch series "mm/vmscan: don't try to reclaim hwpoison folio"
   from Jinjiang Tu fixes our handling of hwpoisoned pages within the
   reclaim code.
 -----BEGIN PGP SIGNATURE-----
 
 iHQEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nZaAAKCRDdBJ7gKXxA
 jsOWAPiP4r7CJHMZRK4eyJOkvS1a1r+TsIarrFZtjwvf/GIfAQCEG+JDxVfUaUSF
 Ee93qSSLR1BkNdDw+931Pu0mXfbnBw==
 =Pn2K
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - The series "Enable strict percpu address space checks" from Uros
   Bizjak uses x86 named address space qualifiers to provide
   compile-time checking of percpu area accesses.

   This has caused a small amount of fallout - two or three issues were
   reported. In all cases the calling code was found to be incorrect.

 - The series "Some cleanup for memcg" from Chen Ridong implements some
   relatively monir cleanups for the memcontrol code.

 - The series "mm: fixes for device-exclusive entries (hmm)" from David
   Hildenbrand fixes a boatload of issues which David found then using
   device-exclusive PTE entries when THP is enabled. More work is
   needed, but this makes thins better - our own HMM selftests now
   succeed.

 - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
   remove the z3fold and zbud implementations. They have been deprecated
   for half a year and nobody has complained.

 - The series "mm: further simplify VMA merge operation" from Lorenzo
   Stoakes implements numerous simplifications in this area. No runtime
   effects are anticipated.

 - The series "mm/madvise: remove redundant mmap_lock operations from
   process_madvise()" from SeongJae Park rationalizes the locking in the
   madvise() implementation. Performance gains of 20-25% were observed
   in one MADV_DONTNEED microbenchmark.

 - The series "Tiny cleanup and improvements about SWAP code" from
   Baoquan He contains a number of touchups to issues which Baoquan
   noticed when working on the swap code.

 - The series "mm: kmemleak: Usability improvements" from Catalin
   Marinas implements a couple of improvements to the kmemleak
   user-visible output.

 - The series "mm/damon/paddr: fix large folios access and schemes
   handling" from Usama Arif provides a couple of fixes for DAMON's
   handling of large folios.

 - The series "mm/damon/core: fix wrong and/or useless damos_walk()
   behaviors" from SeongJae Park fixes a few issues with the accuracy of
   kdamond's walking of DAMON regions.

 - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
   Stoakes changes the interaction between framebuffer deferred-io and
   core MM. No functional changes are anticipated - this is preparatory
   work for the future removal of page structure fields.

 - The series "mm/damon: add support for hugepage_size DAMOS filter"
   from Usama Arif adds a DAMOS filter which permits the filtering by
   huge page sizes.

 - The series "mm: permit guard regions for file-backed/shmem mappings"
   from Lorenzo Stoakes extends the guard region feature from its
   present "anon mappings only" state. The feature now covers shmem and
   file-backed mappings.

 - The series "mm: batched unmap lazyfree large folios during
   reclamation" from Barry Song cleans up and speeds up the unmapping
   for pte-mapped large folios.

 - The series "reimplement per-vma lock as a refcount" from Suren
   Baghdasaryan puts the vm_lock back into the vma. Our reasons for
   pulling it out were largely bogus and that change made the code more
   messy. This patchset provides small (0-10%) improvements on one
   microbenchmark.

 - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
   improves" from SeongJae Park does some maintenance work on the DAMON
   docs.

 - The series "hugetlb/CMA improvements for large systems" from Frank
   van der Linden addresses a pile of issues which have been observed
   when using CMA on large machines.

 - The series "mm/damon: introduce DAMOS filter type for unmapped pages"
   from SeongJae Park enables users of DMAON/DAMOS to filter my the
   page's mapped/unmapped status.

 - The series "zsmalloc/zram: there be preemption" from Sergey
   Senozhatsky teaches zram to run its compression and decompression
   operations preemptibly.

 - The series "selftests/mm: Some cleanups from trying to run them" from
   Brendan Jackman fixes a pile of unrelated issues which Brendan
   encountered while runnimg our selftests.

 - The series "fs/proc/task_mmu: add guard region bit to pagemap" from
   Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
   determine whether a particular page is a guard page.

 - The series "mm, swap: remove swap slot cache" from Kairui Song
   removes the swap slot cache from the allocation path - it simply
   wasn't being effective.

 - The series "mm: cleanups for device-exclusive entries (hmm)" from
   David Hildenbrand implements a number of unrelated cleanups in this
   code.

 - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
   implements a number of preparatoty cleanups to the GENERIC_PTDUMP
   Kconfig logic.

 - The series "mm/damon: auto-tune aggregation interval" from SeongJae
   Park implements a feedback-driven automatic tuning feature for
   DAMON's aggregation interval tuning.

 - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
   powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
   preparation for implementing lazy mmu mode for arm64 to optimize
   vmalloc.

 - The series "mm/page_alloc: Some clarifications for migratetype
   fallback" from Brendan Jackman reworks some commentary to make the
   code easier to follow.

 - The series "page_counter cleanup and size reduction" from Shakeel
   Butt cleans up the page_counter code and fixes a size increase which
   we accidentally added late last year.

 - The series "Add a command line option that enables control of how
   many threads should be used to allocate huge pages" from Thomas
   Prescher does that. It allows the careful operator to significantly
   reduce boot time by tuning the parallalization of huge page
   initialization.

 - The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
   from Tang Yizhou fixes the tracing output from the dirty page
   balancing code.

 - The series "mm/damon: make allow filters after reject filters useful
   and intuitive" from SeongJae Park improves the handling of allow and
   reject filters. Behaviour is made more consistent and the documention
   is updated accordingly.

 - The series "Switch zswap to object read/write APIs" from Yosry Ahmed
   updates zswap to the new object read/write APIs and thus permits the
   removal of some legacy code from zpool and zsmalloc.

 - The series "Some trivial cleanups for shmem" from Baolin Wang does as
   it claims.

 - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
   Alistair Popple regularizes the weird ZONE_DEVICE page refcount
   handling in DAX, permittig the removal of a number of special-case
   checks.

 - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
   preparatoty refactoring and cleanup of the mremap() code.

 - The series "mm: MM owner tracking for large folios (!hugetlb) +
   CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
   which we determine whether a large folio is known to be mapped
   exclusively into a single MM.

 - The series "mm/damon: add sysfs dirs for managing DAMOS filters based
   on handling layers" from SeongJae Park adds a couple of new sysfs
   directories to ease the management of DAMON/DAMOS filters.

 - The series "arch, mm: reduce code duplication in mem_init()" from
   Mike Rapoport consolidates many per-arch implementations of
   mem_init() into code generic code, where that is practical.

 - The series "mm/damon/sysfs: commit parameters online via
   damon_call()" from SeongJae Park continues the cleaning up of sysfs
   access to DAMON internal data.

 - The series "mm: page_ext: Introduce new iteration API" from Luiz
   Capitulino reworks the page_ext initialization to fix a boot-time
   crash which was observed with an unusual combination of compile and
   cmdline options.

 - The series "Buddy allocator like (or non-uniform) folio split" from
   Zi Yan reworks the code to split a folio into smaller folios. The
   main benefit is lessened memory consumption: fewer post-split folios
   are generated.

 - The series "Minimize xa_node allocation during xarry split" from Zi
   Yan reduces the number of xarray xa_nodes which are generated during
   an xarray split.

 - The series "drivers/base/memory: Two cleanups" from Gavin Shan
   performs some maintenance work on the drivers/base/memory code.

 - The series "Add tracepoints for lowmem reserves, watermarks and
   totalreserve_pages" from Martin Liu adds some more tracepoints to the
   page allocator code.

 - The series "mm/madvise: cleanup requests validations and
   classifications" from SeongJae Park cleans up some warts which
   SeongJae observed during his earlier madvise work.

 - The series "mm/hwpoison: Fix regressions in memory failure handling"
   from Shuai Xue addresses two quite serious regressions which Shuai
   has observed in the memory-failure implementation.

 - The series "mm: reliable huge page allocator" from Johannes Weiner
   makes huge page allocations cheaper and more reliable by reducing
   fragmentation.

 - The series "Minor memcg cleanups & prep for memdescs" from Matthew
   Wilcox is preparatory work for the future implementation of memdescs.

 - The series "track memory used by balloon drivers" from Nico Pache
   introduces a way to track memory used by our various balloon drivers.

 - The series "mm/damon: introduce DAMOS filter type for active pages"
   from Nhat Pham permits users to filter for active/inactive pages,
   separately for file and anon pages.

 - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
   separates the proactive reclaim statistics from the direct reclaim
   statistics.

 - The series "mm/vmscan: don't try to reclaim hwpoison folio" from
   Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
   code.

* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
  mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
  x86/mm: restore early initialization of high_memory for 32-bits
  mm/vmscan: don't try to reclaim hwpoison folio
  mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
  cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
  mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
  selftests/mm: speed up split_huge_page_test
  selftests/mm: uffd-unit-tests support for hugepages > 2M
  docs/mm/damon/design: document active DAMOS filter type
  mm/damon: implement a new DAMOS filter type for active pages
  fs/dax: don't disassociate zero page entries
  MM documentation: add "Unaccepted" meminfo entry
  selftests/mm: add commentary about 9pfs bugs
  fork: use __vmalloc_node() for stack allocation
  docs/mm: Physical Memory: Populate the "Zones" section
  xen: balloon: update the NR_BALLOON_PAGES state
  hv_balloon: update the NR_BALLOON_PAGES state
  balloon_compaction: update the NR_BALLOON_PAGES state
  meminfo: add a per node counter for balloon drivers
  mm: remove references to folio in __memcg_kmem_uncharge_page()
  ...
2025-04-01 09:29:18 -07:00
Kent Overstreet
f1350c2c74 bcachefs: fix ref leak in btree_node_read_all_replicas
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-01 02:47:41 -04:00
Linus Torvalds
98fb679d19 bcachefs updates for 6.15, part 2
All bugfixes and logging improvements.
 
 Minor merge conflict, see:
 https://lore.kernel.org/linux-next/20250331092816.778a7c83@canb.auug.org.au/T/#u
 
 CI says the fs-next tree is good:
 https://evilpiepirate.org/~testdashboard/ci?user=fs-next&branch=master
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmfqyxsACgkQE6szbY3K
 bnY4aQ/7BylMgHZsAG2OLRRtegCsuFZ5fZt148TObofSGTTDPcVKYQWcz249Hlao
 RZzv9nbqq2M7fJrUK5Xloc4DA0ICuWIh9n+uRf+5od7JgtygjJpXqMRz9HrGBtTo
 QZJE/wAzsa1A8xBORjWVki4koHT3YivaMW2zdgbHIHWTjDJso5Es7RW0/WZYv3lW
 cFLFEfOnBCEXckhEtK7TAPnJpHEPw+/d0bMFU/PIHbokUwTjxCR0bmRL/RecKUXa
 5U1o1x7gAo1iPi3XPGLJVVxXWgjmxzQlF/3aXva+DYaeLgxPMqKxUlC6hkV4f6Oc
 9lH/w1pEiCMcANbbp2E3Q91sDFRlafFCgvsKhEz79W5WoNq+vSrxLhLaynyuBT/K
 lfoiig6IFRTWJDYHu2L6YHFMmp8JOxgJSJ0+dcgyVRnaDJQeGgbuv1tEldonQLsg
 9DT8iRJpVDomffwPUoVhujlvJOqUi8zFkxyMCgVWExFzC3ief2B5s3D4uLXcpApO
 nZfb01W0ElW7qBMQxjyD0Vy+wY8EryzTht9ZKJq5Id1T/LWc9Qi+jPaY86OBC9/w
 GJgW9OcYLFjYdsDokk5XkwOd/IAXz6fU+vHGtahFJPVfH4T8zzdBnxfPbiR2mXo8
 4EfeNmRevZP/oK7/2l2cqIzY7tYBJBUK1gFyvz1+7bcuFwVI8rc=
 =Udka
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2025-03-31' of git://evilpiepirate.org/bcachefs

Pull more bcachefs updates from Kent Overstreet:
 "All bugfixes and logging improvements"

* tag 'bcachefs-2025-03-31' of git://evilpiepirate.org/bcachefs: (35 commits)
  bcachefs: fix bch2_write_point_to_text() units
  bcachefs: Log original key being moved in data updates
  bcachefs: BCH_JSET_ENTRY_log_bkey
  bcachefs: Reorder error messages that include journal debug
  bcachefs: Don't use designated initializers for disk_accounting_pos
  bcachefs: Silence errors after emergency shutdown
  bcachefs: fix units in rebalance_status
  bcachefs: bch2_ioctl_subvolume_destroy() fixes
  bcachefs: Clear fs_path_parent on subvolume unlink
  bcachefs: Change btree_insert_node() assertion to error
  bcachefs: Better printing of inconsistency errors
  bcachefs: bch2_count_fsck_err()
  bcachefs: Better helpers for inconsistency errors
  bcachefs: Consistent indentation of multiline fsck errors
  bcachefs: Add an "ignore unknown" option to bch2_parse_mount_opts()
  bcachefs: bch2_time_stats_init_no_pcpu()
  bcachefs: Fix bch2_fs_get_tree() error path
  bcachefs: fix logging in journal_entry_err_msg()
  bcachefs: add missing newline in bch2_trans_updates_to_text()
  bcachefs: print_string_as_lines: fix extra newline
  ...
2025-03-31 18:33:51 -07:00
Kent Overstreet
de39965858 bcachefs: Fix null ptr deref in bch2_write_endio()
This was previously hard to hit since it requires racing with device
removal, but splitting up io_ref uncovered it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-31 17:39:27 -04:00
Kent Overstreet
7f10fde38f bcachefs: Fix field spanning write warning
Struct with embedded VLA...

memcpy: detected field-spanning write (size 8) of single field "&gc->r.e" at fs/bcachefs/ec.c:465 (size 3)
WARNING: CPU: 1 PID: 936 at fs/bcachefs/ec.c:465 bch2_trigger_stripe+0x706/0x730
Modules linked in:
CPU: 1 UID: 0 PID: 936 Comm: mount.bcachefs Not tainted 6.14.0-rc6-ktest-00236-gefb0b5c62dbc #55
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:bch2_trigger_stripe+0x706/0x730
Code: b4 00 01 b9 03 00 00 00 48 89 fb 48 c7 c7 33 54 da 81 48 89 d6 49 89 d6 48 c7 c2 c3 36 db 81 e8 60 54 c5 ff 48 89 df 4c 89 f2 <0f> 0b e9 5c fd ff ff e8 fe 5e 4e 00 bf 10 00 00 00 48 c7 c6 ff ff
RSP: 0018:ffff88817081f680 EFLAGS: 00010246
RAX: f8fe7dd1c56b5600 RBX: ffff888101265368 RCX: 0000000000000027
RDX: 0000000000000008 RSI: 00000000fffbffff RDI: ffff888101265368
RBP: 0000000000000000 R08: 000000000003ffff R09: ffff88817f1fe000
R10: 00000000000bfffd R11: 0000000000000004 R12: ffff8881012652c0
R13: 0000000000000000 R14: 0000000000000008 R15: ffff88817081f6c9
FS:  00007fc428bc7c80(0000) GS:ffff888179280000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffd3ee4a038 CR3: 000000010a9bc000 CR4: 0000000000750eb0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __warn+0xce/0x1b0
 ? bch2_trigger_stripe+0x706/0x730
 ? report_bug+0x11b/0x1a0
 ? bch2_trigger_stripe+0x706/0x730
 ? handle_bug+0x5e/0x90
 ? exc_invalid_op+0x1a/0x50
 ? asm_exc_invalid_op+0x1a/0x20
 ? bch2_trigger_stripe+0x706/0x730
 bch2_gc_mark_key+0x2cf/0x430
 bch2_check_allocations+0x1a64/0x1ed0
 ? vsnprintf+0x1ad/0x420
 ? bch2_check_allocations+0x191f/0x1ed0
 bch2_run_recovery_passes+0x13b/0x2b0
 bch2_fs_recovery+0x9b7/0x1290
 ? __bch2_print+0xb2/0xf0
 ? bch2_printbuf_exit+0x1e/0x30
 ? print_mount_opts+0x153/0x180
 bch2_fs_start+0x274/0x3b0
 bch2_fs_get_tree+0x516/0x6e0
 vfs_get_tree+0x21/0xa0
 do_new_mount+0x153/0x350
 __x64_sys_mount+0x16c/0x1f0
 do_syscall_64+0x6c/0x140
 ? arch_exit_to_user_mode_prepare+0x9/0x40
 entry_SYSCALL_64_after_hwframe+0x4b/0x53

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-31 17:39:10 -04:00
Kent Overstreet
f540876f4e bcachefs: Fix striping behaviour
For striping across devices, we maintain "clocks", and we advance them
by the inverse of "how much free space this device has left", so that we
round robin biased in favor of devices with more free space.

This code was originally trying to do EWMA-ish stuff when originally
written, ~10 years ago, and was never properly cleaned up when it was
realized that an EWMA is not the right approach here.

That left a bug, when we rescale to keep all the clocks in the correct
range and prevent overflow.

It was assumed that we'd always be allocated from the device with the
smallest clock hand, but that's actually not correct: with the target
options, allocations will be first tried from a subset of devices, and
then the entire filesystem if that fails.

Thus, the rescale from the first allocation - allocating from a subset
of devices - can pick the wrong rescale value and cause the rest of the
clocks to go to 0, losing information.

This resuls in incorrect striping behaviour when the desired number of
replicas doesn't fit on the foreground target.

Link: https://www.reddit.com/r/bcachefs/comments/1jn3t26/replica_allocation_not_evenly_distributed_among/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-31 17:39:10 -04:00
Kent Overstreet
650f5353dc bcachefs: fix bch2_write_point_to_text() units
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30 20:04:16 -04:00
Kent Overstreet
7fdc3fa3cb bcachefs: Log original key being moved in data updates
There's something going on with the data move path; log the original key
being moved for debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30 18:25:12 -04:00
Kent Overstreet
edaed8ee8c bcachefs: BCH_JSET_ENTRY_log_bkey
Add a journal entry type for logging - but logging a bkey, not a string;
to be used for data move path debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30 18:25:12 -04:00
Kent Overstreet
2b47102b93 bcachefs: Reorder error messages that include journal debug
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30 16:36:27 -04:00
Kent Overstreet
393a05a741 bcachefs: Don't use designated initializers for disk_accounting_pos
Not all compilers fully initialize these - they're not guaranteed to
because of the union shenanigans.

Fixes: https://github.com/koverstreet/bcachefs/issues/844
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30 16:35:13 -04:00
Kent Overstreet
f548db4d31 bcachefs: Silence errors after emergency shutdown
We don't care about errors from asynchronous ops that were because we
did an emergency shutdown; silence them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30 16:35:13 -04:00
Kent Overstreet
458e2ef882 bcachefs: fix units in rebalance_status
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30 16:35:13 -04:00
Kent Overstreet
707549600c bcachefs: bch2_ioctl_subvolume_destroy() fixes
bch2_evict_subvolume_inodes() was getting stuck - due to incorrectly
pruning the dcache.

Also, fix missing permissions checks.

Reported-by: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30 16:33:22 -04:00
Kent Overstreet
b3981564ca bcachefs: Clear fs_path_parent on subvolume unlink
This fixes recursive subvolume removal.

Subvolume deletion is asynchronous; fs_path_parent, and thus the entry
in the subvolume_children btree, need to be cleared when the subvolume
is unlinked from the fs heirarchy - else we'll spuriously think a
subvolume has children and deletion will fail.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-29 20:16:49 -04:00
Kent Overstreet
63c3b8f616 bcachefs: Change btree_insert_node() assertion to error
Debug for https://github.com/koverstreet/bcachefs/issues/843

Print useful debug info and go emergency read-only.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-29 14:24:48 -04:00
Kent Overstreet
6d77ce4a27 bcachefs: Better printing of inconsistency errors
Build up and emit the error message for an inconsistency error all at
once, instead of spread over multiple printk calls, so they're not
jumbled in the dmesg log.

Also, add better indenting.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-29 13:26:13 -04:00
Kent Overstreet
7337f9f14e bcachefs: bch2_count_fsck_err()
Factor out a helper from __bch2_fsck_err(), for counting the error in
the superblock and deciding whether to print or ratelimit - will be used
to replace some log_fsck_err() calls, where we want to lift out printing
the error message.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-29 13:26:13 -04:00
Kent Overstreet
b00750c2e5 bcachefs: Better helpers for inconsistency errors
An inconsistency error often happens as part of an event with multiple
error messages, and we want to build up one single error message with
proper indenting to produce more readable log messages that don't get
garbled.

Add new helpers that emit messages to a printbuf instead of printing
them directly, next patch will convert to use them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 22:31:47 -04:00
Kent Overstreet
1ece53237e bcachefs: Consistent indentation of multiline fsck errors
Add the new helper printbuf_indent_add_nextline(), and use it in
__bch2_fsck_err() to centralize setting the indentation of multiline
fsck errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 22:31:47 -04:00
Kent Overstreet
a7cdf2276e bcachefs: Add an "ignore unknown" option to bch2_parse_mount_opts()
To be used by the mount helper in userspace, where we still have options
to be parsed by other layers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 22:31:47 -04:00
Kent Overstreet
daa771332e bcachefs: bch2_time_stats_init_no_pcpu()
Add a mode to disable automatic switching to percpu mode, useful when a
time_stats will only be used by one thread and we don't want to have to
flush the percpu buffers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 22:31:47 -04:00
Florian Albrechtskirchinger
7c4cb50e1a bcachefs: Fix bch2_fs_get_tree() error path
When a filesystem is mounted read-only, subsequent attempts to mount it
as read-write fail with EBUSY. Previously, the error path in
bch2_fs_get_tree() would unconditionally call __bch2_fs_stop(),
improperly freeing resources for a filesystem that was still actively
mounted. This change modifies the error path to only call
__bch2_fs_stop() if the superblock has no valid root dentry, ensuring
resources are not cleaned up prematurely when the filesystem is in use.

Signed-off-by: Florian Albrechtskirchinger <falbrechtskirchinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 13:51:09 -04:00
Kent Overstreet
6b1e0b9e18 bcachefs: fix logging in journal_entry_err_msg()
We want to log errors all at once, not spread across multiple printks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 12:36:32 -04:00
Kent Overstreet
ff4e0f7de6 bcachefs: add missing newline in bch2_trans_updates_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 12:36:32 -04:00
Kent Overstreet
35a11506a3 bcachefs: print_string_as_lines: fix extra newline
Don't print a newline on empty string; this was causing us to also print
an extra newline when we got to the end of th string.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 12:36:32 -04:00
Kent Overstreet
3c72d3eea9 bcachefs: Fix WARN() in bch2_bkey_pick_read_device()
syzbot discovered that this one is possible: we have pointers, but none
of them are to valid devices.

Reported-by: syzbot+336a6e6a2dbb7d4dba9a@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 12:35:05 -04:00
Kent Overstreet
af3d4c276a bcachefs: Don't return 0 size holes from bch2_seek_hole()
The hole we find in the btree might be fully dirty in the page cache. If
so, keep searching.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 11:30:14 -04:00
Kent Overstreet
1f4bb8254c bcachefs: Fix bch2_seek_hole() locking
We can't call bch2_seek_pagecache_hole(), and block on page locks, with
btree locks held.

This is easily fixed because we're at the end of the transaction - we
can just unlock, we don't need a drop_locks_do().

Reported-by: https://github.com/nagalun
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 11:30:14 -04:00
Kent Overstreet
2dd202dbaf bcachefs: Recovery no longer holds state_lock
state_lock guards against devices coming or leaving, changing state, or
the filesystem changing between ro <-> rw.

But it's not necessary for running recovery passes, and holding it
blocks asynchronous events that would cause us to go RO or kick out
devices.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 11:13:25 -04:00
Kent Overstreet
c6c6a39109 bcachefs: Fix permissions on version modparam
There's no reason for this not to be world readable - it provides the
currently supported on disk format version.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28 11:13:23 -04:00
Linus Torvalds
4a4b30ea80 bcachefs updates for 6.15
On disk format is now soft frozen: no more required/automatic are
 anticipated before taking off the experimental label.
 
 Major changes/features since 6.14:
 
 - Scrub
 
 - Blocksize greater than page size support
 
 - A number of "rebalance spinning and doing no work" issues have been
   fixed; we now check if the write allocation will succeed in
   bch2_data_update_init(), before kicking off the read.
 
   There's still more work to do in this area. Later we may want to add
   another bitset btree, like rebalance_work, to track "extents that
   rebalance was requested to move but couldn't", e.g. due to destination
   target having insufficient online devices.
 
 - We can now support scaling well into the petabyte range: latest
   bcachefs-tools will pick an appropriate bucket size at format time to
   ensure fsck can run in available memory (e.g. a server with 256GB of
   ram and 100PB of storage would want 16MB buckets).
 
 On disk format changes:
 
 - 1.21: cached backpointers (scalability improvement)
 
   Cached replicas now get backpointers, which means we no longer rely on
   incrementing bucket generation numbers to invalidate cached data: this
   lets us get rid of the bucket generation number garbage collection,
   which had to periodically rescan all extents to recompute bucket
   oldest_gen.
 
   Bucket generation numbers are now only used as a consistency check,
   but they're quite useful for that.
 
 - 1.22: stripe backpointers
 
   Stripes now have backpointers: erasure coded stripes have their own
   checksums, separate from the checksums for the extents they contain
   (and stripe checksums also cover the parity blocks). This is required
   for implementing scrub for stripes.
 
 - 1.23: stripe lru (scalability improvement)
 
   Persistent lru for stripes, ordered by "number of empty blocks". This
   is used by the stripe creation path, which depending on free space
   may create a new stripe out of a partially empty existing stripe
   instead of starting a brand new stripe.
 
   This replaces an in-memory heap, and means we no longer have to read
   in the stripes btree at startup.
 
 - 1.24: casefolding
 
   Case insensitive directory support, courtesy of Valve.
 
   This is an incompatible feature, to enable mount with
     -o version_upgrade=incompatible
 
 - 1.25: extent_flags
 
   Another incompatible feature requiring explicit opt-in to enable.
 
   This adds a flags entry to extents, and a flag bit that marks extents
   as poisoned.
 
   A poisoned extent is an extent that was unreadable due to checksum
   errors. We can't move such extents without giving them a new checksum,
   and we may have to move them (for e.g. copygc or device evacuate).
   We also don't want to delete them: in the future we'll have an API
   that lets userspace ignore checksum errors and attempt to deal with
   simple bitrot itself. Marking them as poisoned lets us continue to
   return the correct error to userspace on normal read calls.
 
 Other changes/features:
 
 - BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top'
   command, which shows a live view of all internal filesystem counters.
 
 - Improved journal pipelining: we can now have 16 journal writes in
   flight concurrently, up from 4. We're logging significantly more to
   the journal than we used to with all the recent disk accounting
   changes and additions, so some users should see a performance
   increase on some workloads.
 
 - BCH_MEMBER_STATE_failed: previously, we would do no IO at all to
   devices marked as failed. Now we will attempt to read from them, but
   only if we have no better options.
 
 - New option, write_error_timeout: devices will be kicked out of the
   filesystem if all writes have been failing for x number of seconds.
 
   We now also kick devices out when notified by blk_holder_ops that
   they've gone offline.
 
 - Device option handling improvements: the discard option should now be
   working as expected (additionally, in -tools, all device options that
   can be set at format time can now be set at device add time, i.e.
   data_allowed, state).
 
 - We now try harder to read data after a checksum error: we'll do
   additional retries if necessary to a device after after it gave us
   data with a checksum error.
 
 - More self healing work: the full inode <-> dirent consistency checks
   that are currently run by fsck are now also run every time we do a
   lookup, meaning we'll be able to correct errors at runtime. Runtime
   self healing will be flipped on after the new changes have seen more
   testing, currently they're just checking for consistency.
 
 - KMSAN fixes: our KMSAN builds should be nearly clean now, which will
   put a massive dent in the syzbot dashboard.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmfhbnsACgkQE6szbY3K
 bnY6ew/9FXh3m71BvVpuqTYcUGzIC7gVrnkFy6n4W96v07OjSOoTNHOVVovajxc3
 P9LvA77BHC4Xro3H7ORpsIurOZUc6yx18ZizzulVbQFuYa7LY/kNri4ZBtGHcRiV
 pIdQDLSNmwFjPA4x2S1qTFSF1c586lad+UNQiLam5ophBwQPEO6vG51ZEHa4wld9
 +OhWTDYfrvij4D3Lt1ppvhuDP+PQBjhu/QFc0bGjHvKOjfV6sw9XU91sCYKOJIzd
 qzpsiQd5sepnX717Br3f5SLdxMq2lJYvRp9756vltOCaMBvJYJtHqtXCglHQEkFw
 yjhmPjk4r3VlKTF8K+wEJfAHwbC2kEn7csJNbt0+Nko5PPtFyrb8ok6QUbHCKscL
 L0VMnzaXHVqvG2VgYa31temfdz7HM/zHjQ8Al3eQPaqTHIoTXIBQxOQSea/apVMt
 TIlastvLoHfR8W7+LrwOmTjnBJGCJ+MrdcJzJDVk2tQmmcMA0boeZvl4aSklFuyB
 zNN5fxp0VMsxNyIHLJjQ3UcwVqHXC5w+f5H1ByQLUyQh+m/xaAaz7S+BTVdVbFPa
 1Z1xDuvuHOTnjIOamnOD1l36afJnhq5RciPCXCNtQSB819mc+AfNGQNQTVNOTReC
 iTiUCcNxu0/DIPlPmeJzAlukVJUgz+/knOI/6zPs3eI7/o88ZGg=
 =k3cV
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs

Pull bcachefs updates from Kent Overstreet:
 "On disk format is now soft frozen: no more required/automatic are
  anticipated before taking off the experimental label.

  Major changes/features since 6.14:

   - Scrub

   - Blocksize greater than page size support

   - A number of "rebalance spinning and doing no work" issues have been
     fixed; we now check if the write allocation will succeed in
     bch2_data_update_init(), before kicking off the read.

     There's still more work to do in this area. Later we may want to
     add another bitset btree, like rebalance_work, to track "extents
     that rebalance was requested to move but couldn't", e.g. due to
     destination target having insufficient online devices.

   - We can now support scaling well into the petabyte range: latest
     bcachefs-tools will pick an appropriate bucket size at format time
     to ensure fsck can run in available memory (e.g. a server with
     256GB of ram and 100PB of storage would want 16MB buckets).

  On disk format changes:

   - 1.21: cached backpointers (scalability improvement)

     Cached replicas now get backpointers, which means we no longer rely
     on incrementing bucket generation numbers to invalidate cached
     data: this lets us get rid of the bucket generation number garbage
     collection, which had to periodically rescan all extents to
     recompute bucket oldest_gen.

     Bucket generation numbers are now only used as a consistency check,
     but they're quite useful for that.

   - 1.22: stripe backpointers

     Stripes now have backpointers: erasure coded stripes have their own
     checksums, separate from the checksums for the extents they contain
     (and stripe checksums also cover the parity blocks). This is
     required for implementing scrub for stripes.

   - 1.23: stripe lru (scalability improvement)

     Persistent lru for stripes, ordered by "number of empty blocks".
     This is used by the stripe creation path, which depending on free
     space may create a new stripe out of a partially empty existing
     stripe instead of starting a brand new stripe.

     This replaces an in-memory heap, and means we no longer have to
     read in the stripes btree at startup.

   - 1.24: casefolding

     Case insensitive directory support, courtesy of Valve.

     This is an incompatible feature, to enable mount with
       -o version_upgrade=incompatible

   - 1.25: extent_flags

     Another incompatible feature requiring explicit opt-in to enable.

     This adds a flags entry to extents, and a flag bit that marks
     extents as poisoned.

     A poisoned extent is an extent that was unreadable due to checksum
     errors. We can't move such extents without giving them a new
     checksum, and we may have to move them (for e.g. copygc or device
     evacuate). We also don't want to delete them: in the future we'll
     have an API that lets userspace ignore checksum errors and attempt
     to deal with simple bitrot itself. Marking them as poisoned lets us
     continue to return the correct error to userspace on normal read
     calls.

  Other changes/features:

   - BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top'
     command, which shows a live view of all internal filesystem
     counters.

   - Improved journal pipelining: we can now have 16 journal writes in
     flight concurrently, up from 4. We're logging significantly more to
     the journal than we used to with all the recent disk accounting
     changes and additions, so some users should see a performance
     increase on some workloads.

   - BCH_MEMBER_STATE_failed: previously, we would do no IO at all to
     devices marked as failed. Now we will attempt to read from them,
     but only if we have no better options.

   - New option, write_error_timeout: devices will be kicked out of the
     filesystem if all writes have been failing for x number of seconds.

     We now also kick devices out when notified by blk_holder_ops that
     they've gone offline.

   - Device option handling improvements: the discard option should now
     be working as expected (additionally, in -tools, all device options
     that can be set at format time can now be set at device add time,
     i.e. data_allowed, state).

   - We now try harder to read data after a checksum error: we'll do
     additional retries if necessary to a device after after it gave us
     data with a checksum error.

   - More self healing work: the full inode <-> dirent consistency
     checks that are currently run by fsck are now also run every time
     we do a lookup, meaning we'll be able to correct errors at runtime.
     Runtime self healing will be flipped on after the new changes have
     seen more testing, currently they're just checking for consistency.

   - KMSAN fixes: our KMSAN builds should be nearly clean now, which
     will put a massive dent in the syzbot dashboard"

* tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs: (180 commits)
  bcachefs: Kill unnecessary bch2_dev_usage_read()
  bcachefs: btree node write errors now print btree node
  bcachefs: Fix race in print_chain()
  bcachefs: btree_trans_restart_foreign_task()
  bcachefs: bch2_disk_accounting_mod2()
  bcachefs: zero init journal bios
  bcachefs: Eliminate padding in move_bucket_key
  bcachefs: Fix a KMSAN splat in btree_update_nodes_written()
  bcachefs: kmsan asserts
  bcachefs: Fix kmsan warnings in bch2_extent_crc_pack()
  bcachefs: Disable asm memcpys when kmsan enabled
  bcachefs: Handle backpointers with unknown data types
  bcachefs: Count BCH_DATA_parity backpointers correctly
  bcachefs: Run bch2_check_dirent_target() at lookup time
  bcachefs: Refactor bch2_check_dirent_target()
  bcachefs: Move bch2_check_dirent_target() to namei.c
  bcachefs: fs-common.c -> namei.c
  bcachefs: EIO cleanup
  bcachefs: bch2_write_prep_encoded_data() now returns errcode
  bcachefs: Simplify bch2_write_op_error()
  ...
2025-03-27 13:20:07 -07:00
Kent Overstreet
d0fb2a266a bcachefs: cond_resched() in journal_key_sort_cmp()
Fixes "task out to lunch" warnings during recovery on large machines
with lots of dirty data in the journal.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26 16:26:35 -04:00
Kent Overstreet
ef488bb5d0 bcachefs: Fix 'hung task' messages in btree node scan
btree node scan has to wait on kthread workers that scan each device,
potentially for awhile.

We would like this to be interruptible, but we may need a different
mechanism than signals for that - we've had bugs in the past where
mounts were failing due to checking for signals, and no explanation on
where they came from.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26 16:26:35 -04:00
Kent Overstreet
9314e2fb26 bcachefs: Fix btree iter flags in data move (2)
Data move -> move_get_io_opts -> bch2_get_update_rebalance_opts

requires a not_extents iterator; this fixes the path where we're walking
the extents btree and chase a reflink pointer into the reflink btree.

bch2_lookup_indirect_extent() requires working with an extents iterator
(due to peek_slot() semantics), so we implement
bch2_lookup_indirect_extent_for_move().

This is simplified because there's no need to report
indirect_extent_missing_errors here, that can be deferred until fsck or
when a user reads that data.

Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26 16:26:35 -04:00
Kent Overstreet
19ff84b20d bcachefs: Don't unnecessarily decrypt data when moving
There's various checks for "are we going to compress this" - but we're
not going to compress if we know it's incompressible.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26 16:26:35 -04:00
Kent Overstreet
a44e4f8f00 bcachefs: Document disk accounting keys and conuters
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26 16:26:35 -04:00
Kent Overstreet
9c893face2 bcachefs: Validate number of counters for accounting keys
We weren't checking that accounting keys have the expected number of
accounters. Originally we probably wanted to be flexible on this, but it
doesn't look like that will be required - accounting is extended by
adding new counter types, not more counters to an existing type.

This means we can drop a BUG_ON() that popped once in automated testing,
and the new validation will make that bug easier to track down.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26 16:26:35 -04:00
Kent Overstreet
e1e50a6330 bcachefs: Use print_string_as_lines() for journal stuck messages
They were being truncated, printk has a 1k limit per call

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-25 11:49:46 -04:00
Kent Overstreet
a76db26a96 bcachefs: Fix duplicate checksum error messages in write path
Also, improve the message in prep_encoded_data() - it now prints
good/bad checksums, and checksum type.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-25 11:49:43 -04:00
Kent Overstreet
3ba0240a87 bcachefs: Fix silent short reads in data read retry path
__bch2_read, before calling __bch2_read_extent(), sets bvec_iter.bi_size
to "the size we can read from the current extent" with a swap, and
restores it to "the size for the total read" after the read_extent call
with another swap.

But we neglected to do the restore before the "if (ret) goto err;" -
which is a problem if we're retrying those errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-25 11:49:16 -04:00
Kent Overstreet
5af61dbd96 bcachefs: Fix nonce inconsistency in bch2_write_prep_encoded_data()
If we're moving an extent that was partially overwritten,
bch2_write_rechecksum() will trim it to the currenty live range.

If we then also want to compress it, it'll be decrypted - but the nonce
has been advanced for the overwritten start of the extent that we
dropped, and we were using the nonce we calculated before rechecksum().

Reported-by: Gabriel de Perthuis <g2p.code@gmail.com>
Fixes: 127d90d282 ("bcachefs: bch2_write_prep_encoded_data() now returns errcode")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-25 11:40:35 -04:00
Linus Torvalds
e41170cc5e vfs-6.15-rc1.pagesize
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rxAAKCRCRxhvAZXjc
 ooIPAQCwMjDjtWegvBy8kefiRw+fa4z3ZWHrwRT9DJrD/K9WyAD+JVd0ou27SVpQ
 jKpRSRct2eTbyxdYiGydHQGm5F5sLg4=
 =0FyQ
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs pagesize updates from Christian Brauner:
 "This enables block sizes greater than the page size for block devices.

  With this we can start supporting block devices with logical block
  sizes larger than 4k.

  It also allows to lift the device cache sector size support to 64k.
  This allows filesystems which can use larger sector sizes up to 64k to
  ensure that the filesystem will not generate writes that are smaller
  than the specified sector size"

* tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  bdev: add back PAGE_SIZE block size validation for sb_set_blocksize()
  bdev: use bdev_io_min() for statx block size
  block/bdev: lift block size restrictions to 64k
  block/bdev: enable large folio support for large logical block sizes
  fs/buffer fs/mpage: remove large folio restriction
  fs/mpage: use blocks_per_folio instead of blocks_per_page
  fs/mpage: avoid negative shift for large blocksize
  fs/buffer: remove batching from async read
  fs/buffer: simplify block_read_full_folio() with bh_offset()
2025-03-24 12:01:29 -07:00
Linus Torvalds
26d8e43079 vfs-6.15-rc1.async.dir
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rNwAKCRCRxhvAZXjc
 onBJAP9Z8Ywmlb5KQ1E3HvDmkwyY6yOSyZ9/CmbzrkCJ8ywYkQD/d9/xt0EP/O/q
 N8YtzXArHWt7u0YbcVpy9WK3F72BdwU=
 =VJgY
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs async dir updates from Christian Brauner:
 "This contains cleanups that fell out of the work from async directory
  handling:

   - Change kern_path_locked() and user_path_locked_at() to never return
     a negative dentry. This simplifies the usability of these helpers
     in various places

   - Drop d_exact_alias() from the remaining place in NFS where it is
     still used. This also allows us to drop the d_exact_alias() helper
     completely

   - Drop an unnecessary call to fh_update() from nfsd_create_locked()

   - Change i_op->mkdir() to return a struct dentry

     Change vfs_mkdir() to return a dentry provided by the filesystems
     which is hashed and positive. This allows us to reduce the number
     of cases where the resulting dentry is not positive to very few
     cases. The code in these places becomes simpler and easier to
     understand.

   - Repack DENTRY_* and LOOKUP_* flags"

* tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  doc: fix inline emphasis warning
  VFS: Change vfs_mkdir() to return the dentry.
  nfs: change mkdir inode_operation to return alternate dentry if needed.
  fuse: return correct dentry for ->mkdir
  ceph: return the correct dentry on mkdir
  hostfs: store inode in dentry after mkdir if possible.
  Change inode_operations.mkdir to return struct dentry *
  nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked()
  nfs/vfs: discard d_exact_alias()
  VFS: add common error checks to lookup_one_qstr_excl()
  VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry
  VFS: repack LOOKUP_ bit flags.
  VFS: repack DENTRY_ flags.
2025-03-24 10:47:14 -07:00
Kent Overstreet
d8bdc8daac bcachefs: Kill unnecessary bch2_dev_usage_read()
bch2_dev_usage_read() is fairly expensive, we should optimize this more.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:37 -04:00
Kent Overstreet
2adfa46734 bcachefs: btree node write errors now print btree node
It turned out a user was wondering why we were going read-only after a
write error, and he didn't realize he didn't have replication enabled -
this will make that more obvious, and we should be printing it anyways.

Link: https://www.reddit.com/r/bcachefs/comments/1jf9akl/large_data_transfers_switched_bcachefs_to_readonly/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:37 -04:00
Kent Overstreet
739200c573 bcachefs: Fix race in print_chain()
00636 Unable to handle kernel NULL pointer dereference at virtual address 00000000000000b0
00636 Mem abort info:
00636   ESR = 0x0000000096000005
00636   EC = 0x25: DABT (current EL), IL = 32 bits
00636   SET = 0, FnV = 0
00636   EA = 0, S1PTW = 0
00636   FSC = 0x05: level 1 translation fault
00636 Data abort info:
00636   ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
00636   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
00636   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
00636 user pgtable: 4k pages, 39-bit VAs, pgdp=0000000101b10000
00636 [00000000000000b0] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
00636 Internal error: Oops: 0000000096000005 [#1] SMP
00636 Modules linked in:
00636 CPU: 12 UID: 0 PID: 79369 Comm: cat Not tainted 6.14.0-rc6-ktest-g3783b8973ab7 #17757
00636 Hardware name: linux,dummy-virt (DT)
00636 pstate: 20001005 (nzCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
00636 pc : print_chain+0xb8/0x170
00636 lr : print_chain+0xa0/0x170
00636 sp : ffffff80d9c1bbb0
00636 x29: ffffff80d9c1bbb0 x28: 0000000000000002 x27: ffffff80c1be8250
00636 x26: ffffff80dd9b0000 x25: 0000000000000020 x24: 000000000000002d
00636 x23: 000000000000003c x22: ffffffc080a54518 x21: ffffff80da6e00d0
00636 x20: ffffff80da6e0170 x19: ffffff80c1a1d240 x18: 00000000ffffffff
00636 x17: 3535303937202d3c x16: 203139202d3c2035 x15: 00000000ffffffff
00636 x14: 0000000000000000 x13: ffffff80d71b63f1 x12: 0000000000000006
00636 x11: ffffffc080beb1c0 x10: 0000000000000020 x9 : 00000000000134cc
00636 x8 : 0000000000000020 x7 : 0000000000000004 x6 : 0000000000000020
00636 x5 : ffffff80d71b63f7 x4 : ffffffc080a5451b x3 : 0000000000000000
00636 x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000
00636 Call trace:
00636  print_chain+0xb8/0x170 (P)
00636  bch2_check_for_deadlock+0x444/0x5a0
00636  bch2_btree_deadlock_read+0xb4/0x1c8
00636  full_proxy_read+0x74/0xd8
00636  vfs_read+0x90/0x300

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:37 -04:00
Kent Overstreet
0b4fd56726 bcachefs: btree_trans_restart_foreign_task()
In debug mode, we save the call stack on transaction restart - but
there's no locking, so we can't touch it if we're issuing the restart
from another thread.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:37 -04:00
Kent Overstreet
f4a584f4bf bcachefs: bch2_disk_accounting_mod2()
We're hitting some issues with uninitialized struct padding, flagged by
kmsan.

They appear to be falso positives, otherwise bch2_accounting_validate()
would have flagged them as "junk at end". But for now, we'll need to
initialize disk_accounting_pos with memset().

This adds a new helper, bch2_disk_accounting_mod2(), that initializes a
disk_accounting_pos and does the accounting mod all at once - so overall
things actually get slightly more ergonomic.

BCH_DISK_ACCOUNTING_replicas keys are left for now; KMSAN isn't warning
about them and they're a bit special.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:37 -04:00
Kent Overstreet
5ae6f33053 bcachefs: zero init journal bios
fix a kmsan splat

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:37 -04:00
Kent Overstreet
9ea24b287b bcachefs: Eliminate padding in move_bucket_key
We appear to be tripping over a compiler/kmsan bug with padding fields -
this is an easy workaround.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:37 -04:00
Kent Overstreet
1f88c35674 bcachefs: Fix a KMSAN splat in btree_update_nodes_written()
We may sometimes read from uninitialized memory; we know, and that's ok.

We check if a btree node has been reused before waiting on any
outstanding IO.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:37 -04:00
Kent Overstreet
28aa859b6b bcachefs: kmsan asserts
Catching these early makes them a lot easier to track down.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
53cf2a3daa bcachefs: Fix kmsan warnings in bch2_extent_crc_pack()
We store to all fields, so the kmsan warnings were spurious - but
initializing via stores to bitfields appear to have been giving the
compiler/kmsan trouble, and they're not necessary.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
9c3a2c9b47 bcachefs: Disable asm memcpys when kmsan enabled
kmsan doesn't know about inline assembly, obviously; this will close a
ton of syzbot bugs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
962322475b bcachefs: Handle backpointers with unknown data types
New data types might be added later, so we don't want to disallow
unknown data types - that'll be a compatibility hassle later. Instead,
ignore them.

Reported-by: syzbot+3a290f5ff67ca3023834@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
6a9f681ef6 bcachefs: Count BCH_DATA_parity backpointers correctly
These are counted as stripe data in the corresponding alloc keys.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
04e90891be bcachefs: Run bch2_check_dirent_target() at lookup time
More on the "full online self healing" project:

We now run most of the dirent <-> inode consistency checks, with repair
code, at runtime - the exact same check and repair code that fsck runs.

This will allow us to repair the "dirent points to inode that does not
point back" inconsistencies that have been popping up at runtime.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
9b0d00a369 bcachefs: Refactor bch2_check_dirent_target()
Prep work for calling bch2_check_dirent_target() from bch2_lookup().

- Add an inline wrapper, if the target and backpointer match we can skip
  the function call.

- We don't (yet?) want to remove the dirent we did the lookup from (when
  we find a directory or subvol with multiple valid dirents pointing to
  it), we can defer on that until later. For now, add an "are we in
  fsck?" parameter.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
758ea4ff81 bcachefs: Move bch2_check_dirent_target() to namei.c
We're gradually running more and more fsck.c checks at runtime,
whereever applicable; when we do so they get moved out of fsck.c.

Next patch will call bch2_check_dirent_target() from bch2_lookup().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
4fcd4de0a6 bcachefs: fs-common.c -> namei.c
name <-> inode, code for managing the relationships between inodes and
dirents.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
8a9f3d0582 bcachefs: EIO cleanup
Replace these with proper private error codes, so that when we get an
error message we're not sifting through the entire codebase to see where
it came from.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
127d90d282 bcachefs: bch2_write_prep_encoded_data() now returns errcode
Prep work for killing off EIO and replacing them with proper private
error codes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
2fe208303a bcachefs: Simplify bch2_write_op_error()
There's no reason for the caller to do the actual logging, it's all done
the same.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:36 -04:00
Kent Overstreet
af2ff37da7 bcachefs: Fix block/btree node size defaults
We're fixing option parsing in userspace, it now obeys
OPT_SB_FIELD_SECTORS

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Alan Huang
5d361ae5af bcachefs: Add missing smp_rmb()
The smp_rmb() guarantees that reads from reservations.counter
occur before accessing cur_entry_u64s. It's paired with the
atomic64_try_cmpxchg in journal_entry_open.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
4a4000b9a6 bcachefs: Kill JOURNAL_ERRORS()
Convert these to standard error codes, which means we can pass them
outside the journal code, they're easier to pass to tracepoints, etc.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
80be08cdb5 bcachefs: Filesystem discard option now propagates to devices
the discard option is special, because it's both a filesystem and a
device option.

When set at the filesytsem level, it's supposed to propagate to (if set
persistently via sysfs) or override (if non persistently as a mount
option) the devices - that now works correctly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
8d7b7ac367 bcachefs: Device state is now a runtime option
Other options can normally be set at runtime via sysfs, no reason for
this one not to be as well - it just doesn't support the degraded flags
argument this way, that requires the ioctl.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
7b84d934a1 bcachefs: Setting foreground_target at runtime now triggers rebalance
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
8b294a9b5c bcachefs: Device options now use standard sysfs code
Device options now use the common code for sysfs, and can superblock
fields (in a struct bch_member).

This replaces BCH_DEV_OPT_SETTERS(), which was weird and easy to miss.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
d2bad59255 bcachefs: Kill BCH_DEV_OPT_SETTERS()
Previously, device options had their superblock option field listed
separately, which was weird and easy to miss when defining options.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Alan Huang
dd7ae389ff bcachefs: Remove spurious smp_mb()
The smp_mb() is paired with nothing.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Alan Huang
5cc0ab39fb bcachefs: Fix incorrect state count
atomic64_read(&j->seq) - j->seq_write_started == JOURNAL_STATE_BUF_NR is
the condition in journal_entry_open where we return JOURNAL_ERR_max_open,
so journal_cur_seq(j) - seq == JOURNAL_STATE_BUF_NR means that the buf
corresponding to seq has started to write.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
16a8d5d00b bcachefs: Fix btree iter flags in data move
Rebalance requires a not_extents iterator.

This wasn't hit before because all_snapshots disableds is_extents on
snapshots btrees - but has no effect on the reflink btree.

Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
92c7789a9e bcachefs: Validate bch_sb.offset field
This was missed - but it needs to be correct for the superblock recovery
tool that scans the start and end of the device for backup superblocks:
we don't want to pick up superblocks that belong to a different
partition that starts at a different offset.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
8bd875ae47 bcachefs: bch2_sb_validate() doesn't need bch_sb_handle
Minor refactoring, so that bch2_sb_validate() can be used in the new
userspace superblock recovery tool.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:35 -04:00
Kent Overstreet
5e67243ea6 bcachefs: Add missing random.h includes
Fix build in userspace, and good hygeine.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Kent Overstreet
2eb985c549 bcachefs: Better incompat version/feature error messages
If we can't mount because of an incompatibility, print what's supported
and unsupported - to help solve PEBKAC issues.

Reported-by: Roland Vet <vet.roland@protonmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Kent Overstreet
6aa446c05a bcachefs: Fix offset_into_extent in data move path
Fixes the following:

[   17.607394] kernel BUG at fs/bcachefs/reflink.c:261!
[   17.608316] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[   17.608485] CPU: 0 UID: 0 PID: 564 Comm: bch-rebalance/3 Tainted: G           OE      6.14.0-rc6-arch1-gfcb0bd9609d2 #7 0efd7a8f4a00afeb2c5fb6e7ecb1aec8ddcbb1e1
[   17.608616] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[   17.608736] Hardware name: Micro-Star International Co., Ltd. MS-7D75/MAG B650 TOMAHAWK WIFI (MS-7D75), BIOS 1.74 08/01/2023
[   17.608855] RIP: 0010:bch2_lookup_indirect_extent+0x252/0x290 [bcachefs]
[   17.609006] Code: 00 00 00 00 e8 7f 51 f5 ff 89 c3 85 c0 74 52 48 8b 7d b0 4c 89 ee e8 4d 4b f4 ff 48 63 d3 48 89 d0 31 d2 e9 2e ff ff ff 0f 0b <0f> 0b 48 8b 7d b0 4c 89 ee 48 89 55 a8 e8 2c 4b f4 ff 4c 8b 55 a8
[   17.609136] RSP: 0018:ffffa3714455f850 EFLAGS: 00010246
[   17.609261] RAX: 0000000000000080 RBX: ffff895891098790 RCX: 0000000000000000
[   17.609387] RDX: 0000000000000080 RSI: ffffa3714455fa90 RDI: ffff895889550000
[   17.609511] RBP: ffffa3714455f8c0 R08: ffff895891098790 R09: 0000000000000001
[   17.609637] R10: ffffa3714455f8d8 R11: ffffa3714455f950 R12: ffffa3714455fa58
[   17.609763] R13: ffff895891098790 R14: ffffa3714455fa58 R15: ffff895889550000
[   17.609888] FS:  0000000000000000(0000) GS:ffff896757c00000(0000) knlGS:0000000000000000
[   17.610015] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   17.610143] CR2: 0000716b8cda2750 CR3: 0000000914e22000 CR4: 0000000000f50ef0
[   17.610272] PKRU: 55555554
[   17.610403] Call Trace:
[   17.610535]  <TASK>
[   17.610662]  ? __die_body.cold+0x19/0x27
[   17.610791]  ? die+0x2e/0x50
[   17.610918]  ? do_trap+0xca/0x110
[   17.611049]  ? do_error_trap+0x6a/0x90
[   17.611178]  ? bch2_lookup_indirect_extent+0x252/0x290 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.611331]  ? exc_invalid_op+0x50/0x70
[   17.611468]  ? bch2_lookup_indirect_extent+0x252/0x290 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.611620]  ? asm_exc_invalid_op+0x1a/0x20
[   17.611757]  ? bch2_lookup_indirect_extent+0x252/0x290 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.611911]  ? bch2_move_data_btree+0x58a/0x6c0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.612084]  bch2_move_data_btree+0x58a/0x6c0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.612256]  ? __pfx_rebalance_pred+0x10/0x10 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.612431]  ? bch2_move_extent+0x3d7/0x6e0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.612607]  ? __bch2_move_data+0xea/0x200 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.612782]  __bch2_move_data+0xea/0x200 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.612959]  ? __pfx_rebalance_pred+0x10/0x10 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.613149]  do_rebalance+0x517/0x8d0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.613342]  ? local_clock_noinstr+0xd/0xd0
[   17.613518]  ? local_clock+0x15/0x30
[   17.613693]  ? __bch2_trans_get+0x152/0x300 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.613890]  ? __pfx_bch2_rebalance_thread+0x10/0x10 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]
[   17.614090]  bch2_rebalance_thread+0x66/0xb0 [bcachefs c42b95c23facdfe11d39755520127cd771dddec2]

The offset_into_extent bit was copied from the read path, but it's
unnecessary here, where we always want to read and move the entire
indirect extent, and it causes the assertion pop - because we're using a
non-extents iterator, which always points to the end of the reflink
pointer.

Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Eric Biggers
71fbb0b86e bcachefs: use sha256() instead of crypto_shash API
Just use sha256() instead of the clunky crypto API.  This is much
simpler.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Eric Biggers
39abc73b59 bcachefs: Remove unnecessary softdeps on crc32c and crc64
Since bcachefs does not access crc32c and crc64 through the crypto API,
there is no need to use module softdeps to ensure they are loaded.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Kent Overstreet
9b39835e93 bcachefs: #if 0 out (enable|disable)_encryption()
These weren't hooked up, but they probably should be - add some comments
for context.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Kent Overstreet
9962cb7748 bcachefs: Improve can_write_extent()
This fixes another "rebalance spinning and doing no work" issue;
rebalance was reading extents it wanted to move, but then failing in
bch2_write() -> bch2_alloc_sectors_start() due to being unable to
allocate sufficient replicas.

This was triggered by a user playing with the durability settings, the
foreground device was an NVME device with durability=2, and originally
he'd set the background device to durability=2 as well, but changed it
back to 1 (the default) after seeing IO errors.

That meant that with replicas=2, we want to move data off the NVME
device which satisfies that constraint, but with a single durability=1
device on the background target there's no way to move the extent to
that target while satisfiying the "required replicas" constraint.

The solution for now is for bch2_data_update_init() to check for this,
and return an error - before kicking off the read.

bch2_data_update_init() already had two different checks for "will we be
able to write this extent", with partially duplicated code, so this
patch combines and improves that logic.

Additionally, we now always bail out and return an error if there's
insufficient space on the destination target. Previously, we only did
this for BCH_WRITE_alloc_nowait moves, because it might be the case that
copygc just needs to free up space on the destination target.

But we really shouldn't kick off a move if the destination is full, we
can't currently distinguish between "really full" and "just need to wait
for copygc", and if we are going to wait on copygc it'd be better to do
that before kicking off the move.

This will additionally fix "rebalance spinning" issues caused by a
filesystem that has more data than can fit in background_target - which
is a valid scenario, since we don't exclude foreground/cache devices
when calculating filesystem capacity.

Reported-by: Maël Kerbiriou <mael.kerbiriou@free.fr>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Kent Overstreet
fb8a9a32cc bcachefs: trace_io_move_write_fail
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Alan Huang
76bc6e51cd bcachefs: Increase blacklist range
Now there are 16 journal buffers, 8 is too small to be enough.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Kent Overstreet
35de2abc22 bcachefs: __bch2_read() now takes a btree_trans
Next patch will be checking if the extent we're reading from matches the
IO failure we saw before marking the failure.

For this to work, __bch2_read() needs to take the same transaction
context that bch2_rbio_retry() uses to do that check.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Kent Overstreet
3fb8bacb14 bcachefs: BCH_READ_data_update -> bch_read_bio.data_update
Read flags are codepath dependent and change as they're passed around,
while the fields in rbio._state are mostly fixed properties of that
particular object.

Losing track of BCH_READ_data_update would be bad, and previously it was
not obvious if it was always correctly set in the rbio, so this is a
safety cleanup.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24 09:50:34 -04:00
Uros Bizjak
8a3c392388 percpu: use TYPEOF_UNQUAL() in variable declarations
Use TYPEOF_UNQUAL() to declare variables as a corresponding type without
named address space qualifier to avoid "`__seg_gs' specified for auto
variable `var'" errors.

Link: https://lkml.kernel.org/r/20250127160709.80604-4-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Nadav Amit <nadav.amit@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16 22:05:53 -07:00
Kent Overstreet
be31e412ac bcachefs: Checksum errors get additional retries
It's possible for checksum errors to be transient - e.g. flakey
controller or cable, thus we need additional retries (besides retrying
from different replicas) before we can definitely return an error.

This is particularly important for the next patch, which will allow the
data move path to move extents with checksum errors - we don't want to
accidentally introduce bitrot due to a transient error!

- bch2_bkey_pick_read_device() is substantially reworked, and
  bch2_dev_io_failures is expanded to record more information about the
  type of failure (i.e. number of checksum errors).

  It now returns an error code that describes more precisely the reason
  for the failure - checksum error, io error, or offline device, instead
  of the previous generic "insufficient devices". This is important for
  the next patches that add poisoning, as we only want to poison extents
  when we've got real checksum errors (or perhaps IO errors?) - not
  because a device was offline.

- Add a new option and superblock field for the number of checksum
  retries.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
ccba9029b0 bcachefs: Print message on successful read retry
Users have been asking for this, and now that errors are returned to the
top level read retry path - we can.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
de73677ff8 bcachefs: Return errors to top level bch2_rbio_retry()
Next patch will be adding an additional retry loop for checksum errors,
so that we can rule out transient errors before marking an extent as
poisoned.

Prerequisite to this is returning errors to bch2_rbio_retry(); this will
also let us add a "successful retry" message.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
881b598ef1 bcachefs: BCH_ERR_data_read_buffer_too_small
Now that the read path uses proper error codes, we can get rid of the
weird rbio->hole signalling to the move path that the read didn't
happen.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
f4b84bac20 bcachefs: Read error message now indicates if it was for an internal move
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
e75993b0bf bcachefs: Fix BCH_ERR_data_read_csum_err_maybe_userspace in retry path
When we do a read to a buffer that's mapped into userspace, it's
possible to get a spurious checksum error if userspace was modified the
buffer at the same time.

When we retry those, they have to be bounced before we know definitively
whether we're reading corrupt data.

But the retry path propagates read flags differently, so needs special
handling.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
943f0cfb15 bcachefs: Convert read path to standard error codes
Kill the READ_ERR/READ_RETRY/READ_RETRY_AVOID enums, and add standard
error codes that describe precisely which error occured.

This is going to be used for the data move path, to move but poison
extents with checksum errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
5a06cb8000 bcachefs: Debug params for data corruption injection
dm-flakey is busted, and this is simpler anyways - this lets us test the
checksum error retry ptahs

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
6d80fca9ef bcachefs: Don't create bch_io_failures unless it's needed
Only needed in retry path, no point in wasting stack space.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
9ec0089149 bcachefs: bch2_bkey_ptrs_rebalance_opts()
Small optimization for bch2_bkey_sectors_need_rebalance()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
7c1e2a254f bcachefs: Add a cond_resched() to btree cache teardown
[12308.606480] watchdog: BUG: soft lockup - CPU#18 stuck for 26s! [umount:48479]
[12308.606485] Modules linked in: bcachefs lz4hc_compress lz4_compress lz4_decompress sunrpc overlay nf_conntrack_netlink xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE bridge stp llc xfrm_user ip6table_nat ip6table_filter ip6_tables iptable_nat xt_addrtype iptable_filter ip_tables x_tables nfnetlink_cttimeout nfnetlink openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 psample ext4 mbcache jbd2 nls_iso8859_1 nls_cp850 vfat fat binfmt_misc skx_edac_common nfit edac_core libnvdimm cbc encrypted_keys intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common ipmi_ssif x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm drivetemp rapl intel_cstate coretemp mgag200 i2c_algo_bit ixgbe drm_shmem_helper drm_kms_helper mdio_devres xfrm_algo mdio drm ptp intel_uncore mei_me efi_pstore evdev uas pl2303 pps_core libphy usb_storage usbserial lpc_ich mei drm_panel_orientation_quirks acpi_power_meter tiny_power_button ipmi_si mfd_core intel_pch_thermal acpi_tad acpi_ipmi ioatdma
[12308.606541]  ipmi_devintf ipmi_msghandler dca wmi button efivarfs polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 sha1_generic xhci_pci xhci_hcd aesni_intel ehci_pci ehci_hcd gf128mul crypto_simd cryptd usbcore hpwdt usb_common
[12308.606557] CPU: 18 UID: 0 PID: 48479 Comm: umount Tainted: G             L     6.14.0-rc6-x86_64-00159-ga09496a03e63 #1
[12308.606560] Tainted: [L]=SOFTLOCKUP
[12308.606561] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 07/20/2023
[12308.606563] RIP: 0010:clear_page_erms+0x7/0x10
[12308.606570] Code: 48 89 47 38 48 8d 7f 40 75 d9 90 c3 cc cc cc cc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 b9 00 10 00 00 31 c0 <f3> aa c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[12308.606572] RSP: 0018:ffff9ed5b622fba0 EFLAGS: 00010246
[12308.606574] RAX: 0000000000000000 RBX: ffff90347fffe6c0 RCX: 00000000000004c0
[12308.606575] RDX: ffffe34ea9bec1c0 RSI: 00000000000405f0 RDI: ffff902eafb07b40
[12308.606576] RBP: ffff9ed5b622fbf0 R08: 0000000000000001 R09: 0000000000000006
[12308.606577] R10: 0000000000040001 R11: 0000000000000000 R12: ffffe34ea9bec000
[12308.606578] R13: 0000000000000000 R14: 0000000000000006 R15: ffffe34ea9bed000
[12308.606580] FS:  00007fe704ecfb68(0000) GS:ffff9053fea00000(0000) knlGS:0000000000000000
[12308.606581] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12308.606582] CR2: 00007f18159068ae CR3: 00000001314d0005 CR4: 00000000007726f0
[12308.606583] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[12308.606584] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[12308.606584] PKRU: 55555554
[12308.606585] Call Trace:
[12308.606587]  <IRQ>
[12308.606590]  ? show_regs.cold+0x19/0x28
[12308.606595]  ? watchdog_timer_fn.cold+0x3d/0x9d
[12308.606598]  ? __pfx_watchdog_timer_fn+0x10/0x10
[12308.606602]  ? __hrtimer_run_queues+0x12e/0x250
[12308.606607]  ? hrtimer_interrupt+0xfd/0x220
[12308.606609]  ? __sysvec_apic_timer_interrupt+0x53/0xe0
[12308.606614]  ? sysvec_apic_timer_interrupt+0x76/0xa0
[12308.606619]  </IRQ>
[12308.606620]  <TASK>
[12308.606620]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
[12308.606626]  ? clear_page_erms+0x7/0x10
[12308.606628]  ? __free_pages_ok+0x374/0x640
[12308.606633]  free_frozen_pages+0x34/0x570
[12308.606636]  __folio_put+0x87/0xe0
[12308.606641]  free_large_kmalloc+0x70/0x80
[12308.606645]  kfree+0x2f6/0x390
[12308.606648]  kvfree+0x2d/0x40
[12308.606653]  __btree_node_data_free+0xaf/0xf0 [bcachefs]
[12308.606726]  btree_node_data_free+0x6a/0x80 [bcachefs]
[12308.606778]  bch2_fs_btree_cache_exit+0x262/0x440 [bcachefs]
[12308.606829]  bch2_fs_release+0xe8/0x340 [bcachefs]
[12308.606905]  kobject_put+0x60/0xc0
[12308.606908]  bch2_fs_free+0xdd/0x120 [bcachefs]
[12308.606981]  bch2_kill_sb+0x1e/0x30 [bcachefs]
[12308.607051]  deactivate_locked_super+0x32/0xb0
[12308.607055]  deactivate_super+0x40/0x50
[12308.607057]  cleanup_mnt+0xc3/0x160
[12308.607060]  __cleanup_mnt+0x12/0x20
[12308.607062]  task_work_run+0x5f/0xa0
[12308.607064]  syscall_exit_to_user_mode+0x194/0x1a0
[12308.607066]  do_syscall_64+0x67/0x170
[12308.607068]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[12308.607070] RIP: 0033:0x7fe704e66eed
[12308.607073] Code: 08 49 89 ca b8 a5 00 00 00 0f 05 48 89 c7 e8 8a e6 ff ff 48 83 c4

Reported-by: Stijn Tintel <stijn@linux-ipv6.be>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
c991fbee8e bcachefs: rebalance, copygc status also print stacktrace
These are commonly needed when debugging, and saves from having to ask
users to dig.

Also, rebalance_status now includes pending rebalance work.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16 13:47:55 -04:00
Kent Overstreet
8dc4514d58 bcachefs: Kill bch2_remount()
Single caller, so inline it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:03:16 -04:00
Kent Overstreet
a2e9e68746 bcachefs: Kill a bit of dead code
Found with CC=clang W=1

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Thorsten Blum
ff4cb203cc bcachefs: Use max() to improve gen_after()
Use max() to simplify gen_after() and improve its readability.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Thorsten Blum
c073ec6bec bcachefs: Remove unnecessary byte allocation
The extra byte is not used - remove it.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
94373026d9 bcachefs: We no longer read stripes into memory at startup
And the stripes heap gets deleted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
434a3f2ffa bcachefs: trace_stripe_create
Add a simple tracepoint for stripe creation, we'll want to expand this
later.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
6c336144b9 bcachefs: get_existing_stripe() uses new stripe lru
Convert to the new persistent stripe LRU.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
039790cfb5 bcachefs: ec_stripe_delete() uses new stripe lru
Convert to the new persistent stripe LRU.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
4b0fac4bed bcachefs: journal write path comment
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
981e380144 bcachefs: Kick devices out after too many write IO errors
We're improving our handling of write errors - we shouldn't write
degraded data just because a write failed once, we should retry it (on
other devices, if possible).

But for this to work, we need to kick devices out when they're only
returning errors - otherwise those retries will loop infinitely.

This adds a configurable timeout - if writes are failing for too long,
we'll set that device read-only.

In the future we should also implement more tracking and another knob
for an "allowed error rate", so that we can kick out drives that are
acting "unhealthy".

Another thing we'll want is a mechanism (likely in userspace) for
bringing a device back in after a transient error - perhaps a cable was
jiggled, or there was a controller reset.

After transient errors we also need a mechanism to walk (from the
journal) recent btree updates that weren't flushed to that device and
treat them as "degraded", since unflushed data may well not have been
written. Out of scope for this patch, but becoming relevant.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
d71e023376 bcachefs: Change BCH_MEMBER_STATE_failed semantics
Previously, we woudn't try to read at all from a failed device - that
doesn't make much sense, the device may be unhealthy (perhaps taking
longer than it should to service reads), but if it's our only option we
should still try to read from it.

Now, bch2_bkey_pick_read_device() will pick failed devices only if there
are no non-failed replicas to read from.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
cf164a9106 bcachefs: bch2_dev_get_ioref() may now sleep
The next patch implementing freezing will change bch2_dev_get_ioref() to
sleep if a device is currently frozen.

Add an annotation and fix the journal code accordingly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
2efa8397ca bcachefs: Fix btree_node_scan io_ref handling
This was completely fubar; it's now simplified a bit as well.
Note that for_each_online_member() takes and releases io_refs as it
iterates, so we need to release that if we break.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
d5308203a8 bcachefs: Implement blk_holder_ops
We can't use the standard fs_holder_ops because they're meant for single
device filesystems - fs_bdev_mark_dead() in particular - and they assume
that the blk_holder is the super_block, which also doesn't work for a
multi device filesystem.

These generally follow the standard fs_holder_ops; the
locking/refcounting is a bit simplified because c->ro_ref suffices, and
bch2_fs_bdev_mark_dead() is not necessarily shutting down the entire
filesystem.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
1fdbe0b184 bcachefs: Make sure c->vfs_sb is set before starting fs
This is necessary for the new blk_holder_ops, which want the vfs
super_block available for synchronization.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
13fd6be102 bcachefs: Stash a pointer to the filesystem for blk_holder_ops
Note that we open block devices before we allocate bch_fs, but once
attached to a filesystem they will be closed before the bch_fs is torn
down - so stashing a pointer without a refcount looks incorrect but it's
not.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
b31c070407 bcachefs: Finish bch2_account_io_completion() conversions
More prep work for automatically kicking devices out after too many IO
errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
3526bca36b bcachefs: bch2_account_io_completion()
We need to start accounting successes for every IO, not just failures,
so introduce a unified hook for io completion accounting and convert
io_read.c.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
3480aecd5f bcachefs: Fix read path io_ref handling
We were using our device pointer after we'd released our ref to it.

Unlikely to be a race that's practical to hit, since actually removing a
member device is a whole process besides just taking it offline, but -
needs to be fixed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:16 -04:00
Kent Overstreet
7bc5808168 bcachefs: data_update now checks for extents that can't be moved
If a device is ro or failed, we might not have anywhere to move a
replica.

Check for this early, before doing the read and attempting to write.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
fba513a9ee bcachefs: give bch2_write_super() a proper error code
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
4a90675cfe bcachefs: bcachefs_metadata_version_extent_flags
This implements a new extent field bitflags that apply to the whole
extent. There's been a couple things we've wanted this for in the past,
but the immediate need is extent poisoning, to solve a rebalance issue.

Unknown extent fields can't be parsed (we won't known their size, so we
can't advance to the next field), so this is an incompat feature, and
using it prevents the filesystem from being mounted by old versions.

This also adds the BCH_EXTENT_poisoned flag; this indicates that the
data is known to be bad (i.e. there was a checksum error, and we had to
write a new checksum) and reads will return errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
6422bf8117 bcachefs: bch2_request_incompat_feature() now returns error code
For future usage, we'll want a dedicated error code for better
debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Thorsten Blum
bafd41b435 bcachefs: Fix error type in bch2_alloc_v3_validate()
Use error type alloc_v3_unpack_error in bch2_alloc_v3_validate().

Fixes: b65db750e2 ("bcachefs: Enumerate fsck errors")
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
fb195fa753 bcachefs: BCH_SB_FEATURES_ALL includes BCH_FEATURE_incompat_verison_field
These features are set on format and incompat upgarde.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
24d790a7da bcachefs: sysfs internal/trigger_btree_updates
Add a debug knob to manually trigger the btree updates worker.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Joshua Ashton
d37c14ac6f bcachefs: bcachefs_metadata_version_casefolding
This patch implements support for case-insensitive file name lookups
in bcachefs.

The implementation uses the same UTF-8 lowering and normalization that
ext4 and f2fs is using.

More information is provided in Documentation/bcachefs/casefolding.rst

Compatibility notes:

This uses the new versioning scheme for incompatible features where an
incompatible feature is tied to a version number: the superblock says
"we may use incompat features up to x" and "incompat features up to x
are in use", disallowing mounting by previous versions.

Additionally, and old style incompat feature bit is used, so that
kernels without utf8 casefolding support know if casefolding
specifically is in use and they're allowed to mount.

Signed-off-by: Joshua Ashton <joshua@froggi.es>
Cc: André Almeida <andrealmeid@igalia.com>
Cc: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Joshua Ashton
76872d46b7 bcachefs: Split out dirent alloc and name initialization
Splits out the code that allocates the dirent and initializes the name
to make things easier to implement casefolding in a future commit.

Cc: André Almeida <andrealmeid@igalia.com>
Cc: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
72f4edcf45 bcachefs: Kill dirent_occupied_size() in create path
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
68171d91ce bcachefs: Kill dirent_occupied_size() in rename path
Cc: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
6756e385a5 bcachefs: bcachefs_metadata_version_stripe_lru
Add a persistent LRU for stripes, ordered by "number of empty blocks",
i.e. order in which we wish to reuse them.

This will replace the in-memory stripes heap, so we can kill off reading
stripes into memory at startup.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
88d961b518 bcachefs: bcachefs_metadata_version_stripe_backpointers
Stripes now have backpointers.

This is needed for proper scrub - stripe checksums need to be verified,
separately from extents within the stripe, since a block may not be full
of live extents but it's still needed for reconstruct.

And this will be needed for (efficient) evacuate/repair paths.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:15 -04:00
Kent Overstreet
69bd8a9277 bcachefs: Advance bch_alloc.oldest_gen if no stale pointers
Now that we've got cached backpointers and aren't leaving around stale
pointers on bucket invalidation, we no longer need the periodic (rare)
gc_gens - which recalculates each bucket's oldest gen to avoid wraparound.

We can't delete that code because we've got to support existing
filesystems that will still have stale pointers, but this gets rid of
another scalability limit.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
942a418c7a bcachefs: Invalidate cached data by backpointers
If we don't leave stale pointers around, we won't have to deal with
bucket gen wraparound.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
15800f3d4b bcachefs: bcachefs_metadata_version_cached_backpointers
Cached pointers now have backpointers.

This means that we'll be able to kill cached pointers in the
bucket_invalidate path, when invalidating/reusing buckets containing
cached data, instead of leaving them around to be cleaned up by gc_gens
garbago collection - which requires a full metadata scan.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
65bc7688b8 bcachefs: rework bch2_trans_commit_run_triggers()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
c7c07bf250 bcachefs: Better trigger ordering
Transactional triggers need to run in a defined ordering, which is not
quite the same as btree ID integer comparison.

Previously this was handled in a hacky way in
bch2_trans_commit_run_triggers(), since it was only the alloc btree that
needed special handling, but upcoming stripe btree changes are going to
require more ordering changes - so, define that ordering.

Next patch will change the transaction commit path to use it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
cc297dfb41 bcachefs: bch2_trigger_stripe_ptr() no longer uses ec_stripes_heap_lock
Introduce per-entry locks, like with struct bucket - the stripes heap is
going away.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
bc76ba70d2 bcachefs: Rework bch2_check_lru_key()
It's now easier to add new LRU types.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
3aff608b86 bcachefs: decouple bch2_lru_check_set() from alloc btree
Pass in the backpointer explicitly, instead of assuming 'referring_k' is
an alloc key and calculating it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
b8e37c1645 bcachefs: s/BCH_LRU_FRAGMENTATION_START/BCH_LRU_BUCKET_FRAGMENTATION/
FRAGMENTATION_START was incorrect, there's currently only one
fragmentation LRU (at the end of the reserved bits for LRU type), and
we're getting ready to add a stripe fragmentation lru - so give it a
better name.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
e130496707 bcachefs: bch2_lru_change() checks for no-op
Minor cleanup, no reason for the caller to have to this.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
cb87f623c1 bcachefs: minor journal errcode cleanup
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
1ccbcd3205 bcachefs: bch2_write_op_error() now prints info about data update
A user has been seeing the "error verifying existing checksum while
rewriting existing data (memory corruption?)" error.

This generally indicates a hardware issue (and that may be the case
here), but it might also indicate a bug, in which case we need more
information to look for patterns.

Reported-by: Roland Vet <vet.roland@protonmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Kent Overstreet
3faa4647a0 bcachefs: metadata_target is not an inode option
This option only applies filesystem wide.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Andreas Gruenbacher
f27614652c bcachefs: eytzinger1_{next,prev} cleanup
The eytzinger code was previously relying on the following wrap-around
properties and their "eytzinger0" equivalents:

  eytzinger1_prev(0, size) == eytzinger1_last(size)
  eytzinger1_next(0, size) == eytzinger1_first(size)

However, these properties are no longer relied upon and no longer
necessary, so remove the corresponding asserts and forbid the use of
eytzinger1_prev(0, size) and eytzinger1_next(0, size).

This allows to further simplify the code in eytzinger1_next() and
eytzinger1_prev(): where the left shifting happens, eytzinger1_next() is
trying to move i to the lowest child on the left, which is equivalent to
doubling i until the next doubling would cause it to be greater than
size.  This is implemented by shifting i to the left so that the most
significant bits align and then shifting i to the right by one if the
result is greater than size.

Likewise, eytzinger1_prev() is trying to move to the lowest child on the
right; the same applies here.

The 1-offset in (size - 1) in eytzinger1_next() isn't needed at all, but
the equivalent offset in eytzinger1_prev() is surprisingly needed to
preserve the 'eytzinger1_prev(0, size) == eytzinger1_last(size)'
property.  However, since we no longer support that property, we can get
rid of these offsets as well.  This saves one addition in each function
and makes the code less confusing.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Andreas Gruenbacher
68eb4c5fea bcachefs: convert eytzinger sort to be 1-based (2)
In this second step, transform the eytzinger indexes i, j, and k in
eytzinger1_sort_r() from 0-based to 1-based.  This step looks a bit
messy, but the resulting code is slightly better.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Andreas Gruenbacher
3ff0dd28d6 bcachefs: convert eytzinger sort to be 1-based (1)
In this first step, convert the eytzinger sort functions to use 1-based
primitives.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Andreas Gruenbacher
3849bcab4d bcachefs: convert eytzinger0_find to be 1-based
Several of the algorithms on eytzinger trees are implemented in terms of
the eytzinger0 primitives.  However, those algorithms can just as easily
be expressed in terms of the eytzinger1 primitives, and that leads to
better and easier to understand code.  Start by converting
eytzinger0_find().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Andreas Gruenbacher
956032edd2 bcachefs: Add eytzinger0_find self test
Function eytzinger0_find() isn't currently covered, so add a self test.

We can rely on eytzinger0_find_le() here because it is being
tested independently.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Andreas Gruenbacher
63ce189b00 bcachefs: add eytzinger0_find_ge self test
Add an eytzinger0_find_ge() self test similar to eytzinger0_find_gt().

Note that this test requires eytzinger0_find_ge() to return the first
matching element in the array in case of duplicates.  To prevent
bisection errors, we only add this test after strenghening the original
implementation (see the previous commit).

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Andreas Gruenbacher
11223d0e7b bcachefs: implement eytzinger0_find_ge directly
Implement eytzinger0_find_ge() directly instead of implementing it in
terms of eytzinger0_find_le() and adjusting the result.

This turns eytzinger0_find_ge() into a minimum search, so when there are
duplicate elements, the result of eytzinger0_find_ge() will now always
point at the first matching element.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:14 -04:00
Andreas Gruenbacher
2182f29545 bcachefs: implement eytzinger0_find_gt directly
Instead of implementing eytzinger0_find_gt() in terms of
eytzinger0_find_le() and adjusting the result, implement it directly.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Andreas Gruenbacher
d7cd33f7ef bcachefs: add eytzinger0_find_gt self test
Add an eytzinger0_find_gt() self test similar to eytzinger0_find_le().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Andreas Gruenbacher
d384dada0e bcachefs: simplify eytzinger0_find_le
Replace the over-complicated implementation of eytzinger0_find_le() by
an equivalent, simpler version.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Andreas Gruenbacher
d148d804f2 bcachefs: convert eytzinger0_find_le to be 1-based
eytzinger0_find_le() is also easy to concert to 1-based eytzinger (but
see the next commit).

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Andreas Gruenbacher
c722b818a2 bcachefs: improve eytzinger0_find_le self test
Rename eytzinger0_find_test_val() to eytzinger0_find_test_le() and add a
new eytzinger0_find_test_val() wrapper that calls it.

We have already established that the array is sorted in eytzinger order,
so we can use the eytzinger iterator functions and check the boundary
conditions to verify the result of eytzinger0_find_le().

Only scan the entire array if we get an incorrect result.  When we need
to scan, use eytzinger0_for_each_prev() so that we'll stop at the
highest matching element in the array in case there are duplicates;
going through the array linearly wouldn't give us that.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Andreas Gruenbacher
dc5ceaaad8 bcachefs: add eytzinger0_for_each_prev
Add an eytzinger0_for_each_prev() macro for iterating through an
eytzinger array in reverse.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Andreas Gruenbacher
e8a0966ffa bcachefs: eytzinger0_find_test improvement
In eytzinger0_find_test(), remember the smallest element seen so far
instead of comparing adjacent array elements.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Andreas Gruenbacher
ec70103f9b bcachefs: eytzinger[01]_test improvement
In eytzinger[01]_test(), make sure that eytzinger[01]_for_each()
iterates over all array elements.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Andreas Gruenbacher
0766f5599c bcachefs: eytzinger self tests: fix cmp_u16 typo
Fix an obvious typo in cmp_u16().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Andreas Gruenbacher
0ede49212a bcachefs: eytzinger self tests: missing newline termination
pr_info() format strings need to be newline terminated.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Andreas Gruenbacher
217ad1d7c7 bcachefs: eytzinger self tests: loop cleanups
The iterator variable of eytzinger0_for_each() loops has been changed to
be locally scoped at some point, so remove variables defined outside the
loop that are now unused.  In addition and for clarity, use a different
variable inside those loops where an outside variable would be shadowed.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Andreas Gruenbacher
d54b82ecc4 bcachefs: EYTZINGER_DEBUG fix
When EYTZINGER_DEBUG is defined, <linux/bug.h> needs to be included.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Andreas Gruenbacher
f7f9be0238 bcachefs: bch2_blacklist_entries_gc cleanup
Use an eytzinger0_for_each() loop here.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Kent Overstreet
34a493089a bcachefs: bch2_bkey_ptr_data_type() now correctly returns cached for cached ptrs
Necessary for adding backpointers for cached pointers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Kent Overstreet
fd49882f12 bcachefs: Add time_stat for btree writes
We have other metadata IO types covered, this was missing.

Note: this includes the time until completion, i.e. including parent
pointer update.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Kent Overstreet
b7f648e2ec bcachefs: Add comment explaining why asserts in invalidate_one_bucket() are impossible
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Kent Overstreet
7606fb4d26 bcachefs: Ignore backpointers to stripes in ec_stripe_update_extents()
Prep work for stripe backpointers: this path previously would get very
confused at being asked to process (remove redundant replicas) stripes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Kent Overstreet
898bda5b72 bcachefs: Increase JOURNAL_BUF_NR
Increase journal pipelining.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Kent Overstreet
35282ce9e8 bcachefs: Free journal bufs when not in use
Since we're increasing the number of 'struct journal_bufs', we don't
want them all permanently holding onto buffers for the journal data -
that'd be 16 * 2MB = 32MB, or potentially more.

Add a single-element mempool (open coded, since buffer size varies),
this also means we won't be hitting the memory allocator every time we
open and close a journal entry/buffer.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Kent Overstreet
2e853fdbc7 bcachefs: Don't touch journal_buf->data->seq in journal_res_get
This is a small optimization, reducing the number of cachelines we touch
in the fast path - and it's also necessary for the next patch that
increases JOURNAL_BUF_NR.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:13 -04:00
Kent Overstreet
199a3578ed bcachefs: Kill journal_res.idx
More dead code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Kent Overstreet
c2be81d48a bcachefs: Kill journal_res_state.unwritten_idx
Dead code

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Kent Overstreet
3eccc02035 bcachefs: add progress indicator to check_allocations
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Kent Overstreet
491eda6394 bcachefs: Add a progress indicator to bch2_dev_data_drop()
This code needs quite a bit of work: we don't want to be walking all
metadata in the filesystem, we should just be walking backpointers, and
it should be switched to a data ioctl that can report progress via a
file descriptor, not the system console.

But that'll take more work - before we can safely walk only backpointers
we need to change device add to not reuse device indexes, since with
that change accounting being wrong introduces the possibility of
removing a device that still has pointers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Kent Overstreet
baabeb4997 bcachefs: Factor out progress.[ch]
the backpointers code has progress indicators; these aren't great, since
they print to the dmesg console and we much prefer to have progress
indicators reporting to a specific userspace program so they're not
spamming the system console.

But not all codepaths that need progress indicators support that yet,
and we don't want users to think "this is hung".

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Kent Overstreet
06284963e3 bcachefs: bch2_inum_offset_err_msg_trans() no longer handles transaction restarts
we're starting to use error messages with paths in fsck_errors(), where
we do not want nested transaction restart handling, so let's prepare for
that.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Kent Overstreet
45f0e6c838 bcachefs: bch2_indirect_extent_missing_error() prints path, not just inode number
We want all error messages converted to print paths, not just inode
numbers - users want this information, and it speeds up debugging too.

Auditing and converting all error messages is going to be a big project,
so for the moment we're just doing this incrementally.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Kent Overstreet
e63cf203d7 bcachefs: Convert migrate to move_data_phys()
Iterating over backpointers on a specific device is potentially much
cheaper than walking all filesystem data.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Kent Overstreet
157ea58341 bcachefs: Read/move path counter work
Reorganize counters a bit, grouping related counters together.

New counters:
- io_read_inline
- io_read_hole

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Alan Huang
7d8321a286 bcachefs: Fix subtraction underflow
When ancestor is less than IS_ANCESTOR_BITMAP, we would get an incorrect
result.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Kent Overstreet
f269ae55d2 bcachefs: Scrub
Add a new data op to walk all data and metadata in a filesystem,
checking if it can be read successfully, and on error repairing from
another copy if possible.

- New helper: bch2_dev_idx_is_online(), so that we can bail out and
  report to userspace when we're unable to scrub because the device is
  offline

- data_update_opts, which controls the data move path, now understands
  scrub: data is only read, not written. The read path is responsible
  for rewriting on read error, as with other reads.

- scrub_pred skips data extents that don't have checksums

- bch_ioctl_data has a new scrub member, which has a data_types field
  for data types to check - i.e. all data types, or only metadata.

- Add new entries to bch_move_stats so that we can report numbers for
  corrected and uncorrected errors

- Add a new enum to bch_ioctl_data_event for explicitly reporting
  completion and return code (i.e. device offline)

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Kent Overstreet
3e2ad29865 bcachefs: bch2_btree_node_scrub()
Add a function for scrubbing btree nodes - reading them in, and kicking
off a rewrite if there's an error.

The btree_node_read_done() checks have to be duplicated because we're
not using a pointer to a struct btree - the btree node might already be
in cache, and we need to check a specific replica, which might not be
the one we previously read from.

This will be used in the next patch implementing high-level scrub.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Kent Overstreet
ca24130ee4 bcachefs: bch2_bkey_pick_read_device() can now specify a device
To be used for scrub, where we want the read to come from a specific
device.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Kent Overstreet
2a2f7aaa8d bcachefs: __bch2_move_data_phys() now uses bch2_btree_node_rewrite_pos()
Kill most of the separate logic for btree nodes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Kent Overstreet
987fdbdb40 bcachefs: bch2_move_data_phys()
Add a more general version of bch2_evacuate_bucket - to be used for
scrub.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Kent Overstreet
12188c9e2b bcachefs: bch2_btree_node_rewrite_pos()
Add a new helper for rewriting a btree node given a just the key, not a
pointer to the node itself.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Kent Overstreet
ca16fa6b86 bcachefs: backpointer_get_key() doesn't pull in btree node
We may not need to pull in a btree node when walking backpointers -
don't do so unnecessarily when using backpointer_get_key().

It'll still fall back to backpointer_get_node() in a few situations,
including btree roots (where an iterator can't point at just the key),
and races due to the interior update path not having deleted a
backpointer to an old node yet.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Kent Overstreet
dff6de9518 bcachefs: Internal reads can now correct errors
Rework the read path so that BCH_READ_NODECODE reads now also self-heal
after a read error and a successful retry - prerequisite for scrub.

- __bch2_read_endio() now handles a read that's both BCH_READ_NODECODE
  and a bounce.

  Normally, we don't want a BCH_READ_NODECODE read to ever allocate a
  split bch_read_bio: we want to maintain the relationship between the
  bch_read_bio and the data_update it's embedded in.

  But correcting read errors requires allocating a split/bounce rbio
  that's embedded in a promote_op. We do still have a 1-1 relationship,
  i.e. we only allocate a single split/bounce if it's a
  BCH_READ_NODECODE, so things hopefully don't get too crazy.

- __bch2_read_extent() now is allowed to allocate the promote_op for
  rewriting after a failed read, even if it's BCH_READ_NODECODE.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Kent Overstreet
7b1d655106 bcachefs: Don't self-heal if a data update is already rewriting
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Kent Overstreet
4dfb76e0ad bcachefs: Don't start promotes from bch2_rbio_free()
we don't want to block completion of the read - starting a promote calls
into the write path, which will block.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:12 -04:00
Kent Overstreet
7e9ed60f5f bcachefs: Bail out early on alloc_nowait data updates
If a data update doesn't want to block on allocations (promotes, self
healing on read error) - check if the allocation would fail before
kicking off the data update and calling into the write path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:11 -04:00
Kent Overstreet
c37d42a0e2 bcachefs: Rework init order in bch2_data_update_init()
Initialize the write op first, so that in the next patch we can check if
the allocator would block (for BCH_WRITE_alloc_nowait ops) and bail out
before taking nocow locks/dev refs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:11 -04:00
Kent Overstreet
29ad31c780 bcachefs: Self healing writes are BCH_WRITE_alloc_nowait
If a drive is failing and we're moving data off of it, we can't
necessairly depend on capacity/disk reservation calculations to avoid
deadlocking/blocking on the allocator.

And, we don't want to queue up infinite self healing moves anyways.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:11 -04:00
Kent Overstreet
8ff92a9e4e bcachefs: Promotes should use BCH_WRITE_only_specified_devs
Promotes, like most other internal moves, should only go to the
specified target and not fall back to allocating from the full
filesystem.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:11 -04:00
Kent Overstreet
d0148e7169 bcachefs: Be stricter in bch2_read_retry_nodecode()
Now that data_update embeds bch_read_bio, BCH_READ_NODECODE means that
the read is embedded in a a data_update - and we can check in the retry
path if the extent has changed and bail out.

This likely fixes some subtle bugs with read errors and data moves.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:11 -04:00
Kent Overstreet
6f7111f820 bcachefs: cleanup redundant code around data_update_op initialization
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:11 -04:00
Kent Overstreet
536d789781 bcachefs: bch2_update_unwritten_extent() no longer depends on wbio
Prep work for improving bch2_data_update_init().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:11 -04:00
Kent Overstreet
8f97793d67 bcachefs: promote_op uses embedded bch_read_bio
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:11 -04:00
Kent Overstreet
a70bd97630 bcachefs: data_update now embeds bch_read_bio
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:11 -04:00
Kent Overstreet
dfa204b169 bcachefs: rbio_init() cleanup
Move more initialization to rbio_init(), to assist in further cleanups.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:11 -04:00
Kent Overstreet
0f856b7228 bcachefs: rbio_init_fragment()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:11 -04:00
Kent Overstreet
14e2523fc5 bcachefs: Rename BCH_WRITE flags fer consistency with other x-macros enums
The uppercase/lowercase style is nice for making the namespace explicit.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:11 -04:00
Kent Overstreet
9157b3ddfb bcachefs: x-macroize BCH_READ flags
Will be adding a bch2_read_bio_to_text().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:11 -04:00
Kent Overstreet
9f37016cb2 bcachefs: kill bch_read_bio.devs_have
Dead code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:11 -04:00
Kent Overstreet
3075e68d26 bcachefs: bch2_data_update_inflight_to_text()
Add a new helper for bch2_moving_ctxt_to_text(), which may be used to
debug if moving_ios are getting stuck.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:11 -04:00
Kent Overstreet
50ca857457 bcachefs: BCH_IOCTL_QUERY_COUNTERS
Add an ioctl for querying counters, the same ones provided in
/sys/fs/bcachefs/<uuid>/counters/, but more suitable for a 'bcachefs
top' command.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:11 -04:00
Kent Overstreet
5ee760f667 bcachefs: BCH_COUNTER_bucket_discard_fast
Add a separate counter for fastpath bucket discards, which don't require
a journal flush.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:11 -04:00
Kent Overstreet
bbd804f2ad bcachefs: enum bch_persistent_counters_stable
Persistent counters, like recovery passes, include a stable enum in
their definition - but this was never correctly plumbed.

This allows us to add new counters and properly organize them with a
non-stable "presentation order", which can also be used in userspace by
the new 'bcachefs fs top' tool.

Fortunatel, since we haven't yet added any new counters where
presentation order ID doesn't match stable ID, this won't cause any
reordering issues.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:11 -04:00
Kent Overstreet
999cc1bb68 bcachefs: Separate running/runnable in wp stats
We've got per-writepoint statistics to see how well the writepoint index
update threads are pipelining; this separates running vs. runnable so we
can see at a glance if they're blocking.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:11 -04:00
Kent Overstreet
78c9c6f6cd bcachefs: Move write_points to debugfs
this was hitting the sysfs 4k limit

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:11 -04:00
Kent Overstreet
55a132c37a bcachefs: Don't inc io_(read|write) counters for moves
This makes 'bcachefs fs top' more useful; we can now see at a glance
whether the IO to the device is being done for user reads/writes, or
copygc/rebalance.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:10 -04:00
Kent Overstreet
e5a63ad343 bcachefs: Fix missing increment of move_extent_write counter
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:10 -04:00
Kent Overstreet
c3c9957c81 bcachefs: check_bp_exists() check for backpointers for stale pointers
Early version of 'bcachefs_metadata_version_cached_backpointers' was
creating backpointers for stale cached pointers - whoops. Now we have to
repair those.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:10 -04:00
Kent Overstreet
2deae55804 bcachefs: btree_node_(rewrite|update_key) cleanup
Factor out get_iter_to_node() and use it for
btree_node_rewrite_get_iter(), to be used for fixing btree node write
error behaviour.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 21:02:10 -04:00
Kent Overstreet
be212d86b1 bcachefs: bs > ps support
bcachefs removed most PAGE_SIZE references long ago, so this is easy;
only readpage_bio_extend() has to be tweaked to respect the minimum
order.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 19:45:57 -04:00
Kent Overstreet
1a2b74d0a2 bcachefs: fix build on 32 bit in get_random_u64_below()
bare 64 bit divides not allowed, whoops

arm-linux-gnueabi-ld: drivers/char/random.o: in function `__get_random_u64_below':
drivers/char/random.c:602:(.text+0xc70): undefined reference to `__aeabi_uldivmod'

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 19:45:54 -04:00
Kent Overstreet
90fd9ad5b0 bcachefs: Change btree wb assert to runtime error
We just had a report of the assert for "btree in write buffer for
non-write buffer btree" popping during the 6.14 upgrade.

- 150TB filesystem, after a reboot the upgrade was able to continue from
  where it left off, so no major damage.

But with 6.14 about to come out we want to get this tracked down asap,
and need more data if other users hit this.

Convert the BUG_ON() to an emergency read-only, and print out btree, the
key itself, and stack trace from the original write buffer update (which
did not have this check before).

Reported-by: Stijn Tintel <stijn@linux-ipv6.be>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14 10:25:25 -04:00
Kent Overstreet
9c18ea7ffe bcachefs: bch2_get_random_u64_below()
steal the (clever) algorithm from get_random_u32_below()

this fixes a bug where we were passing roundup_pow_of_two() a 64 bit
number - we're squaring device latencies now:

[  +1.681698] ------------[ cut here ]------------
[  +0.000010] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
[  +0.000011] shift exponent 64 is too large for 64-bit type 'long unsigned int'
[  +0.000011] CPU: 1 UID: 0 PID: 196 Comm: kworker/u32:13 Not tainted 6.14.0-rc6-dave+ #10
[  +0.000012] Hardware name: ASUS System Product Name/PRIME B460I-PLUS, BIOS 1301 07/13/2021
[  +0.000005] Workqueue: events_unbound __bch2_read_endio [bcachefs]
[  +0.000354] Call Trace:
[  +0.000005]  <TASK>
[  +0.000007]  dump_stack_lvl+0x5d/0x80
[  +0.000018]  ubsan_epilogue+0x5/0x30
[  +0.000008]  __ubsan_handle_shift_out_of_bounds.cold+0x61/0xe6
[  +0.000011]  bch2_rand_range.cold+0x17/0x20 [bcachefs]
[  +0.000231]  bch2_bkey_pick_read_device+0x547/0x920 [bcachefs]
[  +0.000229]  __bch2_read_extent+0x1e4/0x18e0 [bcachefs]
[  +0.000241]  ? bch2_btree_iter_peek_slot+0x3df/0x800 [bcachefs]
[  +0.000180]  ? bch2_read_retry_nodecode+0x270/0x330 [bcachefs]
[  +0.000230]  bch2_read_retry_nodecode+0x270/0x330 [bcachefs]
[  +0.000230]  bch2_rbio_retry+0x1fa/0x600 [bcachefs]
[  +0.000224]  ? bch2_printbuf_make_room+0x71/0xb0 [bcachefs]
[  +0.000243]  ? bch2_read_csum_err+0x4a4/0x610 [bcachefs]
[  +0.000278]  bch2_read_csum_err+0x4a4/0x610 [bcachefs]
[  +0.000227]  ? __bch2_read_endio+0x58b/0x870 [bcachefs]
[  +0.000220]  __bch2_read_endio+0x58b/0x870 [bcachefs]
[  +0.000268]  ? try_to_wake_up+0x31c/0x7f0
[  +0.000011]  ? process_one_work+0x176/0x330
[  +0.000008]  process_one_work+0x176/0x330
[  +0.000008]  worker_thread+0x252/0x390
[  +0.000008]  ? __pfx_worker_thread+0x10/0x10
[  +0.000006]  kthread+0xec/0x230
[  +0.000011]  ? __pfx_kthread+0x10/0x10
[  +0.000009]  ret_from_fork+0x31/0x50
[  +0.000009]  ? __pfx_kthread+0x10/0x10
[  +0.000008]  ret_from_fork_asm+0x1a/0x30
[  +0.000012]  </TASK>
[  +0.000046] ---[ end trace ]---

Reported-by: Roland Vet <vet.roland@protonmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-13 12:40:22 -04:00
Kent Overstreet
69a5a13a22 bcachefs: target_congested -> get_random_u32_below()
get_random_u32_below() has a better algorithm than bch2_rand_range(),
it just didn't exist at the time.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-13 12:39:21 -04:00
Kent Overstreet
3bcde88d38 bcachefs: fix tiny leak in bch2_dev_add()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-13 00:23:19 -04:00
Kent Overstreet
dbac8feb23 bcachefs: Make sure trans is unlocked when submitting read IO
We were still using the trans after the unlock, leading to this bug in
the retry path:

00255 ------------[ cut here ]------------
00255 kernel BUG at fs/bcachefs/btree_iter.c:3348!
00255 Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
00255 bcachefs (0ca38fe8-0a26-41f9-9b5d-6a27796c7803): /fiotest offset 86048768: no device to read from:
00255   u64s 8 type extent 4098:168192:U32_MAX len 128 ver 0: durability: 0 crc: c_size 128 size 128 offset 0 nonce 0 csum crc32c 0:8040a368  compress none ec: idx 83 block 1 ptr: 0:302:128 gen 0
00255 bcachefs (0ca38fe8-0a26-41f9-9b5d-6a27796c7803): /fiotest offset 85983232: no device to read from:
00255   u64s 8 type extent 4098:168064:U32_MAX len 128 ver 0: durability: 0 crc: c_size 128 size 128 offset 0 nonce 0 csum crc32c 0:43311336  compress none ec: idx 83 block 1 ptr: 0:302:0 gen 0
00255 Modules linked in:
00255 CPU: 5 UID: 0 PID: 304 Comm: kworker/u70:2 Not tainted 6.14.0-rc6-ktest-g526aae23d67d #16040
00255 Hardware name: linux,dummy-virt (DT)
00255 Workqueue: events_unbound bch2_rbio_retry
00255 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
00255 pc : __bch2_trans_get+0x100/0x378
00255 lr : __bch2_trans_get+0xa0/0x378
00255 sp : ffffff80c865b760
00255 x29: ffffff80c865b760 x28: 0000000000000000 x27: ffffff80d76ed880
00255 x26: 0000000000000018 x25: 0000000000000000 x24: ffffff80f4ec3760
00255 x23: ffffff80f4010140 x22: 0000000000000056 x21: ffffff80f4ec0000
00255 x20: ffffff80f4ec3788 x19: ffffff80d75f8000 x18: 00000000ffffffff
00255 x17: 2065707974203820 x16: 7334367520200a3a x15: 0000000000000008
00255 x14: 0000000000000001 x13: 0000000000000100 x12: 0000000000000006
00255 x11: ffffffc080b47a40 x10: 0000000000000000 x9 : ffffffc08038dea8
00255 x8 : ffffff80d75fc018 x7 : 0000000000000000 x6 : 0000000000003788
00255 x5 : 0000000000003760 x4 : ffffff80c922de80 x3 : ffffff80f18f0000
00255 x2 : ffffff80c922de80 x1 : 0000000000000130 x0 : 0000000000000006
00255 Call trace:
00255  __bch2_trans_get+0x100/0x378 (P)
00255  bch2_read_io_err+0x98/0x260
00255  bch2_read_endio+0xb8/0x2d0
00255  __bch2_read_extent+0xce8/0xfe0
00255  __bch2_read+0x2a8/0x978
00255  bch2_rbio_retry+0x188/0x318
00255  process_one_work+0x154/0x390
00255  worker_thread+0x20c/0x3b8
00255  kthread+0xf0/0x1b0
00255  ret_from_fork+0x10/0x20
00255 Code: 6b01001f 54ffff01 79408460 3617fec0 (d4210000)
00255 ---[ end trace 0000000000000000 ]---
00255 Kernel panic - not syncing: Oops - BUG: Fatal exception
00255 SMP: stopping secondary CPUs
00255 Kernel Offset: disabled
00255 CPU features: 0x000,00000070,00000010,8240500b
00255 Memory Limit: none
00255 ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]---

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-11 11:21:44 -04:00
Roxana Nicolescu
58517f4df8 bcachefs: Initialize from_inode members for bch_io_opts
When there is no inode source, all "from_inode" members in the structure
bhc_io_opts should be set false.

Fixes: 7a7c43a0c1 ("bcachefs: Add bch_io_opts fields for indicating whether the opts came from the inode")
Reported-by: syzbot+c17ad4b4367b72a853cb@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c17ad4b4367b72a853cb
Signed-off-by: Roxana Nicolescu <nicolescu.roxana@protonmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-11 11:19:33 -04:00
Alan Huang
3a04334d62 bcachefs: Fix b->written overflow
When bset past end of btree node, we should not add sectors to
b->written, which will overflow b->written.

Reported-by: syzbot+3cb3d9e8c3f197754825@syzkaller.appspotmail.com
Tested-by: syzbot+3cb3d9e8c3f197754825@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-11 09:19:23 -04:00
Luis Chamberlain
a64e5a5960
bdev: add back PAGE_SIZE block size validation for sb_set_blocksize()
The commit titled "block/bdev: lift block size restrictions to 64k"
lifted the block layer's max supported block size to 64k inside the
helper blk_validate_block_size() now that we support large folios.
However in lifting the block size we also removed the silly use
cases many filesystems have to use sb_set_blocksize() to *verify*
that the block size <= PAGE_SIZE. The call to sb_set_blocksize() was
used to check the block size <= PAGE_SIZE since historically we've
always supported userspace to create for example 64k block size
filesystems even on 4k page size systems, but what we didn't allow
was mounting them. Older filesystems have been using the check with
sb_set_blocksize() for years.

While, we could argue that such checks should be filesystem specific,
there are much more users of sb_set_blocksize() than LBS enabled
filesystem on upstream, so just do the easier thing and bring back
the PAGE_SIZE check for sb_set_blocksize() users and only skip it
for LBS enabled filesystems.

This will ensure that tests such as generic/466 when run in a loop
against say, ext4, won't try to try to actually mount a filesystem with
a block size larger than your filesystem supports given your PAGE_SIZE
and in the worst case crash.

Cc: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20250307020403.3068567-1-mcgrof@kernel.org
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-07 12:56:05 +01:00
Kent Overstreet
8ba73f53dc bcachefs: copygc now skips non-rw devices
There's no point in doing copygc on non-rw devices: the fragmentation
doesn't matter if we're not writing to them, and we may not have
anywhere to put the data on our other devices.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-06 18:15:01 -05:00
Kent Overstreet
33255c161a bcachefs: Fix bch2_dev_journal_alloc() spuriously failing
Previously, we fixed journal resize spuriousl failing with
-BCH_ERR_open_buckets_empty, but initial journal allocation was missed
because it didn't invoke the "block on allocator" loop at all.

Factor out the "loop on allocator" code to fix that.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-06 18:15:01 -05:00
Kent Overstreet
4a4f9b5c7c bcachefs: Don't set BCH_FEATURE_incompat_version_field unless requested
We shouldn't be setting incompatible bits or the incompatible version
field unless explicitly request or allowed - otherwise we break mounting
with old kernels or userspace.

Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-28 19:07:33 -05:00
NeilBrown
88d5baf690
Change inode_operations.mkdir to return struct dentry *
Some filesystems, such as NFS, cifs, ceph, and fuse, do not have
complete control of sequencing on the actual filesystem (e.g.  on a
different server) and may find that the inode created for a mkdir
request already exists in the icache and dcache by the time the mkdir
request returns.  For example, if the filesystem is mounted twice the
directory could be visible on the other mount before it is on the
original mount, and a pair of name_to_handle_at(), open_by_handle_at()
calls could instantiate the directory inode with an IS_ROOT() dentry
before the first mkdir returns.

This means that the dentry passed to ->mkdir() may not be the one that
is associated with the inode after the ->mkdir() completes.  Some
callers need to interact with the inode after the ->mkdir completes and
they currently need to perform a lookup in the (rare) case that the
dentry is no longer hashed.

This lookup-after-mkdir requires that the directory remains locked to
avoid races.  Planned future patches to lock the dentry rather than the
directory will mean that this lookup cannot be performed atomically with
the mkdir.

To remove this barrier, this patch changes ->mkdir to return the
resulting dentry if it is different from the one passed in.
Possible returns are:
  NULL - the directory was created and no other dentry was used
  ERR_PTR() - an error occurred
  non-NULL - this other dentry was spliced in

This patch only changes file-systems to return "ERR_PTR(err)" instead of
"err" or equivalent transformations.  Subsequent patches will make
further changes to some file-systems to return a correct dentry.

Not all filesystems reliably result in a positive hashed dentry:

- NFS, cifs, hostfs will sometimes need to perform a lookup of
  the name to get inode information.  Races could result in this
  returning something different. Note that this lookup is
  non-atomic which is what we are trying to avoid.  Placing the
  lookup in filesystem code means it only happens when the filesystem
  has no other option.
- kernfs and tracefs leave the dentry negative and the ->revalidate
  operation ensures that lookup will be called to correctly populate
  the dentry.  This could be fixed but I don't think it is important
  to any of the users of vfs_mkdir() which look at the dentry.

The recommendation to use
    d_drop();d_splice_alias()
is ugly but fits with current practice.  A planned future patch will
change this.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-27 20:00:17 +01:00
Christian Brauner
71628584df
Merge patch series "prep patches for my mkdir series"
NeilBrown <neilb@suse.de> says:

These two patches are cleanup are dependencies for my mkdir changes and
subsequence directory locking changes.

* patches from https://lore.kernel.org/r/20250226062135.2043651-1-neilb@suse.de: (2 commits)
  nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked()
  nfs/vfs: discard d_exact_alias()

Link: https://lore.kernel.org/r/20250226062135.2043651-1-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-27 09:25:34 +01:00
Kent Overstreet
eb54d2695b bcachefs: Fix truncate sometimes failing and returning 1
__bch_truncate_folio() may return 1 to indicate dirtyness of the folio
being truncated, needed for fpunch to get the i_size writes correct.

But truncate was forgetting to clear ret, and sometimes returning it as
an error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-26 19:31:05 -05:00
Alan Huang
677bdb7346 bcachefs: Fix deadlock
This fixes two deadlocks:

1.pcpu_alloc_mutex involved one as pointed by syzbot[1]
2.recursion deadlock.

The root cause is that we hold the bc lock during alloc_percpu, fix it
by following the pattern used by __btree_node_mem_alloc().

[1] https://lore.kernel.org/all/66f97d9a.050a0220.6bad9.001d.GAE@google.com/T/

Reported-by: syzbot+fe63f377148a6371a9db@syzkaller.appspotmail.com
Tested-by: syzbot+fe63f377148a6371a9db@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-26 19:31:05 -05:00
Kent Overstreet
7909d1fb90 bcachefs: Check for -BCH_ERR_open_buckets_empty in journal resize
This fixes occasional failures from journal resize.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-26 19:31:05 -05:00
Kent Overstreet
4804f3ac26 bcachefs: Revert directory i_size
This turned out to have several bugs, which were missed because the fsck
code wasn't properly reporting errors - whoops.

Kicking it out for now, hopefully it can make 6.15.

Cc: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-26 19:30:38 -05:00
Kent Overstreet
cf3e696026 bcachefs: fix bch2_extent_ptr_eq()
Reviewed-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-23 23:35:33 -05:00
Alan Huang
c522093b02 bcachefs: Fix memmove when move keys down
The fix alone doesn't fix [1], but should be applied before debugging
that.

[1] https://syzkaller.appspot.com/bug?extid=38a0cbd267eff2d286ff

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-20 16:40:34 -05:00
Kent Overstreet
68aaa63716 bcachefs: print op->nonce on data update inconsistency
"nonce inconstancy" is popping up again, causing us to go emergency
read-only.

This one looks less serious, i.e. specific to the encryption path and
not indicative of a data corruption bug. But we'll need more info to
track it down.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-20 16:39:28 -05:00
Kent Overstreet
b04974f759 bcachefs: Fix srcu lock warning in btree_update_nodes_written()
We don't want to be holding the srcu lock while waiting on btree write
completions - easily fixed.

Reported-by: Janpieter Sollie <janpieter.sollie@edpnet.be>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-19 18:52:42 -05:00
Kent Overstreet
4fd509c10f bcachefs: Fix bch2_indirect_extent_missing_error()
We had some error handling confusion here;
-BCH_ERR_missing_indirect_extent is thrown by
trans_trigger_reflink_p_segment(); at this point we haven't decide
whether we're generating an error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-19 17:33:13 -05:00
Kent Overstreet
b9ddb3e1a8 bcachefs: Fix fsck directory i_size checking
Error handling was wrong, causing unhandled transaction restart errors.

check_directory_size() was also inefficient, since keys in multiple
snapshots would be iterated over once for every snapshot. Convert it to
the same scheme used for i_sectors and subdir count checking.

Cc: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-19 13:52:27 -05:00
NeilBrown
1c3cb50b58
VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry
No callers of kern_path_locked() or user_path_locked_at() want a
negative dentry.  So change them to return -ENOENT instead.  This
simplifies callers.

This results in a subtle change to bcachefs in that an ioctl will now
return -ENOENT in preference to -EXDEV.  I believe this restores the
behaviour to what it was prior to
 Commit bbe6a7c899 ("bch2_ioctl_subvolume_destroy(): fix locking")

Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250217003020.3170652-2-neilb@suse.de
Acked-by: Paul Moore <paul@paul-moore.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-19 14:08:41 +01:00
Alan Huang
406e445b3c bcachefs: Reuse transaction
bch2_nocow_write_convert_unwritten is already in transaction context:

00191 ========= TEST   generic/648
00242 kernel BUG at fs/bcachefs/btree_iter.c:3332!
00242 Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
00242 Modules linked in:
00242 CPU: 4 UID: 0 PID: 2593 Comm: fsstress Not tainted 6.13.0-rc3-ktest-g345af8f855b7 #14403
00242 Hardware name: linux,dummy-virt (DT)
00242 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
00242 pc : __bch2_trans_get+0x120/0x410
00242 lr : __bch2_trans_get+0xcc/0x410
00242 sp : ffffff80d89af600
00242 x29: ffffff80d89af600 x28: ffffff80ddb23000 x27: 00000000fffff705
00242 x26: ffffff80ddb23028 x25: ffffff80d8903fe0 x24: ffffff80ebb30168
00242 x23: ffffff80c8aeb500 x22: 000000000000005d x21: ffffff80d8904078
00242 x20: ffffff80d8900000 x19: ffffff80da9e8000 x18: 0000000000000000
00242 x17: 64747568735f6c61 x16: 6e72756f6a20726f x15: 0000000000000028
00242 x14: 0000000000000004 x13: 000000000000f787 x12: ffffffc081bbcdc8
00242 x11: 0000000000000000 x10: 0000000000000003 x9 : ffffffc08094efbc
00242 x8 : 000000001092c111 x7 : 000000000000000c x6 : ffffffc083c31fc4
00242 x5 : ffffffc083c31f28 x4 : ffffff80c8aeb500 x3 : ffffff80ebb30000
00242 x2 : 0000000000000001 x1 : 0000000000000a21 x0 : 000000000000028e
00242 Call trace:
00242  __bch2_trans_get+0x120/0x410 (P)
00242  bch2_inum_offset_err_msg+0x48/0xb0
00242  bch2_nocow_write_convert_unwritten+0x3d0/0x530
00242  bch2_nocow_write+0xeb0/0x1000
00242  __bch2_write+0x330/0x4e8
00242  bch2_write+0x1f0/0x530
00242  bch2_direct_write+0x530/0xc00
00242  bch2_write_iter+0x160/0xbe0
00242  vfs_write+0x1cc/0x360
00242  ksys_write+0x5c/0xf0
00242  __arm64_sys_write+0x20/0x30
00242  invoke_syscall.constprop.0+0x54/0xe8
00242  do_el0_svc+0x44/0xc0
00242  el0_svc+0x34/0xa0
00242  el0t_64_sync_handler+0x104/0x130
00242  el0t_64_sync+0x154/0x158
00242 Code: 6b01001f 54ffff01 79408460 3617fec0 (d4210000)
00242 ---[ end trace 0000000000000000 ]---
00242 Kernel panic - not syncing: Oops - BUG: Fatal exception

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-12 18:44:50 -05:00
Alan Huang
531323a2ef bcachefs: Pass _orig_restart_count to trans_was_restarted
_orig_restart_count is unused now, according to the logic, trans_was_restarted
should be using _orig_restart_count.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-12 18:40:19 -05:00
Kent Overstreet
9cf6b84b71 bcachefs: CONFIG_BCACHEFS_INJECT_TRANSACTION_RESTARTS
Incorrectly handled transaction restarts can be a source of heisenbugs;
add a mode where we randomly inject them to shake them out.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-12 18:40:19 -05:00
Kent Overstreet
9f734cd076 bcachefs: Fix want_new_bset() so we write until the end of the btree node
want_new_bset() returns the address of a new bset to initialize if we
wish to do so in a btree node - either because the previous one is too
big, or because it's been written.

The case for 'previous bset was written' was wrong: it's only supposed
to check for if we have space in the node for one more block, but
because it subtracted the header from the space available it would never
initialize a new bset if we were down to the last block in a node.

Fixing this results in fewer btree node splits/compactions, which fixes
a bug with flushing the journal to go read-only sometimes not
terminating or taking excessively long.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-11 10:10:32 -05:00
Kent Overstreet
1e690efa72 bcachefs: Split out journal pins by btree level
This lets us flush the journal to go read-only more effectively.

Flushing the journal and going read-only requires halting mutually
recursive processes, which strictly speaking are not guaranteed to
terminate.

Flushing btree node journal pins will kick off a btree node write, and
btree node writes on completion must do another btree update to the
parent node to update the 'sectors_written' field for that node's key.

If the parent node is full and requires a split or compaction, that's
going to generate a whole bunch of additional btree updates - alloc
info, LRU btree, and more - which then have to be flushed, and the cycle
repeats.

This process will terminate much more effectively if we tweak journal
reclaim to flush btree updates leaf to root: i.e., don't flush updates
for a given btree node (kicking off a write, and consuming space within
that node up to the next block boundary) if there might still be
unflushed updates in child nodes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-11 10:10:32 -05:00
Alan Huang
1c316eb57c bcachefs: Fix use after free
acc->k.data should be used with the lock hold:

00221 ========= TEST   generic/187
00221        run fstests generic/187 at 2025-02-09 21:08:10
00221 spectre-v4 mitigation disabled by command-line option
00222 bcachefs (vdc): starting version 1.20: directory_size opts=errors=ro
00222 bcachefs (vdc): initializing new filesystem
00222 bcachefs (vdc): going read-write
00222 bcachefs (vdc): marking superblocks
00222 bcachefs (vdc): initializing freespace
00222 bcachefs (vdc): done initializing freespace
00222 bcachefs (vdc): reading snapshots table
00222 bcachefs (vdc): reading snapshots done
00222 bcachefs (vdc): done starting filesystem
00222 bcachefs (vdc): shutting down
00222 bcachefs (vdc): going read-only
00222 bcachefs (vdc): finished waiting for writes to stop
00223 bcachefs (vdc): flushing journal and stopping allocators, journal seq 6
00223 bcachefs (vdc): flushing journal and stopping allocators complete, journal seq 8
00223 bcachefs (vdc): clean shutdown complete, journal seq 9
00223 bcachefs (vdc): marking filesystem clean
00223 bcachefs (vdc): shutdown complete
00223 bcachefs (vdc): starting version 1.20: directory_size opts=errors=ro
00223 bcachefs (vdc): initializing new filesystem
00223 bcachefs (vdc): going read-write
00223 bcachefs (vdc): marking superblocks
00223 bcachefs (vdc): initializing freespace
00223 bcachefs (vdc): done initializing freespace
00223 bcachefs (vdc): reading snapshots table
00223 bcachefs (vdc): reading snapshots done
00223 bcachefs (vdc): done starting filesystem
00244 hrtimer: interrupt took 123350440 ns
00264 bcachefs (vdc): shutting down
00264 bcachefs (vdc): going read-only
00264 bcachefs (vdc): finished waiting for writes to stop
00264 bcachefs (vdc): flushing journal and stopping allocators, journal seq 97
00265 bcachefs (vdc): flushing journal and stopping allocators complete, journal seq 101
00265 bcachefs (vdc): clean shutdown complete, journal seq 102
00265 bcachefs (vdc): marking filesystem clean
00265 bcachefs (vdc): shutdown complete
00265 bcachefs (vdc): starting version 1.20: directory_size opts=errors=ro
00265 bcachefs (vdc): recovering from clean shutdown, journal seq 102
00265 bcachefs (vdc): accounting_read...
00265 ==================================================================
00265  done
00265 BUG: KASAN: slab-use-after-free in bch2_fs_to_text+0x12b4/0x1728
00265 bcachefs (vdc): alloc_read... done
00265 bcachefs (vdc): stripes_read... done
00265 Read of size 4 at addr ffffff80c57eac00 by task cat/7531
00265 bcachefs (vdc): snapshots_read... done
00265
00265 CPU: 6 UID: 0 PID: 7531 Comm: cat Not tainted 6.13.0-rc3-ktest-g16fc6fa3819d #14103
00265 Hardware name: linux,dummy-virt (DT)
00265 Call trace:
00265  show_stack+0x1c/0x30 (C)
00265  dump_stack_lvl+0x6c/0x80
00265  print_report+0xf8/0x5d8
00265  kasan_report+0x90/0xd0
00265  __asan_report_load4_noabort+0x1c/0x28
00265  bch2_fs_to_text+0x12b4/0x1728
00265  bch2_fs_show+0x94/0x188
00265  sysfs_kf_seq_show+0x1a4/0x348
00265  kernfs_seq_show+0x12c/0x198
00265  seq_read_iter+0x27c/0xfd0
00265  kernfs_fop_read_iter+0x390/0x4f8
00265  vfs_read+0x480/0x7f0
00265  ksys_read+0xe0/0x1e8
00265  __arm64_sys_read+0x70/0xa8
00265  invoke_syscall.constprop.0+0x74/0x1e8
00265  do_el0_svc+0xc8/0x1c8
00265  el0_svc+0x20/0x60
00265  el0t_64_sync_handler+0x104/0x130
00265  el0t_64_sync+0x154/0x158
00265
00265 Allocated by task 7510:
00265  kasan_save_stack+0x28/0x50
00265  kasan_save_track+0x1c/0x38
00265  kasan_save_alloc_info+0x3c/0x50
00265  __kasan_kmalloc+0xac/0xb0
00265  __kmalloc_node_noprof+0x168/0x348
00265  __kvmalloc_node_noprof+0x20/0x140
00265  __bch2_darray_resize_noprof+0x90/0x1b0
00265  __bch2_accounting_mem_insert+0x76c/0xb08
00265  bch2_accounting_mem_insert+0x224/0x3b8
00265  bch2_accounting_mem_mod_locked+0x480/0xc58
00265  bch2_accounting_read+0xa94/0x3eb8
00265  bch2_run_recovery_pass+0x80/0x178
00265  bch2_run_recovery_passes+0x340/0x698
00265  bch2_fs_recovery+0x1c98/0x2bd8
00265  bch2_fs_start+0x240/0x490
00265  bch2_fs_get_tree+0xe1c/0x1458
00265  vfs_get_tree+0x7c/0x250
00265  path_mount+0xe24/0x1648
00265  __arm64_sys_mount+0x240/0x438
00265  invoke_syscall.constprop.0+0x74/0x1e8
00265  do_el0_svc+0xc8/0x1c8
00265  el0_svc+0x20/0x60
00265  el0t_64_sync_handler+0x104/0x130
00265  el0t_64_sync+0x154/0x158
00265
00265 Freed by task 7510:
00265  kasan_save_stack+0x28/0x50
00265  kasan_save_track+0x1c/0x38
00265  kasan_save_free_info+0x48/0x88
00265  __kasan_slab_free+0x48/0x60
00265  kfree+0x188/0x408
00265  kvfree+0x3c/0x50
00265  __bch2_darray_resize_noprof+0xe0/0x1b0
00265  __bch2_accounting_mem_insert+0x76c/0xb08
00265  bch2_accounting_mem_insert+0x224/0x3b8
00265  bch2_accounting_mem_mod_locked+0x480/0xc58
00265  bch2_accounting_read+0xa94/0x3eb8
00265  bch2_run_recovery_pass+0x80/0x178
00265  bch2_run_recovery_passes+0x340/0x698
00265  bch2_fs_recovery+0x1c98/0x2bd8
00265  bch2_fs_start+0x240/0x490
00265  bch2_fs_get_tree+0xe1c/0x1458
00265  vfs_get_tree+0x7c/0x250
00265  path_mount+0xe24/0x1648
00265 bcachefs (vdc): going read-write
00265  __arm64_sys_mount+0x240/0x438
00265  invoke_syscall.constprop.0+0x74/0x1e8
00265  do_el0_svc+0xc8/0x1c8
00265  el0_svc+0x20/0x60
00265  el0t_64_sync_handler+0x104/0x130
00265  el0t_64_sync+0x154/0x158
00265
00265 The buggy address belongs to the object at ffffff80c57eac00
00265  which belongs to the cache kmalloc-128 of size 128
00265 The buggy address is located 0 bytes inside of
00265  freed 128-byte region [ffffff80c57eac00, ffffff80c57eac80)
00265
00265 The buggy address belongs to the physical page:
00265 page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1057ea
00265 head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
00265 flags: 0x8000000000000040(head|zone=2)
00265 page_type: f5(slab)
00265 raw: 8000000000000040 ffffff80c0002800 dead000000000100 dead000000000122
00265 raw: 0000000000000000 0000000000200020 00000001f5000000 ffffff80c57a6400
00265 head: 8000000000000040 ffffff80c0002800 dead000000000100 dead000000000122
00265 head: 0000000000000000 0000000000200020 00000001f5000000 ffffff80c57a6400
00265 head: 8000000000000001 fffffffec315fa81 ffffffffffffffff 0000000000000000
00265 head: 0000000000000002 0000000000000000 00000000ffffffff 0000000000000000
00265 page dumped because: kasan: bad access detected
00265
00265 Memory state around the buggy address:
00265  ffffff80c57eab00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00265  ffffff80c57eab80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
00265 >ffffff80c57eac00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
00265                    ^
00265  ffffff80c57eac80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
00265  ffffff80c57ead00: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc
00265 ==================================================================
00265 Kernel panic - not syncing: kasan.fault=panic set ...
00265 CPU: 6 UID: 0 PID: 7531 Comm: cat Not tainted 6.13.0-rc3-ktest-g16fc6fa3819d #14103
00265 Hardware name: linux,dummy-virt (DT)
00265 Call trace:
00265  show_stack+0x1c/0x30 (C)
00265  dump_stack_lvl+0x30/0x80
00265  dump_stack+0x18/0x20
00265  panic+0x4d4/0x518
00265  start_report.constprop.0+0x0/0x90
00265  kasan_report+0xa0/0xd0
00265  __asan_report_load4_noabort+0x1c/0x28
00265  bch2_fs_to_text+0x12b4/0x1728
00265  bch2_fs_show+0x94/0x188
00265  sysfs_kf_seq_show+0x1a4/0x348
00265  kernfs_seq_show+0x12c/0x198
00265  seq_read_iter+0x27c/0xfd0
00265  kernfs_fop_read_iter+0x390/0x4f8
00265  vfs_read+0x480/0x7f0
00265  ksys_read+0xe0/0x1e8
00265  __arm64_sys_read+0x70/0xa8
00265  invoke_syscall.constprop.0+0x74/0x1e8
00265  do_el0_svc+0xc8/0x1c8
00265  el0_svc+0x20/0x60
00265  el0t_64_sync_handler+0x104/0x130
00265  el0t_64_sync+0x154/0x158
00265 SMP: stopping secondary CPUs
00265 Kernel Offset: disabled
00265 CPU features: 0x000,00000070,00000010,8240500b
00265 Memory Limit: none
00265 ---[ end Kernel panic - not syncing: kasan.fault=panic set ... ]---
00270 ========= FAILED TIMEOUT generic.187 in 1200s

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-11 10:10:32 -05:00
Kent Overstreet
595170d4b6 bcachefs: Fix marking reflink pointers to missing indirect extents
reflink pointers to missing indirect extents aren't deleted, they just
have an error bit set - in case the indirect extent somehow reappears.

fsck/mark and sweep thus needs to ignore these errors.

Also, they can be marked AUTOFIX now.

Reported-by: Roland Vet <vet.roland@protonmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-07 14:49:47 -05:00
Kent Overstreet
4be214c269 bcachefs: bch2_bkey_sectors_need_rebalance() now only depends on bch_extent_rebalance
Previously, bch2_bkey_sectors_need_rebalance() called
bch2_target_accepts_data(), checking whether the target is writable.

However, this means that adding or removing devices from a target would
change the value of bch2_bkey_sectors_need_rebalance() for an existing
extent; this needs to be invariant so that the extent trigger can
correctly maintain rebalance_work accounting.

Instead, check target_accepts_data() in io_opts_to_rebalance_opts(),
before creating the bch_extent_rebalance entry.

This fixes (one?) cause of rebalance_work accounting being off.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-06 22:35:11 -05:00
Kent Overstreet
3539880ef1 bcachefs: Fix rcu imbalance in bch2_fs_btree_key_cache_exit()
Spotted by sparse.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-06 22:35:11 -05:00
Kent Overstreet
9e9033522a bcachefs: Fix discard path journal flushing
The discard path is supposed to issue journal flushes when there's too
many buckets empty buckets that need a journal commit before they can be
written to again, but at some point this code seems to have been lost.

Bring it back with a new optimization to make sure we don't issue too
many journal flushes: the journal now tracks the sequence number of the
most recent flush in progress, which the discard path uses when deciding
which buckets need a journal flush.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-06 22:35:11 -05:00
Jeongjun Park
2ef995df0c bcachefs: fix deadlock in journal_entry_open()
In the previous commit b3d82c2f27, code was added to prevent journal sequence
overflow. Among them, the code added to journal_entry_open() uses the
bch2_fs_fatal_err_on() function to handle errors.

However, __journal_res_get() , which calls journal_entry_open() , calls
journal_entry_open() while holding journal->lock , but bch2_fs_fatal_err_on()
internally tries to acquire journal->lock , which results in a deadlock.

So we need to add a locked helper to handle fatal errors even when the
journal->lock is held.

Fixes: b3d82c2f27 ("bcachefs: Guard against journal seq overflow")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-06 22:35:11 -05:00
Jeongjun Park
6b37037d6d bcachefs: fix incorrect pointer check in __bch2_subvolume_delete()
For some unknown reason, checks on struct bkey_s_c_snapshot and struct
bkey_s_c_snapshot_tree pointers are missing.

Therefore, I think it would be appropriate to fix the incorrect pointer checking
through this patch.

Fixes: 4bd06f07bc ("bcachefs: Fixes for snapshot_tree.master_subvol")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-06 22:35:11 -05:00
Linus Torvalds
a86bf2283d assorted stuff for this merge window
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZ5yJdgAKCRBZ7Krx/gZQ
 69W4AQDwgxceiQ6icx3rFhCWQigne4jdMO84kd8tNaa+xHGe1AD/WnkeChc5DqjQ
 wZWZxAAzml9SS01IcSiHWaF5fgrjlA0=
 =rXOq
 -----END PGP SIGNATURE-----

Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull misc vfs cleanups from Al Viro:
 "Two unrelated patches - one is a removal of long-obsolete include in
  overlayfs (it used to need fs/internal.h, but the extern it wanted has
  been moved back to include/linux/namei.h) and another introduces
  convenience helper constructing struct qstr by a NUL-terminated
  string"

* tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  add a string-to-qstr constructor
  fs/overlayfs/namei.c: get rid of include ../internal.h
2025-02-01 15:07:56 -08:00
Linus Torvalds
8080ff5ac6 bcachefs fixes for 6.14-rc1
- second half of a fix for a bug that'd been causing oopses on
   filesystems using snapshots with memory pressure (key cache fills for
   snaphots btrees are tricky)
 - build fix for strange compiler configurations that double stack frame
   size
 - "journal stuck timeout" now takes into account device latency: this
   fixes some spurious warnings, and the main remaining source of SRCU
   lock hold time warnings (I'm no longer seeing this in my CI, so any
   users still seeing this should definitely ping me)
 - fix for slow/hanging unmounts (" Improve journal pin flushing")
 - some more tracepoint fixes/improvements, to chase down the "rebalance
   isn't making progress" issues
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmeajBsACgkQE6szbY3K
 bna2BA/9E/+WBDFQHkLJ4kNQBxKL4u1xfav5kKGZ79mUlqruhr3AckLPFzWmQO21
 eOJE0NeyvpsvLewDXMGZ8w/Nm3Vdc53X6ATKkQaF/UoTYVWbubmF62sXBzSS8TUh
 YIM6s24/CbCi8lT49JAuIaG3OC21KH0X0zOcvyepfmn1aiPNLr4y7zOWKynOhgCt
 mAt374ayUDDTgoQmXqPIrGp8eD/C+vUjo1ief+DIGMQmDj4uHpb5iBbmjXm8FF9x
 4TcrP1UjWpiWPcHeb98H/CBWOnjDSgOFhYmxhVOvkDpC6XbtPSgKQIOs7tSJ0Nuo
 IOzrGuBPVfd2m+wgXsn7zbn0HNOjS76sCo92K1lAdS86k0eqRfXmCxkU6FUphNkA
 WCG8WrK0RjHL132iR97dtv36No8ji5mZN1ILPk/h4KRkoKC+9fA8BaJdAGVt+6NP
 wZLtZxZkV8BqgXF41HwzHt54YftRPn2kR47Jfu1rPimSUd4Uqy8Yjw2J/fUT7eAd
 6JdfiadhAtMRWnFGzmVs4LsEWJ7Ja7GnG7jhjzlACsqbXsTU8k16Wq38IchC6mi+
 p+hqq9pLAeosW9Lk/QTGFrq52aQfyOzdUjq1pyCcEYtZFNqjj8GmmHVejxZWiRTo
 C6dTEkSIMcBx+9QP8BJ5o+xMR02KABn+8x43TzQxQ2DXj0QamTA=
 =QJHL
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2025-01-29' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet:

 - second half of a fix for a bug that'd been causing oopses on
   filesystems using snapshots with memory pressure (key cache fills for
   snaphots btrees are tricky)

 - build fix for strange compiler configurations that double stack frame
   size

 - "journal stuck timeout" now takes into account device latency: this
   fixes some spurious warnings, and the main remaining source of SRCU
   lock hold time warnings (I'm no longer seeing this in my CI, so any
   users still seeing this should definitely ping me)

 - fix for slow/hanging unmounts (" Improve journal pin flushing")

 - some more tracepoint fixes/improvements, to chase down the "rebalance
   isn't making progress" issues

* tag 'bcachefs-2025-01-29' of git://evilpiepirate.org/bcachefs:
  bcachefs: Improve trace_move_extent_finish
  bcachefs: Fix trace_copygc
  bcachefs: Journal writes are now IOPRIO_CLASS_RT
  bcachefs: Improve journal pin flushing
  bcachefs: fix bch2_btree_node_flags
  bcachefs: rebalance, copygc enabled are runtime opts
  bcachefs: Improve decompression error messages
  bcachefs: bset_blacklisted_journal_seq is now AUTOFIX
  bcachefs: "Journal stuck" timeout now takes into account device latency
  bcachefs: Reduce stack frame size of __bch2_str_hash_check_key()
  bcachefs: Fix btree_trans_peek_key_cache()
2025-01-30 08:42:50 -08:00
Al Viro
c1feab95e0 add a string-to-qstr constructor
Quite a few places want to build a struct qstr by given string;
it would be convenient to have a primitive doing that, rather
than open-coding it via QSTR_INIT().

The closest approximation was in bcachefs, but that expands to
initializer list - {.len = strlen(string), .name = string}.
It would be more useful to have it as compound literal -
(struct qstr){.len = strlen(string), .name = string}.

Unlike initializer list it's a valid expression.  What's more,
it's a valid lvalue - it's an equivalent of anonymous local
variable with such initializer, so the things like
	path->dentry = d_alloc_pseudo(mnt->mnt_sb, &QSTR(name));
are valid.  It can also be used as initializer, with identical
effect -
	struct qstr x = (struct qstr){.name = s, .len = strlen(s)};
is equivalent to
	struct qstr anon_variable = {.name = s, .len = strlen(s)};
	struct qstr x = anon_variable;
	// anon_variable is never used after that point
and any even remotely sane compiler will manage to collapse that
into
	struct qstr x = {.name = s, .len = strlen(s)};

What compound literals can't be used for is initialization of
global variables, but those are covered by QSTR_INIT().

This commit lifts definition(s) of QSTR() into linux/dcache.h,
converts it to compound literal (all bcachefs users are fine
with that) and converts assorted open-coded instances to using
that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27 19:25:45 -05:00
Kent Overstreet
5d9ccda9ba bcachefs: Improve trace_move_extent_finish
We're currently debugging issues with rebalance, where it's not making
progress as quickly as it should be (or sometimes not at all).

Add the full data_update to the move_extent_finish tracepoint, so we can
check that the replicas we wrote match what we were supposed to do.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-26 23:02:28 -05:00
Kent Overstreet
0e458a616f bcachefs: Fix trace_copygc
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-26 23:02:28 -05:00
Kent Overstreet
75474a54ed bcachefs: Journal writes are now IOPRIO_CLASS_RT
System performance is particularly sensitive to journal write latency,
the number of outstanding journal writes is bounded and we can't issue
journal flushes until other journal writes have completed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-26 23:02:28 -05:00
Kent Overstreet
35f5197009 bcachefs: Improve journal pin flushing
Running the preempt tiering tests with a lower than normal journal
reclaim delay turned up a shutdown hang - a lost wakeup, caused because
flushing a journal pin (e.g. key cache/write buffer) can generate a new
journal pin.

The "simple" fix of adding the correct wakeup didn't work because of
ordering issues; if we flush btree node pins too aggressively before
other pins have completed, we end up spinning where each flush iteration
generates new work.

So to fix this correctly:
- The list of flushed journal pins is now broken out by type, so that
  we can wait for key cache/write buffer pin flushing to complete
  before flushing dirty btree nodes

- A new closure_waitlist is added for bch2_journal_flush_pins; this one
  is only used under or when we're taking the journal lock, so it's
  pretty cheap to add rigorously correct wakeups to journal_pin_set()
  and journal_pin_drop().

Additionally, bch2_journal_seq_pins_to_text() is moved to
journal_reclaim.c, where it belongs, along with a bit of other small
renaming and refactoring.

Besides fixing the hang, the better ordering between key cache/write
buffer flushing and btree node flushing should help or fix the "unmount
taking excessively long" a few users have been noticing.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-25 19:37:43 -05:00
Kent Overstreet
0c74c85bbe bcachefs: fix bch2_btree_node_flags
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-25 19:33:19 -05:00
Kent Overstreet
37fd6b8176 bcachefs: rebalance, copygc enabled are runtime opts
Fix a regression from when these were switched to normal opts.h options.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-25 19:33:19 -05:00
Kent Overstreet
2efbc3518f bcachefs: Improve decompression error messages
Ratelimit them, and use the new bch2_write_op_error() helper that prints
path and file offset.

Reported-by: https://github.com/koverstreet/bcachefs/issues/819
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-25 14:43:13 -05:00
Linus Torvalds
37b33c68b0 CRC updates for 6.14
- Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be
   directly accessible via the library API, instead of requiring the
   crypto API.  This is much simpler and more efficient.
 
 - Convert some users such as ext4 to use the CRC32 library API instead
   of the crypto API.  More conversions like this will come later.
 
 - Add a KUnit test that tests and benchmarks multiple CRC variants.
   Remove older, less-comprehensive tests that are made redundant by
   this.
 
 - Add an entry to MAINTAINERS for the kernel's CRC library code.  I'm
   volunteering to maintain it.  I have additional cleanups and
   optimizations planned for future cycles.
 
 These patches have been in linux-next since -rc1.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZ418ZRQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKyJYAP9kBlpm8W9/XY6N8SpjKaXE/vKQYHQl
 Nobhak06Us8uJwEAkcUTymWP4IwQj5A9jgBAPRw53FQcNVKIc+01C7gRHw0=
 =mqSH
 -----END PGP SIGNATURE-----

Merge tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux

Pull CRC updates from Eric Biggers:

 - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be
   directly accessible via the library API, instead of requiring the
   crypto API. This is much simpler and more efficient.

 - Convert some users such as ext4 to use the CRC32 library API instead
   of the crypto API. More conversions like this will come later.

 - Add a KUnit test that tests and benchmarks multiple CRC variants.
   Remove older, less-comprehensive tests that are made redundant by
   this.

 - Add an entry to MAINTAINERS for the kernel's CRC library code. I'm
   volunteering to maintain it. I have additional cleanups and
   optimizations planned for future cycles.

* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (31 commits)
  MAINTAINERS: add entry for CRC library
  powerpc/crc: delete obsolete crc-vpmsum_test.c
  lib/crc32test: delete obsolete crc32test.c
  lib/crc16_kunit: delete obsolete crc16_kunit.c
  lib/crc_kunit.c: add KUnit test suite for CRC library functions
  powerpc/crc-t10dif: expose CRC-T10DIF function through lib
  arm64/crc-t10dif: expose CRC-T10DIF function through lib
  arm/crc-t10dif: expose CRC-T10DIF function through lib
  x86/crc-t10dif: expose CRC-T10DIF function through lib
  crypto: crct10dif - expose arch-optimized lib function
  lib/crc-t10dif: add support for arch overrides
  lib/crc-t10dif: stop wrapping the crypto API
  scsi: target: iscsi: switch to using the crc32c library
  f2fs: switch to using the crc32 library
  jbd2: switch to using the crc32c library
  ext4: switch to using the crc32c library
  lib/crc32: make crc32c() go directly to lib
  bcachefs: Explicitly select CRYPTO from BCACHEFS_FS
  x86/crc32: expose CRC32 functions through lib
  x86/crc32: update prototype for crc32_pclmul_le_16()
  ...
2025-01-22 19:55:08 -08:00
Kent Overstreet
c9c8a17f7a bcachefs: bset_blacklisted_journal_seq is now AUTOFIX
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-21 23:05:32 -05:00
Kent Overstreet
2c5d8a8347 bcachefs: "Journal stuck" timeout now takes into account device latency
If a block device (e.g. your typical consumer SSD) is taking multiple
seconds for IOs (typically flushes), we don't want to emit the "journal
stuck" message prematurely.

Also, make sure to drop the btree_trans srcu lock if we're blocking for
more than a second.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-21 18:32:05 -05:00
Kent Overstreet
f917016f69 bcachefs: Reduce stack frame size of __bch2_str_hash_check_key()
We don't need all the helpers inlined here.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-21 12:57:48 -05:00
Kent Overstreet
a858175227 bcachefs: Fix btree_trans_peek_key_cache()
BTREE_ITER_cached_nofill has some tricky corner cases; it's used
internally for iterators that aren't walking the key cache, but need to
be coherent with the key cache.

It tells traverse to look up and lock the key cache entry if present,
but don't create one if it doesn't exist.

That means we have to have a BTREE_ITER_UPTODATE path (because after
traverse the path has to be UPTODATE, or we pop assertions) that doesn't
point to anything (which is the less bad option, taken by the previous
fix).

The previous fix for this path missed an issue that can happen in
bch2_trans_peek_key_cache(): we can't set should_be_locked on a path
that doesn't point to anything and doesn't hold locks.

Fixes: bd5b09727f ("bcachefs: Don't set btree_path to updtodate if we don't fill")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-21 12:26:25 -05:00
Linus Torvalds
1cbfb828e0 for-6.14/block-20250118
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeL6hoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppw2EADQV8nDgLRggZR+il4U03yKHXcQEdAX1GrB
 Erowx+dasIJuh6kp3n6qRe9QD/pRqt1DKyLvXoWF8Qfuwq85j7oDnDDYxutNYT27
 hDgrLJriJ3VeKYtTu+andHWt8P29b5h57UayInDOUJurEPA6rXyFZ5YVIti8n21K
 uDOrQXiACG3qRWS2+p2f3UNhX0MkFNFdN/lxi13WMIJtRWF5bXAP+JOgIWCID4Ze
 QuSY6rQD4dp4Q6M2erpX6tn0YZb7Hvw3rPjsd91n6jvYfTUVLH375zg8jCBpi6Wi
 Syufbb8xcTtriVPTDRNu0ekjebkc8wD8ax/h86g0z9v3Ua4DlNmsx9eXrtv6r5nu
 YXqDODOad6stI0+owFquW2vas0gHmfNSfyfGdlk2g24PMtP5Yx0V6FIEvwIeqnje
 ghgxQvBuKUsdhqakByfNnc+XvXi3+RUJek8kvMeUSUQWT1IyMQqPOOk0yp9WdyWD
 bY1f2ECP5BR1b37zYOyawewsI5xTupHUswn5a4r4qtGn3O15rGDkX98Nab5aLCnR
 rW/DvX7+wT6gW9EwrRHiwjwfNDZbsJ9Ggu3lMhtUl5GUWdk58yTiVgKaHJLnlX9/
 CKFKfyyIR1Vl8+gYIpemyFhhcoN+dCSf06ISkrg0jeS0/tYwydaAaCBPL5J4kxZA
 h3Rtbh+Pgg==
 =EXYs
 -----END PGP SIGNATURE-----

Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull requests via Keith:
      - Target support for PCI-Endpoint transport (Damien)
      - TCP IO queue spreading fixes (Sagi, Chaitanya)
      - Target handling for "limited retry" flags (Guixen)
      - Poll type fix (Yongsoo)
      - Xarray storage error handling (Keisuke)
      - Host memory buffer free size fix on error (Francis)

 - MD pull requests via Song:
      - Reintroduce md-linear (Yu Kuai)
      - md-bitmap refactor and fix (Yu Kuai)
      - Replace kmap_atomic with kmap_local_page (David Reaver)

 - Quite a few queue freeze and debugfs deadlock fixes

   Ming introduced lockdep support for this in the 6.13 kernel, and it
   has (unsurprisingly) uncovered quite a few issues

 - Use const attributes for IO schedulers

 - Remove bio ioprio wrappers

 - Fixes for stacked device atomic write support

 - Refactor queue affinity helpers, in preparation for better supporting
   isolated CPUs

 - Cleanups of loop O_DIRECT handling

 - Cleanup of BLK_MQ_F_* flags

 - Add rotational support for null_blk

 - Various fixes and cleanups

* tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits)
  block: Don't trim an atomic write
  block: Add common atomic writes enable flag
  md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add()
  block: limit disk max sectors to (LLONG_MAX >> 9)
  block: Change blk_stack_atomic_writes_limits() unit_min check
  block: Ensure start sector is aligned for stacking atomic writes
  blk-mq: Move more error handling into blk_mq_submit_bio()
  block: Reorder the request allocation code in blk_mq_submit_bio()
  nvme: fix bogus kzalloc() return check in nvme_init_effects_log()
  md/md-bitmap: move bitmap_{start, end}write to md upper layer
  md/raid5: implement pers->bitmap_sector()
  md: add a new callback pers->bitmap_sector()
  md/md-bitmap: remove the last parameter for bimtap_ops->endwrite()
  md/md-bitmap: factor behind write counters out from bitmap_{start/end}write()
  md: Replace deprecated kmap_atomic() with kmap_local_page()
  md: reintroduce md-linear
  partitions: ldm: remove the initial kernel-doc notation
  blk-cgroup: rwstat: fix kernel-doc warnings in header file
  blk-cgroup: fix kernel-doc warnings in header file
  nbd: fix partial sending
  ...
2025-01-20 19:38:46 -08:00
Kent Overstreet
ff0b7ed607 bcachefs: Fix check_inode_hash_info_matches_root()
Can't use memcmp() when the struct contains padding.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-15 15:28:23 -05:00
Kent Overstreet
a4e11cea27 bcachefs: Document issue with bch_stripe layout
We've got a problem with bch_stripe that is going to take an on disk
format rev to fix - we can't access the block sector counts if the
checksum type is unknown.

Document it for now, there are a few other things to fix as well.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-14 10:45:31 -05:00
Kent Overstreet
78423deb51 bcachefs: Fix self healing on read error
We were incorrectly checking if there'd been an io error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-14 10:45:31 -05:00
Alan Huang
5dd21b2712 bcachefs: Pop all the transactions from the abort one
The transaction is going to abort, so there will be no cycle involving
this transaction anymore.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-14 10:45:25 -05:00
Alan Huang
b169138d48 bcachefs: Only abort the transactions in the cycle
When the cycle doesn't involve the initiator of the cycle detection,
we might choose a transaction that is not involved in the cycle to abort.
It shouldn't be that since it won't break the cycle, this patch
therefore chooses the transaction in the cycle to abort.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-14 10:45:18 -05:00
Alan Huang
6853a5e5d4 bcachefs: Introduce lock_graph_pop_from
This patch introduces a helper function called lock_graph_pop_from,
it pops the graph from i.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-14 10:45:13 -05:00
Alan Huang
b5c3dcd0db bcachefs: Convert open-coded lock_graph_pop_all to helper
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-14 10:45:08 -05:00
Alan Huang
0ef9ab34f4 bcachefs: Do not allow no fail lock request to fail
If the transaction chose itself as a victim before and restarted, it
might request a no fail lock request this time. But it might be added to
others' lock graph and be chose as the victim again, it's no longer safe
without additional check. We can also convert the cycle detector to be
fully RCU-based to solve that unsoundness, but the latency added to trans_put
and additional memory required may not worth it.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-14 10:45:08 -05:00
Alan Huang
cdc419dbf2 bcachefs: Merge the condition to avoid additional invocation
If the lock has been acquired and unlocked, we don't have to do clear
and wakeup again, though harmless since we hold the intent lock. Merge
the condition might be clearer.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-14 10:45:08 -05:00
Alan Huang
9c13cc9c7d Revert "bcachefs: Fix bch2_btree_node_upgrade()"
This reverts commit 62448afee7.

six_lock_tryupgrade fails only if there is an intent lock held,
it won't fail no matter how many read locks are held.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-14 10:45:08 -05:00
Hongbo Li
c72deb03ff bcachefs: bcachefs_metadata_version_directory_size
This adds another metadata version for accounting directory size.
For the new version of the filesystem, when new subdirectory items
are created or deleted, the parent directory's size will change
accordingly. For the old version of the existed file system, running
fsck will automatically upgrade the metadata version, and it will
do the check and recalculationg of the directory size.

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-13 14:58:38 -05:00
Hongbo Li
e614a6c52d bcachefs: make directory i_size meaningful
The isize of directory is 0 in bcachefs if the directory is empty.
With more child dirents created, its size ought to change. Many
other filesystems changed as that (ie. xfs and btrfs). And many of
them changed as the size of child dirent name. Although the directory
size may not seem to convey much, we can still give it some meaning.

The formula of dentry size as follow:
    occupied_size = 40 + ALIGN(9 + namelen, 8)

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-13 14:58:38 -05:00
Kent Overstreet
4204e3bf63 bcachefs: check_unreachable_inodes is not actually PASS_ONLINE yet
check_unreachable_inodes does work in online mode, with the one caveat
that it assumes check_dirents has also run - and check_dirents is not
PASS_ONLINE yet.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:42 -05:00
Kent Overstreet
ae153f2e11 bcachefs: Don't use BTREE_ITER_cached when walking alloc btree during fsck
No need to pull the whole alloc btree into the btree key cache.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:42 -05:00
Kent Overstreet
15734b5e6f bcachefs: Check for dirents to overwritten inodes
This fixes various "dirent to missing inode" errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:42 -05:00
Kent Overstreet
d3d0fac57d bcachefs: bch2_btree_iter_peek_slot() handles navigating to nonexistent depth
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:42 -05:00
Kent Overstreet
bd5b09727f bcachefs: Don't set btree_path to updtodate if we don't fill
This fixes various locking asserts, and a null ptr deref in
bch2_btree_iter_peek_path().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:42 -05:00
Kent Overstreet
cf67f46641 bcachefs: __bch2_btree_pos_to_text()
Factor out a version of bch2_btree_pos_to_text() that doesn't take a
pointer to a in-memory btree node, to be used for btree node scrub.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:42 -05:00
Kent Overstreet
0a46ea9d46 bcachefs: printbuf_reset() handles tabstops
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:42 -05:00
Kent Overstreet
5906dcb993 bcachefs: Silence read-only errors when deleting snapshots
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:42 -05:00
Kent Overstreet
8b1f46bff3 bcachefs: Dropped superblock write is no longer a fatal error
Just emit a warning if errors=continue or fix_safe.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:42 -05:00
Kent Overstreet
8cfdc6ce1f bcachefs: bch2_trans_node_drop()
Factor out a small common helper.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:42 -05:00
Kent Overstreet
0971a72c3d bcachefs: bch2_trans_unlock_write()
New helper for dropping all write locks; which is distinct from the
helper the transaction commit path uses, which is faster and only
touches updates.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:42 -05:00
Kent Overstreet
e1911d7a69 bcachefs: btree_node_unlock() can now drop write locks
Prep work for reworking btree node locking during interior btree
updates.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
9a5232ef0a bcachefs: six locks: write locks can now be held recursively
This is needed for the interior update locking rework, where we'll be
holding node write locks for the duration of the update - which is
needed for synchronizing with online check_allocations.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
8f3aaa5d5d bcachefs: bch2_fs_btree_gc_init()
Now returns errors, prep work for check_allocations_done_lock

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
cb3f34982c bcachefs: Assert that btree write buffer only touches the right btrees
More asserts, more better.

Also, clean up the per-btree flags a bit.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
bdedae70f5 bcachefs: bch2_inum_path() now crosses subvolumes correctly
The dirent that points to a subvolume root is in the parent subvolume.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
ce9a21713b bcachefs: bch2_inum_path() no longer returns an error for disconnected inums
bch2_inum_path() should work even if the filesystem is corrupted - we
don't want it to cause fsck to fail.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
6adc5af50a bcachefs: btree_path_very_locks(): verify lock seq
If the btree_path's lock seq is wrong, the next bch2_trans_relock()
operation is guaranteed to fail and we take an unnecessary transaction
restart.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
f908eacc34 bcachefs: fix bch2_btree_key_cache_drop()
When evicting, we shouldn't leave a pointer to the key cache entry lying
around - that screws up btree path asserts we're adding.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
bc6fce7870 bcachefs: bch2_btree_node_write_trans()
Avoiding screwing up path->lock_seq.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
4bd06f07bc bcachefs: Fixes for snapshot_tree.master_subvol
Ensure that snapshot_tree.master_subvol is cleared when we delete the
master subvolume in a tree of snapshots, and allow for snapshot trees
that don't have a master subvolume in fsck.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
b5e4cd0871 bcachefs: Don't rely on snapshot_tree.master_subvol for reattaching
Previously, fsck used the snapshot tree's master subvol for finding the
root inode number - but the master subvol might have been deleting, and
setting a new one should be a user operation; meaning we can't rely on
it existing.

Fortunately, for finding the root inode number in a tree of snapshots,
finding any associated subvolume works.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
4541408391 bcachefs: bch2_kvmalloc()
Add a version of kvmalloc() that doesn't have the INT_MAX limit; large
filesystems do hit this.

We'll want to get rid of the in-memory bucket gens array, but we're not
there quite yet.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
fa3e5135e4 bcachefs: Fix assert for online fsck
We can't check if we're racing with fsck ending until mark_lock is held.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
cf3da2d627 bcachefs: Handle -BCH_ERR_need_mark_replicas in gc
Locking considerations (possibly no longer relevant?) mean that when an
accounting update needs a new superblock replicas entry to be created,
it's deferred to the transaction commit error path.

But accounting updates for gc/fcsk aren't done from the transaction
commit path - so we need to handle
-BCH_ERR_btree_insert_need_mark_replicas locally.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
861cd0f606 bcachefs: Write lock btree node in key cache fills
this addresses a key cache coherency bug

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
baf13d8344 bcachefs: kill __bch2_btree_iter_flags()
bch2_btree_iter_flags() now takes a level parameter; this fixes a bug
where using a node iterator on a leaf wouldn't set
BTREE_ITER_with_key_cache, leading to fun cache coherency bugs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
30e32692d6 bcachefs: Drop redundant "read error" call from btree_gc
The btree node read error path already calls topology error, so this is
entirely redundant, and we're not specific enough about our error codes
- this was triggering for bucket_ref_update() errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
6542afe299 bcachefs: Drop racy warning
Checking for writing past i_size after unlocking the folio and clearing
the dirty bit is racy, and we already check it at the start.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
0475c7639e bcachefs: better check_bp_exists() error message
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Hongbo Li
d01ea14da7 bcachefs: add counter_flags for counters
In bcachefs, io_read and io_write counter record the amount
of data which has been read and written. They increase in
unit of sector, so to display correctly, they need to be
shifted to the left by the size of a sector. Other counters
like io_move, move_extent_{read, write, finish} also have
this problem.

In order to support different unit, we add extra column to
mark the counter type by using TYPE_COUNTER and TYPE_SECTORS
in BCH_PERSISTENT_COUNTERS().

Fixes: 1c6fdbd8f2 ("bcachefs: Initial commit")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
3db3084a86 bcachefs: bcachefs_metadata_version_autofix_errors
It's time to make self healing the default: change the error action for
old filesystems to fix_safe, matching the default for current
filesystems.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
df448ca355 bcachefs: bcachefs_metadata_version_persistent_inode_cursors
Persistent cursors for inode allocation.

A free inodes btree would add substantial overhead to inode allocation
and freeing - a "next num to allocate" cursor is always going to be
faster.

We just need it to be persistent, to avoid scanning the inodes btree
from the start on startup.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-09 23:38:41 -05:00
Kent Overstreet
59c50511f7 bcachefs: bcachefs_metadata_version_inode_depth
This adds a new inode field, bi_depth, for directory inodes: this allows
us to make the check_directory_structure pass much more efficient.

Currently, to ensure the filesystem is fully connect and has no loops,
for every directory we follow backpointers until we find the root. But
by adding a depth counter, it sufficies to only check the parent of each
directory, and check that the parent's bi_depth is smaller.

(fsck doesn't require that bi_depth = parent->bi_depth + 1; if a rename
causes bi_depth off, but the chain to the root is still strictly
decreasing, then the algorithm still works and there's no need for fsck
to fixup the bi_depth fields).

We've already checked backpointers, so we know that every directory
(excluding the root)has a valid parent: if bi_depth is always
decreasing, every chain must terminate, and terminate at the root
directory.

bi_depth will not necessarily be correct when fsck runs, due to
directory renames - we can't change bi_depth on every child directory
when renaming a directory. That's ok; fsck will silently fix the
bi_depth field as needed, and future fsck runs will be much faster.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
80c6352c2c bcachefs: Option changes now get propagated to reflinked data
Now that bch2_move_get_io_opts() re-propagates changed inode io options
to bch_extent_rebalance, we can properly suport changing IO path options
for reflinked data.

Changing a per-file IO path option, either via the xattr interface or
via the BCHFS_IOC_REINHERIT_ATTRS ioctl, will now trigger a scan (the
inode number is marked as needing a scan, via
bch2_set_rebalance_needs_scan()), and rebalance will use
bch2_move_data(), which will walk the inode number and pick up the new
options.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
ea4f9e75ec bcachefs: bcachefs_metadata_version_reflink_p_may_update_opts
Previously, io path option changes on a file would be picked up
automatically and applied to existing data - but not for reflinked data,
as we had no way of doing this safely. A user may have had permission to
copy (and reflink) a given file, but not write to it, and if so they
shouldn't be allowed to change e.g. nr_replicas or other options.

This uses the incompat feature mechanism in the previous patch to add a
new incompatible flag to bch_reflink_p, indicating whether a given
reflink pointer may propagate io path option changes back to the
indirect extent.

In this initial patch we're only setting it for the source extents.

We'd like to set it for the destination in a reflink copy, when the user
has write access to the source, but that requires mnt_idmap which is not
curretly plumbed up to remap_file_range.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
a36d8f0e0e bcachefs: BCH_SB_VERSION_INCOMPAT
We've been getting away from feature bits: they don't have any kind of
ordering, and thus it's possible for people to enable weird combinations
of features that were never tested or intended to be run.

Much better to just give every new feature, compatible or incompatible,
a version number.

Additionally, we probably won't ever rev the major version number: major
version numbers represent incompatible versions, but that doesn't really
fit with how we actually roll out incompatible features - we need a
better way of rolling out incompatible features.

So, this patch adds two new superblock fields:
- BCH_SB_VERSION_INCOMPAT
- BCH_SB_VERSION_INCOMPAT_ALLOWED

BCH_SB_VERSION_INCOMPAT_ALLOWED indicates that incompatible features up
to version number x are allowed to be used without user prompting, but
it does not by itself deny old versions from mounting.

BCH_SB_VERSION_INCOMPAT does deny old versions from mounting, and must
be <= BCH_SB_VERSION_INCOMPAT_ALLOWED.

BCH_SB_VERSION_INCOMPAT will only be set when a codepath attempts to use
an incompatible feature, so as to not unnecessarily break compatibility
with old versions.

bch2_request_incompat_feature() is the new interface to check if an
incompatible feature may be used.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
d884cf189a bcachefs: Only run check_backpointers_to_extents in debug mode
The backpointers passes, check_backpointers_to_extents() and
check_extents_to_backpointers() are the most expensive fsck passes.

Now that we're running the same check and repair code when using a
backpointer at runtime (via bch2_backpointer_get_key()) that fsck does,
there's no reason fsck needs to - except to verify that the filesystem
really has no errors in debug mode.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
7611d6b5d1 bcachefs: better backpointer_target_not_found() error message
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
c2c2a4d642 bcachefs: bch2_backpointer_get_key() now repairs dangling backpointers
Continuing on with the self healing theme, we should be running any
check and repair code at runtime that we can - instead of declaring the
filesystemt inconsistent.

This will also let us skip running the backpointers -> extents fsck pass
except in debug mode.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
c738866e47 bcachefs: check_extents_to_backpointers() now only checks buckets with mismatches
Instead of walking every extent and every backpointer it points to,
first sum up backpointers in each bucket and check for mismatches, and
only look for missing backpointers if mismatches were detected, and only
check extents in those buckets.

This is a major fsck scalability improvement, since the two backpointers
passes (backpointers -> extents and extents -> backpointers) are the
most expensive fsck passes by far.

Additionally, to speed up the upgrade for backpointer bucket gens, or in
situations when we have to rebuild alloc info, add a special case for
when no backpointers are found in a bucket - don't check each individual
backpointer (in particular, avoiding the write buffer flushes), just
recreate them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
056cae1c00 bcachefs: Add write buffer flush param to backpointer_get_key()
In an upcoming patch bch2_backpointer_get_key() will be repairing when
it finds a dangling backpointer; it will need to flush the btree write
buffer before it can definitively say there's an error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
7171b1fd27 bcachefs: kill __bch2_extent_ptr_to_bp()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
aca7a26f7f bcachefs: bch2_extent_ptr_to_bp() no longer depends on device
bch_backpointer no longer contains the bucket_offset field, it's just a
direct LBA mapping (with low bits to account for compressed extent
splitting), so we don't need to refer to the device to construct it
anymore.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
ba9752e5f4 bcachefs: bcachefs_metadata_version_disk_accounting_big_endian
Fix sort order for disk accounting keys, in order to fix a regression on
mount times.

The typetag is now the most significant byte of the key, meaning disk
accounting keys of the same type now sort together.

This lets us skip over disk accounting keys that aren't mirrored in
memory when reading accounting at startup, instead of having them
interleaved with other counter types.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
ebdca07268 bcachefs: bcachefs_metadata_version_backpointer_bucket_gen
New on disk format version: backpointers new include the generation
number of the bucket they refer to, and the obsolete bucket_offset field
(no longer needed because we no longer store backpointers in alloc keys)
is gone.

This is an expensive forced upgrade - hopefully the last; we have to run
the extents_to_backpointers recovery pass to regenerate backpointers.

It's a forced incompatible upgrade because the alternative would've been
permamently making backpointers bigger, and as one of the biggest btrees
(along with the extents btree) that's not an ideal option.

It's worth it though, because this allows us to make the
check_extents_to_backpointers pass drastically cheaper: an upcoming
patch changes it to sum up backpointers in a bucket and check the sum
against the sector counts for that bucket, only looking for missing
backpointers if they don't match (and then only for specific buckets).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
6679e363f4 bcachefs: bch2_btree_path_peek_slot() doesn't return errors
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
07c1a6fa90 bcachefs: trace_key_cache_fill
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
17d678bcdd bcachefs: Log message in journal for snapshot deletion
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
54c9b92fc7 bcachefs: bch2_trans_log_msg()
Export a helper for logging to the journal when we're already in a
transaction context.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
Kent Overstreet
d0855e2106 bcachefs: Kill snapshot_t->equiv
Now entirely dead code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-29 13:30:39 -05:00
John Garry
19206d3f5e block: Delete bio_set_prio()
Since commit 43b62ce3ff ("block: move bio io prio to a new field"), macro
bio_set_prio() does nothing but set bio->bi_ioprio. All other places just
set bio->bi_ioprio directly, so replace bio_set_prio() remaining
callsites with setting bio->bi_ioprio directly and delete that macro.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20241202111957.2311683-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23 08:17:23 -07:00
Kent Overstreet
35c5609abf bcachefs: Snapshot deletion no longer uses snapshot_t->equiv
Switch to generating a private list of interior nodes to delete, instead
of using the equivalence class in the global data structure.

This eliminates possible races with snapshot creation, and is much
cleaner - it'll let us delete a lot of janky code for calculating and
maintaining the equivalence classes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:23 -05:00
Kent Overstreet
85c060f62d bcachefs: Kill equiv_seen arg to delete_dead_snapshots_process_key()
When deleting dead snapshots, we move keys from redundant interior
snapshot nodes to child nodes - unless there's already a key, in which
case the ancestor key is deleted.

Previously, we tracked via equiv_seen whether the child snapshot had a
key, but this was tricky w.r.t. transaction restarts, and not
transactionally safe w.r.t. updates in the child snapshot.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:23 -05:00
Kent Overstreet
92e31d4251 bcachefs: Don't run overwrite triggers before insert
This breaks when the trigger is inserting updates for the same btree, as
the inode trigger now does.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:23 -05:00
Kent Overstreet
f859bc945e bcachefs: alloc_data_type_set() happens in alloc trigger
Originally, we ran insert triggers before overwrite so that if an extent
was being moved (by fallocate insert/collapse range), the bucket sector
count wouldn't hit 0 partway through, and so we don't trigger state
changes caused by that too soon.

But this is better solved by just moving the data type change to the
alloc trigger itself, where it's already called.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:23 -05:00
Kent Overstreet
b9a37144da bcachefs: Fix key cache + BTREE_ITER_all_snapshots
Normally, whitouts (KEY_TYPE_whitout) are filtered from btree lookups,
since they exist only to represent deletions of keys in ancestor
snapshots - except, they should not be filtered in
BTREE_ITER_all_snapshots mode, so that e.g. snapshot deletion can clean
them up.

This means that that the key cache has to store whiteouts, and key cache
fills cannot filter them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:23 -05:00
Kent Overstreet
7e320a4063 bcachefs: Fix btree_trans_peek_key_cache() BTREE_ITER_all_snapshots
In BTREE_ITER_all_snapshots mode, we're required to only return keys
where the snapshot field matches the iterator position -
BTREE_ITER_filter_snapshots requires pulling keys into the key cache
from ancestor snapshots, so we have to check for that.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:23 -05:00
Kent Overstreet
c50341be4e bcachefs: tidy btree_trans_peek_journal()
Change to match bch2_btree_trans_peek_updates() calling convention.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:23 -05:00
Kent Overstreet
68eb4fdd8c bcachefs: tidy up __bch2_btree_iter_peek()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:23 -05:00
Kent Overstreet
25a3123a67 bcachefs: check_indirect_extents can run online
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:23 -05:00
Kent Overstreet
00fa283a41 bcachefs: Refactor c->opts.reconstruct_alloc
Now handled in one place.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:23 -05:00
Nathan Chancellor
64833d3965 bcachefs: Add empty statement between label and declaration in check_inode_hash_info_matches_root()
Clang 18 and newer warns (or errors with CONFIG_WERROR=y):

  fs/bcachefs/str_hash.c:164:2: error: label followed by a declaration is a C23 extension [-Werror,-Wc23-extensions]
    164 |         struct bch_inode_unpacked inode;
        |         ^

In Clang 17 and prior, this is an unconditional hard error:

  fs/bcachefs/str_hash.c:164:2: error: expected expression
    164 |         struct bch_inode_unpacked inode;
        |         ^
  fs/bcachefs/str_hash.c:165:30: error: use of undeclared identifier 'inode'
    165 |         ret = bch2_inode_unpack(k, &inode);
        |                                     ^
  fs/bcachefs/str_hash.c:169:55: error: use of undeclared identifier 'inode'
    169 |         struct bch_hash_info hash2 = bch2_hash_info_init(c, &inode);
        |                                                              ^
  fs/bcachefs/str_hash.c:171:40: error: use of undeclared identifier 'inode'
    171 |                 ret = repair_inode_hash_info(trans, &inode);
        |                                                      ^

Add an empty statement between the label and the declaration to fix the
warning/error without disturbing the code too much.

Fixes: 2519d3b0d656 ("bcachefs: bch2_str_hash_check_key() now checks inode hash info")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202412092339.QB7hffGC-lkp@intel.com/
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:23 -05:00
Kent Overstreet
3f57171d8d bcachefs: trace_write_buffer_maybe_flush
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:23 -05:00
Kent Overstreet
9f95fc3c12 bcachefs: bch2_snapshot_exists()
bch2_snapshot_equiv() is going away; convert users that just wanted to
know if the snapshot exists to something better

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:23 -05:00
Kent Overstreet
be203120dc bcachefs: bch2_check_key_has_snapshot() prints btree id
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:23 -05:00
Kent Overstreet
6ea607ca61 bcachefs: bch2_str_hash_check_key() now checks inode hash info
Versions of the same inode in different snapshots must have the same
hash info; this is critical for lookups to work correctly.

We're going to be running the str_hash checks online, at readdir or
xattr list time, so we now need str_hash_check_key() to check for inode
hash seed mismatches, since it won't be run right after check_inodes().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:23 -05:00
Kent Overstreet
644457ed83 bcachefs: Don't BUG_ON() inode unpack error
Bkey validation checks that inodes are well-formed and unpack
successfully, so an unpack error should always indicate memory
corruption or some other kind of hardware bug - but these are still
errors we can recover from.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:23 -05:00
Kent Overstreet
7b11260456 bcachefs: Use proper errcodes for inode unpack errors
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:23 -05:00
Kent Overstreet
cd150cf924 bcachefs: kill sysfs internal/accounting
Since we added per-inode counters there's now far too many counters to
show in one shot - if we want this in the future, it'll have to be in
debugfs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:23 -05:00
Kent Overstreet
49f2d18263 bcachefs: Kill unnecessary mark_lock usage
We can't hold mark_lock while calling fsck_err() - that's a deadlock,
mark_lock is meant to be a leaf node lock.

It's also unnecessary for gc_bucket() and bucket_gen(); rcu suffices
since the bucket_gens array describes its size, and we can't race with
device removal or resize during gc/fsck since that takes state lock.

Reported-by: syzbot+38641fcbda1aaffefdd4@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
54dacdada6 bcachefs: Don't start rewriting btree nodes until after journal replay
This fixes a deadlock during journal replay when btree node read errors
kick off a ton of rewrites: we don't want them competing with journal
replay.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
9e779f3f24 bcachefs: Fix reuse of bucket before journal flush on multiple empty -> nonempty transition
For each bucket we track when the bucket became nonempty and when it
became empty again: if we can ensure that there will be no journal
flushes in the range [nonempty, empty) (possibly because they occured at
the same journal sequence number), then it's safe to reuse the bucket
without waiting for a journal commit.

This is a major performance optimization for erasure coding, where
writes are initially replicated, but the extra replicas are quickly
dropped: if those buckets are reused and overwritten without issuing a
cache flush to the underlying device, then they only cost bus bandwidth.

But there's a tricky corner case when there's multiple empty -> nonempty
-> empty transitions in quick succession, i.e. when data is getting
overwritten immediately as it's being written.

If this happens and the previous empty transition hasn't been flushed,
we need to continue tracking the previous nonempty transition - not
start a new one.

Fixing this means we now need to track both the nonempty and empty
transitions in bch_alloc_v4.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
89e74eccab bcachefs: bch2_journal_noflush_seq() now takes [start, end)
Harder to screw up if we're explicit about the range, and more correct
as journal reservations can be outstanding on multiple journal entries
simultaneously.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
be565740ee bcachefs: Set bucket needs discard, inc gen on empty -> nonempty transition
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
44a43cf9fd bcachefs: Don't add unknown accounting types to eytzinger tree
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
60558d55f7 bcachefs: Plumb bkey_validate_context to journal_entry_validate
This lets us print the exact location in the journal if it was found in
the journal, or correctly print if it was found in the superblock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
bbe36bd099 bcachefs: Use a heap for handling overwrites in btree node scan
Fix an O(n^2) issue when we find many overlapping (overwritten) btree
nodes - especially when one node overwrites many smaller nodes.

This was discovered to be an issue with the bcachefs
merge_torture_flakey test - if we had a large btree that was then
emptied, the number of difficult overwrites can be unbounded.

Cc: Kuan-Wei Chiu <visitorckw@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
fbd152bf94 bcachefs: Minor bucket alloc optimization
Check open buckets and buckets waiting for journal commit before doing
other expensive lookups.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
f65645d804 bcachefs: Mark more errors autofix
tested repairing from a bug uncovered by the merge_torture_flakey test

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
821ddebbc2 bcachefs: fix bch2_btree_node_header_to_text() format string
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
58117dbdd6 bcachefs: Journal space calculations should skip durability=0 devices
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
d4c9fc000b bcachefs: factor out str_hash.c
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
ce70157112 bcachefs: kill flags param to bch2_subvolume_get()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
23f88c1d16 bcachefs: Don't call bch2_btree_interior_update_will_free_node() until after update succeeds
Originally, btree splits always succeeded once we got to the point of
recursing to the btree_insert_node() call.

But that changed when we switched to not taking intent locks all the way
up to the root, and that introduced a bug, because
bch2_btree_interior_update_will_free_node() cancels paending writes and
reparents a node that's going to be made visible on disk by another
btree update to the current btree update.

This was discovered in recent backpointers work, because
bch2_btree_interior_update_will_free_node() also clears the
will_make_reachable flag, causing backpointer target lookup to
spuriously thing it had found a dangling backpointer (when the
backpointer just hadn't been created yet by
btree_update_nodes_written()).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
c67fab0774 bcachefs: Make sure __bch2_run_explicit_recovery_pass() signals to rewind
We should always signal to rewind if the requested pass hasn't been run,
even if called multiple times.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
90c6daa6ac bcachefs: Call bch2_btree_lost_data() on btree read error
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
ff7e7c5367 bcachefs: Journal write path refactoring, debug improvements
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
47d6ee766f bcachefs: dev_alloc_list.devs -> dev_alloc_list.data
This lets us use darray macros on dev_alloc_list (and it will become a
darray eventually, when we increase the maximum number of devices).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
49833ce27e bcachefs: Fix failure to allocate journal write on discard retry
When allocating a journal write fails, then retries after doing
discards, we were failing to count already allocated replicas.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:22 -05:00
Kent Overstreet
6728f8f829 bcachefs: BCH_ERR_insufficient_journal_devices
kill another standard error code use

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Kent Overstreet
3f1cf04ff9 bcachefs: Silence "unable to allocate journal write" if we're already RO
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Kent Overstreet
400af9a398 bcachefs: trace_accounting_mem_insert
Add a tracepoint for inserting new accounting entries: we're seeing odd
spinning behaviour in accounting read.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Kent Overstreet
e3474394eb bcachefs: Advance to next bp on BCH_ERR_backpointer_to_overwritten_btree_node
Don't spin.

Fixes: de95cc201a97 ("bcachefs: Kill bch2_get_next_backpointer()")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Kent Overstreet
8dabb19ff4 bcachefs: Simplify disk accounting validate late
The validate late path was iterating over accounting entries in
eytzinger order, which is unnecessarily tricky when we may have to
remove entries.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Kent Overstreet
f78760dede bcachefs: logged ops only use inum 0 of logged ops btree
we wish to use the logged ops btree for other items that aren't strictly
logged ops: cursors for inode allocation

There's no reason to create another cached btree for inode allocator
cursors - so reserve different parts of the keyspace for different
purposes.

Older versions will ignore or delete the cursors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Kent Overstreet
ad0b2544ec bcachefs: rcu_pending now works in userspace
Introduce a typedef to handle the difference between unsigned
long/struct urcu_gp_poll_state.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Geert Uytterhoeven
d36b3e74b6 bcachefs: BCACHEFS_PATH_TRACEPOINTS should depend on TRACING
When tracing is disabled, there is no point in asking the user about
enabling extra btree_path tracepoints in bcachefs.

Fixes: 32ed4a620c ("bcachefs: Btree path tracepoints")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Kent Overstreet
9c22dd02ae bcachefs: Fix allocating too big journal entry
The "journal space available" calculations didn't take into account
mismatched bucket sizes; we need to take the minimum space available out
of our devices.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Kent Overstreet
5cdaec193a bcachefs: Improve "unable to allocate journal write" message
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Kent Overstreet
511ddcdb2d bcachefs: fix bch2_journal_key_insert_take() seq
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Kent Overstreet
c1f618f4f7 bcachefs: bch2_async_btree_node_rewrites_flush()
Add a method to flush btree node rewrites at the end of recovery, to
ensure that corrected errors are persisted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Kent Overstreet
b29769c72d bcachefs: If we did repair on a btree node, make sure we rewrite it
Ensure that "invalid bkey" repair gets persisted, so that it doesn't
repeatedly spam the logs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Kent Overstreet
1302eeb7c5 bcachefs: bkey_fsck_err now respects errors_silent
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Kent Overstreet
7807b5b07d bcachefs: list_pop_entry()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Kent Overstreet
097cc9d0d6 bcachefs: Convert write path errors to inum_to_path()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Kent Overstreet
f7727a6767 bcachefs: bch2_inum_to_path()
Add a function for walking backpointers to find a path from a given
inode number, and convert various error messages to use it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Kent Overstreet
c9b9afe78c bcachefs: Fix fsck.c build in userspace
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Yang Li
2f8d5edf55 bcachefs: Add missing parameter description to bch2_bucket_alloc_trans()
The function bch2_bucket_alloc_trans() lacked a description for the
nowait parameter in its documentation comment block. This patch adds the
missing description to ensure all parameters are properly documented.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=12179
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Kent Overstreet
2cd85fea49 bcachefs: Don't recurse in check_discard_freespace_key
When calling check_discard_freeespace_key from the allocator, we can't
repair without recursing - run it asynchronously instead.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Kent Overstreet
9bdb3b73e7 bcachefs: Check for extent crc uncompressed/compressed size mismatch
When not compressed, these must be equal - this fixes an assertion pop
in bch2_rechecksum_bio().

Reported-by: syzbot+50d3544c9b8db9c99fd2@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Kent Overstreet
ff1dd05f82 bcachefs: bch2_trans_relock() is trylock for lockdep
fix some spurious lockdep splats

Reported-by: syzbot+e088be3c2d5c05aaac35@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:20 -05:00
Kent Overstreet
f7f196170d bcachefs: cryptographic MACs on superblock are not (yet?) supported
We should add support for cryptographic macs on the superblock - and it
won't be hard, but it'll need an incompatible feature bit (and we have a
new incompatible feature versioning scheme coming).

For now, just add a guard to avoid a dull ptr deref in gen_poly_key().

Reported-by: syzbot+dd3d9835055dacb66f35@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:20 -05:00
Kent Overstreet
4746ee182a bcachefs: Check for inode journal seq in the future
More check and repair code: this fixes a warning in
bch2_journal_flush_seq_async()

Reported-by: syzbot+d119b445ec739e7f3068@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:20 -05:00
Kent Overstreet
0eafe758ac bcachefs: Check for bucket journal seq in the future
This fixes an assertion pop in bch2_journal_noflush_seq() - log the
error to the superblock and continue instead.

Reported-by: syzbot+85700120f75fc10d4e18@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:20 -05:00
Kent Overstreet
8b10590918 bcachefs: do_fsck_ask_yn()
__bch2_fsck_err() is huge, and badly needs more refactoring

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:20 -05:00
Kent Overstreet
052210c3fa bcachefs: Don't error out when logging fsck error
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:20 -05:00
Kent Overstreet
cfba90aba9 bcachefs: mark more errors AUTOFIX
mark errors as autofix where syzbot has hit the repair paths

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:20 -05:00
Kent Overstreet
914381013b bcachefs: add missing printbuf_reset()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:20 -05:00
Kent Overstreet
0184dfa3b8 bcachefs: Fix journal_iter list corruption
Fix exiting an iterator that wasn't initialized.

Reported-by: syzbot+2f7c2225ed8a5cb24af1@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:20 -05:00
Kent Overstreet
f11ca2ab18 bcachefs: Guard against backpointers to unknown btrees
Reported-by: syzbot+997f0573004dcb964555@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:20 -05:00
Kent Overstreet
f9e0a9be70 bcachefs: Issue a transaction restart after commit in repair
transaction commits invalidate pointers to btree values, and they also
downgrade intent locks.

This breaks the interior btree update path, which takes intent locks and
then calls into the allocator.

This isn't an ideal solution: we can't unconditionally issue a restart
after a transaction commit, because that would break other codepaths.

Reported-by: syzbot+78d82470c16a49702682@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:20 -05:00
Kent Overstreet
b3d82c2f27 bcachefs: Guard against journal seq overflow
Wraparound is impractical to handle since in various places we use 0 as
a sentinal value - but 64 bits (or 56, because the btree write buffer
steals a few bits) is enough for all practical purposes.

Reported-by: syzbot+73ed43fbe826227bd4e0@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:20 -05:00
Kent Overstreet
9963a14da1 bcachefs: BCH_FS_recovery_running
If we're autofixing topology errors, we shouldn't shutdown if we're
still in recovery.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:20 -05:00
Kent Overstreet
124e108185 bcachefs: Make topology errors autofix
These repair paths are well tested, we can repair them without explicit
user intervention

This also tweaks bch2_topology_error() so that we run topology repair if
we're in recovery, not just fsck.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:20 -05:00
Kent Overstreet
a6f4794fcd bcachefs: struct bkey_validate_context
Add a new parameter to bkey validate functions, and use it to improve
invalid bkey error messages: we can now print the btree and depth it
came from, or if it came from the journal, or is a btree root.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:20 -05:00
Kent Overstreet
c7e78f7b01 bcachefs: Ignore empty btree root journal entries
There's no reason to treat them as errors: just ignore them, and go with
a previous btree root if we had one.

Reported-by: syzbot+e22007d6acb9c87c2362@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:20 -05:00
Kent Overstreet
90f3683e8f bcachefs: Fix null ptr deref in btree_path_lock_root()
Historically, we required that all btree node roots point to a valid
(possibly fake) node, but we're improving our ability to continue in the
presence of errors.

Reported-by: syzbot+e22007d6acb9c87c2362@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:20 -05:00
Kent Overstreet
db0667a4ed bcachefs: Go RW earlier, for normal rw mount
Previously, when mounting read-write after a clean shutdown, we wouldn't
go read-write until after all the recovery passes completed.

Now, go RW early in recovery, the same as any other situation we'll need
to go read-write. This fixes a bug where we discover unlinked inodes
after a clean shutdown: repair fails because we're read only.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:20 -05:00
Kent Overstreet
d941597636 bcachefs: Fix bch2_btree_node_update_key_early()
Fix an assertion pop from the recent btree cache freelist fixes.

Fixes: baefd3f849 ("bcachefs: btree_cache.freeable list fixes")
Reported-by: Tyler <th020394@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:20 -05:00
Kent Overstreet
686d2ebec6 bcachefs: Change "disk accounting version 0" check to commit only
6.11 had a bug where we'd sometimes create disk accounting keys with
version 0, which causes issues for journal replay - but we don't need to
delete existing accounting keys with version 0.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:20 -05:00
Kent Overstreet
dba8243f3b bcachefs: Don't try to en/decrypt when encryption not available
If a btree node says it's encrypted, but the superblock never had an
encryptino key - whoops, that needs to be handled.

Reported-by: syzbot+026f1857b12f5eb3f9e9@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:20 -05:00
Kent Overstreet
75eabea698 bcachefs: Fix dup/misordered check in btree node read
We were checking for out of order keys, but not duplicate keys.

Reported-by: syzbot+dedbd67513939979f84f@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
1415265480 bcachefs: Bad btree roots are now autofix
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
828552ca74 bcachefs: Kill bch2_bucket_alloc_new_fs()
The early-early allocation path, bch2_bucket_alloc_new_fs(), is no
longer needed - and inconsistencies around new_fs_bucket_idx have been a
frequent source of bugs.

Reported-by: syzbot+592425844580a6598410@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
abf23afa36 bcachefs: Fix btree node scan when unknown btree IDs are present
btree_root entries for unknown btree IDs are created during recovery,
before reading those btree roots.

But btree_node_scan may find btree nodes with unknown btree IDs when we
haven't seen roots for those btrees.

Reported-by: syzbot+1f202d4da221ec6ebf8e@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
427db7ffe9 bcachefs: backpointer_to_missing_ptr is now autofix
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
b71d89bd7b bcachefs: Fix accounting_read when we rewind
If we rewind recovery to run topology repair, that causes
accounting_read to run twice.

This fixes accounting being double counted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
a7ecd5f2cc bcachefs: disk_accounting: bch2_dev_rcu -> bch2_dev_rcu_noerror
Accounting keys that reference invalid devices are corrected by fsck,
they shouldn't cause an emergency shutdown.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
6534a404d4 bcachefs: errcode cleanup: journal errors
Instead of throwing standard error codes, we should be throwing
dedicated private error codes, this greatly improves debugability.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
525be09e63 bcachefs: Use separate rhltable for bch2_inode_or_descendents_is_open()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
375d21b76d bcachefs: BCH_ERR_btree_node_read_error_cached
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
0eaac0b44f bcachefs: btree_write_buffer_flush_seq() no longer closes journal
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
bb61afebca bcachefs: discard fastpath now uses bch2_discard_one_bucket()
The discard bucket fastpath previously was using its own code for
discarding buckets and clearing them in the need_discard btree, which
didn't have any of the consistency checks of the main discard path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
e1cb4f56dc bcachefs: Bias reads more in favor of faster device
Per reports of performance issues on mixed multi device filesystems
where we're issuing too much IO to the spinning rust - tweak this
algorithm.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
f4d67f6d5a bcachefs: trivial btree write buffer refactoring
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
c601e5d7da bcachefs: Can now block journal activity without closing cur entry
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
c80f33b752 bcachefs: New backpointers helpers
- bch2_backpointer_del()
- bch2_backpointer_maybe_flush()

Kill a bit of open coding and make sure we're properly handling the
btree write buffer.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
1ab00b6cdd bcachefs: kill bch_backpointer.bucket_offset usage
bch_backpointer.bucket_offset is going away - it's no longer needed
since we no longer store backpointers in alloc keys, the same
information is in the key position itself.

And we'll be reclaiming the space in bch_backpointer for the bucket
generation number.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
e48fda6cdc bcachefs: Fix check_backpointers_to_extents range limiting
bch2_get_btree_in_memory_pos() will return positions that refer directly
to the btree it's checking will fit in memory - i.e. backpointer
positions, not buckets.

This also means check_bp_exists() no longer has to refer to the device,
and we can delete some code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
eb25733aba bcachefs: bch_backpointer -> bkey_i_backpointer
Since we no longer store backpointers in alloc keys, there's no reason
not to pass around bkey_i_backpointers; this means we don't have to pass
the bucket pos separately.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
abff9b149d bcachefs: Drop swab code for backpointers in alloc keys
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
5b5a7ae8fa bcachefs: bucket_pos_to_bp_end()
Better helpers for iterating over backpointers within a specific bucket

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
debe6965ac bcachefs: check for backpointers to invalid device
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:19 -05:00
Kent Overstreet
3b6ebc94a0 bcachefs: fix bp_pos_to_bucket_nodev_noerror
_noerror means don't produce inconsistent errors, so it should be using
bch2_dev_rcu_noerror().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
4de2c24aa9 bcachefs: Fix evacuate_bucket tracepoint
86a494c8eef9 ("bcachefs: Kill bch2_get_next_backpointer()") dropped some
things the tracepoint emitted because bch2_evacuate_bucket() no longer
looks at the alloc key - but we did want at least some of that.

We still no longer look at the alloc key so we can't report on the
fragmentation number, but that's a direct function of dirty_sectors and
a copygc concern anyways - copygc should get its own tracepoint that
includes information from the fragmentation LRU.

But we can report on the number of sectors we moved and the bucket size.

Co-developed-by: Piotr Zalewski <pZ010001011111@proton.me>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
eae6c4a625 bcachefs: fix O(n^2) issue with whiteouts in journal keys
The journal_keys array can't be substantially modified after we go RW,
because lookups need to be able to check it locklessly - thus we're
limited on what we can do when a key in the journal has been
overwritten.

This is a problem when there's many overwrites to skip over for peek()
operations. To fix this, add tracking of ranges of overwrites: we create
a range entry when there's more than one contiguous whiteout.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
854724d116 bcachefs: btree_and_journal_iter: don't iterate over too many whiteouts when prefetching
To help ameloriate issues with peek operations having to skip over
deletions in the journal - just bail out if all we're doing is
prefetching btree nodes.

Since btree node prefetching runs every time we iterate to a new node,
and has to sequentially scan ahead, this avoids another O(n^2).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
06d7a56fe0 bcachefs: journal keys: sort keys for interior nodes first
There's an unavoidable issue with btree lookups when we're overlaying
journal keys and the journal has many deletions for keys present in the
btree - peek operations will have to iterate over all those deletions to
find the next live key to return.

This is mainly a problem for lookups in interior nodes, if we have to
traverse to a leaf. Looking up an insert position in a leaf (for journal
replay) doesn't have to find the next live key, but walking down the
btree does.

So to ameloriate this, change journal key sort ordering so that we
replay keys from roots and interior nodes first.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
57026c41c9 bcachefs: kill bch2_journal_entries_free()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
3d0b3b51c5 bcachefs: Don't BUG_ON() when superblock feature wasn't set for compressed data
We don't allocate the mempools for compression/decompression unless we
need them - but that means there's an inconsistency to check for.

Reported-by: syzbot+cb3fbcfb417448cfd278@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
e1702b9891 bcachefs: Don't use a shared decompress workspace mempool
gzip and zstd require different decompress workspace sizes, and if we
start with one and then start using the other at runtime we may not get
the correct size

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
6a4ce7a92f bcachefs: compression workspaces should be indexed by opt, not type
type includes lz4 and lz4_old, which do not get different compression
workspaces, and incompressible, a fake type - BCH_COMPRESSION_OPTS() is
the correct enum to use.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
cec51e0a5d bcachefs: add missing BTREE_ITER_intent
this fixes excessive transaction restarts due to trans_commit having to
upgrade

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
9e92d6e9ef bcachefs: Kill bch2_get_next_backpointer()
Since for quite some time backpointers have only been stored in the
backpointers btree, not alloc keys (an aborted experiment, support for
which has been removed) - we can replace get_next_backpointer() with
simple btree iteration.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
7815809fca bcachefs: Delete backpointers check in try_alloc_bucket()
try_alloc_bucket() has a "safety" check, which avoids allocating a
bucket if there's any backpointers present.

But backpointers are not the source of truth for live data in a bucket,
the bucket sector counts are; this check was fairly useless, and we're
also deferring backpointers checks from fsck to runtime in the near
future.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
ac745efb42 bcachefs: peek_prev_min(): Search forwards for extents, snapshots
With extents and snapshots, for slightly different reasons, we may have
to search forwards to find a key that compares equal to iter->pos (i.e.
a key that peek_prev() should return, as it returns keys <= iter->pos).

peek_slot() does this, and is an easy way to fix this case.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
7e5b8e00e2 bcachefs: Implement bch2_btree_iter_prev_min()
A user contributed a filessytem dump, where the dump was actually
corrupted (due to being taken while the filesystem was online), but
which exposed an interesting bug in fsck - reconstruct_inode().

When itearting in BTREE_ITER_filter_snapshots mode, it's required to
give an end position for the iteration and it can't span inode numbers;
continuing into the next inode might mean we start seeing keys from a
different snapshot tree, that the is_ancestor() checks always filter,
thus we're never able to return a key and stop iterating.

Backwards iteration never implemented the end position because nothing
else needed it - except for reconstuct_inode().

Additionally, backwards iteration is now able to overlay keys from the
journal, which will be useful if we ever decide to start doing journal
replay in the background.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
acd1fc7b1f bcachefs: discard_one_bucket() now uses need_discard_or_freespace_err()
More conversion of inconsistent errors to fsck errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
c8e588135c bcachefs: bch2_bucket_do_index(): inconsistent_err -> fsck_err
Factor out a common helper, need_discard_or_freespace_err(), which is
now used by both fsck and the runtime checks, and can repair.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
c97118f1dc bcachefs: try_alloc_bucket() now uses bch2_check_discard_freespace_key()
check_discard_freespace_key() was doing all the same checks as
try_alloc_bucket(), but with repair.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
731d06e138 bcachefs: rework bch2_bucket_alloc_freelist() freelist iteration
Prep work for converting try_alloc_bucket() to use
bch2_check_discard_freespace_key().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
724e49c677 bcachefs: kill inconsistent err in invalidate_one_bucket()
Change it to a normal fsck_err() - meaning it'll get repaired at runtime
when that's flipped on.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
7579c85d9c bcachefs: Don't delete reflink pointers to missing indirect extents
To avoid tragic loss in the event of transient errors (i.e., a btree
node topology error that was later corrected by btree node scan), we
can't delete reflink pointers to correct errors.

This adds a new error bit to bch_reflink_p, indicating that it is known
to point to a missing indirect extent, and the error has already been
reported.

Indirect extent lookups now use bch2_lookup_indirect_extent(), which on
error reports it as a fsck_err() and sets the error bit, and clears it
if necessary on succesful lookup.

This also gets rid of the bch2_inconsistent_error() call in
__bch2_read_indirect_extent, and in the reflink_p trigger: part of the
online self healing project.

An on disk format change isn't necessary here: setting the error bit
will be interpreted by older versions as pointing to a different index,
which will also be missing - which is fine.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
3d338378d7 bcachefs: Reorganize reflink.c a bit
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
61f854da4c bcachefs: Reserve 8 bits in bch_reflink_p
Better repair for reflink pointers, as well as propagating new inode
options to indirect extents, are going to require a few extra bits
bch_reflink_p: so claim a few from the high end of the destination
index.

Also add some missing bounds checking.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:17 -05:00
Kent Overstreet
eb73e7773f bcachefs: Kill FSCK_NEED_FSCK
If we find an error that indicates that we need to run fsck, we can
specify that directly with run_explicit_recovery_pass().

These are now log_fsck_err() calls: we're just logging in the superblock
that an error occurred - and possibly doing an emergency shutdown,
depending on policy.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:17 -05:00
Kent Overstreet
79c5e3c793 bcachefs: lru errors are expected when reconstructing alloc
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:17 -05:00
Kent Overstreet
b6269cd0ec bcachefs: Delete dead code from bch2_discard_one_bucket()
alloc key validation ensures that if a bucket is in need_discard state
the sector counts are all zero - we don't have to check for that.

The NEED_INC_GEN check appears to be dead code, as well: we only see
buckets in the need_discard btree, and it's an error if they aren't in
the need_discard state.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:17 -05:00
Kent Overstreet
7d1918b0d8 bcachefs: bch2_btree_bit_mod_iter()
factor out a new helper, make it handle extents bitset btrees
(freespace).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:17 -05:00
Kent Overstreet
1f282f1ee0 bcachefs: delete dead code
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:17 -05:00
Kent Overstreet
d985e63dba bcachefs: Fix shutdown message
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:17 -05:00
Kent Overstreet
5c3911ac94 bcachefs: Don't use page allocator for sb_read_scratch
Kill another unnecessary dependency on PAGE_SIZE

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:17 -05:00
Youling Tang
385d1a3c81 bcachefs: Simplify code in bch2_dev_alloc()
- Remove unnecessary variable 'ret'.
- Remove unnecessary bch2_dev_free() operations.

Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:17 -05:00
Youling Tang
924e81c530 bcachefs: Remove redundant initialization in bch2_vfs_inode_init()
`inode->v.i_ino` has been initialized to `inum.inum`. If `inum.inum` and
`bi->bi_inum` are not equal, BUG_ON() is triggered in
bch2_inode_update_after_write().

Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:17 -05:00
Youling Tang
5abd7ac19d bcachefs: Removes NULL pointer checks for __filemap_get_folio return values
__filemap_get_folio the return value cannot be NULL, so unnecessary checks
are removed.

Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:17 -05:00
Kent Overstreet
dc003efbc7 bcachefs: Add support for FS_IOC_GETFSSYSFSPATH
[TEST]:
```
$ cat ioctl_getsysfspath.c
 #include <stdio.h>
 #include <stdlib.h>
 #include <fcntl.h>
 #include <sys/ioctl.h>
 #include <linux/fs.h>
 #include <unistd.h>

 int main(int argc, char *argv[]) {
     int fd;
     struct fs_sysfs_path sysfs_path = {};

     if (argc != 2) {
         fprintf(stderr, "Usage: %s <path_to_file_or_directory>\n", argv[0]);
         exit(EXIT_FAILURE);
     }

     fd = open(argv[1], O_RDONLY);
     if (fd == -1) {
         perror("open");
         exit(EXIT_FAILURE);
     }

     if (ioctl(fd, FS_IOC_GETFSSYSFSPATH, &sysfs_path) == -1) {
         perror("ioctl FS_IOC_GETFSSYSFSPATH");
         close(fd);
         exit(EXIT_FAILURE);
     }

     printf("FS_IOC_GETFSSYSFSPATH: %s\n", sysfs_path.name);
     close(fd);
     return 0;
 }

$ gcc ioctl_getsysfspath.c
$ sudo bcachefs format /dev/sda
$ sudo mount.bcachefs /dev/sda /mnt
$ sudo ./a.out /mnt
  FS_IOC_GETFSSYSFSPATH: bcachefs/c380b4ab-fbb6-41d2-b805-7a89cae9cadb
```

Original patch link:
[1]: https://lore.kernel.org/all/20240207025624.1019754-8-kent.overstreet@linux.dev/

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Youling Tang <youling.tang@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:17 -05:00
Kent Overstreet
4f1a6b0ab4 bcachefs: Add support for FS_IOC_GETFSUUID
Use super_set_uuid() to set `sb->s_uuid_len` to avoid returning `-ENOTTY`
with sb->s_uuid_len being 0.

Original patch link:
[1]: https://lore.kernel.org/all/20240207025624.1019754-2-kent.overstreet@linux.dev/

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:17 -05:00
Youling Tang
4fa5d8e166 bcachefs: Correct the description of the '--bucket=size' options
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:17 -05:00
Integral
394033dcc9 bcachefs: add support for true/false & yes/no in bool-type options
Here is the patch which uses existing constant table:

Currently, when using bcachefs-tools to set options, bool-type options
can only accept 1 or 0. Add support for accepting true/false and yes/no
for these options.

Signed-off-by: Integral <integral@murena.io>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Acked-by: David Howells <dhowells@redhat.com>
2024-12-21 01:36:17 -05:00
Kent Overstreet
e5ea05293a bcachefs: Move fsck ioctl code to fsck.c
chardev.c and fs-ioctl.c are not organized by subject; let's try to fix
this.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:17 -05:00
Kent Overstreet
e69df6adf8 bcachefs: Kill unnecessary iter_rewind() in bkey_get_empty_slot()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:17 -05:00
Kent Overstreet
db6e584b85 bcachefs: Simplify btree_iter_peek() filter_snapshots
Collapse all the BTREE_ITER_filter_snapshots handling down into a single
block; btree iteration is much simpler in the !filter_snapshots case.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:17 -05:00
Kent Overstreet
000fe8d573 bcachefs: Rename btree_iter_peek_upto() -> btree_iter_peek_max()
We'll be introducing btree_iter_peek_prev_min(), so rename for
consistency.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:17 -05:00
Kent Overstreet
65b14fa3d8 bcachefs: Assert that we're not violating key cache coherency rules
We're not allowed to have a dirty key in the key cache if the key
doesn't exist at all in the btree - creation has to bypass the key
cache, so that iteration over the btree can check if the key is present
in the key cache.

Things break in subtle ways if cache coherency is broken, so this needs
an assert.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:17 -05:00
Kent Overstreet
b318882022 bcachefs: bch2_trans_verify_not_unlocked_or_in_restart()
Fold two asserts into one.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:16 -05:00
Kent Overstreet
a71a1fac90 bcachefs: Better in_restart error
We're ramping up on checking transaction restart handling correctness -
so, in debug mode we now save a backtrace for where the restart was
emitted, which makes it much easier to track down the incorrect
handling.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:16 -05:00
Kent Overstreet
2434fc38ef bcachefs: Assert we're not in a restart in bch2_trans_put()
This always indicates a transaction restart handling bug

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:16 -05:00
Kent Overstreet
69785001c6 bcachefs: Fix unhandled transaction restart in evacuate_bucket()
Generally, releasing a transaction within a transaction restart means an
unhandled transaction restart: but this can happen legitimately within
the move code, e.g. when bch2_move_ratelimit() tells us to exit before
we've retried.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:16 -05:00
Kent Overstreet
b09b34499c bcachefs: Improved check_topology() assert
On interior btree node updates, we always verify that we're not
introducing topology errors: child nodes should exactly span the range
of the parent node.

single_device.ktest small_nodes has been popping this assert: change it
to give us more information.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:16 -05:00
Kent Overstreet
a34b026482 bcachefs: Kill BCH_TRANS_COMMIT_lazy_rw
We unconditionally go read-write, if we're going to do so, before
journal replay: lazy_rw is obsolete.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:16 -05:00
Kent Overstreet
cc944fbe06 bcachefs: Add assert for use of journal replay keys for updates
The journal replay keys mechanism can only be used for updates in early
recovery, when still single threaded.

Add some asserts to make sure we never accidentally use it elsewhere.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:16 -05:00
Hongbo Li
32e573c362 bcachefs: use attribute define helper for sysfs attribute
The sysfs attribute definition has been wrapped into macro:
rw_attribute, read_attribute and write_attribute, we can
use these helpers to uniform the attribute definition.

Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:16 -05:00
Hongbo Li
d3d8ec90ba bcachefs: remove write permission for gc_gens_pos sysfs interface
The gc_gens_pos is used to show the status of bucket gen gc.
There is no need to assign write permissions for this attribute.
Here we can use read_attribute helper to define this attribute.

```
[Before]
  $ ll internal/gc_gens_pos
  -rw-r--r-- 1 root root 4096 Oct 28 15:27 internal/gc_gens_pos

[After]
  $ ll internal/gc_gens_pos
  -r--r--r-- 1 root root 4096 Oct 28 17:27 internal/gc_gens_pos
```

Fixes: ac516d0e7d ("bcachefs: Add the status of bucket gen gc to sysfs")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:16 -05:00
Kent Overstreet
161d13835e bcachefs: Move bch_extent_rebalance code to rebalance.c
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:16 -05:00
Kent Overstreet
a652c56590 bcachefs: Improve trace_rebalance_extent
We now say explicitly which pointers are being moved or compressed

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:16 -05:00
Kent Overstreet
3de8b72731 bcachefs: Simplify option logic in rebalance
Since bch2_move_get_io_opts() now synchronizes io_opts with options from
bch_extent_rebalance, delete the ad-hoc logic in rebalance.c that
previously did this.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:16 -05:00
Kent Overstreet
6aa0bd0fd5 bcachefs: get_update_rebalance_opts()
bch2_move_get_io_opts() now synchronizes options loaded from the
filesystem and inode (if present, i.e. not walking the reflink btree
directly) with options from the bch_extent_rebalance_entry, updating the
extent if necessary.

Since bch_extent_rebalance tracks where its option came from we can
preserve "inode options override filesystem options", even for indirect
extents where we don't have access to the inode the options came from.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:16 -05:00
Kent Overstreet
4ae6bbb522 bcachefs: bch2_write_inode() now checks for changing rebalance options
Previously, BCHFS_IOC_REINHERIT_ATTRS didn't trigger rebalance scans
when changing rebalance options - it had been missed, only the xattr
interface triggered them.

Ideally they'd be done by the transactional trigger, but unpacking the
inode to get the options is too heavy to be done in the low level
trigger - the inode trigger is run on every extent update, since the
bch_inode.bi_journal_seq has to be updated for fsync.

bch2_write_inode() is a good compromise, it already unpacks and repacks
and is not run in any super-fast paths.

Additionally, creating the new rebalance entry to trigger the scan is
now done in the same transaction as the inode update that changed the
options.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:16 -05:00
Kent Overstreet
2d21d11253 bcachefs: New bch_extent_rebalance fields
- Add more io path options to bch_extent_rebalance
- For each option, track whether it came from the filesystem or the
  inode

This will be used for improved rebalance support for reflinked data.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:16 -05:00
Kent Overstreet
ed13bb5726 bcachefs: bch2_prt_csum_opt()
bounds checking helper

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:16 -05:00
Kent Overstreet
c225084704 bcachefs: copygc_enabled, rebalance_enabled now opts.h options
They can now be set at mount time

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:16 -05:00
Kent Overstreet
7a7c43a0c1 bcachefs: Add bch_io_opts fields for indicating whether the opts came from the inode
This is going to be used in the bch_extent_rebalance improvements, which
propagate io_path options into the extent (important for rebalance,
which needs something present in the extent for transactionally tagging
them in the rebalance_work btree, and also for indirect extents).

By tracking in bch_extent_rebalance whether the option came from the
filesystem or the inode we can correctly handle options being changed on
indirect extents.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:16 -05:00
Kent Overstreet
3000855cab bcachefs: io_opts_to_rebalance_opts()
New helper to simplify bch2_bkey_set_needs_rebalance()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:16 -05:00
Kent Overstreet
5fffe1a3c3 bcachefs: rename bch_extent_rebalance fields to match other opts structs
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:16 -05:00
Kent Overstreet
282faf9524 bcachefs: kill __bch2_bkey_sectors_need_rebalance()
Single caller, fold into bch2_bkey_sectors_need_rebalance()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:16 -05:00
Kent Overstreet
c8908959ae bcachefs: kill bch2_bkey_needs_rebalance()
Dead code

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Kent Overstreet
015fafc49b bcachefs: small cleanup for extent ptr bitmasks
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Kent Overstreet
eacb755568 bcachefs: bch2_io_opts_fixups()
Centralize some io path option fixups - they weren't always being
applied correctly:

- background_compression uses compression if unset
- background_target uses foreground_target if unset
- nocow disables most fancy io path options

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Kent Overstreet
16de2c856a bcachefs: use bch2_data_update_opts_to_text() in trace_move_extent_fail()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Kent Overstreet
a1ca525b82 bcachefs: avoid 'unsigned flags'
flags should have actual types, where possible: fix btree_update.h
helpers

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Thorsten Blum
901ff6555b bcachefs: Annotate struct bucket_gens with __counted_by()
Add the __counted_by compiler attribute to the flexible array member b
to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.

Use struct_size() to calculate the number of bytes to be allocated.

Update bucket_gens->nbuckets and bucket_gens->nbuckets_minus_first when
resizing.

Compile-tested only.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Thorsten Blum
ac9826f147 bcachefs: Use str_write_read() helper in write_super_endio()
Remove hard-coded strings by using the str_write_read() helper function.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Thorsten Blum
751d869710 bcachefs: Use str_write_read() helper in ec_block_endio()
Remove hard-coded strings by using the helper function str_write_read().

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Thorsten Blum
de902e3b4a bcachefs: Use str_write_read() helper function
Remove hard-coded strings by using the helper function str_write_read().

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Kent Overstreet
e0c8369bc8 bcachefs: Add version check for bch_btree_ptr_v2.sectors_written validate
A user popped up with a very old (0.11) filesystem that needed repair
and wasn't recently backed up.

Reported-by: Manoa <manoa@mail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Kent Overstreet
27de0ee39f bcachefs: Add block plugging to read paths
This will help with some of the btree_trans srcu lock hold time warnings
that are still turning up; submit_bio() can block for awhile if the
device is sufficiently congested.

It's not a perfect solution since blk_plug bios are submitted when
scheduling; we might want a way to disable the "submit on context
switch" behaviour, or switch to our own plugging in the future.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Kent Overstreet
be5a7be106 bcachefs: Fix warning about passing flex array member by value
this showed up when building in userspace

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Kent Overstreet
fb8c835b18 bcachefs: bch2_journal_meta() takes ref on c->writes
This part of addressing
https://github.com/koverstreet/bcachefs/issues/656

where we're getting stuck in bch2_journal_meta() in the dump tool.

We shouldn't be invoking the journal without a ref on c->writes (if
we're not RW), and there's no reason for the dump tool to be going
read-write.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Kent Overstreet
8b22abb4c8 bcachefs: -o norecovery now bails out of recovery earlier
-o norecovery (used by the dump tool) should be doing the absolute
minimum amount of work to get the filesystem up and readable; we
shouldn't be running check and repair code, or going read-write.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Kent Overstreet
d55d4a0ca2 bcachefs: Refactor new stripe path to reduce dependencies on ec_stripe_head
We need to add a path for reshaping existing stripes (for e.g. device
removal), and this new path won't necessarily use ec_stripe_head.

Refactor the code to avoid unnecessary references to it for clarity.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Kent Overstreet
db514cf677 bcachefs: Avoid bch2_btree_id_str()
Prefer bch2_btree_id_to_text() - it prints out the integer ID when
unknown.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Kent Overstreet
9e2f5f7988 bcachefs: better error message in check_snapshot_tree()
If we find a snapshot node and it didn't match the snapshot tree, we
should print it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Kent Overstreet
106480e9a8 bcachefs: Factor out jset_entry_log_msg_bytes()
Needed for improved userspace cmd_list_journal

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Kent Overstreet
0269e27ce3 bcachefs: improved bkey_val_copy()
Factor out some common code, add typechecking.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Kent Overstreet
e3c43dbe8e bcachefs: bch2_btree_lost_data() now uses run_explicit_rceovery_pass_persistent()
Also get a bit more fine grained about which passes to run for which
btrees.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Kent Overstreet
d65d126c02 bcachefs: Add locking for bch_fs.curr_recovery_pass
Recovery can rewind in certain situations - when we discover we need to
run a pass that doesn't normally run.

This can happen from another thread for btree node read errors, so we
need a bit of locking.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:14 -05:00
Kent Overstreet
26c79fdc58 bcachefs: lru, accounting are alloc btrees
They can be regenerated by fsck and don't require a btree node scan,
like other alloc btrees.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:14 -05:00
Kent Overstreet
18f5b84a04 bcachefs: bch2_run_explicit_recovery_pass() returns different error when not in recovery
if we're not in recovery then there's no way to rewind recovery - give
this a different errcode so that any error messages will give us a
better idea of what happened.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:14 -05:00
Kent Overstreet
eba3d7e57d bcachefs: add more path idx debug asserts
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:14 -05:00
Thorsten Blum
55f524b706 bcachefs: Use FOREACH_ACL_ENTRY() macro to iterate over acl entries
Use the existing FOREACH_ACL_ENTRY() macro to iterate over POSIX acl
entries and remove the custom acl_for_each_entry() macro.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:14 -05:00
Thorsten Blum
03525de506 bcachefs: Remove duplicate included headers
The header files dirent_format.h and disk_groups_format.h are included
twice. Remove the redundant includes and the following warnings reported
by make includecheck:

  disk_groups_format.h is included more than once
  dirent_format.h is included more than once

Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:14 -05:00
Kent Overstreet
4e1c6ac05a bcachefs: kill btree_trans_restart_nounlock()
Redundant, the normal btree_trans_restart() doesn't unlock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:14 -05:00
Kent Overstreet
d6cf895847 bcachefs: Remove unnecessary peek_slot()
hash_lookup() used to return an errorcode, and a peek_slot() call was
required to get the key it looked up. But we're adding fault injection
for transaction restarts, so fix this old unconverted code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:14 -05:00
Thomas Bertschinger
fe818d2039 bcachefs: move bch2_xattr_handlers to .rodata
A series posted previously moved all of the `struct xattr_handler`
tables to .rodata for each filesystem [1].

However, this appears to have been done shortly before bcachefs was
merged, so bcachefs was missed at that time.

Link: https://lkml.kernel.org/r/20230930050033.41174-1-wedsonaf@gmail.com [1]
Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:14 -05:00
Alan Huang
bf4e42d158 bcachefs: Delete dead code
lock_fail_root_changed has not been used since commit
0d7009d7ca ("bcachefs: Delete old deadlock avoidance code")

Remove it.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:14 -05:00
Kent Overstreet
c07beca44f bcachefs: Pull disk accounting hooks out of trans_commit.c
Also, fix a minor bug in the revert path, where we weren't checking the
journal entry type correctly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:14 -05:00
Kent Overstreet
179cdecf22 bcachefs: bch_verbose_ratelimited
ratelimit "deleting unlinked inode" messages

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:14 -05:00
Kent Overstreet
a55e2d78ea bcachefs: rcu_pending: don't invoke __call_rcu() under lock
In userspace we don't (yet) have an SRCU implementation, so call_srcu()
recurses.

But we don't want to be invoking it under the lock anyways.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:14 -05:00
Kent Overstreet
b836f22014 bcachefs: __bch2_key_has_snapshot_overwrites uses for_each_btree_key_reverse_norestart()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:14 -05:00
Kent Overstreet
1325ccf27e bcachefs: remove_backpointer() now uses dirent_get_by_pos()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:14 -05:00
Kent Overstreet
de92b1ee67 bcachefs: bch2_inode_should_have_bp -> bch2_inode_should_have_single_bp
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:14 -05:00
Colin Ian King
1c6d5841ae bcachefs: remove superfluous ; after statements
There are a several statements with two following semicolons, replace
these with just one semicolon.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:14 -05:00
Kent Overstreet
135c0c8524 bcachefs: Fix racy use of jiffies
Calculate the timeout, then check if it's positive before calling
schedule_timeout().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:13 -05:00
Kent Overstreet
4b8d382b2c Merge branch 'bcachefs-kill-retry-estale' into HEAD 2024-12-21 01:36:13 -05:00
Eric Biggers
cc354fa7f0 bcachefs: Explicitly select CRYPTO from BCACHEFS_FS
Explicitly select CRYPTO from BCACHEFS_FS, so that this dependency of
CRYPTO_SHA256, CRYPTO_CHACHA20, and CRYPTO_POLY1305 (which are also
selected) is satisfied.  Currently this dependency is satisfied
indirectly via LIBCRC32C, but this is fragile and is planned to change
(https://lore.kernel.org/r/20241021002935.325878-13-ebiggers@kernel.org).

Acked-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lore.kernel.org/r/20241202010844.144356-15-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2024-12-01 17:23:02 -08:00
Linus Torvalds
f5f4745a7f - The series "resource: A couple of cleanups" from Andy Shevchenko
performs some cleanups in the resource management code.
 
 - The series "Improve the copy of task comm" from Yafang Shao addresses
   possible race-induced overflows in the management of task_struct.comm[].
 
 - The series "Remove unnecessary header includes from
   {tools/}lib/list_sort.c" from Kuan-Wei Chiu adds some cleanups and a
   small fix to the list_sort library code and to its selftest.
 
 - The series "Enhance min heap API with non-inline functions and
   optimizations" also from Kuan-Wei Chiu optimizes and cleans up the
   min_heap library code.
 
 - The series "nilfs2: Finish folio conversion" from Ryusuke Konishi
   finishes off nilfs2's folioification.
 
 - The series "add detect count for hung tasks" from Lance Yang adds more
   userspace visibility into the hung-task detector's activity.
 
 - Apart from that, singelton patches in many places - please see the
   individual changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ0L6lQAKCRDdBJ7gKXxA
 jmEIAPwMSglNPKRIOgzOvHh8MUJW1Dy8iKJ2kWCO3f6QTUIM2AEA+PazZbUd/g2m
 Ii8igH0UBibIgva7MrCyJedDI1O23AA=
 =8BIU
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2024-11-24-02-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - The series "resource: A couple of cleanups" from Andy Shevchenko
   performs some cleanups in the resource management code

 - The series "Improve the copy of task comm" from Yafang Shao addresses
   possible race-induced overflows in the management of
   task_struct.comm[]

 - The series "Remove unnecessary header includes from
   {tools/}lib/list_sort.c" from Kuan-Wei Chiu adds some cleanups and a
   small fix to the list_sort library code and to its selftest

 - The series "Enhance min heap API with non-inline functions and
   optimizations" also from Kuan-Wei Chiu optimizes and cleans up the
   min_heap library code

 - The series "nilfs2: Finish folio conversion" from Ryusuke Konishi
   finishes off nilfs2's folioification

 - The series "add detect count for hung tasks" from Lance Yang adds
   more userspace visibility into the hung-task detector's activity

 - Apart from that, singelton patches in many places - please see the
   individual changelogs for details

* tag 'mm-nonmm-stable-2024-11-24-02-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (71 commits)
  gdb: lx-symbols: do not error out on monolithic build
  kernel/reboot: replace sprintf() with sysfs_emit()
  lib: util_macros_kunit: add kunit test for util_macros.h
  util_macros.h: fix/rework find_closest() macros
  Improve consistency of '#error' directive messages
  ocfs2: fix uninitialized value in ocfs2_file_read_iter()
  hung_task: add docs for hung_task_detect_count
  hung_task: add detect count for hung tasks
  dma-buf: use atomic64_inc_return() in dma_buf_getfile()
  fs/proc/kcore.c: fix coccinelle reported ERROR instances
  resource: avoid unnecessary resource tree walking in __region_intersects()
  ocfs2: remove unused errmsg function and table
  ocfs2: cluster: fix a typo
  lib/scatterlist: use sg_phys() helper
  checkpatch: always parse orig_commit in fixes tag
  nilfs2: convert metadata aops from writepage to writepages
  nilfs2: convert nilfs_recovery_copy_block() to take a folio
  nilfs2: convert nilfs_page_count_clean_buffers() to take a folio
  nilfs2: remove nilfs_writepage
  nilfs2: convert checkpoint file to be folio-based
  ...
2024-11-25 16:09:48 -08:00
Kent Overstreet
840c2fbcc5 bcachefs: Fix assertion pop in bch2_ptr_swab()
This runs on extents that haven't yet been validated, so we don't want
to assert that we have a valid entry type.

Reported-by: syzbot+4f29c3f12f864d8a8d17@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-12 03:46:57 -05:00
Kent Overstreet
657d4282d8 bcachefs: Fix journal_entry_dev_usage_to_text() overrun
If the jset_entry_dev_usage is malformed, and too small, our nr_entries
calculation will be incorrect - just bail out.

Reported-by: syzbot+05d7520be047c9be86e0@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-12 03:46:57 -05:00
Kent Overstreet
2642084f26 bcachefs: Allow for unknown key types in backpointers fsck
We can't assume that btrees only contain keys of a given type - even if
they only have a single key type listed in the allowed key types for
that btree; this is a forwards compatibility issue.

Reported-by: syzbot+a27c3aaa3640dd3e1dfb@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-11 00:37:19 -05:00
Kent Overstreet
0b6ec0c5ac bcachefs: Fix assertion pop in topology repair
Fixes: baefd3f849 ("bcachefs: btree_cache.freeable list fixes")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-11 00:37:18 -05:00